init commit

This commit is contained in:
2026-03-18 11:10:07 -04:00
commit b1ef516348
8 changed files with 730 additions and 0 deletions

110
README.md Normal file
View File

@@ -0,0 +1,110 @@
# discourse-url-to-article
A Discourse plugin that detects when a URL is pasted into the **topic title** field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the **composer body** with a clean Markdown rendering.
---
## Features
- 🔗 Detects a bare URL typed/pasted into the topic title
- 📄 Extracts article content using a Readability-style heuristic (no external API needed)
- ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
- 🛡️ SSRF protection: blocks requests to private/loopback addresses
- ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
- 🌐 Works with most article-style pages (news, blogs, documentation)
---
## Installation
Add the plugin to your `app.yml`:
```yaml
hooks:
after_code:
- exec:
cd: $home/plugins
cmd:
- git clone https://github.com/yourname/discourse-url-to-article.git
```
Then rebuild: `./launcher rebuild app`
---
## Site Settings
| Setting | Default | Description |
|---|---|---|
| `url_to_article_enabled` | `true` | Enable/disable the plugin |
| `url_to_article_auto_populate` | `false` | Populate body automatically without button click |
| `url_to_article_max_content_length` | `50000` | Max chars extracted from a page |
| `url_to_article_fetch_timeout` | `10` | Seconds before HTTP fetch times out |
| `url_to_article_allowed_domains` | *(blank = all)* | Comma-separated domain allowlist |
| `url_to_article_blocked_domains` | `localhost,127.0.0.1,…` | SSRF blocklist |
---
## How It Works
### Frontend (Ember.js)
`initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern:
1. A dismissible bar appears above the editor offering to import the article.
2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`.
3. The response populates `composer.model.reply` (body) and optionally updates the title.
### Backend (Ruby)
`ArticleExtractor` in `lib/url_to_article/article_extractor.rb`:
1. **Fetches** the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect).
2. **Extracts metadata** from Open Graph / Twitter Card / standard `<meta>` tags.
3. **Finds the content node** by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `<div>` and `<section>` elements.
4. **Cleans the node**: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
5. **Converts to Markdown** using the `reverse_markdown` gem.
### Security
- Only authenticated users can call `/url-to-article/extract`.
- Only `http`/`https` schemes are allowed.
- Configurable domain blocklist (loopback/private addresses blocked by default).
- Optional allowlist to restrict to specific domains.
---
## Output Format
The body is written as:
```markdown
> **Site Name** — *Author Name*
> Source: <https://example.com/article>
*Article description or lead paragraph.*
---
## Article Heading
Full article text in Markdown...
```
---
## Extending
### Custom extraction logic
Subclass or monkey-patch `UrlToArticle::ArticleExtractor` in a separate plugin to add site-specific selectors or post-processing.
### Paywall / JS-rendered sites
For sites that require JavaScript rendering, replace the `fetch_html` method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).
---
## License
MIT

View File

@@ -0,0 +1,63 @@
# frozen_string_literal: true
module UrlToArticle
class ArticlesController < ::ApplicationController
requires_login
before_action :ensure_enabled!
before_action :validate_url!
def extract
result = ArticleExtractor.extract(@url)
render json: {
title: result.title,
byline: result.byline,
site_name: result.site_name,
description: result.description,
markdown: result.markdown,
url: result.url,
}
rescue => e
Rails.logger.warn("[url-to-article] Extraction failed for #{@url}: #{e.message}")
render json: { error: "Could not extract article: #{e.message}" }, status: :unprocessable_entity
end
private
def ensure_enabled!
raise Discourse::NotFound unless SiteSetting.url_to_article_enabled
end
def validate_url!
raw = params.require(:url)
begin
uri = URI.parse(raw)
rescue URI::InvalidURIError
return render json: { error: "Invalid URL" }, status: :bad_request
end
unless %w[http https].include?(uri.scheme)
return render json: { error: "Only http/https URLs are supported" }, status: :bad_request
end
# SSRF protection — block private/loopback addresses
blocked_domains = SiteSetting.url_to_article_blocked_domains
.split(",").map(&:strip).reject(&:empty?)
if blocked_domains.any? { |d| uri.host&.include?(d) }
return render json: { error: "Domain not allowed" }, status: :forbidden
end
# Optionally enforce an allowlist
allowed_domains = SiteSetting.url_to_article_allowed_domains
.split(",").map(&:strip).reject(&:empty?)
if allowed_domains.any? && !allowed_domains.any? { |d| uri.host&.end_with?(d) }
return render json: { error: "Domain not in allowlist" }, status: :forbidden
end
@url = raw
end
end
end

View File

@@ -0,0 +1,197 @@
import { apiInitializer } from "discourse/lib/api";
import { debounce } from "@ember/runloop";
import { ajax } from "discourse/lib/ajax";
import I18n from "I18n";
const URL_REGEX = /^(https?:\/\/[^\s/$.?#][^\s]*)$/i;
const DEBOUNCE_MS = 600;
export default apiInitializer("1.8.0", (api) => {
if (!api.container.lookup("site-settings:main").url_to_article_enabled) {
return;
}
// -----------------------------------------------------------------------
// Inject a helper button + status banner into the composer
// -----------------------------------------------------------------------
api.modifyClass("component:composer-editor", {
pluginId: "url-to-article",
didInsertElement() {
this._super(...arguments);
this._setupUrlToArticle();
},
willDestroyElement() {
this._super(...arguments);
this._teardownUrlToArticle();
},
_setupUrlToArticle() {
// Watch the title field — it lives outside the composer-editor DOM,
// so we observe via the composer model's `title` property.
const composer = this.get("composer");
if (!composer) return;
this._titleObserver = () => this._onTitleChanged();
composer.addObserver("model.title", this, "_titleObserver");
},
_teardownUrlToArticle() {
const composer = this.get("composer");
if (!composer) return;
composer.removeObserver("model.title", this, "_titleObserver");
},
_onTitleChanged() {
const title = this.get("composer.model.title") || "";
const match = title.trim().match(URL_REGEX);
if (!match) {
this._hideArticleBar();
return;
}
const url = match[1];
if (this._lastDetectedUrl === url) return; // Same URL — no-op
this._lastDetectedUrl = url;
const autoPopulate = api.container
.lookup("site-settings:main")
.url_to_article_auto_populate;
if (autoPopulate) {
debounce(this, "_fetchAndPopulate", url, DEBOUNCE_MS);
} else {
this._showArticleBar(url);
}
},
// ---- Bar UI -------------------------------------------------------
_showArticleBar(url) {
this._hideArticleBar(); // remove any existing bar first
const bar = document.createElement("div");
bar.className = "url-to-article-bar";
bar.dataset.url = url;
bar.innerHTML = `
<span class="url-to-article-icon">📄</span>
<span class="url-to-article-label">${I18n.t("url_to_article.bar_label")}</span>
<button class="btn btn-small btn-primary url-to-article-btn">
${I18n.t("url_to_article.fetch_button")}
</button>
<button class="btn btn-small btn-flat url-to-article-dismiss" aria-label="${I18n.t("url_to_article.dismiss")}">✕</button>
`;
bar.querySelector(".url-to-article-btn").addEventListener("click", () => {
this._fetchAndPopulate(url);
});
bar.querySelector(".url-to-article-dismiss").addEventListener("click", () => {
this._hideArticleBar();
this._lastDetectedUrl = null; // Allow re-detection if title changes
});
const toolbarEl = this.element.querySelector(".d-editor-container");
if (toolbarEl) {
toolbarEl.insertAdjacentElement("afterbegin", bar);
}
},
_hideArticleBar() {
this.element?.querySelectorAll(".url-to-article-bar").forEach((el) => el.remove());
},
_setStatus(message, type = "info") {
const bar = this.element?.querySelector(".url-to-article-bar");
if (!bar) return;
let status = bar.querySelector(".url-to-article-status");
if (!status) {
status = document.createElement("span");
status.className = "url-to-article-status";
bar.appendChild(status);
}
status.textContent = message;
status.className = `url-to-article-status url-to-article-status--${type}`;
},
// ---- Fetch & populate ---------------------------------------------
async _fetchAndPopulate(url) {
const bar = this.element?.querySelector(".url-to-article-bar");
const btn = bar?.querySelector(".url-to-article-btn");
if (btn) {
btn.disabled = true;
btn.textContent = I18n.t("url_to_article.fetching");
}
this._setStatus(I18n.t("url_to_article.fetching"), "info");
try {
const data = await ajax("/url-to-article/extract", {
type: "POST",
data: { url },
});
if (data.error) {
throw new Error(data.error);
}
this._populateComposer(data);
this._setStatus(I18n.t("url_to_article.success"), "success");
// Auto-hide bar after 3 seconds on success
setTimeout(() => this._hideArticleBar(), 3000);
} catch (err) {
const msg = err.jqXHR?.responseJSON?.error || err.message || I18n.t("url_to_article.error_generic");
this._setStatus(`${I18n.t("url_to_article.error_prefix")} ${msg}`, "error");
if (btn) {
btn.disabled = false;
btn.textContent = I18n.t("url_to_article.retry_button");
}
}
},
_populateComposer(data) {
const composerModel = this.get("composer.model");
if (!composerModel) return;
// Build the article body in Markdown
const lines = [];
// Attribution header
const siteName = data.site_name ? `**${data.site_name}**` : "";
const byline = data.byline ? ` — *${data.byline}*` : "";
if (siteName || byline) {
lines.push(`> ${siteName}${byline}`);
lines.push(`> ${I18n.t("url_to_article.source_label")}: <${data.url}>`);
lines.push("");
} else {
lines.push(`> ${I18n.t("url_to_article.source_label")}: <${data.url}>`);
lines.push("");
}
if (data.description) {
lines.push(`*${data.description}*`);
lines.push("");
lines.push("---");
lines.push("");
}
lines.push(data.markdown || "");
const body = lines.join("\n");
// Only set title if it's still the raw URL (avoid overwriting edited titles)
const currentTitle = composerModel.get("title") || "";
if (currentTitle.trim() === data.url || currentTitle.trim() === "") {
composerModel.set("title", data.title || data.url);
}
composerModel.set("reply", body);
},
});
});

View File

@@ -0,0 +1,51 @@
/* URL-to-Article plugin styles */
.url-to-article-bar {
display: flex;
align-items: center;
gap: 0.5rem;
padding: 0.5rem 0.75rem;
margin-bottom: 0.5rem;
background: var(--tertiary-low, #e8f4ff);
border: 1px solid var(--tertiary-medium, #8bc2f0);
border-radius: var(--d-border-radius, 4px);
font-size: var(--font-down-1);
flex-wrap: wrap;
}
.url-to-article-icon {
font-size: 1.1em;
flex-shrink: 0;
}
.url-to-article-label {
flex: 1;
min-width: 8rem;
color: var(--primary-medium);
font-weight: 500;
}
.url-to-article-btn {
flex-shrink: 0;
}
.url-to-article-dismiss {
flex-shrink: 0;
padding: 0.25rem 0.5rem !important;
color: var(--primary-medium) !important;
}
.url-to-article-status {
font-style: italic;
font-size: var(--font-down-1);
&.url-to-article-status--info {
color: var(--tertiary);
}
&.url-to-article-status--success {
color: var(--success);
}
&.url-to-article-status--error {
color: var(--danger);
}
}

View File

@@ -0,0 +1,11 @@
en:
url_to_article:
bar_label: "URL detected — import as article?"
fetch_button: "Import Article"
retry_button: "Retry"
fetching: "Fetching…"
dismiss: "Dismiss"
success: "Article imported!"
error_generic: "Unknown error"
error_prefix: "Error:"
source_label: "Source"

31
config/settings.yml Normal file
View File

@@ -0,0 +1,31 @@
plugins:
url_to_article_enabled:
default: true
client: true
type: bool
url_to_article_auto_populate:
default: false
client: true
type: bool
description: "Automatically populate the body when a URL is detected in the title (no button click needed)"
url_to_article_max_content_length:
default: 50000
type: integer
description: "Maximum number of characters to extract from a page"
url_to_article_fetch_timeout:
default: 10
type: integer
description: "Seconds to wait when fetching a URL"
url_to_article_allowed_domains:
default: ""
type: string
description: "Comma-separated list of allowed domains. Leave blank to allow all."
url_to_article_blocked_domains:
default: "localhost,127.0.0.1,0.0.0.0,::1"
type: string
description: "Comma-separated list of blocked domains (SSRF protection)"

View File

@@ -0,0 +1,232 @@
# frozen_string_literal: true
require "nokogiri"
require "reverse_markdown"
require "net/http"
require "uri"
require "timeout"
module UrlToArticle
class ArticleExtractor
# Tags that are almost never article content
NOISE_SELECTORS = %w[
script style noscript iframe nav footer header
.navigation .nav .menu .sidebar .widget .ad .advertisement
.cookie-banner .cookie-notice .popup .modal .overlay
.social-share .share-buttons .related-posts .comments
#comments #sidebar #navigation #footer #header
[role=navigation] [role=banner] [role=contentinfo]
[aria-label=navigation] [aria-label=footer]
].freeze
# Candidate content selectors tried in order
ARTICLE_SELECTORS = %w[
article[class*=content]
article[class*=post]
article[class*=article]
article
[role=main]
main
.post-content
.article-content
.entry-content
.article-body
.story-body
.post-body
.content-body
.page-content
#article-body
#post-content
#main-content
].freeze
Result = Struct.new(:title, :byline, :site_name, :description, :markdown, :url, keyword_init: true)
def self.extract(url)
new(url).extract
end
def initialize(url)
@url = url
@uri = URI.parse(url)
end
def extract
html = fetch_html
doc = Nokogiri::HTML(html)
title = extract_title(doc)
byline = extract_byline(doc)
site_name = extract_site_name(doc)
description = extract_description(doc)
content_node = find_content_node(doc)
clean_node!(content_node)
markdown = node_to_markdown(content_node)
markdown = truncate(markdown)
Result.new(
title: title,
byline: byline,
site_name: site_name,
description: description,
markdown: markdown,
url: @url
)
end
private
def fetch_html
Timeout.timeout(SiteSetting.url_to_article_fetch_timeout) do
http = Net::HTTP.new(@uri.host, @uri.port)
http.use_ssl = @uri.scheme == "https"
http.open_timeout = 5
http.read_timeout = SiteSetting.url_to_article_fetch_timeout
request = Net::HTTP::Get.new(@uri.request_uri)
request["User-Agent"] = "Mozilla/5.0 (compatible; Discourse URL-to-Article Bot/1.0)"
request["Accept"] = "text/html,application/xhtml+xml"
request["Accept-Language"] = "en-US,en;q=0.9"
response = http.request(request)
# Follow one redirect
if response.is_a?(Net::HTTPRedirection) && response["location"]
redirect_uri = URI.parse(response["location"])
@uri = redirect_uri
http = Net::HTTP.new(@uri.host, @uri.port)
http.use_ssl = @uri.scheme == "https"
response = http.get(@uri.request_uri, "User-Agent" => request["User-Agent"])
end
raise "HTTP #{response.code}" unless response.is_a?(Net::HTTPSuccess)
response.body.force_encoding("UTF-8")
end
end
def extract_title(doc)
# Try OG title first, then twitter:title, then <title>
og = doc.at_css('meta[property="og:title"]')&.attr("content")
return og.strip if og.present?
tw = doc.at_css('meta[name="twitter:title"]')&.attr("content")
return tw.strip if tw.present?
h1 = doc.at_css("h1")&.text
return h1.strip if h1.present?
doc.at_css("title")&.text&.strip || @uri.host
end
def extract_byline(doc)
candidates = [
doc.at_css('meta[name="author"]')&.attr("content"),
doc.at_css('[rel="author"]')&.text,
doc.at_css(".author")&.text,
doc.at_css('[class*="byline"]')&.text,
doc.at_css("address")&.text,
]
candidates.compact.map(&:strip).reject(&:empty?).first
end
def extract_site_name(doc)
doc.at_css('meta[property="og:site_name"]')&.attr("content")&.strip ||
@uri.host.sub(/^www\./, "")
end
def extract_description(doc)
doc.at_css('meta[property="og:description"]')&.attr("content")&.strip ||
doc.at_css('meta[name="description"]')&.attr("content")&.strip
end
def find_content_node(doc)
# Try known article selectors
ARTICLE_SELECTORS.each do |sel|
node = doc.at_css(sel)
next unless node
text = node.text.strip
# Make sure it has meaningful content (>200 chars of text)
return node if text.length > 200
end
# Fallback: score all <div> and <section> blocks by text density
score_and_pick(doc)
end
def score_and_pick(doc)
candidates = doc.css("div, section, td").map do |node|
text = node.text.strip
next if text.length < 150
# Score = text length - penalize nodes with lots of tags (nav-heavy)
tag_count = node.css("*").size.to_f
text_length = text.length.to_f
score = text_length - (tag_count * 3)
[score, node]
end.compact.sort_by { |s, _| -s }
candidates.first&.last || doc.at_css("body") || doc
end
def clean_node!(node)
return unless node
# Remove noise elements
NOISE_SELECTORS.each do |sel|
node.css(sel).each(&:remove)
end
# Remove hidden elements
node.css("[style]").each do |el|
el.remove if el["style"] =~ /display\s*:\s*none|visibility\s*:\s*hidden/i
end
# Remove empty tags (except br, img, hr)
node.css("span, div, p, section").each do |el|
el.remove if el.text.strip.empty? && el.css("img, video, audio, iframe").empty?
end
# Strip all attributes except allowed ones on certain tags
allowed = {
"a" => %w[href title],
"img" => %w[src alt title width height],
"td" => %w[colspan rowspan],
"th" => %w[colspan rowspan scope],
"ol" => %w[start type],
"li" => %w[value],
"code" => %w[class],
"pre" => %w[class],
}
node.css("*").each do |el|
tag = el.name.downcase
permitted = allowed[tag] || []
el.attributes.each_key do |attr|
el.remove_attribute(attr) unless permitted.include?(attr)
end
# Make relative image URLs absolute
if tag == "img" && el["src"] && !el["src"].start_with?("http", "//", "data:")
el["src"] = URI.join(@url, el["src"]).to_s rescue nil
end
if tag == "a" && el["href"] && !el["href"].start_with?("http", "//", "#", "mailto:")
el["href"] = URI.join(@url, el["href"]).to_s rescue nil
end
end
end
def node_to_markdown(node)
return "" unless node
ReverseMarkdown.convert(node.to_html, unknown_tags: :bypass, github_flavored: true)
.gsub(/\n{3,}/, "\n\n") # collapse excessive blank lines
.strip
end
def truncate(text)
max = SiteSetting.url_to_article_max_content_length
return text if text.length <= max
text[0...max] + "\n\n*[Content truncated — visit the original article for the full text.]*"
end
end
end

35
plugin.rb Normal file
View File

@@ -0,0 +1,35 @@
# frozen_string_literal: true
# name: discourse-url-to-article
# about: Scrapes a URL pasted into the topic title and populates the composer body with the article content
# version: 0.1.0
# authors: Your Name
# url: https://github.com/yourname/discourse-url-to-article
gem "nokogiri", "1.16.4"
gem "reverse_markdown", "2.1.1"
enabled_site_setting :url_to_article_enabled
after_initialize do
require_relative "lib/url_to_article/article_extractor"
module ::UrlToArticle
PLUGIN_NAME = "discourse-url-to-article"
class Engine < ::Rails::Engine
engine_name PLUGIN_NAME
isolate_namespace UrlToArticle
end
end
require_relative "app/controllers/url_to_article/articles_controller"
UrlToArticle::Engine.routes.draw do
post "/extract" => "articles#extract"
end
Discourse::Application.routes.append do
mount UrlToArticle::Engine, at: "/url-to-article"
end
end