discourse-url-to-article
A Discourse plugin that detects when a URL is pasted into the topic title field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the composer body with a clean Markdown rendering.
Features
- 🔗 Detects a bare URL typed/pasted into the topic title
- 📄 Extracts article content using a Readability-style heuristic (no external API needed)
- ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
- 🛡️ SSRF protection: blocks requests to private/loopback addresses
- ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
- 🌐 Works with most article-style pages (news, blogs, documentation)
Installation
Add the plugin to your app.yml:
hooks:
after_code:
- exec:
cd: $home/plugins
cmd:
- git clone https://code.draft13.com/robert/discourse-url-to-article.git
Then rebuild: ./launcher rebuild app
Site Settings
| Setting | Default | Description |
|---|---|---|
url_to_article_enabled |
true |
Enable/disable the plugin |
url_to_article_auto_populate |
false |
Populate body automatically without button click |
url_to_article_max_content_length |
50000 |
Max chars extracted from a page |
url_to_article_fetch_timeout |
10 |
Seconds before HTTP fetch times out |
url_to_article_allowed_domains |
(blank = all) | Comma-separated domain allowlist |
url_to_article_blocked_domains |
localhost,127.0.0.1,… |
SSRF blocklist |
How It Works
Frontend (Ember.js)
initializers/url-to-article.js hooks into the composer-editor component and observes the composer.model.title property via Ember's observer system. When the title matches a bare URL pattern:
- A dismissible bar appears above the editor offering to import the article.
- On click (or automatically if
auto_populateis on), it POSTs to/url-to-article/extract. - The response populates
composer.model.reply(body) and optionally updates the title.
Backend (Ruby)
ArticleExtractor in lib/url_to_article/article_extractor.rb:
- Fetches the HTML via
Net::HTTPwith a browser-like User-Agent (follows one redirect). - Extracts metadata from Open Graph / Twitter Card / standard
<meta>tags. - Finds the content node by trying a list of known semantic selectors (
article,[role=main],.post-content, etc.), then falling back to a text-density scoring algorithm over all<div>and<section>elements. - Cleans the node: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
- Converts to Markdown using the
reverse_markdowngem.
Security
- Only authenticated users can call
/url-to-article/extract. - Only
http/httpsschemes are allowed. - Configurable domain blocklist (loopback/private addresses blocked by default).
- Optional allowlist to restrict to specific domains.
Output Format
The body is written as:
> **Site Name** — *Author Name*
> Source: <https://example.com/article>
*Article description or lead paragraph.*
---
## Article Heading
Full article text in Markdown...
Extending
Custom extraction logic
Subclass or monkey-patch UrlToArticle::ArticleExtractor in a separate plugin to add site-specific selectors or post-processing.
Paywall / JS-rendered sites
For sites that require JavaScript rendering, replace the fetch_html method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).
License
MIT