A Discourse plugin that detects when a URL is pasted into the topic title field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the composer body with a clean Markdown rendering.

Features

🔗 Detects a bare URL typed/pasted into the topic title
📄 Extracts article content using a Readability-style heuristic (no external API needed)
✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
🛡️ SSRF protection: blocks requests to private/loopback addresses
⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
🌐 Works with most article-style pages (news, blogs, documentation)

Installation

Add the plugin to your app.yml:

hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://code.draft13.com/robert/discourse-url-to-article.git

Then rebuild: ./launcher rebuild app

Site Settings

Setting	Default	Description
`url_to_article_enabled`	`true`	Enable/disable the plugin
`url_to_article_auto_populate`	`false`	Populate body automatically without button click
`url_to_article_max_content_length`	`50000`	Max chars extracted from a page
`url_to_article_fetch_timeout`	`10`	Seconds before HTTP fetch times out
`url_to_article_allowed_domains`	(blank = all)	Comma-separated domain allowlist
`url_to_article_blocked_domains`	`localhost,127.0.0.1,…`	SSRF blocklist

How It Works

Frontend (Ember.js)

initializers/url-to-article.js hooks into the composer-editor component and observes the composer.model.title property via Ember's observer system. When the title matches a bare URL pattern:

A dismissible bar appears above the editor offering to import the article.
On click (or automatically if auto_populate is on), it POSTs to /url-to-article/extract.
The response populates composer.model.reply (body) and optionally updates the title.

Backend (Ruby)

ArticleExtractor in lib/url_to_article/article_extractor.rb:

Fetches the HTML via Net::HTTP with a browser-like User-Agent (follows one redirect).
Extracts metadata from Open Graph / Twitter Card / standard <meta> tags.
Finds the content node by trying a list of known semantic selectors (article, [role=main], .post-content, etc.), then falling back to a text-density scoring algorithm over all <div> and <section> elements.
Cleans the node: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
Converts to Markdown using the reverse_markdown gem.

Security

Only authenticated users can call /url-to-article/extract.
Only http/https schemes are allowed.
Configurable domain blocklist (loopback/private addresses blocked by default).
Optional allowlist to restrict to specific domains.

Output Format

The body is written as:

> **Site Name** — *Author Name*
> Source: <https://example.com/article>

*Article description or lead paragraph.*

---

## Article Heading

Full article text in Markdown...

Extending

Custom extraction logic

Subclass or monkey-patch UrlToArticle::ArticleExtractor in a separate plugin to add site-specific selectors or post-processing.

Paywall / JS-rendered sites

For sites that require JavaScript rendering, replace the fetch_html method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).

License

MIT