# discourse-url-to-article A Discourse plugin that detects when a URL is pasted into the **topic title** field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the **composer body** with a clean Markdown rendering. --- ## Features - 🔗 Detects a bare URL typed/pasted into the topic title - 📄 Extracts article content using a Readability-style heuristic (no external API needed) - ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text - 🛡️ SSRF protection: blocks requests to private/loopback addresses - ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap - 🌐 Works with most article-style pages (news, blogs, documentation) --- ## Installation Add the plugin to your `app.yml`: ```yaml hooks: after_code: - exec: cd: $home/plugins cmd: - git clone https://github.com/yourname/discourse-url-to-article.git ``` Then rebuild: `./launcher rebuild app` --- ## Site Settings | Setting | Default | Description | |---|---|---| | `url_to_article_enabled` | `true` | Enable/disable the plugin | | `url_to_article_auto_populate` | `false` | Populate body automatically without button click | | `url_to_article_max_content_length` | `50000` | Max chars extracted from a page | | `url_to_article_fetch_timeout` | `10` | Seconds before HTTP fetch times out | | `url_to_article_allowed_domains` | *(blank = all)* | Comma-separated domain allowlist | | `url_to_article_blocked_domains` | `localhost,127.0.0.1,…` | SSRF blocklist | --- ## How It Works ### Frontend (Ember.js) `initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern: 1. A dismissible bar appears above the editor offering to import the article. 2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`. 3. The response populates `composer.model.reply` (body) and optionally updates the title. ### Backend (Ruby) `ArticleExtractor` in `lib/url_to_article/article_extractor.rb`: 1. **Fetches** the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect). 2. **Extracts metadata** from Open Graph / Twitter Card / standard `` tags. 3. **Finds the content node** by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `
` and `
` elements. 4. **Cleans the node**: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute. 5. **Converts to Markdown** using the `reverse_markdown` gem. ### Security - Only authenticated users can call `/url-to-article/extract`. - Only `http`/`https` schemes are allowed. - Configurable domain blocklist (loopback/private addresses blocked by default). - Optional allowlist to restrict to specific domains. --- ## Output Format The body is written as: ```markdown > **Site Name** — *Author Name* > Source: *Article description or lead paragraph.* --- ## Article Heading Full article text in Markdown... ``` --- ## Extending ### Custom extraction logic Subclass or monkey-patch `UrlToArticle::ArticleExtractor` in a separate plugin to add site-specific selectors or post-processing. ### Paywall / JS-rendered sites For sites that require JavaScript rendering, replace the `fetch_html` method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API). --- ## License MIT