# discourse-url-to-article A Discourse plugin that detects when a URL is pasted into the **topic title** field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the **composer body** with a clean Markdown rendering. --- ## Features - 🔗 Detects a bare URL typed/pasted into the topic title - 📄 Extracts article content using a Readability-style heuristic (no external API needed) - ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text - 🛡️ SSRF protection: blocks requests to private/loopback addresses - ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap - 🌐 Works with most article-style pages (news, blogs, documentation) --- ## Installation Add the plugin to your `app.yml`: ```yaml hooks: after_code: - exec: cd: $home/plugins cmd: - git clone https://github.com/yourname/discourse-url-to-article.git ``` Then rebuild: `./launcher rebuild app` --- ## Site Settings | Setting | Default | Description | |---|---|---| | `url_to_article_enabled` | `true` | Enable/disable the plugin | | `url_to_article_auto_populate` | `false` | Populate body automatically without button click | | `url_to_article_max_content_length` | `50000` | Max chars extracted from a page | | `url_to_article_fetch_timeout` | `10` | Seconds before HTTP fetch times out | | `url_to_article_allowed_domains` | *(blank = all)* | Comma-separated domain allowlist | | `url_to_article_blocked_domains` | `localhost,127.0.0.1,…` | SSRF blocklist | --- ## How It Works ### Frontend (Ember.js) `initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern: 1. A dismissible bar appears above the editor offering to import the article. 2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`. 3. The response populates `composer.model.reply` (body) and optionally updates the title. ### Backend (Ruby) `ArticleExtractor` in `lib/url_to_article/article_extractor.rb`: 1. **Fetches** the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect). 2. **Extracts metadata** from Open Graph / Twitter Card / standard `` tags. 3. **Finds the content node** by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `