Robert Johnson 8328169049 Update README.md
Changing URL in readme.
2026-03-18 14:31:16 -04:00
2026-03-18 11:48:57 -04:00
2026-03-18 11:28:55 -04:00
2026-03-18 11:48:57 -04:00
2026-03-18 14:31:16 -04:00

discourse-url-to-article

A Discourse plugin that detects when a URL is pasted into the topic title field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the composer body with a clean Markdown rendering.


Features

  • 🔗 Detects a bare URL typed/pasted into the topic title
  • 📄 Extracts article content using a Readability-style heuristic (no external API needed)
  • ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
  • 🛡️ SSRF protection: blocks requests to private/loopback addresses
  • ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
  • 🌐 Works with most article-style pages (news, blogs, documentation)

Installation

Add the plugin to your app.yml:

hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://code.draft13.com/robert/discourse-url-to-article.git

Then rebuild: ./launcher rebuild app


Site Settings

Setting Default Description
url_to_article_enabled true Enable/disable the plugin
url_to_article_auto_populate false Populate body automatically without button click
url_to_article_max_content_length 50000 Max chars extracted from a page
url_to_article_fetch_timeout 10 Seconds before HTTP fetch times out
url_to_article_allowed_domains (blank = all) Comma-separated domain allowlist
url_to_article_blocked_domains localhost,127.0.0.1,… SSRF blocklist

How It Works

Frontend (Ember.js)

initializers/url-to-article.js hooks into the composer-editor component and observes the composer.model.title property via Ember's observer system. When the title matches a bare URL pattern:

  1. A dismissible bar appears above the editor offering to import the article.
  2. On click (or automatically if auto_populate is on), it POSTs to /url-to-article/extract.
  3. The response populates composer.model.reply (body) and optionally updates the title.

Backend (Ruby)

ArticleExtractor in lib/url_to_article/article_extractor.rb:

  1. Fetches the HTML via Net::HTTP with a browser-like User-Agent (follows one redirect).
  2. Extracts metadata from Open Graph / Twitter Card / standard <meta> tags.
  3. Finds the content node by trying a list of known semantic selectors (article, [role=main], .post-content, etc.), then falling back to a text-density scoring algorithm over all <div> and <section> elements.
  4. Cleans the node: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
  5. Converts to Markdown using the reverse_markdown gem.

Security

  • Only authenticated users can call /url-to-article/extract.
  • Only http/https schemes are allowed.
  • Configurable domain blocklist (loopback/private addresses blocked by default).
  • Optional allowlist to restrict to specific domains.

Output Format

The body is written as:

> **Site Name** — *Author Name*
> Source: <https://example.com/article>

*Article description or lead paragraph.*

---

## Article Heading

Full article text in Markdown...

Extending

Custom extraction logic

Subclass or monkey-patch UrlToArticle::ArticleExtractor in a separate plugin to add site-specific selectors or post-processing.

Paywall / JS-rendered sites

For sites that require JavaScript rendering, replace the fetch_html method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).


License

MIT

Description
Stores URL content into Discourse for PKM, i.e., personal bookmark database.
Readme 64 KiB
Languages
Ruby 63.5%
JavaScript 31.3%
SCSS 5.2%