init commit

2026-03-18 11:10:07 -04:00
commit b1ef516348
8 changed files with 730 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,110 @@
+# discourse-url-to-article
+
+A Discourse plugin that detects when a URL is pasted into the **topic title** field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the **composer body** with a clean Markdown rendering.
+
+---
+
+## Features
+
+- 🔗 Detects a bare URL typed/pasted into the topic title
+- 📄 Extracts article content using a Readability-style heuristic (no external API needed)
+- ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
+- 🛡️ SSRF protection: blocks requests to private/loopback addresses
+- ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
+- 🌐 Works with most article-style pages (news, blogs, documentation)
+
+---
+
+## Installation
+
+Add the plugin to your `app.yml`:
+
+```yaml
+hooks:
+  after_code:
+    - exec:
+        cd: $home/plugins
+        cmd:
+          - git clone https://github.com/yourname/discourse-url-to-article.git
+```
+
+Then rebuild: `./launcher rebuild app`
+
+---
+
+## Site Settings
+
+| Setting | Default | Description |
+|---|---|---|
+| `url_to_article_enabled` | `true` | Enable/disable the plugin |
+| `url_to_article_auto_populate` | `false` | Populate body automatically without button click |
+| `url_to_article_max_content_length` | `50000` | Max chars extracted from a page |
+| `url_to_article_fetch_timeout` | `10` | Seconds before HTTP fetch times out |
+| `url_to_article_allowed_domains` | *(blank = all)* | Comma-separated domain allowlist |
+| `url_to_article_blocked_domains` | `localhost,127.0.0.1,…` | SSRF blocklist |
+
+---
+
+## How It Works
+
+### Frontend (Ember.js)
+
+`initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern:
+
+1. A dismissible bar appears above the editor offering to import the article.
+2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`.
+3. The response populates `composer.model.reply` (body) and optionally updates the title.
+
+### Backend (Ruby)
+
+`ArticleExtractor` in `lib/url_to_article/article_extractor.rb`:
+
+1. **Fetches** the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect).
+2. **Extracts metadata** from Open Graph / Twitter Card / standard `<meta>` tags.
+3. **Finds the content node** by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `<div>` and `<section>` elements.
+4. **Cleans the node**: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
+5. **Converts to Markdown** using the `reverse_markdown` gem.
+
+### Security
+
+- Only authenticated users can call `/url-to-article/extract`.
+- Only `http`/`https` schemes are allowed.
+- Configurable domain blocklist (loopback/private addresses blocked by default).
+- Optional allowlist to restrict to specific domains.
+
+---
+
+## Output Format
+
+The body is written as:
+
+```markdown
+> **Site Name** — *Author Name*
+> Source: <https://example.com/article>
+
+*Article description or lead paragraph.*
+
+---
+
+## Article Heading
+
+Full article text in Markdown...
+```
+
+---
+
+## Extending
+
+### Custom extraction logic
+
+Subclass or monkey-patch `UrlToArticle::ArticleExtractor` in a separate plugin to add site-specific selectors or post-processing.
+
+### Paywall / JS-rendered sites
+
+For sites that require JavaScript rendering, replace the `fetch_html` method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).
+
+---
+
+## License
+
+MIT