README.md

# discourse-url-to-article

A Discourse plugin that detects when a URL is pasted into the **topic title** field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the **composer body** with a clean Markdown rendering.

---

## Features

- 🔗 Detects a bare URL typed/pasted into the topic title
- 📄 Extracts article content using a Readability-style heuristic (no external API needed)
- ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
- 🛡️ SSRF protection: blocks requests to private/loopback addresses
- ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
- 🌐 Works with most article-style pages (news, blogs, documentation)

---

## Installation

Add the plugin to your `app.yml`:

```yaml
hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://code.draft13.com/robert/discourse-url-to-article.git
```

Then rebuild: `./launcher rebuild app`

---

## Site Settings

| Setting | Default | Description |
|---|---|---|
| `url_to_article_enabled` | `true` | Enable/disable the plugin |
| `url_to_article_auto_populate` | `false` | Populate body automatically without button click |
| `url_to_article_max_content_length` | `50000` | Max chars extracted from a page |
| `url_to_article_fetch_timeout` | `10` | Seconds before HTTP fetch times out |
| `url_to_article_allowed_domains` | *(blank = all)* | Comma-separated domain allowlist |
| `url_to_article_blocked_domains` | `localhost,127.0.0.1,…` | SSRF blocklist |

---

## How It Works

### Frontend (Ember.js)

`initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern:

1. A dismissible bar appears above the editor offering to import the article.
2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`.
3. The response populates `composer.model.reply` (body) and optionally updates the title.

### Backend (Ruby)

`ArticleExtractor` in `lib/url_to_article/article_extractor.rb`:

1. **Fetches** the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect).
2. **Extracts metadata** from Open Graph / Twitter Card / standard `<meta>` tags.
3. **Finds the content node** by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `<div>` and `<section>` elements.
4. **Cleans the node**: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
5. **Converts to Markdown** using the `reverse_markdown` gem.

### Security

- Only authenticated users can call `/url-to-article/extract`.
- Only `http`/`https` schemes are allowed.
- Configurable domain blocklist (loopback/private addresses blocked by default).
- Optional allowlist to restrict to specific domains.

---

## Output Format

The body is written as:

```markdown
> **Site Name** — *Author Name*
> Source: <https://example.com/article>

*Article description or lead paragraph.*

---

## Article Heading

Full article text in Markdown...
```

---

## Extending

### Custom extraction logic

Subclass or monkey-patch `UrlToArticle::ArticleExtractor` in a separate plugin to add site-specific selectors or post-processing.

### Paywall / JS-rendered sites

For sites that require JavaScript rendering, replace the `fetch_html` method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).

---

## License

MIT
init commit 2026-03-18 11:10:07 -04:00			`# discourse-url-to-article`

			`A Discourse plugin that detects when a URL is pasted into the topic title field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the composer body with a clean Markdown rendering.`

			`---`

			`## Features`

			`- 🔗 Detects a bare URL typed/pasted into the topic title`
			`- 📄 Extracts article content using a Readability-style heuristic (no external API needed)`
			`- ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text`
			`- 🛡️ SSRF protection: blocks requests to private/loopback addresses`
			`- ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap`
			`- 🌐 Works with most article-style pages (news, blogs, documentation)`

			`---`

			`## Installation`

			Add the plugin to your `app.yml`:

			```yaml
			`hooks:`
			`after_code:`
			`- exec:`
			`cd: $home/plugins`
			`cmd:`
Update README.md Changing URL in readme. 2026-03-18 14:31:16 -04:00			`- git clone https://code.draft13.com/robert/discourse-url-to-article.git`
init commit 2026-03-18 11:10:07 -04:00			```

			Then rebuild: `./launcher rebuild app`

			`---`

			`## Site Settings`

			`\| Setting \| Default \| Description \|`
			`\|---\|---\|---\|`
			\| `url_to_article_enabled` \| `true` \| Enable/disable the plugin \|
			\| `url_to_article_auto_populate` \| `false` \| Populate body automatically without button click \|
			\| `url_to_article_max_content_length` \| `50000` \| Max chars extracted from a page \|
			\| `url_to_article_fetch_timeout` \| `10` \| Seconds before HTTP fetch times out \|
			\| `url_to_article_allowed_domains` \| (blank = all) \| Comma-separated domain allowlist \|
			\| `url_to_article_blocked_domains` \| `localhost,127.0.0.1,…` \| SSRF blocklist \|

			`---`

			`## How It Works`

			`### Frontend (Ember.js)`

			`initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern:

			`1. A dismissible bar appears above the editor offering to import the article.`
			2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`.
			3. The response populates `composer.model.reply` (body) and optionally updates the title.

			`### Backend (Ruby)`

			`ArticleExtractor` in `lib/url_to_article/article_extractor.rb`:

			1. Fetches the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect).
			2. Extracts metadata from Open Graph / Twitter Card / standard `<meta>` tags.
			3. Finds the content node by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `<div>` and `<section>` elements.
			`4. Cleans the node: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.`
			5. Converts to Markdown using the `reverse_markdown` gem.

			`### Security`

			- Only authenticated users can call `/url-to-article/extract`.
			- Only `http`/`https` schemes are allowed.
			`- Configurable domain blocklist (loopback/private addresses blocked by default).`
			`- Optional allowlist to restrict to specific domains.`

			`---`

			`## Output Format`

			`The body is written as:`

			```markdown
			`> Site Name — Author Name`
			`> Source: <https://example.com/article>`

			`Article description or lead paragraph.`

			`---`

			`## Article Heading`

			`Full article text in Markdown...`
			```

			`---`

			`## Extending`

			`### Custom extraction logic`

			Subclass or monkey-patch `UrlToArticle::ArticleExtractor` in a separate plugin to add site-specific selectors or post-processing.

			`### Paywall / JS-rendered sites`

			For sites that require JavaScript rendering, replace the `fetch_html` method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).

			`---`

			`## License`

			`MIT`