init commit
This commit is contained in:
110
README.md
Normal file
110
README.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# discourse-url-to-article
|
||||
|
||||
A Discourse plugin that detects when a URL is pasted into the **topic title** field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the **composer body** with a clean Markdown rendering.
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
- 🔗 Detects a bare URL typed/pasted into the topic title
|
||||
- 📄 Extracts article content using a Readability-style heuristic (no external API needed)
|
||||
- ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
|
||||
- 🛡️ SSRF protection: blocks requests to private/loopback addresses
|
||||
- ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
|
||||
- 🌐 Works with most article-style pages (news, blogs, documentation)
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
Add the plugin to your `app.yml`:
|
||||
|
||||
```yaml
|
||||
hooks:
|
||||
after_code:
|
||||
- exec:
|
||||
cd: $home/plugins
|
||||
cmd:
|
||||
- git clone https://github.com/yourname/discourse-url-to-article.git
|
||||
```
|
||||
|
||||
Then rebuild: `./launcher rebuild app`
|
||||
|
||||
---
|
||||
|
||||
## Site Settings
|
||||
|
||||
| Setting | Default | Description |
|
||||
|---|---|---|
|
||||
| `url_to_article_enabled` | `true` | Enable/disable the plugin |
|
||||
| `url_to_article_auto_populate` | `false` | Populate body automatically without button click |
|
||||
| `url_to_article_max_content_length` | `50000` | Max chars extracted from a page |
|
||||
| `url_to_article_fetch_timeout` | `10` | Seconds before HTTP fetch times out |
|
||||
| `url_to_article_allowed_domains` | *(blank = all)* | Comma-separated domain allowlist |
|
||||
| `url_to_article_blocked_domains` | `localhost,127.0.0.1,…` | SSRF blocklist |
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
### Frontend (Ember.js)
|
||||
|
||||
`initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern:
|
||||
|
||||
1. A dismissible bar appears above the editor offering to import the article.
|
||||
2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`.
|
||||
3. The response populates `composer.model.reply` (body) and optionally updates the title.
|
||||
|
||||
### Backend (Ruby)
|
||||
|
||||
`ArticleExtractor` in `lib/url_to_article/article_extractor.rb`:
|
||||
|
||||
1. **Fetches** the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect).
|
||||
2. **Extracts metadata** from Open Graph / Twitter Card / standard `<meta>` tags.
|
||||
3. **Finds the content node** by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `<div>` and `<section>` elements.
|
||||
4. **Cleans the node**: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
|
||||
5. **Converts to Markdown** using the `reverse_markdown` gem.
|
||||
|
||||
### Security
|
||||
|
||||
- Only authenticated users can call `/url-to-article/extract`.
|
||||
- Only `http`/`https` schemes are allowed.
|
||||
- Configurable domain blocklist (loopback/private addresses blocked by default).
|
||||
- Optional allowlist to restrict to specific domains.
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
The body is written as:
|
||||
|
||||
```markdown
|
||||
> **Site Name** — *Author Name*
|
||||
> Source: <https://example.com/article>
|
||||
|
||||
*Article description or lead paragraph.*
|
||||
|
||||
---
|
||||
|
||||
## Article Heading
|
||||
|
||||
Full article text in Markdown...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extending
|
||||
|
||||
### Custom extraction logic
|
||||
|
||||
Subclass or monkey-patch `UrlToArticle::ArticleExtractor` in a separate plugin to add site-specific selectors or post-processing.
|
||||
|
||||
### Paywall / JS-rendered sites
|
||||
|
||||
For sites that require JavaScript rendering, replace the `fetch_html` method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
63
app/controllers/url_to_article/articles_controller.rb
Normal file
63
app/controllers/url_to_article/articles_controller.rb
Normal file
@@ -0,0 +1,63 @@
|
||||
# frozen_string_literal: true
|
||||
|
||||
module UrlToArticle
|
||||
class ArticlesController < ::ApplicationController
|
||||
requires_login
|
||||
before_action :ensure_enabled!
|
||||
before_action :validate_url!
|
||||
|
||||
def extract
|
||||
result = ArticleExtractor.extract(@url)
|
||||
|
||||
render json: {
|
||||
title: result.title,
|
||||
byline: result.byline,
|
||||
site_name: result.site_name,
|
||||
description: result.description,
|
||||
markdown: result.markdown,
|
||||
url: result.url,
|
||||
}
|
||||
rescue => e
|
||||
Rails.logger.warn("[url-to-article] Extraction failed for #{@url}: #{e.message}")
|
||||
render json: { error: "Could not extract article: #{e.message}" }, status: :unprocessable_entity
|
||||
end
|
||||
|
||||
private
|
||||
|
||||
def ensure_enabled!
|
||||
raise Discourse::NotFound unless SiteSetting.url_to_article_enabled
|
||||
end
|
||||
|
||||
def validate_url!
|
||||
raw = params.require(:url)
|
||||
|
||||
begin
|
||||
uri = URI.parse(raw)
|
||||
rescue URI::InvalidURIError
|
||||
return render json: { error: "Invalid URL" }, status: :bad_request
|
||||
end
|
||||
|
||||
unless %w[http https].include?(uri.scheme)
|
||||
return render json: { error: "Only http/https URLs are supported" }, status: :bad_request
|
||||
end
|
||||
|
||||
# SSRF protection — block private/loopback addresses
|
||||
blocked_domains = SiteSetting.url_to_article_blocked_domains
|
||||
.split(",").map(&:strip).reject(&:empty?)
|
||||
|
||||
if blocked_domains.any? { |d| uri.host&.include?(d) }
|
||||
return render json: { error: "Domain not allowed" }, status: :forbidden
|
||||
end
|
||||
|
||||
# Optionally enforce an allowlist
|
||||
allowed_domains = SiteSetting.url_to_article_allowed_domains
|
||||
.split(",").map(&:strip).reject(&:empty?)
|
||||
|
||||
if allowed_domains.any? && !allowed_domains.any? { |d| uri.host&.end_with?(d) }
|
||||
return render json: { error: "Domain not in allowlist" }, status: :forbidden
|
||||
end
|
||||
|
||||
@url = raw
|
||||
end
|
||||
end
|
||||
end
|
||||
197
assets/javascripts/discourse/initializers/url-to-article.js
Normal file
197
assets/javascripts/discourse/initializers/url-to-article.js
Normal file
@@ -0,0 +1,197 @@
|
||||
import { apiInitializer } from "discourse/lib/api";
|
||||
import { debounce } from "@ember/runloop";
|
||||
import { ajax } from "discourse/lib/ajax";
|
||||
import I18n from "I18n";
|
||||
|
||||
const URL_REGEX = /^(https?:\/\/[^\s/$.?#][^\s]*)$/i;
|
||||
const DEBOUNCE_MS = 600;
|
||||
|
||||
export default apiInitializer("1.8.0", (api) => {
|
||||
if (!api.container.lookup("site-settings:main").url_to_article_enabled) {
|
||||
return;
|
||||
}
|
||||
|
||||
// -----------------------------------------------------------------------
|
||||
// Inject a helper button + status banner into the composer
|
||||
// -----------------------------------------------------------------------
|
||||
api.modifyClass("component:composer-editor", {
|
||||
pluginId: "url-to-article",
|
||||
|
||||
didInsertElement() {
|
||||
this._super(...arguments);
|
||||
this._setupUrlToArticle();
|
||||
},
|
||||
|
||||
willDestroyElement() {
|
||||
this._super(...arguments);
|
||||
this._teardownUrlToArticle();
|
||||
},
|
||||
|
||||
_setupUrlToArticle() {
|
||||
// Watch the title field — it lives outside the composer-editor DOM,
|
||||
// so we observe via the composer model's `title` property.
|
||||
const composer = this.get("composer");
|
||||
if (!composer) return;
|
||||
|
||||
this._titleObserver = () => this._onTitleChanged();
|
||||
composer.addObserver("model.title", this, "_titleObserver");
|
||||
},
|
||||
|
||||
_teardownUrlToArticle() {
|
||||
const composer = this.get("composer");
|
||||
if (!composer) return;
|
||||
composer.removeObserver("model.title", this, "_titleObserver");
|
||||
},
|
||||
|
||||
_onTitleChanged() {
|
||||
const title = this.get("composer.model.title") || "";
|
||||
const match = title.trim().match(URL_REGEX);
|
||||
|
||||
if (!match) {
|
||||
this._hideArticleBar();
|
||||
return;
|
||||
}
|
||||
|
||||
const url = match[1];
|
||||
|
||||
if (this._lastDetectedUrl === url) return; // Same URL — no-op
|
||||
this._lastDetectedUrl = url;
|
||||
|
||||
const autoPopulate = api.container
|
||||
.lookup("site-settings:main")
|
||||
.url_to_article_auto_populate;
|
||||
|
||||
if (autoPopulate) {
|
||||
debounce(this, "_fetchAndPopulate", url, DEBOUNCE_MS);
|
||||
} else {
|
||||
this._showArticleBar(url);
|
||||
}
|
||||
},
|
||||
|
||||
// ---- Bar UI -------------------------------------------------------
|
||||
|
||||
_showArticleBar(url) {
|
||||
this._hideArticleBar(); // remove any existing bar first
|
||||
|
||||
const bar = document.createElement("div");
|
||||
bar.className = "url-to-article-bar";
|
||||
bar.dataset.url = url;
|
||||
bar.innerHTML = `
|
||||
<span class="url-to-article-icon">📄</span>
|
||||
<span class="url-to-article-label">${I18n.t("url_to_article.bar_label")}</span>
|
||||
<button class="btn btn-small btn-primary url-to-article-btn">
|
||||
${I18n.t("url_to_article.fetch_button")}
|
||||
</button>
|
||||
<button class="btn btn-small btn-flat url-to-article-dismiss" aria-label="${I18n.t("url_to_article.dismiss")}">✕</button>
|
||||
`;
|
||||
|
||||
bar.querySelector(".url-to-article-btn").addEventListener("click", () => {
|
||||
this._fetchAndPopulate(url);
|
||||
});
|
||||
|
||||
bar.querySelector(".url-to-article-dismiss").addEventListener("click", () => {
|
||||
this._hideArticleBar();
|
||||
this._lastDetectedUrl = null; // Allow re-detection if title changes
|
||||
});
|
||||
|
||||
const toolbarEl = this.element.querySelector(".d-editor-container");
|
||||
if (toolbarEl) {
|
||||
toolbarEl.insertAdjacentElement("afterbegin", bar);
|
||||
}
|
||||
},
|
||||
|
||||
_hideArticleBar() {
|
||||
this.element?.querySelectorAll(".url-to-article-bar").forEach((el) => el.remove());
|
||||
},
|
||||
|
||||
_setStatus(message, type = "info") {
|
||||
const bar = this.element?.querySelector(".url-to-article-bar");
|
||||
if (!bar) return;
|
||||
|
||||
let status = bar.querySelector(".url-to-article-status");
|
||||
if (!status) {
|
||||
status = document.createElement("span");
|
||||
status.className = "url-to-article-status";
|
||||
bar.appendChild(status);
|
||||
}
|
||||
status.textContent = message;
|
||||
status.className = `url-to-article-status url-to-article-status--${type}`;
|
||||
},
|
||||
|
||||
// ---- Fetch & populate ---------------------------------------------
|
||||
|
||||
async _fetchAndPopulate(url) {
|
||||
const bar = this.element?.querySelector(".url-to-article-bar");
|
||||
const btn = bar?.querySelector(".url-to-article-btn");
|
||||
|
||||
if (btn) {
|
||||
btn.disabled = true;
|
||||
btn.textContent = I18n.t("url_to_article.fetching");
|
||||
}
|
||||
this._setStatus(I18n.t("url_to_article.fetching"), "info");
|
||||
|
||||
try {
|
||||
const data = await ajax("/url-to-article/extract", {
|
||||
type: "POST",
|
||||
data: { url },
|
||||
});
|
||||
|
||||
if (data.error) {
|
||||
throw new Error(data.error);
|
||||
}
|
||||
|
||||
this._populateComposer(data);
|
||||
this._setStatus(I18n.t("url_to_article.success"), "success");
|
||||
|
||||
// Auto-hide bar after 3 seconds on success
|
||||
setTimeout(() => this._hideArticleBar(), 3000);
|
||||
} catch (err) {
|
||||
const msg = err.jqXHR?.responseJSON?.error || err.message || I18n.t("url_to_article.error_generic");
|
||||
this._setStatus(`${I18n.t("url_to_article.error_prefix")} ${msg}`, "error");
|
||||
if (btn) {
|
||||
btn.disabled = false;
|
||||
btn.textContent = I18n.t("url_to_article.retry_button");
|
||||
}
|
||||
}
|
||||
},
|
||||
|
||||
_populateComposer(data) {
|
||||
const composerModel = this.get("composer.model");
|
||||
if (!composerModel) return;
|
||||
|
||||
// Build the article body in Markdown
|
||||
const lines = [];
|
||||
|
||||
// Attribution header
|
||||
const siteName = data.site_name ? `**${data.site_name}**` : "";
|
||||
const byline = data.byline ? ` — *${data.byline}*` : "";
|
||||
if (siteName || byline) {
|
||||
lines.push(`> ${siteName}${byline}`);
|
||||
lines.push(`> ${I18n.t("url_to_article.source_label")}: <${data.url}>`);
|
||||
lines.push("");
|
||||
} else {
|
||||
lines.push(`> ${I18n.t("url_to_article.source_label")}: <${data.url}>`);
|
||||
lines.push("");
|
||||
}
|
||||
|
||||
if (data.description) {
|
||||
lines.push(`*${data.description}*`);
|
||||
lines.push("");
|
||||
lines.push("---");
|
||||
lines.push("");
|
||||
}
|
||||
|
||||
lines.push(data.markdown || "");
|
||||
|
||||
const body = lines.join("\n");
|
||||
|
||||
// Only set title if it's still the raw URL (avoid overwriting edited titles)
|
||||
const currentTitle = composerModel.get("title") || "";
|
||||
if (currentTitle.trim() === data.url || currentTitle.trim() === "") {
|
||||
composerModel.set("title", data.title || data.url);
|
||||
}
|
||||
|
||||
composerModel.set("reply", body);
|
||||
},
|
||||
});
|
||||
});
|
||||
51
assets/stylesheets/url-to-article.scss
Normal file
51
assets/stylesheets/url-to-article.scss
Normal file
@@ -0,0 +1,51 @@
|
||||
/* URL-to-Article plugin styles */
|
||||
|
||||
.url-to-article-bar {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
padding: 0.5rem 0.75rem;
|
||||
margin-bottom: 0.5rem;
|
||||
background: var(--tertiary-low, #e8f4ff);
|
||||
border: 1px solid var(--tertiary-medium, #8bc2f0);
|
||||
border-radius: var(--d-border-radius, 4px);
|
||||
font-size: var(--font-down-1);
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.url-to-article-icon {
|
||||
font-size: 1.1em;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.url-to-article-label {
|
||||
flex: 1;
|
||||
min-width: 8rem;
|
||||
color: var(--primary-medium);
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
.url-to-article-btn {
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.url-to-article-dismiss {
|
||||
flex-shrink: 0;
|
||||
padding: 0.25rem 0.5rem !important;
|
||||
color: var(--primary-medium) !important;
|
||||
}
|
||||
|
||||
.url-to-article-status {
|
||||
font-style: italic;
|
||||
font-size: var(--font-down-1);
|
||||
|
||||
&.url-to-article-status--info {
|
||||
color: var(--tertiary);
|
||||
}
|
||||
&.url-to-article-status--success {
|
||||
color: var(--success);
|
||||
}
|
||||
&.url-to-article-status--error {
|
||||
color: var(--danger);
|
||||
}
|
||||
}
|
||||
11
config/locales/client.en.yml
Normal file
11
config/locales/client.en.yml
Normal file
@@ -0,0 +1,11 @@
|
||||
en:
|
||||
url_to_article:
|
||||
bar_label: "URL detected — import as article?"
|
||||
fetch_button: "Import Article"
|
||||
retry_button: "Retry"
|
||||
fetching: "Fetching…"
|
||||
dismiss: "Dismiss"
|
||||
success: "Article imported!"
|
||||
error_generic: "Unknown error"
|
||||
error_prefix: "Error:"
|
||||
source_label: "Source"
|
||||
31
config/settings.yml
Normal file
31
config/settings.yml
Normal file
@@ -0,0 +1,31 @@
|
||||
plugins:
|
||||
url_to_article_enabled:
|
||||
default: true
|
||||
client: true
|
||||
type: bool
|
||||
|
||||
url_to_article_auto_populate:
|
||||
default: false
|
||||
client: true
|
||||
type: bool
|
||||
description: "Automatically populate the body when a URL is detected in the title (no button click needed)"
|
||||
|
||||
url_to_article_max_content_length:
|
||||
default: 50000
|
||||
type: integer
|
||||
description: "Maximum number of characters to extract from a page"
|
||||
|
||||
url_to_article_fetch_timeout:
|
||||
default: 10
|
||||
type: integer
|
||||
description: "Seconds to wait when fetching a URL"
|
||||
|
||||
url_to_article_allowed_domains:
|
||||
default: ""
|
||||
type: string
|
||||
description: "Comma-separated list of allowed domains. Leave blank to allow all."
|
||||
|
||||
url_to_article_blocked_domains:
|
||||
default: "localhost,127.0.0.1,0.0.0.0,::1"
|
||||
type: string
|
||||
description: "Comma-separated list of blocked domains (SSRF protection)"
|
||||
232
lib/url_to_article/article_extractor.rb
Normal file
232
lib/url_to_article/article_extractor.rb
Normal file
@@ -0,0 +1,232 @@
|
||||
# frozen_string_literal: true
|
||||
|
||||
require "nokogiri"
|
||||
require "reverse_markdown"
|
||||
require "net/http"
|
||||
require "uri"
|
||||
require "timeout"
|
||||
|
||||
module UrlToArticle
|
||||
class ArticleExtractor
|
||||
# Tags that are almost never article content
|
||||
NOISE_SELECTORS = %w[
|
||||
script style noscript iframe nav footer header
|
||||
.navigation .nav .menu .sidebar .widget .ad .advertisement
|
||||
.cookie-banner .cookie-notice .popup .modal .overlay
|
||||
.social-share .share-buttons .related-posts .comments
|
||||
#comments #sidebar #navigation #footer #header
|
||||
[role=navigation] [role=banner] [role=contentinfo]
|
||||
[aria-label=navigation] [aria-label=footer]
|
||||
].freeze
|
||||
|
||||
# Candidate content selectors tried in order
|
||||
ARTICLE_SELECTORS = %w[
|
||||
article[class*=content]
|
||||
article[class*=post]
|
||||
article[class*=article]
|
||||
article
|
||||
[role=main]
|
||||
main
|
||||
.post-content
|
||||
.article-content
|
||||
.entry-content
|
||||
.article-body
|
||||
.story-body
|
||||
.post-body
|
||||
.content-body
|
||||
.page-content
|
||||
#article-body
|
||||
#post-content
|
||||
#main-content
|
||||
].freeze
|
||||
|
||||
Result = Struct.new(:title, :byline, :site_name, :description, :markdown, :url, keyword_init: true)
|
||||
|
||||
def self.extract(url)
|
||||
new(url).extract
|
||||
end
|
||||
|
||||
def initialize(url)
|
||||
@url = url
|
||||
@uri = URI.parse(url)
|
||||
end
|
||||
|
||||
def extract
|
||||
html = fetch_html
|
||||
doc = Nokogiri::HTML(html)
|
||||
|
||||
title = extract_title(doc)
|
||||
byline = extract_byline(doc)
|
||||
site_name = extract_site_name(doc)
|
||||
description = extract_description(doc)
|
||||
content_node = find_content_node(doc)
|
||||
|
||||
clean_node!(content_node)
|
||||
markdown = node_to_markdown(content_node)
|
||||
markdown = truncate(markdown)
|
||||
|
||||
Result.new(
|
||||
title: title,
|
||||
byline: byline,
|
||||
site_name: site_name,
|
||||
description: description,
|
||||
markdown: markdown,
|
||||
url: @url
|
||||
)
|
||||
end
|
||||
|
||||
private
|
||||
|
||||
def fetch_html
|
||||
Timeout.timeout(SiteSetting.url_to_article_fetch_timeout) do
|
||||
http = Net::HTTP.new(@uri.host, @uri.port)
|
||||
http.use_ssl = @uri.scheme == "https"
|
||||
http.open_timeout = 5
|
||||
http.read_timeout = SiteSetting.url_to_article_fetch_timeout
|
||||
|
||||
request = Net::HTTP::Get.new(@uri.request_uri)
|
||||
request["User-Agent"] = "Mozilla/5.0 (compatible; Discourse URL-to-Article Bot/1.0)"
|
||||
request["Accept"] = "text/html,application/xhtml+xml"
|
||||
request["Accept-Language"] = "en-US,en;q=0.9"
|
||||
|
||||
response = http.request(request)
|
||||
|
||||
# Follow one redirect
|
||||
if response.is_a?(Net::HTTPRedirection) && response["location"]
|
||||
redirect_uri = URI.parse(response["location"])
|
||||
@uri = redirect_uri
|
||||
http = Net::HTTP.new(@uri.host, @uri.port)
|
||||
http.use_ssl = @uri.scheme == "https"
|
||||
response = http.get(@uri.request_uri, "User-Agent" => request["User-Agent"])
|
||||
end
|
||||
|
||||
raise "HTTP #{response.code}" unless response.is_a?(Net::HTTPSuccess)
|
||||
response.body.force_encoding("UTF-8")
|
||||
end
|
||||
end
|
||||
|
||||
def extract_title(doc)
|
||||
# Try OG title first, then twitter:title, then <title>
|
||||
og = doc.at_css('meta[property="og:title"]')&.attr("content")
|
||||
return og.strip if og.present?
|
||||
|
||||
tw = doc.at_css('meta[name="twitter:title"]')&.attr("content")
|
||||
return tw.strip if tw.present?
|
||||
|
||||
h1 = doc.at_css("h1")&.text
|
||||
return h1.strip if h1.present?
|
||||
|
||||
doc.at_css("title")&.text&.strip || @uri.host
|
||||
end
|
||||
|
||||
def extract_byline(doc)
|
||||
candidates = [
|
||||
doc.at_css('meta[name="author"]')&.attr("content"),
|
||||
doc.at_css('[rel="author"]')&.text,
|
||||
doc.at_css(".author")&.text,
|
||||
doc.at_css('[class*="byline"]')&.text,
|
||||
doc.at_css("address")&.text,
|
||||
]
|
||||
candidates.compact.map(&:strip).reject(&:empty?).first
|
||||
end
|
||||
|
||||
def extract_site_name(doc)
|
||||
doc.at_css('meta[property="og:site_name"]')&.attr("content")&.strip ||
|
||||
@uri.host.sub(/^www\./, "")
|
||||
end
|
||||
|
||||
def extract_description(doc)
|
||||
doc.at_css('meta[property="og:description"]')&.attr("content")&.strip ||
|
||||
doc.at_css('meta[name="description"]')&.attr("content")&.strip
|
||||
end
|
||||
|
||||
def find_content_node(doc)
|
||||
# Try known article selectors
|
||||
ARTICLE_SELECTORS.each do |sel|
|
||||
node = doc.at_css(sel)
|
||||
next unless node
|
||||
text = node.text.strip
|
||||
# Make sure it has meaningful content (>200 chars of text)
|
||||
return node if text.length > 200
|
||||
end
|
||||
|
||||
# Fallback: score all <div> and <section> blocks by text density
|
||||
score_and_pick(doc)
|
||||
end
|
||||
|
||||
def score_and_pick(doc)
|
||||
candidates = doc.css("div, section, td").map do |node|
|
||||
text = node.text.strip
|
||||
next if text.length < 150
|
||||
|
||||
# Score = text length - penalize nodes with lots of tags (nav-heavy)
|
||||
tag_count = node.css("*").size.to_f
|
||||
text_length = text.length.to_f
|
||||
score = text_length - (tag_count * 3)
|
||||
|
||||
[score, node]
|
||||
end.compact.sort_by { |s, _| -s }
|
||||
|
||||
candidates.first&.last || doc.at_css("body") || doc
|
||||
end
|
||||
|
||||
def clean_node!(node)
|
||||
return unless node
|
||||
|
||||
# Remove noise elements
|
||||
NOISE_SELECTORS.each do |sel|
|
||||
node.css(sel).each(&:remove)
|
||||
end
|
||||
|
||||
# Remove hidden elements
|
||||
node.css("[style]").each do |el|
|
||||
el.remove if el["style"] =~ /display\s*:\s*none|visibility\s*:\s*hidden/i
|
||||
end
|
||||
|
||||
# Remove empty tags (except br, img, hr)
|
||||
node.css("span, div, p, section").each do |el|
|
||||
el.remove if el.text.strip.empty? && el.css("img, video, audio, iframe").empty?
|
||||
end
|
||||
|
||||
# Strip all attributes except allowed ones on certain tags
|
||||
allowed = {
|
||||
"a" => %w[href title],
|
||||
"img" => %w[src alt title width height],
|
||||
"td" => %w[colspan rowspan],
|
||||
"th" => %w[colspan rowspan scope],
|
||||
"ol" => %w[start type],
|
||||
"li" => %w[value],
|
||||
"code" => %w[class],
|
||||
"pre" => %w[class],
|
||||
}
|
||||
node.css("*").each do |el|
|
||||
tag = el.name.downcase
|
||||
permitted = allowed[tag] || []
|
||||
el.attributes.each_key do |attr|
|
||||
el.remove_attribute(attr) unless permitted.include?(attr)
|
||||
end
|
||||
|
||||
# Make relative image URLs absolute
|
||||
if tag == "img" && el["src"] && !el["src"].start_with?("http", "//", "data:")
|
||||
el["src"] = URI.join(@url, el["src"]).to_s rescue nil
|
||||
end
|
||||
if tag == "a" && el["href"] && !el["href"].start_with?("http", "//", "#", "mailto:")
|
||||
el["href"] = URI.join(@url, el["href"]).to_s rescue nil
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
def node_to_markdown(node)
|
||||
return "" unless node
|
||||
ReverseMarkdown.convert(node.to_html, unknown_tags: :bypass, github_flavored: true)
|
||||
.gsub(/\n{3,}/, "\n\n") # collapse excessive blank lines
|
||||
.strip
|
||||
end
|
||||
|
||||
def truncate(text)
|
||||
max = SiteSetting.url_to_article_max_content_length
|
||||
return text if text.length <= max
|
||||
text[0...max] + "\n\n*[Content truncated — visit the original article for the full text.]*"
|
||||
end
|
||||
end
|
||||
end
|
||||
35
plugin.rb
Normal file
35
plugin.rb
Normal file
@@ -0,0 +1,35 @@
|
||||
# frozen_string_literal: true
|
||||
|
||||
# name: discourse-url-to-article
|
||||
# about: Scrapes a URL pasted into the topic title and populates the composer body with the article content
|
||||
# version: 0.1.0
|
||||
# authors: Your Name
|
||||
# url: https://github.com/yourname/discourse-url-to-article
|
||||
|
||||
gem "nokogiri", "1.16.4"
|
||||
gem "reverse_markdown", "2.1.1"
|
||||
|
||||
enabled_site_setting :url_to_article_enabled
|
||||
|
||||
after_initialize do
|
||||
require_relative "lib/url_to_article/article_extractor"
|
||||
|
||||
module ::UrlToArticle
|
||||
PLUGIN_NAME = "discourse-url-to-article"
|
||||
|
||||
class Engine < ::Rails::Engine
|
||||
engine_name PLUGIN_NAME
|
||||
isolate_namespace UrlToArticle
|
||||
end
|
||||
end
|
||||
|
||||
require_relative "app/controllers/url_to_article/articles_controller"
|
||||
|
||||
UrlToArticle::Engine.routes.draw do
|
||||
post "/extract" => "articles#extract"
|
||||
end
|
||||
|
||||
Discourse::Application.routes.append do
|
||||
mount UrlToArticle::Engine, at: "/url-to-article"
|
||||
end
|
||||
end
|
||||
Reference in New Issue
Block a user