init commit

2026-03-18 11:10:07 -04:00
commit b1ef516348
8 changed files with 730 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,110 @@
 # discourse-url-to-article
 A Discourse plugin that detects when a URL is pasted into the **topic title** field and offers to scrape the page, extracting the article content (à la browser Reader Mode) and populating the **composer body** with a clean Markdown rendering.
 ---
 ## Features
 - 🔗 Detects a bare URL typed/pasted into the topic title
 - 📄 Extracts article content using a Readability-style heuristic (no external API needed)
 - ✍️ Populates the topic body with clean Markdown: heading, byline, description, full article text
 - 🛡️ SSRF protection: blocks requests to private/loopback addresses
 - ⚙️ Configurable: auto-populate mode, allowlist/blocklist, timeout, content length cap
 - 🌐 Works with most article-style pages (news, blogs, documentation)
 ---
 ## Installation
 Add the plugin to your `app.yml`:
 ```yaml
 hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://github.com/yourname/discourse-url-to-article.git
 ```
 Then rebuild: `./launcher rebuild app`
 ---
 ## Site Settings
 | Setting | Default | Description |
 |---|---|---|
 | `url_to_article_enabled` | `true` | Enable/disable the plugin |
 | `url_to_article_auto_populate` | `false` | Populate body automatically without button click |
 | `url_to_article_max_content_length` | `50000` | Max chars extracted from a page |
 | `url_to_article_fetch_timeout` | `10` | Seconds before HTTP fetch times out |
 | `url_to_article_allowed_domains` | *(blank = all)* | Comma-separated domain allowlist |
 | `url_to_article_blocked_domains` | `localhost,127.0.0.1,…` | SSRF blocklist |
 ---
 ## How It Works
 ### Frontend (Ember.js)
 `initializers/url-to-article.js` hooks into the `composer-editor` component and observes the `composer.model.title` property via Ember's observer system. When the title matches a bare URL pattern:
 1. A dismissible bar appears above the editor offering to import the article.
 2. On click (or automatically if `auto_populate` is on), it POSTs to `/url-to-article/extract`.
 3. The response populates `composer.model.reply` (body) and optionally updates the title.
 ### Backend (Ruby)
 `ArticleExtractor` in `lib/url_to_article/article_extractor.rb`:
 1. **Fetches** the HTML via `Net::HTTP` with a browser-like User-Agent (follows one redirect).
 2. **Extracts metadata** from Open Graph / Twitter Card / standard `<meta>` tags.
 3. **Finds the content node** by trying a list of known semantic selectors (`article`, `[role=main]`, `.post-content`, etc.), then falling back to a text-density scoring algorithm over all `<div>` and `<section>` elements.
 4. **Cleans the node**: removes nav, ads, scripts, hidden elements; strips non-essential attributes; makes relative URLs absolute.
 5. **Converts to Markdown** using the `reverse_markdown` gem.
 ### Security
 - Only authenticated users can call `/url-to-article/extract`.
 - Only `http`/`https` schemes are allowed.
 - Configurable domain blocklist (loopback/private addresses blocked by default).
 - Optional allowlist to restrict to specific domains.
 ---
 ## Output Format
 The body is written as:
 ```markdown
 > **Site Name** — *Author Name*
 > Source: <https://example.com/article>
 *Article description or lead paragraph.*
 ---
 ## Article Heading
 Full article text in Markdown...
 ```
 ---
 ## Extending
 ### Custom extraction logic
 Subclass or monkey-patch `UrlToArticle::ArticleExtractor` in a separate plugin to add site-specific selectors or post-processing.
 ### Paywall / JS-rendered sites
 For sites that require JavaScript rendering, replace the `fetch_html` method with a call to a headless browser service (e.g. Browserless, Splash) or a third-party extraction API (Diffbot, Mercury Parser API).
 ---
 ## License
 MIT
--- a/app/controllers/url_to_article/articles_controller.rb
+++ b/app/controllers/url_to_article/articles_controller.rb
@@ -0,0 +1,63 @@
 # frozen_string_literal: true
 module UrlToArticle
  class ArticlesController < ::ApplicationController
    requires_login
    before_action :ensure_enabled!
    before_action :validate_url!
    def extract
      result = ArticleExtractor.extract(@url)
      render json: {
        title:       result.title,
        byline:      result.byline,
        site_name:   result.site_name,
        description: result.description,
        markdown:    result.markdown,
        url:         result.url,
      }
    rescue => e
      Rails.logger.warn("[url-to-article] Extraction failed for #{@url}: #{e.message}")
      render json: { error: "Could not extract article: #{e.message}" }, status: :unprocessable_entity
    end
    private
    def ensure_enabled!
      raise Discourse::NotFound unless SiteSetting.url_to_article_enabled
    end
    def validate_url!
      raw = params.require(:url)
      begin
        uri = URI.parse(raw)
      rescue URI::InvalidURIError
        return render json: { error: "Invalid URL" }, status: :bad_request
      end
      unless %w[http https].include?(uri.scheme)
        return render json: { error: "Only http/https URLs are supported" }, status: :bad_request
      end
      # SSRF protection — block private/loopback addresses
      blocked_domains = SiteSetting.url_to_article_blocked_domains
        .split(",").map(&:strip).reject(&:empty?)
      if blocked_domains.any? { |d| uri.host&.include?(d) }
        return render json: { error: "Domain not allowed" }, status: :forbidden
      end
      # Optionally enforce an allowlist
      allowed_domains = SiteSetting.url_to_article_allowed_domains
        .split(",").map(&:strip).reject(&:empty?)
      if allowed_domains.any? && !allowed_domains.any? { |d| uri.host&.end_with?(d) }
        return render json: { error: "Domain not in allowlist" }, status: :forbidden
      end
      @url = raw
    end
  end
 end
--- a/assets/javascripts/discourse/initializers/url-to-article.js
+++ b/assets/javascripts/discourse/initializers/url-to-article.js
@@ -0,0 +1,197 @@
 import { apiInitializer } from "discourse/lib/api";
 import { debounce } from "@ember/runloop";
 import { ajax } from "discourse/lib/ajax";
 import I18n from "I18n";
 const URL_REGEX = /^(https?:\/\/[^\s/$.?#][^\s]*)$/i;
 const DEBOUNCE_MS = 600;
 export default apiInitializer("1.8.0", (api) => {
  if (!api.container.lookup("site-settings:main").url_to_article_enabled) {
    return;
  }
  // -----------------------------------------------------------------------
  // Inject a helper button + status banner into the composer
  // -----------------------------------------------------------------------
  api.modifyClass("component:composer-editor", {
    pluginId: "url-to-article",
    didInsertElement() {
      this._super(...arguments);
      this._setupUrlToArticle();
    },
    willDestroyElement() {
      this._super(...arguments);
      this._teardownUrlToArticle();
    },
    _setupUrlToArticle() {
      // Watch the title field — it lives outside the composer-editor DOM,
      // so we observe via the composer model's `title` property.
      const composer = this.get("composer");
      if (!composer) return;
      this._titleObserver = () => this._onTitleChanged();
      composer.addObserver("model.title", this, "_titleObserver");
    },
    _teardownUrlToArticle() {
      const composer = this.get("composer");
      if (!composer) return;
      composer.removeObserver("model.title", this, "_titleObserver");
    },
    _onTitleChanged() {
      const title = this.get("composer.model.title") || "";
      const match = title.trim().match(URL_REGEX);
      if (!match) {
        this._hideArticleBar();
        return;
      }
      const url = match[1];
      if (this._lastDetectedUrl === url) return; // Same URL — no-op
      this._lastDetectedUrl = url;
      const autoPopulate = api.container
        .lookup("site-settings:main")
        .url_to_article_auto_populate;
      if (autoPopulate) {
        debounce(this, "_fetchAndPopulate", url, DEBOUNCE_MS);
      } else {
        this._showArticleBar(url);
      }
    },
    // ---- Bar UI -------------------------------------------------------
    _showArticleBar(url) {
      this._hideArticleBar(); // remove any existing bar first
      const bar = document.createElement("div");
      bar.className = "url-to-article-bar";
      bar.dataset.url = url;
      bar.innerHTML = `
        <span class="url-to-article-icon">📄</span>
        <span class="url-to-article-label">${I18n.t("url_to_article.bar_label")}</span>
        <button class="btn btn-small btn-primary url-to-article-btn">
          ${I18n.t("url_to_article.fetch_button")}
        </button>
        <button class="btn btn-small btn-flat url-to-article-dismiss" aria-label="${I18n.t("url_to_article.dismiss")}">✕</button>
      `;
      bar.querySelector(".url-to-article-btn").addEventListener("click", () => {
        this._fetchAndPopulate(url);
      });
      bar.querySelector(".url-to-article-dismiss").addEventListener("click", () => {
        this._hideArticleBar();
        this._lastDetectedUrl = null; // Allow re-detection if title changes
      });
      const toolbarEl = this.element.querySelector(".d-editor-container");
      if (toolbarEl) {
        toolbarEl.insertAdjacentElement("afterbegin", bar);
      }
    },
    _hideArticleBar() {
      this.element?.querySelectorAll(".url-to-article-bar").forEach((el) => el.remove());
    },
    _setStatus(message, type = "info") {
      const bar = this.element?.querySelector(".url-to-article-bar");
      if (!bar) return;
      let status = bar.querySelector(".url-to-article-status");
      if (!status) {
        status = document.createElement("span");
        status.className = "url-to-article-status";
        bar.appendChild(status);
      }
      status.textContent = message;
      status.className = `url-to-article-status url-to-article-status--${type}`;
    },
    // ---- Fetch & populate ---------------------------------------------
    async _fetchAndPopulate(url) {
      const bar = this.element?.querySelector(".url-to-article-bar");
      const btn = bar?.querySelector(".url-to-article-btn");
      if (btn) {
        btn.disabled = true;
        btn.textContent = I18n.t("url_to_article.fetching");
      }
      this._setStatus(I18n.t("url_to_article.fetching"), "info");
      try {
        const data = await ajax("/url-to-article/extract", {
          type: "POST",
          data: { url },
        });
        if (data.error) {
          throw new Error(data.error);
        }
        this._populateComposer(data);
        this._setStatus(I18n.t("url_to_article.success"), "success");
        // Auto-hide bar after 3 seconds on success
        setTimeout(() => this._hideArticleBar(), 3000);
      } catch (err) {
        const msg = err.jqXHR?.responseJSON?.error || err.message || I18n.t("url_to_article.error_generic");
        this._setStatus(`${I18n.t("url_to_article.error_prefix")} ${msg}`, "error");
        if (btn) {
          btn.disabled = false;
          btn.textContent = I18n.t("url_to_article.retry_button");
        }
      }
    },
    _populateComposer(data) {
      const composerModel = this.get("composer.model");
      if (!composerModel) return;
      // Build the article body in Markdown
      const lines = [];
      // Attribution header
      const siteName = data.site_name ? `**${data.site_name}**` : "";
      const byline = data.byline ? ` — *${data.byline}*` : "";
      if (siteName || byline) {
        lines.push(`> ${siteName}${byline}`);
        lines.push(`> ${I18n.t("url_to_article.source_label")}: <${data.url}>`);
        lines.push("");
      } else {
        lines.push(`> ${I18n.t("url_to_article.source_label")}: <${data.url}>`);
        lines.push("");
      }
      if (data.description) {
        lines.push(`*${data.description}*`);
        lines.push("");
        lines.push("---");
        lines.push("");
      }
      lines.push(data.markdown || "");
      const body = lines.join("\n");
      // Only set title if it's still the raw URL (avoid overwriting edited titles)
      const currentTitle = composerModel.get("title") || "";
      if (currentTitle.trim() === data.url || currentTitle.trim() === "") {
        composerModel.set("title", data.title || data.url);
      }
      composerModel.set("reply", body);
    },
  });
 });
--- a/assets/stylesheets/url-to-article.scss
+++ b/assets/stylesheets/url-to-article.scss
@@ -0,0 +1,51 @@
 /* URL-to-Article plugin styles */
 .url-to-article-bar {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  padding: 0.5rem 0.75rem;
  margin-bottom: 0.5rem;
  background: var(--tertiary-low, #e8f4ff);
  border: 1px solid var(--tertiary-medium, #8bc2f0);
  border-radius: var(--d-border-radius, 4px);
  font-size: var(--font-down-1);
  flex-wrap: wrap;
 }
 .url-to-article-icon {
  font-size: 1.1em;
  flex-shrink: 0;
 }
 .url-to-article-label {
  flex: 1;
  min-width: 8rem;
  color: var(--primary-medium);
  font-weight: 500;
 }
 .url-to-article-btn {
  flex-shrink: 0;
 }
 .url-to-article-dismiss {
  flex-shrink: 0;
  padding: 0.25rem 0.5rem !important;
  color: var(--primary-medium) !important;
 }
 .url-to-article-status {
  font-style: italic;
  font-size: var(--font-down-1);
  &.url-to-article-status--info {
    color: var(--tertiary);
  }
  &.url-to-article-status--success {
    color: var(--success);
  }
  &.url-to-article-status--error {
    color: var(--danger);
  }
 }
--- a/config/locales/client.en.yml
+++ b/config/locales/client.en.yml
@@ -0,0 +1,11 @@
 en:
  url_to_article:
    bar_label: "URL detected — import as article?"
    fetch_button: "Import Article"
    retry_button: "Retry"
    fetching: "Fetching…"
    dismiss: "Dismiss"
    success: "Article imported!"
    error_generic: "Unknown error"
    error_prefix: "Error:"
    source_label: "Source"
--- a/config/settings.yml
+++ b/config/settings.yml
@@ -0,0 +1,31 @@
 plugins:
  url_to_article_enabled:
    default: true
    client: true
    type: bool
  url_to_article_auto_populate:
    default: false
    client: true
    type: bool
    description: "Automatically populate the body when a URL is detected in the title (no button click needed)"
  url_to_article_max_content_length:
    default: 50000
    type: integer
    description: "Maximum number of characters to extract from a page"
  url_to_article_fetch_timeout:
    default: 10
    type: integer
    description: "Seconds to wait when fetching a URL"
  url_to_article_allowed_domains:
    default: ""
    type: string
    description: "Comma-separated list of allowed domains. Leave blank to allow all."
  url_to_article_blocked_domains:
    default: "localhost,127.0.0.1,0.0.0.0,::1"
    type: string
    description: "Comma-separated list of blocked domains (SSRF protection)"
--- a/lib/url_to_article/article_extractor.rb
+++ b/lib/url_to_article/article_extractor.rb
@@ -0,0 +1,232 @@
 # frozen_string_literal: true
 require "nokogiri"
 require "reverse_markdown"
 require "net/http"
 require "uri"
 require "timeout"
 module UrlToArticle
  class ArticleExtractor
    # Tags that are almost never article content
    NOISE_SELECTORS = %w[
      script style noscript iframe nav footer header
      .navigation .nav .menu .sidebar .widget .ad .advertisement
      .cookie-banner .cookie-notice .popup .modal .overlay
      .social-share .share-buttons .related-posts .comments
      #comments #sidebar #navigation #footer #header
      [role=navigation] [role=banner] [role=contentinfo]
      [aria-label=navigation] [aria-label=footer]
    ].freeze
    # Candidate content selectors tried in order
    ARTICLE_SELECTORS = %w[
      article[class*=content]
      article[class*=post]
      article[class*=article]
      article
      [role=main]
      main
      .post-content
      .article-content
      .entry-content
      .article-body
      .story-body
      .post-body
      .content-body
      .page-content
      #article-body
      #post-content
      #main-content
    ].freeze
    Result = Struct.new(:title, :byline, :site_name, :description, :markdown, :url, keyword_init: true)
    def self.extract(url)
      new(url).extract
    end
    def initialize(url)
      @url = url
      @uri = URI.parse(url)
    end
    def extract
      html = fetch_html
      doc = Nokogiri::HTML(html)
      title     = extract_title(doc)
      byline    = extract_byline(doc)
      site_name = extract_site_name(doc)
      description = extract_description(doc)
      content_node = find_content_node(doc)
      clean_node!(content_node)
      markdown = node_to_markdown(content_node)
      markdown = truncate(markdown)
      Result.new(
        title: title,
        byline: byline,
        site_name: site_name,
        description: description,
        markdown: markdown,
        url: @url
      )
    end
    private
    def fetch_html
      Timeout.timeout(SiteSetting.url_to_article_fetch_timeout) do
        http = Net::HTTP.new(@uri.host, @uri.port)
        http.use_ssl = @uri.scheme == "https"
        http.open_timeout = 5
        http.read_timeout = SiteSetting.url_to_article_fetch_timeout
        request = Net::HTTP::Get.new(@uri.request_uri)
        request["User-Agent"] = "Mozilla/5.0 (compatible; Discourse URL-to-Article Bot/1.0)"
        request["Accept"] = "text/html,application/xhtml+xml"
        request["Accept-Language"] = "en-US,en;q=0.9"
        response = http.request(request)
        # Follow one redirect
        if response.is_a?(Net::HTTPRedirection) && response["location"]
          redirect_uri = URI.parse(response["location"])
          @uri = redirect_uri
          http = Net::HTTP.new(@uri.host, @uri.port)
          http.use_ssl = @uri.scheme == "https"
          response = http.get(@uri.request_uri, "User-Agent" => request["User-Agent"])
        end
        raise "HTTP #{response.code}" unless response.is_a?(Net::HTTPSuccess)
        response.body.force_encoding("UTF-8")
      end
    end
    def extract_title(doc)
      # Try OG title first, then twitter:title, then <title>
      og = doc.at_css('meta[property="og:title"]')&.attr("content")
      return og.strip if og.present?
      tw = doc.at_css('meta[name="twitter:title"]')&.attr("content")
      return tw.strip if tw.present?
      h1 = doc.at_css("h1")&.text
      return h1.strip if h1.present?
      doc.at_css("title")&.text&.strip || @uri.host
    end
    def extract_byline(doc)
      candidates = [
        doc.at_css('meta[name="author"]')&.attr("content"),
        doc.at_css('[rel="author"]')&.text,
        doc.at_css(".author")&.text,
        doc.at_css('[class*="byline"]')&.text,
        doc.at_css("address")&.text,
      ]
      candidates.compact.map(&:strip).reject(&:empty?).first
    end
    def extract_site_name(doc)
      doc.at_css('meta[property="og:site_name"]')&.attr("content")&.strip ||
        @uri.host.sub(/^www\./, "")
    end
    def extract_description(doc)
      doc.at_css('meta[property="og:description"]')&.attr("content")&.strip ||
        doc.at_css('meta[name="description"]')&.attr("content")&.strip
    end
    def find_content_node(doc)
      # Try known article selectors
      ARTICLE_SELECTORS.each do |sel|
        node = doc.at_css(sel)
        next unless node
        text = node.text.strip
        # Make sure it has meaningful content (>200 chars of text)
        return node if text.length > 200
      end
      # Fallback: score all <div> and <section> blocks by text density
      score_and_pick(doc)
    end
    def score_and_pick(doc)
      candidates = doc.css("div, section, td").map do |node|
        text = node.text.strip
        next if text.length < 150
        # Score = text length - penalize nodes with lots of tags (nav-heavy)
        tag_count = node.css("*").size.to_f
        text_length = text.length.to_f
        score = text_length - (tag_count * 3)
        [score, node]
      end.compact.sort_by { |s, _| -s }
      candidates.first&.last || doc.at_css("body") || doc
    end
    def clean_node!(node)
      return unless node
      # Remove noise elements
      NOISE_SELECTORS.each do |sel|
        node.css(sel).each(&:remove)
      end
      # Remove hidden elements
      node.css("[style]").each do |el|
        el.remove if el["style"] =~ /display\s*:\s*none|visibility\s*:\s*hidden/i
      end
      # Remove empty tags (except br, img, hr)
      node.css("span, div, p, section").each do |el|
        el.remove if el.text.strip.empty? && el.css("img, video, audio, iframe").empty?
      end
      # Strip all attributes except allowed ones on certain tags
      allowed = {
        "a"   => %w[href title],
        "img" => %w[src alt title width height],
        "td"  => %w[colspan rowspan],
        "th"  => %w[colspan rowspan scope],
        "ol"  => %w[start type],
        "li"  => %w[value],
        "code" => %w[class],
        "pre"  => %w[class],
      }
      node.css("*").each do |el|
        tag = el.name.downcase
        permitted = allowed[tag] || []
        el.attributes.each_key do |attr|
          el.remove_attribute(attr) unless permitted.include?(attr)
        end
        # Make relative image URLs absolute
        if tag == "img" && el["src"] && !el["src"].start_with?("http", "//", "data:")
          el["src"] = URI.join(@url, el["src"]).to_s rescue nil
        end
        if tag == "a" && el["href"] && !el["href"].start_with?("http", "//", "#", "mailto:")
          el["href"] = URI.join(@url, el["href"]).to_s rescue nil
        end
      end
    end
    def node_to_markdown(node)
      return "" unless node
      ReverseMarkdown.convert(node.to_html, unknown_tags: :bypass, github_flavored: true)
        .gsub(/\n{3,}/, "\n\n")  # collapse excessive blank lines
        .strip
    end
    def truncate(text)
      max = SiteSetting.url_to_article_max_content_length
      return text if text.length <= max
      text[0...max] + "\n\n*[Content truncated — visit the original article for the full text.]*"
    end
  end
 end
--- a/plugin.rb
+++ b/plugin.rb
@@ -0,0 +1,35 @@
 # frozen_string_literal: true
 # name: discourse-url-to-article
 # about: Scrapes a URL pasted into the topic title and populates the composer body with the article content
 # version: 0.1.0
 # authors: Your Name
 # url: https://github.com/yourname/discourse-url-to-article
 gem "nokogiri", "1.16.4"
 gem "reverse_markdown", "2.1.1"
 enabled_site_setting :url_to_article_enabled
 after_initialize do
  require_relative "lib/url_to_article/article_extractor"
  module ::UrlToArticle
    PLUGIN_NAME = "discourse-url-to-article"
    class Engine < ::Rails::Engine
      engine_name PLUGIN_NAME
      isolate_namespace UrlToArticle
    end
  end
  require_relative "app/controllers/url_to_article/articles_controller"
  UrlToArticle::Engine.routes.draw do
    post "/extract" => "articles#extract"
  end
  Discourse::Application.routes.append do
    mount UrlToArticle::Engine, at: "/url-to-article"
  end
 end