From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 35F661F576 for ; Tue, 6 Mar 2018 08:42:42 +0000 (UTC) From: "Eric Wong (Contractor, The Linux Foundation)" To: meta@public-inbox.org Subject: [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Date: Tue, 6 Mar 2018 08:42:08 +0000 Message-Id: <20180306084242.19988-1-e@80x24.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit List-Id: Duplicate detection based on `content_id' now works and rejects obviously re-sent messages with the same Message-Id. Since many historical messages already have multiple Message-Ids (some from buggy versions of git-send-email), we will inject Message-Ids as needed to differentiate messages with the SAME Message-Id. This prevents NNTP readers from missing out on messages. Internally, the Message-Id we _favor_ for NNTP is also the one which gets used for rendering threads. Excessively long Message-Ids are just truncated to 244 for now (Xapian limit for terms). I hope it's not an abuse vector going forward (only one spam message used it), but this is another problem our inject-new-Message-Id-on-duplicate scheme "solves". Internal timestamps used for sorting now favor the first (last-added) Received: header since is more likely to be correct than the Date: header. A wrong Date: header will still show up in the per-message ("permalink") view, so it can still be used to embarass people with bad clocks :P (Of course, downloadable mboxes will continue to show them). For thread skeleton (index) views in HTML, we use the internal timestamp for now; but maybe we'll use the Date: like the permalink view. Maybe internally there can be two timestamps like git's author-vs-committer dates. Xapian index size is reduced, as the "nq:" search field is no longer redundantly storing information that would be in searchable diff fields (df* in https://public-inbox.org/git/_/text/help/). This (along with remembering to run fstrim(8)) seems to have reduced best-case indexing speed to around 3.5 hours for the 2000-2017 dataset I'm using \o/ Eric Wong (Contractor, The Linux Foundation) (34): v2writable: delete ::Import obj when ->done search: remove informational "warning" message searchidx: add PID to error message when die-ing content_id: special treatment for Message-Id headers evcleanup: disable outside of daemon v2writable: deduplicate detection on add evcleanup: do not create event loop if nothing was registered mid: add `mids' and `references' methods for extraction content_id: use `mids' and `references' for MID extraction searchidx: use new `references' method for parsing References content_id: no need to be human-friendly v2writable: inject new Message-IDs on true duplicates search: revert to using 'Q' as a uniQue id per-Xapian conventions searchidx: support indexing multiple MIDs mid: be strict with References, but loose on Message-Id searchidx: avoid excessive XNQ indexing with diffs searchidxskeleton: add a note about locking v2writable: generated Message-ID goes first searchidx: use add_boolean_term for internal terms searchidx: add NNTP article number as a searchable term mid: truncate excessively long MIDs early nntp: use NNTP article numbers for lookups nntp: fix NEWNEWS command searchidx: store the primary MID in doc data for NNTP import: consolidate object info for v2 imports v2: avoid redundant/repeated configs for git partition repos INSTALL: document more optional dependencies search: favor skeleton DB for lookup_mail search: each_smsg_by_mid uses skeleton if available v2writable: remove unnecessary skeleton commit favor Received: date over Date: header globally import: fall back to Sender for extracting name and email scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping v2writable: detect and use previous partition count INSTALL | 13 ++ MANIFEST | 2 + lib/PublicInbox/ContentId.pm | 32 +++-- lib/PublicInbox/Daemon.pm | 1 + lib/PublicInbox/EvCleanup.pm | 6 +- lib/PublicInbox/ExtMsg.pm | 2 +- lib/PublicInbox/Import.pm | 99 ++++++------- lib/PublicInbox/Inbox.pm | 1 + lib/PublicInbox/MID.pm | 55 +++++++- lib/PublicInbox/MsgTime.pm | 51 +++++++ lib/PublicInbox/NNTP.pm | 31 ++--- lib/PublicInbox/Search.pm | 70 ++++++++-- lib/PublicInbox/SearchIdx.pm | 260 +++++++++++++++++++++-------------- lib/PublicInbox/SearchIdxPart.pm | 8 +- lib/PublicInbox/SearchIdxSkeleton.pm | 27 +--- lib/PublicInbox/SearchMsg.pm | 26 ++-- lib/PublicInbox/V2Writable.pm | 166 +++++++++++++++++++--- lib/PublicInbox/View.pm | 8 +- lib/PublicInbox/WwwAtomStream.pm | 5 +- scripts/import_vger_from_mbox | 11 +- t/content_id.t | 5 +- t/import.t | 9 +- t/init.t | 2 + t/mid.t | 22 ++- t/nntpd.t | 2 + t/search-thr-index.t | 2 +- t/v2writable.t | 195 ++++++++++++++++++++++++++ 27 files changed, 842 insertions(+), 269 deletions(-) create mode 100644 lib/PublicInbox/MsgTime.pm create mode 100644 t/v2writable.t -- EW