From: "Eric Wong (Contractor, The Linux Foundation)" <e@80x24.org>
To: meta@public-inbox.org
Subject: [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes
Date: Tue, 6 Mar 2018 08:42:08 +0000 [thread overview]
Message-ID: <20180306084242.19988-1-e@80x24.org> (raw)
Duplicate detection based on `content_id' now works and rejects
obviously re-sent messages with the same Message-Id.
Since many historical messages already have multiple Message-Ids (some
from buggy versions of git-send-email), we will inject Message-Ids as
needed to differentiate messages with the SAME Message-Id. This
prevents NNTP readers from missing out on messages.
Internally, the Message-Id we _favor_ for NNTP is also the one which
gets used for rendering threads.
Excessively long Message-Ids are just truncated to 244 for now (Xapian
limit for terms). I hope it's not an abuse vector going forward (only
one spam message used it), but this is another problem our
inject-new-Message-Id-on-duplicate scheme "solves".
Internal timestamps used for sorting now favor the first (last-added)
Received: header since is more likely to be correct than the Date:
header.
A wrong Date: header will still show up in the per-message ("permalink")
view, so it can still be used to embarass people with bad clocks :P
(Of course, downloadable mboxes will continue to show them).
For thread skeleton (index) views in HTML, we use the internal
timestamp for now; but maybe we'll use the Date: like the permalink
view. Maybe internally there can be two timestamps like git's
author-vs-committer dates.
Xapian index size is reduced, as the "nq:" search field is no longer
redundantly storing information that would be in searchable diff
fields (df* in https://public-inbox.org/git/_/text/help/).
This (along with remembering to run fstrim(8)) seems to have
reduced best-case indexing speed to around 3.5 hours for the
2000-2017 dataset I'm using \o/
Eric Wong (Contractor, The Linux Foundation) (34):
v2writable: delete ::Import obj when ->done
search: remove informational "warning" message
searchidx: add PID to error message when die-ing
content_id: special treatment for Message-Id headers
evcleanup: disable outside of daemon
v2writable: deduplicate detection on add
evcleanup: do not create event loop if nothing was registered
mid: add `mids' and `references' methods for extraction
content_id: use `mids' and `references' for MID extraction
searchidx: use new `references' method for parsing References
content_id: no need to be human-friendly
v2writable: inject new Message-IDs on true duplicates
search: revert to using 'Q' as a uniQue id per-Xapian conventions
searchidx: support indexing multiple MIDs
mid: be strict with References, but loose on Message-Id
searchidx: avoid excessive XNQ indexing with diffs
searchidxskeleton: add a note about locking
v2writable: generated Message-ID goes first
searchidx: use add_boolean_term for internal terms
searchidx: add NNTP article number as a searchable term
mid: truncate excessively long MIDs early
nntp: use NNTP article numbers for lookups
nntp: fix NEWNEWS command
searchidx: store the primary MID in doc data for NNTP
import: consolidate object info for v2 imports
v2: avoid redundant/repeated configs for git partition repos
INSTALL: document more optional dependencies
search: favor skeleton DB for lookup_mail
search: each_smsg_by_mid uses skeleton if available
v2writable: remove unnecessary skeleton commit
favor Received: date over Date: header globally
import: fall back to Sender for extracting name and email
scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
v2writable: detect and use previous partition count
INSTALL | 13 ++
MANIFEST | 2 +
lib/PublicInbox/ContentId.pm | 32 +++--
lib/PublicInbox/Daemon.pm | 1 +
lib/PublicInbox/EvCleanup.pm | 6 +-
lib/PublicInbox/ExtMsg.pm | 2 +-
lib/PublicInbox/Import.pm | 99 ++++++-------
lib/PublicInbox/Inbox.pm | 1 +
lib/PublicInbox/MID.pm | 55 +++++++-
lib/PublicInbox/MsgTime.pm | 51 +++++++
lib/PublicInbox/NNTP.pm | 31 ++---
lib/PublicInbox/Search.pm | 70 ++++++++--
lib/PublicInbox/SearchIdx.pm | 260 +++++++++++++++++++++--------------
lib/PublicInbox/SearchIdxPart.pm | 8 +-
lib/PublicInbox/SearchIdxSkeleton.pm | 27 +---
lib/PublicInbox/SearchMsg.pm | 26 ++--
lib/PublicInbox/V2Writable.pm | 166 +++++++++++++++++++---
lib/PublicInbox/View.pm | 8 +-
lib/PublicInbox/WwwAtomStream.pm | 5 +-
scripts/import_vger_from_mbox | 11 +-
t/content_id.t | 5 +-
t/import.t | 9 +-
t/init.t | 2 +
t/mid.t | 22 ++-
t/nntpd.t | 2 +
t/search-thr-index.t | 2 +-
t/v2writable.t | 195 ++++++++++++++++++++++++++
27 files changed, 842 insertions(+), 269 deletions(-)
create mode 100644 lib/PublicInbox/MsgTime.pm
create mode 100644 t/v2writable.t
--
EW
next reply other threads:[~2018-03-06 8:42 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-06 8:42 Eric Wong (Contractor, The Linux Foundation) [this message]
2018-03-06 8:42 ` [PATCH 01/34] v2writable: delete ::Import obj when ->done Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 02/34] search: remove informational "warning" message Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 03/34] searchidx: add PID to error message when die-ing Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 04/34] content_id: special treatment for Message-Id headers Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 05/34] evcleanup: disable outside of daemon Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 06/34] v2writable: deduplicate detection on add Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 07/34] evcleanup: do not create event loop if nothing was registered Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 08/34] mid: add `mids' and `references' methods for extraction Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 09/34] content_id: use `mids' and `references' for MID extraction Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 10/34] searchidx: use new `references' method for parsing References Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 11/34] content_id: no need to be human-friendly Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 12/34] v2writable: inject new Message-IDs on true duplicates Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 13/34] search: revert to using 'Q' as a uniQue id per-Xapian conventions Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 14/34] searchidx: support indexing multiple MIDs Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 15/34] mid: be strict with References, but loose on Message-Id Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 16/34] searchidx: avoid excessive XNQ indexing with diffs Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 17/34] searchidxskeleton: add a note about locking Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 18/34] v2writable: generated Message-ID goes first Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 19/34] searchidx: use add_boolean_term for internal terms Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 20/34] searchidx: add NNTP article number as a searchable term Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 21/34] mid: truncate excessively long MIDs early Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 22/34] nntp: use NNTP article numbers for lookups Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 23/34] nntp: fix NEWNEWS command Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 24/34] searchidx: store the primary MID in doc data for NNTP Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 25/34] import: consolidate object info for v2 imports Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 26/34] v2: avoid redundant/repeated configs for git partition repos Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 27/34] INSTALL: document more optional dependencies Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 28/34] search: favor skeleton DB for lookup_mail Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 29/34] search: each_smsg_by_mid uses skeleton if available Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 30/34] v2writable: remove unnecessary skeleton commit Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 31/34] favor Received: date over Date: header globally Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 32/34] import: fall back to Sender for extracting name and email Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 33/34] scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 34/34] v2writable: detect and use previous partition count Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:53 ` [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180306084242.19988-1-e@80x24.org \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).