Date | Commit message (Collapse) |
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|
|
It's easier to store everything in one array ref similar
to what our Git->check routine returns
|
|
Leaking these pipes to child processes wasn't harmful, but
made determining relationships and dataflow between processes
more confusing.
|
|
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.
git-fast-import is fast, so we don't bother parallelizing it.
Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NNTP
article number and internal thread_id, respectively).
We handle msgmap in the main process which drives fast-import.
When the article number is retrieved/generated, we write the
entire message to per-partition subprocesses via pipes for
expensive text+term indexing.
When these per-partition subprocesses are done with the
expensive text+term indexing, they write SearchMsg (small data)
to a shared pipe (inherited from the main V2Writable process)
back to the threader, which runs its own subprocess.
The number of text+term Xapian partitions is chosen at import
and can be made equal to the number of cores in a machine.
V2Writable --> Import -> git-fast-import
\-> SearchIdxThread -> Msgmap (synchronous)
\-> SearchIdxPart[n] -> SearchIdx[*]
\-> SearchIdxThread -> SearchIdx ("threader", a subprocess)
[* ] each subprocess writes to threader
|
|
This is too slow, currently. Working with only 2017 LKML
archives:
git-only: ~1 minute
git + SQLite: ~12 minutes
git+Xapian+SQlite: ~45 minutes
So yes, it looks like we'll need to parallelize Xapian indexing,
at least.
|
|
Wrap the old Import package to enable creating new repos based
on size thresholds. This is better than relying on time-based
rotation as LKML traffic seems to be increasing.
|
|
Despite email not existing until 1971; "Jan 1, 1970 00:00:00"
seems like a common default timestamp for some test emails
to use as a Date: header.
|
|
There's a lot of crap in archives and git-fast-import
accepts empty names and email addresses for authors
just fine.
|
|
For LKML, it appears we need an even more liberal parser than
RFC2822 date parser in git. I have not validated Date::Parse
parses dates correctly, but this at least prevents
git-fast-import(1) from choking.
|
|
Wrap "get-mark" and "checkpoint" commands for git-fast-import
while documenting/cementing parts of the API.
|
|
Call order will need to change a bit since this is going to be
tied to Xapian
|
|
We'll reuse this class in v2, but won't be utilizing
per-git-repository ssoma.lock files.
Meanwhile, stop treating ::Inbox objects as an afterthought
and allow importing name and email into them.
|
|
For machines which have never seen ssoma, they don't need the
index so stop creating it.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
Sometimes an email is an innocent removal "rm" for a
misdirected, off-topic post, while most removed messages are
"spam". Allow anybody to look at history and easily distinguish
the reason for removing the message.
|
|
This seems to allow weirdly-encoded "raw" emails in
blade.nagaokaut.ac.jp/ruby/ruby-core/*
to be handled without difficulties.
|
|
This was necessary for the presence of the 0xa0 byte(*)
in the Subject: of the message at:
http://blade.nagaokaut.ac.jp/ruby/ruby-core/3220
(*) That is 0xa0, not 0x0a ("\n"), so I wonder if the
nibbles got swapped somehow.
|
|
This should fix problems with multipart messages where
text/plain parts lack a header.
cf. git clone --mirror https://github.com/rjbs/Email-MIME.git
refs/pull/28/head
In the future, we may still introduce as streaming
interface to reduce memory usage on large emails.
|
|
We should not completely kill a process if "git gc --auto"
errors out due to a warning or whatnot.
|
|
We need to prevent excessive repository growth for
public-inbox-watch and public-inbox-mda users.
|
|
We will be reusing this in the next commit, too.
|
|
This reduces duplication, slightly. We may be using it
yet again in a to-be-introduced function (or we may not
introduce it).
|
|
Not sure why or how I missed this before; but the common address
parsing routine we have should be more correct.
Add a test to ensure excessively quoted names don't make it
through, either.
|
|
We need to pass the Inbox object to SearchIdx to get altid
mappings properly for incremental imports.
TODO: use the Inbox object in more places where it makes sense
to do so.
|
|
This will allow us to release and re-acquire Xapian locks
due to the lack of FD_CLOEXEC on some FDs.
|
|
For reindexing, fresh Xapian DBs do not count as a reindex,
allowing users to blindly use --reindex on the first
run on a clean repo.
While we're at it, allow indexing to override HEAD ref for
multi-head git repos.
|
|
Callers may have localized $/ to something else, so make sure
we chomp the expected character(s) when calling chomp.
|
|
Mailing lists I watch and mirror may not have the best spam
filtering, and an extra layer should not hurt.
|
|
Because our WatchMaildir module is liberal about what
it accepts, we can potentially have messages without a
subject.
|
|
This prevents multiple update processes from stepping over
each other while called under the lock, and also allows the
new -watch process to update the index iff indexing was
desired.
|
|
git has stricter requirements for ident names (no '<>')
which Email::Address allows.
Even in 1.908, Email::Address also has an incomplete fix for
CVE-2015-7686 with a DoS-able regexp for comments. Since we
don't care for or need all the RFC compliance of Email::Address,
avoiding it entirely may be preferable.
Email::Address will still be installed as a requirement for
Email::MIME, but it is only used by the
Email::MIME::header_str_set which we do not use
|
|
We don't need to update-server-info (or read-tree) if fast
import was spawned for removals and no changes were made.
|
|
git doesn't handle '<' and '>' characters in the author
name at all regardless of quoting, not just matched pairs.
So fall back to using the email as the author name since
the commit info isn't critical, anyways (shallow clones
are fine).
|
|
Mbox formatters may add extra newlines at the end of the
message, and that's not relevant for comparing messages
for deletion.
|
|
We should update $GIT_DIR/info/refs for dumb HTTP clients
whenever we make changes to the repository. The best place
to update is immediately after making commits.
This fixes a bug where public-inbox-learn did not properly
update $GIT_DIR/info/refs after inserting or removing
messages.
|
|
This is probably trivial enough to be final?
|
|
By converting to using ourt git-fast-import-based Import
module. This should allow us to be more easily installed.
|
|
The read could fail entirely and leave $lf undefined.
|
|
It confuses the git ident parser and may not be a great
idea to fix in git since it could break interopability
with older versions.
|
|
git is byte-oriented and fast-import will not tolerate
miscalculations. This is necessary for wide characters
in commit messages (email Subjects).
|
|
Author names may have wide characters in them, so avoid warnings
as git favors UTF-8 for names and fast-import even requires them
for commit messages
|
|
This will allow us to write fast importers for existing
archives as well as eventually removing the ssoma dependency
for performance and ease-of-installation.
|