Date | Commit message (Collapse) |
|
|
|
Now that the code matches Xapian terminology, ensure
our comments match, too.
|
|
Be consistent with our own terminology and use "epoch" for
[0-9]+\.git repos. The term "partition" is going away entirely.
|
|
|
|
We'll be using the term "shard" from now on to be consistent
with Xapian terminology.
|
|
Another step towards keeping our file and package names
consistent with Xapian terminology.
|
|
Our internal data structure should be consistent with Xapian
terminology.
|
|
Another step towards becoming consistent with Xapian terminology
|
|
Using compact to change shard count was abandoned during
the v2 development phase.
|
|
Oops :x
|
|
* origin/reshard:
xcpdb: support resharding v2 repos
xcpdb: use destination shard as progress prefix
xapcmd: preserve indexlevel based on the destination
v2writable: use a smaller default for Xapian partitions
|
|
Apparently 16 CPUs (probably HT) and SATA storage is common
these days. Having excessive Xapian partitions leads to
contention and excessive FD/space use. So set a smaller
default but continue allowing user-specified values to bump
this up.
|
|
Xapian on Linux <3.15 has trouble with coprocesses since it used
fork() for locking and would hold onto pipes used for git
unnecessarily.
|
|
Much of the existing purge code is repurposed to a general
"replace" functionality.
->purge is simpler because it can just drop the information.
Unlike ->purge, ->replace needs to edit existing git commits (in
case of From: and Subject: headers) and reindex the modified
message.
We currently disallow editing of References:, In-Reply-To: and
Message-ID headers because it can cause bad side effects with
our threading (and our lack of rethreading support to deal with
excessive matching from incorrect/invalid References).
|
|
Continuing the work by Eric Biederman in commit a118d58a402bd31b
("Import.pm: When purging replace a purged file with a zero length file"),
we can use a generic OID replacement mechanism to implement
purge.
|
|
It's one ugly sub with lots of parameters, but it's better
than calling a bunch of ugly subs with lots of parameters;
as we'll be needing to call it again when reindexing for
message replacements.
|
|
In case some BOFH decides to randomly create directories
using non-ASCII digits all over the place.
|
|
We don't need to use git to check ancestry if object IDs
match on a string comparison.
This saves 100ms or so and brings down the ~0.5s no-op time on
lore.kernel.org/lkml down to ~0.4s.
|
|
Creating mm_tmp is an expensive operation with large inboxes
and can be avoided if there are no new messages to process.
Since git-fetch(1) currently lacks an --exit-code option(*),
mirrors will run `public-inbox-index' unconditionally after
fetch, which is an expensive op if it needs to duplicate
a large SQLite DB.
This speeds up the mirror case of:
git --git-dir=git/$EPOCH.git fetch && public-inbox-index
This reduces the no-op `public-inbox-index' time from over 8s to
~0.5s on a (currently) 7-epoch clone of https://lore.kernel.org/lkml/
on my system.
(*) WIP --exit-code for git-fetch:
https://public-inbox.org/git/87ftphw7mv.fsf@evledraar.gmail.com/
|
|
This will make future changes easier-to-follow.
|
|
It'll make it easier to detect if we have anything to
unindex and run git-log on, at all.
|
|
We can show progress whenever we commit changes to the FS.
|
|
And use singular `opt' to be consistent with the common name
of 'getopt'.
|
|
Hopefully this improves maintainability by allowing Perl
to do some arg checking for us.
|
|
We don't need to stuff that into $self (V2Writable) which can be
longer-lived than a ->index_sync invocation.
|
|
Yet another temporary variable with no use outside of index_sync.
|
|
regen is always enabled for index_sync nowadays (and has
been for a while).
Rename `index_prepare' to `sync_prepare' to show it's for
->index_sync; and not the online indexing we do for ->add.
|
|
reindexing info is not used outside of the index_sync code path.
|
|
Another small step to reduce parameters passed to reindex_oid.
|
|
A first step towards making the v2 index_sync code
easier-to-follow. More fields to follow...
|
|
`public-inbox-index --reindex' could cause NNTP article number
gaps to form when it also has to deal with new,
never-before-seen commits in mirrors running off `git fetch'.
Fix this by running two distinct invocations of ->index_sync;
once to only reindex old commits, and a second time to index
new commits.
This does not appear to be a problem on v1 at the moment,
but I'll need more time to analyze this.
|
|
Fix a misspelling and ensure line context is printed by
`die' by leaving out the final '\n'. Also, `delete' was
pointless.
|
|
Apparently it's never been used and we write to msgmap directly.
|
|
Emit information about reindexing git revision ranges when used
with xcpdb. Additionally, distinguish Xapian copy output from
v2 git epoch counting by increasing directory context info.
For now, v1 batches batches are emitted. v2 indexing is still
missing progress reporting for batches, as the data structures
for reindexing would benefit from a refactoring, first.
This does not currently affect the use of public-inbox-index,
but may in the future.
|
|
Copying an entire Xapian DB takes a long time, so update our
reindexing code to support partial reindexing, snapshot the
pre-copydatabase git revisions, perform the lengthy copy,
and do a partial reindex when the copy + renames are done.
|
|
This is preparation to to support partial reindexing
|
|
In retrospect, introducing V1Writable was unnecessary and
InboxWritable->importer is in a better position to abstract
away differences between v1 and v2 writers.
So teach InboxWritable to initialize inboxes and get rid
of V1Writable.
|
|
Avoiding reliance on environment variables is a bit cleaner
for writing tests
|
|
Import initialization is a little strange from history, but we
also can't change it too much because it's technically a public
API which external code may rely on...
And we may need to support v1 repos indefinitely. This should
make it easier to write tests for both formats.
|
|
This can help users track down the source of warnings
when presented with imperfect emails.
While we're at it, make the __WARN__ callback in t/v2writable.t
a no-op since we don't check for warnings, there.
|
|
Newly-cloned epochs need to be in alternates file of
all.git for the web and NNTP interfaces to work. So
allow invocations of "public-inbox-index" to idempotently
ensure the epoch is visible from the all.git repo.
|
|
We'll be using this sub to fill $GIT_DIR/objects/info/alternates
if somebody uses clone --mirror, too
|
|
All of our internal epoch rollover calculations are done using
the estimated unpacked (and uncompressed) size of the repo. The
importer instance needs to check that unpacked size before
selecting an epoch when an epoch already has packed data.
This bug did not impact the initial mass imports since we only
initialize the Import instance once-per-epoch and did not need
to take existing epochs into account.
Tested manually with -mda on a local clone of LKML
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
|
|
A stand-alone tool for purge will won't know the epoch
if nothing was ->add()-ed before.
|
|
Otherwise, Perl may exit successfully when a failure code
is desired.
|
|
We don't require every git epoch to exist since we support
the --skip feature in public-inbox-init.
|
|
And doesn't try to access undef as an array ref.
|
|
Hopefully this helps people familiarize themselves with
the source code.
|
|
I've hit /proc/sys/fs/pipe-user-pages-* limits on some systems.
So stop hogging resources on pipes which don't benefit from
giant sizes.
Some of these can use eventfd in the future to further reduce
resource use.
|
|
The new t/*filter_rubylang.t tests call -index immediately
after -init, which causes confusing messages to show up to
the end user.
Check the validity of the ref before calling "git-log".
|