Date | Commit message (Collapse) |
|
We don't want to waste cycles passing non-ASCII characters
to git.
|
|
Its result is used for HTML anchors and such.
|
|
RFC3977 does not have provisions for whitespace beyond ASCII
TAB, SP, CR and LF. I doubt there's any NNTP clients broken
enough to be sending non-ASCII whitespace delimiters.
We're probably excessively liberal regarding TAB acceptance,
even; but it's probably too late to change at this point...
|
|
We aren't able to make sense of non-ASCII digits
cf. perlrecharclass(1) / "Digits" section
|
|
The "\w" character class in Perl matches any word characters
in the Unicode database, not just ASCII characters. So we
must be prepared for that and generate links to IDNs.
|
|
We don't need and won't be needing per-socket PostLoopCallbacks.
|
|
We never enable write watches ourselves for HTTP and NNTP,
and only enable the write watch with EvCleanup because it's
an "always on" watch.
|
|
This was never used in Danga::Socket 1.61, either.
|
|
I've used Danga::Socket for well over a decade in various
projects at this point and have never seen the need for it.
If such a bug ever happens; the process should fall over so
it gets fixed ASAP.
|
|
This is not used by perlbal for OpenSSL support, either;
and it does not appear to be the right layer for doing
write translations anyways (IO::Socket::SSL uses `tie').
|
|
Sometimes I get bored with the email part of this project and
need a distraction :P
|
|
ToClose and HaveEpoll are of no use to us and I see no
future use for them, either.
|
|
Even though we currently don't use it repeatedly, ->Reset
should close() kqueue FDs and not cause the process to run
out of descriptors.
Add a close-on-exec test while we're at it.
|
|
We should not be leaking these FDs to git(1) processes,
in case git has a bug that causes it to access the wrong FD.
|
|
No reason to leave that (usually) empty file open after killing off
"cat-file --batch-check". This wasn't an unbound leak, though,
as respawning the --batch-check process would've clobbered the
old err_c file.
|
|
A constant stream of traffic to either httpd/nntpd would mean
git-cat-file processes never expire. Things can go bad after a
full repack, as a full repack will unlink old pack indices and
git-cat-file does not currently detect unlinked files.
We could do something complicated by recursively stat-ing
objects/pack of every git directory and alternate;
but that's probably not worth the trouble compared to
occasionally restarting the cat-file process.
So simplify the code and let httpd/nntpd expire them
periodically, since spawning a "git-cat-file --batch" process
isn't too expensive. We already spawn for every request which
hits git-http-backend, cgit, and git-apply.
In the future, we may optionally support the Git::Raw module
to avoid IPC; but we must remain careful to not leave lingering
FDs open to unlinked files after repack.
|
|
This is worth a 1-2% speedup in t/perf-msgview.t rendering 2620
messages currently in https://public-inbox.org/meta/
|
|
We don't need to use git to check ancestry if object IDs
match on a string comparison.
This saves 100ms or so and brings down the ~0.5s no-op time on
lore.kernel.org/lkml down to ~0.4s.
|
|
Creating mm_tmp is an expensive operation with large inboxes
and can be avoided if there are no new messages to process.
Since git-fetch(1) currently lacks an --exit-code option(*),
mirrors will run `public-inbox-index' unconditionally after
fetch, which is an expensive op if it needs to duplicate
a large SQLite DB.
This speeds up the mirror case of:
git --git-dir=git/$EPOCH.git fetch && public-inbox-index
This reduces the no-op `public-inbox-index' time from over 8s to
~0.5s on a (currently) 7-epoch clone of https://lore.kernel.org/lkml/
on my system.
(*) WIP --exit-code for git-fetch:
https://public-inbox.org/git/87ftphw7mv.fsf@evledraar.gmail.com/
|
|
This will make future changes easier-to-follow.
|
|
It'll make it easier to detect if we have anything to
unindex and run git-log on, at all.
|
|
And use it from Admin.
It's easy to tell what indexlevel=basic is from unconfigured
inboxes, but distinguishing between 'medium' and 'full' would
require stat()-ing position.* files which is fragile and
Xapian-implementation-dependent.
So use the metadata facility of Xapian and store it in the main
partition so Admin tools can deal better with unconfigured
inboxes copied using generic tools like cp(1) or rsync(1).
|
|
We can show progress whenever we commit changes to the FS.
|
|
It doesn't implement progress of batches, yet, but it wires
up the parsing of the command-line while preserving output
compatibility.
This output is NOT meant to be stable.
|
|
And use singular `opt' to be consistent with the common name
of 'getopt'.
|
|
Hopefully this improves maintainability by allowing Perl
to do some arg checking for us.
|
|
We don't need to stuff that into $self (V2Writable) which can be
longer-lived than a ->index_sync invocation.
|
|
Yet another temporary variable with no use outside of index_sync.
|
|
regen is always enabled for index_sync nowadays (and has
been for a while).
Rename `index_prepare' to `sync_prepare' to show it's for
->index_sync; and not the online indexing we do for ->add.
|
|
reindexing info is not used outside of the index_sync code path.
|
|
Another small step to reduce parameters passed to reindex_oid.
|
|
A first step towards making the v2 index_sync code
easier-to-follow. More fields to follow...
|
|
`public-inbox-index --reindex' could cause NNTP article number
gaps to form when it also has to deal with new,
never-before-seen commits in mirrors running off `git fetch'.
Fix this by running two distinct invocations of ->index_sync;
once to only reindex old commits, and a second time to index
new commits.
This does not appear to be a problem on v1 at the moment,
but I'll need more time to analyze this.
|
|
We can't pass an empty string to `git merge-base --is-ancestor'
AFAIK, this did NOT present issues in the current test suite.
|
|
Streaming large blobs can take multiple iterations of the event
loop in our -httpd; so we must not let the File::Temp::Dir
result go out-of-scope when streaming large blobs created from
patches.
|
|
Fix a misspelling and ensure line context is printed by
`die' by leaving out the final '\n'. Also, `delete' was
pointless.
|
|
No reason to copyright colour schemes :P
|
|
Apparently it's never been used and we write to msgmap directly.
|
|
I have never not found double negatives to be confusing...
|
|
Some users (or bots :P) can trigger horrible queries which
the caller can choose to either log or ignore. This prevents
horrible queries from ExtMsg from logging confusing "ref: "
messages when $@ is not a Perl reference.
|
|
-index documentation avoid redundant v1 information and refers
readers to apropriate v1/v2 manpages. Search::Xapian can also
be optional, now, as only the PSGI search interface uses it.
Favor "INBOX_DIR" where appropriate, since "REPO_DIR" can be
confused for code repos which we also support.
XAPIAN_FLUSH_THRESHOLD is documented for all relevant
bulk commands.
|
|
To properly handle compact tmpdir cleanup in single process
situations, we need to carefully account for Xtmpdir not
being a singleton and ensuring we don't clobber signal
handlers which belong to other Xtmpdirs.
|
|
We don't have to be tied to the number of partitions in case
we made a bad choice at initialization. This doesn't affect
reindexing, but the copying phase is already intensive.
And optimize away the extra process when we only have a single
job which won't parallelize.
The wording for the (v2) reindexing phase could be improved,
later. I also plan to allow repartitioning of existing
Xapian DBs.
|
|
We should not have leftover junk on interrupted invocations.
|
|
Allow users to specify the --blocksize <B>, --no-full, --fuller
options for xapian-compact(1) for fine-tuning compact behavior
for low-traffic/inactive inboxes.
We also won't support --multipass, since it doesn't seem
compatible with our requirement to use --no-renumber.
We also won't support --single-file, since it only seems
intended for totally dead inboxes; and it doesn't seem
worth the support overhead when "totally dead" turns out
to be a misdiagnosis.
|
|
Since -xcpdb is a superset of -compact, we can reuse much of
that code used for driving compact.
For compact (only), this is slightly less memory efficient since
it requires an extra process per-partition, but we get to prefix
the output with the partition name for more readable output.
|
|
Cleanup temporary directories on common termination signals
(INT, HUP, PIPE, TERM), but only if it's not in the process
of being committed via rename() sequence.
|
|
Emit information about reindexing git revision ranges when used
with xcpdb. Additionally, distinguish Xapian copy output from
v2 git epoch counting by increasing directory context info.
For now, v1 batches batches are emitted. v2 indexing is still
missing progress reporting for batches, as the data structures
for reindexing would benefit from a refactoring, first.
This does not currently affect the use of public-inbox-index,
but may in the future.
|
|
`warn' is reserved for actual warnings, as it respects
$SIG{__WARN__} and we rely on that override to print
message context information when we are indexing.
|
|
By creating temporary directories as deep as possible,
we can allow v2 repositories to have `xap$SCHEMA_VERSION'
(e.g. `xap15') reside on a separate FS.
We also check st_dev ahead-of-time to avoid doing work which
will fail with EXDEV. Of course, another process may still
move/change things around.
|