* [PATCH 17/28] cindex: implement --max-size=SIZE
@ 2023-03-21 23:07 7% ` Eric Wong
0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2023-03-21 23:07 UTC (permalink / raw)
To: meta
This matches existing behavior of -index and -extindex, and
will hopefully allow me to avoid OOM problems by skipping
problematic commits.
---
lib/PublicInbox/CodeSearchIdx.pm | 6 ++++++
script/public-inbox-cindex | 4 +++-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm
index fcd28671..b185731d 100644
--- a/lib/PublicInbox/CodeSearchIdx.pm
+++ b/lib/PublicInbox/CodeSearchIdx.pm
@@ -161,6 +161,7 @@ sub shard_index { # via wq_io_do
my $op_p = delete($self->{1}) // die 'BUG: no {1} op_p';
my $batch_bytes = $self->{-opt}->{batch_size} //
$PublicInbox::SearchIdx::BATCH_BYTES;
+ my $max_size = $self->{-opt}->{max_size};
# local-ized in parent before fork
$TXN_BYTES = $batch_bytes;
local $self->{git} = $git; # for patchid
@@ -177,6 +178,11 @@ sub shard_index { # via wq_io_do
$self->begin_txn_lazy;
while (defined($buf = <$rd>)) {
chomp($buf);
+ if ($max_size && length($buf) >= $max_size) {
+ my ($H, undef) = split(/\n/, $buf, 2);
+ warn "W: skipping $H (", length($buf)," >= $max_size)\n";
+ next;
+ }
$TXN_BYTES -= length($buf);
@$cmt{@FMT} = split(/\n/, $buf, scalar(@FMT));
$/ = "\n";
diff --git a/script/public-inbox-cindex b/script/public-inbox-cindex
index 420ef4de..e2500b93 100755
--- a/script/public-inbox-cindex
+++ b/script/public-inbox-cindex
@@ -16,6 +16,7 @@ usage: public-inbox-cindex [options] --project-list=FILE PROJECT_ROOT
--update | -u update previously-indexed code repos with `-d'
--jobs=NUM set or disable parallelization (NUM=0)
--batch-size=BYTES flush changes to OS after a given number of bytes
+ --max-size=BYTES do not index commit diffs larger than the given size
--prune prune old repos and commits
--reindex reindex previously indexed repos
--verbose | -v increase verbosity (may be repeated)
@@ -25,7 +26,8 @@ See public-inbox-cindex(1) man page for full documentation.
EOF
my $opt = { fsync => 1, scan => 1 }; # --no-scan is hidden
GetOptions($opt, qw(quiet|q verbose|v+ reindex jobs|j=i fsync|sync! dangerous
- indexlevel|index-level|L=s batch_size|batch-size=s
+ indexlevel|index-level|L=s
+ batch_size|batch-size=s max_size|max-size=s
project-list=s exclude=s@
d=s update|u scan! prune dry-run|n C=s@ help|h))
or die $help;
^ permalink raw reply related [relevance 7%]
* [PATCH 00/28] cindex coderepo commit indexer
@ 2023-03-21 23:07 6% Eric Wong
0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2023-03-21 23:07 UTC (permalink / raw)
To: meta
Not wired up to WWW nor lei, yet; but indexing + pruning of
commits works.
I'm not sure if indexing (root) tree OIDs or committer
names+emails is worth it, since I don't think those are very
important terms to search for.
I first wanted to shoehorn this into extindex, but I think it
works better as a separate Xapian schema.
It allows both internal indexes ($GIT_DIR/public-inbox-cindex)
for unforked repos, as well as extindex-style external index
to encompass several projects.
The indexer is structured a bit more nicely than existing
indexers since I'm relying on OnDestroy and `local', more.
I would like to trickle some of these improvements back to
the mail indexers at some point.
--prune and --reindex currently block incremental updates, which
isn't great since both take a while for giant Xapian DBs.
Pruning is pretty important since it's much common for coderepos
(e.g. `seen' branch of git.git)
`lei cq' will probably be a new command which behaves
similarly to `lei q -f text', but takes `git log' options
for output...
Eric Wong (28):
ipc: move nproc_shards from v2writable
search: relocate all_terms from lei_search
admin: hoist out resolve_git_dir
admin: ensure resolved GIT_DIR is absolute
test_common: create_inbox: use `$!' properly on mkdir failure
codesearch: initial cut w/ -cindex tool
cindex: parallelize prep phases
cindex: use read-only shards during prep phases
searchidxshard: improve comment wording
cindex: use DS and workqueues for parallelism
ds: @post_loop_do replaces SetPostLoopCallback
cindex: implement --exclude= like -clone
cindex: show shard number in progress message
cindex: drop `unchanged' progress message
cindex: handle graceful shutdown by default
sigfd: pass signal name rather than number to callback
cindex: implement --max-size=SIZE
cindex: check for checkpoint before giant messages
cindex: truncate or drop body for over-sized commits
cindex: attempt to give oldest commits lowest docids
cindex: improve granularity of quit checks
spawn: show failing directory for chdir failures
cindex: filter out non-existent git directories
cindex: add support for --prune
cindex: implement reindex
cindex: squelch incompatible options
cindex: respect existing permissions
cindex: ignore SIGPIPE
MANIFEST | 4 +
lib/PublicInbox/Admin.pm | 18 +-
lib/PublicInbox/CodeSearch.pm | 121 +++++
lib/PublicInbox/CodeSearchIdx.pm | 835 ++++++++++++++++++++++++++++++
lib/PublicInbox/Config.pm | 2 +-
lib/PublicInbox/DS.pm | 30 +-
lib/PublicInbox/Daemon.pm | 4 +-
lib/PublicInbox/ExtSearchIdx.pm | 2 +-
lib/PublicInbox/IPC.pm | 33 +-
lib/PublicInbox/LEI.pm | 4 +-
lib/PublicInbox/LeiSearch.pm | 14 -
lib/PublicInbox/MiscIdx.pm | 2 +-
lib/PublicInbox/Search.pm | 77 ++-
lib/PublicInbox/SearchIdx.pm | 88 ++--
lib/PublicInbox/SearchIdxShard.pm | 7 +-
lib/PublicInbox/Sigfd.pm | 10 +-
lib/PublicInbox/Spawn.pm | 6 +-
lib/PublicInbox/SpawnPP.pm | 2 +-
lib/PublicInbox/TestCommon.pm | 47 +-
lib/PublicInbox/V2Writable.pm | 26 +-
lib/PublicInbox/Watch.pm | 2 +-
script/public-inbox-cindex | 86 +++
script/public-inbox-convert | 2 +-
t/cindex.t | 134 +++++
t/dir_idle.t | 6 +-
t/ds-leak.t | 8 +-
t/imapd.t | 6 +-
t/nntpd.t | 2 +-
t/sigfd.t | 7 +-
t/watch_maildir.t | 8 +-
xt/mem-imapd-tls.t | 7 +-
xt/mem-nntpd-tls.t | 8 +-
xt/net_writer-imap.t | 4 +-
33 files changed, 1424 insertions(+), 188 deletions(-)
create mode 100644 lib/PublicInbox/CodeSearch.pm
create mode 100644 lib/PublicInbox/CodeSearchIdx.pm
create mode 100755 script/public-inbox-cindex
create mode 100644 t/cindex.t
^ permalink raw reply [relevance 6%]
Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2023-03-21 23:07 6% [PATCH 00/28] cindex coderepo commit indexer Eric Wong
2023-03-21 23:07 ` [PATCH 01/28] ipc: move nproc_shards from v2writable Eric Wong
2023-03-21 23:07 7% ` [PATCH 17/28] cindex: implement --max-size=SIZE Eric Wong
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).