* [PATCH 00/14] IT'S ALIVE! www loads cindex join data
@ 2023-11-28 14:56 7% Eric Wong
2023-11-28 14:56 2% ` [PATCH 05/14] xap_helper.h: move cindex endpoints to separate file Eric Wong
0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
To: meta
8/14 is the killer one which actually makes the cindex data
useful for WWW and powering solver. Keep in mind, I've had
to cap solver at 3 coderepos as a temporary measure since
there's a lot of "weak" joins we should be weeding out.
More documentation coming, but cindex joins are very much
a fuzzy thing which will have to deal with false positives
and such. So figuring out the scoring for sanity would
make sense...
Fortunately, --join=aggressive,reset only takes ~1 hour for me,
so probably 1/3 that on modern hardware. Incremental
`-cindex --join' (no suboptions) usually takes <5 minutes if
done frequently.
New performance problem: solver could definitely be smarter
about dealing with common roots/groups. For the longest time,
I've only had 1 coderepo per-inbox, having hundreds is wacky.
Actual searching against the cindex isn't done, yet, but
that's kinda straightforward.
Eric Wong (14):
test_common: create_*: detect changes all parameters
t/cindex*: require SCM_RIGHTS for these tests
codesearch: eliminate redundant substitutions
solver: schedule cleanup after synchronous git->check
xap_helper.h: move cindex endpoints to separate file
xap_helper: implement mset endpoint for WWW, IMAP, etc...
hval: use File::Spec to make relative paths for href
www: load and use cindex join data
git: speed up ->git_path for non-worktrees
cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'
git: speed up Git->new by 5% or so
admin: resolve_git_dir respects symlinks
cindex: extra quit checks
www: start working on a repo listing
Documentation/public-inbox-cindex.pod | 2 +-
MANIFEST | 3 +
Makefile.PL | 8 +-
lib/PublicInbox/Admin.pm | 25 +-
lib/PublicInbox/CodeSearch.pm | 162 ++++++++++-
lib/PublicInbox/CodeSearchIdx.pm | 52 ++--
lib/PublicInbox/Config.pm | 39 ++-
lib/PublicInbox/Git.pm | 27 +-
lib/PublicInbox/Hval.pm | 12 +-
lib/PublicInbox/RepoList.pm | 39 +++
lib/PublicInbox/Search.pm | 42 +++
lib/PublicInbox/SearchIdx.pm | 10 +-
lib/PublicInbox/SolverGit.pm | 9 +-
lib/PublicInbox/TestCommon.pm | 35 ++-
lib/PublicInbox/View.pm | 7 +-
lib/PublicInbox/WWW.pm | 1 +
lib/PublicInbox/WwwCoderepo.pm | 44 ++-
lib/PublicInbox/WwwStream.pm | 11 +-
lib/PublicInbox/WwwText.pm | 19 +-
lib/PublicInbox/XapHelper.pm | 51 ++--
lib/PublicInbox/XapHelperCxx.pm | 14 +-
lib/PublicInbox/xap_helper.h | 379 +++++++-------------------
lib/PublicInbox/xh_cidx.h | 244 +++++++++++++++++
lib/PublicInbox/xh_mset.h | 96 +++++++
script/public-inbox-cindex | 38 ++-
t/admin.t | 12 +
t/cindex-join.t | 9 +-
t/cindex.t | 91 ++++++-
t/xap_helper.t | 53 +++-
xt/solver.t | 3 +-
30 files changed, 1111 insertions(+), 426 deletions(-)
create mode 100644 lib/PublicInbox/RepoList.pm
create mode 100644 lib/PublicInbox/xh_cidx.h
create mode 100644 lib/PublicInbox/xh_mset.h
^ permalink raw reply [relevance 7%]
* [PATCH 05/14] xap_helper.h: move cindex endpoints to separate file
2023-11-28 14:56 7% [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
@ 2023-11-28 14:56 2% ` Eric Wong
0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2023-11-28 14:56 UTC (permalink / raw)
To: meta
It ought to help a bit with organization since xap_helper.h
is getting somewhat large and we'll need new endpoints to
support WWW, lei, and whatever else that needs to come.
---
MANIFEST | 1 +
lib/PublicInbox/XapHelperCxx.pm | 10 +-
lib/PublicInbox/xap_helper.h | 269 +-------------------------------
lib/PublicInbox/xh_cidx.h | 259 ++++++++++++++++++++++++++++++
4 files changed, 272 insertions(+), 267 deletions(-)
create mode 100644 lib/PublicInbox/xh_cidx.h
diff --git a/MANIFEST b/MANIFEST
index 85811133..bbbe0b91 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -378,6 +378,7 @@ lib/PublicInbox/XapHelperCxx.pm
lib/PublicInbox/Xapcmd.pm
lib/PublicInbox/gcf2_libgit2.h
lib/PublicInbox/xap_helper.h
+lib/PublicInbox/xh_cidx.h
sa_config/Makefile
sa_config/README
sa_config/root/etc/spamassassin/public-inbox.pre
diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm
index f421c7bc..8a66fdcd 100644
--- a/lib/PublicInbox/XapHelperCxx.pm
+++ b/lib/PublicInbox/XapHelperCxx.pm
@@ -20,7 +20,7 @@ $ENV{PERL_INLINE_DIRECTORY} // die('BUG: PERL_INLINE_DIRECTORY unset');
substr($dir, 0, 0) = "$ENV{PERL_INLINE_DIRECTORY}/";
my $bin = "$dir/xap_helper";
my ($srcpfx) = (__FILE__ =~ m!\A(.+/)[^/]+\z!);
-my @srcs = map { $srcpfx.$_ } qw(xap_helper.h);
+my @srcs = map { $srcpfx.$_ } qw(xap_helper.h xh_cidx.h);
my @pm_dep = map { $srcpfx.$_ } qw(Search.pm CodeSearch.pm);
my $ldflags = '-Wl,-O1';
$ldflags .= ' -Wl,--compress-debug-sections=zlib' if $^O ne 'openbsd';
@@ -61,11 +61,9 @@ sub build () {
require PublicInbox::OnDestroy;
my ($prog) = ($bin =~ m!/([^/]+)\z!);
my $lk = PublicInbox::Lock->new("$dir/$prog.lock")->lock_for_scope;
- open my $fh, '>', "$dir/$prog.cpp";
- say $fh qq(# include "$_") for @srcs;
- print $fh PublicInbox::Search::generate_cxx();
- print $fh PublicInbox::CodeSearch::generate_cxx();
- close $fh;
+ write_file '>', "$dir/$prog.cpp", qq{#include "xap_helper.h"\n},
+ PublicInbox::Search::generate_cxx(),
+ PublicInbox::CodeSearch::generate_cxx();
opendir my $dh, '.';
my $restore = PublicInbox::OnDestroy->new(\&chdir, $dh);
diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h
index 5816c24c..89d151d9 100644
--- a/lib/PublicInbox/xap_helper.h
+++ b/lib/PublicInbox/xap_helper.h
@@ -146,6 +146,12 @@ struct worker {
unsigned nr;
};
+struct fbuf {
+ FILE *fp;
+ char *ptr;
+ size_t len;
+};
+
#define SPLIT2ARGV(dst,buf,len) split2argv(dst,buf,len,MY_ARRAY_SIZE(dst))
static size_t split2argv(char **dst, char *buf, size_t len, size_t limit)
{
@@ -253,87 +259,11 @@ static bool starts_with(const std::string *s, const char *pfx, size_t pfx_len)
return s->size() >= pfx_len && !memcmp(pfx, s->c_str(), pfx_len);
}
-static void dump_ibx_term(struct req *req, const char *pfx,
- Xapian::Document *doc, const char *ibx_id)
-{
- Xapian::TermIterator cur = doc->termlist_begin();
- Xapian::TermIterator end = doc->termlist_end();
- size_t pfx_len = strlen(pfx);
-
- for (cur.skip_to(pfx); cur != end; cur++) {
- std::string tn = *cur;
-
- if (starts_with(&tn, pfx, pfx_len)) {
- fprintf(req->fp[0], "%s %s\n",
- tn.c_str() + pfx_len, ibx_id);
- ++req->nr_out;
- }
- }
-}
-
static int my_setlinebuf(FILE *fp) // glibc setlinebuf(3) can't report errors
{
return setvbuf(fp, NULL, _IOLBF, 0);
}
-static enum exc_iter dump_ibx_iter(struct req *req, const char *ibx_id,
- Xapian::MSetIterator *i)
-{
- try {
- Xapian::Document doc = i->get_document();
- for (int p = 0; p < req->pfxc; p++)
- dump_ibx_term(req, req->pfxv[p], &doc, ibx_id);
- } catch (const Xapian::DatabaseModifiedError & e) {
- req->srch->db->reopen();
- return ITER_RETRY;
- } catch (const Xapian::DocNotFoundError & e) { // oh well...
- warnx("doc not found: %s", e.get_description().c_str());
- }
- return ITER_OK;
-}
-
-static bool cmd_dump_ibx(struct req *req)
-{
- if ((optind + 1) >= req->argc)
- ABORT("usage: dump_ibx [OPTIONS] IBX_ID QRY_STR");
- if (!req->pfxc)
- ABORT("dump_ibx requires -A PREFIX");
-
- const char *ibx_id = req->argv[optind];
- if (my_setlinebuf(req->fp[0])) // for sort(1) pipe
- EABORT("setlinebuf(fp[0])"); // WTF?
- req->asc = true;
- req->sort_col = -1;
- Xapian::MSet mset = mail_mset(req, req->argv[optind + 1]);
-
- // @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
- // in case we need to retry on DB reopens
- for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
- for (int t = 10; t > 0; --t)
- switch (dump_ibx_iter(req, ibx_id, &i)) {
- case ITER_OK: t = 0; break; // leave inner loop
- case ITER_RETRY: break; // continue for-loop
- case ITER_ABORT: return false; // error
- }
- }
- emit_mset_stats(req, &mset);
- return true;
-}
-
-struct fbuf {
- FILE *fp;
- char *ptr;
- size_t len;
-};
-
-struct dump_roots_tmp {
- struct stat sb;
- void *mm_ptr;
- char **entries;
- struct fbuf wbuf;
- int root2off_fd;
-};
-
// n.b. __cleanup__ works fine with C++ exceptions, but not longjmp
// Only clang and g++ are supported, as AFAIK there's no other
// relevant Free(-as-in-speech) C++ compilers.
@@ -367,127 +297,6 @@ static size_t off2size(off_t n)
return (size_t)n;
}
-#define CLEANUP_DUMP_ROOTS __attribute__((__cleanup__(dump_roots_ensure)))
-static void dump_roots_ensure(void *ptr)
-{
- struct dump_roots_tmp *drt = (struct dump_roots_tmp *)ptr;
- if (drt->root2off_fd >= 0)
- xclose(drt->root2off_fd);
- hdestroy(); // idempotent
- size_t size = off2size(drt->sb.st_size);
- if (drt->mm_ptr && munmap(drt->mm_ptr, size))
- EABORT("BUG: munmap(%p, %zu)", drt->mm_ptr, size);
- free(drt->entries);
- fbuf_ensure(&drt->wbuf);
-}
-
-static bool root2offs_str(struct fbuf *root_offs, Xapian::Document *doc)
-{
- Xapian::TermIterator cur = doc->termlist_begin();
- Xapian::TermIterator end = doc->termlist_end();
- ENTRY e, *ep;
- fbuf_init(root_offs);
- for (cur.skip_to("G"); cur != end; cur++) {
- std::string tn = *cur;
- if (!starts_with(&tn, "G", 1))
- continue;
- union { const char *in; char *out; } u;
- u.in = tn.c_str() + 1;
- e.key = u.out;
- ep = hsearch(e, FIND);
- if (!ep) ABORT("hsearch miss `%s'", e.key);
- // ep->data is a NUL-terminated string matching /[0-9]+/
- fputc(' ', root_offs->fp);
- fputs((const char *)ep->data, root_offs->fp);
- }
- fputc('\n', root_offs->fp);
- if (ferror(root_offs->fp) | fclose(root_offs->fp))
- err(EXIT_FAILURE, "ferror|fclose(root_offs)"); // ENOMEM
- root_offs->fp = NULL;
- return true;
-}
-
-// writes term values matching @pfx for a given @doc, ending the line
-// with the contents of @root_offs
-static void dump_roots_term(struct req *req, const char *pfx,
- struct dump_roots_tmp *drt,
- struct fbuf *root_offs,
- Xapian::Document *doc)
-{
- Xapian::TermIterator cur = doc->termlist_begin();
- Xapian::TermIterator end = doc->termlist_end();
- size_t pfx_len = strlen(pfx);
-
- for (cur.skip_to(pfx); cur != end; cur++) {
- std::string tn = *cur;
- if (!starts_with(&tn, pfx, pfx_len))
- continue;
- fputs(tn.c_str() + pfx_len, drt->wbuf.fp);
- fwrite(root_offs->ptr, root_offs->len, 1, drt->wbuf.fp);
- ++req->nr_out;
- }
-}
-
-// we may have lines which exceed PIPE_BUF, so we do our own
-// buffering and rely on flock(2), here
-static bool dump_roots_flush(struct req *req, struct dump_roots_tmp *drt)
-{
- char *p;
- int fd = fileno(req->fp[0]);
- bool ok = true;
-
- if (!drt->wbuf.fp) return true;
- if (fd < 0) EABORT("BUG: fileno");
- if (ferror(drt->wbuf.fp) | fclose(drt->wbuf.fp)) // ENOMEM?
- err(EXIT_FAILURE, "ferror|fclose(drt->wbuf.fp)");
- drt->wbuf.fp = NULL;
- if (!drt->wbuf.len) goto done_free;
- while (flock(drt->root2off_fd, LOCK_EX)) {
- if (errno == EINTR) continue;
- err(EXIT_FAILURE, "LOCK_EX"); // ENOLCK?
- }
- p = drt->wbuf.ptr;
- do { // write to client FD
- ssize_t n = write(fd, p, drt->wbuf.len);
- if (n > 0) {
- drt->wbuf.len -= n;
- p += n;
- } else {
- perror(n ? "write" : "write (zero bytes)");
- return false;
- }
- } while (drt->wbuf.len);
- while (flock(drt->root2off_fd, LOCK_UN)) {
- if (errno == EINTR) continue;
- err(EXIT_FAILURE, "LOCK_UN"); // ENOLCK?
- }
-done_free: // OK to skip on errors, dump_roots_ensure calls fbuf_ensure
- free(drt->wbuf.ptr);
- drt->wbuf.ptr = NULL;
- return ok;
-}
-
-static enum exc_iter dump_roots_iter(struct req *req,
- struct dump_roots_tmp *drt,
- Xapian::MSetIterator *i)
-{
- CLEANUP_FBUF struct fbuf root_offs = {}; // " $ID0 $ID1 $IDx..\n"
- try {
- Xapian::Document doc = i->get_document();
- if (!root2offs_str(&root_offs, &doc))
- return ITER_ABORT; // bad request, abort
- for (int p = 0; p < req->pfxc; p++)
- dump_roots_term(req, req->pfxv[p], drt,
- &root_offs, &doc);
- } catch (const Xapian::DatabaseModifiedError & e) {
- req->srch->db->reopen();
- return ITER_RETRY;
- } catch (const Xapian::DocNotFoundError & e) { // oh well...
- warnx("doc not found: %s", e.get_description().c_str());
- }
- return ITER_OK;
-}
-
static char *hsearch_enter_key(char *s)
{
#if defined(__OpenBSD__) || defined(__DragonFly__)
@@ -507,70 +316,6 @@ static char *hsearch_enter_key(char *s)
return s;
}
-static bool cmd_dump_roots(struct req *req)
-{
- CLEANUP_DUMP_ROOTS struct dump_roots_tmp drt = {};
- drt.root2off_fd = -1;
- if ((optind + 1) >= req->argc)
- ABORT("usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR");
- if (!req->pfxc)
- ABORT("dump_roots requires -A PREFIX");
- const char *root2off_file = req->argv[optind];
- drt.root2off_fd = open(root2off_file, O_RDONLY);
- if (drt.root2off_fd < 0)
- EABORT("open(%s)", root2off_file);
- if (fstat(drt.root2off_fd, &drt.sb)) // ENOMEM?
- err(EXIT_FAILURE, "fstat(%s)", root2off_file);
- // each entry is at least 43 bytes ({OIDHEX}\0{INT}\0),
- // so /32 overestimates the number of expected entries by
- // ~%25 (as recommended by Linux hcreate(3) manpage)
- size_t size = off2size(drt.sb.st_size);
- size_t est = (size / 32) + 1; //+1 for "\0" termination
- drt.mm_ptr = mmap(NULL, size, PROT_READ,
- MAP_PRIVATE, drt.root2off_fd, 0);
- if (drt.mm_ptr == MAP_FAILED)
- err(EXIT_FAILURE, "mmap(%zu, %s)", size, root2off_file);
- size_t asize = est * 2;
- if (asize < est) ABORT("too many entries: %zu", est);
- drt.entries = (char **)calloc(asize, sizeof(char *));
- if (!drt.entries)
- err(EXIT_FAILURE, "calloc(%zu * 2, %zu)", est, sizeof(char *));
- size_t tot = split2argv(drt.entries, (char *)drt.mm_ptr, size, asize);
- if (tot <= 0) return false; // split2argv already warned on error
- if (!hcreate(est))
- err(EXIT_FAILURE, "hcreate(%zu)", est);
- for (size_t i = 0; i < tot; ) {
- ENTRY e;
- e.key = hsearch_enter_key(drt.entries[i++]); // dies on ENOMEM
- e.data = drt.entries[i++];
- if (!hsearch(e, ENTER))
- err(EXIT_FAILURE, "hsearch(%s => %s, ENTER)", e.key,
- (const char *)e.data);
- }
- req->asc = true;
- req->sort_col = -1;
- Xapian::MSet mset = commit_mset(req, req->argv[optind + 1]);
-
- // @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
- // in case we need to retry on DB reopens
- for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
- if (!drt.wbuf.fp)
- fbuf_init(&drt.wbuf);
- for (int t = 10; t > 0; --t)
- switch (dump_roots_iter(req, &drt, &i)) {
- case ITER_OK: t = 0; break; // leave inner loop
- case ITER_RETRY: break; // continue for-loop
- case ITER_ABORT: return false; // error
- }
- if (!(req->nr_out & 0x3fff) && !dump_roots_flush(req, &drt))
- return false;
- }
- if (!dump_roots_flush(req, &drt))
- return false;
- emit_mset_stats(req, &mset);
- return true;
-}
-
// for test usage only, we need to ensure the compiler supports
// __cleanup__ when exceptions are thrown
struct inspect { struct req *req; };
@@ -594,6 +339,8 @@ static bool cmd_test_inspect(struct req *req)
return false;
}
+#include "xh_cidx.h" // CodeSearchIdx.pm stuff
+
#define CMD(n) { .fn_len = sizeof(#n) - 1, .fn_name = #n, .fn = cmd_##n }
static const struct cmd_entry {
size_t fn_len;
diff --git a/lib/PublicInbox/xh_cidx.h b/lib/PublicInbox/xh_cidx.h
new file mode 100644
index 00000000..c2d94162
--- /dev/null
+++ b/lib/PublicInbox/xh_cidx.h
@@ -0,0 +1,259 @@
+// Copyright (C) all contributors <meta@public-inbox.org>
+// License: GPL-2.0+ <https://www.gnu.org/licenses/gpl-2.0.txt>
+// This file is only intended to be included by xap_helper.h
+// it implements pieces used by CodeSearchIdx.pm
+
+static void dump_ibx_term(struct req *req, const char *pfx,
+ Xapian::Document *doc, const char *ibx_id)
+{
+ Xapian::TermIterator cur = doc->termlist_begin();
+ Xapian::TermIterator end = doc->termlist_end();
+ size_t pfx_len = strlen(pfx);
+
+ for (cur.skip_to(pfx); cur != end; cur++) {
+ std::string tn = *cur;
+
+ if (starts_with(&tn, pfx, pfx_len)) {
+ fprintf(req->fp[0], "%s %s\n",
+ tn.c_str() + pfx_len, ibx_id);
+ ++req->nr_out;
+ }
+ }
+}
+
+static enum exc_iter dump_ibx_iter(struct req *req, const char *ibx_id,
+ Xapian::MSetIterator *i)
+{
+ try {
+ Xapian::Document doc = i->get_document();
+ for (int p = 0; p < req->pfxc; p++)
+ dump_ibx_term(req, req->pfxv[p], &doc, ibx_id);
+ } catch (const Xapian::DatabaseModifiedError & e) {
+ req->srch->db->reopen();
+ return ITER_RETRY;
+ } catch (const Xapian::DocNotFoundError & e) { // oh well...
+ warnx("doc not found: %s", e.get_description().c_str());
+ }
+ return ITER_OK;
+}
+
+static bool cmd_dump_ibx(struct req *req)
+{
+ if ((optind + 1) >= req->argc)
+ ABORT("usage: dump_ibx [OPTIONS] IBX_ID QRY_STR");
+ if (!req->pfxc)
+ ABORT("dump_ibx requires -A PREFIX");
+
+ const char *ibx_id = req->argv[optind];
+ if (my_setlinebuf(req->fp[0])) // for sort(1) pipe
+ EABORT("setlinebuf(fp[0])"); // WTF?
+ req->asc = true;
+ req->sort_col = -1;
+ Xapian::MSet mset = mail_mset(req, req->argv[optind + 1]);
+
+ // @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
+ // in case we need to retry on DB reopens
+ for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
+ for (int t = 10; t > 0; --t)
+ switch (dump_ibx_iter(req, ibx_id, &i)) {
+ case ITER_OK: t = 0; break; // leave inner loop
+ case ITER_RETRY: break; // continue for-loop
+ case ITER_ABORT: return false; // error
+ }
+ }
+ emit_mset_stats(req, &mset);
+ return true;
+}
+
+struct dump_roots_tmp {
+ struct stat sb;
+ void *mm_ptr;
+ char **entries;
+ struct fbuf wbuf;
+ int root2off_fd;
+};
+
+#define CLEANUP_DUMP_ROOTS __attribute__((__cleanup__(dump_roots_ensure)))
+static void dump_roots_ensure(void *ptr)
+{
+ struct dump_roots_tmp *drt = (struct dump_roots_tmp *)ptr;
+ if (drt->root2off_fd >= 0)
+ xclose(drt->root2off_fd);
+ hdestroy(); // idempotent
+ size_t size = off2size(drt->sb.st_size);
+ if (drt->mm_ptr && munmap(drt->mm_ptr, size))
+ EABORT("BUG: munmap(%p, %zu)", drt->mm_ptr, size);
+ free(drt->entries);
+ fbuf_ensure(&drt->wbuf);
+}
+
+static bool root2offs_str(struct fbuf *root_offs, Xapian::Document *doc)
+{
+ Xapian::TermIterator cur = doc->termlist_begin();
+ Xapian::TermIterator end = doc->termlist_end();
+ ENTRY e, *ep;
+ fbuf_init(root_offs);
+ for (cur.skip_to("G"); cur != end; cur++) {
+ std::string tn = *cur;
+ if (!starts_with(&tn, "G", 1))
+ continue;
+ union { const char *in; char *out; } u;
+ u.in = tn.c_str() + 1;
+ e.key = u.out;
+ ep = hsearch(e, FIND);
+ if (!ep) ABORT("hsearch miss `%s'", e.key);
+ // ep->data is a NUL-terminated string matching /[0-9]+/
+ fputc(' ', root_offs->fp);
+ fputs((const char *)ep->data, root_offs->fp);
+ }
+ fputc('\n', root_offs->fp);
+ if (ferror(root_offs->fp) | fclose(root_offs->fp))
+ err(EXIT_FAILURE, "ferror|fclose(root_offs)"); // ENOMEM
+ root_offs->fp = NULL;
+ return true;
+}
+
+// writes term values matching @pfx for a given @doc, ending the line
+// with the contents of @root_offs
+static void dump_roots_term(struct req *req, const char *pfx,
+ struct dump_roots_tmp *drt,
+ struct fbuf *root_offs,
+ Xapian::Document *doc)
+{
+ Xapian::TermIterator cur = doc->termlist_begin();
+ Xapian::TermIterator end = doc->termlist_end();
+ size_t pfx_len = strlen(pfx);
+
+ for (cur.skip_to(pfx); cur != end; cur++) {
+ std::string tn = *cur;
+ if (!starts_with(&tn, pfx, pfx_len))
+ continue;
+ fputs(tn.c_str() + pfx_len, drt->wbuf.fp);
+ fwrite(root_offs->ptr, root_offs->len, 1, drt->wbuf.fp);
+ ++req->nr_out;
+ }
+}
+
+// we may have lines which exceed PIPE_BUF, so we do our own
+// buffering and rely on flock(2), here
+static bool dump_roots_flush(struct req *req, struct dump_roots_tmp *drt)
+{
+ char *p;
+ int fd = fileno(req->fp[0]);
+ bool ok = true;
+
+ if (!drt->wbuf.fp) return true;
+ if (fd < 0) EABORT("BUG: fileno");
+ if (ferror(drt->wbuf.fp) | fclose(drt->wbuf.fp)) // ENOMEM?
+ err(EXIT_FAILURE, "ferror|fclose(drt->wbuf.fp)");
+ drt->wbuf.fp = NULL;
+ if (!drt->wbuf.len) goto done_free;
+ while (flock(drt->root2off_fd, LOCK_EX)) {
+ if (errno == EINTR) continue;
+ err(EXIT_FAILURE, "LOCK_EX"); // ENOLCK?
+ }
+ p = drt->wbuf.ptr;
+ do { // write to client FD
+ ssize_t n = write(fd, p, drt->wbuf.len);
+ if (n > 0) {
+ drt->wbuf.len -= n;
+ p += n;
+ } else {
+ perror(n ? "write" : "write (zero bytes)");
+ return false;
+ }
+ } while (drt->wbuf.len);
+ while (flock(drt->root2off_fd, LOCK_UN)) {
+ if (errno == EINTR) continue;
+ err(EXIT_FAILURE, "LOCK_UN"); // ENOLCK?
+ }
+done_free: // OK to skip on errors, dump_roots_ensure calls fbuf_ensure
+ free(drt->wbuf.ptr);
+ drt->wbuf.ptr = NULL;
+ return ok;
+}
+
+static enum exc_iter dump_roots_iter(struct req *req,
+ struct dump_roots_tmp *drt,
+ Xapian::MSetIterator *i)
+{
+ CLEANUP_FBUF struct fbuf root_offs = {}; // " $ID0 $ID1 $IDx..\n"
+ try {
+ Xapian::Document doc = i->get_document();
+ if (!root2offs_str(&root_offs, &doc))
+ return ITER_ABORT; // bad request, abort
+ for (int p = 0; p < req->pfxc; p++)
+ dump_roots_term(req, req->pfxv[p], drt,
+ &root_offs, &doc);
+ } catch (const Xapian::DatabaseModifiedError & e) {
+ req->srch->db->reopen();
+ return ITER_RETRY;
+ } catch (const Xapian::DocNotFoundError & e) { // oh well...
+ warnx("doc not found: %s", e.get_description().c_str());
+ }
+ return ITER_OK;
+}
+
+static bool cmd_dump_roots(struct req *req)
+{
+ CLEANUP_DUMP_ROOTS struct dump_roots_tmp drt = {};
+ drt.root2off_fd = -1;
+ if ((optind + 1) >= req->argc)
+ ABORT("usage: dump_roots [OPTIONS] ROOT2ID_FILE QRY_STR");
+ if (!req->pfxc)
+ ABORT("dump_roots requires -A PREFIX");
+ const char *root2off_file = req->argv[optind];
+ drt.root2off_fd = open(root2off_file, O_RDONLY);
+ if (drt.root2off_fd < 0)
+ EABORT("open(%s)", root2off_file);
+ if (fstat(drt.root2off_fd, &drt.sb)) // ENOMEM?
+ err(EXIT_FAILURE, "fstat(%s)", root2off_file);
+ // each entry is at least 43 bytes ({OIDHEX}\0{INT}\0),
+ // so /32 overestimates the number of expected entries by
+ // ~%25 (as recommended by Linux hcreate(3) manpage)
+ size_t size = off2size(drt.sb.st_size);
+ size_t est = (size / 32) + 1; //+1 for "\0" termination
+ drt.mm_ptr = mmap(NULL, size, PROT_READ,
+ MAP_PRIVATE, drt.root2off_fd, 0);
+ if (drt.mm_ptr == MAP_FAILED)
+ err(EXIT_FAILURE, "mmap(%zu, %s)", size, root2off_file);
+ size_t asize = est * 2;
+ if (asize < est) ABORT("too many entries: %zu", est);
+ drt.entries = (char **)calloc(asize, sizeof(char *));
+ if (!drt.entries)
+ err(EXIT_FAILURE, "calloc(%zu * 2, %zu)", est, sizeof(char *));
+ size_t tot = split2argv(drt.entries, (char *)drt.mm_ptr, size, asize);
+ if (tot <= 0) return false; // split2argv already warned on error
+ if (!hcreate(est))
+ err(EXIT_FAILURE, "hcreate(%zu)", est);
+ for (size_t i = 0; i < tot; ) {
+ ENTRY e;
+ e.key = hsearch_enter_key(drt.entries[i++]); // dies on ENOMEM
+ e.data = drt.entries[i++];
+ if (!hsearch(e, ENTER))
+ err(EXIT_FAILURE, "hsearch(%s => %s, ENTER)", e.key,
+ (const char *)e.data);
+ }
+ req->asc = true;
+ req->sort_col = -1;
+ Xapian::MSet mset = commit_mset(req, req->argv[optind + 1]);
+
+ // @UNIQ_FOLD in CodeSearchIdx.pm can handle duplicate lines fine
+ // in case we need to retry on DB reopens
+ for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
+ if (!drt.wbuf.fp)
+ fbuf_init(&drt.wbuf);
+ for (int t = 10; t > 0; --t)
+ switch (dump_roots_iter(req, &drt, &i)) {
+ case ITER_OK: t = 0; break; // leave inner loop
+ case ITER_RETRY: break; // continue for-loop
+ case ITER_ABORT: return false; // error
+ }
+ if (!(req->nr_out & 0x3fff) && !dump_roots_flush(req, &drt))
+ return false;
+ }
+ if (!dump_roots_flush(req, &drt))
+ return false;
+ emit_mset_stats(req, &mset);
+ return true;
+}
^ permalink raw reply related [relevance 2%]
Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2023-11-28 14:56 7% [PATCH 00/14] IT'S ALIVE! www loads cindex join data Eric Wong
2023-11-28 14:56 2% ` [PATCH 05/14] xap_helper.h: move cindex endpoints to separate file Eric Wong
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).