public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2019-05-15	www: use Inbox->over where appropriate
	We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-05-14	tests: remove unnecessary loading of ::DS and Socket
	PublicInbox::DS works for every platform we we care about, nowadays; so checking for it is a waste of time. Cleanup a few POSIX and Socket imports while we're in the area.
2019-05-14	v2writable: allow setting nproc via creat options
	Avoiding reliance on environment variables is a bit cleaner for writing tests
2019-05-08	Merge remote-tracking branch 'origin/danga-bundle'
	* origin/danga-bundle: DS: epoll: fix misordered EPOLL_CTL_DEL call DS: drop unused "_undef" sub syscall: drop readahead wrapper build: do not manify DS and Syscall pods DS: handle EINTR in IO::Poll path, too DS: workaround IO::Kqueue EINTR (mis-)handling DS: drop profiling support DS: remove unused fields and functions listener: use EPOLLEXCLUSIVE for listen sockets bundle Danga::Socket and Sys::Syscall
2019-05-06	index: warn with info about the message as context
	This can help users track down the source of warnings when presented with imperfect emails. While we're at it, make the __WARN__ callback in t/v2writable.t a no-op since we don't check for warnings, there.
2019-05-04	bundle Danga::Socket and Sys::Syscall
	These modules are unmaintained upstream at the moment, but I'll be able to help with the intended maintainer once/if CPAN ownership is transferred. OTOH, we've been waiting for that transfer for several years, now... Changes I intend to make: * EPOLLEXCLUSIVE for Linux * remove unused fields wasting memory * kqueue bugfixes e.g. https://rt.cpan.org/Ticket/Display.html?id=116615 * accept4 support And some lower priority experiments: * switch to EV_ONESHOT / EPOLLONESHOT (incompatible changes) * nginx-style buffering to tmpfile instead of string array * sendfile off tmpfile buffers * io_uring maybe?
2019-01-11	v2writable: ->purge returns undef on no-op
	And doesn't try to access undef as an array ref.
2019-01-10	t/v2writable.t: force more consistent "git log" output
	This should probably use lower-level git plumbing, but until then, consistently add a bunch of --no-* options to "git log" to get more consistent output. Noticed-by: Johannes Berg https://public-inbox.org/meta/1538164205.14416.76.camel@sipsolutions.net/
2019-01-10	check git version requirements
	This allows v1 tests to continue working on git 1.8.0 for now. This allows git 2.1.4 packaged with Debian 8 ("jessie") to run old tests, at least. I suppose it's safe to drop Debian 7 ("wheezy") due to our dependency on git 1.8.0 for "merge-base --is-ancestor". Writing V2 repositories requires git 2.6 for "get-mark" support, so mask out tests for older gits.
2018-12-29	tests: consolidate process spawning code.
	IPC::Run provides a nice simplification in several places; and we already use it (optionally) on a lot of tests. For the non-test code, we still rely on our vfork-capable Inline::C stuff since real-world server processes can get large enough to where vfork is an advantage. Maybe Perl5 can use CLONE_VFORK somehow, one day: https://rt.perl.org/Ticket/Display.html?id=128227 Ohg V'q engure cbeg choyvp-vaobk gb Ehol :C
2018-07-18	t/search.t t/v2writable.t: Teach search tests to fail more cleanly.
	Now that some of the indexes are optionals these tests might fail so teach them to fail more cleanly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-05-30	respect umask if core.sharedRepository is not set
	This is consistent with git itself and the previous behavior was a result of misunderstanding of how git interprets this. And adjust tests slightly to match the new behavior. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> <38873789-ab42-65a1-20c9-12c30b171f4f@linuxfoundation.org>
2018-04-18	ensure SQLite and Xapian files respect core.sharedRepository
	We can't have files with permissions inconsistent with what's in git objects.
2018-04-18	v2: generate better Message-IDs for duplicates
	While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp to at least give readers some sense of the age. And shorten the "localhost" hostname to "z" to save space.
2018-04-07	store less data in the Xapian document
	Since we only query the SQLite over DB for OVER/XOVER; do not need to waste space storing fields To/Cc/:bytes/:lines or the XNUM term. We only use From/Subject/References/Message-ID/:blob in various places of the PSGI code. For reindexing, we will take advantage of docid stability in "xapian-compact --no-renumber" to ensure duplicates do not show up in search results. Since the PSGI interface is the only consumer of Xapian at the moment, it has no need to search based on NNTP article number.
2018-04-07	v2writable: reduce barriers
	Since we handle the overview info synchronously, we only need barriers in tests, now. We will use asynchronous checkpoints to sync less-important Xapian data. For data deduplication, this requires us to hoist out the cat-blob support in ::Import for reading uncommitted data in git.
2018-04-06	www: favor reading more from SQLite, and less from Xapian
	Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-05	v2writable: remove redundant remove from Over DB
	The Xapian partitions will trigger the removal anyways. Test this and fix some description/spelling errors while we're at it.
2018-04-04	import: rewrite less history during purge
	We do not need to rewrite old commits unaffected by the object_id purge, only newer commits. This was a state management bug :x We will also return the new commit ID of rewritten history to aid in incremental indexing of mirrors for the next change.
2018-04-03	nntp: simplify the long_response API
	We we worked around the default range/termination conditions of long_response in many cases to reduce calls to SQLite or Xapian. So continue that trend and become more like the PSGI API which doesn't force callers to specify an article range or work inside a loop.
2018-04-03	msgmap: replace id_batch with ids_after
	id_batch had a an overly complicated interface, replace it with id_batch which is simpler and takes advantage of selectcol_arrayref in DBI. This allows simplification of callers and the diffstat agrees with me.
2018-04-02	www: rework query responses to avoid COUNT in SQLite
	In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-04-01	truncate Message-IDs and References consistently
	We need to stop ghost messages from generating longer Message-IDs than Xapian can handle with terms.
2018-04-01	v2: one file, really
	We need to ensure there is only one file in the top-level tree at any commit so the "add; remove; add;" sequence on the same message is detected properly. Otherwise, git will not detect the second "add" unless a second message is added to history. Deletes are now stored in "d" (and not "D" or "_/D") at the top-level, now. There's no need to have a "_" to reduce churn as "m" and "d" should never co-exist. It's now lowercased to make it easier-to-distinguish from "D" in git-log output.
2018-03-30	t/v2writable: use simplify permissions reading
	We have Git::qx nowadays.
2018-03-29	v2writable: support purging messages from git entirely
	Purging existing messages is fairly straightforward since we can take advantage of Xapian and lookup the git object_id with it. Unfortunately, purging an already "removed" message (which is no longer in Xapian) is not as easy and we'll need to expose ->purge_oids to purge by the git object_id (currently SHA-1). Furthermore, we expire reflogs and prune in hopes a dumb HTTP client won't get the object.
2018-03-29	v2writable: append, instead of prepending generated Message-ID
	The original Message-ID is still the most important when discussing with other recipients who do not rely on a message flowing through public-inbox. So whatever Message-ID we use to deduplicate internally will be secondary and less important. All of our front-end v2 code is order-independent, so we won't let the message count against us, that way.
2018-03-20	content_id: do not take Message-Id into account
	If we need to use content_id, we've already lost hope in relying on Message-Id as a differentiator. This prevents duplicates from showing up repeatedly with -watch when Message-Ids are reused and we generate new Message-Ids to disambiguate.
2018-03-19	v2writable: remove "resent" message for duplicate Message-IDs
	public-inbox-watch gets restarted on reboots and whatnot, so it could get pointlessly noisy. This message was only useful during initial development and imports.
2018-03-19	v2writable: allow disabling parallelization
	While parallel processes improves import speed for initial imports; they are probably not necessary for daily mail imports via WatchMaildir and certainly not for public-inbox-init. Save some memory for daily use and even helps improve readability of some subroutines by showing which methods they call remotely.
2018-03-19	v2writable: ensure ->done is idempotent
	This matches Import::done behavior
2018-03-19	v2writable: test for idempotent removals
	This will make reindexing easier.
2018-03-19	import: switch to URL-safe Base64 for Message-IDs
	Hexdigests are too long and shorter Message-IDs are easier to deal with.
2018-03-19	import: (v2): write deletes to a separate '_' subdirectory
	In the future, we may store "purged" content IDs or other uncommon stuff under "_/" of the git tree. This keeps the top-level tree small and more amenable to deltafication. This helps the the common case where "m" is most commonly changed file at the top level. Also, use 'D' instead of 'd' since it matches git's '--raw' output format.
2018-03-19	import: (v2) delete writes the blob into history in subdir
	This makes it easier to audit deletes with "git log -p" and prevents an unstable specification of "content_id" from being stored in history. This should be cost-free if done in the same partition (and even cheaper than before as it introduces no new blobs). It does have a higher cost across partitions, but is probably irrelevant given the typical ham:spam ratio.
2018-03-19	v2writable: implement remove correctly
	We need to hide removals from anybody hitting the search engine.
2018-03-19	v2writable: support "barrier" operation to avoid reforking
	Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06	v2writable: detect and use previous partition count
	We need to detect the number of partitions the repository was created with to ensure Xapian DBs can work across different machines (or even CPU affinity changes) without leaving messages unaffected by search.
2018-03-03	v2: avoid redundant/repeated configs for git partition repos
	We'll let the config of all.git dictate every other subrepo to ease maintenance and configuration. The "include" directive has been supported since git 1.7.10, so it's safe to depend on as v2 requires git 2.6.0+ anyways for "get-mark" in fast-import.
2018-03-03	searchidx: store the primary MID in doc data for NNTP
	We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03	v2writable: generated Message-ID goes first
	This is to make SearchMsg behave more sanely under NNTP.
2018-03-03	searchidx: support indexing multiple MIDs
	It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02	v2writable: inject new Message-IDs on true duplicates
	Since we'll need to support multiple Message-IDs anyways, inject a new one if we hit a duplicate (or don't get one at all). Try to use a deterministic Message-Id for consistency, but give up determinism and use a random Message-Id if an "attacker" wants to prevent their message from being archived.