Date | Commit message (Collapse) |
|
I didn't wait until September to do it, this year!
|
|
popen_rd accepts arbitrary redirects, so we can reuse its
code to setup the pipe end we want to read, saving each
caller a few lines of code compared to calling pipe+spawn.
|
|
Most spawn and popen_rd callers die on failure to spawn,
anyways, and some are missing checks entirely. This saves
us a bunch of verbose error-checking code in callers.
This also makes popen_rd more consistent, since it already
dies on pipe creation failures.
|
|
There's a bunch of leftover "require" and "use" statements we no
longer need and can get rid of, along with some excessive
imports via "use".
IO::Handle usage isn't always obvious, so add comments
describing why a package loads it. Along the same lines,
document the tmpdir support as the reason we depend on
File::Temp 0.19, even though every Perl 5.10.1+ user has it.
While we're at it, favor "use" over "require", since it it gives
us extra compile-time checking.
|
|
Found by codespell, there's a few more in comments and some
debatable ones, but user-facing stuff is more important.
|
|
We can save callers the trouble of {-hold} and {-dev_null}
refs as well as the trouble of calling fileno().
|
|
run_die() doesn't require an $env arg, so there's no
point passing "undef" to it.
|
|
Since we give users no indication or control of how "git gc"
runs, showing its progress is confusing.
|
|
SearchIdx->new no longer accepts a GIT_DIR path as its argument
since commit 585314673236d664729fe3ab2d4fb229d1c0f2d5
("searchidx: require PublicInbox::Inbox (or InboxWritable) ref")
|
|
I don't trust the MIME type to not munge my email messages in horrible
ways upon occasion. Therefore allow for passing in the raw message
value instead of trusting the mime object to preserve it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[ew: use "//" from Perl 5.10+ for defined check]
|
|
Much of the existing purge code is repurposed to a general
"replace" functionality.
->purge is simpler because it can just drop the information.
Unlike ->purge, ->replace needs to edit existing git commits (in
case of From: and Subject: headers) and reindex the modified
message.
We currently disallow editing of References:, In-Reply-To: and
Message-ID headers because it can cause bad side effects with
our threading (and our lack of rethreading support to deal with
excessive matching from incorrect/invalid References).
|
|
Continuing the work by Eric Biederman in commit a118d58a402bd31b
("Import.pm: When purging replace a purged file with a zero length file"),
we can use a generic OID replacement mechanism to implement
purge.
|
|
We will be reusing the same logic for extracting all
the authorship and commit title logic for edits; so
put it all into one sub.
|
|
While I don't expect git to suddenly start spewing non-ASCII
digits in places I'd expect ASCII, this would make things easier
for future hackers and reviewers.
|
|
Consolidate subject handling in the add function to make it easier to
read and understand.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
Import initialization is a little strange from history, but we
also can't change it too much because it's technically a public
API which external code may rely on...
And we may need to support v1 repos indefinitely. This should
make it easier to write tests for both formats.
|
|
This is for consistency with other fields which follow
this pattern w.r.t. field-naming when referring to internal
fields.
|
|
'$inbox' is more human-readable, so that is for the more
human-readable name in most cases. Making our variable naming
more consistent should make the code easier-to-review and
harder to screw up.
|
|
Zombies are bad.
|
|
Hopefully this helps people familiarize themselves with
the source code.
|
|
Remove confusing documentation around ssoma now that we
have NNTP and downloadable mbox support.
Only lightly-checked for grammar and speling, and not yet
formatting. Edits, corrections and addendums expected :>
|
|
This ensures that the number of added files remains the same and thus
the article numbers derived from a repository will remain the same.
I think this is the last place in public-inbox that has to be tweaked to
guarantee the generated article number will remain the same in an public
inbox archive.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
While working on one of the tests I did:
my $im = PublicInbox::V2Writable->new($ibx, 1);
my $im0 = $im->importer();
$im->add($mime);
Which resulted in a warning of the use of an undefined value from
atfork_child, and the test failing nastily. Inspection of the code
reveals this can happen anytime gfi_start has not been called.
So just fix atfork_child to skip closing file descriptors that have
not yet been setup.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
Use ||= '' to ensure that if the From or Sender header is not present
the code sees an empty string and instead of undefined.
I had some email messages with a From field without an @ (because the
sender was local) and without a Sender which were causing errors when
imported. I think this was bad enough that the email messages were
failing to be imported.
Signed-off-by: Eric Biederamn <ebiederm@xmission.com>
|
|
Recently I ran git --git-dir=lkml/git/1.git fsck
and it reported:
> warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NULL byte inthe commit object body
Which I found quite scary. Nulls in the wrong place have a bad tendency
to make programs misbehave.
It turns out someone had placed "=?iso-8859-1?q?=00?=" at the end of
their subject line. Which is the mime encoding for NULL. Email::Mime
had correctly decoded the header, and then public-inbox had simply
copied the contents of the header into the subject line of the git
commit.
To prevent that from causing problems replace nulls in such subject
lines with spaces.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
|
|
This should reduce idle cat-file instances
|
|
In case people were running old buggy versions from 2016...
(and -convert should probably clean those up, eventually)
|
|
While hunting duplicates, I noticed a leading '-' in some
Message-IDs as a result of RFC4648 encoding. While '-' seems
allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon
and make using Message-IDs as arguments for command-line tools
more difficult. So prefix them with a datestamp to at least
give readers some sense of the age. And shorten the "localhost"
hostname to "z" to save space.
|
|
Since we handle the overview info synchronously, we only need
barriers in tests, now. We will use asynchronous checkpoints
to sync less-important Xapian data.
For data deduplication, this requires us to hoist out the
cat-blob support in ::Import for reading uncommitted data
in git.
|
|
This is important for people running mirrors via "git fetch",
as they need to be kept up-to-date. Purging is also now
supported in mirrors.
The short-lived "--regenerate" option is gone and is now
implicitly enabled as a result. It's still cheap when
article number regeneration is unnecessary, as we track
the range for each git repository.
|
|
We do not need to rewrite old commits unaffected by the object_id
purge, only newer commits. This was a state management bug :x
We will also return the new commit ID of rewritten history to
aid in incremental indexing of mirrors for the next change.
|
|
We need to ensure there is only one file in the top-level tree
at any commit so the "add; remove; add;" sequence on the same
message is detected properly.
Otherwise, git will not detect the second "add" unless
a second message is added to history.
Deletes are now stored in "d" (and not "D" or "_/D") at the
top-level, now. There's no need to have a "_" to reduce churn
as "m" and "d" should never co-exist. It's now lowercased to
make it easier-to-distinguish from "D" in git-log output.
|
|
We'll be using it in more future tests and scripts.
|
|
Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.
Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).
Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.
|
|
The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox. So whatever Message-ID we use
to deduplicate internally will be secondary and less important.
All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.
|
|
This also quiets down warnings from -watch when spam training
happens on messages without Message-Id.
|
|
We want to make it clear to the code and DEBUG_DIFF users
that we do not introduce messages with unsuitable headers
into public archives.
|
|
We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.
This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.
|
|
Reduce the places where we have duplicate logic for discarding
unwanted headers.
|
|
This reduces code duplication needed for locking and
and hopefully makes things easier to understand.
|
|
Instead of using ssoma-based locking, enable locking via Import
for now.
|
|
Hexdigests are too long and shorter Message-IDs are easier
to deal with.
|
|
This allows us to share code for generating Message-IDs
between v1 and v2 repos.
For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:
The workaround for this would be to reuse the bad message from
the archive itself.
|
|
This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg. This also ensures dumb HTTP
clients can see changes to V2 repos immediately.
|
|
In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree. This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.
Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.
|
|
This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.
This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs). It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
|
|
Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|