Date | Commit message (Collapse) |
|
This code will be shared with future mass-import tools.
|
|
Unfortunately this gives up some minor performance tweaks we
made to avoid reforking import processes.
|
|
This reduces code duplication needed for locking and
and hopefully makes things easier to understand.
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|
|
Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).
Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.
|
|
Interchangably using "all", "skel", "threader", etc. were
confusing. Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.
|
|
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.
git-fast-import is fast, so we don't bother parallelizing it.
Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NNTP
article number and internal thread_id, respectively).
We handle msgmap in the main process which drives fast-import.
When the article number is retrieved/generated, we write the
entire message to per-partition subprocesses via pipes for
expensive text+term indexing.
When these per-partition subprocesses are done with the
expensive text+term indexing, they write SearchMsg (small data)
to a shared pipe (inherited from the main V2Writable process)
back to the threader, which runs its own subprocess.
The number of text+term Xapian partitions is chosen at import
and can be made equal to the number of cores in a machine.
V2Writable --> Import -> git-fast-import
\-> SearchIdxThread -> Msgmap (synchronous)
\-> SearchIdxPart[n] -> SearchIdx[*]
\-> SearchIdxThread -> SearchIdx ("threader", a subprocess)
[* ] each subprocess writes to threader
|
|
Wrap the old Import package to enable creating new repos based
on size thresholds. This is better than relying on time-based
rotation as LKML traffic seems to be increasing.
|
|
|
|
Call order will need to change a bit since this is going to be
tied to Xapian
|
|
|
|
|
|
We will also treat all known list addresses as non-obfuscated.
By setting publicinbox.noObfuscate in ~/.public-inbox/config,
this will allow users to disable address obfuscation on a
per-domain or per-address basis.
|
|
This will make it easier to prevent breakage in the future.
|
|
Unfortunately, it appears we have to reject this and instead add
support filtering at View time(*), due to DKIM signatures in
messages from ruby-lang.org.
(*) which may not be worth it
|
|
This allows us to support centralized mailing lists (which suck,
but better than no mailing list at all).
|
|
We'll be adding more reply options for centralized mailing
lists. So split out the logic so it's easy-to-find.
Organizing code is hard :<
|
|
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
|
|
Due to the asynchronous nature of SMTP, it is possible for the
root message of a thread (with no References/In-Reply-To)
to arrive last in a series. We must preserve the thread_id
of the ghost message in this case, as we do when vivifiying
non-root ghosts.
Otherwise, this causes threads to be broken when the root
arrives last.
|
|
Some mailing lists add annoying tags into the Subject line which
discourages readers from doing proper mail organization on the
client side. They also waste precious screen space and
attention span.
Remove them from our archives to reduce clutter.
|
|
This should fix problems with multipart messages where
text/plain parts lack a header.
cf. git clone --mirror https://github.com/rjbs/Email-MIME.git
refs/pull/28/head
In the future, we may still introduce as streaming
interface to reduce memory usage on large emails.
|
|
I'll be using this to improve message threading performance.
|
|
This will let us stream larger Atom documents bodies without
wasting too much memory and reduce the amount of round-trip
requests needed to get necessary information.
Hopefully clients are using streaming (SAX) parsers, too.
This is the final transition in the core public-inbox
code to allow migrating to a "pull"-based body streaming
scheme which allows a HTTP server to respond appropriately
to backpressure from slow clients.
|
|
This starts to show noticeable performance improvements when
attempting to thread over 400 messages; but the improvement
may not be measurable with less.
However, the resulting code is much shorter and (IMHO)
much easier to understand.
|
|
Introduce our own SearchThread class for threading messages.
This should allow us to specialize and optimize away objects
in future commits.
|
|
Hopefully more folks can download and run public-inbox,
nowadays.
|
|
Begin documenting some basic help functionality.
I may tweak the anchor names of the various HTML endpoints
to be more consistent with each other (old ones will be
supported for a short while), so I'm not documenting
those, for now.
This may become part of a builtin key-value store for
basic texts, but this probably shouldn't become a wiki
engine, either.
|
|
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&',
"'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed
in path-absolute where we have the Message-ID.
In any case, it seems '@' is fairly common in path components
nowadays and too common in Message-IDs.
|
|
For some existing mailing list archives, messages are identified
by serial number (such as NNTP article numbers in gmane). Those
links may become inaccessible (as is the current case for
gmane), so ensure users can still search based on old serial
numbers.
Now, I run the following periodically to get article numbers
from gmane (while news.gmane.org remains):
NNTPSERVER=news.gmane.org
export NNTPSERVER
GROUP=gmane.comp.version-control.git
perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3
(I might integrate this further with public-inbox-* scripts one day).
My ~/.public-inbox/config as an added "altid" snippet which now
looks like this:
[publicinbox "git"]
address = git@vger.kernel.org
mainrepo = /path/to/git.vger.git
newsgroup = inbox.comp.version-control.git
; relative pathnames expand to $mainrepo/public-inbox/$file
altid = serial:gmane:file=gmane.sqlite3
And run "public-inbox-index --reindex /path/to/git.vger.git"
periodically.
This ought to allow searching for "gmane:12345" to work for
Xapian-enabled instances.
Disclaimer: while public-inbox supports NNTP and stable article
serial numbers, use of those for public links is discouraged
since it encourages centralization.
|
|
This is used to quickly generate an article number to Message-ID
mapping.
Usage:
NNTPSERVER=news.example.org ./scripts/xhdr-num2mid GROUP >file
|
|
|
|
Currently only for git-http-backend use, this allows limiting
the number of spawned processes per-inbox or by group, if there
are multiple large inboxes amidst a sea of small ones.
For example, a "big" repo limiter could be used for big inboxes:
which would be shared between multiple repos:
[limiter "big"]
max = 4
[publicinbox "git"]
address = git@vger.kernel.org
mainrepo = /path/to/git.git
; shared limiter with giant:
httpbackendmax = big
[publicinbox "giant"]
address = giant@project.org
mainrepo = /path/to/giant.git
; shared limiter with git:
httpbackendmax = big
; This is a tiny inbox, use the default limiter with 32 slots:
[publicinbox "meta"]
address = meta@public-inbox.org
mainrepo = /path/to/meta.git
|
|
Same as nginx :>
|
|
|
|
We may remove from_name in the future.
...And disallow quotes in email addresses.
Technically I believe they're allowed, but they're definitely
uncommon and unlikely to show up in legitimate mail.
|
|
This should hopefully make it easier to try other anti-spam
systems (or none at all) in the future.
|
|
When running mailing list mirrors, one needs to be careful
spammers do not try to sidestep the list server we want to
mirror from and inject email into our mail directly by setting
the appropriate list headers (e.g. "X-Mailing-List" or
"List-Id"). We trust the top-most Received: header is
the one our own mail server got the mail from.
Bcc:-ing a public mailing list is a very likely indicator of
spam in my experience, so throw in an extra rule mark it.
While public-inbox-mda rejects Bcc: entirely, public-inbox-watch
needs to mirror lists which allow Bcc.
==> list_mirror.cf <==
loadplugin PublicInbox::SaPlugin::ListMirror
ifplugin PublicInbox::SaPlugin::ListMirror
header LIST_MIRROR_RECEIVED eval:check_list_mirror_received()
describe LIST_MIRROR_RECEIVED Received does not match trusted list server
score LIST_MIRROR_RECEIVED 10
header LIST_MIRROR_BCC eval:check_list_mirror_bcc()
describe LIST_MIRROR_BCC Mailing list was Bcc-ed
score LIST_MIRROR_BCC 1
endif
==> ~/.spamassassin/user_prefs <==
ifplugin PublicInbox::SaPlugin::ListMirror
list_mirror X-Mailing-List git@vger.kernel.org *.kernel.org git@vger.kernel.org
endif
|
|
And add a check-manifest target to the Makefile to
ensure we're up-to-date with git (but do not depend on
git).
|
|
This will allow users to run importers off existing mail
accounts where they may not have access to run -mda.
Currently, we only support Maildirs, but IMAP ought to be
doable.
|
|
Oops, maybe this could be auto-maintained somehow...
|
|
Most of its functionality is in the PublicInbox::Inbox class.
While we're at it, we no longer auto-create newsgroup names
based on the inbox name, since newsgroup names probably deserve
some thought when it comes to hierarchy.
|
|
It seems common for users to end statements with URLs,
while it is rare for a URL itself to end with a '.' or ';'.
So make a guess and assume the URL was intended to not
include the trailing '.' or ';'
|
|
This will allow us to more easily reuse it elsewhere.
|
|
Ugh, I wonder if we can/should generate this automatically...
|
|
It's been a while...
|
|
This project is currently implemented in Perl, and pod2man is
probably more common among potential users and developers of
this project.
|
|
We'll be using it for more than just cat-file.
Adding a `popen' API for internal use allows us to save a bunch
of code in other places.
|
|
Thanks to Ask for the patch in
https://rt.cpan.org/Public/Bug/Display.html?id=22817
|
|
This should allow us to more-easily test with Plack.
|
|
We use --git-dir=... instead of $ENV{GIT_DIR} because ENV changes
do not propagate easily with mod_perl.
|