diff options
Diffstat (limited to 'Documentation/technical')
-rw-r--r-- | Documentation/technical/data_structures.txt | 229 | ||||
-rw-r--r-- | Documentation/technical/ds.txt | 28 | ||||
-rw-r--r-- | Documentation/technical/memory.txt | 56 | ||||
-rw-r--r-- | Documentation/technical/weird-stuff.txt | 22 | ||||
-rw-r--r-- | Documentation/technical/whyperl.txt | 177 |
5 files changed, 500 insertions, 12 deletions
diff --git a/Documentation/technical/data_structures.txt b/Documentation/technical/data_structures.txt new file mode 100644 index 00000000..11f78041 --- /dev/null +++ b/Documentation/technical/data_structures.txt @@ -0,0 +1,229 @@ +Internal data structures of public-inbox + +This is a guide for hackers new to our code base. Do not +consider our internal data structures stable for external +consumers, this document should be updated when internals +change. I recommend reading this document from the source tree, +with the source code easily accessible if you need examples. + +This mainly documents in-memory data structures. If you're +interested in the stable on-filesystem formats, see the +public-inbox-config(5), public-inbox-v1-format(5) and +public-inbox-v2-format(5) manpages. + +Common abbreviations when used outside of their packages are +documented. `$self' is the common variable name when used +within their package. + +PublicInbox::Config +------------------- + +PublicInbox::Config is the root class which loads a +public-inbox-config file and instantiates PublicInbox::Inbox, +PublicInbox::WWW, PublicInbox::NNTPD, and other top-level +classes. + +Outside of tests, this is typically a singleton. + +Per-message classes +------------------- + +* PublicInbox::Eml - Email::MIME-like class + Common abbreviation: $mime, $eml + Used by: PublicInbox::WWW, PublicInbox::SearchIdx + + A representation of an entire email, multipart or not. + An option to use libgmime or libmailutils may be supported + in the future for performance and memory use. + + This can be a memory hog with big messages and giant + attachments, so our PublicInbox::WWW interface only keeps + one object of this class in memory at a time. + + In other words, this is the "meat" of the message, whereas + $smsg (below) is just the "skeleton". + + Our PublicInbox::V2Writable class may have two objects of this + type in memory at a time for deduplication. + + In public-inbox 1.4 and earlier, Email::MIME and its subclass, + PublicInbox::MIME were used. Despite still slurping, + PublicInbox::Eml is faster and uses less memory due to + lazy header parsing and lazy subpart instantiation with + shorter object lifetimes. + +* PublicInbox::Smsg - small message skeleton + Used by: PublicInbox::{NNTP,WWW,SearchIdx} + Common abbreviation: $smsg + + Represents headers shown in NNTP overview and PSGI message + summaries (thread skeleton). + + This is loaded from either the overview DB (over.sqlite3) or + the Xapian DB (docdata.glass), though the Xapian docdata + won't hold NNTP-only fields (Cc:/To:). + + There may be hundreds or thousands of these objects in memory + at a time, so fields are pruned if unneeded. + +* PublicInbox::SearchThread::Msg - subclass of Smsg + Common abbreviation: $cont or $node + Used by: PublicInbox::WWW + + The structure we use for a non-recursive[1] variant of + JWZ's algorithm: <https://www.jwz.org/doc/threading.html>. + Nowadays, this is a re-blessed $smsg with additional fields. + + As with $smsg objects, there may be hundreds or thousands + of these objects in memory at a time. + + We also do not use a linked list for storing children as JWZ + describes, but instead a Perl hashref for {children} which + becomes an arrayref upon sorting. + + [1] https://rt.cpan.org/Ticket/Display.html?id=116727 + +Per-inbox classes +----------------- + +* PublicInbox::Inbox - represents a single public-inbox + Common abbreviation: $ibx + Used everywhere. + + This represents a "publicinbox" section in the config + file, see public-inbox-config(5) for details. + +* PublicInbox::Git - represents a single git repository + Common abbreviation: $git, $ibx->git + Used everywhere. + + Each configured "publicinbox" or "coderepo" has one of these. + +* PublicInbox::Msgmap - msgmap.sqlite3 read-write interface + Common abbreviation: $mm, $ibx->mm + Used everywhere if SQLite is available. + + Each indexed inbox has one of these, see + public-inbox-v1-format(5) and public-inbox-v2-format(5) + manpages for details. + +* PublicInbox::Over - over.sqlite3 read-only interface + Common abbreviation: $over, $ibx->over + Used everywhere if SQLite is available. + + Each indexed inbox has one of these, see + public-inbox-v1-format(5) and public-inbox-v2-format(5) + manpages for details. + +* PublicInbox::Search - Xapian read-only interface + Common abbreviation: $srch, $ibx->search + Used everywhere if Xapian is available. + + Each indexed inbox has one of these, see + public-inbox-v1-format(5) and public-inbox-v2-format(5) + manpages for details. + +PublicInbox::WWW +---------------- + +The main PSGI web interface, uses several other packages to +form our web interface. + +PublicInbox::SolverGit +---------------------- + +This is instantiated from the $INBOX/$BLOB_OID/s/ WWW endpoint +and represents the stages and states for "solving" a blob by +searching for and applying patches. See the code and comments +in PublicInbox/SolverGit.pm + +PublicInbox::Qspawn +------------------- + +This is instantiated from various WWW endpoints and represents +the stages and states for running and managing subprocesses +in a way which won't exceed configured process limits defined +via "publicinboxlimiter.*" directives in public-inbox-config(5). + +ad-hoc structures shared across packages +---------------------------------------- + +* $ctx - PublicInbox::WWW app request context + This holds the PSGI $env as well as any internal variables + used by various modules of PublicInbox::WWW. + + As with the PSGI $env, there is one per active WWW + request+response cycle. It does not exist for idle HTTP + clients. + +daemon classes +-------------- + +* PublicInbox::NNTP - a NNTP client socket + Common abbreviation: $nntp + Used by: PublicInbox::DS, public-inbox-nntpd + + Unlike PublicInbox::HTTP, all of the NNTP client logic for + serving to NNTP clients is here, including what would be + in $ctx on the HTTP or WWW side. + + There may be thousands of these since we support thousands of + NNTP clients. + +* PublicInbox::HTTP - a HTTP client socket + Common abbreviation: $http + Used by: PublicInbox::DS, public-inbox-httpd + + Unlike PublicInbox::NNTP, this class has no knowledge of any of + the email- or git-specific parts of public-inbox, only PSGI. + However, it supports APIs and behaviors (e.g. streaming large + responses) which PublicInbox::WWW may take advantage of. + + There may be thousands of these since we support thousands of + HTTP clients. + +* PublicInbox::Listener - a SOCK_STREAM listen socket (TCP or Unix) + Used by: PublicInbox::DS, public-inbox-httpd, public-inbox-nntpd + Common abbreviation: @listeners in PublicInbox::Daemon + + This class calls non-blocking accept(2) or accept4(2) on a + listen socket to create new PublicInbox::HTTP and + PublicInbox::NNTP instances. + +* PublicInbox::HTTPD + Common abbreviation: $httpd + + Represents an HTTP daemon which creates PublicInbox::HTTP + wrappers around client sockets accepted from + PublicInbox::Listener. + + Since the SERVER_NAME and SERVER_PORT PSGI variables need to be + exposed for HTTP/1.0 requests when Host: headers are missing, + this is per Listener socket. + +* PublicInbox::HTTPD::Async + Common abbreviation: $async + + Used for implementing an asynchronous "push" interface for + slow, expensive responses which may require spawning + git-httpd-backend(1), git-apply(1) or other commands. + This will also be used for dealing with future asynchronous + operations such as HTTP reverse proxying and slow storage + retrieval operations. + +* PublicInbox::NNTPD + Common abbreviation: $nntpd + + Represents an NNTP daemon which creates PublicInbox::NNTP + wrappers around client sockets accepted from + PublicInbox::Listener. + + This is currently a singleton, but it is associated with a + given PublicInbox::Config which may be instantiated more than + once in the future. + +* PublicInbox::EOFpipe + + Used throughout to trigger a callback when a pipe(7) is closed. + This is frequently used to portably detect process exit without + relying on a catch-all waitpid(-1, ...) call. diff --git a/Documentation/technical/ds.txt b/Documentation/technical/ds.txt index cbd06cfb..afead2f1 100644 --- a/Documentation/technical/ds.txt +++ b/Documentation/technical/ds.txt @@ -1,9 +1,14 @@ PublicInbox::DS - event loop and async I/O base class -Our PublicInbox::DS event loop which powers public-inbox-nntpd -and public-inbox-httpd diverges significantly from the -unmaintained Danga::Socket package we forked from. In fact, -it's probably different from most other event loops out there. +Our PublicInbox::DS event loop which powers most of our long-lived +processes(*) diverges significantly from the unmaintained Danga::Socket +package we forked from. In fact, it's probably different from most +other event loops out there. + +Most notably, it uses one-shot, level-trigger, and edge-trigger mode +modes of kqueue|epoll depending on the situation. + +(*) public-inbox-netd,(-httpd,-imapd,-nntpd,-pop3d,-watch) + lei-daemon Most notably: @@ -14,7 +19,7 @@ Most notably: triggers a call. The lack of read/write callback distinction is driven by the - fact TLS libraries (e.g. OpenSSL via IO::Socket::SSL) may + fact that TLS libraries (e.g. OpenSSL via IO::Socket::SSL) may declare SSL_WANT_READ on SSL_write(), and SSL_WANT_READ on SSL_read(). So we end up having to let each user object decide whether it wants to make read or write calls depending on its @@ -30,7 +35,7 @@ Most notably: Reducing the user-supplied code down to a single callback allows subclasses to keep their logic self-contained. The combination of this change and one-shot wakeups (see below) for bidirectional - data flows make asynchronous code easier to reason about. + data flows makes asynchronous code easier to reason about. Other divergences: @@ -48,7 +53,7 @@ Other divergences: Augmented features: -* obj->write(CODEREF) passes the object itself to the CODEREF +* obj->write(CODEREF) passes the object itself to the CODEREF. Being able to enqueue subroutine calls is a powerful feature in Danga::Socket for keeping linear logic in an asynchronous environment. Unfortunately, each subroutine takes several kilobytes of memory. @@ -64,8 +69,8 @@ Augmented features: * ->requeue support. An optimization of the AddTimer(0, ...) idiom for immediately dispatching code at the next event loop iteration. public-inbox uses this for fairly generating large responses - iteratively (see PublicInbox::NNTP::long_response or the use of - ->getline callbacks for generating gigantic gzipped mboxes). + iteratively (see PublicInbox::NNTP::long_response or ibx_async_cat + for blob retrievals). New features @@ -77,12 +82,11 @@ New features which (if any) events it's interested in for the next loop iteration. * Edge-triggering available via EPOLLET or EV_CLEAR. These reduce wakeups - for unidirectional classes (e.g. PublicInbox::Listener sockets, - and pipes via PublicInbox::HTTPD::Async). + for unidirectional classes when throughput is more important than fairness. * IO::Socket::SSL support (for NNTPS, STARTTLS+NNTP, HTTPS) -* dwaitpid (waitpid wrapper) support for reaping dead children +* awaitpid (waitpid wrapper) support for reaping dead children * reliable signal wakeups are supported via signalfd on Linux, EVFILT_SIGNAL on *BSDs via IO::KQueue. diff --git a/Documentation/technical/memory.txt b/Documentation/technical/memory.txt new file mode 100644 index 00000000..039694c3 --- /dev/null +++ b/Documentation/technical/memory.txt @@ -0,0 +1,56 @@ +semi-automatic memory management in public-inbox +------------------------------------------------ + +The majority of public-inbox is implemented in Perl 5, a +language and interpreter not particularly known for being +memory-efficient. + +We strive to keep processes small to improve locality, allow +the kernel to cache more files, and to be a good neighbor to +other processes running on the machine. Taking advantage of +automatic reference counting (ARC) in Perl allows us to +deterministically release memory back to the heap. + +We start with a simple data model with few circular +references. This both eases human understanding and reduces +the likelihood of bugs. + +Knowing the relative sizes and quantities of our data +structures, we limit the scope of allocations as much as +possible and keep large allocations shortest-lived. This +minimizes both the cognitive overhead on humans in addition +to reducing memory pressure on the machine. + +Short-lived non-immortal closures (aka "anonymous subs") are +avoided in long-running daemons unless required for +compatibility with PSGI. Closures are memory-intensive and +may make allocation lifetimes less obvious to humans. They +are also the source of memory leaks in older versions of +Perl, including 5.16.3 found in enterprise distros. + +We also use Perl's `delete' and `undef' built-ins to drop +reference counts sooner than scope allows. These functions +are required to break the few reference cycles we have that +would otherwise lead to leaks. + +Of note, `undef' may be used in two ways: + +1. to free(3) the underlying buffer: + + undef $scalar; + +2. to reset a buffer but reduce realloc(3) on subsequent growth: + + $scalar = ""; # useful when repeated appending + $scalar = undef; # usually not needed + +In the future, our internal data model will be further +flattened and simplified to reduce the overhead imposed by +small objects. Large allocations may also be avoided by +optionally using Inline::C. + +Finally, the mwrap-perl LD_PRELOAD wrapper was ported to Perl 5 +and enhanced to provide live memory usage tracking on 64-bit systems +with minimal performance impact on production traffic: + + git clone https://80x24.org/mwrap-perl.git diff --git a/Documentation/technical/weird-stuff.txt b/Documentation/technical/weird-stuff.txt new file mode 100644 index 00000000..0c8d6891 --- /dev/null +++ b/Documentation/technical/weird-stuff.txt @@ -0,0 +1,22 @@ +There's a lot of weird code in public-inbox which may be daunting +to new hackers. + +* The event loop (PublicInbox::DS) is an evolution of a fairly standard + C10K event loop. See ds.txt in this directory for more. + +Things got weirder in 2021: + +* The lei command-line tool is backed by a daemon. This was done to + improve startup time for shell completion and manage git/SQLite/Xapian + single-writer during long, parallel imports. It may eventually become + a read-write IMAP/JMAP server. + +* SOCK_SEQPACKET is used extensively in lei, and will likely make its + way into more places, still. + +And even more so in 2022: + +* public-inbox-clone / PublicInbox::LeiMirror relies on ->DESTROY + for make-like dependency management while providing parallelism. + +More to come, lei will expose Maildirs via FUSE 3... diff --git a/Documentation/technical/whyperl.txt b/Documentation/technical/whyperl.txt new file mode 100644 index 00000000..db1d9793 --- /dev/null +++ b/Documentation/technical/whyperl.txt @@ -0,0 +1,177 @@ +why public-inbox is currently implemented in Perl 5 +--------------------------------------------------- + +While Perl has many detractors and there's a lot not to like +about Perl, we use it anyways because it offers benefits not +(yet) available from other languages. + +This document is somewhat inspired by https://sqlite.org/whyc.html + +Other languages and runtimes may eventually be a possibility +for us, and this document can serve as our requirements list +for possible replacements. + +As always, comments and corrections and additions welcome at +<meta@public-inbox.org>. We're not Perl experts, either. + +Good Things +----------- + +* Availability + + Perl 5 is installed on many, if not most GNU/Linux and + BSD-based servers and workstations. It is likely the most + widely installed programming environment that offers a + significant amount of POSIX functionality. Users won't + have to waste bandwidth or space with giant toolchains or + architecture-specific binaries. + + Furthermore, Perl documentation is typically installed + locally as manpages, allowing users to quickly refer + to documentation as needed. + +* Scripted, always editable by the end user + + Users cannot lose access to the source code. Code written + entirely in any scripting language automatically satisfies + the GPL-2.0, making it easier to satisfy the AGPL-3.0. + + Use of a scripting language improves auditability for + malicious changes. It also reduces storage and bandwidth + requirements for distributors, as the same scripts can be + shared across multiple OSes and architectures. + + Perl's availability and the low barrier to entry of + scripting ensures it's easy for users to exercise their + software freedom. + +* Predictable performance + + While Perl is neither fast nor memory-efficient, its + performance and memory use are predictable and do not + require GC tuning by the user. + + public-inbox is developed for (and mostly on) old + hardware. Perl was fast enough to power the web of the + late 1990s, and any cheap VPS today has more than enough + RAM and CPU for handling plain-text email. + + Low hardware requirements increase the reach of our software + to more users, improving centralization resistance. + +* Compatibility + + Unlike similarly powerful scripting languages, there is no + forced migration to a major new version. From 2000-2020, + Perl had fewer breaking changes than Python or Ruby; we + expect that trend to continue given the inertia of Perl 5. + + As of April 2021, the Perl Steering Committee has confirmed + Perl 7 will require `use v7.0' and existing code should + continue working unchanged: + https://nntp.perl.org/group/perl.perl5.porters/259789 + <CAMvkq_SyTKZD=1=mHXwyzVYYDQb8Go0N0TuE5ZATYe_M4BCm-g@mail.gmail.com> + +* Built for text processing + + Our focus is plain-text mail, and Perl has many built-ins + optimized for text processing. It also has good support + for UTF-8 and legacy encodings found in old mail archives. + +* Integration with distros and non-Perl libraries + + Perl modules and bindings to common libraries such as + SQLite and Xapian are already distributed by many + GNU/Linux distros and BSD ports. + + There should be no need to rely on language-specific + package managers such as cpan(1), those systems increase + the learning curve for users and system administrators. + +* Compactness and terseness + + Less code generally means less bugs. We try to avoid the + "line noise" stereotype of some Perl codebases, yet still + manage to write less code than one would with + non-scripting languages. + +* Performance ceiling and escape hatch + + With optional Inline::C, we can be "as fast as C" in some + cases. Inline::C is widely packaged by distros and it + gives us an escape hatch for dealing with missing bindings + or performance problems should they arise. Inline::C use + (as opposed to XS) also preserves the software freedom and + auditability benefits to all users. + + Unfortunately, most C toolchains are big; so Inline::C + will always be optional for users who cannot afford the + bandwidth or space. + + +Bad Things +---------- + +* Slow startup time. Tokenization, parsing, and compilation of + pure Perl is not cached. Inline::C does cache its results, + however. + + We work around slow startup times in tests by preloading + code, similar to how mod_perl works for CGI. + +* High space overhead and poor locality of small data + structures, including the optree. This may not be fixable + in Perl itself given compatibility requirements of the C API. + + These problems are exacerbated on modern 64-bit platforms, + though the Linux x32 ABI offers promise. + +* Lack of vectored I/O support (writev, sendmmsg, etc. syscalls) + and "newer" POSIX functions in general. APIs end up being + slurpy, favoring large buffers and memory copies for + concatenation rather than rope (aka "cord") structures. + +* While mmap(2) is available via PerlIO::mmap, string ops + (m//, substr(), index(), etc.) still require memory copies + into userspace, negating a benefit of zero-copy. + +* The XS/C API makes it difficult to improve internals while + preserving compatibility. + +* Lack of optional type checking. This may be a blessing in + disguise, though, as it encourages us to simplify our data + models and lowers cognitive overhead. + +* SMP support is mostly limited to fork(), since many + libraries (including much of the standard library) are not + thread-safe. Even with threads.pm, sharing data between + interpreters within the same process is inefficient due to + the lack of lock-free and wait-free data structures from + projects such as Userspace RCU. + +* Process spawning speed degrades as memory use increases. + We work around this optionally via Inline::C and vfork(2), + since Perl lacks an approximation of posix_spawn(3). + + We also use `undef' and `delete' ops to free large buffers + as soon as we're done using them to save memory. + + +Red herrings to ignore when evaluating other runtimes +----------------------------------------------------- + +These don't discount a language or runtime from being +used, they're just not interesting. + +* Lightweight threading + + While lightweight threading implementations are + convenient, they tend to be significantly heavier than + pure event-loop systems (or multi-threaded event-loop + systems). + + Lightweight threading implementations have stack overhead + and growth typically measured in kilobytes. The userspace + state overhead of event-based systems is an order of + magnitude less, and a sunk cost regardless of concurrency + model. |