about summary refs log tree commit homepage
path: root/Documentation/technical
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/technical')
-rw-r--r--Documentation/technical/data_structures.txt229
-rw-r--r--Documentation/technical/ds.txt28
-rw-r--r--Documentation/technical/memory.txt56
-rw-r--r--Documentation/technical/weird-stuff.txt22
-rw-r--r--Documentation/technical/whyperl.txt177
5 files changed, 500 insertions, 12 deletions
diff --git a/Documentation/technical/data_structures.txt b/Documentation/technical/data_structures.txt
new file mode 100644
index 00000000..11f78041
--- /dev/null
+++ b/Documentation/technical/data_structures.txt
@@ -0,0 +1,229 @@
+Internal data structures of public-inbox
+
+This is a guide for hackers new to our code base.  Do not
+consider our internal data structures stable for external
+consumers, this document should be updated when internals
+change.  I recommend reading this document from the source tree,
+with the source code easily accessible if you need examples.
+
+This mainly documents in-memory data structures.  If you're
+interested in the stable on-filesystem formats, see the
+public-inbox-config(5), public-inbox-v1-format(5) and
+public-inbox-v2-format(5) manpages.
+
+Common abbreviations when used outside of their packages are
+documented.  `$self' is the common variable name when used
+within their package.
+
+PublicInbox::Config
+-------------------
+
+PublicInbox::Config is the root class which loads a
+public-inbox-config file and instantiates PublicInbox::Inbox,
+PublicInbox::WWW, PublicInbox::NNTPD, and other top-level
+classes.
+
+Outside of tests, this is typically a singleton.
+
+Per-message classes
+-------------------
+
+* PublicInbox::Eml - Email::MIME-like class
+  Common abbreviation: $mime, $eml
+  Used by: PublicInbox::WWW, PublicInbox::SearchIdx
+
+  A representation of an entire email, multipart or not.
+  An option to use libgmime or libmailutils may be supported
+  in the future for performance and memory use.
+
+  This can be a memory hog with big messages and giant
+  attachments, so our PublicInbox::WWW interface only keeps
+  one object of this class in memory at a time.
+
+  In other words, this is the "meat" of the message, whereas
+  $smsg (below) is just the "skeleton".
+
+  Our PublicInbox::V2Writable class may have two objects of this
+  type in memory at a time for deduplication.
+
+  In public-inbox 1.4 and earlier, Email::MIME and its subclass,
+  PublicInbox::MIME were used.  Despite still slurping,
+  PublicInbox::Eml is faster and uses less memory due to
+  lazy header parsing and lazy subpart instantiation with
+  shorter object lifetimes.
+
+* PublicInbox::Smsg - small message skeleton
+  Used by: PublicInbox::{NNTP,WWW,SearchIdx}
+  Common abbreviation: $smsg
+
+  Represents headers shown in NNTP overview and PSGI message
+  summaries (thread skeleton).
+
+  This is loaded from either the overview DB (over.sqlite3) or
+  the Xapian DB (docdata.glass), though the Xapian docdata
+  won't hold NNTP-only fields (Cc:/To:).
+
+  There may be hundreds or thousands of these objects in memory
+  at a time, so fields are pruned if unneeded.
+
+* PublicInbox::SearchThread::Msg - subclass of Smsg
+  Common abbreviation: $cont or $node
+  Used by: PublicInbox::WWW
+
+  The structure we use for a non-recursive[1] variant of
+  JWZ's algorithm: <https://www.jwz.org/doc/threading.html>.
+  Nowadays, this is a re-blessed $smsg with additional fields.
+
+  As with $smsg objects, there may be hundreds or thousands
+  of these objects in memory at a time.
+
+  We also do not use a linked list for storing children as JWZ
+  describes, but instead a Perl hashref for {children} which
+  becomes an arrayref upon sorting.
+
+  [1] https://rt.cpan.org/Ticket/Display.html?id=116727
+
+Per-inbox classes
+-----------------
+
+* PublicInbox::Inbox - represents a single public-inbox
+  Common abbreviation: $ibx
+  Used everywhere.
+
+  This represents a "publicinbox" section in the config
+  file, see public-inbox-config(5) for details.
+
+* PublicInbox::Git - represents a single git repository
+  Common abbreviation: $git, $ibx->git
+  Used everywhere.
+
+  Each configured "publicinbox" or "coderepo" has one of these.
+
+* PublicInbox::Msgmap - msgmap.sqlite3 read-write interface
+  Common abbreviation: $mm, $ibx->mm
+  Used everywhere if SQLite is available.
+
+  Each indexed inbox has one of these, see
+  public-inbox-v1-format(5) and public-inbox-v2-format(5)
+  manpages for details.
+
+* PublicInbox::Over - over.sqlite3 read-only interface
+  Common abbreviation: $over, $ibx->over
+  Used everywhere if SQLite is available.
+
+  Each indexed inbox has one of these, see
+  public-inbox-v1-format(5) and public-inbox-v2-format(5)
+  manpages for details.
+
+* PublicInbox::Search - Xapian read-only interface
+  Common abbreviation: $srch, $ibx->search
+  Used everywhere if Xapian is available.
+
+  Each indexed inbox has one of these, see
+  public-inbox-v1-format(5) and public-inbox-v2-format(5)
+  manpages for details.
+
+PublicInbox::WWW
+----------------
+
+The main PSGI web interface, uses several other packages to
+form our web interface.
+
+PublicInbox::SolverGit
+----------------------
+
+This is instantiated from the $INBOX/$BLOB_OID/s/ WWW endpoint
+and represents the stages and states for "solving" a blob by
+searching for and applying patches.  See the code and comments
+in PublicInbox/SolverGit.pm
+
+PublicInbox::Qspawn
+-------------------
+
+This is instantiated from various WWW endpoints and represents
+the stages and states for running and managing subprocesses
+in a way which won't exceed configured process limits defined
+via "publicinboxlimiter.*" directives in public-inbox-config(5).
+
+ad-hoc structures shared across packages
+----------------------------------------
+
+* $ctx - PublicInbox::WWW app request context
+  This holds the PSGI $env as well as any internal variables
+  used by various modules of PublicInbox::WWW.
+
+  As with the PSGI $env, there is one per active WWW
+  request+response cycle.  It does not exist for idle HTTP
+  clients.
+
+daemon classes
+--------------
+
+* PublicInbox::NNTP - a NNTP client socket
+  Common abbreviation: $nntp
+  Used by: PublicInbox::DS, public-inbox-nntpd
+
+  Unlike PublicInbox::HTTP, all of the NNTP client logic for
+  serving to NNTP clients is here, including what would be
+  in $ctx on the HTTP or WWW side.
+
+  There may be thousands of these since we support thousands of
+  NNTP clients.
+
+* PublicInbox::HTTP - a HTTP client socket
+  Common abbreviation: $http
+  Used by: PublicInbox::DS, public-inbox-httpd
+
+  Unlike PublicInbox::NNTP, this class has no knowledge of any of
+  the email- or git-specific parts of public-inbox, only PSGI.
+  However, it supports APIs and behaviors (e.g. streaming large
+  responses) which PublicInbox::WWW may take advantage of.
+
+  There may be thousands of these since we support thousands of
+  HTTP clients.
+
+* PublicInbox::Listener - a SOCK_STREAM listen socket (TCP or Unix)
+  Used by: PublicInbox::DS, public-inbox-httpd, public-inbox-nntpd
+  Common abbreviation: @listeners in PublicInbox::Daemon
+
+  This class calls non-blocking accept(2) or accept4(2) on a
+  listen socket to create new PublicInbox::HTTP and
+  PublicInbox::NNTP instances.
+
+* PublicInbox::HTTPD
+  Common abbreviation: $httpd
+
+  Represents an HTTP daemon which creates PublicInbox::HTTP
+  wrappers around client sockets accepted from
+  PublicInbox::Listener.
+
+  Since the SERVER_NAME and SERVER_PORT PSGI variables need to be
+  exposed for HTTP/1.0 requests when Host: headers are missing,
+  this is per Listener socket.
+
+* PublicInbox::HTTPD::Async
+  Common abbreviation: $async
+
+  Used for implementing an asynchronous "push" interface for
+  slow, expensive responses which may require spawning
+  git-httpd-backend(1), git-apply(1) or other commands.
+  This will also be used for dealing with future asynchronous
+  operations such as HTTP reverse proxying and slow storage
+  retrieval operations.
+
+* PublicInbox::NNTPD
+  Common abbreviation: $nntpd
+
+  Represents an NNTP daemon which creates PublicInbox::NNTP
+  wrappers around client sockets accepted from
+  PublicInbox::Listener.
+
+  This is currently a singleton, but it is associated with a
+  given PublicInbox::Config which may be instantiated more than
+  once in the future.
+
+* PublicInbox::EOFpipe
+
+  Used throughout to trigger a callback when a pipe(7) is closed.
+  This is frequently used to portably detect process exit without
+  relying on a catch-all waitpid(-1, ...) call.
diff --git a/Documentation/technical/ds.txt b/Documentation/technical/ds.txt
index cbd06cfb..afead2f1 100644
--- a/Documentation/technical/ds.txt
+++ b/Documentation/technical/ds.txt
@@ -1,9 +1,14 @@
 PublicInbox::DS - event loop and async I/O base class
 
-Our PublicInbox::DS event loop which powers public-inbox-nntpd
-and public-inbox-httpd diverges significantly from the
-unmaintained Danga::Socket package we forked from.  In fact,
-it's probably different from most other event loops out there.
+Our PublicInbox::DS event loop which powers most of our long-lived
+processes(*) diverges significantly from the unmaintained Danga::Socket
+package we forked from.  In fact, it's probably different from most
+other event loops out there.
+
+Most notably, it uses one-shot, level-trigger, and edge-trigger mode
+modes of kqueue|epoll depending on the situation.
+
+(*) public-inbox-netd,(-httpd,-imapd,-nntpd,-pop3d,-watch) + lei-daemon
 
 Most notably:
 
@@ -14,7 +19,7 @@ Most notably:
   triggers a call.
 
   The lack of read/write callback distinction is driven by the
-  fact TLS libraries (e.g. OpenSSL via IO::Socket::SSL) may
+  fact that TLS libraries (e.g. OpenSSL via IO::Socket::SSL) may
   declare SSL_WANT_READ on SSL_write(), and SSL_WANT_READ on
   SSL_read().  So we end up having to let each user object decide
   whether it wants to make read or write calls depending on its
@@ -30,7 +35,7 @@ Most notably:
   Reducing the user-supplied code down to a single callback allows
   subclasses to keep their logic self-contained.  The combination
   of this change and one-shot wakeups (see below) for bidirectional
-  data flows make asynchronous code easier to reason about.
+  data flows makes asynchronous code easier to reason about.
 
 Other divergences:
 
@@ -48,7 +53,7 @@ Other divergences:
 
 Augmented features:
 
-* obj->write(CODEREF) passes the object itself to the CODEREF
+* obj->write(CODEREF) passes the object itself to the CODEREF.
   Being able to enqueue subroutine calls is a powerful feature in
   Danga::Socket for keeping linear logic in an asynchronous environment.
   Unfortunately, each subroutine takes several kilobytes of memory.
@@ -64,8 +69,8 @@ Augmented features:
 * ->requeue support.  An optimization of the AddTimer(0, ...) idiom
   for immediately dispatching code at the next event loop iteration.
   public-inbox uses this for fairly generating large responses
-  iteratively (see PublicInbox::NNTP::long_response or the use of
-  ->getline callbacks for generating gigantic gzipped mboxes).
+  iteratively (see PublicInbox::NNTP::long_response or ibx_async_cat
+  for blob retrievals).
 
 New features
 
@@ -77,12 +82,11 @@ New features
   which (if any) events it's interested in for the next loop iteration.
 
 * Edge-triggering available via EPOLLET or EV_CLEAR.  These reduce wakeups
-  for unidirectional classes (e.g. PublicInbox::Listener sockets,
-  and pipes via PublicInbox::HTTPD::Async).
+  for unidirectional classes when throughput is more important than fairness.
 
 * IO::Socket::SSL support (for NNTPS, STARTTLS+NNTP, HTTPS)
 
-* dwaitpid (waitpid wrapper) support for reaping dead children
+* awaitpid (waitpid wrapper) support for reaping dead children
 
 * reliable signal wakeups are supported via signalfd on Linux,
   EVFILT_SIGNAL on *BSDs via IO::KQueue.
diff --git a/Documentation/technical/memory.txt b/Documentation/technical/memory.txt
new file mode 100644
index 00000000..039694c3
--- /dev/null
+++ b/Documentation/technical/memory.txt
@@ -0,0 +1,56 @@
+semi-automatic memory management in public-inbox
+------------------------------------------------
+
+The majority of public-inbox is implemented in Perl 5, a
+language and interpreter not particularly known for being
+memory-efficient.
+
+We strive to keep processes small to improve locality, allow
+the kernel to cache more files, and to be a good neighbor to
+other processes running on the machine.  Taking advantage of
+automatic reference counting (ARC) in Perl allows us to
+deterministically release memory back to the heap.
+
+We start with a simple data model with few circular
+references.  This both eases human understanding and reduces
+the likelihood of bugs.
+
+Knowing the relative sizes and quantities of our data
+structures, we limit the scope of allocations as much as
+possible and keep large allocations shortest-lived.  This
+minimizes both the cognitive overhead on humans in addition
+to reducing memory pressure on the machine.
+
+Short-lived non-immortal closures (aka "anonymous subs") are
+avoided in long-running daemons unless required for
+compatibility with PSGI.  Closures are memory-intensive and
+may make allocation lifetimes less obvious to humans.  They
+are also the source of memory leaks in older versions of
+Perl, including 5.16.3 found in enterprise distros.
+
+We also use Perl's `delete' and `undef' built-ins to drop
+reference counts sooner than scope allows.  These functions
+are required to break the few reference cycles we have that
+would otherwise lead to leaks.
+
+Of note, `undef' may be used in two ways:
+
+1. to free(3) the underlying buffer:
+
+        undef $scalar;
+
+2. to reset a buffer but reduce realloc(3) on subsequent growth:
+
+        $scalar = "";                # useful when repeated appending
+        $scalar = undef;        # usually not needed
+
+In the future, our internal data model will be further
+flattened and simplified to reduce the overhead imposed by
+small objects.  Large allocations may also be avoided by
+optionally using Inline::C.
+
+Finally, the mwrap-perl LD_PRELOAD wrapper was ported to Perl 5
+and enhanced to provide live memory usage tracking on 64-bit systems
+with minimal performance impact on production traffic:
+
+        git clone https://80x24.org/mwrap-perl.git
diff --git a/Documentation/technical/weird-stuff.txt b/Documentation/technical/weird-stuff.txt
new file mode 100644
index 00000000..0c8d6891
--- /dev/null
+++ b/Documentation/technical/weird-stuff.txt
@@ -0,0 +1,22 @@
+There's a lot of weird code in public-inbox which may be daunting
+to new hackers.
+
+* The event loop (PublicInbox::DS) is an evolution of a fairly standard
+  C10K event loop.  See ds.txt in this directory for more.
+
+Things got weirder in 2021:
+
+* The lei command-line tool is backed by a daemon.  This was done to
+  improve startup time for shell completion and manage git/SQLite/Xapian
+  single-writer during long, parallel imports.  It may eventually become
+  a read-write IMAP/JMAP server.
+
+* SOCK_SEQPACKET is used extensively in lei, and will likely make its
+  way into more places, still.
+
+And even more so in 2022:
+
+* public-inbox-clone / PublicInbox::LeiMirror relies on ->DESTROY
+  for make-like dependency management while providing parallelism.
+
+More to come, lei will expose Maildirs via FUSE 3...
diff --git a/Documentation/technical/whyperl.txt b/Documentation/technical/whyperl.txt
new file mode 100644
index 00000000..db1d9793
--- /dev/null
+++ b/Documentation/technical/whyperl.txt
@@ -0,0 +1,177 @@
+why public-inbox is currently implemented in Perl 5
+---------------------------------------------------
+
+While Perl has many detractors and there's a lot not to like
+about Perl, we use it anyways because it offers benefits not
+(yet) available from other languages.
+
+This document is somewhat inspired by https://sqlite.org/whyc.html
+
+Other languages and runtimes may eventually be a possibility
+for us, and this document can serve as our requirements list
+for possible replacements.
+
+As always, comments and corrections and additions welcome at
+<meta@public-inbox.org>.  We're not Perl experts, either.
+
+Good Things
+-----------
+
+* Availability
+
+  Perl 5 is installed on many, if not most GNU/Linux and
+  BSD-based servers and workstations.  It is likely the most
+  widely installed programming environment that offers a
+  significant amount of POSIX functionality.  Users won't
+  have to waste bandwidth or space with giant toolchains or
+  architecture-specific binaries.
+
+  Furthermore, Perl documentation is typically installed
+  locally as manpages, allowing users to quickly refer
+  to documentation as needed.
+
+* Scripted, always editable by the end user
+
+  Users cannot lose access to the source code.  Code written
+  entirely in any scripting language automatically satisfies
+  the GPL-2.0, making it easier to satisfy the AGPL-3.0.
+
+  Use of a scripting language improves auditability for
+  malicious changes.  It also reduces storage and bandwidth
+  requirements for distributors, as the same scripts can be
+  shared across multiple OSes and architectures.
+
+  Perl's availability and the low barrier to entry of
+  scripting ensures it's easy for users to exercise their
+  software freedom.
+
+* Predictable performance
+
+  While Perl is neither fast nor memory-efficient, its
+  performance and memory use are predictable and do not
+  require GC tuning by the user.
+
+  public-inbox is developed for (and mostly on) old
+  hardware.  Perl was fast enough to power the web of the
+  late 1990s, and any cheap VPS today has more than enough
+  RAM and CPU for handling plain-text email.
+
+  Low hardware requirements increase the reach of our software
+  to more users, improving centralization resistance.
+
+* Compatibility
+
+  Unlike similarly powerful scripting languages, there is no
+  forced migration to a major new version.  From 2000-2020,
+  Perl had fewer breaking changes than Python or Ruby; we
+  expect that trend to continue given the inertia of Perl 5.
+
+  As of April 2021, the Perl Steering Committee has confirmed
+  Perl 7 will require `use v7.0' and existing code should
+  continue working unchanged:
+  https://nntp.perl.org/group/perl.perl5.porters/259789
+  <CAMvkq_SyTKZD=1=mHXwyzVYYDQb8Go0N0TuE5ZATYe_M4BCm-g@mail.gmail.com>
+
+* Built for text processing
+
+  Our focus is plain-text mail, and Perl has many built-ins
+  optimized for text processing.  It also has good support
+  for UTF-8 and legacy encodings found in old mail archives.
+
+* Integration with distros and non-Perl libraries
+
+  Perl modules and bindings to common libraries such as
+  SQLite and Xapian are already distributed by many
+  GNU/Linux distros and BSD ports.
+
+  There should be no need to rely on language-specific
+  package managers such as cpan(1), those systems increase
+  the learning curve for users and system administrators.
+
+* Compactness and terseness
+
+  Less code generally means less bugs.  We try to avoid the
+  "line noise" stereotype of some Perl codebases, yet still
+  manage to write less code than one would with
+  non-scripting languages.
+
+* Performance ceiling and escape hatch
+
+  With optional Inline::C, we can be "as fast as C" in some
+  cases.  Inline::C is widely packaged by distros and it
+  gives us an escape hatch for dealing with missing bindings
+  or performance problems should they arise.  Inline::C use
+  (as opposed to XS) also preserves the software freedom and
+  auditability benefits to all users.
+
+  Unfortunately, most C toolchains are big; so Inline::C
+  will always be optional for users who cannot afford the
+  bandwidth or space.
+
+
+Bad Things
+----------
+
+* Slow startup time.  Tokenization, parsing, and compilation of
+  pure Perl is not cached.  Inline::C does cache its results,
+  however.
+
+  We work around slow startup times in tests by preloading
+  code, similar to how mod_perl works for CGI.
+
+* High space overhead and poor locality of small data
+  structures, including the optree.  This may not be fixable
+  in Perl itself given compatibility requirements of the C API.
+
+  These problems are exacerbated on modern 64-bit platforms,
+  though the Linux x32 ABI offers promise.
+
+* Lack of vectored I/O support (writev, sendmmsg, etc. syscalls)
+  and "newer" POSIX functions in general.  APIs end up being
+  slurpy, favoring large buffers and memory copies for
+  concatenation rather than rope (aka "cord") structures.
+
+* While mmap(2) is available via PerlIO::mmap, string ops
+  (m//, substr(), index(), etc.) still require memory copies
+  into userspace, negating a benefit of zero-copy.
+
+* The XS/C API makes it difficult to improve internals while
+  preserving compatibility.
+
+* Lack of optional type checking.  This may be a blessing in
+  disguise, though, as it encourages us to simplify our data
+  models and lowers cognitive overhead.
+
+* SMP support is mostly limited to fork(), since many
+  libraries (including much of the standard library) are not
+  thread-safe.  Even with threads.pm, sharing data between
+  interpreters within the same process is inefficient due to
+  the lack of lock-free and wait-free data structures from
+  projects such as Userspace RCU.
+
+* Process spawning speed degrades as memory use increases.
+  We work around this optionally via Inline::C and vfork(2),
+  since Perl lacks an approximation of posix_spawn(3).
+
+  We also use `undef' and `delete' ops to free large buffers
+  as soon as we're done using them to save memory.
+
+
+Red herrings to ignore when evaluating other runtimes
+-----------------------------------------------------
+
+These don't discount a language or runtime from being
+used, they're just not interesting.
+
+* Lightweight threading
+
+  While lightweight threading implementations are
+  convenient, they tend to be significantly heavier than
+  pure event-loop systems (or multi-threaded event-loop
+  systems).
+
+  Lightweight threading implementations have stack overhead
+  and growth typically measured in kilobytes.  The userspace
+  state overhead of event-based systems is an order of
+  magnitude less, and a sunk cost regardless of concurrency
+  model.