public-inbox.git  about / heads / tags
an "archives first" approach to mailing lists
blob 11f7804121bcd22c30d2959d34c019b645cd8556 7757 bytes (raw)
$ git show HEAD:Documentation/technical/data_structures.txt	# shows this blob on the CLI

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
 
Internal data structures of public-inbox

This is a guide for hackers new to our code base.  Do not
consider our internal data structures stable for external
consumers, this document should be updated when internals
change.  I recommend reading this document from the source tree,
with the source code easily accessible if you need examples.

This mainly documents in-memory data structures.  If you're
interested in the stable on-filesystem formats, see the
public-inbox-config(5), public-inbox-v1-format(5) and
public-inbox-v2-format(5) manpages.

Common abbreviations when used outside of their packages are
documented.  `$self' is the common variable name when used
within their package.

PublicInbox::Config
-------------------

PublicInbox::Config is the root class which loads a
public-inbox-config file and instantiates PublicInbox::Inbox,
PublicInbox::WWW, PublicInbox::NNTPD, and other top-level
classes.

Outside of tests, this is typically a singleton.

Per-message classes
-------------------

* PublicInbox::Eml - Email::MIME-like class
  Common abbreviation: $mime, $eml
  Used by: PublicInbox::WWW, PublicInbox::SearchIdx

  A representation of an entire email, multipart or not.
  An option to use libgmime or libmailutils may be supported
  in the future for performance and memory use.

  This can be a memory hog with big messages and giant
  attachments, so our PublicInbox::WWW interface only keeps
  one object of this class in memory at a time.

  In other words, this is the "meat" of the message, whereas
  $smsg (below) is just the "skeleton".

  Our PublicInbox::V2Writable class may have two objects of this
  type in memory at a time for deduplication.

  In public-inbox 1.4 and earlier, Email::MIME and its subclass,
  PublicInbox::MIME were used.  Despite still slurping,
  PublicInbox::Eml is faster and uses less memory due to
  lazy header parsing and lazy subpart instantiation with
  shorter object lifetimes.

* PublicInbox::Smsg - small message skeleton
  Used by: PublicInbox::{NNTP,WWW,SearchIdx}
  Common abbreviation: $smsg

  Represents headers shown in NNTP overview and PSGI message
  summaries (thread skeleton).

  This is loaded from either the overview DB (over.sqlite3) or
  the Xapian DB (docdata.glass), though the Xapian docdata
  won't hold NNTP-only fields (Cc:/To:).

  There may be hundreds or thousands of these objects in memory
  at a time, so fields are pruned if unneeded.

* PublicInbox::SearchThread::Msg - subclass of Smsg
  Common abbreviation: $cont or $node
  Used by: PublicInbox::WWW

  The structure we use for a non-recursive[1] variant of
  JWZ's algorithm: <https://www.jwz.org/doc/threading.html>.
  Nowadays, this is a re-blessed $smsg with additional fields.

  As with $smsg objects, there may be hundreds or thousands
  of these objects in memory at a time.

  We also do not use a linked list for storing children as JWZ
  describes, but instead a Perl hashref for {children} which
  becomes an arrayref upon sorting.

  [1] https://rt.cpan.org/Ticket/Display.html?id=116727

Per-inbox classes
-----------------

* PublicInbox::Inbox - represents a single public-inbox
  Common abbreviation: $ibx
  Used everywhere.

  This represents a "publicinbox" section in the config
  file, see public-inbox-config(5) for details.

* PublicInbox::Git - represents a single git repository
  Common abbreviation: $git, $ibx->git
  Used everywhere.

  Each configured "publicinbox" or "coderepo" has one of these.

* PublicInbox::Msgmap - msgmap.sqlite3 read-write interface
  Common abbreviation: $mm, $ibx->mm
  Used everywhere if SQLite is available.

  Each indexed inbox has one of these, see
  public-inbox-v1-format(5) and public-inbox-v2-format(5)
  manpages for details.

* PublicInbox::Over - over.sqlite3 read-only interface
  Common abbreviation: $over, $ibx->over
  Used everywhere if SQLite is available.

  Each indexed inbox has one of these, see
  public-inbox-v1-format(5) and public-inbox-v2-format(5)
  manpages for details.

* PublicInbox::Search - Xapian read-only interface
  Common abbreviation: $srch, $ibx->search
  Used everywhere if Xapian is available.

  Each indexed inbox has one of these, see
  public-inbox-v1-format(5) and public-inbox-v2-format(5)
  manpages for details.

PublicInbox::WWW
----------------

The main PSGI web interface, uses several other packages to
form our web interface.

PublicInbox::SolverGit
----------------------

This is instantiated from the $INBOX/$BLOB_OID/s/ WWW endpoint
and represents the stages and states for "solving" a blob by
searching for and applying patches.  See the code and comments
in PublicInbox/SolverGit.pm

PublicInbox::Qspawn
-------------------

This is instantiated from various WWW endpoints and represents
the stages and states for running and managing subprocesses
in a way which won't exceed configured process limits defined
via "publicinboxlimiter.*" directives in public-inbox-config(5).

ad-hoc structures shared across packages
----------------------------------------

* $ctx - PublicInbox::WWW app request context
  This holds the PSGI $env as well as any internal variables
  used by various modules of PublicInbox::WWW.

  As with the PSGI $env, there is one per active WWW
  request+response cycle.  It does not exist for idle HTTP
  clients.

daemon classes
--------------

* PublicInbox::NNTP - a NNTP client socket
  Common abbreviation: $nntp
  Used by: PublicInbox::DS, public-inbox-nntpd

  Unlike PublicInbox::HTTP, all of the NNTP client logic for
  serving to NNTP clients is here, including what would be
  in $ctx on the HTTP or WWW side.

  There may be thousands of these since we support thousands of
  NNTP clients.

* PublicInbox::HTTP - a HTTP client socket
  Common abbreviation: $http
  Used by: PublicInbox::DS, public-inbox-httpd

  Unlike PublicInbox::NNTP, this class has no knowledge of any of
  the email- or git-specific parts of public-inbox, only PSGI.
  However, it supports APIs and behaviors (e.g. streaming large
  responses) which PublicInbox::WWW may take advantage of.

  There may be thousands of these since we support thousands of
  HTTP clients.

* PublicInbox::Listener - a SOCK_STREAM listen socket (TCP or Unix)
  Used by: PublicInbox::DS, public-inbox-httpd, public-inbox-nntpd
  Common abbreviation: @listeners in PublicInbox::Daemon

  This class calls non-blocking accept(2) or accept4(2) on a
  listen socket to create new PublicInbox::HTTP and
  PublicInbox::NNTP instances.

* PublicInbox::HTTPD
  Common abbreviation: $httpd

  Represents an HTTP daemon which creates PublicInbox::HTTP
  wrappers around client sockets accepted from
  PublicInbox::Listener.

  Since the SERVER_NAME and SERVER_PORT PSGI variables need to be
  exposed for HTTP/1.0 requests when Host: headers are missing,
  this is per Listener socket.

* PublicInbox::HTTPD::Async
  Common abbreviation: $async

  Used for implementing an asynchronous "push" interface for
  slow, expensive responses which may require spawning
  git-httpd-backend(1), git-apply(1) or other commands.
  This will also be used for dealing with future asynchronous
  operations such as HTTP reverse proxying and slow storage
  retrieval operations.

* PublicInbox::NNTPD
  Common abbreviation: $nntpd

  Represents an NNTP daemon which creates PublicInbox::NNTP
  wrappers around client sockets accepted from
  PublicInbox::Listener.

  This is currently a singleton, but it is associated with a
  given PublicInbox::Config which may be instantiated more than
  once in the future.

* PublicInbox::EOFpipe

  Used throughout to trigger a callback when a pipe(7) is closed.
  This is frequently used to portably detect process exit without
  relying on a catch-all waitpid(-1, ...) call.

git clone https://public-inbox.org/public-inbox.git
git clone http://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/public-inbox.git