user/dev discussion of public-inbox itself
 help / color / Atom feed
c960913d332893ef63c7ca1e56220c9c392abbb2 blob 5936 bytes (raw)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
 
% public-inbox developer manual

=head1 NAME

public-inbox v1 git repository and tree description (aka "ssoma")

=head1 DESCRIPTION

WARNING: this does NOT describe the scalable v2 format used
by public-inbox.  Use of ssoma is not recommended for new
installations due to scalability problems.

ssoma uses a git repository to store each email as a git blob.
The tree filename of the blob is based on the SHA1 hexdigest of
the first Message-ID header.  A commit is made for each message
delivered.  The commit SHA-1 identifier is used by ssoma clients
to track synchronization state.

=head1 PATHNAMES IN TREES

A Message-ID may be extremely long and also contain slashes, so using
them as a path name is challenging.  Instead we use the SHA-1 hexdigest
of the Message-ID (excluding the leading "E<lt>" and trailing "E<gt>")
to generate a path name.  Leading and trailing white space in the
Message-ID header is ignored for hashing.

A message with Message-ID of: E<lt>20131106023245.GA20224@dcvr.yhbt.netE<gt>

Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63

Thus it is easy to look up the contents of a message matching a given
a Message-ID.

=head1 MESSAGE-ID CONFLICTS

public-inbox v1 repositories currently do not resolve conflicting
Message-IDs or messages with multiple Message-IDs.

=head1 HEADERS

The Message-ID header is required.
"Bytes", "Lines" and "Content-Length" headers are stripped and not
allowed, they can interfere with further processing.
When using ssoma with public-inbox-mda, the "Status" mbox header
is also stripped as that header makes no sense in a public archive.

=head1 LOCKING

L<flock(2)> locking exclusively locks the empty $GIT_DIR/ssoma.lock file
for all non-atomic operations.

=head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA)

1. Message is delivered to a mail transport agent (MTA)

1a. (optional) reject/discard spam, this should run before ssoma-mda

1b. (optional) reject/strip unwanted attachments

ssoma-mda handles all steps once invoked.

2. Mail transport agent invokes ssoma-mda

3. reads message via stdin, extracting Message-ID

4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock

5. creates or updates the blob of associated 2/38 SHA-1 path

6. updates the index and commits

7. releases $GIT_DIR/ssoma.lock

ssoma-mda can also be used as an L<inotify(7)> trigger to monitor maildirs,
and the ability to monitor IMAP mailboxes using IDLE will be available
in the future.

=head1 GIT REPOSITORIES (SERVERS)

ssoma uses bare git repositories on both servers and clients.

Using the L<git-init(1)> command with --bare is the recommend method
of creating a git repository on a server:

	git init --bare /path/to/wherever/you/want.git

There are no standardized paths for servers, administrators make
all the choices regarding git repository locations.

Special files in $GIT_DIR on the server:

=over

=item $GIT_DIR/ssoma.lock

An empty file for L<flock(2)> locking.
This is necessary to ensure the index and commits are updated
consistently and multiple processes running MDA do not step on
each other.

=item $GIT_DIR/public-inbox/msgmap.sqlite3

SQLite3 database maintaining a stable mapping of Message-IDs to NNTP
article numbers.  Used by L<public-inbox-nntpd(1)> and created
and updated by L<public-inbox-index(1)>.

Users of the L<PublicInbox::WWW> interface will find it
useful for attempting recovery from copy-paste truncations of
URLs containing long Message-IDs.

Automatically updated by L<public-inbox-mda(1)>,
L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.

Losing or damaging this file will cause synchronization problems for
NNTP clients.  This file is expected to be stable and require no
updates to its schema.

Requires L<DBD::SQLite>.

=item $GIT_DIR/public-inbox/xapian$N/

Xapian database for search indices in the PSGI web UI.

$N is the value of PublicInbox::Search::SCHEMA_VERSION, and
installations may have parallel versions on disk during upgrades
or to roll-back upgrades.

This is created and updated by L<public-inbox-index(1)>.

Automatically updated by L<public-inbox-mda(1)>,
L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.

This directory can always be regenerated with L<public-inbox-index(1)>.
If lost or damaaged, there is no need to back it up unless the
CPU/memory cost of regenerating it outweighs the storage/transfer cost.

Since SCHEMA_VERSION 15 and the development of the v2 format,
the "overview" DB also exists in the xapian directory for v1
repositories.  See L<public-inbox-v2-format(5)/OVERVIEW DB>

Our use of the L</OVERVIEW DB> requires Xapian document IDs to
remain stable.  Using L<public-inbox-compact(1)> and
L<public-inbox-xcpdb(1)> wrappers are recommended over tools
provided by Xapian.

This directory is large, often two to three times the size of
the objects stored in a packed git repository.

=item $GIT_DIR/ssoma.index

This file is no longer used or created by public-inbox, but it is
updated if it exists to remain compatible with ssoma installations.

A git index file used for MDA updates.  The normal git index (in
$GIT_DIR/index) is not used at all as there is typically no working
tree.

=back

Each client $GIT_DIR may have multiple mbox/maildir/command targets.
It is possible for a client to extract the mail stored in the git
repository to multiple mboxes for compatibility with a variety of
different tools.

=head1 CAVEATS

It is NOT recommended to check out the working directory of a git.
there may be many files.

It is impossible to completely expunge messages, even spam, as git
retains full history.  Projects may (with adequate notice) cycle to new
repositories/branches with history cleaned up via L<git-filter-branch(1)>.
This is up to the administrators.

=head1 COPYRIGHT

Copyright 2013-2019 all contributors L<mailto:meta@public-inbox.org>

License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>

=head1 SEE ALSO

L<gitrepository-layout(5)>, L<ssoma(1)>
debug log:

solving c960913 ...
found c960913 in https://80x24.org/public-inbox.git

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror http://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.org/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git