git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: git@vger.kernel.org, dstolee@microsoft.com
Cc: gitster@pobox.com, sbeller@google.com, pclouds@gmail.com,
	avarab@gmail.com, sunshine@sunshineco.com, szeder.dev@gmail.com
Subject: [PATCH v4 01/23] multi-pack-index: add design document
Date: Thu, 12 Jul 2018 15:39:18 -0400	[thread overview]
Message-ID: <20180712193940.21065-2-dstolee@microsoft.com> (raw)
In-Reply-To: <20180712193940.21065-1-dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 109 +++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000000..d7e57639f7
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The pack.multiIndex config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the most-
+  recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+  contents of the multi-pack-index file match the offsets listed in
+  the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- The reachability bitmap is currently paired directly with a single
+  packfile, using the pack-order as the object order to hopefully
+  compress the bitmaps well using run-length encoding. This could be
+  extended to pair a reachability bitmap with a multi-pack-index. If
+  the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then a reachability bitmap
+  could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+  may point to objects that are not stored locally, but available
+  on request to a server. The multi-pack-index does not currently
+  track these promisor packs.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+    An earlier RFC for the multi-pack-index feature
+
+[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
-- 
2.18.0.118.gd4f65b8d14


  reply	other threads:[~2018-07-12 19:40 UTC|newest]

Thread overview: 192+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
2018-06-11 19:04   ` Stefan Beller
2018-06-18 18:48     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
2018-06-11 19:19   ` Stefan Beller
2018-06-18 19:01     ` Derrick Stolee
2018-06-18 19:41       ` Stefan Beller
2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
2018-06-07 17:20   ` Duy Nguyen
2018-06-18 19:23     ` Derrick Stolee
2018-06-11 21:02   ` Stefan Beller
2018-06-18 19:40     ` Derrick Stolee
2018-06-18 19:55       ` Stefan Beller
2018-06-18 19:58         ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
2018-06-07 17:27   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
2018-06-07 17:35   ` Duy Nguyen
2018-06-12 15:00   ` Duy Nguyen
2018-06-19 12:54     ` Derrick Stolee
2018-06-19 14:59       ` Duy Nguyen
2018-06-19 15:24         ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
2018-06-07 17:54   ` Duy Nguyen
2018-06-20 13:13     ` Derrick Stolee
2018-06-07 18:31   ` Duy Nguyen
2018-06-20 13:33     ` Derrick Stolee
2018-06-20 15:07       ` Duy Nguyen
2018-06-20 16:39         ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 07/23] midx: expand test data Derrick Stolee
2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
2018-06-07 18:03   ` Duy Nguyen
2018-06-20 16:33     ` [PATCH] packfile: generalize pack directory list Derrick Stolee
2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
2018-06-07 18:26   ` Duy Nguyen
2018-06-21 15:25     ` Derrick Stolee
2018-06-21 17:38       ` Junio C Hamano
2018-06-22 18:25         ` Derrick Stolee
2018-06-22 18:31           ` Junio C Hamano
2018-06-22 18:32             ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
2018-06-09 16:43   ` Duy Nguyen
2018-06-21 17:23     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
2018-06-09 17:07   ` Duy Nguyen
2018-06-21 17:54     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
2018-06-09 17:25   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
2018-06-09 17:28   ` Duy Nguyen
2018-06-21 19:49     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
2018-06-09 17:41   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 15/23] midx: create core.midx config setting Derrick Stolee
2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
2018-06-09 17:47   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
2018-06-09 17:56   ` Duy Nguyen
2018-06-21 20:03     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
2018-06-09 18:01   ` Duy Nguyen
2018-06-22 18:38     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 19/23] midx: use existing midx when writing new one Derrick Stolee
2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
2018-06-09 18:03   ` Duy Nguyen
2018-06-22 18:39     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
2018-06-09 18:05   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 22/23] midx: use midx to find ref-deltas Derrick Stolee
2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
2018-06-09 18:13   ` Duy Nguyen
2018-06-22 18:44     ` Derrick Stolee
2018-06-07 14:06 ` [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
2018-06-07 14:54   ` Derrick Stolee
2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 01/24] multi-pack-index: add design document Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 02/24] multi-pack-index: add format details Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 03/24] multi-pack-index: add builtin Derrick Stolee
2018-06-25 19:15     ` Junio C Hamano
2018-06-25 14:34   ` [PATCH v2 04/24] multi-pack-index: add 'write' verb Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 05/24] midx: write header information to lockfile Derrick Stolee
2018-06-25 19:19     ` Junio C Hamano
2018-07-05 19:13       ` Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 06/24] multi-pack-index: load into memory Derrick Stolee
2018-06-25 19:38     ` Junio C Hamano
2018-07-05 14:19       ` Derrick Stolee
2018-07-05 18:58         ` Eric Sunshine
2018-07-06 19:20           ` Junio C Hamano
2018-06-25 14:34   ` [PATCH v2 07/24] multi-pack-index: expand test data Derrick Stolee
2018-06-25 19:45     ` Junio C Hamano
2018-06-25 14:34   ` [PATCH v2 08/24] packfile: generalize pack directory list Derrick Stolee
2018-06-25 19:57     ` Junio C Hamano
2018-06-25 14:34   ` [PATCH v2 09/24] multi-pack-index: read packfile list Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 10/24] multi-pack-index: write pack names in chunk Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 11/24] midx: read pack names into array Derrick Stolee
2018-06-25 23:52     ` Eric Sunshine
2018-06-25 14:34   ` [PATCH v2 12/24] midx: sort and deduplicate objects from packfiles Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 13/24] midx: write object ids in a chunk Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 14/24] midx: write object id fanout chunk Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 15/24] midx: write object offsets Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 16/24] config: create core.multiPackIndex setting Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 17/24] midx: prepare midxed_git struct Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 18/24] midx: read objects from multi-pack-index Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 19/24] midx: use midx in abbreviation calculations Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 20/24] midx: use existing midx when writing new one Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 21/24] midx: use midx in approximate_object_count Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 22/24] midx: prevent duplicate packfile loads Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 23/24] packfile: skip loading index if in multi-pack-index Derrick Stolee
2018-06-25 14:34   ` [PATCH v2 24/24] midx: clear midx on repack Derrick Stolee
2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
2018-07-06  0:52     ` [PATCH v3 01/24] multi-pack-index: add design document Derrick Stolee
2018-07-06  0:52     ` [PATCH v3 02/24] multi-pack-index: add format details Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 03/24] multi-pack-index: add builtin Derrick Stolee
2018-07-06  3:54       ` Eric Sunshine
2018-07-06  0:53     ` [PATCH v3 04/24] multi-pack-index: add 'write' verb Derrick Stolee
2018-07-06  4:07       ` Eric Sunshine
2018-07-06  0:53     ` [PATCH v3 05/24] midx: write header information to lockfile Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 06/24] multi-pack-index: load into memory Derrick Stolee
2018-07-06  4:19       ` Eric Sunshine
2018-07-06  5:18         ` Eric Sunshine
2018-07-09 19:08       ` Junio C Hamano
2018-07-12 16:06         ` Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 07/24] multi-pack-index: expand test data Derrick Stolee
2018-07-06  4:36       ` Eric Sunshine
2018-07-06  5:20         ` Eric Sunshine
2018-07-12 14:10         ` Derrick Stolee
2018-07-12 18:02           ` Eric Sunshine
2018-07-12 18:06             ` Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 08/24] packfile: generalize pack directory list Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 09/24] multi-pack-index: read packfile list Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 10/24] multi-pack-index: write pack names in chunk Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 11/24] midx: read pack names into array Derrick Stolee
2018-07-06  4:58       ` Eric Sunshine
2018-07-06  0:53     ` [PATCH v3 12/24] midx: sort and deduplicate objects from packfiles Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 13/24] midx: write object ids in a chunk Derrick Stolee
2018-07-06  5:04       ` Eric Sunshine
2018-07-06  0:53     ` [PATCH v3 14/24] midx: write object id fanout chunk Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 15/24] midx: write object offsets Derrick Stolee
2018-07-06  5:27       ` Eric Sunshine
2018-07-12 16:33         ` Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 16/24] config: create core.multiPackIndex setting Derrick Stolee
2018-07-06  5:39       ` Eric Sunshine
2018-07-12 13:19         ` Derrick Stolee
2018-07-12 16:30           ` Derrick Stolee
2018-07-11  9:48       ` SZEDER Gábor
2018-07-12 13:01         ` Derrick Stolee
2018-07-12 13:31           ` SZEDER Gábor
2018-07-12 15:40             ` Derrick Stolee
2018-07-12 17:29             ` Junio C Hamano
2018-07-06  0:53     ` [PATCH v3 17/24] midx: prepare midxed_git struct Derrick Stolee
2018-07-06  5:41       ` Eric Sunshine
2018-07-06  0:53     ` [PATCH v3 18/24] midx: read objects from multi-pack-index Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 19/24] midx: use midx in abbreviation calculations Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 20/24] midx: use existing midx when writing new one Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 21/24] midx: use midx in approximate_object_count Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 22/24] midx: prevent duplicate packfile loads Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 23/24] packfile: skip loading index if in multi-pack-index Derrick Stolee
2018-07-06  0:53     ` [PATCH v3 24/24] midx: clear midx on repack Derrick Stolee
2018-07-06  5:52       ` Eric Sunshine
2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
2018-07-12 19:39       ` Derrick Stolee [this message]
2018-07-12 19:39       ` [PATCH v4 02/23] multi-pack-index: add format details Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 03/23] multi-pack-index: add builtin Derrick Stolee
2018-07-20 18:22         ` Junio C Hamano
2018-07-20 22:15           ` brian m. carlson
2018-07-20 22:28             ` Junio C Hamano
2018-07-12 19:39       ` [PATCH v4 04/23] multi-pack-index: add 'write' verb Derrick Stolee
2018-07-12 22:56         ` Eric Sunshine
2018-07-12 19:39       ` [PATCH v4 05/23] midx: write header information to lockfile Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 06/23] multi-pack-index: load into memory Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 07/23] t5319: expand test data Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 08/23] packfile: generalize pack directory list Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 09/23] multi-pack-index: read packfile list Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 10/23] multi-pack-index: write pack names in chunk Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 11/23] midx: read pack names into array Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 12/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 13/23] midx: write object ids in a chunk Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 14/23] midx: write object id fanout chunk Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 15/23] midx: write object offsets Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 16/23] config: create core.multiPackIndex setting Derrick Stolee
2018-07-12 21:05         ` Junio C Hamano
2018-07-13  0:50           ` Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 17/23] midx: read objects from multi-pack-index Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 18/23] midx: use midx in abbreviation calculations Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 19/23] midx: use existing midx when writing new one Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 20/23] midx: use midx in approximate_object_count Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 21/23] midx: prevent duplicate packfile loads Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 22/23] packfile: skip loading index if in multi-pack-index Derrick Stolee
2018-07-12 19:39       ` [PATCH v4 23/23] midx: clear midx on repack Derrick Stolee
2018-07-12 21:11       ` [PATCH v4 00/23] Multi-pack-index (MIDX) Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180712193940.21065-2-dstolee@microsoft.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=sbeller@google.com \
    --cc=sunshine@sunshineco.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).