From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id F33B61F6AC for ; Fri, 6 Jul 2018 00:53:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753531AbeGFAxa (ORCPT ); Thu, 5 Jul 2018 20:53:30 -0400 Received: from mail-qt0-f196.google.com ([209.85.216.196]:41651 "EHLO mail-qt0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753442AbeGFAx2 (ORCPT ); Thu, 5 Jul 2018 20:53:28 -0400 Received: by mail-qt0-f196.google.com with SMTP id y20-v6so8715798qto.8 for ; Thu, 05 Jul 2018 17:53:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=vInE5ZqtQrog6Zzg/mewH4C+23oOukgq2MmNl7aQuak=; b=pbSUD3t/j7a6gPL6Ziv61vJtUSpZh5qnvh/D0eowxBpQtsDmTxx2mysjV/mY/QA3b9 jxDQic9TldDyfKVHFIk2UJv0+zarwUdmjTPoqM6RL9Qm5wdNq3GGHydgy3YfQ1PF85w5 HppZ+A+ip2KwAN/hUDKLqYermZytzumsxuFWwhPZ1C5kV7TBfW/n1eyqWSIoXptKU9m2 fWYnjwD9APdgF5TJCo38qN0ithjS+JjLE8sT5QiIVmNN29oHErErEEq3TvzysjtfSsNz HKX+ZEq5MzZrySV6SRz/QNzgZMZmJ8N5dsC0DS1kz+1Xib/I4ta4sba3cNNruVO1tM+1 Lyfw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=vInE5ZqtQrog6Zzg/mewH4C+23oOukgq2MmNl7aQuak=; b=sCIZUQLazD1NuxY4APRemYVq0jG7sFFJYG2QmzxGldO/uYHOjlD9A2+bpTJaGd+d2S 8vYq6MMIPmwkD3ffdzh1FBHXMf76ycrqrOwFDhUBJSeCZIiURYk27nSCl3bb0iRa2IbO O+vc5b0Ibd3n1A4pjOqUlCgc9gU6VJFLvXk6aZMaIwnSbe8ZntzUl9zPV0gGF3GF/his DH4YSPom7+sujYawH7QyVG8SQhf29iQ/Itrq8lZC/knTz5vgxxbKSluSkxUjpTGMZEX4 3QlOnxh41T+RL6nUcgpqIrqHzm8F1LbeKjvpFFz29l3/498f0CH9Dp7+BvvbB5oi1qyS G+Cw== X-Gm-Message-State: APt69E1NSiTBZNPc02QpH7CBYpGP4XcOjcpAsKVJ9pdWqyVinphiasQA Cket7wynUvhYkWJo+sdmhBH18gpG X-Google-Smtp-Source: AAOMgpd97ankqRQDw8BB9gWu5iEkofMtYXEWgqpGvD4hQe7W7EvB6ZNFNPe5xcBPrmwdviThx3whcA== X-Received: by 2002:ac8:222a:: with SMTP id o39-v6mr7225632qto.399.1530838407992; Thu, 05 Jul 2018 17:53:27 -0700 (PDT) Received: from stolee-linux-2.corp.microsoft.com ([2001:4898:8010:0:eb4a:5dff:fe0f:730f]) by smtp.gmail.com with ESMTPSA id u25-v6sm4882791qku.3.2018.07.05.17.53.27 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Jul 2018 17:53:27 -0700 (PDT) From: Derrick Stolee X-Google-Original-From: Derrick Stolee To: git@vger.kernel.org Cc: gitster@pobox.com, sbeller@google.com, pclouds@gmail.com, avarab@gmail.com, dstolee@microsoft.com Subject: [PATCH v3 01/24] multi-pack-index: add design document Date: Thu, 5 Jul 2018 20:52:58 -0400 Message-Id: <20180706005321.124643-2-dstolee@microsoft.com> X-Mailer: git-send-email 2.18.0.118.gd4f65b8d14 In-Reply-To: <20180706005321.124643-1-dstolee@microsoft.com> References: <20180625143434.89044-1-dstolee@microsoft.com> <20180706005321.124643-1-dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Signed-off-by: Derrick Stolee --- Documentation/technical/multi-pack-index.txt | 109 +++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 Documentation/technical/multi-pack-index.txt diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt new file mode 100644 index 0000000000..d7e57639f7 --- /dev/null +++ b/Documentation/technical/multi-pack-index.txt @@ -0,0 +1,109 @@ +Multi-Pack-Index (MIDX) Design Notes +==================================== + +The Git object directory contains a 'pack' directory containing +packfiles (with suffix ".pack") and pack-indexes (with suffix +".idx"). The pack-indexes provide a way to lookup objects and +navigate to their offset within the pack, but these must come +in pairs with the packfiles. This pairing depends on the file +names, as the pack-index differs only in suffix with its pack- +file. While the pack-indexes provide fast lookup per packfile, +this performance degrades as the number of packfiles increases, +because abbreviations need to inspect every packfile and we are +more likely to have a miss on our most-recently-used packfile. +For some large repositories, repacking into a single packfile +is not feasible due to storage space or excessive repack times. + +The multi-pack-index (MIDX for short) stores a list of objects +and their offsets into multiple packfiles. It contains: + +- A list of packfile names. +- A sorted list of object IDs. +- A list of metadata for the ith object ID including: + - A value j referring to the jth packfile. + - An offset within the jth packfile for the object. +- If large offsets are required, we use another list of large + offsets similar to version 2 pack-indexes. + +Thus, we can provide O(log N) lookup time for any number +of packfiles. + +Design Details +-------------- + +- The MIDX is stored in a file named 'multi-pack-index' in the + .git/objects/pack directory. This could be stored in the pack + directory of an alternate. It refers only to packfiles in that + same directory. + +- The pack.multiIndex config setting must be on to consume MIDX files. + +- The file format includes parameters for the object ID hash + function, so a future change of hash algorithm does not require + a change in format. + +- The MIDX keeps only one record per object ID. If an object appears + in multiple packfiles, then the MIDX selects the copy in the most- + recently modified packfile. + +- If there exist packfiles in the pack directory not registered in + the MIDX, then those packfiles are loaded into the `packed_git` + list and `packed_git_mru` cache. + +- The pack-indexes (.idx files) remain in the pack directory so we + can delete the MIDX file, set core.midx to false, or downgrade + without any loss of information. + +- The MIDX file format uses a chunk-based approach (similar to the + commit-graph file) that allows optional data to be added. + +Future Work +----------- + +- Add a 'verify' subcommand to the 'git midx' builtin to verify the + contents of the multi-pack-index file match the offsets listed in + the corresponding pack-indexes. + +- The multi-pack-index allows many packfiles, especially in a context + where repacking is expensive (such as a very large repo), or + unexpected maintenance time is unacceptable (such as a high-demand + build machine). However, the multi-pack-index needs to be rewritten + in full every time. We can extend the format to be incremental, so + writes are fast. By storing a small "tip" multi-pack-index that + points to large "base" MIDX files, we can keep writes fast while + still reducing the number of binary searches required for object + lookups. + +- The reachability bitmap is currently paired directly with a single + packfile, using the pack-order as the object order to hopefully + compress the bitmaps well using run-length encoding. This could be + extended to pair a reachability bitmap with a multi-pack-index. If + the multi-pack-index is extended to store a "stable object order" + (a function Order(hash) = integer that is constant for a given hash, + even as the multi-pack-index is updated) then a reachability bitmap + could point to a multi-pack-index and be updated independently. + +- Packfiles can be marked as "special" using empty files that share + the initial name but replace ".pack" with ".keep" or ".promisor". + We can add an optional chunk of data to the multi-pack-index that + records flags of information about the packfiles. This allows new + states, such as 'repacked' or 'redeltified', that can help with + pack maintenance in a multi-pack environment. It may also be + helpful to organize packfiles by object type (commit, tree, blob, + etc.) and use this metadata to help that maintenance. + +- The partial clone feature records special "promisor" packs that + may point to objects that are not stored locally, but available + on request to a server. The multi-pack-index does not currently + track these promisor packs. + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=6 + Chromium work item for: Multi-Pack Index (MIDX) + +[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/ + An earlier RFC for the multi-pack-index feature + +[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ + Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c -- 2.18.0.118.gd4f65b8d14