git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH 00/23] Multi-pack-index (MIDX)
@ 2018-06-07 14:03 Derrick Stolee
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
                   ` (25 more replies)
  0 siblings, 26 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

This patch series includes a rewrite of the previous
multi-pack-index RFC [1] using the feedback from the
commit-graph feature.

I based this series on 'next' as it requires the
recent object-store patches.

The multi-pack-index (MIDX) is explained fully in
the design document 'Documentation/technical/midx.txt'.
The short description is that the MIDX stores the
information from all of the IDX files in a pack
directory. The crucial design decision is that the
IDX files still exist, so we can fall back to the IDX
files if there is any issue with the MIDX (or core.midx
is set to false, or a user downgrades Git, etc.)

The MIDX feature has been part of our GVFS releases
for a few months (since the RFC). It has behaved well,
indexing over 31 million commits and trees across up
to 250 packfiles. These MIDX files are nearly 1GB in
size and take ~20 seconds to rewrite when adding new
IDX information. This ~20s mark is something I'd like
to improve, and I mention how to make the file
incremental (similar to split-index) in the design
document. I also want to make the commit-graph file
incremental, so I'd like to do that at the same time
after both the MIDX and commit-graph are stable.


Lookup Speedups
---------------

When looking for an object, Git uses an most-recently-
used (MRU) cache of packfiles. This does pretty well to
minimize the number of misses when searching through
packfiles for an object, especially if there is one
"big" packfile that contains most of the objets (so it
will rarely miss and is usually one of the first two
packfiles in the list). The MIDX does provide a way
to remove these misses, improving lookup time. However,
this lookup time greatly depends on the arrangement of
the packfiles.

For instance, if you take the Linux repository and repack
using `git repack -adfF --max-pack-size=128m` then all
commits will be in one packfile, all trees will be in
a small set of packfiles and organized well so 'git
rev-list --objects HEAD^{tree}' only inspects one or two
packfiles.

GVFS has the notion of a "prefetch packfile". These are
packfiles that are precomputed by cache servers to
contain the commits and trees introduced to the remote
each day. GVFS downloads these packfiles and places them
in an alternate. Since these are organized by "first
time introduced" and the working directory is so large,
the MRU misses are significant when performing a checkout
and updating the .git/index file.

To test the performance in this situation, I created a
script that organizes the Linux repository in a similar
fashion. I split the commit history into 50 parts by
creating branches on every 10,000 commits of the first-
parent history. Then, `git rev-list --objects A ^B`
provides the list of objects reachable from A but not B,
so I could send that to `git pack-objects` to create
these "time-based" packfiles. With these 50 packfiles
(deleting the old one from my fresh clone, and deleting
all tags as they were no longer on-disk) I could then
test 'git rev-list --objects HEAD^{tree}' and see:

        Before: 0.17s
        After:  0.13s
        % Diff: -23.5%

By adding logic to count hits and misses to bsearch_pack,
I was able to see that the command above calls that
method 266,930 times with a hit rate of 33%. The MIDX
has the same number of calls with a 100% hit rate.



Abbreviation Speedups
---------------------

To fully disambiguate an abbreviation, we must iterate
through all packfiles to ensure no collision exists in
any packfile. This requires O(P log N) time. With the
MIDX, this is only O(log N) time. Our standard test [2]
is 'git log --oneline --parents --raw' because it writes
many abbreviations while also doing a lot of other work
(walking commits and trees to compute the raw diff).

For a copy of the Linux repository with 50 packfiles
split by time, we observed the following:

        Before: 100.5 s
        After:   58.2 s
        % Diff: -59.7%


Request for Review Attention
----------------------------

I tried my best to take the feedback from the commit-graph
feature and apply it to this feature. I also worked to
follow the object-store refactoring as I could. I also have
some local commits that create a 'verify' subcommand and
integrate with 'fsck' similar to the commit-graph, but I'll
leave those for a later series (and review is still underway
for that part of the commit-graph).

One place where I could use some guidance is related to the
current state of 'the_hash_algo' patches. The file format
allows a different "hash version" which then indicates the
length of the hash. What's the best way to ensure this
feature doesn't cause extra pain in the hash-agnostic series?
This will inform how I go back and make the commit-graph
feature better in this area, too.


Thanks,
-Stolee

[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/T/#u
    Previous MIDX RFC.

[2] https://public-inbox.org/git/20171012120220.226427-1-dstolee@microsoft.com/
    A patch series on abbreviation speedups


Derrick Stolee (23):
  midx: add design document
  midx: add midx format details to pack-format.txt
  midx: add midx builtin
  midx: add 'write' subcommand and basic wiring
  midx: write header information to lockfile
  midx: struct midxed_git and 'read' subcommand
  midx: expand test data
  midx: read packfiles from pack directory
  midx: write pack names in chunk
  midx: write a lookup into the pack names chunk
  midx: sort and deduplicate objects from packfiles
  midx: write object ids in a chunk
  midx: write object id fanout chunk
  midx: write object offsets
  midx: create core.midx config setting
  midx: prepare midxed_git struct
  midx: read objects from multi-pack-index
  midx: use midx in abbreviation calculations
  midx: use existing midx when writing new one
  midx: use midx in approximate_object_count
  midx: prevent duplicate packfile loads
  midx: use midx to find ref-deltas
  midx: clear midx on repack

 .gitignore                              |   1 +
 Documentation/config.txt                |   4 +
 Documentation/git-midx.txt              |  60 ++
 Documentation/technical/midx.txt        | 109 +++
 Documentation/technical/pack-format.txt |  82 +++
 Makefile                                |   2 +
 builtin.h                               |   1 +
 builtin/midx.c                          |  88 +++
 builtin/repack.c                        |   8 +
 cache.h                                 |   1 +
 command-list.txt                        |   1 +
 config.c                                |   5 +
 environment.c                           |   1 +
 git.c                                   |   1 +
 midx.c                                  | 923 ++++++++++++++++++++++++
 midx.h                                  |  23 +
 object-store.h                          |  35 +
 packfile.c                              |  47 +-
 packfile.h                              |   1 +
 sha1-name.c                             |  70 ++
 t/t5319-midx.sh                         | 192 +++++
 21 files changed, 1652 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 Documentation/technical/midx.txt
 create mode 100644 builtin/midx.c
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-midx.sh

-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 01/23] midx: add design document
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-11 19:04   ` Stefan Beller
  2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
                   ` (24 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/midx.txt | 109 +++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/technical/midx.txt

diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt
new file mode 100644
index 0000000000..789f410d71
--- /dev/null
+++ b/Documentation/technical/midx.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The core.midx config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the most-
+  recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+  contents of the multi-pack-index file match the offsets listed in
+  the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- The reachability bitmap is currently paired directly with a single
+  packfile, using the pack-order as the object order to hopefully
+  compress the bitmaps well using run-length encoding. This could be
+  extended to pair a reachability bitmap with a multi-pack-index. If
+  the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then a reachability bitmap
+  could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+  may point to objects that are not stored locally, but available
+  on request to a server. The multi-pack-index does not currently
+  track these promisor packs.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+    An earlier RFC for the multi-pack-index feature
+
+[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-11 19:19   ` Stefan Beller
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The multi-pack-index (MIDX) feature generalizes the existing pack-
index (IDX) feature by indexing objects across multiple pack-files.

Describe the basic file format, using a 12-byte header followed by
a lookup table for a list of "chunks" which will be described later.
The file ends with a footer containing a checksum using the hash
algorithm.

The header allows later versions to create breaking changes by
advancing the version number. We can also change the hash algorithm
using a different version value.

We will add the individual chunk format information as we introduce
the code that writes that information.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 70a99fd142..17666b4bfc 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -252,3 +252,52 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA-1-checksum of all of the above.
+
+== midx-*.midx files have the following format:
+
+The meta-index files refer to multiple pack-files and loose objects.
+
+In order to allow extensions that add extra data to the MIDX, we organize
+the body into "chunks" and provide a lookup table at the beginning of the
+body. The header includes certain length values, such as the number of packs,
+the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+	4-byte signature:
+	    The signature is: {'M', 'I', 'D', 'X'}
+
+	1-byte version number:
+	    Git only writes or recognizes version 1
+
+	1-byte Object Id Version
+	    Git only writes or recognizes verion 1 (SHA-1)
+
+	1-byte number (C) of "chunks"
+
+	1-byte number (I) of base multi-pack-index files:
+	    This value is currently always zero.
+
+	4-byte number (P) of pack files
+
+CHUNK LOOKUP:
+
+	(C + 1) * 12 bytes providing the chunk offsets:
+	    First 4 bytes describe chunk id. Value 0 is a terminating label.
+	    Other 8 bytes provide offset in current file for chunk to start.
+	    (Chunks are provided in file-order, so you can infer the length
+	    using the next chunk position if necessary.)
+
+	The remaining data in the body is described one chunk at a time, and
+	these chunks may be given in any order. Chunks are required unless
+	otherwise specified.
+
+CHUNK DATA:
+
+	(This section intentionally left incomplete.)
+
+TRAILER:
+
+	H-byte HASH-checksum of all of the above.
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 03/23] midx: add midx builtin
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
  2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:20   ` Duy Nguyen
  2018-06-11 21:02   ` Stefan Beller
  2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
                   ` (22 subsequent siblings)
  25 siblings, 2 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

This new 'git midx' builtin will be the plumbing access for writing,
reading, and checking multi-pack-index (MIDX) files. The initial
implementation is a no-op.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                 |  1 +
 Documentation/git-midx.txt | 29 +++++++++++++++++++++++++++++
 Makefile                   |  1 +
 builtin.h                  |  1 +
 builtin/midx.c             | 38 ++++++++++++++++++++++++++++++++++++++
 command-list.txt           |  1 +
 git.c                      |  1 +
 7 files changed, 72 insertions(+)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 builtin/midx.c

diff --git a/.gitignore b/.gitignore
index 388cc4beee..e309644d6b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -97,6 +97,7 @@
 /git-merge-subtree
 /git-mergetool
 /git-mergetool--lib
+/git-midx
 /git-mktag
 /git-mktree
 /git-name-rev
diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
new file mode 100644
index 0000000000..2bd886f1a2
--- /dev/null
+++ b/Documentation/git-midx.txt
@@ -0,0 +1,29 @@
+git-midx(1)
+============
+
+NAME
+----
+git-midx - Write and verify multi-pack-indexes (MIDX files).
+
+
+SYNOPSIS
+--------
+[verse]
+'git midx' [--object-dir <dir>]
+
+DESCRIPTION
+-----------
+Write or verify a MIDX file.
+
+OPTIONS
+-------
+
+--object-dir <dir>::
+	Use given directory for the location of Git objects. We check
+	<dir>/packs/multi-pack-index for the current MIDX file, and
+	<dir>/packs for the pack-files to index.
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 1d27f36365..88958c7b42 100644
--- a/Makefile
+++ b/Makefile
@@ -1045,6 +1045,7 @@ BUILTIN_OBJS += builtin/merge-index.o
 BUILTIN_OBJS += builtin/merge-ours.o
 BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
+BUILTIN_OBJS += builtin/midx.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
 BUILTIN_OBJS += builtin/mv.o
diff --git a/builtin.h b/builtin.h
index 4e0f64723e..7b5bd46c7d 100644
--- a/builtin.h
+++ b/builtin.h
@@ -189,6 +189,7 @@ extern int cmd_merge_ours(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
+extern int cmd_midx(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
diff --git a/builtin/midx.c b/builtin/midx.c
new file mode 100644
index 0000000000..59ea92178f
--- /dev/null
+++ b/builtin/midx.c
@@ -0,0 +1,38 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "parse-options.h"
+
+static char const * const builtin_midx_usage[] ={
+	N_("git midx [--object-dir <dir>]"),
+	NULL
+};
+
+static struct opts_midx {
+	const char *object_dir;
+} opts;
+
+int cmd_midx(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_midx_options[] = {
+		{ OPTION_STRING, 0, "object-dir", &opts.object_dir,
+		  N_("dir"),
+		  N_("The object directory containing set of packfile and pack-index pairs.") },
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_midx_usage, builtin_midx_options);
+
+	git_config(git_default_config, NULL);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_midx_options,
+			     builtin_midx_usage, 0);
+
+	if (!opts.object_dir)
+		opts.object_dir = get_object_directory();
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e1c26c1bb7..a21bd7470e 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -123,6 +123,7 @@ git-merge-index                         plumbingmanipulators
 git-merge-one-file                      purehelpers
 git-mergetool                           ancillarymanipulators           complete
 git-merge-tree                          ancillaryinterrogators
+git-midx                                plumbingmanipulators
 git-mktag                               plumbingmanipulators
 git-mktree                              plumbingmanipulators
 git-mv                                  mainporcelain           worktree
diff --git a/git.c b/git.c
index c2f48d53dd..400fadd677 100644
--- a/git.c
+++ b/git.c
@@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
 	{ "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
 	{ "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
 	{ "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
+	{ "midx", cmd_midx, RUN_SETUP },
 	{ "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
 	{ "mktree", cmd_mktree, RUN_SETUP },
 	{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 04/23] midx: add 'write' subcommand and basic wiring
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:27   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

In anticipation of writing multi-pack-indexes (MIDX files), add a
'git midx write' subcommand and send the options to a write_midx_file()
method. Also create a basic test file that tests the 'write' subcommand.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt | 22 +++++++++++++++++++++-
 Makefile                   |  1 +
 builtin/midx.c             |  9 ++++++++-
 midx.c                     |  9 +++++++++
 midx.h                     |  4 ++++
 t/t5319-midx.sh            | 10 ++++++++++
 6 files changed, 53 insertions(+), 2 deletions(-)
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-midx.sh

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 2bd886f1a2..dcaeb1a91b 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 --------
 [verse]
-'git midx' [--object-dir <dir>]
+'git midx' [--object-dir <dir>] <verb>
 
 DESCRIPTION
 -----------
@@ -23,6 +23,26 @@ OPTIONS
 	<dir>/packs/multi-pack-index for the current MIDX file, and
 	<dir>/packs for the pack-files to index.
 
+write::
+	When given as the verb, write a new MIDX file to
+	<dir>/packs/multi-pack-index.
+
+
+EXAMPLES
+--------
+
+* Write a MIDX file for the packfiles in the current .git folder.
++
+-------------------------------------------
+$ git midx write
+-------------------------------------------
+
+* Write a MIDX file for the packfiles in an alternate.
++
+-------------------------------------------
+$ git midx --object-dir <alt> write
+-------------------------------------------
+
 
 GIT
 ---
diff --git a/Makefile b/Makefile
index 88958c7b42..aa86fcd8ec 100644
--- a/Makefile
+++ b/Makefile
@@ -890,6 +890,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
diff --git a/builtin/midx.c b/builtin/midx.c
index 59ea92178f..dc0a5acd3f 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -3,9 +3,10 @@
 #include "config.h"
 #include "git-compat-util.h"
 #include "parse-options.h"
+#include "midx.h"
 
 static char const * const builtin_midx_usage[] ={
-	N_("git midx [--object-dir <dir>]"),
+	N_("git midx [--object-dir <dir>] [write]"),
 	NULL
 };
 
@@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 	if (!opts.object_dir)
 		opts.object_dir = get_object_directory();
 
+	if (argc == 0)
+		return 0;
+
+	if (!strcmp(argv[0], "write"))
+		return write_midx_file(opts.object_dir);
+
 	return 0;
 }
diff --git a/midx.c b/midx.c
new file mode 100644
index 0000000000..616af66b13
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,9 @@
+#include "git-compat-util.h"
+#include "cache.h"
+#include "dir.h"
+#include "midx.h"
+
+int write_midx_file(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
new file mode 100644
index 0000000000..3a63673952
--- /dev/null
+++ b/midx.h
@@ -0,0 +1,4 @@
+#include "cache.h"
+#include "packfile.h"
+
+int write_midx_file(const char *object_dir);
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
new file mode 100755
index 0000000000..a590137af7
--- /dev/null
+++ b/t/t5319-midx.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+test_description='multi-pack-indexes'
+. ./test-lib.sh
+
+test_expect_success 'write midx with no pakcs' '
+	git midx --object-dir=. write
+'
+
+test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 05/23] midx: write header information to lockfile
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (3 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:35   ` Duy Nguyen
  2018-06-12 15:00   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
                   ` (20 subsequent siblings)
  25 siblings, 2 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

As we begin writing the multi-pack-index format to disk, start with
the basics: the 12-byte header and the 20-byte checksum footer. Start
with these basics so we can add the rest of the format in small
increments.

As we implement the format, we will use a technique to check that our
computed offsets within the multi-pack-index file match what we are
actually writing. Each method that writes to the hashfile will return
the number of bytes written, and we will track that those values match
our expectations.

Currently, write_midx_header() returns 12, but is not checked. We will
check the return value in a later commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c          | 53 +++++++++++++++++++++++++++++++++++++++++++++++++
 t/t5319-midx.sh |  5 +++--
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 616af66b13..3e55422a21 100644
--- a/midx.c
+++ b/midx.c
@@ -1,9 +1,62 @@
 #include "git-compat-util.h"
 #include "cache.h"
 #include "dir.h"
+#include "csum-file.h"
+#include "lockfile.h"
 #include "midx.h"
 
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_VERSION 1
+#define MIDX_HASH_VERSION 1 /* SHA-1 */
+#define MIDX_HEADER_SIZE 12
+
+static char *get_midx_filename(const char *object_dir)
+{
+	struct strbuf midx_name = STRBUF_INIT;
+	strbuf_addstr(&midx_name, object_dir);
+	strbuf_addstr(&midx_name, "/pack/multi-pack-index");
+	return strbuf_detach(&midx_name, NULL);
+}
+
+static size_t write_midx_header(struct hashfile *f,
+				unsigned char num_chunks,
+				uint32_t num_packs)
+{
+	char byte_values[4];
+	hashwrite_be32(f, MIDX_SIGNATURE);
+	byte_values[0] = MIDX_VERSION;
+	byte_values[1] = MIDX_HASH_VERSION;
+	byte_values[2] = num_chunks;
+	byte_values[3] = 0; /* unused */
+	hashwrite(f, byte_values, sizeof(byte_values));
+	hashwrite_be32(f, num_packs);
+
+	return MIDX_HEADER_SIZE;
+}
+
 int write_midx_file(const char *object_dir)
 {
+	unsigned char num_chunks = 0;
+	uint32_t num_packs = 0;
+	char *midx_name;
+	struct hashfile *f;
+	struct lock_file lk;
+
+	midx_name = get_midx_filename(object_dir);
+	if (safe_create_leading_directories(midx_name)) {
+		UNLEAK(midx_name);
+		die_errno(_("unable to create leading directories of %s"),
+			  midx_name);
+	}
+
+	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
+	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	FREE_AND_NULL(midx_name);
+
+	write_midx_header(f, num_chunks, num_packs);
+
+	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	commit_lock_file(&lk);
+
 	return 0;
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index a590137af7..80f9389837 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,8 +3,9 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
-test_expect_success 'write midx with no pakcs' '
-	git midx --object-dir=. write
+test_expect_success 'write midx with no packs' '
+	git midx --object-dir=. write &&
+	test_path_is_file pack/multi-pack-index
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (4 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:54   ` Duy Nguyen
  2018-06-07 18:31   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 07/23] midx: expand test data Derrick Stolee
                   ` (19 subsequent siblings)
  25 siblings, 2 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

As we build the multi-pack-index feature by adding chunks at a time,
we want to test that the data is being written correctly.

Create struct midxed_git to store an in-memory representation of a
multi-pack-index and a memory-map of the binary file. Initialize this
struct in load_midxed_git(object_dir).

Create the 'git midx read' subcommand to output basic information about
the multi-pack-index file. This will be expanded as more information is
written to the file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt | 11 +++++++
 builtin/midx.c             | 23 +++++++++++++-
 midx.c                     | 65 ++++++++++++++++++++++++++++++++++++++
 midx.h                     |  9 ++++++
 object-store.h             | 19 +++++++++++
 t/t5319-midx.sh            | 12 ++++++-
 6 files changed, 137 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index dcaeb1a91b..919283fdd8 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -23,6 +23,11 @@ OPTIONS
 	<dir>/packs/multi-pack-index for the current MIDX file, and
 	<dir>/packs for the pack-files to index.
 
+read::
+	When given as the verb, read the current MIDX file and output
+	basic information about its contents. Used for debugging
+	purposes only.
+
 write::
 	When given as the verb, write a new MIDX file to
 	<dir>/packs/multi-pack-index.
@@ -43,6 +48,12 @@ $ git midx write
 $ git midx --object-dir <alt> write
 -------------------------------------------
 
+* Read the MIDX file in the .git/objects folder.
++
+-------------------------------------------
+$ git midx read
+-------------------------------------------
+
 
 GIT
 ---
diff --git a/builtin/midx.c b/builtin/midx.c
index dc0a5acd3f..c7002f664a 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -6,7 +6,7 @@
 #include "midx.h"
 
 static char const * const builtin_midx_usage[] ={
-	N_("git midx [--object-dir <dir>] [write]"),
+	N_("git midx [--object-dir <dir>] [read|write]"),
 	NULL
 };
 
@@ -14,6 +14,25 @@ static struct opts_midx {
 	const char *object_dir;
 } opts;
 
+static int read_midx_file(const char *object_dir)
+{
+	struct midxed_git *m = load_midxed_git(object_dir);
+
+	if (!m)
+		return 0;
+
+	printf("header: %08x %d %d %d %d\n",
+	       m->signature,
+	       m->version,
+	       m->hash_version,
+	       m->num_chunks,
+	       m->num_packs);
+
+	printf("object_dir: %s\n", m->object_dir);
+
+	return 0;
+}
+
 int cmd_midx(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_midx_options[] = {
@@ -38,6 +57,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 	if (argc == 0)
 		return 0;
 
+	if (!strcmp(argv[0], "read"))
+		return read_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 
diff --git a/midx.c b/midx.c
index 3e55422a21..fa18770f1d 100644
--- a/midx.c
+++ b/midx.c
@@ -3,12 +3,15 @@
 #include "dir.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "object-store.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
 #define MIDX_HASH_VERSION 1 /* SHA-1 */
 #define MIDX_HEADER_SIZE 12
+#define MIDX_HASH_LEN 20
+#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -18,6 +21,68 @@ static char *get_midx_filename(const char *object_dir)
 	return strbuf_detach(&midx_name, NULL);
 }
 
+struct midxed_git *load_midxed_git(const char *object_dir)
+{
+	struct midxed_git *m;
+	int fd;
+	struct stat st;
+	size_t midx_size;
+	void *midx_map;
+	const char *midx_name = get_midx_filename(object_dir);
+
+	fd = git_open(midx_name);
+	if (fd < 0)
+		return NULL;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return NULL;
+	}
+	midx_size = xsize_t(st.st_size);
+
+	if (midx_size < MIDX_MIN_SIZE) {
+		close(fd);
+		die("multi-pack-index file %s is too small", midx_name);
+	}
+
+	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
+	strcpy(m->object_dir, object_dir);
+	m->data = midx_map;
+
+	m->signature = get_be32(m->data);
+	if (m->signature != MIDX_SIGNATURE) {
+		error("multi-pack-index signature %X does not match signature %X",
+		      m->signature, MIDX_SIGNATURE);
+		goto cleanup_fail;
+	}
+
+	m->version = *(m->data + 4);
+	if (m->version != MIDX_VERSION) {
+		error("multi-pack-index version %d not recognized",
+		      m->version);
+		goto cleanup_fail;
+	}
+
+	m->hash_version = *(m->data + 5);
+	if (m->hash_version != MIDX_HASH_VERSION) {
+		error("hash version %d not recognized", m->hash_version);
+		goto cleanup_fail;
+	}
+	m->hash_len = MIDX_HASH_LEN;
+
+	m->num_chunks = *(m->data + 6);
+	m->num_packs = get_be32(m->data + 8);
+
+	return m;
+
+cleanup_fail:
+	FREE_AND_NULL(m);
+	munmap(midx_map, midx_size);
+	close(fd);
+	exit(1);
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index 3a63673952..a1d18ed991 100644
--- a/midx.h
+++ b/midx.h
@@ -1,4 +1,13 @@
+#ifndef MIDX_H
+#define MIDX_H
+
+#include "git-compat-util.h"
 #include "cache.h"
+#include "object-store.h"
 #include "packfile.h"
 
+struct midxed_git *load_midxed_git(const char *object_dir);
+
 int write_midx_file(const char *object_dir);
+
+#endif
diff --git a/object-store.h b/object-store.h
index d683112fd7..77cb82621a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -84,6 +84,25 @@ struct packed_git {
 	char pack_name[FLEX_ARRAY]; /* more */
 };
 
+struct midxed_git {
+	struct midxed_git *next;
+
+	int fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	uint32_t signature;
+	unsigned char version;
+	unsigned char hash_version;
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_packs;
+	uint32_t num_objects;
+
+	char object_dir[FLEX_ARRAY];
+};
+
 struct raw_object_store {
 	/*
 	 * Path to the repository's object store.
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 80f9389837..e78514d8e9 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,9 +3,19 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+midx_read_expect() {
+	cat >expect <<- EOF
+	header: 4d494458 1 1 0 0
+	object_dir: .
+	EOF
+	git midx read --object-dir=. >actual &&
+	test_cmp expect actual
+}
+
 test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
-	test_path_is_file pack/multi-pack-index
+	test_path_is_file pack/multi-pack-index &&
+	midx_read_expect
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 07/23] midx: expand test data
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (5 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

As we build the multi-pack-index file format, we want to test the format
on real repoasitories. Add tests to t5319-midx.sh that create repository
data including multiple packfiles with both version 1 and version 2
formats.

The current 'git midx write' command will always write the same file
with no "real" data. This will be expanded in future commits, along with
the test expectations.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-midx.sh | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index e78514d8e9..2c25a69744 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -14,8 +14,109 @@ midx_read_expect() {
 
 test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
+	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
 	midx_read_expect
 '
 
+test_expect_success 'create objects' '
+	for i in `test_seq 1 5`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree </dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with one v1 pack' '
+	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
+	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'write midx with one v2 pack' '
+	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
+	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more objects' '
+	for i in `test_seq 6 5`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list2 &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with two packs' '
+	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
+	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more packs' '
+	for j in `test_seq 1 10`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+		git update-index --add file_101 &&
+		tree=$(git write-tree) &&
+		commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+		echo $tree &&
+		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+		} >obj-list &&
+		git update-ref HEAD $commit &&
+		git pack-objects --index-version=2 test-pack <obj-list &&
+		i=$(expr $i + 1) || return 1 &&
+		j=$(expr $j + 1) || return 1
+	done
+'
+
+test_expect_success 'write midx with twelve packs' '
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 08/23] midx: read packfiles from pack directory
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (6 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 07/23] midx: expand test data Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 18:03   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

When constructing a multi-pack-index file for a given object directory,
read the files within the enclosed pack directory and find matches that
end with ".idx" and find the correct paired packfile using
add_packed_git().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c          | 51 +++++++++++++++++++++++++++++++++++++++++++++++--
 t/t5319-midx.sh | 15 ++++++++-------
 2 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index fa18770f1d..9fb89c80a2 100644
--- a/midx.c
+++ b/midx.c
@@ -102,10 +102,15 @@ static size_t write_midx_header(struct hashfile *f,
 int write_midx_file(const char *object_dir)
 {
 	unsigned char num_chunks = 0;
-	uint32_t num_packs = 0;
 	char *midx_name;
 	struct hashfile *f;
 	struct lock_file lk;
+	struct packed_git **packs = NULL;
+	uint32_t i, nr_packs = 0, alloc_packs = 0;
+	DIR *dir;
+	struct dirent *de;
+	struct strbuf pack_dir = STRBUF_INIT;
+	size_t pack_dir_len;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	strbuf_addf(&pack_dir, "%s/pack", object_dir);
+	dir = opendir(pack_dir.buf);
+
+	if (!dir) {
+		error_errno("unable to open pack directory: %s",
+			    pack_dir.buf);
+		strbuf_release(&pack_dir);
+		return 1;
+	}
+
+	strbuf_addch(&pack_dir, '/');
+	pack_dir_len = pack_dir.len;
+	ALLOC_ARRAY(packs, alloc_packs);
+	while ((de = readdir(dir)) != NULL) {
+		if (is_dot_or_dotdot(de->d_name))
+			continue;
+
+		if (ends_with(de->d_name, ".idx")) {
+			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+
+			strbuf_setlen(&pack_dir, pack_dir_len);
+			strbuf_addstr(&pack_dir, de->d_name);
+
+			packs[nr_packs] = add_packed_git(pack_dir.buf,
+							 pack_dir.len,
+							 0);
+			if (!packs[nr_packs])
+				warning("failed to add packfile '%s'",
+					pack_dir.buf);
+			else
+				nr_packs++;
+		}
+	}
+	closedir(dir);
+	strbuf_release(&pack_dir);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, num_packs);
+	write_midx_header(f, num_chunks, nr_packs);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
+	for (i = 0; i < nr_packs; i++) {
+		close_pack(packs[i]);
+		FREE_AND_NULL(packs[i]);
+	}
+
+	FREE_AND_NULL(packs);
 	return 0;
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 2c25a69744..abe545c7c4 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -4,8 +4,9 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 midx_read_expect() {
+	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 0 0
+	header: 4d494458 1 1 0 $NUM_PACKS
 	object_dir: .
 	EOF
 	git midx read --object-dir=. >actual &&
@@ -16,7 +17,7 @@ test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
 	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect
+	midx_read_expect 0
 '
 
 test_expect_success 'create objects' '
@@ -47,14 +48,14 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'Add more objects' '
@@ -85,7 +86,7 @@ test_expect_success 'write midx with two packs' '
 	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
 	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 2
 '
 
 test_expect_success 'Add more packs' '
@@ -108,7 +109,7 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 test-pack <obj-list &&
+		git pack-objects --index-version=2 pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
@@ -116,7 +117,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 12
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 09/23] midx: write pack names in chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (7 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 18:26   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The multi-pack-index (MIDX) needs to track which pack-files are covered
by the MIDX file. Store these in our first required chunk. Since
filenames are not well structured, add padding to keep good alignment in
later chunks.

Modify the 'git midx read' subcommand to output the existence of the
pack-file name chunk. Modify t5319-midx.sh to reflect this new output
and the new expected number of chunks.

Defense in depth: A pattern we are using in the multi-pack-index feature
is to verify the data as we write it. We want to ensure we never write
invalid data to the multi-pack-index. There are many checks during the
write of a MIDX file that double-check that the values we are writing
fit the format definitions. If any value is incorrect, then we notice
before writing invalid data. This mainly helps developers while working
on the feature, but it can also identify issues that only appear when
dealing with very large data sets. These large sets are hard to encode
into test cases.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |   6 +
 builtin/midx.c                          |   7 +
 midx.c                                  | 176 +++++++++++++++++++++++-
 object-store.h                          |   2 +
 t/t5319-midx.sh                         |   3 +-
 5 files changed, 188 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 17666b4bfc..2b37be7b33 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,12 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+	Packfile Names (ID: {'P', 'N', 'A', 'M'})
+	    Stores the packfile names as concatenated, null-terminated strings.
+	    Packfiles must be listed in lexicographic order for fast lookups by
+	    name. This is the only chunk not guaranteed to be a multiple of four
+	    bytes in length, so should be the last chunk for alignment reasons.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/builtin/midx.c b/builtin/midx.c
index c7002f664a..fe56560853 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -28,6 +28,13 @@ static int read_midx_file(const char *object_dir)
 	       m->num_chunks,
 	       m->num_packs);
 
+	printf("chunks:");
+
+	if (m->chunk_pack_names)
+		printf(" pack_names");
+
+	printf("\n");
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/midx.c b/midx.c
index 9fb89c80a2..d4f4a01a51 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
+#define MIDX_MAX_CHUNKS 1
+#define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+
 static char *get_midx_filename(const char *object_dir)
 {
 	struct strbuf midx_name = STRBUF_INIT;
@@ -29,6 +34,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	size_t midx_size;
 	void *midx_map;
 	const char *midx_name = get_midx_filename(object_dir);
+	uint32_t i;
 
 	fd = git_open(midx_name);
 	if (fd < 0)
@@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	m->num_chunks = *(m->data + 6);
 	m->num_packs = get_be32(m->data + 8);
 
+	for (i = 0; i < m->num_chunks; i++) {
+		uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i);
+		uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
+
+		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKNAMES:
+				m->chunk_pack_names = m->data + chunk_offset;
+				break;
+
+			case 0:
+				die("terminating MIDX chunk id appears earlier than expected");
+				break;
+
+			default:
+				/*
+				 * Do nothing on unrecognized chunks, allowing future
+				 * extensions to add optional chunks.
+				 */
+				break;
+		}
+	}
+
+	if (!m->chunk_pack_names)
+		die("MIDX missing required pack-name chunk");
+
 	return m;
 
 cleanup_fail:
@@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_pair {
+	uint32_t pack_int_id;
+	char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+	struct pack_pair *a = (struct pack_pair *)_a;
+	struct pack_pair *b = (struct pack_pair *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
+{
+	uint32_t i;
+	struct pack_pair *pairs;
+
+	ALLOC_ARRAY(pairs, nr_packs);
+
+	for (i = 0; i < nr_packs; i++) {
+		pairs[i].pack_int_id = i;
+		pairs[i].pack_name = pack_names[i];
+	}
+
+	QSORT(pairs, nr_packs, pack_pair_compare);
+
+	for (i = 0; i < nr_packs; i++) {
+		pack_names[i] = pairs[i].pack_name;
+		perm[pairs[i].pack_int_id] = i;
+	}
+}
+
+static size_t write_midx_pack_names(struct hashfile *f,
+				    char **pack_names,
+				    uint32_t num_packs)
+{
+	uint32_t i;
+	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
+	size_t written = 0;
+
+	for (i = 0; i < num_packs; i++) {
+		size_t writelen = strlen(pack_names[i]) + 1;
+
+		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+			BUG("incorrect pack-file order: %s before %s",
+			    pack_names[i - 1],
+			    pack_names[i]);
+
+		hashwrite(f, pack_names[i], writelen);
+		written += writelen;
+	}
+
+	/* add padding to be aligned */
+	i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
+	if (i < MIDX_CHUNK_ALIGNMENT) {
+		bzero(padding, sizeof(padding));
+		hashwrite(f, padding, i);
+		written += i;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
-	unsigned char num_chunks = 0;
+	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	struct hashfile *f;
 	struct lock_file lk;
 	struct packed_git **packs = NULL;
+	char **pack_names = NULL;
+	uint32_t *pack_perm;
 	uint32_t i, nr_packs = 0, alloc_packs = 0;
+	uint32_t alloc_pack_names = 0;
 	DIR *dir;
 	struct dirent *de;
 	struct strbuf pack_dir = STRBUF_INIT;
 	size_t pack_dir_len;
+	uint64_t pack_name_concat_len = 0;
+	uint64_t written = 0;
+	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
+	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -132,12 +233,14 @@ int write_midx_file(const char *object_dir)
 	strbuf_addch(&pack_dir, '/');
 	pack_dir_len = pack_dir.len;
 	ALLOC_ARRAY(packs, alloc_packs);
+	ALLOC_ARRAY(pack_names, alloc_pack_names);
 	while ((de = readdir(dir)) != NULL) {
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		if (ends_with(de->d_name, ".idx")) {
 			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+			ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
 
 			strbuf_setlen(&pack_dir, pack_dir_len);
 			strbuf_addstr(&pack_dir, de->d_name);
@@ -145,21 +248,83 @@ int write_midx_file(const char *object_dir)
 			packs[nr_packs] = add_packed_git(pack_dir.buf,
 							 pack_dir.len,
 							 0);
-			if (!packs[nr_packs])
+			if (!packs[nr_packs]) {
 				warning("failed to add packfile '%s'",
 					pack_dir.buf);
-			else
-				nr_packs++;
+				continue;
+			}
+
+			pack_names[nr_packs] = xstrdup(de->d_name);
+			pack_name_concat_len += strlen(de->d_name) + 1;
+			nr_packs++;
 		}
 	}
+
 	closedir(dir);
 	strbuf_release(&pack_dir);
 
+	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
+	ALLOC_ARRAY(pack_perm, nr_packs);
+	sort_packs_by_name(pack_names, nr_packs, pack_perm);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, nr_packs);
+	cur_chunk = 0;
+	num_chunks = 1;
+
+	written = write_midx_header(f, num_chunks, nr_packs);
+
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
+
+	for (i = 0; i <= num_chunks; i++) {
+		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
+			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
+			    chunk_offsets[i - 1],
+			    chunk_offsets[i]);
+
+		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
+			BUG("chunk offset %"PRIu64" is not properly aligned",
+			    chunk_offsets[i]);
+
+		hashwrite_be32(f, chunk_ids[i]);
+		hashwrite_be32(f, chunk_offsets[i] >> 32);
+		hashwrite_be32(f, chunk_offsets[i]);
+
+		written += MIDX_CHUNKLOOKUP_WIDTH;
+	}
+
+	for (i = 0; i < num_chunks; i++) {
+		if (written != chunk_offsets[i])
+			BUG("inccrrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
+			    chunk_offsets[i],
+			    written,
+			    chunk_ids[i]);
+
+		switch (chunk_ids[i]) {
+			case MIDX_CHUNKID_PACKNAMES:
+				written += write_midx_pack_names(f, pack_names, nr_packs);
+				break;
+
+			default:
+				BUG("trying to write unknown chunk id %"PRIx32,
+				    chunk_ids[i]);
+		}
+	}
+
+	if (written != chunk_offsets[num_chunks])
+		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
+		    written,
+		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
@@ -170,5 +335,6 @@ int write_midx_file(const char *object_dir)
 	}
 
 	FREE_AND_NULL(packs);
+	FREE_AND_NULL(pack_names);
 	return 0;
 }
diff --git a/object-store.h b/object-store.h
index 77cb82621a..199cf4bd44 100644
--- a/object-store.h
+++ b/object-store.h
@@ -100,6 +100,8 @@ struct midxed_git {
 	uint32_t num_packs;
 	uint32_t num_objects;
 
+	const unsigned char *chunk_pack_names;
+
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index abe545c7c4..fdf4f84a90 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,7 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 0 $NUM_PACKS
+	header: 4d494458 1 1 1 $NUM_PACKS
+	chunks: pack_names
 	object_dir: .
 	EOF
 	git midx read --object-dir=. >actual &&
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 10/23] midx: write a lookup into the pack names chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (8 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 16:43   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  5 +++
 builtin/midx.c                          |  7 ++++
 midx.c                                  | 56 +++++++++++++++++++++++--
 object-store.h                          |  2 +
 t/t5319-midx.sh                         | 11 +++--
 5 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 2b37be7b33..29bf87283a 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,11 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+	Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
+	    P * 4 bytes storing the offset in the packfile name chunk for
+	    the null-terminated string containing the filename for the
+	    ith packfile.
+
 	Packfile Names (ID: {'P', 'N', 'A', 'M'})
 	    Stores the packfile names as concatenated, null-terminated strings.
 	    Packfiles must be listed in lexicographic order for fast lookups by
diff --git a/builtin/midx.c b/builtin/midx.c
index fe56560853..3a261e9bbf 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -16,6 +16,7 @@ static struct opts_midx {
 
 static int read_midx_file(const char *object_dir)
 {
+	uint32_t i;
 	struct midxed_git *m = load_midxed_git(object_dir);
 
 	if (!m)
@@ -30,11 +31,17 @@ static int read_midx_file(const char *object_dir)
 
 	printf("chunks:");
 
+	if (m->chunk_pack_lookup)
+		printf(" pack_lookup");
 	if (m->chunk_pack_names)
 		printf(" pack_names");
 
 	printf("\n");
 
+	printf("packs:\n");
+	for (i = 0; i < m->num_packs; i++)
+		printf("%s\n", m->pack_names[i]);
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/midx.c b/midx.c
index d4f4a01a51..923acda72e 100644
--- a/midx.c
+++ b/midx.c
@@ -13,8 +13,9 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 1
+#define MIDX_MAX_CHUNKS 2
 #define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
@@ -85,6 +86,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
 
 		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKLOOKUP:
+				m->chunk_pack_lookup = (uint32_t *)(m->data + chunk_offset);
+				break;
+
 			case MIDX_CHUNKID_PACKNAMES:
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
@@ -102,9 +107,32 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		}
 	}
 
+	if (!m->chunk_pack_lookup)
+		die("MIDX missing required pack lookup chunk");
 	if (!m->chunk_pack_names)
 		die("MIDX missing required pack-name chunk");
 
+	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+	for (i = 0; i < m->num_packs; i++) {
+		if (i) {
+			if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
+				error("MIDX pack lookup value %d before %d",
+				      ntohl(m->chunk_pack_lookup[i - 1]),
+				      ntohl(m->chunk_pack_lookup[i]));
+				goto cleanup_fail;
+			}
+		}
+
+		m->pack_names[i] = (const char *)(m->chunk_pack_names + ntohl(m->chunk_pack_lookup[i]));
+
+		if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
+			error("MIDX pack names out of order: '%s' before '%s'",
+			      m->pack_names[i - 1],
+			      m->pack_names[i]);
+			goto cleanup_fail;
+		}
+	}
+
 	return m;
 
 cleanup_fail:
@@ -162,6 +190,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	}
 }
 
+static size_t write_midx_pack_lookup(struct hashfile *f,
+				     char **pack_names,
+				     uint32_t nr_packs)
+{
+	uint32_t i, cur_len = 0;
+
+	for (i = 0; i < nr_packs; i++) {
+		hashwrite_be32(f, cur_len);
+		cur_len += strlen(pack_names[i]) + 1;
+	}
+
+	return sizeof(uint32_t) * (size_t)nr_packs;
+}
+
 static size_t write_midx_pack_names(struct hashfile *f,
 				    char **pack_names,
 				    uint32_t num_packs)
@@ -275,13 +317,17 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 1;
+	num_chunks = 2;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKLOOKUP;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
+
 	cur_chunk++;
 	chunk_ids[cur_chunk] = 0;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
@@ -311,6 +357,10 @@ int write_midx_file(const char *object_dir)
 			    chunk_ids[i]);
 
 		switch (chunk_ids[i]) {
+			case MIDX_CHUNKID_PACKLOOKUP:
+				written += write_midx_pack_lookup(f, pack_names, nr_packs);
+				break;
+
 			case MIDX_CHUNKID_PACKNAMES:
 				written += write_midx_pack_names(f, pack_names, nr_packs);
 				break;
diff --git a/object-store.h b/object-store.h
index 199cf4bd44..1ba50459ca 100644
--- a/object-store.h
+++ b/object-store.h
@@ -100,8 +100,10 @@ struct midxed_git {
 	uint32_t num_packs;
 	uint32_t num_objects;
 
+	const uint32_t *chunk_pack_lookup;
 	const unsigned char *chunk_pack_names;
 
+	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index fdf4f84a90..a31c387c8f 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,10 +6,15 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 1 $NUM_PACKS
-	chunks: pack_names
-	object_dir: .
+	header: 4d494458 1 1 2 $NUM_PACKS
+	chunks: pack_lookup pack_names
+	packs:
 	EOF
+	if [ $NUM_PACKS -ge 1 ]
+	then
+		ls pack/ | grep idx | sort >> expect
+	fi
+	printf "object_dir: .\n" >>expect &&
 	git midx read --object-dir=. >actual &&
 	test_cmp expect actual
 }
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 11/23] midx: sort and deduplicate objects from packfiles
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (9 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:07   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Before writing a list of objects and their offsets to a multi-pack-index
(MIDX), we need to collect the list of objects contained in the
packfiles. There may be multiple copies of some objects, so this list
must be deduplicated.

It is possible to artificially get into a state where there are many
duplicate copies of objects. That can create high memory pressure if we
are to create a list of all objects before de-duplication. To reduce
this memory pressure without a significant performance drop,
automatically group objects by the first byte of their object id. Use
the IDX fanout tables to group the data, copy to a local array, then
sort.

Copy only the de-duplicated entries. Select the duplicate based on the
most-recent modified time of a packfile containing the object.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/midx.c b/midx.c
index 923acda72e..b20d52713c 100644
--- a/midx.c
+++ b/midx.c
@@ -4,6 +4,7 @@
 #include "csum-file.h"
 #include "lockfile.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
@@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	}
 }
 
+static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
+{
+	const uint32_t *level1_ofs = p->index_data;
+
+	if (!level1_ofs) {
+		if (open_pack_index(p))
+			return 0;
+		level1_ofs = p->index_data;
+	}
+
+	if (p->index_version > 1) {
+		level1_ofs += 2;
+	}
+
+	return ntohl(level1_ofs[value]);
+}
+
+struct pack_midx_entry {
+	struct object_id oid;
+	uint32_t pack_int_id;
+	time_t pack_mtime;
+	uint64_t offset;
+};
+
+static int midx_oid_compare(const void *_a, const void *_b)
+{
+	struct pack_midx_entry *a = (struct pack_midx_entry *)_a;
+	struct pack_midx_entry *b = (struct pack_midx_entry *)_b;
+	int cmp = oidcmp(&a->oid, &b->oid);
+
+	if (cmp)
+		return cmp;
+
+	if (a->pack_mtime > b->pack_mtime)
+		return -1;
+	else if (a->pack_mtime < b->pack_mtime)
+		return 1;
+
+	return a->pack_int_id - b->pack_int_id;
+}
+
+static void fill_pack_entry(uint32_t pack_int_id,
+			    struct packed_git *p,
+			    uint32_t cur_object,
+			    struct pack_midx_entry *entry)
+{
+	if (!nth_packed_object_oid(&entry->oid, p, cur_object))
+		die("failed to located object %d in packfile", cur_object);
+
+	entry->pack_int_id = pack_int_id;
+	entry->pack_mtime = p->mtime;
+
+	entry->offset = nth_packed_object_offset(p, cur_object);
+}
+
+/*
+ * It is possible to artificially get into a state where there are many
+ * duplicate copies of objects. That can create high memory pressure if
+ * we are to create a list of all objects before de-duplication. To reduce
+ * this memory pressure without a significant performance drop, automatically
+ * group objects by the first byte of their object id. Use the IDX fanout
+ * tables to group the data, copy to a local array, then sort.
+ *
+ * Copy only the de-duplicated entries (selected by most-recent modified time
+ * of a packfile containing the object).
+ */
+static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+						  uint32_t *perm,
+						  uint32_t nr_packs,
+						  uint32_t *nr_objects)
+{
+	uint32_t cur_fanout, cur_pack, cur_object;
+	uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
+	struct pack_midx_entry *entries_by_fanout = NULL;
+	struct pack_midx_entry *deduplicated_entries = NULL;
+
+	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+		if (open_pack_index(p[cur_pack]))
+			continue;
+
+		total_objects += p[cur_pack]->num_objects;
+	}
+
+	/*
+	 * As we de-duplicate by fanout value, we expect the fanout
+	 * slices to be evenly distributed, with some noise. Hence,
+	 * allocate slightly more than one 256th.
+	 */
+	alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16;
+
+	ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
+	ALLOC_ARRAY(deduplicated_entries, alloc_objects);
+	*nr_objects = 0;
+
+	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
+		nr_fanout = 0;
+
+		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
+			end = get_pack_fanout(p[cur_pack], cur_fanout);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				nr_fanout++;
+			}
+		}
+
+		QSORT(entries_by_fanout, nr_fanout, midx_oid_compare);
+
+		/*
+		 * The batch is now sorted by OID and then mtime (descending).
+		 * Take only the first duplicate.
+		 */
+		for (cur_object = 0; cur_object < nr_fanout; cur_object++) {
+			if (cur_object && !oidcmp(&entries_by_fanout[cur_object - 1].oid,
+						  &entries_by_fanout[cur_object].oid))
+				continue;
+
+			ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects);
+			memcpy(&deduplicated_entries[*nr_objects],
+			       &entries_by_fanout[cur_object],
+			       sizeof(struct pack_midx_entry));
+			(*nr_objects)++;
+		}
+	}
+
+	FREE_AND_NULL(entries_by_fanout);
+	return deduplicated_entries;
+}
+
 static size_t write_midx_pack_lookup(struct hashfile *f,
 				     char **pack_names,
 				     uint32_t nr_packs)
@@ -254,6 +389,7 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+	uint32_t nr_entries;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -312,6 +448,8 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, nr_packs);
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
+	get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 12/23] midx: write object ids in a chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (10 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:25   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  4 ++
 builtin/midx.c                          |  2 +
 midx.c                                  | 50 +++++++++++++++++++++++--
 object-store.h                          |  1 +
 t/t5319-midx.sh                         |  4 +-
 5 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 29bf87283a..de9ac778b6 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -307,6 +307,10 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+	    The OIDs for all objects in the MIDX are stored in lexicographic
+	    order in this chunk.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/builtin/midx.c b/builtin/midx.c
index 3a261e9bbf..86edd30174 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -35,6 +35,8 @@ static int read_midx_file(const char *object_dir)
 		printf(" pack_lookup");
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_lookup)
+		printf(" oid_lookup");
 
 	printf("\n");
 
diff --git a/midx.c b/midx.c
index b20d52713c..d06bc6876a 100644
--- a/midx.c
+++ b/midx.c
@@ -14,10 +14,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 2
+#define MIDX_MAX_CHUNKS 3
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
 static char *get_midx_filename(const char *object_dir)
@@ -95,6 +96,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				m->chunk_oid_lookup = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die("terminating MIDX chunk id appears earlier than expected");
 				break;
@@ -112,6 +117,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		die("MIDX missing required pack lookup chunk");
 	if (!m->chunk_pack_names)
 		die("MIDX missing required pack-name chunk");
+	if (!m->chunk_oid_lookup)
+		die("MIDX missing required OID lookup chunk");
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
 	for (i = 0; i < m->num_packs; i++) {
@@ -370,6 +377,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		if (i < nr_objects - 1) {
+			struct pack_midx_entry *next = list;
+			if (oidcmp(&obj->oid, &next->oid) >= 0)
+				BUG("OIDs not in order: %s >= %s",
+				oid_to_hex(&obj->oid),
+				oid_to_hex(&next->oid));
+		}
+
+		hashwrite(f, obj->oid.hash, (int)hash_len);
+		written += hash_len;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -389,6 +422,7 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+	struct pack_midx_entry *entries;
 	uint32_t nr_entries;
 
 	midx_name = get_midx_filename(object_dir);
@@ -448,14 +482,14 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, nr_packs);
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
-	get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+	entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
 
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 2;
+	num_chunks = 3;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
@@ -467,9 +501,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -503,6 +541,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, pack_names, nr_packs);
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index 1ba50459ca..7d14d3586e 100644
--- a/object-store.h
+++ b/object-store.h
@@ -102,6 +102,7 @@ struct midxed_git {
 
 	const uint32_t *chunk_pack_lookup;
 	const unsigned char *chunk_pack_names;
+	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index a31c387c8f..e71aa52b80 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,8 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 2 $NUM_PACKS
-	chunks: pack_lookup pack_names
+	header: 4d494458 1 1 3 $NUM_PACKS
+	chunks: pack_lookup pack_names oid_lookup
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 13/23] midx: write object id fanout chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (11 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:28   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  5 +++
 builtin/midx.c                          |  4 +-
 midx.c                                  | 53 +++++++++++++++++++++++--
 object-store.h                          |  1 +
 t/t5319-midx.sh                         | 18 +++++----
 5 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index de9ac778b6..77e88f85e4 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -307,6 +307,11 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+	    The ith entry, F[i], stores the number of OIDs with first
+	    byte at most i. Thus F[255] stores the total
+	    number of objects (N).
+
 	OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
diff --git a/builtin/midx.c b/builtin/midx.c
index 86edd30174..e1fd0e0de4 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -35,10 +35,12 @@ static int read_midx_file(const char *object_dir)
 		printf(" pack_lookup");
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_fanout)
+		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
 
-	printf("\n");
+	printf("\nnum_objects: %d\n", m->num_objects);
 
 	printf("packs:\n");
 	for (i = 0; i < m->num_packs; i++)
diff --git a/midx.c b/midx.c
index d06bc6876a..9458ced208 100644
--- a/midx.c
+++ b/midx.c
@@ -14,12 +14,14 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 3
+#define MIDX_MAX_CHUNKS 4
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+#define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -96,6 +98,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
@@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		die("MIDX missing required pack lookup chunk");
 	if (!m->chunk_pack_names)
 		die("MIDX missing required pack-name chunk");
+	if (!m->chunk_oid_fanout)
+		die("MIDX missing required OID fanout chunk");
 	if (!m->chunk_oid_lookup)
 		die("MIDX missing required OID lookup chunk");
 
+	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
+
 	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
 	for (i = 0; i < m->num_packs; i++) {
 		if (i) {
@@ -377,6 +387,35 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_fanout(struct hashfile *f,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	struct pack_midx_entry *last = objects + nr_objects;
+	uint32_t count = 0;
+	uint32_t i;
+
+	/*
+	* Write the first-level table (the list is sorted,
+	* but we use a 256-entry lookup to be able to avoid
+	* having to do eight extra binary search iterations).
+	*/
+	for (i = 0; i < 256; i++) {
+		struct pack_midx_entry *next = list;
+
+		while (next < last && next->oid.hash[0] == i) {
+			count++;
+			next++;
+		}
+
+		hashwrite_be32(f, count);
+		list = next;
+	}
+
+	return MIDX_CHUNK_FANOUT_SIZE;
+}
+
 static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 				    struct pack_midx_entry *objects,
 				    uint32_t nr_objects)
@@ -489,7 +528,7 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 3;
+	num_chunks = 4;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
@@ -501,9 +540,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
+
 	cur_chunk++;
 	chunk_ids[cur_chunk] = 0;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
@@ -541,6 +584,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, pack_names, nr_packs);
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				written += write_midx_oid_fanout(f, entries, nr_entries);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
diff --git a/object-store.h b/object-store.h
index 7d14d3586e..c613ff2571 100644
--- a/object-store.h
+++ b/object-store.h
@@ -102,6 +102,7 @@ struct midxed_git {
 
 	const uint32_t *chunk_pack_lookup;
 	const unsigned char *chunk_pack_names;
+	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index e71aa52b80..d4ae988479 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -5,9 +5,11 @@ test_description='multi-pack-indexes'
 
 midx_read_expect() {
 	NUM_PACKS=$1
+	NUM_OBJECTS=$2
 	cat >expect <<- EOF
-	header: 4d494458 1 1 3 $NUM_PACKS
-	chunks: pack_lookup pack_names oid_lookup
+	header: 4d494458 1 1 4 $NUM_PACKS
+	chunks: pack_lookup pack_names oid_fanout oid_lookup
+	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
@@ -23,7 +25,7 @@ test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
 	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect 0
+	midx_read_expect 0 0
 '
 
 test_expect_success 'create objects' '
@@ -54,18 +56,18 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'Add more objects' '
-	for i in `test_seq 6 5`
+	for i in `test_seq 6 10`
 	do
 		iii=$(printf '%03i' $i)
 		test-tool genrandom "bar" 200 > wide_delta_$iii &&
@@ -92,7 +94,7 @@ test_expect_success 'write midx with two packs' '
 	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
 	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
 	git midx --object-dir=. write &&
-	midx_read_expect 2
+	midx_read_expect 2 33
 '
 
 test_expect_success 'Add more packs' '
@@ -123,7 +125,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git midx --object-dir=. write &&
-	midx_read_expect 12
+	midx_read_expect 12 73
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 14/23] midx: write object offsets
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (12 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:41   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 15/23] midx: create core.midx config setting Derrick Stolee
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The final pair of chunks for the multi-pack-index (MIDX) file stores the
object offsets. We default to using 32-bit offsets as in the pack-index
version 1 format, but if there exists an offset larger than 32-bits, we
use a trick similar to the pack-index version 2 format by storing all
offsets at least 2^31 in a 64-bit table; we use the 32-bit table to
point into that 64-bit table as necessary.

We only store these 64-bit offsets if necessary, so create a test that
manipulates a version 2 pack-index to fake a large offset. This allows
us to test that the large offset table is created, but the data does not
match the actual packfile offsets. The MIDX offset does match the
(corrupted) pack-index offset, so a later commit will compare these
offsets during a 'verify' step.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  15 +++-
 builtin/midx.c                          |   4 +
 midx.c                                  | 100 +++++++++++++++++++++++-
 object-store.h                          |   2 +
 t/t5319-midx.sh                         |  45 ++++++++---
 5 files changed, 151 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 77e88f85e4..0256cfb5e0 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -316,7 +316,20 @@ CHUNK DATA:
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
 
-	(This section intentionally left incomplete.)
+	Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
+	    Stores two 4-byte values for every object.
+	    1: The pack-int-id for the pack storing this object.
+	    2: The offset within the pack.
+		If all offsets are less than 2^31, then the large offset chunk
+		will not exist and offsets are stored as in IDX v1.
+		If there is at least one offset value larger than 2^32-1, then
+		the large offset chunk must exist. If the large offset chunk
+		exists and the 31st bit is on, then removing that bit reveals
+		the row in the large offsets containing the 8-byte offset of
+		this object.
+
+	[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+	    8-byte offsets into large packfiles.
 
 TRAILER:
 
diff --git a/builtin/midx.c b/builtin/midx.c
index e1fd0e0de4..607d2b3544 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -39,6 +39,10 @@ static int read_midx_file(const char *object_dir)
 		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
+	if (m->chunk_object_offsets)
+		printf(" object_offsets");
+	if (m->chunk_large_offsets)
+		printf(" large_offsets");
 
 	printf("\nnum_objects: %d\n", m->num_objects);
 
diff --git a/midx.c b/midx.c
index 9458ced208..a49300bf75 100644
--- a/midx.c
+++ b/midx.c
@@ -14,14 +14,19 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 4
+#define MIDX_MAX_CHUNKS 6
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
+#define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
+#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
+#define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -106,6 +111,14 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				m->chunk_object_offsets = m->data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				m->chunk_large_offsets = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die("terminating MIDX chunk id appears earlier than expected");
 				break;
@@ -127,6 +140,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		die("MIDX missing required OID fanout chunk");
 	if (!m->chunk_oid_lookup)
 		die("MIDX missing required OID lookup chunk");
+	if (!m->chunk_object_offsets)
+		die("MIDX missing required object offsets chunk");
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
@@ -442,6 +457,56 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 	return written;
 }
 
+static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i, nr_large_offset = 0;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		hashwrite_be32(f, obj->pack_int_id);
+
+		if (large_offset_needed && obj->offset >> 31)
+			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
+		else if (!large_offset_needed && obj->offset >> 32)
+			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
+			    oid_to_hex(&obj->oid),
+			    obj->offset);
+		else
+			hashwrite_be32(f, (uint32_t)obj->offset);
+
+		written += MIDX_CHUNK_OFFSET_WIDTH;
+	}
+
+	return written;
+}
+
+static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
+				       struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	size_t written = 0;
+
+	while (nr_large_offset) {
+		struct pack_midx_entry *obj = list++;
+		uint64_t offset = obj->offset;
+
+		if (!(offset >> 31))
+			continue;
+
+		hashwrite_be32(f, offset >> 32);
+		hashwrite_be32(f, offset & 0xffffffff);
+		written += 2 * sizeof(uint32_t);
+
+		nr_large_offset--;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -462,7 +527,8 @@ int write_midx_file(const char *object_dir)
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	struct pack_midx_entry *entries;
-	uint32_t nr_entries;
+	uint32_t nr_entries, num_large_offsets = 0;
+	int large_offsets_needed = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -522,13 +588,19 @@ int write_midx_file(const char *object_dir)
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
 	entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+	for (i = 0; i < nr_entries; i++) {
+		if (entries[i].offset > 0x7fffffff)
+			num_large_offsets++;
+		if (entries[i].offset > 0xffffffff)
+			large_offsets_needed = 1;
+	}
 
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 4;
+	num_chunks = large_offsets_needed ? 6 : 5;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
@@ -548,9 +620,21 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
 
+	cur_chunk++;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	if (large_offsets_needed) {
+		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
+
+		cur_chunk++;
+		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
+					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+	}
+
+	chunk_ids[cur_chunk] = 0;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -592,6 +676,14 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index c613ff2571..9b671f1b0a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -104,6 +104,8 @@ struct midxed_git {
 	const unsigned char *chunk_pack_names;
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_object_offsets;
+	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index d4ae988479..709652c635 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,18 +6,21 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
+	NUM_CHUNKS=$3
+	OBJECT_DIR=$4
+	EXTRA_CHUNKS="$5"
 	cat >expect <<- EOF
-	header: 4d494458 1 1 4 $NUM_PACKS
-	chunks: pack_lookup pack_names oid_fanout oid_lookup
+	header: 4d494458 1 1 $NUM_CHUNKS $NUM_PACKS
+	chunks: pack_lookup pack_names oid_fanout oid_lookup object_offsets$EXTRA_CHUNKS
 	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
 	then
-		ls pack/ | grep idx | sort >> expect
+		ls $OBJECT_DIR/pack/ | grep idx | sort >> expect
 	fi
-	printf "object_dir: .\n" >>expect &&
-	git midx read --object-dir=. >actual &&
+	printf "object_dir: $OBJECT_DIR\n" >>expect &&
+	git midx read --object-dir=$OBJECT_DIR >actual &&
 	test_cmp expect actual
 }
 
@@ -25,7 +28,7 @@ test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
 	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect 0 0
+	midx_read_expect 0 0 5 .
 '
 
 test_expect_success 'create objects' '
@@ -56,14 +59,14 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 5 .
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 5 .
 '
 
 test_expect_success 'Add more objects' '
@@ -94,7 +97,7 @@ test_expect_success 'write midx with two packs' '
 	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
 	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
 	git midx --object-dir=. write &&
-	midx_read_expect 2 33
+	midx_read_expect 2 33 5 .
 '
 
 test_expect_success 'Add more packs' '
@@ -125,7 +128,29 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git midx --object-dir=. write &&
-	midx_read_expect 12 73
+	midx_read_expect 12 73 5 .
+'
+
+
+# usage: corrupt_data <file> <pos> [<data>]
+corrupt_data() {
+	file=$1
+	pos=$2
+	data="${3:-\0}"
+	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+# Force 64-bit offsets by manipulating the idx file.
+# This makes the IDX file _incorrect_ so be careful to clean up after!
+test_expect_success 'force some 64-bit offsets with pack-objects' '
+	mkdir objects64 &&
+	mkdir objects64/pack &&
+	pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
+	idx64=objects64/pack/test-64-$pack64.idx &&
+	chmod u+w $idx64 &&
+	corrupt_data $idx64 2899 "\02" &&
+	midx64=$(git midx write --object-dir=objects64) &&
+	midx_read_expect 1 62 6 objects64 " large_offsets"
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 15/23] midx: create core.midx config setting
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (13 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The core.midx config setting controls the multi-pack-index (MIDX)
feature. If false, the setting will disable all reads from the
multi-pack-index file.

Add comparison commands in t5319-midx.sh to check typical Git behavior
remains the same as the config setting is turned on and off. This
currently includes 'git rev-list' and 'git log' commands to trigger
several object database reads.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt |  4 +++
 cache.h                  |  1 +
 config.c                 |  5 ++++
 environment.c            |  1 +
 t/t5319-midx.sh          | 57 ++++++++++++++++++++++++++++++++--------
 5 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ab641bf5a9..e78150e452 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -908,6 +908,10 @@ core.commitGraph::
 	Enable git commit graph feature. Allows reading from the
 	commit-graph file.
 
+core.midx::
+	Enable multi-pack-index feature. Allows reading from the multi-
+	pack-index file.
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 89a107a7f7..c7967f7643 100644
--- a/cache.h
+++ b/cache.h
@@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_commit_graph;
+extern int core_midx;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index fbbf0f8e9f..0df3dbdf74 100644
--- a/config.c
+++ b/config.c
@@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.midx")) {
+		core_midx = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 2a6de2330b..dcb4417604 100644
--- a/environment.c
+++ b/environment.c
@@ -67,6 +67,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_commit_graph;
+int core_midx;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 709652c635..1a50987778 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,6 +3,8 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+objdir=.git/objects
+
 midx_read_expect() {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
@@ -62,13 +64,42 @@ test_expect_success 'write midx with one v1 pack' '
 	midx_read_expect 1 17 5 .
 '
 
+midx_git_two_modes() {
+	git -c core.midx=false $1 >expect &&
+	git -c core.midx=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
 test_expect_success 'write midx with one v2 pack' '
-	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
-	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
-	git midx --object-dir=. write &&
-	midx_read_expect 1 17 5 .
+	pack=$(git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list) &&
+	git midx --object-dir=$objdir write &&
+	midx_read_expect 1 17 5 $objdir
 '
 
+midx_git_two_modes() {
+	git -c core.midx=false $1 >expect &&
+	git -c core.midx=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
+compare_results_with_midx "one v2 pack"
+
 test_expect_success 'Add more objects' '
 	for i in `test_seq 6 10`
 	do
@@ -94,12 +125,13 @@ test_expect_success 'Add more objects' '
 '
 
 test_expect_success 'write midx with two packs' '
-	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
-	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
-	git midx --object-dir=. write &&
-	midx_read_expect 2 33 5 .
+	pack2=$(git pack-objects --index-version=1 $objdir/pack/test-2 <obj-list2) &&
+	git midx --object-dir=$objdir write &&
+	midx_read_expect 2 33 5 $objdir
 '
 
+compare_results_with_midx "two packs"
+
 test_expect_success 'Add more packs' '
 	for j in `test_seq 1 10`
 	do
@@ -120,17 +152,20 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 pack/test-pack <obj-list &&
+		git pack-objects --index-version=2 $objdir/pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
 '
 
+compare_results_with_midx "mixed mode (two packs + extra)"
+
 test_expect_success 'write midx with twelve packs' '
-	git midx --object-dir=. write &&
-	midx_read_expect 12 73 5 .
+	git midx --object-dir=$objdir write &&
+	midx_read_expect 12 73 5 $objdir
 '
 
+compare_results_with_midx "twelve packs"
 
 # usage: corrupt_data <file> <pos> [<data>]
 corrupt_data() {
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 16/23] midx: prepare midxed_git struct
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (14 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 15/23] midx: create core.midx config setting Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:47   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 22 ++++++++++++++++++++++
 midx.h         |  2 ++
 object-store.h |  7 +++++++
 packfile.c     |  6 +++++-
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index a49300bf75..5e9290ca8f 100644
--- a/midx.c
+++ b/midx.c
@@ -175,6 +175,28 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	exit(1);
 }
 
+int prepare_midxed_git_one(struct repository *r, const char *object_dir)
+{
+	struct midxed_git *m = r->objects->midxed_git;
+	struct midxed_git *m_search;
+
+	if (!core_midx)
+		return 0;
+
+	for (m_search = m; m_search; m_search = m_search->next)
+		if (!strcmp(object_dir, m_search->object_dir))
+			return 1;
+
+	r->objects->midxed_git = load_midxed_git(object_dir);
+
+	if (r->objects->midxed_git) {
+		r->objects->midxed_git->next = m;
+		return 1;
+	}
+
+	return 0;
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index a1d18ed991..793203fc4a 100644
--- a/midx.h
+++ b/midx.h
@@ -5,8 +5,10 @@
 #include "cache.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "repository.h"
 
 struct midxed_git *load_midxed_git(const char *object_dir);
+int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
 
diff --git a/object-store.h b/object-store.h
index 9b671f1b0a..7908d46e34 100644
--- a/object-store.h
+++ b/object-store.h
@@ -130,6 +130,13 @@ struct raw_object_store {
 	 */
 	struct oidmap *replace_map;
 
+	/*
+	 * private data
+	 *
+	 * should only be accessed directly by packfile.c and midx.c
+	 */
+	struct midxed_git *midxed_git;
+
 	/*
 	 * private data
 	 *
diff --git a/packfile.c b/packfile.c
index 1a714fbde9..b91ca9b9f5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -15,6 +15,7 @@
 #include "tree-walk.h"
 #include "tree.h"
 #include "object-store.h"
+#include "midx.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *sha1,
@@ -893,10 +894,13 @@ static void prepare_packed_git(struct repository *r)
 
 	if (r->objects->packed_git_initialized)
 		return;
+	prepare_midxed_git_one(r, r->objects->objectdir);
 	prepare_packed_git_one(r, r->objects->objectdir, 1);
 	prepare_alt_odb(r);
-	for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
+	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
+		prepare_midxed_git_one(r, alt->path);
 		prepare_packed_git_one(r, alt->path, 0);
+	}
 	rearrange_packed_git(r);
 	prepare_packed_git_mru(r);
 	r->objects->packed_git_initialized = 1;
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 17/23] midx: read objects from multi-pack-index
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (15 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:56   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++++--
 midx.h         |  2 ++
 object-store.h |  1 +
 packfile.c     |  8 ++++-
 4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 5e9290ca8f..6eca8f1b12 100644
--- a/midx.c
+++ b/midx.c
@@ -3,6 +3,7 @@
 #include "dir.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "sha1-lookup.h"
 #include "object-store.h"
 #include "packfile.h"
 #include "midx.h"
@@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 
 	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
 	strcpy(m->object_dir, object_dir);
-	m->data = midx_map;
+	m->data = (const unsigned char*)midx_map;
 
 	m->signature = get_be32(m->data);
 	if (m->signature != MIDX_SIGNATURE) {
@@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
-	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+	m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
+
+	ALLOC_ARRAY(m->pack_names, m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		if (i) {
 			if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
@@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	exit(1);
 }
 
+static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
+{
+	struct strbuf pack_name = STRBUF_INIT;
+
+	if (pack_int_id >= m->num_packs)
+		BUG("bad pack-int-id");
+
+	if (m->packs[pack_int_id])
+		return 0;
+
+	strbuf_addstr(&pack_name, m->object_dir);
+	strbuf_addstr(&pack_name, "/pack/");
+	strbuf_addstr(&pack_name, m->pack_names[pack_int_id]);
+
+	m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
+	strbuf_release(&pack_name);
+	return !m->packs[pack_int_id];
+}
+
+int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result)
+{
+	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
+			    MIDX_HASH_LEN, result);
+}
+
+static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
+{
+	const unsigned char *offset_data;
+	uint32_t offset32;
+
+	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
+	offset32 = get_be32(offset_data + sizeof(uint32_t));
+
+	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
+		if (sizeof(offset32) < sizeof(uint64_t))
+			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
+
+		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
+		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
+	}
+
+	return offset32;
+}
+
+static uint32_t nth_midxed_pack_int_id(struct midxed_git *m, uint32_t pos)
+{
+	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
+}
+
+static int nth_midxed_pack_entry(struct midxed_git *m, struct pack_entry *e, uint32_t pos)
+{
+	uint32_t pack_int_id;
+	struct packed_git *p;
+
+	if (pos >= m->num_objects)
+		return 0;
+
+	pack_int_id = nth_midxed_pack_int_id(m, pos);
+
+	if (prepare_midx_pack(m, pack_int_id))
+		die(_("error preparing packfile from multi-pack-index"));
+	p = m->packs[pack_int_id];
+
+	/*
+	* We are about to tell the caller where they can locate the
+	* requested object.  We better make sure the packfile is
+	* still here and can be accessed before supplying that
+	* answer, as it may have been deleted since the MIDX was
+	* loaded!
+	*/
+	if (!is_pack_valid(p))
+		return 0;
+
+	e->offset = nth_midxed_offset(m, pos);
+	e->p = p;
+
+	return 1;
+}
+
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m)
+{
+	uint32_t pos;
+
+	if (!bsearch_midx(oid, m, &pos))
+		return 0;
+
+	return nth_midxed_pack_entry(m, e, pos);
+}
+
 int prepare_midxed_git_one(struct repository *r, const char *object_dir)
 {
 	struct midxed_git *m = r->objects->midxed_git;
diff --git a/midx.h b/midx.h
index 793203fc4a..0c66812229 100644
--- a/midx.h
+++ b/midx.h
@@ -8,6 +8,8 @@
 #include "repository.h"
 
 struct midxed_git *load_midxed_git(const char *object_dir);
+int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result);
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/object-store.h b/object-store.h
index 7908d46e34..5af2a852bc 100644
--- a/object-store.h
+++ b/object-store.h
@@ -108,6 +108,7 @@ struct midxed_git {
 	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
+	struct packed_git **packs;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/packfile.c b/packfile.c
index b91ca9b9f5..73f8cc28ee 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1857,11 +1857,17 @@ static int fill_pack_entry(const struct object_id *oid,
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
+	struct midxed_git *m;
 
 	prepare_packed_git(r);
-	if (!r->objects->packed_git)
+	if (!r->objects->packed_git && !r->objects->midxed_git)
 		return 0;
 
+	for (m = r->objects->midxed_git; m; m = m->next) {
+		if (fill_midx_entry(oid, e, m))
+			return 1;
+	}
+
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
 		if (fill_pack_entry(oid, e, p)) {
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 18/23] midx: use midx in abbreviation calculations
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (16 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:01   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 19/23] midx: use existing midx when writing new one Derrick Stolee
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c          | 11 ++++++++
 midx.h          |  3 +++
 packfile.c      |  6 +++++
 packfile.h      |  1 +
 sha1-name.c     | 70 +++++++++++++++++++++++++++++++++++++++++++++++++
 t/t5319-midx.sh |  3 ++-
 6 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 6eca8f1b12..25d8142c2a 100644
--- a/midx.c
+++ b/midx.c
@@ -203,6 +203,17 @@ int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *re
 			    MIDX_HASH_LEN, result);
 }
 
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct midxed_git *m,
+					uint32_t n)
+{
+	if (n >= m->num_objects)
+		return NULL;
+
+	hashcpy(oid->hash, m->chunk_oid_lookup + m->hash_len * n);
+	return oid;
+}
+
 static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
 {
 	const unsigned char *offset_data;
diff --git a/midx.h b/midx.h
index 0c66812229..497bdcc77c 100644
--- a/midx.h
+++ b/midx.h
@@ -9,6 +9,9 @@
 
 struct midxed_git *load_midxed_git(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result);
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct midxed_git *m,
+					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
diff --git a/packfile.c b/packfile.c
index 73f8cc28ee..638e113972 100644
--- a/packfile.c
+++ b/packfile.c
@@ -919,6 +919,12 @@ struct packed_git *get_packed_git(struct repository *r)
 	return r->objects->packed_git;
 }
 
+struct midxed_git *get_midxed_git(struct repository *r)
+{
+	prepare_packed_git(r);
+	return r->objects->midxed_git;
+}
+
 struct list_head *get_packed_git_mru(struct repository *r)
 {
 	prepare_packed_git(r);
diff --git a/packfile.h b/packfile.h
index e0a38aba93..01e14b93fd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -39,6 +39,7 @@ extern void install_packed_git(struct repository *r, struct packed_git *pack);
 
 struct packed_git *get_packed_git(struct repository *r);
 struct list_head *get_packed_git_mru(struct repository *r);
+struct midxed_git *get_midxed_git(struct repository *r);
 
 /*
  * Give a rough count of objects in the repository. This sacrifices accuracy
diff --git a/sha1-name.c b/sha1-name.c
index 60d9ef3c7e..d975a186c9 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -12,6 +12,7 @@
 #include "packfile.h"
 #include "object-store.h"
 #include "repository.h"
+#include "midx.h"
 
 static int get_oid_oneline(const char *, struct object_id *, struct commit_list *);
 
@@ -149,6 +150,32 @@ static int match_sha(unsigned len, const unsigned char *a, const unsigned char *
 	return 1;
 }
 
+static void unique_in_midx(struct midxed_git *m,
+			   struct disambiguate_state *ds)
+{
+	uint32_t num, i, first = 0;
+	const struct object_id *current = NULL;
+	num = m->num_objects;
+
+	if (!num)
+		return;
+
+	bsearch_midx(&ds->bin_pfx, m, &first);
+
+	/*
+	 * At this point, "first" is the location of the lowest object
+	 * with an object name that could match "bin_pfx".  See if we have
+	 * 0, 1 or more objects that actually match(es).
+	 */
+	for (i = first; i < num && !ds->ambiguous; i++) {
+		struct object_id oid;
+		current = nth_midxed_object_oid(&oid, m, i);
+		if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash))
+			break;
+		update_candidates(ds, current);
+	}
+}
+
 static void unique_in_pack(struct packed_git *p,
 			   struct disambiguate_state *ds)
 {
@@ -177,8 +204,12 @@ static void unique_in_pack(struct packed_git *p,
 
 static void find_short_packed_object(struct disambiguate_state *ds)
 {
+	struct midxed_git *m;
 	struct packed_git *p;
 
+	for (m = get_midxed_git(the_repository); m && !ds->ambiguous;
+	     m = m->next)
+		unique_in_midx(m, ds);
 	for (p = get_packed_git(the_repository); p && !ds->ambiguous;
 	     p = p->next)
 		unique_in_pack(p, ds);
@@ -527,6 +558,42 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	return 0;
 }
 
+static void find_abbrev_len_for_midx(struct midxed_git *m,
+				     struct min_abbrev_data *mad)
+{
+	int match = 0;
+	uint32_t num, first = 0;
+	struct object_id oid;
+	const struct object_id *mad_oid;
+
+	if (!m->num_objects)
+		return;
+
+	num = m->num_objects;
+	mad_oid = mad->oid;
+	match = bsearch_midx(mad_oid, m, &first);
+
+	/*
+	 * first is now the position in the packfile where we would insert
+	 * mad->hash if it does not exist (or the position of mad->hash if
+	 * it does exist). Hence, we consider a maximum of two objects
+	 * nearby for the abbreviation length.
+	 */
+	mad->init_len = 0;
+	if (!match) {
+		if (nth_midxed_object_oid(&oid, m, first))
+			extend_abbrev_len(&oid, mad);
+	} else if (first < num - 1) {
+		if (nth_midxed_object_oid(&oid, m, first + 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	if (first > 0) {
+		if (nth_midxed_object_oid(&oid, m, first - 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	mad->init_len = mad->cur_len;
+}
+
 static void find_abbrev_len_for_pack(struct packed_git *p,
 				     struct min_abbrev_data *mad)
 {
@@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 
 static void find_abbrev_len_packed(struct min_abbrev_data *mad)
 {
+	struct midxed_git *m;
 	struct packed_git *p;
 
+	for (m = get_midxed_git(the_repository); m; m = m->next)
+		find_abbrev_len_for_midx(m, mad);
 	for (p = get_packed_git(the_repository); p; p = p->next)
 		find_abbrev_len_for_pack(p, mad);
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 1a50987778..e3873da7d6 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -94,7 +94,8 @@ compare_results_with_midx() {
 	MSG=$1
 	test_expect_success "check normal git operations: $MSG" '
 		midx_git_two_modes "rev-list --objects --all" &&
-		midx_git_two_modes "log --raw"
+		midx_git_two_modes "log --raw" &&
+		midx_git_two_modes "log --oneline"
 	'
 }
 
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 19/23] midx: use existing midx when writing new one
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (17 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/midx.c b/midx.c
index 25d8142c2a..388d79b7d9 100644
--- a/midx.c
+++ b/midx.c
@@ -389,6 +389,23 @@ static int midx_oid_compare(const void *_a, const void *_b)
 	return a->pack_int_id - b->pack_int_id;
 }
 
+static int nth_midxed_pack_midx_entry(struct midxed_git *m,
+				      uint32_t *pack_perm,
+				      struct pack_midx_entry *e,
+				      uint32_t pos)
+{
+	if (pos >= m->num_objects)
+		return 1;
+
+	nth_midxed_object_oid(&e->oid, m, pos);
+	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->offset = nth_midxed_offset(m, pos);
+
+	/* consider objects in midx to be from "old" packs */
+	e->pack_mtime = 0;
+	return 0;
+}
+
 static void fill_pack_entry(uint32_t pack_int_id,
 			    struct packed_git *p,
 			    uint32_t cur_object,
@@ -414,7 +431,8 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * Copy only the de-duplicated entries (selected by most-recent modified time
  * of a packfile containing the object).
  */
-static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+static struct pack_midx_entry *get_sorted_entries(struct midxed_git *m,
+						  struct packed_git **p,
 						  uint32_t *perm,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
@@ -423,8 +441,9 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
 	struct pack_midx_entry *entries_by_fanout = NULL;
 	struct pack_midx_entry *deduplicated_entries = NULL;
+	uint32_t start_pack = m ? m->num_packs : 0;
 
-	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
 		if (open_pack_index(p[cur_pack]))
 			continue;
 
@@ -445,7 +464,23 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
 		nr_fanout = 0;
 
-		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+		if (m) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = ntohl(m->chunk_oid_fanout[cur_fanout - 1]);
+			end = ntohl(m->chunk_oid_fanout[cur_fanout]);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				nth_midxed_pack_midx_entry(m, perm,
+							   &entries_by_fanout[nr_fanout],
+							   cur_object);
+				nr_fanout++;
+			}
+		}
+
+		for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
@@ -654,6 +689,7 @@ int write_midx_file(const char *object_dir)
 	struct pack_midx_entry *entries;
 	uint32_t nr_entries, num_large_offsets = 0;
 	int large_offsets_needed = 0;
+	struct midxed_git *m = NULL;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -662,6 +698,8 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	m = load_midxed_git(object_dir);
+
 	strbuf_addf(&pack_dir, "%s/pack", object_dir);
 	dir = opendir(pack_dir.buf);
 
@@ -676,11 +714,27 @@ int write_midx_file(const char *object_dir)
 	pack_dir_len = pack_dir.len;
 	ALLOC_ARRAY(packs, alloc_packs);
 	ALLOC_ARRAY(pack_names, alloc_pack_names);
+
+	if (m) {
+		for (i = 0; i < m->num_packs; i++) {
+			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+			ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
+
+			packs[nr_packs] = NULL;
+			pack_names[nr_packs] = xstrdup(m->pack_names[i]);
+			pack_name_concat_len += strlen(pack_names[nr_packs]) + 1;
+			nr_packs++;
+		}
+	}
+
 	while ((de = readdir(dir)) != NULL) {
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		if (ends_with(de->d_name, ".idx")) {
+			if (m && midx_contains_pack(m, de->d_name))
+				continue;
+
 			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
 			ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
 
@@ -705,6 +759,9 @@ int write_midx_file(const char *object_dir)
 	closedir(dir);
 	strbuf_release(&pack_dir);
 
+	if (m && nr_packs == m->num_packs)
+		goto cleanup;
+
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
 					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
@@ -712,7 +769,7 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, nr_packs);
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
-	entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+	entries = get_sorted_entries(m, packs, pack_perm, nr_packs, &nr_entries);
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
@@ -823,7 +880,8 @@ int write_midx_file(const char *object_dir)
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
-	for (i = 0; i < nr_packs; i++) {
+cleanup:
+	for (i = m ? m->num_packs : 0; i < nr_packs; i++) {
 		close_pack(packs[i]);
 		FREE_AND_NULL(packs[i]);
 	}
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 20/23] midx: use midx in approximate_object_count
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (18 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 19/23] midx: use existing midx when writing new one Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:03   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/packfile.c b/packfile.c
index 638e113972..059b2aa097 100644
--- a/packfile.c
+++ b/packfile.c
@@ -819,11 +819,14 @@ unsigned long approximate_object_count(void)
 {
 	if (!the_repository->objects->approximate_object_count_valid) {
 		unsigned long count;
+		struct midxed_git *m;
 		struct packed_git *p;
 
 		prepare_packed_git(the_repository);
 		count = 0;
-		for (p = the_repository->objects->packed_git; p; p = p->next) {
+		for (m = get_midxed_git(the_repository); m; m = m->next)
+			count += m->num_objects;
+		for (p = get_packed_git(the_repository); p; p = p->next) {
 			if (open_pack_index(p))
 				continue;
 			count += p->num_objects;
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 21/23] midx: prevent duplicate packfile loads
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (19 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:05   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 22/23] midx: use midx to find ref-deltas Derrick Stolee
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

If the multi-pack-index contains a packfile, then we do not need to add
that packfile to the packed_git linked list or the MRU list.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     | 23 +++++++++++++++++++++++
 midx.h     |  1 +
 packfile.c |  7 +++++++
 3 files changed, 31 insertions(+)

diff --git a/midx.c b/midx.c
index 388d79b7d9..3242646fe0 100644
--- a/midx.c
+++ b/midx.c
@@ -278,6 +278,29 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mi
 	return nth_midxed_pack_entry(m, e, pos);
 }
 
+int midx_contains_pack(struct midxed_git *m, const char *idx_name)
+{
+	uint32_t first = 0, last = m->num_packs;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		const char *current;
+		int cmp;
+
+		current = m->pack_names[mid];
+		cmp = strcmp(idx_name, current);
+		if (!cmp)
+			return 1;
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	return 0;
+}
+
 int prepare_midxed_git_one(struct repository *r, const char *object_dir)
 {
 	struct midxed_git *m = r->objects->midxed_git;
diff --git a/midx.h b/midx.h
index 497bdcc77c..c1db58d8c4 100644
--- a/midx.h
+++ b/midx.h
@@ -13,6 +13,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct midxed_git *m,
 					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
+int midx_contains_pack(struct midxed_git *m, const char *idx_name);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/packfile.c b/packfile.c
index 059b2aa097..479cb69b9f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -746,6 +746,11 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	DIR *dir;
 	struct dirent *de;
 	struct string_list garbage = STRING_LIST_INIT_DUP;
+	struct midxed_git *m = r->objects->midxed_git;
+
+	/* look for the multi-pack-index for this object directory */
+	while (m && strcmp(m->object_dir, objdir))
+		m = m->next;
 
 	strbuf_addstr(&path, objdir);
 	strbuf_addstr(&path, "/pack");
@@ -772,6 +777,8 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 		base_len = path.len;
 		if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
 			/* Don't reopen a pack we already have. */
+			if (m && midx_contains_pack(m, de->d_name))
+				continue;
 			for (p = r->objects->packed_git; p;
 			     p = p->next) {
 				size_t len;
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 22/23] midx: use midx to find ref-deltas
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (20 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     |  2 +-
 midx.h     |  1 +
 packfile.c | 15 +++++++++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 3242646fe0..e46f392fa4 100644
--- a/midx.c
+++ b/midx.c
@@ -214,7 +214,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 	return oid;
 }
 
-static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
+off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
 {
 	const unsigned char *offset_data;
 	uint32_t offset32;
diff --git a/midx.h b/midx.h
index c1db58d8c4..6996b5ff6b 100644
--- a/midx.h
+++ b/midx.h
@@ -9,6 +9,7 @@
 
 struct midxed_git *load_midxed_git(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result);
+off_t nth_midxed_offset(struct midxed_git *m, uint32_t n);
 struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct midxed_git *m,
 					uint32_t n);
diff --git a/packfile.c b/packfile.c
index 479cb69b9f..9b814c89c7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1794,6 +1794,21 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	uint32_t result;
 
 	if (!index) {
+		/*
+		 * If we have a MIDX, then we want to
+		 * check the MIDX for the offset instead.
+		 */
+		struct midxed_git *m;
+
+		for (m = get_midxed_git(the_repository); m; m = m->next) {
+			if (midx_contains_pack(m, p->pack_name)) {
+				if (bsearch_midx(&oid, m, &result))
+					return nth_midxed_offset(m, result);
+
+				break;
+			}
+		}
+
 		if (open_pack_index(p))
 			return 0;
 	}
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 23/23] midx: clear midx on repack
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (21 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 22/23] midx: use midx to find ref-deltas Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:13   ` Duy Nguyen
  2018-06-07 14:06 ` [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (2 subsequent siblings)
  25 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

If a 'git repack' command replaces existing packfiles, then we must
clear the existing multi-pack-index before moving the packfiles it
references.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 8 ++++++++
 midx.c           | 8 ++++++++
 midx.h           | 1 +
 3 files changed, 17 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 6c636e159e..66a7d8e8ea 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -8,6 +8,7 @@
 #include "strbuf.h"
 #include "string-list.h"
 #include "argv-array.h"
+#include "midx.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	int quiet = 0;
 	int local = 0;
+	int midx_cleared = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				continue;
 			}
 
+			if (!midx_cleared) {
+				/* if we move a packfile, it will invalidated the midx */
+				clear_midx_file(get_object_directory());
+				midx_cleared = 1;
+			}
+
 			fname_old = mkpathdup("%s/old-%s%s", packdir,
 						item->string, exts[ext].name);
 			if (file_exists(fname_old))
diff --git a/midx.c b/midx.c
index e46f392fa4..1043c01fa7 100644
--- a/midx.c
+++ b/midx.c
@@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(pack_names);
 	return 0;
 }
+
+void clear_midx_file(const char *object_dir)
+{
+	char *midx = get_midx_filename(object_dir);
+
+	if (remove_path(midx))
+		die(_("failed to clear multi-pack-index at %s"), midx);
+}
diff --git a/midx.h b/midx.h
index 6996b5ff6b..46f9f44c94 100644
--- a/midx.h
+++ b/midx.h
@@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char *idx_name);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
+void clear_midx_file(const char *object_dir);
 
 #endif
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 00/23] Multi-pack-index (MIDX)
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (22 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
@ 2018-06-07 14:06 ` Derrick Stolee
  2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
  25 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

On 6/7/2018 10:03 AM, Derrick Stolee wrote:
> This patch series includes a rewrite of the previous
> multi-pack-index RFC [1] using the feedback from the
> commit-graph feature.

Sorry to everyone who got a duplicate copy of this series. I misspelled 
'kernel.org' and it didn't go to the list.

I also have this series available as a GitHub PR [1]

[1] https://github.com/derrickstolee/git/pull/7


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 00/23] Multi-pack-index (MIDX)
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (23 preceding siblings ...)
  2018-06-07 14:06 ` [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
@ 2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
  2018-06-07 14:54   ` Derrick Stolee
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
  25 siblings, 1 reply; 192+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-07 14:45 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, dstolee, jrnieder, jonathantanmy, mfick


On Thu, Jun 07 2018, Derrick Stolee wrote:

> To test the performance in this situation, I created a
> script that organizes the Linux repository in a similar
> fashion. I split the commit history into 50 parts by
> creating branches on every 10,000 commits of the first-
> parent history. Then, `git rev-list --objects A ^B`
> provides the list of objects reachable from A but not B,
> so I could send that to `git pack-objects` to create
> these "time-based" packfiles. With these 50 packfiles
> (deleting the old one from my fresh clone, and deleting
> all tags as they were no longer on-disk) I could then
> test 'git rev-list --objects HEAD^{tree}' and see:
>
>         Before: 0.17s
>         After:  0.13s
>         % Diff: -23.5%
>
> By adding logic to count hits and misses to bsearch_pack,
> I was able to see that the command above calls that
> method 266,930 times with a hit rate of 33%. The MIDX
> has the same number of calls with a 100% hit rate.

Do you have the script you used for this? It would be very interesting
as something we could stick in t/perf/ to test this use-case in the
future.

How does this & the numbers below compare to just a naïve
--max-pack-size=<similar size> on linux.git?

Is it possible for you to tar this test repo up and share it as a
one-off? I've been polishing the core.validateAbbrev series I have, and
it would be interesting to compare some of the (abbrev) numbers.

> Abbreviation Speedups
> ---------------------
>
> To fully disambiguate an abbreviation, we must iterate
> through all packfiles to ensure no collision exists in
> any packfile. This requires O(P log N) time. With the
> MIDX, this is only O(log N) time. Our standard test [2]
> is 'git log --oneline --parents --raw' because it writes
> many abbreviations while also doing a lot of other work
> (walking commits and trees to compute the raw diff).
>
> For a copy of the Linux repository with 50 packfiles
> split by time, we observed the following:
>
>         Before: 100.5 s
>         After:   58.2 s
>         % Diff: -59.7%
>
>
> Request for Review Attention
> ----------------------------
>
> I tried my best to take the feedback from the commit-graph
> feature and apply it to this feature. I also worked to
> follow the object-store refactoring as I could. I also have
> some local commits that create a 'verify' subcommand and
> integrate with 'fsck' similar to the commit-graph, but I'll
> leave those for a later series (and review is still underway
> for that part of the commit-graph).
>
> One place where I could use some guidance is related to the
> current state of 'the_hash_algo' patches. The file format
> allows a different "hash version" which then indicates the
> length of the hash. What's the best way to ensure this
> feature doesn't cause extra pain in the hash-agnostic series?
> This will inform how I go back and make the commit-graph
> feature better in this area, too.
>
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 00/23] Multi-pack-index (MIDX)
  2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
@ 2018-06-07 14:54   ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:54 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, sbeller, dstolee, jrnieder, jonathantanmy, mfick

On 6/7/2018 10:45 AM, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Jun 07 2018, Derrick Stolee wrote:
>
>> To test the performance in this situation, I created a
>> script that organizes the Linux repository in a similar
>> fashion. I split the commit history into 50 parts by
>> creating branches on every 10,000 commits of the first-
>> parent history. Then, `git rev-list --objects A ^B`
>> provides the list of objects reachable from A but not B,
>> so I could send that to `git pack-objects` to create
>> these "time-based" packfiles. With these 50 packfiles
>> (deleting the old one from my fresh clone, and deleting
>> all tags as they were no longer on-disk) I could then
>> test 'git rev-list --objects HEAD^{tree}' and see:
>>
>>          Before: 0.17s
>>          After:  0.13s
>>          % Diff: -23.5%
>>
>> By adding logic to count hits and misses to bsearch_pack,
>> I was able to see that the command above calls that
>> method 266,930 times with a hit rate of 33%. The MIDX
>> has the same number of calls with a 100% hit rate.
> Do you have the script you used for this? It would be very interesting
> as something we could stick in t/perf/ to test this use-case in the
> future.
>
> How does this & the numbers below compare to just a naïve
> --max-pack-size=<similar size> on linux.git?
>
> Is it possible for you to tar this test repo up and share it as a
> one-off? I've been polishing the core.validateAbbrev series I have, and
> it would be interesting to compare some of the (abbrev) numbers.

Here is what I used. You will want to adjust your constants for whatever 
repo you are using. This is for the Linux kernel which has a 
first-parent history of ~50,000 commits. It also leaves a bunch of extra 
files around, so it is nowhere near incorporating into the code.

#!/bin/bash

for i in `seq 1 50`
do
         ORDER=$((51 - $i))
         NUM_BACK=$((1000 * ($i - 1)))
         echo creating batch/$ORDER
         git branch -f batch/$ORDER HEAD~$NUM_BACK
         echo batch/$ORDER
         git rev-parse batch/$ORDER
done

lastbranch=""
for i in `seq 1 50`
do
         branch=batch/$i
         if [$lastbranch -eq ""]
         then
                 echo "$branch"
                 git rev-list --objects $branch | sed 's/ .*//' 
 >objects-$i.txt
         else
                 echo "$lastbranch"
                 echo "$branch"
                 git rev-list --objects $branch ^$lastbranch | sed 's/ 
.*//' >objects-$i.txt
         fi

         git pack-objects --no-reuse-delta 
.git/objects/pack/branch-split2 <objects-$i.txt
         lastbranch=$branch
done


for tag in `git tag --list`
do
         git tag -d $tag
done

rm -rf .git/objects/pack/pack-*
git midx write


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
@ 2018-06-07 17:20   ` Duy Nguyen
  2018-06-18 19:23     ` Derrick Stolee
  2018-06-11 21:02   ` Stefan Beller
  1 sibling, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:20 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> new file mode 100644
> index 0000000000..2bd886f1a2
> --- /dev/null
> +++ b/Documentation/git-midx.txt
> @@ -0,0 +1,29 @@
> +git-midx(1)
> +============
> +
> +NAME
> +----
> +git-midx - Write and verify multi-pack-indexes (MIDX files).

No full stop. This head line is collected automatically with others
and its having a full stop while the rest does not looks strange/

> diff --git a/builtin/midx.c b/builtin/midx.c
> new file mode 100644
> index 0000000000..59ea92178f
> --- /dev/null
> +++ b/builtin/midx.c
> @@ -0,0 +1,38 @@
> +#include "builtin.h"
> +#include "cache.h"
> +#include "config.h"
> +#include "git-compat-util.h"

You only need either cache.h or git-compat-util.h. If cache.h is here,
git-compat-util can be removed.

> +#include "parse-options.h"
> +
> +static char const * const builtin_midx_usage[] ={
> +       N_("git midx [--object-dir <dir>]"),
> +       NULL
> +};
> +
> +static struct opts_midx {
> +       const char *object_dir;
> +} opts;
> +
> +int cmd_midx(int argc, const char **argv, const char *prefix)
> +{
> +       static struct option builtin_midx_options[] = {
> +               { OPTION_STRING, 0, "object-dir", &opts.object_dir,

For paths (including dir), OPTION_FILENAME may be a better option to
handle correctly when the command is run in a subdir. See df217ed643
(parse-opts: add OPT_FILENAME and transition builtins - 2009-05-23)
for more info.

> +                 N_("dir"),
> +                 N_("The object directory containing set of packfile and pack-index pairs.") },

Other help strings do not have full stop either (I only checked a
couple commands though)

Also, doesn't OPT_STRING() work here too (if you avoid OPTION_FILENAME
for some reason)?

> +               OPT_END(),
> +       };
> +
> +       if (argc == 2 && !strcmp(argv[1], "-h"))
> +               usage_with_options(builtin_midx_usage, builtin_midx_options);
> +
> +       git_config(git_default_config, NULL);
> +
> +       argc = parse_options(argc, argv, prefix,
> +                            builtin_midx_options,
> +                            builtin_midx_usage, 0);
> +
> +       if (!opts.object_dir)
> +               opts.object_dir = get_object_directory();
> +
> +       return 0;
> +}

> diff --git a/git.c b/git.c
> index c2f48d53dd..400fadd677 100644
> --- a/git.c
> +++ b/git.c
> @@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
>         { "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>         { "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>         { "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
> +       { "midx", cmd_midx, RUN_SETUP },

If it's a plumbing and can take an --object-dir, then I don't think
you should require it to run in a repo (with RUN_SETUP).
RUN_SETUP_GENTLY may be better. You could even leave it empty here and
only call setup_git_directory() only when --object-dir is not set.

>         { "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
>         { "mktree", cmd_mktree, RUN_SETUP },
>         { "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 04/23] midx: add 'write' subcommand and basic wiring
  2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
@ 2018-06-07 17:27   ` Duy Nguyen
  0 siblings, 0 replies; 192+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/builtin/midx.c b/builtin/midx.c
> index 59ea92178f..dc0a5acd3f 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -3,9 +3,10 @@
>  #include "config.h"
>  #include "git-compat-util.h"
>  #include "parse-options.h"
> +#include "midx.h"
>
>  static char const * const builtin_midx_usage[] ={
> -       N_("git midx [--object-dir <dir>]"),
> +       N_("git midx [--object-dir <dir>] [write]"),
>         NULL
>  };
>
> @@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
>         if (!opts.object_dir)
>                 opts.object_dir = get_object_directory();
>
> +       if (argc == 0)
> +               return 0;

Isn't it better to die here when no verb is given? I don't see any
good use case for running a no-op "git midx" without verbs. It's more
likely a mistake (e.g. "git midx $foo" where foo happens to be empty)

> +
> +       if (!strcmp(argv[0], "write"))
> +               return write_midx_file(opts.object_dir);
> +
>         return 0;
>  }
> diff --git a/midx.c b/midx.c
> new file mode 100644
> index 0000000000..616af66b13
> --- /dev/null
> +++ b/midx.c
> @@ -0,0 +1,9 @@
> +#include "git-compat-util.h"
> +#include "cache.h"

Only one of the two is needed

> +#include "dir.h"

Not needed yet. It's better to include it in the patch that actually needs it.

> +#include "midx.h"
> +
> +int write_midx_file(const char *object_dir)
> +{
> +       return 0;
> +}
> diff --git a/midx.h b/midx.h
> new file mode 100644
> index 0000000000..3a63673952
> --- /dev/null
> +++ b/midx.h
> @@ -0,0 +1,4 @@
> +#include "cache.h"
> +#include "packfile.h"

These includes are not needed, at least not now. And please protect
the header file with #ifndef __MINDX_H__ .. #endif.

> +
> +int write_midx_file(const char *object_dir);
> diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
> new file mode 100755
> index 0000000000..a590137af7
> --- /dev/null
> +++ b/t/t5319-midx.sh
> @@ -0,0 +1,10 @@
> +#!/bin/sh
> +
> +test_description='multi-pack-indexes'
> +. ./test-lib.sh
> +
> +test_expect_success 'write midx with no pakcs' '

no packs


> +       git midx --object-dir=. write
> +'
> +
> +test_done
> --
> 2.18.0.rc1
>



-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
@ 2018-06-07 17:35   ` Duy Nguyen
  2018-06-12 15:00   ` Duy Nguyen
  1 sibling, 0 replies; 192+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> +static char *get_midx_filename(const char *object_dir)
> +{
> +       struct strbuf midx_name = STRBUF_INIT;
> +       strbuf_addstr(&midx_name, object_dir);
> +       strbuf_addstr(&midx_name, "/pack/multi-pack-index");
> +       return strbuf_detach(&midx_name, NULL);
> +}

I think this whole function can be written as
xstrfmt("%s/pack/multi-pack-index", object_dir);

> +
> +static size_t write_midx_header(struct hashfile *f,
> +                               unsigned char num_chunks,
> +                               uint32_t num_packs)
> +{
> +       char byte_values[4];

unsigned char just to be on the safe side? 'char' is signed on ARM if
I remember correctly.
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
@ 2018-06-07 17:54   ` Duy Nguyen
  2018-06-20 13:13     ` Derrick Stolee
  2018-06-07 18:31   ` Duy Nguyen
  1 sibling, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:54 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> As we build the multi-pack-index feature by adding chunks at a time,
> we want to test that the data is being written correctly.
>
> Create struct midxed_git to store an in-memory representation of a

A word play on 'packed_git'? Amusing. Some more descriptive name would
be better though. midxed looks almost like random letters thrown
together.

> multi-pack-index and a memory-map of the binary file. Initialize this
> struct in load_midxed_git(object_dir).

> +static int read_midx_file(const char *object_dir)
> +{
> +       struct midxed_git *m = load_midxed_git(object_dir);
> +
> +       if (!m)
> +               return 0;

This looks like an error case, please don't just return zero,
typically used to say "success". I don't know if this command stays
"for debugging purposes" until the end. Of course in that case it does
not really matter.

> +struct midxed_git *load_midxed_git(const char *object_dir)
> +{
> +       struct midxed_git *m;
> +       int fd;
> +       struct stat st;
> +       size_t midx_size;
> +       void *midx_map;
> +       const char *midx_name = get_midx_filename(object_dir);

mem leak? This function returns allocated memory if I remember correctly.

> +
> +       fd = git_open(midx_name);
> +       if (fd < 0)
> +               return NULL;

do an error_errno() so we know what went wrong at least.

> +       if (fstat(fd, &st)) {
> +               close(fd);
> +               return NULL;

same here, we should know why fstat() fails.

> +       }
> +       midx_size = xsize_t(st.st_size);
> +
> +       if (midx_size < MIDX_MIN_SIZE) {
> +               close(fd);
> +               die("multi-pack-index file %s is too small", midx_name);

_()

The use of die() should be discouraged though. Many people still try
(or wish) to libify code and new die() does not help. I think error()
here would be enough then you can return NULL. Or you can go fancier
and store the error string in a strbuf like refs code.

> +       }
> +
> +       midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +       m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
> +       strcpy(m->object_dir, object_dir);
> +       m->data = midx_map;
> +
> +       m->signature = get_be32(m->data);
> +       if (m->signature != MIDX_SIGNATURE) {
> +               error("multi-pack-index signature %X does not match signature %X",
> +                     m->signature, MIDX_SIGNATURE);

_(). Maybe 0x%08x instead of %x

> +               goto cleanup_fail;
> +       }
> +
> +       m->version = *(m->data + 4);

m->data[4] instead? shorter and easier to understand.

Same comment on "*(m->data + x)" and error() without _() for the rest.

> +       if (m->version != MIDX_VERSION) {
> +               error("multi-pack-index version %d not recognized",
> +                     m->version);

_()

> +               goto cleanup_fail;
> +       }
> +
> +       m->hash_version = *(m->data + 5);

m->data[5]

> +cleanup_fail:
> +       FREE_AND_NULL(m);
> +       munmap(midx_map, midx_size);
> +       close(fd);
> +       exit(1);

It's bad enough that you die() but exit() in this code seems too much.
Please just return NULL and let the caller handle the error.

> diff --git a/midx.h b/midx.h
> index 3a63673952..a1d18ed991 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -1,4 +1,13 @@
> +#ifndef MIDX_H
> +#define MIDX_H
> +
> +#include "git-compat-util.h"
>  #include "cache.h"
> +#include "object-store.h"

I don't really think you need object-store here (git-compat-util.h
too). "struct mixed_git;" would be enough for load_midxed_git
declaration below.

>  #include "packfile.h"
>
> +struct midxed_git *load_midxed_git(const char *object_dir);
> +
>  int write_midx_file(const char *object_dir);
> +
> +#endif
> diff --git a/object-store.h b/object-store.h
> index d683112fd7..77cb82621a 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -84,6 +84,25 @@ struct packed_git {
>         char pack_name[FLEX_ARRAY]; /* more */
>  };
>
> +struct midxed_git {
> +       struct midxed_git *next;

Do we really have multiple midx files?

> +
> +       int fd;
> +
> +       const unsigned char *data;
> +       size_t data_len;
> +
> +       uint32_t signature;
> +       unsigned char version;
> +       unsigned char hash_version;
> +       unsigned char hash_len;
> +       unsigned char num_chunks;
> +       uint32_t num_packs;
> +       uint32_t num_objects;
> +
> +       char object_dir[FLEX_ARRAY];

Why do you need to keep object_dir when it could be easily retrieved
when the repo is available?

> +};
> +
>  struct raw_object_store {
>         /*
>          * Path to the repository's object store.
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 08/23] midx: read packfiles from pack directory
  2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
@ 2018-06-07 18:03   ` Duy Nguyen
  2018-06-20 16:33     ` [PATCH] packfile: generalize pack directory list Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-07 18:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> @@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir)
>                           midx_name);
>         }
>
> +       strbuf_addf(&pack_dir, "%s/pack", object_dir);
> +       dir = opendir(pack_dir.buf);
> +
> +       if (!dir) {
> +               error_errno("unable to open pack directory: %s",
> +                           pack_dir.buf);

_()

> +               strbuf_release(&pack_dir);
> +               return 1;
> +       }
> +
> +       strbuf_addch(&pack_dir, '/');
> +       pack_dir_len = pack_dir.len;
> +       ALLOC_ARRAY(packs, alloc_packs);
> +       while ((de = readdir(dir)) != NULL) {
> +               if (is_dot_or_dotdot(de->d_name))
> +                       continue;
> +
> +               if (ends_with(de->d_name, ".idx")) {
> +                       ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
> +
> +                       strbuf_setlen(&pack_dir, pack_dir_len);
> +                       strbuf_addstr(&pack_dir, de->d_name);
> +
> +                       packs[nr_packs] = add_packed_git(pack_dir.buf,
> +                                                        pack_dir.len,
> +                                                        0);
> +                       if (!packs[nr_packs])
> +                               warning("failed to add packfile '%s'",
> +                                       pack_dir.buf);
> +                       else
> +                               nr_packs++;
> +               }
> +       }
> +       closedir(dir);
> +       strbuf_release(&pack_dir);

Can we refactor and share this scanning-for-packs code with
packfile.c? I'm pretty sure it does something similar in there.

> -       write_midx_header(f, num_chunks, num_packs);
> +       write_midx_header(f, num_chunks, nr_packs);

Hmm.. could have stuck to one name from the beginning...
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/23] midx: write pack names in chunk
  2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
@ 2018-06-07 18:26   ` Duy Nguyen
  2018-06-21 15:25     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-07 18:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>         m->num_chunks = *(m->data + 6);
>         m->num_packs = get_be32(m->data + 8);
>
> +       for (i = 0; i < m->num_chunks; i++) {
> +               uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i);
> +               uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);

Would be good to reduce magic numbers like 12 and 16, I think you have
some header length constants for those already.

> +               switch (chunk_id) {
> +                       case MIDX_CHUNKID_PACKNAMES:
> +                               m->chunk_pack_names = m->data + chunk_offset;
> +                               break;
> +
> +                       case 0:
> +                               die("terminating MIDX chunk id appears earlier than expected");

_()

> +                               break;
> +
> +                       default:
> +                               /*
> +                                * Do nothing on unrecognized chunks, allowing future
> +                                * extensions to add optional chunks.
> +                                */

I wrote about the chunk term reminding me of PNG format then deleted
it. But it may help to do similar to PNG here. The first letter can
let us know if the chunk is optional and can be safely ignored. E.g.
uppercase first letter cannot be ignored, lowercase go wild.

> +                               break;
> +               }
> +       }
> +
> +       if (!m->chunk_pack_names)
> +               die("MIDX missing required pack-name chunk");

_()

> +
>         return m;
>
>  cleanup_fail:
> @@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
>         return MIDX_HEADER_SIZE;
>  }
>
> +struct pack_pair {
> +       uint32_t pack_int_id;

can this be just pack_id?

> +       char *pack_name;
> +};
> +
> +static int pack_pair_compare(const void *_a, const void *_b)
> +{
> +       struct pack_pair *a = (struct pack_pair *)_a;
> +       struct pack_pair *b = (struct pack_pair *)_b;
> +       return strcmp(a->pack_name, b->pack_name);
> +}
> +
> +static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
> +{
> +       uint32_t i;
> +       struct pack_pair *pairs;
> +
> +       ALLOC_ARRAY(pairs, nr_packs);
> +
> +       for (i = 0; i < nr_packs; i++) {
> +               pairs[i].pack_int_id = i;
> +               pairs[i].pack_name = pack_names[i];
> +       }
> +
> +       QSORT(pairs, nr_packs, pack_pair_compare);
> +
> +       for (i = 0; i < nr_packs; i++) {
> +               pack_names[i] = pairs[i].pack_name;
> +               perm[pairs[i].pack_int_id] = i;
> +       }

pairs[] is leaked?

> +}
> +
> +static size_t write_midx_pack_names(struct hashfile *f,
> +                                   char **pack_names,
> +                                   uint32_t num_packs)
> +{
> +       uint32_t i;
> +       unsigned char padding[MIDX_CHUNK_ALIGNMENT];
> +       size_t written = 0;
> +
> +       for (i = 0; i < num_packs; i++) {
> +               size_t writelen = strlen(pack_names[i]) + 1;
> +
> +               if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
> +                       BUG("incorrect pack-file order: %s before %s",
> +                           pack_names[i - 1],
> +                           pack_names[i]);
> +
> +               hashwrite(f, pack_names[i], writelen);
> +               written += writelen;

side note. This pattern happens a lot. It may be a good idea to make
hashwrite() return writelen so we can just write

written += hashwrite(f, ..., writelen);

> +       }
> +
> +       /* add padding to be aligned */
> +       i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
> +       if (i < MIDX_CHUNK_ALIGNMENT) {
> +               bzero(padding, sizeof(padding));
> +               hashwrite(f, padding, i);
> +               written += i;
> +       }
> +
> +       return written;
> +}
> +
>  int write_midx_file(const char *object_dir)
>  {
> -       unsigned char num_chunks = 0;
> +       unsigned char cur_chunk, num_chunks = 0;
>         char *midx_name;
>         struct hashfile *f;
>         struct lock_file lk;
>         struct packed_git **packs = NULL;
> +       char **pack_names = NULL;
> +       uint32_t *pack_perm;
>         uint32_t i, nr_packs = 0, alloc_packs = 0;
> +       uint32_t alloc_pack_names = 0;
>         DIR *dir;
>         struct dirent *de;
>         struct strbuf pack_dir = STRBUF_INIT;
>         size_t pack_dir_len;
> +       uint64_t pack_name_concat_len = 0;
> +       uint64_t written = 0;
> +       uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
> +       uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];

This long list of local vars may be a good indicator that this
function needs split up into smaller ones.

>
>         midx_name = get_midx_filename(object_dir);
>         if (safe_create_leading_directories(midx_name)) {
> @@ -132,12 +233,14 @@ int write_midx_file(const char *object_dir)
>         strbuf_addch(&pack_dir, '/');
>         pack_dir_len = pack_dir.len;
>         ALLOC_ARRAY(packs, alloc_packs);
> +       ALLOC_ARRAY(pack_names, alloc_pack_names);
>         while ((de = readdir(dir)) != NULL) {
>                 if (is_dot_or_dotdot(de->d_name))
>                         continue;
>
>                 if (ends_with(de->d_name, ".idx")) {
>                         ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
> +                       ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
>
>                         strbuf_setlen(&pack_dir, pack_dir_len);
>                         strbuf_addstr(&pack_dir, de->d_name);
> @@ -145,21 +248,83 @@ int write_midx_file(const char *object_dir)
>                         packs[nr_packs] = add_packed_git(pack_dir.buf,
>                                                          pack_dir.len,
>                                                          0);
> -                       if (!packs[nr_packs])
> +                       if (!packs[nr_packs]) {
>                                 warning("failed to add packfile '%s'",
>                                         pack_dir.buf);
> -                       else
> -                               nr_packs++;
> +                               continue;
> +                       }
> +
> +                       pack_names[nr_packs] = xstrdup(de->d_name);
> +                       pack_name_concat_len += strlen(de->d_name) + 1;
> +                       nr_packs++;
>                 }
>         }
> +
>         closedir(dir);
>         strbuf_release(&pack_dir);
>
> +       if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
> +               pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
> +                                       (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
> +
> +       ALLOC_ARRAY(pack_perm, nr_packs);
> +       sort_packs_by_name(pack_names, nr_packs, pack_perm);
> +
>         hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>         f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>         FREE_AND_NULL(midx_name);
>
> -       write_midx_header(f, num_chunks, nr_packs);
> +       cur_chunk = 0;
> +       num_chunks = 1;
> +
> +       written = write_midx_header(f, num_chunks, nr_packs);
> +
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> +       chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
> +
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = 0;
> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
> +
> +       for (i = 0; i <= num_chunks; i++) {
> +               if (i && chunk_offsets[i] < chunk_offsets[i - 1])
> +                       BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
> +                           chunk_offsets[i - 1],
> +                           chunk_offsets[i]);
> +
> +               if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
> +                       BUG("chunk offset %"PRIu64" is not properly aligned",
> +                           chunk_offsets[i]);
> +
> +               hashwrite_be32(f, chunk_ids[i]);
> +               hashwrite_be32(f, chunk_offsets[i] >> 32);
> +               hashwrite_be32(f, chunk_offsets[i]);
> +
> +               written += MIDX_CHUNKLOOKUP_WIDTH;
> +       }
> +
> +       for (i = 0; i < num_chunks; i++) {
> +               if (written != chunk_offsets[i])
> +                       BUG("inccrrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,

incorrect

> +                           chunk_offsets[i],
> +                           written,
> +                           chunk_ids[i]);
> +
> +               switch (chunk_ids[i]) {
> +                       case MIDX_CHUNKID_PACKNAMES:
> +                               written += write_midx_pack_names(f, pack_names, nr_packs);
> +                               break;
> +
> +                       default:
> +                               BUG("trying to write unknown chunk id %"PRIx32,
> +                                   chunk_ids[i]);
> +               }
> +       }
> +
> +       if (written != chunk_offsets[num_chunks])
> +               BUG("incorrect final offset %"PRIu64" != %"PRIu64,
> +                   written,
> +                   chunk_offsets[num_chunks]);
>
>         finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>         commit_lock_file(&lk);
> @@ -170,5 +335,6 @@ int write_midx_file(const char *object_dir)
>         }
>
>         FREE_AND_NULL(packs);
> +       FREE_AND_NULL(pack_names);

What about the strings in this array? I think they are xstrdup() but I
didn't spot them being freed.

And maybe just use string_list...

>         return 0;
>  }
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
  2018-06-07 17:54   ` Duy Nguyen
@ 2018-06-07 18:31   ` Duy Nguyen
  2018-06-20 13:33     ` Derrick Stolee
  1 sibling, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-07 18:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> index dcaeb1a91b..919283fdd8 100644
> --- a/Documentation/git-midx.txt
> +++ b/Documentation/git-midx.txt
> @@ -23,6 +23,11 @@ OPTIONS
>         <dir>/packs/multi-pack-index for the current MIDX file, and
>         <dir>/packs for the pack-files to index.
>
> +read::
> +       When given as the verb, read the current MIDX file and output
> +       basic information about its contents. Used for debugging
> +       purposes only.

On second thought. If you just need a temporary debugging interface,
adding a program in t/helper may be a better option. In the end we
might still need 'read' to dump a file out, but we should have some
stable output format (and json might be a good choice).

That's it I'm done for today. I will continue on the rest some day, hopefully.
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 10/23] midx: write a lookup into the pack names chunk
  2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
@ 2018-06-09 16:43   ` Duy Nguyen
  2018-06-21 17:23     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 16:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/pack-format.txt |  5 +++
>  builtin/midx.c                          |  7 ++++
>  midx.c                                  | 56 +++++++++++++++++++++++--
>  object-store.h                          |  2 +
>  t/t5319-midx.sh                         | 11 +++--
>  5 files changed, 75 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 2b37be7b33..29bf87283a 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -296,6 +296,11 @@ CHUNK LOOKUP:
>
>  CHUNK DATA:
>
> +       Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
> +           P * 4 bytes storing the offset in the packfile name chunk for
> +           the null-terminated string containing the filename for the
> +           ith packfile.
> +

Commit message is too light on this one. Why does this need to be
stored? Isn't the cost of rebuilding this in-core cheap?

Adding this chunk on disk in my opinion only adds more burden. Now you
have to verify that these offsets actually point to the right place.

>         Packfile Names (ID: {'P', 'N', 'A', 'M'})
>             Stores the packfile names as concatenated, null-terminated strings.
>             Packfiles must be listed in lexicographic order for fast lookups by
> diff --git a/builtin/midx.c b/builtin/midx.c
> index fe56560853..3a261e9bbf 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -16,6 +16,7 @@ static struct opts_midx {
>
>  static int read_midx_file(const char *object_dir)
>  {
> +       uint32_t i;
>         struct midxed_git *m = load_midxed_git(object_dir);
>
>         if (!m)
> @@ -30,11 +31,17 @@ static int read_midx_file(const char *object_dir)
>
>         printf("chunks:");
>
> +       if (m->chunk_pack_lookup)
> +               printf(" pack_lookup");
>         if (m->chunk_pack_names)
>                 printf(" pack_names");
>
>         printf("\n");
>
> +       printf("packs:\n");
> +       for (i = 0; i < m->num_packs; i++)
> +               printf("%s\n", m->pack_names[i]);
> +
>         printf("object_dir: %s\n", m->object_dir);
>
>         return 0;
> diff --git a/midx.c b/midx.c
> index d4f4a01a51..923acda72e 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -13,8 +13,9 @@
>  #define MIDX_HASH_LEN 20
>  #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
>
> -#define MIDX_MAX_CHUNKS 1
> +#define MIDX_MAX_CHUNKS 2
>  #define MIDX_CHUNK_ALIGNMENT 4
> +#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
>  #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
>
> @@ -85,6 +86,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
>
>                 switch (chunk_id) {
> +                       case MIDX_CHUNKID_PACKLOOKUP:
> +                               m->chunk_pack_lookup = (uint32_t *)(m->data + chunk_offset);
> +                               break;
> +
>                         case MIDX_CHUNKID_PACKNAMES:
>                                 m->chunk_pack_names = m->data + chunk_offset;
>                                 break;
> @@ -102,9 +107,32 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 }
>         }
>
> +       if (!m->chunk_pack_lookup)
> +               die("MIDX missing required pack lookup chunk");
>         if (!m->chunk_pack_names)
>                 die("MIDX missing required pack-name chunk");
>
> +       m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
> +       for (i = 0; i < m->num_packs; i++) {
> +               if (i) {
> +                       if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
> +                               error("MIDX pack lookup value %d before %d",
> +                                     ntohl(m->chunk_pack_lookup[i - 1]),
> +                                     ntohl(m->chunk_pack_lookup[i]));
> +                               goto cleanup_fail;
> +                       }
> +               }
> +
> +               m->pack_names[i] = (const char *)(m->chunk_pack_names + ntohl(m->chunk_pack_lookup[i]));
> +
> +               if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
> +                       error("MIDX pack names out of order: '%s' before '%s'",
> +                             m->pack_names[i - 1],
> +                             m->pack_names[i]);
> +                       goto cleanup_fail;
> +               }
> +       }
> +
>         return m;
>
>  cleanup_fail:
> @@ -162,6 +190,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
>         }
>  }
>
> +static size_t write_midx_pack_lookup(struct hashfile *f,
> +                                    char **pack_names,
> +                                    uint32_t nr_packs)
> +{
> +       uint32_t i, cur_len = 0;
> +
> +       for (i = 0; i < nr_packs; i++) {
> +               hashwrite_be32(f, cur_len);
> +               cur_len += strlen(pack_names[i]) + 1;
> +       }
> +
> +       return sizeof(uint32_t) * (size_t)nr_packs;
> +}
> +
>  static size_t write_midx_pack_names(struct hashfile *f,
>                                     char **pack_names,
>                                     uint32_t num_packs)
> @@ -275,13 +317,17 @@ int write_midx_file(const char *object_dir)
>         FREE_AND_NULL(midx_name);
>
>         cur_chunk = 0;
> -       num_chunks = 1;
> +       num_chunks = 2;
>
>         written = write_midx_header(f, num_chunks, nr_packs);
>
> -       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKLOOKUP;
>         chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
>
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
> +
>         cur_chunk++;
>         chunk_ids[cur_chunk] = 0;
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
> @@ -311,6 +357,10 @@ int write_midx_file(const char *object_dir)
>                             chunk_ids[i]);
>
>                 switch (chunk_ids[i]) {
> +                       case MIDX_CHUNKID_PACKLOOKUP:
> +                               written += write_midx_pack_lookup(f, pack_names, nr_packs);
> +                               break;
> +
>                         case MIDX_CHUNKID_PACKNAMES:
>                                 written += write_midx_pack_names(f, pack_names, nr_packs);
>                                 break;
> diff --git a/object-store.h b/object-store.h
> index 199cf4bd44..1ba50459ca 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -100,8 +100,10 @@ struct midxed_git {
>         uint32_t num_packs;
>         uint32_t num_objects;
>
> +       const uint32_t *chunk_pack_lookup;
>         const unsigned char *chunk_pack_names;
>
> +       const char **pack_names;
>         char object_dir[FLEX_ARRAY];
>  };
>
> diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
> index fdf4f84a90..a31c387c8f 100755
> --- a/t/t5319-midx.sh
> +++ b/t/t5319-midx.sh
> @@ -6,10 +6,15 @@ test_description='multi-pack-indexes'
>  midx_read_expect() {
>         NUM_PACKS=$1
>         cat >expect <<- EOF
> -       header: 4d494458 1 1 1 $NUM_PACKS
> -       chunks: pack_names
> -       object_dir: .
> +       header: 4d494458 1 1 2 $NUM_PACKS
> +       chunks: pack_lookup pack_names
> +       packs:
>         EOF
> +       if [ $NUM_PACKS -ge 1 ]
> +       then
> +               ls pack/ | grep idx | sort >> expect
> +       fi
> +       printf "object_dir: .\n" >>expect &&
>         git midx read --object-dir=. >actual &&
>         test_cmp expect actual
>  }
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 11/23] midx: sort and deduplicate objects from packfiles
  2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
@ 2018-06-09 17:07   ` Duy Nguyen
  2018-06-21 17:54     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:07 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Before writing a list of objects and their offsets to a multi-pack-index
> (MIDX), we need to collect the list of objects contained in the
> packfiles. There may be multiple copies of some objects, so this list
> must be deduplicated.

Can you just do merge-sort with a slight modification to ignore duplicates?

>
> It is possible to artificially get into a state where there are many
> duplicate copies of objects. That can create high memory pressure if we
> are to create a list of all objects before de-duplication. To reduce
> this memory pressure without a significant performance drop,
> automatically group objects by the first byte of their object id. Use
> the IDX fanout tables to group the data, copy to a local array, then
> sort.
>
> Copy only the de-duplicated entries. Select the duplicate based on the
> most-recent modified time of a packfile containing the object.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 138 insertions(+)
>
> diff --git a/midx.c b/midx.c
> index 923acda72e..b20d52713c 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -4,6 +4,7 @@
>  #include "csum-file.h"
>  #include "lockfile.h"
>  #include "object-store.h"
> +#include "packfile.h"
>  #include "midx.h"
>
>  #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> @@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
>         }
>  }
>
> +static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
> +{
> +       const uint32_t *level1_ofs = p->index_data;
> +
> +       if (!level1_ofs) {
> +               if (open_pack_index(p))
> +                       return 0;
> +               level1_ofs = p->index_data;
> +       }
> +
> +       if (p->index_version > 1) {
> +               level1_ofs += 2;
> +       }
> +
> +       return ntohl(level1_ofs[value]);
> +}

Maybe keep this in packfile,c, refactor fanout code in there if
necessary, keep .idx file format info in that file instead of
spreading out more.

> +
> +struct pack_midx_entry {
> +       struct object_id oid;
> +       uint32_t pack_int_id;
> +       time_t pack_mtime;
> +       uint64_t offset;
> +};
> +
> +static int midx_oid_compare(const void *_a, const void *_b)
> +{
> +       struct pack_midx_entry *a = (struct pack_midx_entry *)_a;
> +       struct pack_midx_entry *b = (struct pack_midx_entry *)_b;

Try not to lose "const" while typecasting.

> +       int cmp = oidcmp(&a->oid, &b->oid);
> +
> +       if (cmp)
> +               return cmp;
> +
> +       if (a->pack_mtime > b->pack_mtime)
> +               return -1;
> +       else if (a->pack_mtime < b->pack_mtime)
> +               return 1;
> +
> +       return a->pack_int_id - b->pack_int_id;
> +}
> +
> +static void fill_pack_entry(uint32_t pack_int_id,
> +                           struct packed_git *p,
> +                           uint32_t cur_object,
> +                           struct pack_midx_entry *entry)
> +{
> +       if (!nth_packed_object_oid(&entry->oid, p, cur_object))
> +               die("failed to located object %d in packfile", cur_object);

_()

> +
> +       entry->pack_int_id = pack_int_id;
> +       entry->pack_mtime = p->mtime;
> +
> +       entry->offset = nth_packed_object_offset(p, cur_object);
> +}
> +
> +/*
> + * It is possible to artificially get into a state where there are many
> + * duplicate copies of objects. That can create high memory pressure if
> + * we are to create a list of all objects before de-duplication. To reduce
> + * this memory pressure without a significant performance drop, automatically
> + * group objects by the first byte of their object id. Use the IDX fanout
> + * tables to group the data, copy to a local array, then sort.
> + *
> + * Copy only the de-duplicated entries (selected by most-recent modified time
> + * of a packfile containing the object).
> + */
> +static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
> +                                                 uint32_t *perm,
> +                                                 uint32_t nr_packs,
> +                                                 uint32_t *nr_objects)
> +{
> +       uint32_t cur_fanout, cur_pack, cur_object;
> +       uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
> +       struct pack_midx_entry *entries_by_fanout = NULL;
> +       struct pack_midx_entry *deduplicated_entries = NULL;
> +
> +       for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
> +               if (open_pack_index(p[cur_pack]))
> +                       continue;

Is it a big problem if you fail to open .idx for a certain pack?
Should we error out and abort instead of continuing on? Later on in
the second pack loop code when get_fanout return zero (failure), you
don't seem to catch it and skip the pack.

> +
> +               total_objects += p[cur_pack]->num_objects;
> +       }
> +
> +       /*
> +        * As we de-duplicate by fanout value, we expect the fanout
> +        * slices to be evenly distributed, with some noise. Hence,
> +        * allocate slightly more than one 256th.
> +        */
> +       alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16;
> +
> +       ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
> +       ALLOC_ARRAY(deduplicated_entries, alloc_objects);
> +       *nr_objects = 0;
> +
> +       for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
> +               nr_fanout = 0;

Keep variable scope small, declare nr_fanout here instead of at the
top of the function.

> +
> +               for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
> +                       uint32_t start = 0, end;
> +
> +                       if (cur_fanout)
> +                               start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
> +                       end = get_pack_fanout(p[cur_pack], cur_fanout);
> +
> +                       for (cur_object = start; cur_object < end; cur_object++) {
> +                               ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
> +                               fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
> +                               nr_fanout++;
> +                       }
> +               }
> +
> +               QSORT(entries_by_fanout, nr_fanout, midx_oid_compare);
> +
> +               /*
> +                * The batch is now sorted by OID and then mtime (descending).
> +                * Take only the first duplicate.
> +                */
> +               for (cur_object = 0; cur_object < nr_fanout; cur_object++) {
> +                       if (cur_object && !oidcmp(&entries_by_fanout[cur_object - 1].oid,
> +                                                 &entries_by_fanout[cur_object].oid))
> +                               continue;
> +
> +                       ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects);
> +                       memcpy(&deduplicated_entries[*nr_objects],
> +                              &entries_by_fanout[cur_object],
> +                              sizeof(struct pack_midx_entry));
> +                       (*nr_objects)++;
> +               }
> +       }
> +
> +       FREE_AND_NULL(entries_by_fanout);
> +       return deduplicated_entries;
> +}
> +
>  static size_t write_midx_pack_lookup(struct hashfile *f,
>                                      char **pack_names,
>                                      uint32_t nr_packs)
> @@ -254,6 +389,7 @@ int write_midx_file(const char *object_dir)
>         uint64_t written = 0;
>         uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
>         uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
> +       uint32_t nr_entries;
>
>         midx_name = get_midx_filename(object_dir);
>         if (safe_create_leading_directories(midx_name)) {
> @@ -312,6 +448,8 @@ int write_midx_file(const char *object_dir)
>         ALLOC_ARRAY(pack_perm, nr_packs);
>         sort_packs_by_name(pack_names, nr_packs, pack_perm);
>
> +       get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);

Intentional ignoring return value (and temporary leaking as a result)
should have a least a comment to acknowledge it and save reviewers
some head scratching. Or even better, just free it now, even if you
don't use it.

> +
>         hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>         f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>         FREE_AND_NULL(midx_name);
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 12/23] midx: write object ids in a chunk
  2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
@ 2018-06-09 17:25   ` Duy Nguyen
  0 siblings, 0 replies; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/pack-format.txt |  4 ++
>  builtin/midx.c                          |  2 +
>  midx.c                                  | 50 +++++++++++++++++++++++--
>  object-store.h                          |  1 +
>  t/t5319-midx.sh                         |  4 +-
>  5 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 29bf87283a..de9ac778b6 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -307,6 +307,10 @@ CHUNK DATA:
>             name. This is the only chunk not guaranteed to be a multiple of four
>             bytes in length, so should be the last chunk for alignment reasons.
>
> +       OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)

So N is the number of objects and H is hash size? Please don't let me guess.

> +           The OIDs for all objects in the MIDX are stored in lexicographic
> +           order in this chunk.

The reason we keep all hashes together, packed right, is to reduce
cache footprint. Another observation is it takes us usually just 12
bytes or less to uniquely identify an object, which means we could
pack even tighter if we split he object hash into two chunks, do
bsearch in the first chunk with just <n> bytes then verify that the
remaining 20-<b> bytes is matched in the second chunk. This may matter
more when we move to larger hashes. The split would of course be
configurable since different project may have different optimal value,
but default value could be 10/10 bytes.

> +
>         (This section intentionally left incomplete.)
>
>  TRAILER:
> diff --git a/builtin/midx.c b/builtin/midx.c
> index 3a261e9bbf..86edd30174 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -35,6 +35,8 @@ static int read_midx_file(const char *object_dir)
>                 printf(" pack_lookup");
>         if (m->chunk_pack_names)
>                 printf(" pack_names");
> +       if (m->chunk_oid_lookup)
> +               printf(" oid_lookup");
>
>         printf("\n");
>
> diff --git a/midx.c b/midx.c
> index b20d52713c..d06bc6876a 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -14,10 +14,11 @@
>  #define MIDX_HASH_LEN 20
>  #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
>
> -#define MIDX_MAX_CHUNKS 2
> +#define MIDX_MAX_CHUNKS 3
>  #define MIDX_CHUNK_ALIGNMENT 4
>  #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
> +#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
>
>  static char *get_midx_filename(const char *object_dir)
> @@ -95,6 +96,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                                 m->chunk_pack_names = m->data + chunk_offset;
>                                 break;
>
> +                       case MIDX_CHUNKID_OIDLOOKUP:
> +                               m->chunk_oid_lookup = m->data + chunk_offset;
> +                               break;


I just now realized, how do you protect from duplicate chunks? From
this patch, it looks like you could accept two oidlookup chunks just
fine then siliently ignore the first one.

>                         case 0:
>                                 die("terminating MIDX chunk id appears earlier than expected");
>                                 break;
> @@ -112,6 +117,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 die("MIDX missing required pack lookup chunk");
>         if (!m->chunk_pack_names)
>                 die("MIDX missing required pack-name chunk");
> +       if (!m->chunk_oid_lookup)
> +               die("MIDX missing required OID lookup chunk");

_()

>
>         m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
>         for (i = 0; i < m->num_packs; i++) {
> @@ -370,6 +377,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
>         return written;
>  }
>
> +static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
> +                                   struct pack_midx_entry *objects,
> +                                   uint32_t nr_objects)
> +{
> +       struct pack_midx_entry *list = objects;
> +       uint32_t i;
> +       size_t written = 0;
> +
> +       for (i = 0; i < nr_objects; i++) {
> +               struct pack_midx_entry *obj = list++;
> +
> +               if (i < nr_objects - 1) {
> +                       struct pack_midx_entry *next = list;
> +                       if (oidcmp(&obj->oid, &next->oid) >= 0)
> +                               BUG("OIDs not in order: %s >= %s",
> +                               oid_to_hex(&obj->oid),
> +                               oid_to_hex(&next->oid));

Indentation. I almost thought oid_to_hex() was a separate statement.

> +               }
> +
> +               hashwrite(f, obj->oid.hash, (int)hash_len);

Is (int) really necessary? There's no loss in automatically casting
unsigned char to int. But I didn't check C spec, maybe there's some
rules...

> +               written += hash_len;
> +       }
> +
> +       return written;
> +}
> +
>  int write_midx_file(const char *object_dir)
>  {
>         unsigned char cur_chunk, num_chunks = 0;
> @@ -389,6 +422,7 @@ int write_midx_file(const char *object_dir)
>         uint64_t written = 0;
>         uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
>         uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
> +       struct pack_midx_entry *entries;
>         uint32_t nr_entries;
>
>         midx_name = get_midx_filename(object_dir);
> @@ -448,14 +482,14 @@ int write_midx_file(const char *object_dir)
>         ALLOC_ARRAY(pack_perm, nr_packs);
>         sort_packs_by_name(pack_names, nr_packs, pack_perm);
>
> -       get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
> +       entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);

This change should belong to the previous patch. This patch alone
can't tell me that entries is a new allocation. If I didn't remember
the last patch, I could not realize that entries should be freed (and
it does not look like it is here)

>
>         hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>         f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>         FREE_AND_NULL(midx_name);
>
>         cur_chunk = 0;
> -       num_chunks = 2;
> +       num_chunks = 3;
>
>         written = write_midx_header(f, num_chunks, nr_packs);
>
> @@ -467,9 +501,13 @@ int write_midx_file(const char *object_dir)
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
>
>         cur_chunk++;
> -       chunk_ids[cur_chunk] = 0;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
>
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = 0;
> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
> +
>         for (i = 0; i <= num_chunks; i++) {
>                 if (i && chunk_offsets[i] < chunk_offsets[i - 1])
>                         BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
> @@ -503,6 +541,10 @@ int write_midx_file(const char *object_dir)
>                                 written += write_midx_pack_names(f, pack_names, nr_packs);
>                                 break;
>
> +                       case MIDX_CHUNKID_OIDLOOKUP:
> +                               written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
> +                               break;
> +
>                         default:
>                                 BUG("trying to write unknown chunk id %"PRIx32,
>                                     chunk_ids[i]);
> diff --git a/object-store.h b/object-store.h
> index 1ba50459ca..7d14d3586e 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -102,6 +102,7 @@ struct midxed_git {
>
>         const uint32_t *chunk_pack_lookup;
>         const unsigned char *chunk_pack_names;
> +       const unsigned char *chunk_oid_lookup;
>
>         const char **pack_names;
>         char object_dir[FLEX_ARRAY];
> diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
> index a31c387c8f..e71aa52b80 100755
> --- a/t/t5319-midx.sh
> +++ b/t/t5319-midx.sh
> @@ -6,8 +6,8 @@ test_description='multi-pack-indexes'
>  midx_read_expect() {
>         NUM_PACKS=$1
>         cat >expect <<- EOF
> -       header: 4d494458 1 1 2 $NUM_PACKS
> -       chunks: pack_lookup pack_names
> +       header: 4d494458 1 1 3 $NUM_PACKS
> +       chunks: pack_lookup pack_names oid_lookup
>         packs:
>         EOF
>         if [ $NUM_PACKS -ge 1 ]
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 13/23] midx: write object id fanout chunk
  2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
@ 2018-06-09 17:28   ` Duy Nguyen
  2018-06-21 19:49     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
> @@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 die("MIDX missing required pack lookup chunk");
>         if (!m->chunk_pack_names)
>                 die("MIDX missing required pack-name chunk");
> +       if (!m->chunk_oid_fanout)
> +               die("MIDX missing required OID fanout chunk");

_()

> @@ -501,9 +540,13 @@ int write_midx_file(const char *object_dir)
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
>
>         cur_chunk++;
> -       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;

Err.. mistake?

>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
>
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;

Same here.

> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
> +
>         cur_chunk++;
>         chunk_ids[cur_chunk] = 0;
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
>
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 14/23] midx: write object offsets
  2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
@ 2018-06-09 17:41   ` Duy Nguyen
  0 siblings, 0 replies; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:02 PM Derrick Stolee <stolee@gmail.com> wrote:
> +static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
> +                                      struct pack_midx_entry *objects, uint32_t nr_objects)
> +{
> +       struct pack_midx_entry *list = objects;
> +       size_t written = 0;
> +
> +       while (nr_large_offset) {
> +               struct pack_midx_entry *obj = list++;
> +               uint64_t offset = obj->offset;
> +
> +               if (!(offset >> 31))
> +                       continue;
> +
> +               hashwrite_be32(f, offset >> 32);
> +               hashwrite_be32(f, offset & 0xffffffff);

Not sure if you need UL suffix or something here on 32-bit platform.

> +               written += 2 * sizeof(uint32_t);
> +
> +               nr_large_offset--;
> +       }
> +
> +       return written;
> +}
> +
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 16/23] midx: prepare midxed_git struct
  2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
@ 2018-06-09 17:47   ` Duy Nguyen
  0 siblings, 0 replies; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c         | 22 ++++++++++++++++++++++
>  midx.h         |  2 ++
>  object-store.h |  7 +++++++
>  packfile.c     |  6 +++++-
>  4 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/midx.c b/midx.c
> index a49300bf75..5e9290ca8f 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -175,6 +175,28 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>         exit(1);
>  }
>
> +int prepare_midxed_git_one(struct repository *r, const char *object_dir)
> +{
> +       struct midxed_git *m = r->objects->midxed_git;
> +       struct midxed_git *m_search;
> +
> +       if (!core_midx)
> +               return 0;
> +
> +       for (m_search = m; m_search; m_search = m_search->next)
> +               if (!strcmp(object_dir, m_search->object_dir))
> +                       return 1;
> +
> +       r->objects->midxed_git = load_midxed_git(object_dir);
> +
> +       if (r->objects->midxed_git) {
> +               r->objects->midxed_git->next = m;
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
>  static size_t write_midx_header(struct hashfile *f,
>                                 unsigned char num_chunks,
>                                 uint32_t num_packs)
> diff --git a/midx.h b/midx.h
> index a1d18ed991..793203fc4a 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -5,8 +5,10 @@
>  #include "cache.h"
>  #include "object-store.h"
>  #include "packfile.h"
> +#include "repository.h"
>
>  struct midxed_git *load_midxed_git(const char *object_dir);
> +int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
>
> diff --git a/object-store.h b/object-store.h
> index 9b671f1b0a..7908d46e34 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -130,6 +130,13 @@ struct raw_object_store {
>          */
>         struct oidmap *replace_map;
>
> +       /*
> +        * private data
> +        *
> +        * should only be accessed directly by packfile.c and midx.c
> +        */
> +       struct midxed_git *midxed_git;
> +
>         /*
>          * private data
>          *
> diff --git a/packfile.c b/packfile.c
> index 1a714fbde9..b91ca9b9f5 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -15,6 +15,7 @@
>  #include "tree-walk.h"
>  #include "tree.h"
>  #include "object-store.h"
> +#include "midx.h"
>
>  char *odb_pack_name(struct strbuf *buf,
>                     const unsigned char *sha1,
> @@ -893,10 +894,13 @@ static void prepare_packed_git(struct repository *r)
>
>         if (r->objects->packed_git_initialized)
>                 return;
> +       prepare_midxed_git_one(r, r->objects->objectdir);
>         prepare_packed_git_one(r, r->objects->objectdir, 1);
>         prepare_alt_odb(r);
> -       for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
> +       for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
> +               prepare_midxed_git_one(r, alt->path);
>                 prepare_packed_git_one(r, alt->path, 0);
> +       }

Ah, so the object path and the linked list in midxed_git is for
alternates. Makes sense. Would have saved me the trouble if you only
introduced those fields now, when they are actually used (and become
self explanatory)

>         rearrange_packed_git(r);
>         prepare_packed_git_mru(r);
>         r->objects->packed_git_initialized = 1;
> --
> 2.18.0.rc1
>
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 17/23] midx: read objects from multi-pack-index
  2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
@ 2018-06-09 17:56   ` Duy Nguyen
  2018-06-21 20:03     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 6:55 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  midx.h         |  2 ++
>  object-store.h |  1 +
>  packfile.c     |  8 ++++-
>  4 files changed, 104 insertions(+), 3 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 5e9290ca8f..6eca8f1b12 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -3,6 +3,7 @@
>  #include "dir.h"
>  #include "csum-file.h"
>  #include "lockfile.h"
> +#include "sha1-lookup.h"
>  #include "object-store.h"
>  #include "packfile.h"
>  #include "midx.h"
> @@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>
>         m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
>         strcpy(m->object_dir, object_dir);
> -       m->data = midx_map;
> +       m->data = (const unsigned char*)midx_map;

Hmm? Why is this typecast only needed now? Or is it not really needed at all?

>
>         m->signature = get_be32(m->data);
>         if (m->signature != MIDX_SIGNATURE) {
> @@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>
>         m->num_objects = ntohl(m->chunk_oid_fanout[255]);
>
> -       m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
> +       m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
> +
> +       ALLOC_ARRAY(m->pack_names, m->num_packs);

Please make this ALLOC_ARRAY change in the patch that adds
xcalloc(m->num_packs).

>         for (i = 0; i < m->num_packs; i++) {
>                 if (i) {
>                         if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
> @@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>         exit(1);
>  }
>
> +static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
> +{
> +       struct strbuf pack_name = STRBUF_INIT;
> +
> +       if (pack_int_id >= m->num_packs)
> +               BUG("bad pack-int-id");
> +
> +       if (m->packs[pack_int_id])
> +               return 0;
> +
> +       strbuf_addstr(&pack_name, m->object_dir);
> +       strbuf_addstr(&pack_name, "/pack/");
> +       strbuf_addstr(&pack_name, m->pack_names[pack_int_id]);

Just use strbuf_addf()

> +
> +       m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
> +       strbuf_release(&pack_name);
> +       return !m->packs[pack_int_id];

This is a weird return value convention. Normally we go zero/negative
or non-zero/zero for success/failure.

> +}
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 18/23] midx: use midx in abbreviation calculations
  2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
@ 2018-06-09 18:01   ` Duy Nguyen
  2018-06-22 18:38     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
> @@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
>
>  static void find_abbrev_len_packed(struct min_abbrev_data *mad)
>  {
> +       struct midxed_git *m;
>         struct packed_git *p;
>
> +       for (m = get_midxed_git(the_repository); m; m = m->next)
> +               find_abbrev_len_for_midx(m, mad);

If all the packs are in midx, we don't need to run the second loop
below, do we? Otherwise I don't see why we waste cycles on finding
abbrev length on midx at all.

>         for (p = get_packed_git(the_repository); p; p = p->next)
>                 find_abbrev_len_for_pack(p, mad);
>  }
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 20/23] midx: use midx in approximate_object_count
  2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
@ 2018-06-09 18:03   ` Duy Nguyen
  2018-06-22 18:39     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  packfile.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/packfile.c b/packfile.c
> index 638e113972..059b2aa097 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -819,11 +819,14 @@ unsigned long approximate_object_count(void)
>  {
>         if (!the_repository->objects->approximate_object_count_valid) {
>                 unsigned long count;
> +               struct midxed_git *m;
>                 struct packed_git *p;
>
>                 prepare_packed_git(the_repository);
>                 count = 0;
> -               for (p = the_repository->objects->packed_git; p; p = p->next) {
> +               for (m = get_midxed_git(the_repository); m; m = m->next)
> +                       count += m->num_objects;
> +               for (p = get_packed_git(the_repository); p; p = p->next) {

Please don't change this line, it's not related to this patch. Same
concern applies, if we have already counted objects in midx we should
ignore packs that belong to it or we double count.

>                         if (open_pack_index(p))
>                                 continue;
>                         count += p->num_objects;
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 21/23] midx: prevent duplicate packfile loads
  2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
@ 2018-06-09 18:05   ` Duy Nguyen
  0 siblings, 0 replies; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> If the multi-pack-index contains a packfile, then we do not need to add
> that packfile to the packed_git linked list or the MRU list.

Because...?

I think I see the reason, but I'd like it spelled out to avoid any
misunderstanding.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c     | 23 +++++++++++++++++++++++
>  midx.h     |  1 +
>  packfile.c |  7 +++++++
>  3 files changed, 31 insertions(+)
>
> diff --git a/midx.c b/midx.c
> index 388d79b7d9..3242646fe0 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -278,6 +278,29 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mi
>         return nth_midxed_pack_entry(m, e, pos);
>  }
>
> +int midx_contains_pack(struct midxed_git *m, const char *idx_name)
> +{
> +       uint32_t first = 0, last = m->num_packs;
> +
> +       while (first < last) {
> +               uint32_t mid = first + (last - first) / 2;
> +               const char *current;
> +               int cmp;
> +
> +               current = m->pack_names[mid];
> +               cmp = strcmp(idx_name, current);
> +               if (!cmp)
> +                       return 1;
> +               if (cmp > 0) {
> +                       first = mid + 1;
> +                       continue;
> +               }
> +               last = mid;
> +       }
> +
> +       return 0;
> +}
> +
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir)
>  {
>         struct midxed_git *m = r->objects->midxed_git;
> diff --git a/midx.h b/midx.h
> index 497bdcc77c..c1db58d8c4 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -13,6 +13,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
>                                         struct midxed_git *m,
>                                         uint32_t n);
>  int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
> +int midx_contains_pack(struct midxed_git *m, const char *idx_name);
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
> diff --git a/packfile.c b/packfile.c
> index 059b2aa097..479cb69b9f 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -746,6 +746,11 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
>         DIR *dir;
>         struct dirent *de;
>         struct string_list garbage = STRING_LIST_INIT_DUP;
> +       struct midxed_git *m = r->objects->midxed_git;
> +
> +       /* look for the multi-pack-index for this object directory */
> +       while (m && strcmp(m->object_dir, objdir))
> +               m = m->next;
>
>         strbuf_addstr(&path, objdir);
>         strbuf_addstr(&path, "/pack");
> @@ -772,6 +777,8 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
>                 base_len = path.len;
>                 if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
>                         /* Don't reopen a pack we already have. */
> +                       if (m && midx_contains_pack(m, de->d_name))
> +                               continue;
>                         for (p = r->objects->packed_git; p;
>                              p = p->next) {
>                                 size_t len;
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 23/23] midx: clear midx on repack
  2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
@ 2018-06-09 18:13   ` Duy Nguyen
  2018-06-22 18:44     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> If a 'git repack' command replaces existing packfiles, then we must
> clear the existing multi-pack-index before moving the packfiles it
> references.

I think there are other places where we add or remove pack files and
need to reprepare_packed_git(). Any midx invalidation should be part
of that as well.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/repack.c | 8 ++++++++
>  midx.c           | 8 ++++++++
>  midx.h           | 1 +
>  3 files changed, 17 insertions(+)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 6c636e159e..66a7d8e8ea 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -8,6 +8,7 @@
>  #include "strbuf.h"
>  #include "string-list.h"
>  #include "argv-array.h"
> +#include "midx.h"
>
>  static int delta_base_offset = 1;
>  static int pack_kept_objects = -1;
> @@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>         int no_update_server_info = 0;
>         int quiet = 0;
>         int local = 0;
> +       int midx_cleared = 0;
>
>         struct option builtin_repack_options[] = {
>                 OPT_BIT('a', NULL, &pack_everything,
> @@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>                                 continue;
>                         }
>
> +                       if (!midx_cleared) {
> +                               /* if we move a packfile, it will invalidated the midx */

What about removing packs, which also happens in repack? If the
removed pack is part of midx, then midx becomes invalid as well.

> +                               clear_midx_file(get_object_directory());
> +                               midx_cleared = 1;
> +                       }
> +
>                         fname_old = mkpathdup("%s/old-%s%s", packdir,
>                                                 item->string, exts[ext].name);
>                         if (file_exists(fname_old))
> diff --git a/midx.c b/midx.c
> index e46f392fa4..1043c01fa7 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir)
>         FREE_AND_NULL(pack_names);
>         return 0;
>  }
> +
> +void clear_midx_file(const char *object_dir)

delete_ may be more obvious than clear_

> +{
> +       char *midx = get_midx_filename(object_dir);
> +
> +       if (remove_path(midx))
> +               die(_("failed to clear multi-pack-index at %s"), midx);

die_errno()

> +}
> diff --git a/midx.h b/midx.h
> index 6996b5ff6b..46f9f44c94 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char *idx_name);
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
> +void clear_midx_file(const char *object_dir);
>
>  #endif
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 01/23] midx: add design document
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
@ 2018-06-11 19:04   ` Stefan Beller
  2018-06-18 18:48     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Stefan Beller @ 2018-06-11 19:04 UTC (permalink / raw)
  To: stolee; +Cc: git, dstolee, avarab, jrnieder, jonathantanmy, mfick

On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/midx.txt | 109 +++++++++++++++++++++++++++++++
>  1 file changed, 109 insertions(+)
>  create mode 100644 Documentation/technical/midx.txt
>
> diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt
> new file mode 100644
> index 0000000000..789f410d71
> --- /dev/null
> +++ b/Documentation/technical/midx.txt
> @@ -0,0 +1,109 @@
> +Multi-Pack-Index (MIDX) Design Notes
> +====================================
> +
> +The Git object directory contains a 'pack' directory containing
> +packfiles (with suffix ".pack") and pack-indexes (with suffix
> +".idx"). The pack-indexes provide a way to lookup objects and
> +navigate to their offset within the pack, but these must come
> +in pairs with the packfiles. This pairing depends on the file
> +names, as the pack-index differs only in suffix with its pack-
> +file. While the pack-indexes provide fast lookup per packfile,
> +this performance degrades as the number of packfiles increases,
> +because abbreviations need to inspect every packfile and we are
> +more likely to have a miss on our most-recently-used packfile.
> +For some large repositories, repacking into a single packfile
> +is not feasible due to storage space or excessive repack times.

This leads to the question how MIDX will cope with large repos or
a large number of packs. As it is just an index and not a pack itself,
I guess it is smaller by some orders of magnitude, such that it
is ok for now.

> +The multi-pack-index (MIDX for short) stores a list of objects
> +and their offsets into multiple packfiles. It contains:
> +
> +- A list of packfile names.
> +- A sorted list of object IDs.
> +- A list of metadata for the ith object ID including:
> +  - A value j referring to the jth packfile.
> +  - An offset within the jth packfile for the object.
> +- If large offsets are required, we use another list of large
> +  offsets similar to version 2 pack-indexes.
> +
> +Thus, we can provide O(log N) lookup time for any number
> +of packfiles.

This sounds great for the lookup case!
Though that is for the repo-read case.
Let's read on how the dynamics of a repository are dealt with,
e.g. integrating new packs into the MIDX, or how we deal with
objects in multiple packs.

> +
> +Design Details
> +--------------
> +
> +- The MIDX is stored in a file named 'multi-pack-index' in the
> +  .git/objects/pack directory. This could be stored in the pack
> +  directory of an alternate. It refers only to packfiles in that
> +  same directory.

So there is one and only one multi pack index?
That makes the case of preparing the next MIDX that contains more
pack references more interesting, as then we have to atomically update
that file.

> +- The core.midx config setting must be on to consume MIDX files.

Looking through current config options, I would rename this to a more
suggestive name. I searched for the core.idx counterpart that enables
idx files -- it turns out that is named pack.indexVersion.

So maybe pack.MultiIndex ? That could start out as a boolean as in this
series and then evolve into a version number or such later.

> +- The file format includes parameters for the object ID hash
> +  function, so a future change of hash algorithm does not require
> +  a change in format.
> +
> +- The MIDX keeps only one record per object ID. If an object appears
> +  in multiple packfiles, then the MIDX selects the copy in the most-
> +  recently modified packfile.

Okay. That answers the question from above. Though this is just the tie
breaking decision and not a hard limitation? (i.e. we could change this
this later to that pack that has e.g. shortest delta chain for that object or
such)

> +- If there exist packfiles in the pack directory not registered in
> +  the MIDX, then those packfiles are loaded into the `packed_git`
> +  list and `packed_git_mru` cache.

Not sure I understand the implications of this?
Does that mean we first look at the multi index and if an object is not
found, we'll search linearly through all packs that are not part of the
MIDX? That would require the MIDX to be kepot up to date reasonably
to be useful.

> +- The pack-indexes (.idx files) remain in the pack directory so we
> +  can delete the MIDX file, set core.midx to false, or downgrade
> +  without any loss of information.

In the future will it be possible to have no .idx files and just have the .midx?
(I guess that depends on the strategy of how to integrate new packs into
the MIDX?)

> +- The MIDX file format uses a chunk-based approach (similar to the
> +  commit-graph file) that allows optional data to be added.

... or the index files v2 (or reftable files)? Sure, you are most familiar with
commit-graph files, but others may find it easier to have some older
file formats to relate to.

> +Future Work
> +-----------
> +
> +- Add a 'verify' subcommand to the 'git midx' builtin to verify the
> +  contents of the multi-pack-index file match the offsets listed in
> +  the corresponding pack-indexes.
> +
> +- The multi-pack-index allows many packfiles, especially in a context
> +  where repacking is expensive (such as a very large repo), or
> +  unexpected maintenance time is unacceptable (such as a high-demand
> +  build machine).

Supposedly maintenance (git gc) can be run in the background without
interfering with day-to-day life, how is the regeneration of commit graph
or MIDX files impacting the work here?

>     However, the multi-pack-index needs to be rewritten
> +  in full every time. We can extend the format to be incremental, so
> +  writes are fast. By storing a small "tip" multi-pack-index that
> +  points to large "base" MIDX files, we can keep writes fast while
> +  still reducing the number of binary searches required for object
> +  lookups.

So we can have multiple MIDX files? How would that work? Would there
be a chunk that refers to other MIDX files?

> +- The reachability bitmap is currently paired directly with a single
> +  packfile, using the pack-order as the object order to hopefully
> +  compress the bitmaps well using run-length encoding. This could be
> +  extended to pair a reachability bitmap with a multi-pack-index. If
> +  the multi-pack-index is extended to store a "stable object order"
> +  (a function Order(hash) = integer that is constant for a given hash,

This stable object order doesn't fly well with integrating new packs?

> +  even as the multi-pack-index is updated) then a reachability bitmap
> +  could point to a multi-pack-index and be updated independently.
> +
> +- Packfiles can be marked as "special" using empty files that share
> +  the initial name but replace ".pack" with ".keep" or ".promisor".
> +  We can add an optional chunk of data to the multi-pack-index that
> +  records flags of information about the packfiles. This allows new
> +  states, such as 'repacked' or 'redeltified', that can help with
> +  pack maintenance in a multi-pack environment. It may also be
> +  helpful to organize packfiles by object type (commit, tree, blob,
> +  etc.) and use this metadata to help that maintenance.
> +
> +- The partial clone feature records special "promisor" packs that
> +  may point to objects that are not stored locally, but available
> +  on request to a server. The multi-pack-index does not currently
> +  track these promisor packs.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
@ 2018-06-11 19:19   ` Stefan Beller
  2018-06-18 19:01     ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Stefan Beller @ 2018-06-11 19:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

Hi Derrick,
On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> The multi-pack-index (MIDX) feature generalizes the existing pack-
> index (IDX) feature by indexing objects across multiple pack-files.
>
> Describe the basic file format, using a 12-byte header followed by
> a lookup table for a list of "chunks" which will be described later.
> The file ends with a footer containing a checksum using the hash
> algorithm.
>
> The header allows later versions to create breaking changes by
> advancing the version number. We can also change the hash algorithm
> using a different version value.
>
> We will add the individual chunk format information as we introduce
> the code that writes that information.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 70a99fd142..17666b4bfc 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -252,3 +252,52 @@ Pack file entry: <+
>      corresponding packfile.
>
>      20-byte SHA-1-checksum of all of the above.
> +
> +== midx-*.midx files have the following format:
> +
> +The meta-index files refer to multiple pack-files and loose objects.

So is it meta or multi?

> +In order to allow extensions that add extra data to the MIDX, we organize
> +the body into "chunks" and provide a lookup table at the beginning of the
> +body. The header includes certain length values, such as the number of packs,
> +the number of base MIDX files, hash lengths and types.
> +
> +All 4-byte numbers are in network order.
> +
> +HEADER:
> +
> +       4-byte signature:
> +           The signature is: {'M', 'I', 'D', 'X'}
> +
> +       1-byte version number:
> +           Git only writes or recognizes version 1
> +
> +       1-byte Object Id Version
> +           Git only writes or recognizes verion 1 (SHA-1)

s/verion/version/

> +       1-byte number (C) of "chunks"
> +
> +       1-byte number (I) of base multi-pack-index files:
> +           This value is currently always zero.

Oh? Are meta-index and multi-index files different things?

> +       4-byte number (P) of pack files
> +
> +CHUNK LOOKUP:
> +
> +       (C + 1) * 12 bytes providing the chunk offsets:
> +           First 4 bytes describe chunk id. Value 0 is a terminating label.
> +           Other 8 bytes provide offset in current file for chunk to start.
> +           (Chunks are provided in file-order, so you can infer the length
> +           using the next chunk position if necessary.)

It is so nice to have the header also have 12 bytes, so it fits right into the
lookup table. So an alternative point of view:

  If a chunk needs to store more than 8 bytes, we'll have an offset after
  the first 4 bytes that describe the chunk, otherwise you can store the 8 bytes
  of information directly after the 4 bytes.
   "MIDX" is a special chunk and must come first (does it?) and only once
  as it contains the version number.

> +       The remaining data in the body is described one chunk at a time, and
> +       these chunks may be given in any order. Chunks are required unless
> +       otherwise specified.
> +
> +CHUNK DATA:
> +
> +       (This section intentionally left incomplete.)
> +
> +TRAILER:
> +
> +       H-byte HASH-checksum of all of the above.

This means we have to rehash the whole file for updating its contents.
okay.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
  2018-06-07 17:20   ` Duy Nguyen
@ 2018-06-11 21:02   ` Stefan Beller
  2018-06-18 19:40     ` Derrick Stolee
  1 sibling, 1 reply; 192+ messages in thread
From: Stefan Beller @ 2018-06-11 21:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

Hi Derrick,
On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> This new 'git midx' builtin will be the plumbing access for writing,
> reading, and checking multi-pack-index (MIDX) files. The initial
> implementation is a no-op.

Let's talk about the name for a second:

.idx files are written by git-index-pack or as part of
git-pack-objects (which just calls write_idx_file as part
of finish_tmp_packfile), and the name actually suggests
it writes the index files. I have a hard time understanding
what the git-midx command does[1].

With both commit graph as well as multi index we introduce
a command that is centered around that concept (similar to
git-remote or git-config that are centered around a concept,
that is closely resembled by a file), but for indexes for packs
it was integrated differently into Git. So I am not sure if I want
to suggest to integrate it into the packfile commands as that
doesn't really fit. But maybe we can have a name that is human
readable instead of the file suffix? Maybe

  git multi-pack-index ?

I suppose that eventually this command is not really used by
users as it will be used by other porcelain commands in the
background or even as part of repack/gc so I am not worried
about a long name, but I'd be more worried about understandability.

[1] While these names are not perfect for the layman, it is okay?
  I am sure you are aware of https://git-man-page-generator.lokaltog.net/


> new file mode 100644
> index 0000000000..2bd886f1a2
> --- /dev/null
> +++ b/Documentation/git-midx.txt
> @@ -0,0 +1,29 @@
> +git-midx(1)
> +============
> +
> +NAME
> +----
> +git-midx - Write and verify multi-pack-indexes (MIDX files).

The reading is done as part of all other commands.

> +
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git midx' [--object-dir <dir>]
> +
> +DESCRIPTION
> +-----------
> +Write or verify a MIDX file.
> +
> +OPTIONS
> +-------
> +
> +--object-dir <dir>::
> +       Use given directory for the location of Git objects. We check
> +       <dir>/packs/multi-pack-index for the current MIDX file, and
> +       <dir>/packs for the pack-files to index.
> +
> +

Maybe we could have a SEE ALSO section that points at
the explanation of multi index files?
(c.f. man git-submodule that has a  SEE ALSO
gitsubmodules(7), gitmodules(5) explaining concepts(7)
and the file(5))

But as this is plumbing and users should not need to worry about it
this is optional, I would think.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
  2018-06-07 17:35   ` Duy Nguyen
@ 2018-06-12 15:00   ` Duy Nguyen
  2018-06-19 12:54     ` Derrick Stolee
  1 sibling, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-12 15:00 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/midx.c b/midx.c
> index 616af66b13..3e55422a21 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1,9 +1,62 @@
>  #include "git-compat-util.h"
>  #include "cache.h"
>  #include "dir.h"
> +#include "csum-file.h"
> +#include "lockfile.h"
>  #include "midx.h"
>
> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> +#define MIDX_VERSION 1
> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
...
> +static size_t write_midx_header(struct hashfile *f,
> +                               unsigned char num_chunks,
> +                               uint32_t num_packs)
> +{
> +       char byte_values[4];
> +       hashwrite_be32(f, MIDX_SIGNATURE);
> +       byte_values[0] = MIDX_VERSION;
> +       byte_values[1] = MIDX_HASH_VERSION;

Quoting from "State of NewHash work, future directions, and discussion" [1]

* If you need to serialize an algorithm identifier into your data
  format, use the format_id field of struct git_hash_algo.  It's
  designed specifically for that purpose.

[1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60

> +       byte_values[2] = num_chunks;
> +       byte_values[3] = 0; /* unused */
> +       hashwrite(f, byte_values, sizeof(byte_values));
> +       hashwrite_be32(f, num_packs);
> +
> +       return MIDX_HEADER_SIZE;
> +}
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 01/23] midx: add design document
  2018-06-11 19:04   ` Stefan Beller
@ 2018-06-18 18:48     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-18 18:48 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, dstolee, avarab, jrnieder, jonathantanmy, mfick

On 6/11/2018 3:04 PM, Stefan Beller wrote:
> On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/technical/midx.txt | 109 +++++++++++++++++++++++++++++++
>>   1 file changed, 109 insertions(+)
>>   create mode 100644 Documentation/technical/midx.txt
>>
>> diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt
>> new file mode 100644
>> index 0000000000..789f410d71
>> --- /dev/null
>> +++ b/Documentation/technical/midx.txt
>> @@ -0,0 +1,109 @@
>> +Multi-Pack-Index (MIDX) Design Notes
>> +====================================
>> +
>> +The Git object directory contains a 'pack' directory containing
>> +packfiles (with suffix ".pack") and pack-indexes (with suffix
>> +".idx"). The pack-indexes provide a way to lookup objects and
>> +navigate to their offset within the pack, but these must come
>> +in pairs with the packfiles. This pairing depends on the file
>> +names, as the pack-index differs only in suffix with its pack-
>> +file. While the pack-indexes provide fast lookup per packfile,
>> +this performance degrades as the number of packfiles increases,
>> +because abbreviations need to inspect every packfile and we are
>> +more likely to have a miss on our most-recently-used packfile.
>> +For some large repositories, repacking into a single packfile
>> +is not feasible due to storage space or excessive repack times.
> This leads to the question how MIDX will cope with large repos or
> a large number of packs. As it is just an index and not a pack itself,
> I guess it is smaller by some orders of magnitude, such that it
> is ok for now.

The MIDX file is only slightly larger than the union of the IDX files 
for those packfiles.

>> +The multi-pack-index (MIDX for short) stores a list of objects
>> +and their offsets into multiple packfiles. It contains:
>> +
>> +- A list of packfile names.
>> +- A sorted list of object IDs.
>> +- A list of metadata for the ith object ID including:
>> +  - A value j referring to the jth packfile.
>> +  - An offset within the jth packfile for the object.
>> +- If large offsets are required, we use another list of large
>> +  offsets similar to version 2 pack-indexes.
>> +
>> +Thus, we can provide O(log N) lookup time for any number
>> +of packfiles.
> This sounds great for the lookup case!
> Though that is for the repo-read case.
> Let's read on how the dynamics of a repository are dealt with,
> e.g. integrating new packs into the MIDX, or how we deal with
> objects in multiple packs.
>
>> +
>> +Design Details
>> +--------------
>> +
>> +- The MIDX is stored in a file named 'multi-pack-index' in the
>> +  .git/objects/pack directory. This could be stored in the pack
>> +  directory of an alternate. It refers only to packfiles in that
>> +  same directory.
> So there is one and only one multi pack index?
> That makes the case of preparing the next MIDX that contains more
> pack references more interesting, as then we have to atomically update
> that file.

There is only one, but we can make the file incremental without changing 
this name similar to how the split index works.

>
>> +- The core.midx config setting must be on to consume MIDX files.
> Looking through current config options, I would rename this to a more
> suggestive name. I searched for the core.idx counterpart that enables
> idx files -- it turns out that is named pack.indexVersion.
>
> So maybe pack.MultiIndex ? That could start out as a boolean as in this
> series and then evolve into a version number or such later.

I'll use that name and rename this file to 
Documentation/technical/multi-pack-index.txt

>
>> +- The file format includes parameters for the object ID hash
>> +  function, so a future change of hash algorithm does not require
>> +  a change in format.
>> +
>> +- The MIDX keeps only one record per object ID. If an object appears
>> +  in multiple packfiles, then the MIDX selects the copy in the most-
>> +  recently modified packfile.
> Okay. That answers the question from above. Though this is just the tie
> breaking decision and not a hard limitation? (i.e. we could change this
> this later to that pack that has e.g. shortest delta chain for that object or
> such)

This is a soft requirement. It is an easy thing to track at the moment. 
We can compute the MIDX without opening a packfile, for instance.

>
>> +- If there exist packfiles in the pack directory not registered in
>> +  the MIDX, then those packfiles are loaded into the `packed_git`
>> +  list and `packed_git_mru` cache.
> Not sure I understand the implications of this?
> Does that mean we first look at the multi index and if an object is not
> found, we'll search linearly through all packs that are not part of the
> MIDX? That would require the MIDX to be kepot up to date reasonably
> to be useful.

If you add a packfile to the pack directory, you can immediately start 
consuming it. You do not need to wait for the MIDX to be updated. The 
more asynchronous these auxiliary data structures (MIDX, commit-graph) 
can be, the better. This is in direct contrast to the reachability 
bitmap which is useless without its corresponding packfile.

>
>> +- The pack-indexes (.idx files) remain in the pack directory so we
>> +  can delete the MIDX file, set core.midx to false, or downgrade
>> +  without any loss of information.
> In the future will it be possible to have no .idx files and just have the .midx?
> (I guess that depends on the strategy of how to integrate new packs into
> the MIDX?)

This may be part of a future plan, but we need to know a user will never 
set pack.multiIndex to false if they deleted their IDX files.

>> +- The MIDX file format uses a chunk-based approach (similar to the
>> +  commit-graph file) that allows optional data to be added.
> ... or the index files v2 (or reftable files)? Sure, you are most familiar with
> commit-graph files, but others may find it easier to have some older
> file formats to relate to.

I specifically mean that we have a "table of contents" describing the 
list of chunks. IDX v2 relies on a fixed ordering of the tables, and the 
offsets are computed by consuming the last fanout value (number of 
objects). Also, I'm not sure how to add optional data (data that can 
safely be ignored by an earlier version of Git) to an IDX without 
incrementing the version.

>> +Future Work
>> +-----------
>> +
>> +- Add a 'verify' subcommand to the 'git midx' builtin to verify the
>> +  contents of the multi-pack-index file match the offsets listed in
>> +  the corresponding pack-indexes.
>> +
>> +- The multi-pack-index allows many packfiles, especially in a context
>> +  where repacking is expensive (such as a very large repo), or
>> +  unexpected maintenance time is unacceptable (such as a high-demand
>> +  build machine).
> Supposedly maintenance (git gc) can be run in the background without
> interfering with day-to-day life, how is the regeneration of commit graph
> or MIDX files impacting the work here?

Assuming infinite RAM and disk, then yes we could not interfere with 
daily life. A big problem we see is that users frequently don't have the 
disk space to store a second copy of their packfiles on disk, even if we 
could organize a new packfile in reasonable time.

The MIDX is only invalid when a packfile it references is deleted.

The commit-graph is never invalid, except if a commit is deleted by GC. 
But even in that case, how did we "see" the commit ID? As long as we 
don't access these commits, the commit-graph feature doesn't violate 
expectations and can be generated asynchronously with a GC and repack.

>
>>      However, the multi-pack-index needs to be rewritten
>> +  in full every time. We can extend the format to be incremental, so
>> +  writes are fast. By storing a small "tip" multi-pack-index that
>> +  points to large "base" MIDX files, we can keep writes fast while
>> +  still reducing the number of binary searches required for object
>> +  lookups.
> So we can have multiple MIDX files? How would that work? Would there
> be a chunk that refers to other MIDX files?

We can have an optional chunk that refers to a list of "base" MIDX 
files. We then load that full list into multiple 'midxed_git' structs, 
and iterate through the list. VSTS keeps this list to a maximum length 
of 3 (LARGE, Medium, tiny) and merging files as necessary.

>
>> +- The reachability bitmap is currently paired directly with a single
>> +  packfile, using the pack-order as the object order to hopefully
>> +  compress the bitmaps well using run-length encoding. This could be
>> +  extended to pair a reachability bitmap with a multi-pack-index. If
>> +  the multi-pack-index is extended to store a "stable object order"
>> +  (a function Order(hash) = integer that is constant for a given hash,
> This stable object order doesn't fly well with integrating new packs?

When you integrate new packs, the lexicographic order changes as the new 
objects are inserted into the list. However, if we track a separate 
integer value (order[obj]) associated with the object, and keep that 
constant, we can track a stable order for an object across multiple 
generations of MIDX files. New objects are assigned order values larger 
than the previous order values. We can select a "good" ordering of the 
new objects as we extend the list.

This requires a new chunk in the file format. It also helps to store the 
reverse-lookup lex[i] which provides the lexicographic position of the 
object 'obj' with stable-order order[obj] == i.

I'm being intentionally vague in this document to hint towards a 
valuable feature without giving robust details of something that may 
never get built. But, I do think this is one feature of the MIDX that 
would be of the most value for services that use Git as a server 
process, as it allows the reachability bitmap to be set to this stable 
order instead of a single pack order. This is speculation on my part, as 
I don't know how such servers are maintained in the background.

>
>> +  even as the multi-pack-index is updated) then a reachability bitmap
>> +  could point to a multi-pack-index and be updated independently.
>> +
>> +- Packfiles can be marked as "special" using empty files that share
>> +  the initial name but replace ".pack" with ".keep" or ".promisor".
>> +  We can add an optional chunk of data to the multi-pack-index that
>> +  records flags of information about the packfiles. This allows new
>> +  states, such as 'repacked' or 'redeltified', that can help with
>> +  pack maintenance in a multi-pack environment. It may also be
>> +  helpful to organize packfiles by object type (commit, tree, blob,
>> +  etc.) and use this metadata to help that maintenance.
>> +
>> +- The partial clone feature records special "promisor" packs that
>> +  may point to objects that are not stored locally, but available
>> +  on request to a server. The multi-pack-index does not currently
>> +  track these promisor packs.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-11 19:19   ` Stefan Beller
@ 2018-06-18 19:01     ` Derrick Stolee
  2018-06-18 19:41       ` Stefan Beller
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:01 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

On 6/11/2018 3:19 PM, Stefan Beller wrote:
> Hi Derrick,
> On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>> The multi-pack-index (MIDX) feature generalizes the existing pack-
>> index (IDX) feature by indexing objects across multiple pack-files.
>>
>> Describe the basic file format, using a 12-byte header followed by
>> a lookup table for a list of "chunks" which will be described later.
>> The file ends with a footer containing a checksum using the hash
>> algorithm.
>>
>> The header allows later versions to create breaking changes by
>> advancing the version number. We can also change the hash algorithm
>> using a different version value.
>>
>> We will add the individual chunk format information as we introduce
>> the code that writes that information.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
>>   1 file changed, 49 insertions(+)
>>
>> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
>> index 70a99fd142..17666b4bfc 100644
>> --- a/Documentation/technical/pack-format.txt
>> +++ b/Documentation/technical/pack-format.txt
>> @@ -252,3 +252,52 @@ Pack file entry: <+
>>       corresponding packfile.
>>
>>       20-byte SHA-1-checksum of all of the above.
>> +
>> +== midx-*.midx files have the following format:
>> +
>> +The meta-index files refer to multiple pack-files and loose objects.
> So is it meta or multi?

Good catch. We were calling this the meta-index internally before 
changing to "multi-pack-index" (helps to not change the acronym).

>
>> +In order to allow extensions that add extra data to the MIDX, we organize
>> +the body into "chunks" and provide a lookup table at the beginning of the
>> +body. The header includes certain length values, such as the number of packs,
>> +the number of base MIDX files, hash lengths and types.
>> +
>> +All 4-byte numbers are in network order.
>> +
>> +HEADER:
>> +
>> +       4-byte signature:
>> +           The signature is: {'M', 'I', 'D', 'X'}
>> +
>> +       1-byte version number:
>> +           Git only writes or recognizes version 1
>> +
>> +       1-byte Object Id Version
>> +           Git only writes or recognizes verion 1 (SHA-1)
> s/verion/version/
>
>> +       1-byte number (C) of "chunks"
>> +
>> +       1-byte number (I) of base multi-pack-index files:
>> +           This value is currently always zero.
> Oh? Are meta-index and multi-index files different things?

Not intended to be different things, but this number is related to 
making the feature incremental.

>
>> +       4-byte number (P) of pack files
>> +
>> +CHUNK LOOKUP:
>> +
>> +       (C + 1) * 12 bytes providing the chunk offsets:
>> +           First 4 bytes describe chunk id. Value 0 is a terminating label.
>> +           Other 8 bytes provide offset in current file for chunk to start.
>> +           (Chunks are provided in file-order, so you can infer the length
>> +           using the next chunk position if necessary.)
> It is so nice to have the header also have 12 bytes, so it fits right into the
> lookup table. So an alternative point of view:
>
>    If a chunk needs to store more than 8 bytes, we'll have an offset after
>    the first 4 bytes that describe the chunk, otherwise you can store the 8 bytes
>    of information directly after the 4 bytes.
>     "MIDX" is a special chunk and must come first (does it?) and only once
>    as it contains the version number.

This sounds feasible, but unnecessarily complicated. I don't think any 
other chunk will be this small.

>> +       The remaining data in the body is described one chunk at a time, and
>> +       these chunks may be given in any order. Chunks are required unless
>> +       otherwise specified.
>> +
>> +CHUNK DATA:
>> +
>> +       (This section intentionally left incomplete.)
>> +
>> +TRAILER:
>> +
>> +       H-byte HASH-checksum of all of the above.
> This means we have to rehash the whole file for updating its contents.
> okay.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-07 17:20   ` Duy Nguyen
@ 2018-06-18 19:23     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:23 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/7/2018 1:20 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
>> new file mode 100644
>> index 0000000000..2bd886f1a2
>> --- /dev/null
>> +++ b/Documentation/git-midx.txt
>> @@ -0,0 +1,29 @@
>> +git-midx(1)
>> +============
>> +
>> +NAME
>> +----
>> +git-midx - Write and verify multi-pack-indexes (MIDX files).
> No full stop. This head line is collected automatically with others
> and its having a full stop while the rest does not looks strange/
>
>> diff --git a/builtin/midx.c b/builtin/midx.c
>> new file mode 100644
>> index 0000000000..59ea92178f
>> --- /dev/null
>> +++ b/builtin/midx.c
>> @@ -0,0 +1,38 @@
>> +#include "builtin.h"
>> +#include "cache.h"
>> +#include "config.h"
>> +#include "git-compat-util.h"
> You only need either cache.h or git-compat-util.h. If cache.h is here,
> git-compat-util can be removed.
>
>> +#include "parse-options.h"
>> +
>> +static char const * const builtin_midx_usage[] ={
>> +       N_("git midx [--object-dir <dir>]"),
>> +       NULL
>> +};
>> +
>> +static struct opts_midx {
>> +       const char *object_dir;
>> +} opts;
>> +
>> +int cmd_midx(int argc, const char **argv, const char *prefix)
>> +{
>> +       static struct option builtin_midx_options[] = {
>> +               { OPTION_STRING, 0, "object-dir", &opts.object_dir,
> For paths (including dir), OPTION_FILENAME may be a better option to
> handle correctly when the command is run in a subdir. See df217ed643
> (parse-opts: add OPT_FILENAME and transition builtins - 2009-05-23)
> for more info.
Thanks for the pointer!

>
>> +                 N_("dir"),
>> +                 N_("The object directory containing set of packfile and pack-index pairs.") },
> Other help strings do not have full stop either (I only checked a
> couple commands though)
>
> Also, doesn't OPT_STRING() work here too (if you avoid OPTION_FILENAME
> for some reason)?
>
>> +               OPT_END(),
>> +       };
>> +
>> +       if (argc == 2 && !strcmp(argv[1], "-h"))
>> +               usage_with_options(builtin_midx_usage, builtin_midx_options);
>> +
>> +       git_config(git_default_config, NULL);
>> +
>> +       argc = parse_options(argc, argv, prefix,
>> +                            builtin_midx_options,
>> +                            builtin_midx_usage, 0);
>> +
>> +       if (!opts.object_dir)
>> +               opts.object_dir = get_object_directory();
>> +
>> +       return 0;
>> +}
>> diff --git a/git.c b/git.c
>> index c2f48d53dd..400fadd677 100644
>> --- a/git.c
>> +++ b/git.c
>> @@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
>>          { "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>>          { "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>>          { "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
>> +       { "midx", cmd_midx, RUN_SETUP },
> If it's a plumbing and can take an --object-dir, then I don't think
> you should require it to run in a repo (with RUN_SETUP).
> RUN_SETUP_GENTLY may be better. You could even leave it empty here and
> only call setup_git_directory() only when --object-dir is not set.

I agree. Good point. This could be run to maintain an alternate without 
any .git folder.

>
>>          { "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
>>          { "mktree", cmd_mktree, RUN_SETUP },
>>          { "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-11 21:02   ` Stefan Beller
@ 2018-06-18 19:40     ` Derrick Stolee
  2018-06-18 19:55       ` Stefan Beller
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:40 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

On 6/11/2018 5:02 PM, Stefan Beller wrote:
> Hi Derrick,
> On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>> This new 'git midx' builtin will be the plumbing access for writing,
>> reading, and checking multi-pack-index (MIDX) files. The initial
>> implementation is a no-op.
> Let's talk about the name for a second:
>
> .idx files are written by git-index-pack or as part of
> git-pack-objects (which just calls write_idx_file as part
> of finish_tmp_packfile), and the name actually suggests
> it writes the index files. I have a hard time understanding
> what the git-midx command does[1].
>
> With both commit graph as well as multi index we introduce
> a command that is centered around that concept (similar to
> git-remote or git-config that are centered around a concept,
> that is closely resembled by a file), but for indexes for packs
> it was integrated differently into Git. So I am not sure if I want
> to suggest to integrate it into the packfile commands as that
> doesn't really fit. But maybe we can have a name that is human
> readable instead of the file suffix? Maybe
>
>    git multi-pack-index ?
>
> I suppose that eventually this command is not really used by
> users as it will be used by other porcelain commands in the
> background or even as part of repack/gc so I am not worried
> about a long name, but I'd be more worried about understandability.

I'll use "git multi-pack-index" in v2. I'll keep "midx.c" in the root, 
though, if that is OK.

> [1] While these names are not perfect for the layman, it is okay?
>    I am sure you are aware of https://git-man-page-generator.lokaltog.net/

I was not, and enjoyed that quite a bit.

Thanks,
-Stolee

>
>
>> new file mode 100644
>> index 0000000000..2bd886f1a2
>> --- /dev/null
>> +++ b/Documentation/git-midx.txt
>> @@ -0,0 +1,29 @@
>> +git-midx(1)
>> +============
>> +
>> +NAME
>> +----
>> +git-midx - Write and verify multi-pack-indexes (MIDX files).
> The reading is done as part of all other commands.

I like to think the 'read' verb is a subset of "verify" because we are 
checking for information about the MIDX, and mostly for tests or debugging.

>
>> +
>> +
>> +SYNOPSIS
>> +--------
>> +[verse]
>> +'git midx' [--object-dir <dir>]
>> +
>> +DESCRIPTION
>> +-----------
>> +Write or verify a MIDX file.
>> +
>> +OPTIONS
>> +-------
>> +
>> +--object-dir <dir>::
>> +       Use given directory for the location of Git objects. We check
>> +       <dir>/packs/multi-pack-index for the current MIDX file, and
>> +       <dir>/packs for the pack-files to index.
>> +
>> +
> Maybe we could have a SEE ALSO section that points at
> the explanation of multi index files?
> (c.f. man git-submodule that has a  SEE ALSO
> gitsubmodules(7), gitmodules(5) explaining concepts(7)
> and the file(5))
>
> But as this is plumbing and users should not need to worry about it
> this is optional, I would think.

The design document is also in 'Documentation/technical' instead of just 
'Documentation/'. Do we have a pattern of linking to the technical 
documents?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-18 19:01     ` Derrick Stolee
@ 2018-06-18 19:41       ` Stefan Beller
  0 siblings, 0 replies; 192+ messages in thread
From: Stefan Beller @ 2018-06-18 19:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

> >> +       (C + 1) * 12 bytes providing the chunk offsets:
> >> +           First 4 bytes describe chunk id. Value 0 is a terminating label.
> >> +           Other 8 bytes provide offset in current file for chunk to start.
> >> +           (Chunks are provided in file-order, so you can infer the length
> >> +           using the next chunk position if necessary.)
> > It is so nice to have the header also have 12 bytes, so it fits right into the
> > lookup table. So an alternative point of view:
> >
> >    If a chunk needs to store more than 8 bytes, we'll have an offset after
> >    the first 4 bytes that describe the chunk, otherwise you can store the 8 bytes
> >    of information directly after the 4 bytes.
> >     "MIDX" is a special chunk and must come first (does it?) and only once
> >    as it contains the version number.
>
> This sounds feasible, but unnecessarily complicated. I don't think any
> other chunk will be this small.

I was just writing it as a way to test if I really understood what you said
in the doc, not as a suggestion to incorporate it.

Stefan

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-18 19:40     ` Derrick Stolee
@ 2018-06-18 19:55       ` Stefan Beller
  2018-06-18 19:58         ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Stefan Beller @ 2018-06-18 19:55 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

> > But as this is plumbing and users should not need to worry about it
> > this is optional, I would think.
>
> The design document is also in 'Documentation/technical' instead of just
> 'Documentation/'. Do we have a pattern of linking to the technical
> documents?

Apparently we do (and I was not aware of it):

    $ git -C Documentation/ grep link:technical
    git-credential.txt:23:link:technical/api-credentials.html[the Git
credential API] for more
    git.txt:839:link:technical/api-index.html[Git API documentation].
    gitcredentials.txt:184:link:technical/api-credentials.html[credentials
API] for details.
    technical/http-protocol.txt:517:link:technical/pack-protocol.html
    technical/http-protocol.txt:518:link:technical/protocol-capabilities.html
    user-manual.txt:3220:found in link:technical/pack-format.html[pack format].

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-18 19:55       ` Stefan Beller
@ 2018-06-18 19:58         ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:58 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

On 6/18/2018 3:55 PM, Stefan Beller wrote:
>>> But as this is plumbing and users should not need to worry about it
>>> this is optional, I would think.
>> The design document is also in 'Documentation/technical' instead of just
>> 'Documentation/'. Do we have a pattern of linking to the technical
>> documents?
> Apparently we do (and I was not aware of it):
>
>      $ git -C Documentation/ grep link:technical
>      git-credential.txt:23:link:technical/api-credentials.html[the Git
> credential API] for more
>      git.txt:839:link:technical/api-index.html[Git API documentation].
>      gitcredentials.txt:184:link:technical/api-credentials.html[credentials
> API] for details.
>      technical/http-protocol.txt:517:link:technical/pack-protocol.html
>      technical/http-protocol.txt:518:link:technical/protocol-capabilities.html
>      user-manual.txt:3220:found in link:technical/pack-format.html[pack format].

Thanks! I'll add some links.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-12 15:00   ` Duy Nguyen
@ 2018-06-19 12:54     ` Derrick Stolee
  2018-06-19 14:59       ` Duy Nguyen
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-19 12:54 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/12/2018 11:00 AM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>> diff --git a/midx.c b/midx.c
>> index 616af66b13..3e55422a21 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -1,9 +1,62 @@
>>   #include "git-compat-util.h"
>>   #include "cache.h"
>>   #include "dir.h"
>> +#include "csum-file.h"
>> +#include "lockfile.h"
>>   #include "midx.h"
>>
>> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>> +#define MIDX_VERSION 1
>> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
> ...
>> +static size_t write_midx_header(struct hashfile *f,
>> +                               unsigned char num_chunks,
>> +                               uint32_t num_packs)
>> +{
>> +       char byte_values[4];
>> +       hashwrite_be32(f, MIDX_SIGNATURE);
>> +       byte_values[0] = MIDX_VERSION;
>> +       byte_values[1] = MIDX_HASH_VERSION;
> Quoting from "State of NewHash work, future directions, and discussion" [1]
>
> * If you need to serialize an algorithm identifier into your data
>    format, use the format_id field of struct git_hash_algo.  It's
>    designed specifically for that purpose.
>
> [1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60

Thanks! I'll also use the_hash_algo->rawsz to infer the length of the 
hash function.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-19 12:54     ` Derrick Stolee
@ 2018-06-19 14:59       ` Duy Nguyen
  2018-06-19 15:24         ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-19 14:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Tue, Jun 19, 2018 at 2:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/12/2018 11:00 AM, Duy Nguyen wrote:
> > On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
> >> diff --git a/midx.c b/midx.c
> >> index 616af66b13..3e55422a21 100644
> >> --- a/midx.c
> >> +++ b/midx.c
> >> @@ -1,9 +1,62 @@
> >>   #include "git-compat-util.h"
> >>   #include "cache.h"
> >>   #include "dir.h"
> >> +#include "csum-file.h"
> >> +#include "lockfile.h"
> >>   #include "midx.h"
> >>
> >> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> >> +#define MIDX_VERSION 1
> >> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
> > ...
> >> +static size_t write_midx_header(struct hashfile *f,
> >> +                               unsigned char num_chunks,
> >> +                               uint32_t num_packs)
> >> +{
> >> +       char byte_values[4];
> >> +       hashwrite_be32(f, MIDX_SIGNATURE);
> >> +       byte_values[0] = MIDX_VERSION;
> >> +       byte_values[1] = MIDX_HASH_VERSION;
> > Quoting from "State of NewHash work, future directions, and discussion" [1]
> >
> > * If you need to serialize an algorithm identifier into your data
> >    format, use the format_id field of struct git_hash_algo.  It's
> >    designed specifically for that purpose.
> >
> > [1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60
>
> Thanks! I'll also use the_hash_algo->rawsz to infer the length of the
> hash function.

BTW, since you're the author of commit-graph.c and may notice it has
the same problem. Don't touch that code. Brian already has some WIP
changes [1]. We just make sure new code does not add extra work for
him. I expect he'll send all those patches out soon.

[1] https://github.com/bk2204/git/commit/3f9031e06cfb21534eb7dfff7b54e7598ac1149f

-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-19 14:59       ` Duy Nguyen
@ 2018-06-19 15:24         ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-19 15:24 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/19/2018 10:59 AM, Duy Nguyen wrote:
> On Tue, Jun 19, 2018 at 2:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>> On 6/12/2018 11:00 AM, Duy Nguyen wrote:
>>> On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>>>> diff --git a/midx.c b/midx.c
>>>> index 616af66b13..3e55422a21 100644
>>>> --- a/midx.c
>>>> +++ b/midx.c
>>>> @@ -1,9 +1,62 @@
>>>>    #include "git-compat-util.h"
>>>>    #include "cache.h"
>>>>    #include "dir.h"
>>>> +#include "csum-file.h"
>>>> +#include "lockfile.h"
>>>>    #include "midx.h"
>>>>
>>>> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>>>> +#define MIDX_VERSION 1
>>>> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
>>> ...
>>>> +static size_t write_midx_header(struct hashfile *f,
>>>> +                               unsigned char num_chunks,
>>>> +                               uint32_t num_packs)
>>>> +{
>>>> +       char byte_values[4];
>>>> +       hashwrite_be32(f, MIDX_SIGNATURE);
>>>> +       byte_values[0] = MIDX_VERSION;
>>>> +       byte_values[1] = MIDX_HASH_VERSION;
>>> Quoting from "State of NewHash work, future directions, and discussion" [1]
>>>
>>> * If you need to serialize an algorithm identifier into your data
>>>     format, use the format_id field of struct git_hash_algo.  It's
>>>     designed specifically for that purpose.
>>>
>>> [1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60
>> Thanks! I'll also use the_hash_algo->rawsz to infer the length of the
>> hash function.
> BTW, since you're the author of commit-graph.c and may notice it has
> the same problem. Don't touch that code. Brian already has some WIP
> changes [1]. We just make sure new code does not add extra work for
> him. I expect he'll send all those patches out soon.
>
> [1] https://github.com/bk2204/git/commit/3f9031e06cfb21534eb7dfff7b54e7598ac1149f

Thanks for the link. It seems he is creating an oid_version() method 
that returns a 1-byte version for the hash version instead of the 4-byte 
signature of the_hash_algo->format_id. I look forward to incorporating 
that into the MIDX format. I'll keep my macros for now, as we work out 
the other details, and while Brain's patch is cooking.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 17:54   ` Duy Nguyen
@ 2018-06-20 13:13     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-20 13:13 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/7/2018 1:54 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> As we build the multi-pack-index feature by adding chunks at a time,
>> we want to test that the data is being written correctly.
>>
>> Create struct midxed_git to store an in-memory representation of a
> A word play on 'packed_git'? Amusing. Some more descriptive name would
> be better though. midxed looks almost like random letters thrown
> together.

I'll use 'struct multi_pack_index'.

>
>> multi-pack-index and a memory-map of the binary file. Initialize this
>> struct in load_midxed_git(object_dir).
>> +static int read_midx_file(const char *object_dir)
>> +{
>> +       struct midxed_git *m = load_midxed_git(object_dir);
>> +
>> +       if (!m)
>> +               return 0;
> This looks like an error case, please don't just return zero,
> typically used to say "success". I don't know if this command stays
> "for debugging purposes" until the end. Of course in that case it does
> not really matter.

It is intended for debugging and testing. Generally, it is not an error 
to not have a MIDX in an object directory.

>> +struct midxed_git *load_midxed_git(const char *object_dir)
>> +{
>> +       struct midxed_git *m;
>> +       int fd;
>> +       struct stat st;
>> +       size_t midx_size;
>> +       void *midx_map;
>> +       const char *midx_name = get_midx_filename(object_dir);
> mem leak? This function returns allocated memory if I remember correctly.
>
>> +
>> +       fd = git_open(midx_name);
>> +       if (fd < 0)
>> +               return NULL;
> do an error_errno() so we know what went wrong at least.
>
>> +       if (fstat(fd, &st)) {
>> +               close(fd);
>> +               return NULL;
> same here, we should know why fstat() fails.
>
>> +       }
>> +       midx_size = xsize_t(st.st_size);
>> +
>> +       if (midx_size < MIDX_MIN_SIZE) {
>> +               close(fd);
>> +               die("multi-pack-index file %s is too small", midx_name);
> _()
>
> The use of die() should be discouraged though. Many people still try
> (or wish) to libify code and new die() does not help. I think error()
> here would be enough then you can return NULL. Or you can go fancier
> and store the error string in a strbuf like refs code.
>
>> +       }
>> +
>> +       midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
>> +
>> +       m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
>> +       strcpy(m->object_dir, object_dir);
>> +       m->data = midx_map;
>> +
>> +       m->signature = get_be32(m->data);
>> +       if (m->signature != MIDX_SIGNATURE) {
>> +               error("multi-pack-index signature %X does not match signature %X",
>> +                     m->signature, MIDX_SIGNATURE);
> _(). Maybe 0x%08x instead of %x
>
>> +               goto cleanup_fail;
>> +       }
>> +
>> +       m->version = *(m->data + 4);
> m->data[4] instead? shorter and easier to understand.
>
> Same comment on "*(m->data + x)" and error() without _() for the rest.
>
>> +       if (m->version != MIDX_VERSION) {
>> +               error("multi-pack-index version %d not recognized",
>> +                     m->version);
> _()
>> +               goto cleanup_fail;
>> +       }
>> +
>> +       m->hash_version = *(m->data + 5);
> m->data[5]
>
>> +cleanup_fail:
>> +       FREE_AND_NULL(m);
>> +       munmap(midx_map, midx_size);
>> +       close(fd);
>> +       exit(1);
> It's bad enough that you die() but exit() in this code seems too much.
> Please just return NULL and let the caller handle the error.

Will do.

>
>> diff --git a/midx.h b/midx.h
>> index 3a63673952..a1d18ed991 100644
>> --- a/midx.h
>> +++ b/midx.h
>> @@ -1,4 +1,13 @@
>> +#ifndef MIDX_H
>> +#define MIDX_H
>> +
>> +#include "git-compat-util.h"
>>   #include "cache.h"
>> +#include "object-store.h"
> I don't really think you need object-store here (git-compat-util.h
> too). "struct mixed_git;" would be enough for load_midxed_git
> declaration below.
>
>>   #include "packfile.h"
>>
>> +struct midxed_git *load_midxed_git(const char *object_dir);
>> +
>>   int write_midx_file(const char *object_dir);
>> +
>> +#endif
>> diff --git a/object-store.h b/object-store.h
>> index d683112fd7..77cb82621a 100644
>> --- a/object-store.h
>> +++ b/object-store.h
>> @@ -84,6 +84,25 @@ struct packed_git {
>>          char pack_name[FLEX_ARRAY]; /* more */
>>   };
>>
>> +struct midxed_git {
>> +       struct midxed_git *next;
> Do we really have multiple midx files?

There is one per object directory currently, but you may have one 
locally and one in each of your alternates. I do need to double-check 
that we populate this list later in the series. (And I'll remove it from 
this commit and save it for when it is needed.)

>
>> +
>> +       int fd;
>> +
>> +       const unsigned char *data;
>> +       size_t data_len;
>> +
>> +       uint32_t signature;
>> +       unsigned char version;
>> +       unsigned char hash_version;
>> +       unsigned char hash_len;
>> +       unsigned char num_chunks;
>> +       uint32_t num_packs;
>> +       uint32_t num_objects;
>> +
>> +       char object_dir[FLEX_ARRAY];
> Why do you need to keep object_dir when it could be easily retrieved
> when the repo is available?
>
>> +};
>> +
>>   struct raw_object_store {
>>          /*
>>           * Path to the repository's object store.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 18:31   ` Duy Nguyen
@ 2018-06-20 13:33     ` Derrick Stolee
  2018-06-20 15:07       ` Duy Nguyen
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-20 13:33 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/7/2018 2:31 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
>> index dcaeb1a91b..919283fdd8 100644
>> --- a/Documentation/git-midx.txt
>> +++ b/Documentation/git-midx.txt
>> @@ -23,6 +23,11 @@ OPTIONS
>>          <dir>/packs/multi-pack-index for the current MIDX file, and
>>          <dir>/packs for the pack-files to index.
>>
>> +read::
>> +       When given as the verb, read the current MIDX file and output
>> +       basic information about its contents. Used for debugging
>> +       purposes only.
> On second thought. If you just need a temporary debugging interface,
> adding a program in t/helper may be a better option. In the end we
> might still need 'read' to dump a file out, but we should have some
> stable output format (and json might be a good choice).

My intention with this 'read' pattern in the MIDX (and commit-graph) is 
two-fold:

1. We can test that we are writing the correct data in our test suite. A 
test-tool builtin would suffice for this purpose.

2. We can help trouble-shoot users who may be having trouble with their 
MIDX files. Having the subcommand in a plumbing command allows us to do 
this in the shipped versions of Git.

Maybe this second purpose isn't enough to justify the feature in Git and 
we should move this to the test-tool, especially with the 'verify' mode 
coming in a second series. Note that a 'verify' mode doesn't satisfy 
item (1).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-20 13:33     ` Derrick Stolee
@ 2018-06-20 15:07       ` Duy Nguyen
  2018-06-20 16:39         ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Duy Nguyen @ 2018-06-20 15:07 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Wed, Jun 20, 2018 at 3:33 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/7/2018 2:31 PM, Duy Nguyen wrote:
> > On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> >> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> >> index dcaeb1a91b..919283fdd8 100644
> >> --- a/Documentation/git-midx.txt
> >> +++ b/Documentation/git-midx.txt
> >> @@ -23,6 +23,11 @@ OPTIONS
> >>          <dir>/packs/multi-pack-index for the current MIDX file, and
> >>          <dir>/packs for the pack-files to index.
> >>
> >> +read::
> >> +       When given as the verb, read the current MIDX file and output
> >> +       basic information about its contents. Used for debugging
> >> +       purposes only.
> > On second thought. If you just need a temporary debugging interface,
> > adding a program in t/helper may be a better option. In the end we
> > might still need 'read' to dump a file out, but we should have some
> > stable output format (and json might be a good choice).
>
> My intention with this 'read' pattern in the MIDX (and commit-graph) is
> two-fold:
>
> 1. We can test that we are writing the correct data in our test suite. A
> test-tool builtin would suffice for this purpose.
>
> 2. We can help trouble-shoot users who may be having trouble with their
> MIDX files. Having the subcommand in a plumbing command allows us to do
> this in the shipped versions of Git.
>
> Maybe this second purpose isn't enough to justify the feature in Git and
> we should move this to the test-tool, especially with the 'verify' mode
> coming in a second series. Note that a 'verify' mode doesn't satisfy
> item (1).

Yeah I think normally we just have some "fsck" thing to verify when
things go bad. If you need more than that I think you just ask the
user to send the .midx to you (with full understanding of potentially
revealing confidential info and stuff). It'll be faster than
instructing them to "run this command", "ok, run another command"....
I thought of suggesting a command to dump the midx file in readable
form (like json), but I think if fsck fails then chances of that
command successfully dumping may be very low.

Either way, if the command is meant for troubleshooting, I think it
should be added at the end when the whole midx file is implemented and
understood and we see what we need to troubleshoot. Adding small
pieces of changes from patch to patch makes it really hard to see if
it helps troubleshooting at all, it just helps the first purpose.
-- 
Duy

^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH] packfile: generalize pack directory list
  2018-06-07 18:03   ` Duy Nguyen
@ 2018-06-20 16:33     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-20 16:33 UTC (permalink / raw)
  To: git; +Cc: pclouds, Derrick Stolee

In anticipation of sharing the pack directory listing with the
multi-pack-index, generalize prepare_packed_git_one() into
for_each_file_in_pack_dir().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---

Duy,

I think this is what you mean by sharing code between packfile.c and
midx.c for reading the packfiles from a pack directory. This does make
the code in midx.c much simpler. Is this change worth it?

This patch could stand on its own, or can be incorporated into the next
version of the MIDX series.

Thanks,
-Stolee

 packfile.c | 103 +++++++++++++++++++++++++++++++++--------------------
 packfile.h |   6 ++++
 2 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7cd45aa4b2..db61c8813b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -738,13 +738,14 @@ static void report_pack_garbage(struct string_list *list)
 	report_helper(list, seen_bits, first, list->nr);
 }
 
-static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data)
 {
 	struct strbuf path = STRBUF_INIT;
 	size_t dirnamelen;
 	DIR *dir;
 	struct dirent *de;
-	struct string_list garbage = STRING_LIST_INIT_DUP;
 
 	strbuf_addstr(&path, objdir);
 	strbuf_addstr(&path, "/pack");
@@ -759,53 +760,79 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	strbuf_addch(&path, '/');
 	dirnamelen = path.len;
 	while ((de = readdir(dir)) != NULL) {
-		struct packed_git *p;
-		size_t base_len;
-
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		strbuf_setlen(&path, dirnamelen);
 		strbuf_addstr(&path, de->d_name);
 
-		base_len = path.len;
-		if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
-			/* Don't reopen a pack we already have. */
-			for (p = r->objects->packed_git; p;
-			     p = p->next) {
-				size_t len;
-				if (strip_suffix(p->pack_name, ".pack", &len) &&
-				    len == base_len &&
-				    !memcmp(p->pack_name, path.buf, len))
-					break;
-			}
-			if (p == NULL &&
-			    /*
-			     * See if it really is a valid .idx file with
-			     * corresponding .pack file that we can map.
-			     */
-			    (p = add_packed_git(path.buf, path.len, local)) != NULL)
-				install_packed_git(r, p);
-		}
-
-		if (!report_garbage)
-			continue;
-
-		if (ends_with(de->d_name, ".idx") ||
-		    ends_with(de->d_name, ".pack") ||
-		    ends_with(de->d_name, ".bitmap") ||
-		    ends_with(de->d_name, ".keep") ||
-		    ends_with(de->d_name, ".promisor"))
-			string_list_append(&garbage, path.buf);
-		else
-			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
+		fn(path.buf, path.len, de->d_name, data);
 	}
+
 	closedir(dir);
-	report_pack_garbage(&garbage);
-	string_list_clear(&garbage, 0);
 	strbuf_release(&path);
 }
 
+struct prepare_pack_data
+{
+	struct repository *r;
+	struct string_list *garbage;
+	int local;
+};
+
+static void prepare_pack(const char *full_name, size_t full_name_len, const char *file_name, void *_data)
+{
+	struct prepare_pack_data *data = (struct prepare_pack_data *)_data;
+	struct packed_git *p;
+	size_t base_len = full_name_len;
+
+	if (strip_suffix_mem(full_name, &base_len, ".idx")) {
+		/* Don't reopen a pack we already have. */
+		for (p = data->r->objects->packed_git; p; p = p->next) {
+			size_t len;
+			if (strip_suffix(p->pack_name, ".pack", &len) &&
+			    len == base_len &&
+			    !memcmp(p->pack_name, full_name, len))
+				break;
+		}
+
+		if (p == NULL &&
+		    /*
+		     * See if it really is a valid .idx file with
+		     * corresponding .pack file that we can map.
+		     */
+		    (p = add_packed_git(full_name, full_name_len, data->local)) != NULL)
+			install_packed_git(data->r, p);
+	}
+
+	if (!report_garbage)
+	       return;
+
+	if (ends_with(file_name, ".idx") ||
+	    ends_with(file_name, ".pack") ||
+	    ends_with(file_name, ".bitmap") ||
+	    ends_with(file_name, ".keep") ||
+	    ends_with(file_name, ".promisor"))
+		string_list_append(data->garbage, full_name);
+	else
+		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
+}
+
+static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+{
+	struct prepare_pack_data data;
+	struct string_list garbage = STRING_LIST_INIT_DUP;
+
+	data.r = r;
+	data.garbage = &garbage;
+	data.local = local;
+
+	for_each_file_in_pack_dir(objdir, prepare_pack, &data);
+
+	report_pack_garbage(data.garbage);
+	string_list_clear(data.garbage, 0);
+}
+
 static void prepare_packed_git(struct repository *r);
 /*
  * Give a fast, rough count of the number of objects in the repository. This
diff --git a/packfile.h b/packfile.h
index e0a38aba93..d2ad30300a 100644
--- a/packfile.h
+++ b/packfile.h
@@ -28,6 +28,12 @@ extern char *sha1_pack_index_name(const unsigned char *sha1);
 
 extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
+typedef void each_file_in_pack_dir_fn(const char *full_path, size_t full_path_len,
+				      const char *file_pach, void *data);
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data);
+
 /* A hook to report invalid files in pack directory */
 #define PACKDIR_FILE_PACK 1
 #define PACKDIR_FILE_IDX 2
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-20 15:07       ` Duy Nguyen
@ 2018-06-20 16:39         ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-20 16:39 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/20/2018 11:07 AM, Duy Nguyen wrote:
> On Wed, Jun 20, 2018 at 3:33 PM Derrick Stolee <stolee@gmail.com> wrote:
>> On 6/7/2018 2:31 PM, Duy Nguyen wrote:
>>> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>>>> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
>>>> index dcaeb1a91b..919283fdd8 100644
>>>> --- a/Documentation/git-midx.txt
>>>> +++ b/Documentation/git-midx.txt
>>>> @@ -23,6 +23,11 @@ OPTIONS
>>>>           <dir>/packs/multi-pack-index for the current MIDX file, and
>>>>           <dir>/packs for the pack-files to index.
>>>>
>>>> +read::
>>>> +       When given as the verb, read the current MIDX file and output
>>>> +       basic information about its contents. Used for debugging
>>>> +       purposes only.
>>> On second thought. If you just need a temporary debugging interface,
>>> adding a program in t/helper may be a better option. In the end we
>>> might still need 'read' to dump a file out, but we should have some
>>> stable output format (and json might be a good choice).
>> My intention with this 'read' pattern in the MIDX (and commit-graph) is
>> two-fold:
>>
>> 1. We can test that we are writing the correct data in our test suite. A
>> test-tool builtin would suffice for this purpose.
>>
>> 2. We can help trouble-shoot users who may be having trouble with their
>> MIDX files. Having the subcommand in a plumbing command allows us to do
>> this in the shipped versions of Git.
>>
>> Maybe this second purpose isn't enough to justify the feature in Git and
>> we should move this to the test-tool, especially with the 'verify' mode
>> coming in a second series. Note that a 'verify' mode doesn't satisfy
>> item (1).
> Yeah I think normally we just have some "fsck" thing to verify when
> things go bad. If you need more than that I think you just ask the
> user to send the .midx to you (with full understanding of potentially
> revealing confidential info and stuff). It'll be faster than
> instructing them to "run this command", "ok, run another command"....
> I thought of suggesting a command to dump the midx file in readable
> form (like json), but I think if fsck fails then chances of that
> command successfully dumping may be very low.
>
> Either way, if the command is meant for troubleshooting, I think it
> should be added at the end when the whole midx file is implemented and
> understood and we see what we need to troubleshoot. Adding small
> pieces of changes from patch to patch makes it really hard to see if
> it helps troubleshooting at all, it just helps the first purpose.

I'll abandon point (2) for the later 'verify' patch series. Adding the 
test helper early allows tests to demonstrate that each patch does the 
right thing, and that we don't miss testing something.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/23] midx: write pack names in chunk
  2018-06-07 18:26   ` Duy Nguyen
@ 2018-06-21 15:25     ` Derrick Stolee
  2018-06-21 17:38       ` Junio C Hamano
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-21 15:25 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/7/2018 2:26 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>>          m->num_chunks = *(m->data + 6);
>>          m->num_packs = get_be32(m->data + 8);
>>
>> +       for (i = 0; i < m->num_chunks; i++) {
>> +               uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i);
>> +               uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
> Would be good to reduce magic numbers like 12 and 16, I think you have
> some header length constants for those already.
>
>> +               switch (chunk_id) {
>> +                       case MIDX_CHUNKID_PACKNAMES:
>> +                               m->chunk_pack_names = m->data + chunk_offset;
>> +                               break;
>> +
>> +                       case 0:
>> +                               die("terminating MIDX chunk id appears earlier than expected");
> _()

This die() and others like it are not marked for translation on purpose, 
as they should never be seen by an end-user.

>
>> +                               break;
>> +
>> +                       default:
>> +                               /*
>> +                                * Do nothing on unrecognized chunks, allowing future
>> +                                * extensions to add optional chunks.
>> +                                */
> I wrote about the chunk term reminding me of PNG format then deleted
> it. But it may help to do similar to PNG here. The first letter can
> let us know if the chunk is optional and can be safely ignored. E.g.
> uppercase first letter cannot be ignored, lowercase go wild.

That's an interesting way to think about it. That way you could add a 
new "required" chunk and earlier versions could die() realizing they 
don't know how to parse that required chunk.

I think for this format, we should update the file version value when a 
required chunk is needed.

>
>> +                               break;
>> +               }
>> +       }
>> +
>> +       if (!m->chunk_pack_names)
>> +               die("MIDX missing required pack-name chunk");
> _()
>
>> +
>>          return m;
>>
>>   cleanup_fail:
>> @@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
>>          return MIDX_HEADER_SIZE;
>>   }
>>
>> +struct pack_pair {
>> +       uint32_t pack_int_id;
> can this be just pack_id?

Since packfiles are usually named pack-{hash}.pack, I chose to be 
specific here.

>
>> +       char *pack_name;
>> +};
>> +
>> +static int pack_pair_compare(const void *_a, const void *_b)
>> +{
>> +       struct pack_pair *a = (struct pack_pair *)_a;
>> +       struct pack_pair *b = (struct pack_pair *)_b;
>> +       return strcmp(a->pack_name, b->pack_name);
>> +}
>> +
>> +static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
>> +{
>> +       uint32_t i;
>> +       struct pack_pair *pairs;
>> +
>> +       ALLOC_ARRAY(pairs, nr_packs);
>> +
>> +       for (i = 0; i < nr_packs; i++) {
>> +               pairs[i].pack_int_id = i;
>> +               pairs[i].pack_name = pack_names[i];
>> +       }
>> +
>> +       QSORT(pairs, nr_packs, pack_pair_compare);
>> +
>> +       for (i = 0; i < nr_packs; i++) {
>> +               pack_names[i] = pairs[i].pack_name;
>> +               perm[pairs[i].pack_int_id] = i;
>> +       }
> pairs[] is leaked?

Good catch!

>
>> +}
>> +
>> +static size_t write_midx_pack_names(struct hashfile *f,
>> +                                   char **pack_names,
>> +                                   uint32_t num_packs)
>> +{
>> +       uint32_t i;
>> +       unsigned char padding[MIDX_CHUNK_ALIGNMENT];
>> +       size_t written = 0;
>> +
>> +       for (i = 0; i < num_packs; i++) {
>> +               size_t writelen = strlen(pack_names[i]) + 1;
>> +
>> +               if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
>> +                       BUG("incorrect pack-file order: %s before %s",
>> +                           pack_names[i - 1],
>> +                           pack_names[i]);
>> +
>> +               hashwrite(f, pack_names[i], writelen);
>> +               written += writelen;
> side note. This pattern happens a lot. It may be a good idea to make
> hashwrite() return writelen so we can just write
>
> written += hashwrite(f, ..., writelen);

If I change the prototype of hashwrite(), will other callers get 
warnings about not inspecting the return value (for some build options 
on some platforms)?

>
>> +       }
>> +
>> +       /* add padding to be aligned */
>> +       i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
>> +       if (i < MIDX_CHUNK_ALIGNMENT) {
>> +               bzero(padding, sizeof(padding));
>> +               hashwrite(f, padding, i);
>> +               written += i;
>> +       }
>> +
>> +       return written;
>> +}
>> +
>>   int write_midx_file(const char *object_dir)
>>   {
>> -       unsigned char num_chunks = 0;
>> +       unsigned char cur_chunk, num_chunks = 0;
>>          char *midx_name;
>>          struct hashfile *f;
>>          struct lock_file lk;
>>          struct packed_git **packs = NULL;
>> +       char **pack_names = NULL;
>> +       uint32_t *pack_perm;
>>          uint32_t i, nr_packs = 0, alloc_packs = 0;
>> +       uint32_t alloc_pack_names = 0;
>>          DIR *dir;
>>          struct dirent *de;
>>          struct strbuf pack_dir = STRBUF_INIT;
>>          size_t pack_dir_len;
>> +       uint64_t pack_name_concat_len = 0;
>> +       uint64_t written = 0;
>> +       uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
>> +       uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
> This long list of local vars may be a good indicator that this
> function needs split up into smaller ones.

Or the data needs to be grouped into structs, which is happening in my 
local branch due to the for_each_file_in_pack_dir() method.

>>          midx_name = get_midx_filename(object_dir);
>>          if (safe_create_leading_directories(midx_name)) {
>> @@ -132,12 +233,14 @@ int write_midx_file(const char *object_dir)
>>          strbuf_addch(&pack_dir, '/');
>>          pack_dir_len = pack_dir.len;
>>          ALLOC_ARRAY(packs, alloc_packs);
>> +       ALLOC_ARRAY(pack_names, alloc_pack_names);
>>          while ((de = readdir(dir)) != NULL) {
>>                  if (is_dot_or_dotdot(de->d_name))
>>                          continue;
>>
>>                  if (ends_with(de->d_name, ".idx")) {
>>                          ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
>> +                       ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
>>
>>                          strbuf_setlen(&pack_dir, pack_dir_len);
>>                          strbuf_addstr(&pack_dir, de->d_name);
>> @@ -145,21 +248,83 @@ int write_midx_file(const char *object_dir)
>>                          packs[nr_packs] = add_packed_git(pack_dir.buf,
>>                                                           pack_dir.len,
>>                                                           0);
>> -                       if (!packs[nr_packs])
>> +                       if (!packs[nr_packs]) {
>>                                  warning("failed to add packfile '%s'",
>>                                          pack_dir.buf);
>> -                       else
>> -                               nr_packs++;
>> +                               continue;
>> +                       }
>> +
>> +                       pack_names[nr_packs] = xstrdup(de->d_name);
>> +                       pack_name_concat_len += strlen(de->d_name) + 1;
>> +                       nr_packs++;
>>                  }
>>          }
>> +
>>          closedir(dir);
>>          strbuf_release(&pack_dir);
>>
>> +       if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
>> +               pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
>> +                                       (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
>> +
>> +       ALLOC_ARRAY(pack_perm, nr_packs);
>> +       sort_packs_by_name(pack_names, nr_packs, pack_perm);
>> +
>>          hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>>          f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>>          FREE_AND_NULL(midx_name);
>>
>> -       write_midx_header(f, num_chunks, nr_packs);
>> +       cur_chunk = 0;
>> +       num_chunks = 1;
>> +
>> +       written = write_midx_header(f, num_chunks, nr_packs);
>> +
>> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
>> +       chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
>> +
>> +       cur_chunk++;
>> +       chunk_ids[cur_chunk] = 0;
>> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
>> +
>> +       for (i = 0; i <= num_chunks; i++) {
>> +               if (i && chunk_offsets[i] < chunk_offsets[i - 1])
>> +                       BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
>> +                           chunk_offsets[i - 1],
>> +                           chunk_offsets[i]);
>> +
>> +               if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
>> +                       BUG("chunk offset %"PRIu64" is not properly aligned",
>> +                           chunk_offsets[i]);
>> +
>> +               hashwrite_be32(f, chunk_ids[i]);
>> +               hashwrite_be32(f, chunk_offsets[i] >> 32);
>> +               hashwrite_be32(f, chunk_offsets[i]);
>> +
>> +               written += MIDX_CHUNKLOOKUP_WIDTH;
>> +       }
>> +
>> +       for (i = 0; i < num_chunks; i++) {
>> +               if (written != chunk_offsets[i])
>> +                       BUG("inccrrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
> incorrect
>
>> +                           chunk_offsets[i],
>> +                           written,
>> +                           chunk_ids[i]);
>> +
>> +               switch (chunk_ids[i]) {
>> +                       case MIDX_CHUNKID_PACKNAMES:
>> +                               written += write_midx_pack_names(f, pack_names, nr_packs);
>> +                               break;
>> +
>> +                       default:
>> +                               BUG("trying to write unknown chunk id %"PRIx32,
>> +                                   chunk_ids[i]);
>> +               }
>> +       }
>> +
>> +       if (written != chunk_offsets[num_chunks])
>> +               BUG("incorrect final offset %"PRIu64" != %"PRIu64,
>> +                   written,
>> +                   chunk_offsets[num_chunks]);
>>
>>          finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>>          commit_lock_file(&lk);
>> @@ -170,5 +335,6 @@ int write_midx_file(const char *object_dir)
>>          }
>>
>>          FREE_AND_NULL(packs);
>> +       FREE_AND_NULL(pack_names);
> What about the strings in this array? I think they are xstrdup() but I
> didn't spot them being freed.
>
> And maybe just use string_list...

Will do! Thanks.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 10/23] midx: write a lookup into the pack names chunk
  2018-06-09 16:43   ` Duy Nguyen
@ 2018-06-21 17:23     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-21 17:23 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick



On 6/9/2018 12:43 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/technical/pack-format.txt |  5 +++
>>   builtin/midx.c                          |  7 ++++
>>   midx.c                                  | 56 +++++++++++++++++++++++--
>>   object-store.h                          |  2 +
>>   t/t5319-midx.sh                         | 11 +++--
>>   5 files changed, 75 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
>> index 2b37be7b33..29bf87283a 100644
>> --- a/Documentation/technical/pack-format.txt
>> +++ b/Documentation/technical/pack-format.txt
>> @@ -296,6 +296,11 @@ CHUNK LOOKUP:
>>
>>   CHUNK DATA:
>>
>> +       Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
>> +           P * 4 bytes storing the offset in the packfile name chunk for
>> +           the null-terminated string containing the filename for the
>> +           ith packfile.
>> +
> Commit message is too light on this one. Why does this need to be
> stored? Isn't the cost of rebuilding this in-core cheap?
>
> Adding this chunk on disk in my opinion only adds more burden. Now you
> have to verify that these offsets actually point to the right place.
This is a very good point. I'll drop the chunk and just read the names 
directly to construct the array of strings.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/23] midx: write pack names in chunk
  2018-06-21 15:25     ` Derrick Stolee
@ 2018-06-21 17:38       ` Junio C Hamano
  2018-06-22 18:25         ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Junio C Hamano @ 2018-06-21 17:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Duy Nguyen, Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

Derrick Stolee <stolee@gmail.com> writes:

> On 6/7/2018 2:26 PM, Duy Nguyen wrote:
>> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>>> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>>>          m->num_chunks = *(m->data + 6);
>>>          m->num_packs = get_be32(m->data + 8);
>>>
>>> +       for (i = 0; i < m->num_chunks; i++) {
>>> +               uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i);
>>> +               uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
>> Would be good to reduce magic numbers like 12 and 16, I think you have
>> some header length constants for those already.
>>
>>> +               switch (chunk_id) {
>>> +                       case MIDX_CHUNKID_PACKNAMES:
>>> +                               m->chunk_pack_names = m->data + chunk_offset;
>>> +                               break;

(style: aren't these case arms indented one level too deep)?

>>> +                       case 0:
>>> +                               die("terminating MIDX chunk id appears earlier than expected");
>> _()
>
> This die() and others like it are not marked for translation on
> purpose, as they should never be seen by an end-user.

Should never be seen because it indicates a software bug, in which
case this should be BUG() instead of die()?

Or did we just find a file corruption on the filesystem?  If so,
then the error is end-user facing and should tell the user something
that hints what is going on in the language the user understands, I
would guess.

>>> +                       default:
>>> +                               /*
>>> +                                * Do nothing on unrecognized chunks, allowing future
>>> +                                * extensions to add optional chunks.
>>> +                                */
>> I wrote about the chunk term reminding me of PNG format then deleted
>> it. But it may help to do similar to PNG here. The first letter can
>> let us know if the chunk is optional and can be safely ignored. E.g.
>> uppercase first letter cannot be ignored, lowercase go wild.
>
> That's an interesting way to think about it. That way you could add a
> new "required" chunk and earlier versions could die() realizing they
> don't know how to parse that required chunk.

That is how the index extension sections work and may be a good
example to follow.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 11/23] midx: sort and deduplicate objects from packfiles
  2018-06-09 17:07   ` Duy Nguyen
@ 2018-06-21 17:54     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-21 17:54 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/9/2018 1:07 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>> Before writing a list of objects and their offsets to a multi-pack-index
>> (MIDX), we need to collect the list of objects contained in the
>> packfiles. There may be multiple copies of some objects, so this list
>> must be deduplicated.
> Can you just do merge-sort with a slight modification to ignore duplicates?

Are you proposing we consider a multi-way merge of the existing sorted 
lists of packfiles (skipping duplicates)? In my head, this would work 
this way:

1. Keep an array of positions within each of the pack-indexes for the 
"current lex-least OID not already in my sorted list"

2. Scan the list of P pack-indexes to find the lex-least OID among all 
candidates. Advance the position of that pack-index as we put that OID 
in the list (and advance the position of pack-indexes with duplicates).

This would have O(P * N) performance, where P is the number of packfiles 
and N is the total number of objects. This gets slightly better when 
there are duplicates; in the world where we have P identical lists of n 
objects, then N = n * P and we actually get N steps because we can 
advance the position on a duplicate value and not revisit duplicates. 
However, we do not expect duplicates in this density.

By adding some complexity to the algorithm, we could sort the 
pack-indexes in order of their lex-least OIDs, and update the order as 
we advance -- or rather use a min-heap to have access to the proper 
pack-index. This case is most likely to be valuable when updating a 
large MIDX by adding a list of smaller IDX files (which we expect to not 
be the "best" choice for most of the selections). I'm not sure the 
complexity is worth it (would need to measure!).

By concatenating the lists within the fanout values and sorting, we do 
256 sorts of size ~N/256, giving O(N * log(N/256)) performance. This 
method also has an extra array of size ~N/200 to store the batches, 
resulting in extra copies being pushed around.

You've convinced me that your approach may be better, especially in the 
typical case of adding a small number of packfiles to an existing MIDX 
file. Some work is needed to be sure it is better in general (such as 
reported cases of 5000 packfiles!). I'll leave a note to revisit this 
between v2 and v3.

>
>> It is possible to artificially get into a state where there are many
>> duplicate copies of objects. That can create high memory pressure if we
>> are to create a list of all objects before de-duplication. To reduce
>> this memory pressure without a significant performance drop,
>> automatically group objects by the first byte of their object id. Use
>> the IDX fanout tables to group the data, copy to a local array, then
>> sort.
>>
>> Copy only the de-duplicated entries. Select the duplicate based on the
>> most-recent modified time of a packfile containing the object.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   midx.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 138 insertions(+)
>>
>> diff --git a/midx.c b/midx.c
>> index 923acda72e..b20d52713c 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -4,6 +4,7 @@
>>   #include "csum-file.h"
>>   #include "lockfile.h"
>>   #include "object-store.h"
>> +#include "packfile.h"
>>   #include "midx.h"
>>
>>   #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>> @@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
>>          }
>>   }
>>
>> +static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
>> +{
>> +       const uint32_t *level1_ofs = p->index_data;
>> +
>> +       if (!level1_ofs) {
>> +               if (open_pack_index(p))
>> +                       return 0;
>> +               level1_ofs = p->index_data;
>> +       }
>> +
>> +       if (p->index_version > 1) {
>> +               level1_ofs += 2;
>> +       }
>> +
>> +       return ntohl(level1_ofs[value]);
>> +}
> Maybe keep this in packfile,c, refactor fanout code in there if
> necessary, keep .idx file format info in that file instead of
> spreading out more.
>
>> +
>> +struct pack_midx_entry {
>> +       struct object_id oid;
>> +       uint32_t pack_int_id;
>> +       time_t pack_mtime;
>> +       uint64_t offset;
>> +};
>> +
>> +static int midx_oid_compare(const void *_a, const void *_b)
>> +{
>> +       struct pack_midx_entry *a = (struct pack_midx_entry *)_a;
>> +       struct pack_midx_entry *b = (struct pack_midx_entry *)_b;
> Try not to lose "const" while typecasting.
>
>> +       int cmp = oidcmp(&a->oid, &b->oid);
>> +
>> +       if (cmp)
>> +               return cmp;
>> +
>> +       if (a->pack_mtime > b->pack_mtime)
>> +               return -1;
>> +       else if (a->pack_mtime < b->pack_mtime)
>> +               return 1;
>> +
>> +       return a->pack_int_id - b->pack_int_id;
>> +}
>> +
>> +static void fill_pack_entry(uint32_t pack_int_id,
>> +                           struct packed_git *p,
>> +                           uint32_t cur_object,
>> +                           struct pack_midx_entry *entry)
>> +{
>> +       if (!nth_packed_object_oid(&entry->oid, p, cur_object))
>> +               die("failed to located object %d in packfile", cur_object);
> _()
>
>> +
>> +       entry->pack_int_id = pack_int_id;
>> +       entry->pack_mtime = p->mtime;
>> +
>> +       entry->offset = nth_packed_object_offset(p, cur_object);
>> +}
>> +
>> +/*
>> + * It is possible to artificially get into a state where there are many
>> + * duplicate copies of objects. That can create high memory pressure if
>> + * we are to create a list of all objects before de-duplication. To reduce
>> + * this memory pressure without a significant performance drop, automatically
>> + * group objects by the first byte of their object id. Use the IDX fanout
>> + * tables to group the data, copy to a local array, then sort.
>> + *
>> + * Copy only the de-duplicated entries (selected by most-recent modified time
>> + * of a packfile containing the object).
>> + */
>> +static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
>> +                                                 uint32_t *perm,
>> +                                                 uint32_t nr_packs,
>> +                                                 uint32_t *nr_objects)
>> +{
>> +       uint32_t cur_fanout, cur_pack, cur_object;
>> +       uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
>> +       struct pack_midx_entry *entries_by_fanout = NULL;
>> +       struct pack_midx_entry *deduplicated_entries = NULL;
>> +
>> +       for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
>> +               if (open_pack_index(p[cur_pack]))
>> +                       continue;
> Is it a big problem if you fail to open .idx for a certain pack?
> Should we error out and abort instead of continuing on? Later on in
> the second pack loop code when get_fanout return zero (failure), you
> don't seem to catch it and skip the pack.
>
>> +
>> +               total_objects += p[cur_pack]->num_objects;
>> +       }
>> +
>> +       /*
>> +        * As we de-duplicate by fanout value, we expect the fanout
>> +        * slices to be evenly distributed, with some noise. Hence,
>> +        * allocate slightly more than one 256th.
>> +        */
>> +       alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16;
>> +
>> +       ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
>> +       ALLOC_ARRAY(deduplicated_entries, alloc_objects);
>> +       *nr_objects = 0;
>> +
>> +       for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
>> +               nr_fanout = 0;
> Keep variable scope small, declare nr_fanout here instead of at the
> top of the function.
>
>> +
>> +               for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
>> +                       uint32_t start = 0, end;
>> +
>> +                       if (cur_fanout)
>> +                               start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
>> +                       end = get_pack_fanout(p[cur_pack], cur_fanout);
>> +
>> +                       for (cur_object = start; cur_object < end; cur_object++) {
>> +                               ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
>> +                               fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
>> +                               nr_fanout++;
>> +                       }
>> +               }
>> +
>> +               QSORT(entries_by_fanout, nr_fanout, midx_oid_compare);
>> +
>> +               /*
>> +                * The batch is now sorted by OID and then mtime (descending).
>> +                * Take only the first duplicate.
>> +                */
>> +               for (cur_object = 0; cur_object < nr_fanout; cur_object++) {
>> +                       if (cur_object && !oidcmp(&entries_by_fanout[cur_object - 1].oid,
>> +                                                 &entries_by_fanout[cur_object].oid))
>> +                               continue;
>> +
>> +                       ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects);
>> +                       memcpy(&deduplicated_entries[*nr_objects],
>> +                              &entries_by_fanout[cur_object],
>> +                              sizeof(struct pack_midx_entry));
>> +                       (*nr_objects)++;
>> +               }
>> +       }
>> +
>> +       FREE_AND_NULL(entries_by_fanout);
>> +       return deduplicated_entries;
>> +}
>> +
>>   static size_t write_midx_pack_lookup(struct hashfile *f,
>>                                       char **pack_names,
>>                                       uint32_t nr_packs)
>> @@ -254,6 +389,7 @@ int write_midx_file(const char *object_dir)
>>          uint64_t written = 0;
>>          uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
>>          uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
>> +       uint32_t nr_entries;
>>
>>          midx_name = get_midx_filename(object_dir);
>>          if (safe_create_leading_directories(midx_name)) {
>> @@ -312,6 +448,8 @@ int write_midx_file(const char *object_dir)
>>          ALLOC_ARRAY(pack_perm, nr_packs);
>>          sort_packs_by_name(pack_names, nr_packs, pack_perm);
>>
>> +       get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
> Intentional ignoring return value (and temporary leaking as a result)
> should have a least a comment to acknowledge it and save reviewers
> some head scratching. Or even better, just free it now, even if you
> don't use it.
>
>> +
>>          hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>>          f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>>          FREE_AND_NULL(midx_name);
>> --
>> 2.18.0.rc1
>>
>


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 13/23] midx: write object id fanout chunk
  2018-06-09 17:28   ` Duy Nguyen
@ 2018-06-21 19:49     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-21 19:49 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/9/2018 1:28 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>> @@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>>                  die("MIDX missing required pack lookup chunk");
>>          if (!m->chunk_pack_names)
>>                  die("MIDX missing required pack-name chunk");
>> +       if (!m->chunk_oid_fanout)
>> +               die("MIDX missing required OID fanout chunk");
> _()
>
>> @@ -501,9 +540,13 @@ int write_midx_file(const char *object_dir)
>>          chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
>>
>>          cur_chunk++;
>> -       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
>> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
> Err.. mistake?

Not a mistake, just a side-effect of inserting the fanout before the lookup.

The commits are in this order because we need to construct the list 
before we build the fanout, but it makes sense to have the smaller 
fanout chunk before the lookup chunk.

>
>>          chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
>>
>> +       cur_chunk++;
>> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
> Same here.
>
>> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
>> +
>>          cur_chunk++;
>>          chunk_ids[cur_chunk] = 0;
>>          chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
>>


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 17/23] midx: read objects from multi-pack-index
  2018-06-09 17:56   ` Duy Nguyen
@ 2018-06-21 20:03     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-21 20:03 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/9/2018 1:56 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 6:55 PM Derrick Stolee <stolee@gmail.com> wrote:
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   midx.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++++--
>>   midx.h         |  2 ++
>>   object-store.h |  1 +
>>   packfile.c     |  8 ++++-
>>   4 files changed, 104 insertions(+), 3 deletions(-)
>>
>> diff --git a/midx.c b/midx.c
>> index 5e9290ca8f..6eca8f1b12 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -3,6 +3,7 @@
>>   #include "dir.h"
>>   #include "csum-file.h"
>>   #include "lockfile.h"
>> +#include "sha1-lookup.h"
>>   #include "object-store.h"
>>   #include "packfile.h"
>>   #include "midx.h"
>> @@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>>
>>          m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
>>          strcpy(m->object_dir, object_dir);
>> -       m->data = midx_map;
>> +       m->data = (const unsigned char*)midx_map;
> Hmm? Why is this typecast only needed now? Or is it not really needed at all?
>
>>          m->signature = get_be32(m->data);
>>          if (m->signature != MIDX_SIGNATURE) {
>> @@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>>
>>          m->num_objects = ntohl(m->chunk_oid_fanout[255]);
>>
>> -       m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
>> +       m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
>> +
>> +       ALLOC_ARRAY(m->pack_names, m->num_packs);
> Please make this ALLOC_ARRAY change in the patch that adds
> xcalloc(m->num_packs).
>
>>          for (i = 0; i < m->num_packs; i++) {
>>                  if (i) {
>>                          if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
>> @@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>>          exit(1);
>>   }
>>
>> +static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
>> +{
>> +       struct strbuf pack_name = STRBUF_INIT;
>> +
>> +       if (pack_int_id >= m->num_packs)
>> +               BUG("bad pack-int-id");
>> +
>> +       if (m->packs[pack_int_id])
>> +               return 0;
>> +
>> +       strbuf_addstr(&pack_name, m->object_dir);
>> +       strbuf_addstr(&pack_name, "/pack/");
>> +       strbuf_addstr(&pack_name, m->pack_names[pack_int_id]);
> Just use strbuf_addf()
>
>> +
>> +       m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
>> +       strbuf_release(&pack_name);
>> +       return !m->packs[pack_int_id];
> This is a weird return value convention. Normally we go zero/negative
> or non-zero/zero for success/failure.

We are inconsistent.

* open_pack_index() returns non-zero on error. (This was my reference 
point.)
* bsearch_pack() and find_pack_entry() return non-zero when an entry is 
found.

Since the use is "if (error) die()", similar to open_pack_index(), I'll 
keep the current behavior. To switch would require using 
"!!m->packs[pack_int_id]" here and "if (!prepare_midx_pack()) die()" in 
the consumer.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/23] midx: write pack names in chunk
  2018-06-21 17:38       ` Junio C Hamano
@ 2018-06-22 18:25         ` Derrick Stolee
  2018-06-22 18:31           ` Junio C Hamano
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-22 18:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Duy Nguyen, Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/21/2018 1:38 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> On 6/7/2018 2:26 PM, Duy Nguyen wrote:
>>> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>>>> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>>>>           m->num_chunks = *(m->data + 6);
>>>>           m->num_packs = get_be32(m->data + 8);
>>>>
>>>> +       for (i = 0; i < m->num_chunks; i++) {
>>>> +               uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i);
>>>> +               uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
>>> Would be good to reduce magic numbers like 12 and 16, I think you have
>>> some header length constants for those already.
>>>
>>>> +               switch (chunk_id) {
>>>> +                       case MIDX_CHUNKID_PACKNAMES:
>>>> +                               m->chunk_pack_names = m->data + chunk_offset;
>>>> +                               break;
> (style: aren't these case arms indented one level too deep)?
>
>>>> +                       case 0:
>>>> +                               die("terminating MIDX chunk id appears earlier than expected");
>>> _()
>> This die() and others like it are not marked for translation on
>> purpose, as they should never be seen by an end-user.
> Should never be seen because it indicates a software bug, in which
> case this should be BUG() instead of die()?
>
> Or did we just find a file corruption on the filesystem?  If so,
> then the error is end-user facing and should tell the user something
> that hints what is going on in the language the user understands, I
> would guess.
>
>>>> +                       default:
>>>> +                               /*
>>>> +                                * Do nothing on unrecognized chunks, allowing future
>>>> +                                * extensions to add optional chunks.
>>>> +                                */
>>> I wrote about the chunk term reminding me of PNG format then deleted
>>> it. But it may help to do similar to PNG here. The first letter can
>>> let us know if the chunk is optional and can be safely ignored. E.g.
>>> uppercase first letter cannot be ignored, lowercase go wild.
>> That's an interesting way to think about it. That way you could add a
>> new "required" chunk and earlier versions could die() realizing they
>> don't know how to parse that required chunk.
> That is how the index extension sections work and may be a good
> example to follow.

The index extension documentation doesn't appear to be clear about which 
extensions are optional or required, but it seems the split-index is the 
only "required" one and uses lowercase for its extension id.

Since the multi-pack-index has similar structure to the commit-graph 
file, and that file includes an optional chunk with no special casing of 
the chunk id, I think we should stick with the existing model: chunks 
that are added later are optional and if Git _must_ understand it, then 
we increment the version number. Hence, for each version number there is 
a fixed list of required chunks, but an extendible list of optional chunks.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/23] midx: write pack names in chunk
  2018-06-22 18:25         ` Derrick Stolee
@ 2018-06-22 18:31           ` Junio C Hamano
  2018-06-22 18:32             ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Junio C Hamano @ 2018-06-22 18:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Duy Nguyen, Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

Derrick Stolee <stolee@gmail.com> writes:

> The index extension documentation doesn't appear to be clear about
> which extensions are optional or required, but it seems the
> split-index is the only "required" one and uses lowercase for its
> extension id.

read-cache.c::

    /* Index extensions.
     *
     * The first letter should be 'A'..'Z' for extensions that are not
     * necessary for a correct operation (i.e. optimization data).
     * When new extensions are added that _needs_ to be understood in
     * order to correctly interpret the index file, pick character that
     * is outside the range, to cause the reader to abort.
     */


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/23] midx: write pack names in chunk
  2018-06-22 18:31           ` Junio C Hamano
@ 2018-06-22 18:32             ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-22 18:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Duy Nguyen, Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/22/2018 2:31 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> The index extension documentation doesn't appear to be clear about
>> which extensions are optional or required, but it seems the
>> split-index is the only "required" one and uses lowercase for its
>> extension id.
> read-cache.c::
>
>      /* Index extensions.
>       *
>       * The first letter should be 'A'..'Z' for extensions that are not
>       * necessary for a correct operation (i.e. optimization data).
>       * When new extensions are added that _needs_ to be understood in
>       * order to correctly interpret the index file, pick character that
>       * is outside the range, to cause the reader to abort.
>       */

Thanks! I was reading Documentation/technical/index-format.txt and 
optional extensions are mentioned but not described precisely.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 18/23] midx: use midx in abbreviation calculations
  2018-06-09 18:01   ` Duy Nguyen
@ 2018-06-22 18:38     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-22 18:38 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/9/2018 2:01 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>> @@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
>>
>>   static void find_abbrev_len_packed(struct min_abbrev_data *mad)
>>   {
>> +       struct midxed_git *m;
>>          struct packed_git *p;
>>
>> +       for (m = get_midxed_git(the_repository); m; m = m->next)
>> +               find_abbrev_len_for_midx(m, mad);
> If all the packs are in midx, we don't need to run the second loop
> below, do we? Otherwise I don't see why we waste cycles on finding
> abbrev length on midx at all.

We put all packs _at time of writing_ into the midx. More packs may be 
added later that are not in the midx. There are tests in 
t5319-multi-pack-index.sh that verify everything works in this "mixed mode".

It is important that the packfiles are not loaded into the packed_git 
list if they are managed by the midx.

>
>>          for (p = get_packed_git(the_repository); p; p = p->next)
>>                  find_abbrev_len_for_pack(p, mad);
>>   }


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 20/23] midx: use midx in approximate_object_count
  2018-06-09 18:03   ` Duy Nguyen
@ 2018-06-22 18:39     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-22 18:39 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/9/2018 2:03 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   packfile.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/packfile.c b/packfile.c
>> index 638e113972..059b2aa097 100644
>> --- a/packfile.c
>> +++ b/packfile.c
>> @@ -819,11 +819,14 @@ unsigned long approximate_object_count(void)
>>   {
>>          if (!the_repository->objects->approximate_object_count_valid) {
>>                  unsigned long count;
>> +               struct midxed_git *m;
>>                  struct packed_git *p;
>>
>>                  prepare_packed_git(the_repository);
>>                  count = 0;
>> -               for (p = the_repository->objects->packed_git; p; p = p->next) {
>> +               for (m = get_midxed_git(the_repository); m; m = m->next)
>> +                       count += m->num_objects;
>> +               for (p = get_packed_git(the_repository); p; p = p->next) {
> Please don't change this line, it's not related to this patch.

Sure. I'll revert that line.

>   Same
> concern applies, if we have already counted objects in midx we should
> ignore packs that belong to it or we double count.

Since we do not put packfiles into the packed_git list if they are 
tracked by the midx, we will not double count.

>
>>                          if (open_pack_index(p))
>>                                  continue;
>>                          count += p->num_objects;
>> --
>> 2.18.0.rc1
>>
>


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 23/23] midx: clear midx on repack
  2018-06-09 18:13   ` Duy Nguyen
@ 2018-06-22 18:44     ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-22 18:44 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/9/2018 2:13 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee <stolee@gmail.com> wrote:
>> If a 'git repack' command replaces existing packfiles, then we must
>> clear the existing multi-pack-index before moving the packfiles it
>> references.
> I think there are other places where we add or remove pack files and
> need to reprepare_packed_git(). Any midx invalidation should be part
> of that as well.

The other places where we call reprepare_packed_git() are for when we 
may have added a packfile, such as in fetch-pack.c, or sha1_file.c. The 
other candidate to consider is 'git gc', but the packfile deletion is 
handled by a call to 'git repack'.

>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   builtin/repack.c | 8 ++++++++
>>   midx.c           | 8 ++++++++
>>   midx.h           | 1 +
>>   3 files changed, 17 insertions(+)
>>
>> diff --git a/builtin/repack.c b/builtin/repack.c
>> index 6c636e159e..66a7d8e8ea 100644
>> --- a/builtin/repack.c
>> +++ b/builtin/repack.c
>> @@ -8,6 +8,7 @@
>>   #include "strbuf.h"
>>   #include "string-list.h"
>>   #include "argv-array.h"
>> +#include "midx.h"
>>
>>   static int delta_base_offset = 1;
>>   static int pack_kept_objects = -1;
>> @@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>          int no_update_server_info = 0;
>>          int quiet = 0;
>>          int local = 0;
>> +       int midx_cleared = 0;
>>
>>          struct option builtin_repack_options[] = {
>>                  OPT_BIT('a', NULL, &pack_everything,
>> @@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>                                  continue;
>>                          }
>>
>> +                       if (!midx_cleared) {
>> +                               /* if we move a packfile, it will invalidated the midx */
> What about removing packs, which also happens in repack? If the
> removed pack is part of midx, then midx becomes invalid as well.
>
>> +                               clear_midx_file(get_object_directory());
>> +                               midx_cleared = 1;
>> +                       }
>> +
>>                          fname_old = mkpathdup("%s/old-%s%s", packdir,
>>                                                  item->string, exts[ext].name);
>>                          if (file_exists(fname_old))
>> diff --git a/midx.c b/midx.c
>> index e46f392fa4..1043c01fa7 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir)
>>          FREE_AND_NULL(pack_names);
>>          return 0;
>>   }
>> +
>> +void clear_midx_file(const char *object_dir)
> delete_ may be more obvious than clear_
>
>> +{
>> +       char *midx = get_midx_filename(object_dir);
>> +
>> +       if (remove_path(midx))
>> +               die(_("failed to clear multi-pack-index at %s"), midx);
> die_errno()
>
>> +}
>> diff --git a/midx.h b/midx.h
>> index 6996b5ff6b..46f9f44c94 100644
>> --- a/midx.h
>> +++ b/midx.h
>> @@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char *idx_name);
>>   int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>>
>>   int write_midx_file(const char *object_dir);
>> +void clear_midx_file(const char *object_dir);
>>
>>   #endif
>> --
>> 2.18.0.rc1
>>
>


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 00/24] Multi-pack-index (MIDX)
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (24 preceding siblings ...)
  2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
@ 2018-06-25 14:34 ` " Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 01/24] multi-pack-index: add design document Derrick Stolee
                     ` (24 more replies)
  25 siblings, 25 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

This v2 patch has several significant changes from v1. Thanks for all
the feedback that informed these changes:

* The 'midx' builtin is renamed to 'multi-pack-index'

* The 'core.midx' config setting is renamed to 'core.multiPackIndex'

* Many die() or error() statements are marked for translation

* The packfile name lookup chunk is dropped in favor of dynamic
  calculation on read.

* The 'read' verb in the builtin is moved to a test tool.

And one item that I'm saving for investigation while v2 is under review:

* Consider using a merge sort when constructing/deduplicating the list
  of objects and offsets instead of sorting the objects in batches by
  first byte.

Thanks,
-Stolee

---

The multi-pack-index (MIDX) is explained fully in
the design document 'Documentation/technical/multi-pack-index.txt'.
The short description is that the MIDX stores the
information from all of the IDX files in a pack
directory. The crucial design decision is that the
IDX files still exist, so we can fall back to the IDX
files if there is any issue with the MIDX (or core.midx
is set to false, or a user downgrades Git, etc.)

The MIDX feature has been part of our GVFS releases
for a few months (since the RFC). It has behaved well,
indexing over 31 million commits and trees across up
to 250 packfiles. These MIDX files are nearly 1GB in
size and take ~20 seconds to rewrite when adding new
IDX information. This ~20s mark is something I'd like
to improve, and I mention how to make the file
incremental (similar to split-index) in the design
document. I also want to make the commit-graph file
incremental, so I'd like to do that at the same time
after both the MIDX and commit-graph are stable.


Lookup Speedups
---------------

When looking for an object, Git uses an most-recently-
used (MRU) cache of packfiles. This does pretty well to
minimize the number of misses when searching through
packfiles for an object, especially if there is one
"big" packfile that contains most of the objets (so it
will rarely miss and is usually one of the first two
packfiles in the list). The MIDX does provide a way
to remove these misses, improving lookup time. However,
this lookup time greatly depends on the arrangement of
the packfiles.

For instance, if you take the Linux repository and repack
using `git repack -adfF --max-pack-size=128m` then all
commits will be in one packfile, all trees will be in
a small set of packfiles and organized well so 'git
rev-list --objects HEAD^{tree}' only inspects one or two
packfiles.

GVFS has the notion of a "prefetch packfile". These are
packfiles that are precomputed by cache servers to
contain the commits and trees introduced to the remote
each day. GVFS downloads these packfiles and places them
in an alternate. Since these are organized by "first
time introduced" and the working directory is so large,
the MRU misses are significant when performing a checkout
and updating the .git/index file.

To test the performance in this situation, I created a
script that organizes the Linux repository in a similar
fashion. I split the commit history into 50 parts by
creating branches on every 10,000 commits of the first-
parent history. Then, `git rev-list --objects A ^B`
provides the list of objects reachable from A but not B,
so I could send that to `git pack-objects` to create
these "time-based" packfiles. With these 50 packfiles
(deleting the old one from my fresh clone, and deleting
all tags as they were no longer on-disk) I could then
test 'git rev-list --objects HEAD^{tree}' and see:

        Before: 0.17s
        After:  0.13s
        % Diff: -23.5%

By adding logic to count hits and misses to bsearch_pack,
I was able to see that the command above calls that
method 266,930 times with a hit rate of 33%. The MIDX
has the same number of calls with a 100% hit rate.



Abbreviation Speedups
---------------------

To fully disambiguate an abbreviation, we must iterate
through all packfiles to ensure no collision exists in
any packfile. This requires O(P log N) time. With the
MIDX, this is only O(log N) time. Our standard test [2]
is 'git log --oneline --parents --raw' because it writes
many abbreviations while also doing a lot of other work
(walking commits and trees to compute the raw diff).

For a copy of the Linux repository with 50 packfiles
split by time, we observed the following:

        Before: 100.5 s
        After:   58.2 s
        % Diff: -59.7%


Derrick Stolee (24):
  multi-pack-index: add design document
  multi-pack-index: add format details
  multi-pack-index: add builtin
  multi-pack-index: add 'write' verb
  midx: write header information to lockfile
  multi-pack-index: load into memory
  multi-pack-index: expand test data
  packfile: generalize pack directory list
  multi-pack-index: read packfile list
  multi-pack-index: write pack names in chunk
  midx: read pack names into array
  midx: sort and deduplicate objects from packfiles
  midx: write object ids in a chunk
  midx: write object id fanout chunk
  midx: write object offsets
  config: create core.multiPackIndex setting
  midx: prepare midxed_git struct
  midx: read objects from multi-pack-index
  midx: use midx in abbreviation calculations
  midx: use existing midx when writing new one
  midx: use midx in approximate_object_count
  midx: prevent duplicate packfile loads
  packfile: skip loading index if in multi-pack-index
  midx: clear midx on repack

 .gitignore                                   |   3 +-
 Documentation/config.txt                     |   4 +
 Documentation/git-multi-pack-index.txt       |  56 ++
 Documentation/technical/multi-pack-index.txt | 109 +++
 Documentation/technical/pack-format.txt      |  77 ++
 Makefile                                     |   3 +
 builtin.h                                    |   1 +
 builtin/multi-pack-index.c                   |  46 +
 builtin/repack.c                             |   8 +
 cache.h                                      |   1 +
 command-list.txt                             |   1 +
 config.c                                     |   5 +
 environment.c                                |   1 +
 git.c                                        |   1 +
 midx.c                                       | 900 +++++++++++++++++++
 midx.h                                       |  20 +
 object-store.h                               |  33 +
 packfile.c                                   | 173 +++-
 packfile.h                                   |   9 +
 sha1-name.c                                  |  70 ++
 t/helper/test-read-midx.c                    |  54 ++
 t/helper/test-tool.c                         |   1 +
 t/helper/test-tool.h                         |   1 +
 t/t5319-multi-pack-index.sh                  | 191 ++++
 24 files changed, 1724 insertions(+), 44 deletions(-)
 create mode 100644 Documentation/git-multi-pack-index.txt
 create mode 100644 Documentation/technical/multi-pack-index.txt
 create mode 100644 builtin/multi-pack-index.c
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100644 t/helper/test-read-midx.c
 create mode 100755 t/t5319-multi-pack-index.sh


base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 01/24] multi-pack-index: add design document
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 02/24] multi-pack-index: add format details Derrick Stolee
                     ` (23 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 109 +++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000000..d7e57639f7
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The pack.multiIndex config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the most-
+  recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+  contents of the multi-pack-index file match the offsets listed in
+  the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- The reachability bitmap is currently paired directly with a single
+  packfile, using the pack-order as the object order to hopefully
+  compress the bitmaps well using run-length encoding. This could be
+  extended to pair a reachability bitmap with a multi-pack-index. If
+  the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then a reachability bitmap
+  could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+  may point to objects that are not stored locally, but available
+  on request to a server. The multi-pack-index does not currently
+  track these promisor packs.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+    An earlier RFC for the multi-pack-index feature
+
+[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)

base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 02/24] multi-pack-index: add format details
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 01/24] multi-pack-index: add design document Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 03/24] multi-pack-index: add builtin Derrick Stolee
                     ` (22 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

The multi-pack-index feature generalizes the existing pack-index
feature by indexing objects across multiple pack-files.

Describe the basic file format, using a 12-byte header followed by
a lookup table for a list of "chunks" which will be described later.
The file ends with a footer containing a checksum using the hash
algorithm.

The header allows later versions to create breaking changes by
advancing the version number. We can also change the hash algorithm
using a different version value.

We will add the individual chunk format information as we introduce
the code that writes that information.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 70a99fd142..e060e693f4 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -252,3 +252,52 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA-1-checksum of all of the above.
+
+== multi-pack-index (MIDX) files have the following format:
+
+The multi-pack-index files refer to multiple pack-files and loose objects.
+
+In order to allow extensions that add extra data to the MIDX, we organize
+the body into "chunks" and provide a lookup table at the beginning of the
+body. The header includes certain length values, such as the number of packs,
+the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+	4-byte signature:
+	    The signature is: {'M', 'I', 'D', 'X'}
+
+	1-byte version number:
+	    Git only writes or recognizes version 1.
+
+	1-byte Object Id Version
+	    Git only writes or recognizes version 1 (SHA1).
+
+	1-byte number of "chunks"
+
+	1-byte number of base multi-pack-index files:
+	    This value is currently always zero.
+
+	4-byte number of pack files
+
+CHUNK LOOKUP:
+
+	(C + 1) * 12 bytes providing the chunk offsets:
+	    First 4 bytes describe chunk id. Value 0 is a terminating label.
+	    Other 8 bytes provide offset in current file for chunk to start.
+	    (Chunks are provided in file-order, so you can infer the length
+	    using the next chunk position if necessary.)
+
+	The remaining data in the body is described one chunk at a time, and
+	these chunks may be given in any order. Chunks are required unless
+	otherwise specified.
+
+CHUNK DATA:
+
+	(This section intentionally left incomplete.)
+
+TRAILER:
+
+	20-byte SHA1-checksum of the above contents.
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 03/24] multi-pack-index: add builtin
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 01/24] multi-pack-index: add design document Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 02/24] multi-pack-index: add format details Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 19:15     ` Junio C Hamano
  2018-06-25 14:34   ` [PATCH v2 04/24] multi-pack-index: add 'write' verb Derrick Stolee
                     ` (21 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

This new 'git multi-pack-index' builtin will be the plumbing access
for writing, reading, and checking multi-pack-index files. The
initial implementation is a no-op.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                             |  3 +-
 Documentation/git-multi-pack-index.txt | 36 ++++++++++++++++++++++++
 Makefile                               |  1 +
 builtin.h                              |  1 +
 builtin/multi-pack-index.c             | 38 ++++++++++++++++++++++++++
 command-list.txt                       |  1 +
 git.c                                  |  1 +
 7 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/git-multi-pack-index.txt
 create mode 100644 builtin/multi-pack-index.c

diff --git a/.gitignore b/.gitignore
index 388cc4beee..25633bc515 100644
--- a/.gitignore
+++ b/.gitignore
@@ -99,8 +99,9 @@
 /git-mergetool--lib
 /git-mktag
 /git-mktree
-/git-name-rev
+/git-multi-pack-index
 /git-mv
+/git-name-rev
 /git-notes
 /git-p4
 /git-pack-redundant
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
new file mode 100644
index 0000000000..9877f9c441
--- /dev/null
+++ b/Documentation/git-multi-pack-index.txt
@@ -0,0 +1,36 @@
+git-multi-pack-index(1)
+======================
+
+NAME
+----
+git-multi-pack-index - Write and verify multi-pack-indexes
+
+
+SYNOPSIS
+--------
+[verse]
+'git multi-pack-index' [--object-dir <dir>]
+
+DESCRIPTION
+-----------
+Write or verify a multi-pack-index (MIDX) file.
+
+OPTIONS
+-------
+
+--object-dir <dir>::
+	Use given directory for the location of Git objects. We check
+	<dir>/packs/multi-pack-index for the current MIDX file, and
+	<dir>/packs for the pack-files to index.
+
+
+SEE ALSO
+--------
+See link:technical/multi-pack-index.html[The Multi-Pack-Index Design
+Document] and link:technical/pack-format.html[The Multi-Pack-Index
+Format] for more information on the multi-pack-index feature.
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index e4b503d259..54610875ec 100644
--- a/Makefile
+++ b/Makefile
@@ -1047,6 +1047,7 @@ BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
+BUILTIN_OBJS += builtin/multi-pack-index.o
 BUILTIN_OBJS += builtin/mv.o
 BUILTIN_OBJS += builtin/name-rev.o
 BUILTIN_OBJS += builtin/notes.o
diff --git a/builtin.h b/builtin.h
index 4e0f64723e..70997d7ace 100644
--- a/builtin.h
+++ b/builtin.h
@@ -191,6 +191,7 @@ extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
+extern int cmd_multi_pack_index(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
 extern int cmd_name_rev(int argc, const char **argv, const char *prefix);
 extern int cmd_notes(int argc, const char **argv, const char *prefix);
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
new file mode 100644
index 0000000000..f101873525
--- /dev/null
+++ b/builtin/multi-pack-index.c
@@ -0,0 +1,38 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_multi_pack_index_usage[] ={
+	N_("git multi-pack-index [--object-dir <dir>]"),
+	NULL
+};
+
+static struct opts_multi_pack_index {
+	const char *object_dir;
+} opts;
+
+int cmd_multi_pack_index(int argc, const char **argv,
+			 const char *prefix)
+{
+	static struct option builtin_multi_pack_index_options[] = {
+		OPT_FILENAME(0, "object-dir", &opts.object_dir,
+		  N_("The object directory containing set of packfile and pack-index pairs")),
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_multi_pack_index_usage,
+				   builtin_multi_pack_index_options);
+
+	git_config(git_default_config, NULL);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_multi_pack_index_options,
+			     builtin_multi_pack_index_usage, 0);
+
+	if (!opts.object_dir)
+		opts.object_dir = get_object_directory();
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e1c26c1bb7..61071f8fa2 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -123,6 +123,7 @@ git-merge-index                         plumbingmanipulators
 git-merge-one-file                      purehelpers
 git-mergetool                           ancillarymanipulators           complete
 git-merge-tree                          ancillaryinterrogators
+git-multi-pack-index                    plumbingmanipulators
 git-mktag                               plumbingmanipulators
 git-mktree                              plumbingmanipulators
 git-mv                                  mainporcelain           worktree
diff --git a/git.c b/git.c
index c2f48d53dd..a7509fa5f7 100644
--- a/git.c
+++ b/git.c
@@ -505,6 +505,7 @@ static struct cmd_struct commands[] = {
 	{ "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
 	{ "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
 	{ "mktree", cmd_mktree, RUN_SETUP },
+	{ "multi-pack-index", cmd_multi_pack_index, RUN_SETUP_GENTLY },
 	{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
 	{ "name-rev", cmd_name_rev, RUN_SETUP },
 	{ "notes", cmd_notes, RUN_SETUP },
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 04/24] multi-pack-index: add 'write' verb
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (2 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 03/24] multi-pack-index: add builtin Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 05/24] midx: write header information to lockfile Derrick Stolee
                     ` (20 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

In anticipation of writing multi-pack-indexes, add a
'git multi-pack-index write' subcommand and send the options to a
write_midx_file() method. Also create a basic test file that tests
the 'write' subcommand.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 22 +++++++++++++++++++++-
 Makefile                               |  1 +
 builtin/multi-pack-index.c             | 10 +++++++++-
 midx.c                                 |  7 +++++++
 midx.h                                 |  6 ++++++
 t/t5319-multi-pack-index.sh            | 10 ++++++++++
 6 files changed, 54 insertions(+), 2 deletions(-)
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-multi-pack-index.sh

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 9877f9c441..c4dc92ddd9 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir <dir>]
+'git multi-pack-index' [--object-dir <dir>] <verb>
 
 DESCRIPTION
 -----------
@@ -23,6 +23,26 @@ OPTIONS
 	<dir>/packs/multi-pack-index for the current MIDX file, and
 	<dir>/packs for the pack-files to index.
 
+write::
+	When given as the verb, write a new MIDX file to
+	<dir>/packs/multi-pack-index.
+
+
+EXAMPLES
+--------
+
+* Write a MIDX file for the packfiles in the current .git folder.
++
+-----------------------------------------------
+$ git multi-pack-index write
+-----------------------------------------------
+
+* Write a MIDX file for the packfiles in an alternate.
++
+-----------------------------------------------
+$ git multi-pack-index --object-dir <alt> write
+-----------------------------------------------
+
 
 SEE ALSO
 --------
diff --git a/Makefile b/Makefile
index 54610875ec..f5636c711d 100644
--- a/Makefile
+++ b/Makefile
@@ -890,6 +890,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index f101873525..c8f1f19d1f 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -2,9 +2,10 @@
 #include "cache.h"
 #include "config.h"
 #include "parse-options.h"
+#include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] ={
-	N_("git multi-pack-index [--object-dir <dir>]"),
+	N_("git multi-pack-index [--object-dir <dir>] [write]"),
 	NULL
 };
 
@@ -34,5 +35,12 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	if (!opts.object_dir)
 		opts.object_dir = get_object_directory();
 
+	if (argc == 0)
+		usage_with_options(builtin_multi_pack_index_usage,
+				   builtin_multi_pack_index_options);
+
+	if (!strcmp(argv[0], "write"))
+		return write_midx_file(opts.object_dir);
+
 	return 0;
 }
diff --git a/midx.c b/midx.c
new file mode 100644
index 0000000000..32468db1a2
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,7 @@
+#include "cache.h"
+#include "midx.h"
+
+int write_midx_file(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
new file mode 100644
index 0000000000..dbdbe9f873
--- /dev/null
+++ b/midx.h
@@ -0,0 +1,6 @@
+#ifndef __MIDX_H__
+#define __MIDX_H__
+
+int write_midx_file(const char *object_dir);
+
+#endif
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
new file mode 100755
index 0000000000..ec3ddbe79c
--- /dev/null
+++ b/t/t5319-multi-pack-index.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+test_description='multi-pack-indexes'
+. ./test-lib.sh
+
+test_expect_success 'write midx with no packs' '
+	git multi-pack-index --object-dir=. write
+'
+
+test_done
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 05/24] midx: write header information to lockfile
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (3 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 04/24] multi-pack-index: add 'write' verb Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 19:19     ` Junio C Hamano
  2018-06-25 14:34   ` [PATCH v2 06/24] multi-pack-index: load into memory Derrick Stolee
                     ` (19 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

As we begin writing the multi-pack-index format to disk, start with
the basics: the 12-byte header and the 20-byte checksum footer. Start
with these basics so we can add the rest of the format in small
increments.

As we implement the format, we will use a technique to check that our
computed offsets within the multi-pack-index file match what we are
actually writing. Each method that writes to the hashfile will return
the number of bytes written, and we will track that those values match
our expectations.

Currently, write_midx_header() returns 12, but is not checked. We will
check the return value in a later commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 49 +++++++++++++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh |  3 ++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 32468db1a2..393d526881 100644
--- a/midx.c
+++ b/midx.c
@@ -1,7 +1,56 @@
 #include "cache.h"
+#include "csum-file.h"
+#include "lockfile.h"
 #include "midx.h"
 
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_VERSION 1
+#define MIDX_HASH_VERSION 1
+#define MIDX_HEADER_SIZE 12
+
+static char *get_midx_filename(const char *object_dir)
+{
+	return xstrfmt("%s/pack/multi-pack-index", object_dir);
+}
+
+static size_t write_midx_header(struct hashfile *f,
+				unsigned char num_chunks,
+				uint32_t num_packs)
+{
+	unsigned char byte_values[4];
+	hashwrite_be32(f, MIDX_SIGNATURE);
+	byte_values[0] = MIDX_VERSION;
+	byte_values[1] = MIDX_HASH_VERSION;
+	byte_values[2] = num_chunks;
+	byte_values[3] = 0; /* unused */
+	hashwrite(f, byte_values, sizeof(byte_values));
+	hashwrite_be32(f, num_packs);
+
+	return MIDX_HEADER_SIZE;
+}
+
 int write_midx_file(const char *object_dir)
 {
+	unsigned char num_chunks = 0;
+	char *midx_name;
+	struct hashfile *f;
+	struct lock_file lk;
+
+	midx_name = get_midx_filename(object_dir);
+	if (safe_create_leading_directories(midx_name)) {
+		UNLEAK(midx_name);
+		die_errno(_("unable to create leading directories of %s"),
+			  midx_name);
+	}
+
+	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
+	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	FREE_AND_NULL(midx_name);
+
+	write_midx_header(f, num_chunks, 0);
+
+	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	commit_lock_file(&lk);
+
 	return 0;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ec3ddbe79c..8622a7cdce 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -4,7 +4,8 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 test_expect_success 'write midx with no packs' '
-	git multi-pack-index --object-dir=. write
+	git multi-pack-index --object-dir=. write &&
+	test_path_is_file pack/multi-pack-index
 '
 
 test_done
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 06/24] multi-pack-index: load into memory
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (4 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 05/24] midx: write header information to lockfile Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 19:38     ` Junio C Hamano
  2018-06-25 14:34   ` [PATCH v2 07/24] multi-pack-index: expand test data Derrick Stolee
                     ` (18 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Create a new multi_pack_index struct for loading multi-pack-indexes into
memory. Create a test-tool builtin for reading basic information about
that multi-pack-index to verify the correct data is written.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile                    |  1 +
 midx.c                      | 78 +++++++++++++++++++++++++++++++++++++
 midx.h                      |  4 ++
 object-store.h              | 16 ++++++++
 t/helper/test-read-midx.c   | 34 ++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 t/t5319-multi-pack-index.sh | 12 +++++-
 8 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 t/helper/test-read-midx.c

diff --git a/Makefile b/Makefile
index f5636c711d..0b801d1b16 100644
--- a/Makefile
+++ b/Makefile
@@ -717,6 +717,7 @@ TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-path-utils.o
 TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-read-cache.o
+TEST_BUILTINS_OBJS += test-read-midx.o
 TEST_BUILTINS_OBJS += test-ref-store.o
 TEST_BUILTINS_OBJS += test-regex.o
 TEST_BUILTINS_OBJS += test-revision-walking.o
diff --git a/midx.c b/midx.c
index 393d526881..0977397d6a 100644
--- a/midx.c
+++ b/midx.c
@@ -1,18 +1,96 @@
 #include "cache.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "object-store.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
 #define MIDX_HASH_VERSION 1
 #define MIDX_HEADER_SIZE 12
+#define MIDX_HASH_LEN 20
+#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+struct multi_pack_index *load_multi_pack_index(const char *object_dir)
+{
+	struct multi_pack_index *m = NULL;
+	int fd;
+	struct stat st;
+	size_t midx_size;
+	void *midx_map = NULL;
+	uint32_t hash_version;
+	char *midx_name = get_midx_filename(object_dir);
+
+	fd = git_open(midx_name);
+
+	if (fd < 0) {
+		error_errno(_("failed to read %s"), midx_name);
+		FREE_AND_NULL(midx_name);
+		return NULL;
+	}
+	if (fstat(fd, &st)) {
+		error_errno(_("failed to read %s"), midx_name);
+		FREE_AND_NULL(midx_name);
+		close(fd);
+		return NULL;
+	}
+
+	midx_size = xsize_t(st.st_size);
+
+	if (midx_size < MIDX_MIN_SIZE) {
+		close(fd);
+		error(_("multi-pack-index file %s is too small"), midx_name);
+		goto cleanup_fail;
+	}
+
+	FREE_AND_NULL(midx_name);
+
+	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
+	strcpy(m->object_dir, object_dir);
+	m->data = midx_map;
+
+	m->signature = get_be32(m->data);
+	if (m->signature != MIDX_SIGNATURE) {
+		error(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"),
+		      m->signature, MIDX_SIGNATURE);
+		goto cleanup_fail;
+	}
+
+	m->version = m->data[4];
+	if (m->version != MIDX_VERSION) {
+		error(_("multi-pack-index version %d not recognized"),
+		      m->version);
+		goto cleanup_fail;
+	}
+
+	hash_version = m->data[5];
+	if (hash_version != MIDX_HASH_VERSION) {
+		error(_("hash version %u does not match"), hash_version);
+		goto cleanup_fail;
+	}
+	m->hash_len = MIDX_HASH_LEN;
+
+	m->num_chunks = *(m->data + 6);
+
+	m->num_packs = get_be32(m->data + 8);
+
+	return m;
+
+cleanup_fail:
+	FREE_AND_NULL(m);
+	FREE_AND_NULL(midx_name);
+	munmap(midx_map, midx_size);
+	close(fd);
+	return NULL;
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index dbdbe9f873..2d83dd9ec1 100644
--- a/midx.h
+++ b/midx.h
@@ -1,6 +1,10 @@
 #ifndef __MIDX_H__
 #define __MIDX_H__
 
+struct multi_pack_index;
+
+struct multi_pack_index *load_multi_pack_index(const char *object_dir);
+
 int write_midx_file(const char *object_dir);
 
 #endif
diff --git a/object-store.h b/object-store.h
index d683112fd7..4f410841cc 100644
--- a/object-store.h
+++ b/object-store.h
@@ -84,6 +84,22 @@ struct packed_git {
 	char pack_name[FLEX_ARRAY]; /* more */
 };
 
+struct multi_pack_index {
+	int fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	uint32_t signature;
+	unsigned char version;
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_packs;
+	uint32_t num_objects;
+
+	char object_dir[FLEX_ARRAY];
+};
+
 struct raw_object_store {
 	/*
 	 * Path to the repository's object store.
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
new file mode 100644
index 0000000000..5abf969175
--- /dev/null
+++ b/t/helper/test-read-midx.c
@@ -0,0 +1,34 @@
+/*
+ * test-mktemp.c: code to exercise the creation of temporary files
+ */
+#include "test-tool.h"
+#include "cache.h"
+#include "midx.h"
+#include "repository.h"
+#include "object-store.h"
+
+static int read_midx_file(const char *object_dir)
+{
+	struct multi_pack_index *m = load_multi_pack_index(object_dir);
+
+	if (!m)
+		return 0;
+
+	printf("header: %08x %d %d %d\n",
+	       m->signature,
+	       m->version,
+	       m->num_chunks,
+	       m->num_packs);
+
+	printf("object_dir: %s\n", m->object_dir);
+
+	return 0;
+}
+
+int cmd__read_midx(int argc, const char **argv)
+{
+	if (argc != 2)
+		usage("read-midx <object_dir>");
+
+	return read_midx_file(argv[1]);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 805a45de9c..1c3ab36e6c 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -27,6 +27,7 @@ static struct test_cmd cmds[] = {
 	{ "path-utils", cmd__path_utils },
 	{ "prio-queue", cmd__prio_queue },
 	{ "read-cache", cmd__read_cache },
+	{ "read-midx", cmd__read_midx },
 	{ "ref-store", cmd__ref_store },
 	{ "regex", cmd__regex },
 	{ "revision-walking", cmd__revision_walking },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 7116ddfb94..6af8c08a66 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -21,6 +21,7 @@ int cmd__online_cpus(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
 int cmd__prio_queue(int argc, const char **argv);
 int cmd__read_cache(int argc, const char **argv);
+int cmd__read_midx(int argc, const char **argv);
 int cmd__ref_store(int argc, const char **argv);
 int cmd__regex(int argc, const char **argv);
 int cmd__revision_walking(int argc, const char **argv);
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 8622a7cdce..0372704c96 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,9 +3,19 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+midx_read_expect() {
+	cat >expect <<- EOF
+	header: 4d494458 1 0 0
+	object_dir: .
+	EOF
+	test-tool read-midx . >actual &&
+	test_cmp expect actual
+}
+
 test_expect_success 'write midx with no packs' '
 	git multi-pack-index --object-dir=. write &&
-	test_path_is_file pack/multi-pack-index
+	test_path_is_file pack/multi-pack-index &&
+	midx_read_expect
 '
 
 test_done
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 07/24] multi-pack-index: expand test data
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (5 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 06/24] multi-pack-index: load into memory Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 19:45     ` Junio C Hamano
  2018-06-25 14:34   ` [PATCH v2 08/24] packfile: generalize pack directory list Derrick Stolee
                     ` (17 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

As we build the multi-pack-index file format, we want to test the format
on real repoasitories. Add tests to t5319-multi-pack-index.sh that
create repository data including multiple packfiles with both version 1
and version 2 formats.

The current 'git multi-pack-index write' command will always write the
same file with no "real" data. This will be expanded in future commits,
along with the test expectations.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 99 +++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 0372704c96..d533fd0dbc 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -13,9 +13,108 @@ midx_read_expect() {
 }
 
 test_expect_success 'write midx with no packs' '
+	test_when_finished rm pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
 	test_path_is_file pack/multi-pack-index &&
 	midx_read_expect
 '
 
+test_expect_success 'create objects' '
+	for i in `test_seq 1 5`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree </dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with one v1 pack' '
+	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
+	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'write midx with one v2 pack' '
+	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more objects' '
+	for i in `test_seq 6 5`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list2 &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with two packs' '
+	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more packs' '
+	for j in `test_seq 1 10`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+		git update-index --add file_101 &&
+		tree=$(git write-tree) &&
+		commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+		echo $tree &&
+		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+		} >obj-list &&
+		git update-ref HEAD $commit &&
+		git pack-objects --index-version=2 test-pack <obj-list &&
+		i=$(expr $i + 1) || return 1 &&
+		j=$(expr $j + 1) || return 1
+	done
+'
+
+test_expect_success 'write midx with twelve packs' '
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
 test_done
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 08/24] packfile: generalize pack directory list
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (6 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 07/24] multi-pack-index: expand test data Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 19:57     ` Junio C Hamano
  2018-06-25 14:34   ` [PATCH v2 09/24] multi-pack-index: read packfile list Derrick Stolee
                     ` (16 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

In anticipation of sharing the pack directory listing with the
multi-pack-index, generalize prepare_packed_git_one() into
for_each_file_in_pack_dir().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 103 +++++++++++++++++++++++++++++++++--------------------
 packfile.h |   6 ++++
 2 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7cd45aa4b2..db61c8813b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -738,13 +738,14 @@ static void report_pack_garbage(struct string_list *list)
 	report_helper(list, seen_bits, first, list->nr);
 }
 
-static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data)
 {
 	struct strbuf path = STRBUF_INIT;
 	size_t dirnamelen;
 	DIR *dir;
 	struct dirent *de;
-	struct string_list garbage = STRING_LIST_INIT_DUP;
 
 	strbuf_addstr(&path, objdir);
 	strbuf_addstr(&path, "/pack");
@@ -759,53 +760,79 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	strbuf_addch(&path, '/');
 	dirnamelen = path.len;
 	while ((de = readdir(dir)) != NULL) {
-		struct packed_git *p;
-		size_t base_len;
-
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		strbuf_setlen(&path, dirnamelen);
 		strbuf_addstr(&path, de->d_name);
 
-		base_len = path.len;
-		if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
-			/* Don't reopen a pack we already have. */
-			for (p = r->objects->packed_git; p;
-			     p = p->next) {
-				size_t len;
-				if (strip_suffix(p->pack_name, ".pack", &len) &&
-				    len == base_len &&
-				    !memcmp(p->pack_name, path.buf, len))
-					break;
-			}
-			if (p == NULL &&
-			    /*
-			     * See if it really is a valid .idx file with
-			     * corresponding .pack file that we can map.
-			     */
-			    (p = add_packed_git(path.buf, path.len, local)) != NULL)
-				install_packed_git(r, p);
-		}
-
-		if (!report_garbage)
-			continue;
-
-		if (ends_with(de->d_name, ".idx") ||
-		    ends_with(de->d_name, ".pack") ||
-		    ends_with(de->d_name, ".bitmap") ||
-		    ends_with(de->d_name, ".keep") ||
-		    ends_with(de->d_name, ".promisor"))
-			string_list_append(&garbage, path.buf);
-		else
-			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
+		fn(path.buf, path.len, de->d_name, data);
 	}
+
 	closedir(dir);
-	report_pack_garbage(&garbage);
-	string_list_clear(&garbage, 0);
 	strbuf_release(&path);
 }
 
+struct prepare_pack_data
+{
+	struct repository *r;
+	struct string_list *garbage;
+	int local;
+};
+
+static void prepare_pack(const char *full_name, size_t full_name_len, const char *file_name, void *_data)
+{
+	struct prepare_pack_data *data = (struct prepare_pack_data *)_data;
+	struct packed_git *p;
+	size_t base_len = full_name_len;
+
+	if (strip_suffix_mem(full_name, &base_len, ".idx")) {
+		/* Don't reopen a pack we already have. */
+		for (p = data->r->objects->packed_git; p; p = p->next) {
+			size_t len;
+			if (strip_suffix(p->pack_name, ".pack", &len) &&
+			    len == base_len &&
+			    !memcmp(p->pack_name, full_name, len))
+				break;
+		}
+
+		if (p == NULL &&
+		    /*
+		     * See if it really is a valid .idx file with
+		     * corresponding .pack file that we can map.
+		     */
+		    (p = add_packed_git(full_name, full_name_len, data->local)) != NULL)
+			install_packed_git(data->r, p);
+	}
+
+	if (!report_garbage)
+	       return;
+
+	if (ends_with(file_name, ".idx") ||
+	    ends_with(file_name, ".pack") ||
+	    ends_with(file_name, ".bitmap") ||
+	    ends_with(file_name, ".keep") ||
+	    ends_with(file_name, ".promisor"))
+		string_list_append(data->garbage, full_name);
+	else
+		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
+}
+
+static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+{
+	struct prepare_pack_data data;
+	struct string_list garbage = STRING_LIST_INIT_DUP;
+
+	data.r = r;
+	data.garbage = &garbage;
+	data.local = local;
+
+	for_each_file_in_pack_dir(objdir, prepare_pack, &data);
+
+	report_pack_garbage(data.garbage);
+	string_list_clear(data.garbage, 0);
+}
+
 static void prepare_packed_git(struct repository *r);
 /*
  * Give a fast, rough count of the number of objects in the repository. This
diff --git a/packfile.h b/packfile.h
index e0a38aba93..d2ad30300a 100644
--- a/packfile.h
+++ b/packfile.h
@@ -28,6 +28,12 @@ extern char *sha1_pack_index_name(const unsigned char *sha1);
 
 extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
+typedef void each_file_in_pack_dir_fn(const char *full_path, size_t full_path_len,
+				      const char *file_pach, void *data);
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data);
+
 /* A hook to report invalid files in pack directory */
 #define PACKDIR_FILE_PACK 1
 #define PACKDIR_FILE_IDX 2
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 09/24] multi-pack-index: read packfile list
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (7 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 08/24] packfile: generalize pack directory list Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 10/24] multi-pack-index: write pack names in chunk Derrick Stolee
                     ` (15 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

When constructing a multi-pack-index file for a given object directory,
read the files within the enclosed pack directory and find matches that
end with ".idx" and find the correct paired packfile using
add_packed_git().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 45 ++++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh | 16 ++++++-------
 2 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index 0977397d6a..e79ffb5576 100644
--- a/midx.c
+++ b/midx.c
@@ -1,6 +1,8 @@
 #include "cache.h"
 #include "csum-file.h"
+#include "dir.h"
 #include "lockfile.h"
+#include "packfile.h"
 #include "object-store.h"
 #include "midx.h"
 
@@ -107,12 +109,40 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_list
+{
+	struct packed_git **list;
+	uint32_t nr;
+	uint32_t alloc;
+};
+
+static void add_pack_to_midx(const char *full_path, size_t full_path_len,
+			     const char *file_name, void *data)
+{
+	struct pack_list *packs = (struct pack_list *)data;
+
+	if (ends_with(file_name, ".idx")) {
+		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc);
+
+		packs->list[packs->nr] = add_packed_git(full_path,
+							 full_path_len,
+							 0);
+		if (!packs->list[packs->nr])
+			warning(_("failed to add packfile '%s'"),
+				full_path);
+		else
+			packs->nr++;
+	}
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char num_chunks = 0;
 	char *midx_name;
+	uint32_t i;
 	struct hashfile *f;
 	struct lock_file lk;
+	struct pack_list packs;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -121,14 +151,27 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	packs.nr = 0;
+	packs.alloc = 16;
+	packs.list = NULL;
+	ALLOC_ARRAY(packs.list, packs.alloc);
+
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, 0);
+	write_midx_header(f, num_chunks, packs.nr);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
+	for (i = 0; i < packs.nr; i++) {
+		close_pack(packs.list[i]);
+		FREE_AND_NULL(packs.list[i]);
+	}
+
+	FREE_AND_NULL(packs.list);
 	return 0;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index d533fd0dbc..4d4d6ca0a6 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -4,8 +4,9 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 midx_read_expect() {
+	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 0 0
+	header: 4d494458 1 0 $NUM_PACKS
 	object_dir: .
 	EOF
 	test-tool read-midx . >actual &&
@@ -15,8 +16,7 @@ midx_read_expect() {
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect
+	midx_read_expect 0
 '
 
 test_expect_success 'create objects' '
@@ -47,13 +47,13 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'Add more objects' '
@@ -83,7 +83,7 @@ test_expect_success 'Add more objects' '
 test_expect_success 'write midx with two packs' '
 	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 2
 '
 
 test_expect_success 'Add more packs' '
@@ -106,7 +106,7 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 test-pack <obj-list &&
+		git pack-objects --index-version=2 pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
@@ -114,7 +114,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 12
 '
 
 test_done
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 10/24] multi-pack-index: write pack names in chunk
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (8 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 09/24] multi-pack-index: read packfile list Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 11/24] midx: read pack names into array Derrick Stolee
                     ` (14 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

The multi-pack-index needs to track which packfiles it indexes. Store
these in our first required chunk. Since filenames are not well
structured, add padding to keep good alignment in later chunks.

Modify the 'git multi-pack-index read' subcommand to output the
existence of the pack-file name chunk. Modify t5319-multi-pack-index.sh
to reflect this new output and the new expected number of chunks.

Defense in depth: A pattern we are using in the multi-pack-index feature
is to verify the data as we write it. We want to ensure we never write
invalid data to the multi-pack-index. There are many checks that verify
that the values we are writing fit the format definitions. This mainly
helps developers while working on the feature, but it can also identify
issues that only appear when dealing with very large data sets. These
large sets are hard to encode into test cases.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |   6 +
 midx.c                                  | 190 ++++++++++++++++++++++--
 object-store.h                          |   2 +
 t/helper/test-read-midx.c               |   7 +
 t/t5319-multi-pack-index.sh             |   3 +-
 5 files changed, 198 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index e060e693f4..6c5a77475f 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,12 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+	Packfile Names (ID: {'P', 'N', 'A', 'M'})
+	    Stores the packfile names as concatenated, null-terminated strings.
+	    Packfiles must be listed in lexicographic order for fast lookups by
+	    name. This is the only chunk not guaranteed to be a multiple of four
+	    bytes in length, so should be the last chunk for alignment reasons.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/midx.c b/midx.c
index e79ffb5576..2fad99d1b8 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
+#define MIDX_MAX_CHUNKS 1
+#define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -27,6 +32,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	void *midx_map = NULL;
 	uint32_t hash_version;
 	char *midx_name = get_midx_filename(object_dir);
+	uint32_t i;
 
 	fd = git_open(midx_name);
 
@@ -83,6 +89,33 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	m->num_packs = get_be32(m->data + 8);
 
+	for (i = 0; i < m->num_chunks; i++) {
+		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
+					     MIDX_CHUNKLOOKUP_WIDTH * i);
+		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
+						 MIDX_CHUNKLOOKUP_WIDTH * i);
+
+		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKNAMES:
+				m->chunk_pack_names = m->data + chunk_offset;
+				break;
+
+			case 0:
+				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
+				break;
+
+			default:
+				/*
+				 * Do nothing on unrecognized chunks, allowing future
+				 * extensions to add optional chunks.
+				 */
+				break;
+		}
+	}
+
+	if (!m->chunk_pack_names)
+		die(_("multi-pack-index missing required pack-name chunk"));
+
 	return m;
 
 cleanup_fail:
@@ -112,8 +145,11 @@ static size_t write_midx_header(struct hashfile *f,
 struct pack_list
 {
 	struct packed_git **list;
+	char **names;
 	uint32_t nr;
-	uint32_t alloc;
+	uint32_t alloc_list;
+	uint32_t alloc_names;
+	size_t pack_name_concat_len;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -122,27 +158,101 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 	struct pack_list *packs = (struct pack_list *)data;
 
 	if (ends_with(file_name, ".idx")) {
-		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc);
+		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
+		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
 
 		packs->list[packs->nr] = add_packed_git(full_path,
 							 full_path_len,
 							 0);
-		if (!packs->list[packs->nr])
+		if (!packs->list[packs->nr]) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
-		else
-			packs->nr++;
+			return;
+		}
+
+		packs->names[packs->nr] = xstrdup(file_name);
+		packs->pack_name_concat_len += strlen(file_name) + 1;
+		packs->nr++;
+	}
+}
+
+struct pack_pair {
+	uint32_t pack_int_id;
+	char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+	struct pack_pair *a = (struct pack_pair *)_a;
+	struct pack_pair *b = (struct pack_pair *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
+{
+	uint32_t i;
+	struct pack_pair *pairs;
+
+	ALLOC_ARRAY(pairs, nr_packs);
+
+	for (i = 0; i < nr_packs; i++) {
+		pairs[i].pack_int_id = i;
+		pairs[i].pack_name = pack_names[i];
+	}
+
+	QSORT(pairs, nr_packs, pack_pair_compare);
+
+	for (i = 0; i < nr_packs; i++) {
+		pack_names[i] = pairs[i].pack_name;
+		perm[pairs[i].pack_int_id] = i;
+	}
+
+	FREE_AND_NULL(pairs);
+}
+
+static size_t write_midx_pack_names(struct hashfile *f,
+				    char **pack_names,
+				    uint32_t num_packs)
+{
+	uint32_t i;
+	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
+	size_t written = 0;
+
+	for (i = 0; i < num_packs; i++) {
+		size_t writelen = strlen(pack_names[i]) + 1;
+
+		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+			BUG("incorrect pack-file order: %s before %s",
+			    pack_names[i - 1],
+			    pack_names[i]);
+
+		hashwrite(f, pack_names[i], writelen);
+		written += writelen;
+	}
+
+	/* add padding to be aligned */
+	i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
+	if (i < MIDX_CHUNK_ALIGNMENT) {
+		bzero(padding, sizeof(padding));
+		hashwrite(f, padding, i);
+		written += i;
 	}
+
+	return written;
 }
 
 int write_midx_file(const char *object_dir)
 {
-	unsigned char num_chunks = 0;
+	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f;
 	struct lock_file lk;
 	struct pack_list packs;
+	uint32_t *pack_perm;
+	uint64_t written = 0;
+	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
+	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -152,17 +262,77 @@ int write_midx_file(const char *object_dir)
 	}
 
 	packs.nr = 0;
-	packs.alloc = 16;
+	packs.alloc_list = 16;
+	packs.alloc_names = 16;
 	packs.list = NULL;
-	ALLOC_ARRAY(packs.list, packs.alloc);
+	packs.pack_name_concat_len = 0;
+	ALLOC_ARRAY(packs.list, packs.alloc_list);
+	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
+	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					(packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
+	ALLOC_ARRAY(pack_perm, packs.nr);
+	sort_packs_by_name(packs.names, packs.nr, pack_perm);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, packs.nr);
+	cur_chunk = 0;
+	num_chunks = 1;
+
+	written = write_midx_header(f, num_chunks, packs.nr);
+
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
+
+	for (i = 0; i <= num_chunks; i++) {
+		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
+			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
+			    chunk_offsets[i - 1],
+			    chunk_offsets[i]);
+
+		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
+			BUG("chunk offset %"PRIu64" is not properly aligned",
+			    chunk_offsets[i]);
+
+		hashwrite_be32(f, chunk_ids[i]);
+		hashwrite_be32(f, chunk_offsets[i] >> 32);
+		hashwrite_be32(f, chunk_offsets[i]);
+
+		written += MIDX_CHUNKLOOKUP_WIDTH;
+	}
+
+	for (i = 0; i < num_chunks; i++) {
+		if (written != chunk_offsets[i])
+			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
+			    chunk_offsets[i],
+			    written,
+			    chunk_ids[i]);
+
+		switch (chunk_ids[i]) {
+			case MIDX_CHUNKID_PACKNAMES:
+				written += write_midx_pack_names(f, packs.names, packs.nr);
+				break;
+
+			default:
+				BUG("trying to write unknown chunk id %"PRIx32,
+				    chunk_ids[i]);
+		}
+	}
+
+	if (written != chunk_offsets[num_chunks])
+		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
+		    written,
+		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
@@ -170,8 +340,10 @@ int write_midx_file(const char *object_dir)
 	for (i = 0; i < packs.nr; i++) {
 		close_pack(packs.list[i]);
 		FREE_AND_NULL(packs.list[i]);
+		FREE_AND_NULL(packs.names[i]);
 	}
 
 	FREE_AND_NULL(packs.list);
+	FREE_AND_NULL(packs.names);
 	return 0;
 }
diff --git a/object-store.h b/object-store.h
index 4f410841cc..c87d051849 100644
--- a/object-store.h
+++ b/object-store.h
@@ -97,6 +97,8 @@ struct multi_pack_index {
 	uint32_t num_packs;
 	uint32_t num_objects;
 
+	const unsigned char *chunk_pack_names;
+
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 5abf969175..a9232d8219 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -20,6 +20,13 @@ static int read_midx_file(const char *object_dir)
 	       m->num_chunks,
 	       m->num_packs);
 
+	printf("chunks:");
+
+	if (m->chunk_pack_names)
+		printf(" pack_names");
+
+	printf("\n");
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 4d4d6ca0a6..1b2778961c 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -6,7 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 0 $NUM_PACKS
+	header: 4d494458 1 1 $NUM_PACKS
+	chunks: pack_names
 	object_dir: .
 	EOF
 	test-tool read-midx . >actual &&
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 11/24] midx: read pack names into array
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (9 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 10/24] multi-pack-index: write pack names in chunk Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 23:52     ` Eric Sunshine
  2018-06-25 14:34   ` [PATCH v2 12/24] midx: sort and deduplicate objects from packfiles Derrick Stolee
                     ` (13 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 31 +++++++++++++++++++++++++++++++
 object-store.h              |  1 +
 t/helper/test-read-midx.c   |  5 +++++
 t/t5319-multi-pack-index.sh |  7 ++++++-
 4 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 2fad99d1b8..8bbc966f6b 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	uint32_t hash_version;
 	char *midx_name = get_midx_filename(object_dir);
 	uint32_t i;
+	const char *cur_pack_name;
 
 	fd = git_open(midx_name);
 
@@ -116,6 +117,22 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	if (!m->chunk_pack_names)
 		die(_("multi-pack-index missing required pack-name chunk"));
 
+	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+
+	cur_pack_name = (const char *)m->chunk_pack_names;
+	for (i = 0; i < m->num_packs; i++) {
+		m->pack_names[i] = cur_pack_name;
+
+		cur_pack_name += strlen(cur_pack_name) + 1;
+
+		if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
+			error("MIDX pack names out of order: '%s' before '%s'",
+			      m->pack_names[i - 1],
+			      m->pack_names[i]);
+			goto cleanup_fail;
+		}
+	}
+
 	return m;
 
 cleanup_fail:
@@ -210,6 +227,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	FREE_AND_NULL(pairs);
 }
 
+static size_t write_midx_pack_lookup(struct hashfile *f,
+				     char **pack_names,
+				     uint32_t nr_packs)
+{
+	uint32_t i, cur_len = 0;
+
+	for (i = 0; i < nr_packs; i++) {
+		hashwrite_be32(f, cur_len);
+		cur_len += strlen(pack_names[i]) + 1;
+	}
+
+	return sizeof(uint32_t) * (size_t)nr_packs;
+}
+
 static size_t write_midx_pack_names(struct hashfile *f,
 				    char **pack_names,
 				    uint32_t num_packs)
diff --git a/object-store.h b/object-store.h
index c87d051849..88169b33e9 100644
--- a/object-store.h
+++ b/object-store.h
@@ -99,6 +99,7 @@ struct multi_pack_index {
 
 	const unsigned char *chunk_pack_names;
 
+	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index a9232d8219..0b53a9e8b5 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -9,6 +9,7 @@
 
 static int read_midx_file(const char *object_dir)
 {
+	uint32_t i;
 	struct multi_pack_index *m = load_multi_pack_index(object_dir);
 
 	if (!m)
@@ -27,6 +28,10 @@ static int read_midx_file(const char *object_dir)
 
 	printf("\n");
 
+	printf("packs:\n");
+	for (i = 0; i < m->num_packs; i++)
+		printf("%s\n", m->pack_names[i]);
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1b2778961c..800fa7749c 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -8,8 +8,13 @@ midx_read_expect() {
 	cat >expect <<- EOF
 	header: 4d494458 1 1 $NUM_PACKS
 	chunks: pack_names
-	object_dir: .
+	packs:
 	EOF
+	if [ $NUM_PACKS -ge 1 ]
+	then
+		ls pack/ | grep idx | sort >> expect
+	fi
+	printf "object_dir: .\n" >>expect &&
 	test-tool read-midx . >actual &&
 	test_cmp expect actual
 }
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 12/24] midx: sort and deduplicate objects from packfiles
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (10 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 11/24] midx: read pack names into array Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 13/24] midx: write object ids in a chunk Derrick Stolee
                     ` (12 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Before writing a list of objects and their offsets to a multi-pack-index,
we need to collect the list of objects contained in the packfiles. There
may be multiple copies of some objects, so this list must be deduplicated.

It is possible to artificially get into a state where there are many
duplicate copies of objects. That can create high memory pressure if we
are to create a list of all objects before de-duplication. To reduce
this memory pressure without a significant performance drop,
automatically group objects by the first byte of their object id. Use
the IDX fanout tables to group the data, copy to a local array, then
sort.

Copy only the de-duplicated entries. Select the duplicate based on the
most-recent modified time of a packfile containing the object.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 packfile.c |  17 +++++++
 packfile.h |   2 +
 3 files changed, 146 insertions(+)

diff --git a/midx.c b/midx.c
index 8bbc966f6b..648a501d74 100644
--- a/midx.c
+++ b/midx.c
@@ -4,6 +4,7 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
@@ -187,6 +188,13 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			return;
 		}
 
+		if (open_pack_index(packs->list[packs->nr])) {
+			warning(_("failed to open pack-index '%s'"),
+				full_path);
+			close_pack(packs->list[packs->nr]);
+			FREE_AND_NULL(packs->list[packs->nr]);
+		}
+
 		packs->names[packs->nr] = xstrdup(file_name);
 		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
@@ -227,6 +235,120 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	FREE_AND_NULL(pairs);
 }
 
+struct pack_midx_entry {
+	struct object_id oid;
+	uint32_t pack_int_id;
+	time_t pack_mtime;
+	uint64_t offset;
+};
+
+static int midx_oid_compare(const void *_a, const void *_b)
+{
+	const struct pack_midx_entry *a = (const struct pack_midx_entry *)_a;
+	const struct pack_midx_entry *b = (const struct pack_midx_entry *)_b;
+	int cmp = oidcmp(&a->oid, &b->oid);
+
+	if (cmp)
+		return cmp;
+
+	if (a->pack_mtime > b->pack_mtime)
+		return -1;
+	else if (a->pack_mtime < b->pack_mtime)
+		return 1;
+
+	return a->pack_int_id - b->pack_int_id;
+}
+
+static void fill_pack_entry(uint32_t pack_int_id,
+			    struct packed_git *p,
+			    uint32_t cur_object,
+			    struct pack_midx_entry *entry)
+{
+	if (!nth_packed_object_oid(&entry->oid, p, cur_object))
+		die(_("failed to locate object %d in packfile"), cur_object);
+
+	entry->pack_int_id = pack_int_id;
+	entry->pack_mtime = p->mtime;
+
+	entry->offset = nth_packed_object_offset(p, cur_object);
+}
+
+/*
+ * It is possible to artificially get into a state where there are many
+ * duplicate copies of objects. That can create high memory pressure if
+ * we are to create a list of all objects before de-duplication. To reduce
+ * this memory pressure without a significant performance drop, automatically
+ * group objects by the first byte of their object id. Use the IDX fanout
+ * tables to group the data, copy to a local array, then sort.
+ *
+ * Copy only the de-duplicated entries (selected by most-recent modified time
+ * of a packfile containing the object).
+ */
+static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+						  uint32_t *perm,
+						  uint32_t nr_packs,
+						  uint32_t *nr_objects)
+{
+	uint32_t cur_fanout, cur_pack, cur_object;
+	uint32_t alloc_fanout, alloc_objects, total_objects = 0;
+	struct pack_midx_entry *entries_by_fanout = NULL;
+	struct pack_midx_entry *deduplicated_entries = NULL;
+
+	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+		total_objects += p[cur_pack]->num_objects;
+	}
+
+	/*
+	 * As we de-duplicate by fanout value, we expect the fanout
+	 * slices to be evenly distributed, with some noise. Hence,
+	 * allocate slightly more than one 256th.
+	 */
+	alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16;
+
+	ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
+	ALLOC_ARRAY(deduplicated_entries, alloc_objects);
+	*nr_objects = 0;
+
+	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
+		uint32_t nr_fanout = 0;
+
+		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
+			end = get_pack_fanout(p[cur_pack], cur_fanout);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				nr_fanout++;
+			}
+		}
+
+		QSORT(entries_by_fanout, nr_fanout, midx_oid_compare);
+
+		/*
+		 * The batch is now sorted by OID and then mtime (descending).
+		 * Take only the first duplicate.
+		 */
+		for (cur_object = 0; cur_object < nr_fanout; cur_object++) {
+			if (cur_object && !oidcmp(&entries_by_fanout[cur_object - 1].oid,
+						  &entries_by_fanout[cur_object].oid))
+				continue;
+
+			ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects);
+			memcpy(&deduplicated_entries[*nr_objects],
+			       &entries_by_fanout[cur_object],
+			       sizeof(struct pack_midx_entry));
+			(*nr_objects)++;
+		}
+	}
+
+	FREE_AND_NULL(entries_by_fanout);
+	return deduplicated_entries;
+}
+
 static size_t write_midx_pack_lookup(struct hashfile *f,
 				     char **pack_names,
 				     uint32_t nr_packs)
@@ -284,6 +406,8 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+	uint32_t nr_entries;
+	struct pack_midx_entry *entries;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -309,6 +433,8 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
+	entries = get_sorted_entries(packs.list, pack_perm, packs.nr, &nr_entries);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
@@ -376,5 +502,6 @@ int write_midx_file(const char *object_dir)
 
 	FREE_AND_NULL(packs.list);
 	FREE_AND_NULL(packs.names);
+	FREE_AND_NULL(entries);
 	return 0;
 }
diff --git a/packfile.c b/packfile.c
index db61c8813b..39d6b66337 100644
--- a/packfile.c
+++ b/packfile.c
@@ -196,6 +196,23 @@ int open_pack_index(struct packed_git *p)
 	return ret;
 }
 
+uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
+{
+	const uint32_t *level1_ofs = p->index_data;
+
+	if (!level1_ofs) {
+		if (open_pack_index(p))
+			return 0;
+		level1_ofs = p->index_data;
+	}
+
+	if (p->index_version > 1) {
+		level1_ofs += 2;
+	}
+
+	return ntohl(level1_ofs[value]);
+}
+
 static struct packed_git *alloc_packed_git(int extra)
 {
 	struct packed_git *p = xmalloc(st_add(sizeof(*p), extra));
diff --git a/packfile.h b/packfile.h
index d2ad30300a..b0eed44c0b 100644
--- a/packfile.h
+++ b/packfile.h
@@ -69,6 +69,8 @@ extern int open_pack_index(struct packed_git *);
  */
 extern void close_pack_index(struct packed_git *);
 
+extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
+
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 extern void close_pack_windows(struct packed_git *);
 extern void close_pack(struct packed_git *);
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 13/24] midx: write object ids in a chunk
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (11 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 12/24] midx: sort and deduplicate objects from packfiles Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 14/24] midx: write object id fanout chunk Derrick Stolee
                     ` (11 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  4 ++
 midx.c                                  | 51 ++++++++++++++++++++++---
 object-store.h                          |  1 +
 t/helper/test-read-midx.c               |  2 +
 t/t5319-multi-pack-index.sh             |  4 +-
 5 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6c5a77475f..78ee0489c6 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -302,6 +302,10 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Lookup (ID: {'O', 'I', 'D', 'L'})
+	    The OIDs for all objects in the MIDX are stored in lexicographic
+	    order in this chunk.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/midx.c b/midx.c
index 648a501d74..aec85b8181 100644
--- a/midx.c
+++ b/midx.c
@@ -14,9 +14,10 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 1
+#define MIDX_MAX_CHUNKS 2
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
 static char *get_midx_filename(const char *object_dir)
@@ -102,6 +103,10 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				m->chunk_oid_lookup = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
 				break;
@@ -117,6 +122,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	if (!m->chunk_pack_names)
 		die(_("multi-pack-index missing required pack-name chunk"));
+	if (!m->chunk_oid_lookup)
+		die(_("multi-pack-index missing required OID lookup chunk"));
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
 
@@ -127,7 +134,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 		cur_pack_name += strlen(cur_pack_name) + 1;
 
 		if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
-			error("MIDX pack names out of order: '%s' before '%s'",
+			error(_("multi-pack-index pack names out of order: '%s' before '%s'"),
 			      m->pack_names[i - 1],
 			      m->pack_names[i]);
 			goto cleanup_fail;
@@ -394,6 +401,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		if (i < nr_objects - 1) {
+			struct pack_midx_entry *next = list;
+			if (oidcmp(&obj->oid, &next->oid) >= 0)
+				BUG("OIDs not in order: %s >= %s",
+				oid_to_hex(&obj->oid),
+				oid_to_hex(&next->oid));
+		}
+
+		hashwrite(f, obj->oid.hash, (int)hash_len);
+		written += hash_len;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -407,7 +440,7 @@ int write_midx_file(const char *object_dir)
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	uint32_t nr_entries;
-	struct pack_midx_entry *entries;
+	struct pack_midx_entry *entries = NULL;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -440,7 +473,7 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 1;
+	num_chunks = 2;
 
 	written = write_midx_header(f, num_chunks, packs.nr);
 
@@ -448,9 +481,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -480,6 +517,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, packs.names, packs.nr);
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index 88169b33e9..25f8530eb4 100644
--- a/object-store.h
+++ b/object-store.h
@@ -98,6 +98,7 @@ struct multi_pack_index {
 	uint32_t num_objects;
 
 	const unsigned char *chunk_pack_names;
+	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 0b53a9e8b5..60bca5b668 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -25,6 +25,8 @@ static int read_midx_file(const char *object_dir)
 
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_lookup)
+		printf(" oid_lookup");
 
 	printf("\n");
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 800fa7749c..47e1c7d99e 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -6,8 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 $NUM_PACKS
-	chunks: pack_names
+	header: 4d494458 1 2 $NUM_PACKS
+	chunks: pack_names oid_lookup
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 14/24] midx: write object id fanout chunk
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (12 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 13/24] midx: write object ids in a chunk Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 15/24] midx: write object offsets Derrick Stolee
                     ` (10 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  5 +++
 midx.c                                  | 53 +++++++++++++++++++++++--
 object-store.h                          |  1 +
 t/helper/test-read-midx.c               |  4 +-
 t/t5319-multi-pack-index.sh             | 18 +++++----
 5 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 78ee0489c6..3215f7bfcd 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -302,6 +302,11 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Fanout (ID: {'O', 'I', 'D', 'F'})
+	    The ith entry, F[i], stores the number of OIDs with first
+	    byte at most i. Thus F[255] stores the total
+	    number of objects.
+
 	OID Lookup (ID: {'O', 'I', 'D', 'L'})
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
diff --git a/midx.c b/midx.c
index aec85b8181..0f773e2585 100644
--- a/midx.c
+++ b/midx.c
@@ -14,11 +14,13 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 2
+#define MIDX_MAX_CHUNKS 3
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+#define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -103,6 +105,10 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
@@ -122,9 +128,13 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	if (!m->chunk_pack_names)
 		die(_("multi-pack-index missing required pack-name chunk"));
+	if (!m->chunk_oid_fanout)
+		die(_("multi-pack-index missing required OID fanout chunk"));
 	if (!m->chunk_oid_lookup)
 		die(_("multi-pack-index missing required OID lookup chunk"));
 
+	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
+
 	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
 
 	cur_pack_name = (const char *)m->chunk_pack_names;
@@ -401,6 +411,35 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_fanout(struct hashfile *f,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	struct pack_midx_entry *last = objects + nr_objects;
+	uint32_t count = 0;
+	uint32_t i;
+
+	/*
+	* Write the first-level table (the list is sorted,
+	* but we use a 256-entry lookup to be able to avoid
+	* having to do eight extra binary search iterations).
+	*/
+	for (i = 0; i < 256; i++) {
+		struct pack_midx_entry *next = list;
+
+		while (next < last && next->oid.hash[0] == i) {
+			count++;
+			next++;
+		}
+
+		hashwrite_be32(f, count);
+		list = next;
+	}
+
+	return MIDX_CHUNK_FANOUT_SIZE;
+}
+
 static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 				    struct pack_midx_entry *objects,
 				    uint32_t nr_objects)
@@ -473,7 +512,7 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 2;
+	num_chunks = 3;
 
 	written = write_midx_header(f, num_chunks, packs.nr);
 
@@ -481,9 +520,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
+
 	cur_chunk++;
 	chunk_ids[cur_chunk] = 0;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
@@ -517,6 +560,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, packs.names, packs.nr);
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				written += write_midx_oid_fanout(f, entries, nr_entries);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
diff --git a/object-store.h b/object-store.h
index 25f8530eb4..3357e51100 100644
--- a/object-store.h
+++ b/object-store.h
@@ -98,6 +98,7 @@ struct multi_pack_index {
 	uint32_t num_objects;
 
 	const unsigned char *chunk_pack_names;
+	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 60bca5b668..d1bb7290ae 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -25,10 +25,12 @@ static int read_midx_file(const char *object_dir)
 
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_fanout)
+		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
 
-	printf("\n");
+	printf("\nnum_objects: %d\n", m->num_objects);
 
 	printf("packs:\n");
 	for (i = 0; i < m->num_packs; i++)
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 47e1c7d99e..ad0d447522 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -5,9 +5,11 @@ test_description='multi-pack-indexes'
 
 midx_read_expect() {
 	NUM_PACKS=$1
+	NUM_OBJECTS=$2
 	cat >expect <<- EOF
-	header: 4d494458 1 2 $NUM_PACKS
-	chunks: pack_names oid_lookup
+	header: 4d494458 1 3 $NUM_PACKS
+	chunks: pack_names oid_fanout oid_lookup
+	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
@@ -22,7 +24,7 @@ midx_read_expect() {
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 0
+	midx_read_expect 0 0
 '
 
 test_expect_success 'create objects' '
@@ -53,17 +55,17 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'Add more objects' '
-	for i in `test_seq 6 5`
+	for i in `test_seq 6 10`
 	do
 		iii=$(printf '%03i' $i)
 		test-tool genrandom "bar" 200 > wide_delta_$iii &&
@@ -89,7 +91,7 @@ test_expect_success 'Add more objects' '
 test_expect_success 'write midx with two packs' '
 	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 2
+	midx_read_expect 2 33
 '
 
 test_expect_success 'Add more packs' '
@@ -120,7 +122,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 12
+	midx_read_expect 12 73
 '
 
 test_done
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 15/24] midx: write object offsets
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (13 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 14/24] midx: write object id fanout chunk Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 16/24] config: create core.multiPackIndex setting Derrick Stolee
                     ` (9 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

The final pair of chunks for the multi-pack-index file stores the object
offsets. We default to using 32-bit offsets as in the pack-index version
1 format, but if there exists an offset larger than 32-bits, we use a
trick similar to the pack-index version 2 format by storing all offsets
at least 2^31 in a 64-bit table; we use the 32-bit table to point into
that 64-bit table as necessary.

We only store these 64-bit offsets if necessary, so create a test that
manipulates a version 2 pack-index to fake a large offset. This allows
us to test that the large offset table is created, but the data does not
match the actual packfile offsets. The multi-pack-index offset does match
the (corrupted) pack-index offset, so a future feature will compare these
offsets during a 'verify' step.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  15 +++-
 midx.c                                  | 100 +++++++++++++++++++++++-
 object-store.h                          |   2 +
 t/helper/test-read-midx.c               |   4 +
 t/t5319-multi-pack-index.sh             |  45 ++++++++---
 5 files changed, 151 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 3215f7bfcd..cab5bdd2ff 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -311,7 +311,20 @@ CHUNK DATA:
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
 
-	(This section intentionally left incomplete.)
+	Object Offsets (ID: {'O', 'O', 'F', 'F'})
+	    Stores two 4-byte values for every object.
+	    1: The pack-int-id for the pack storing this object.
+	    2: The offset within the pack.
+		If all offsets are less than 2^31, then the large offset chunk
+		will not exist and offsets are stored as in IDX v1.
+		If there is at least one offset value larger than 2^32-1, then
+		the large offset chunk must exist. If the large offset chunk
+		exists and the 31st bit is on, then removing that bit reveals
+		the row in the large offsets containing the 8-byte offset of
+		this object.
+
+	[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+	    8-byte offsets into large packfiles.
 
 TRAILER:
 
diff --git a/midx.c b/midx.c
index 0f773e2585..71ca493107 100644
--- a/midx.c
+++ b/midx.c
@@ -14,13 +14,18 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 3
+#define MIDX_MAX_CHUNKS 5
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
+#define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
+#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
+#define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -113,6 +118,14 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				m->chunk_object_offsets = m->data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				m->chunk_large_offsets = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
 				break;
@@ -132,6 +145,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 		die(_("multi-pack-index missing required OID fanout chunk"));
 	if (!m->chunk_oid_lookup)
 		die(_("multi-pack-index missing required OID lookup chunk"));
+	if (!m->chunk_object_offsets)
+		die(_("multi-pack-index missing required object offsets chunk"));
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
@@ -466,6 +481,56 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 	return written;
 }
 
+static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i, nr_large_offset = 0;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		hashwrite_be32(f, obj->pack_int_id);
+
+		if (large_offset_needed && obj->offset >> 31)
+			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
+		else if (!large_offset_needed && obj->offset >> 32)
+			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
+			    oid_to_hex(&obj->oid),
+			    obj->offset);
+		else
+			hashwrite_be32(f, (uint32_t)obj->offset);
+
+		written += MIDX_CHUNK_OFFSET_WIDTH;
+	}
+
+	return written;
+}
+
+static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
+				       struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	size_t written = 0;
+
+	while (nr_large_offset) {
+		struct pack_midx_entry *obj = list++;
+		uint64_t offset = obj->offset;
+
+		if (!(offset >> 31))
+			continue;
+
+		hashwrite_be32(f, offset >> 32);
+		hashwrite_be32(f, offset & 0xffffffffUL);
+		written += 2 * sizeof(uint32_t);
+
+		nr_large_offset--;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -478,8 +543,9 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t nr_entries;
+	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
+	int large_offsets_needed = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -506,13 +572,19 @@ int write_midx_file(const char *object_dir)
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
 	entries = get_sorted_entries(packs.list, pack_perm, packs.nr, &nr_entries);
+	for (i = 0; i < nr_entries; i++) {
+		if (entries[i].offset > 0x7fffffff)
+			num_large_offsets++;
+		if (entries[i].offset > 0xffffffff)
+			large_offsets_needed = 1;
+	}
 
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 3;
+	num_chunks = large_offsets_needed ? 5 : 4;
 
 	written = write_midx_header(f, num_chunks, packs.nr);
 
@@ -528,9 +600,21 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
 
+	cur_chunk++;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	if (large_offsets_needed) {
+		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
+
+		cur_chunk++;
+		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
+					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+	}
+
+	chunk_ids[cur_chunk] = 0;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -568,6 +652,14 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index 3357e51100..07bcc80e02 100644
--- a/object-store.h
+++ b/object-store.h
@@ -100,6 +100,8 @@ struct multi_pack_index {
 	const unsigned char *chunk_pack_names;
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_object_offsets;
+	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index d1bb7290ae..20771d1c1d 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -29,6 +29,10 @@ static int read_midx_file(const char *object_dir)
 		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
+	if (m->chunk_object_offsets)
+		printf(" object_offsets");
+	if (m->chunk_large_offsets)
+		printf(" large_offsets");
 
 	printf("\nnum_objects: %d\n", m->num_objects);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ad0d447522..ccde83bca4 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -6,25 +6,28 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
+	NUM_CHUNKS=$3
+	OBJECT_DIR=$4
+	EXTRA_CHUNKS="$5"
 	cat >expect <<- EOF
-	header: 4d494458 1 3 $NUM_PACKS
-	chunks: pack_names oid_fanout oid_lookup
+	header: 4d494458 1 $NUM_CHUNKS $NUM_PACKS
+	chunks: pack_names oid_fanout oid_lookup object_offsets$EXTRA_CHUNKS
 	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
 	then
-		ls pack/ | grep idx | sort >> expect
+		ls $OBJECT_DIR/pack/ | grep idx | sort >> expect
 	fi
-	printf "object_dir: .\n" >>expect &&
-	test-tool read-midx . >actual &&
+	printf "object_dir: $OBJECT_DIR\n" >>expect &&
+	test-tool read-midx $OBJECT_DIR >actual &&
 	test_cmp expect actual
 }
 
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 0 0
+	midx_read_expect 0 0 4 .
 '
 
 test_expect_success 'create objects' '
@@ -55,13 +58,13 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 4 .
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 4 .
 '
 
 test_expect_success 'Add more objects' '
@@ -91,7 +94,7 @@ test_expect_success 'Add more objects' '
 test_expect_success 'write midx with two packs' '
 	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 2 33
+	midx_read_expect 2 33 4 .
 '
 
 test_expect_success 'Add more packs' '
@@ -122,7 +125,29 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 12 73
+	midx_read_expect 12 73 4 .
+'
+
+
+# usage: corrupt_data <file> <pos> [<data>]
+corrupt_data() {
+	file=$1
+	pos=$2
+	data="${3:-\0}"
+	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+# Force 64-bit offsets by manipulating the idx file.
+# This makes the IDX file _incorrect_ so be careful to clean up after!
+test_expect_success 'force some 64-bit offsets with pack-objects' '
+	mkdir objects64 &&
+	mkdir objects64/pack &&
+	pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
+	idx64=objects64/pack/test-64-$pack64.idx &&
+	chmod u+w $idx64 &&
+	corrupt_data $idx64 2899 "\02" &&
+	midx64=$(git multi-pack-index write --object-dir=objects64) &&
+	midx_read_expect 1 62 5 objects64 " large_offsets"
 '
 
 test_done
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 16/24] config: create core.multiPackIndex setting
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (14 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 15/24] midx: write object offsets Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 17/24] midx: prepare midxed_git struct Derrick Stolee
                     ` (8 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

The core.multiPackIndex config setting controls the multi-pack-
index (MIDX) feature. If false, the setting will disable all reads
from the multi-pack-index file.

Add comparison commands in t5319-multi-pack-index.sh to check
typical Git behavior remains the same as the config setting is turned
on and off. This currently includes 'git rev-list' and 'git log'
commands to trigger several object database reads.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt    |  4 +++
 cache.h                     |  1 +
 config.c                    |  5 ++++
 environment.c               |  1 +
 t/t5319-multi-pack-index.sh | 55 +++++++++++++++++++++++++++++++------
 5 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ab641bf5a9..ab895ebb32 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -908,6 +908,10 @@ core.commitGraph::
 	Enable git commit graph feature. Allows reading from the
 	commit-graph file.
 
+core.multiPackIndex::
+	Use the multi-pack-index file to track multiple packfiles using a
+	single index. See linkgit:technical/multi-pack-index[1].
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 89a107a7f7..d12aa49710 100644
--- a/cache.h
+++ b/cache.h
@@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_commit_graph;
+extern int core_multi_pack_index;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index fbbf0f8e9f..95d8da4243 100644
--- a/config.c
+++ b/config.c
@@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.multipackindex")) {
+		core_multi_pack_index = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 2a6de2330b..b9bc919cdb 100644
--- a/environment.c
+++ b/environment.c
@@ -67,6 +67,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_commit_graph;
+int core_multi_pack_index;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ccde83bca4..f7f55ea181 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,6 +3,8 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+objdir=.git/objects
+
 midx_read_expect() {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
@@ -61,12 +63,42 @@ test_expect_success 'write midx with one v1 pack' '
 	midx_read_expect 1 17 4 .
 '
 
+midx_git_two_modes() {
+	git -c core.multiPackIndex=false $1 >expect &&
+	git -c core.multiPackIndex=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
 test_expect_success 'write midx with one v2 pack' '
-	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
-	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1 17 4 .
+	git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list &&
+	git multi-pack-index --object-dir=$objdir write &&
+	midx_read_expect 1 17 4 $objdir
 '
 
+midx_git_two_modes() {
+	git -c core.multiPackIndex=false $1 >expect &&
+	git -c core.multiPackIndex=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
+compare_results_with_midx "one v2 pack"
+
 test_expect_success 'Add more objects' '
 	for i in `test_seq 6 10`
 	do
@@ -92,11 +124,13 @@ test_expect_success 'Add more objects' '
 '
 
 test_expect_success 'write midx with two packs' '
-	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
-	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 2 33 4 .
+	git pack-objects --index-version=1 $objdir/pack/test-2 <obj-list2 &&
+	git multi-pack-index --object-dir=$objdir write &&
+	midx_read_expect 2 33 4 $objdir
 '
 
+compare_results_with_midx "two packs"
+
 test_expect_success 'Add more packs' '
 	for j in `test_seq 1 10`
 	do
@@ -117,17 +151,20 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 pack/test-pack <obj-list &&
+		git pack-objects --index-version=2 $objdir/pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
 '
 
+compare_results_with_midx "mixed mode (two packs + extra)"
+
 test_expect_success 'write midx with twelve packs' '
-	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 12 73 4 .
+	git multi-pack-index --object-dir=$objdir write &&
+	midx_read_expect 12 73 4 $objdir
 '
 
+compare_results_with_midx "twelve packs"
 
 # usage: corrupt_data <file> <pos> [<data>]
 corrupt_data() {
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 17/24] midx: prepare midxed_git struct
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (15 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 16/24] config: create core.multiPackIndex setting Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 18/24] midx: read objects from multi-pack-index Derrick Stolee
                     ` (7 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 22 ++++++++++++++++++++++
 midx.h         |  3 +++
 object-store.h |  9 +++++++++
 packfile.c     |  6 +++++-
 4 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 71ca493107..3dd5027dc6 100644
--- a/midx.c
+++ b/midx.c
@@ -176,6 +176,28 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	return NULL;
 }
 
+int prepare_multi_pack_index_one(struct repository *r, const char *object_dir)
+{
+	struct multi_pack_index *m = r->objects->multi_pack_index;
+	struct multi_pack_index *m_search;
+
+	if (!core_multi_pack_index)
+		return 0;
+
+	for (m_search = m; m_search; m_search = m_search->next)
+		if (!strcmp(object_dir, m_search->object_dir))
+			return 1;
+
+	r->objects->multi_pack_index = load_multi_pack_index(object_dir);
+
+	if (r->objects->multi_pack_index) {
+		r->objects->multi_pack_index->next = m;
+		return 1;
+	}
+
+	return 0;
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index 2d83dd9ec1..731ad6f094 100644
--- a/midx.h
+++ b/midx.h
@@ -1,9 +1,12 @@
 #ifndef __MIDX_H__
 #define __MIDX_H__
 
+#include "repository.h"
+
 struct multi_pack_index;
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir);
+int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
 
diff --git a/object-store.h b/object-store.h
index 07bcc80e02..7d67ad7aa9 100644
--- a/object-store.h
+++ b/object-store.h
@@ -85,6 +85,8 @@ struct packed_git {
 };
 
 struct multi_pack_index {
+	struct multi_pack_index *next;
+
 	int fd;
 
 	const unsigned char *data;
@@ -126,6 +128,13 @@ struct raw_object_store {
 	 */
 	struct oidmap *replace_map;
 
+	/*
+	 * private data
+	 *
+	 * should only be accessed directly by packfile.c and midx.c
+	 */
+	struct multi_pack_index *multi_pack_index;
+
 	/*
 	 * private data
 	 *
diff --git a/packfile.c b/packfile.c
index 39d6b66337..ff2df22a0b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -15,6 +15,7 @@
 #include "tree-walk.h"
 #include "tree.h"
 #include "object-store.h"
+#include "midx.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *sha1,
@@ -937,10 +938,13 @@ static void prepare_packed_git(struct repository *r)
 
 	if (r->objects->packed_git_initialized)
 		return;
+	prepare_multi_pack_index_one(r, r->objects->objectdir);
 	prepare_packed_git_one(r, r->objects->objectdir, 1);
 	prepare_alt_odb(r);
-	for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
+	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
+		prepare_multi_pack_index_one(r, alt->path);
 		prepare_packed_git_one(r, alt->path, 0);
+	}
 	rearrange_packed_git(r);
 	prepare_packed_git_mru(r);
 	r->objects->packed_git_initialized = 1;
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 18/24] midx: read objects from multi-pack-index
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (16 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 17/24] midx: prepare midxed_git struct Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 19/24] midx: use midx in abbreviation calculations Derrick Stolee
                     ` (6 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 93 ++++++++++++++++++++++++++++++++++++++++++++++++--
 midx.h         |  2 ++
 object-store.h |  1 +
 packfile.c     |  8 ++++-
 4 files changed, 101 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 3dd5027dc6..14514d6828 100644
--- a/midx.c
+++ b/midx.c
@@ -4,7 +4,7 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "object-store.h"
-#include "packfile.h"
+#include "sha1-lookup.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
@@ -150,7 +150,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
-	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+	m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
+	ALLOC_ARRAY(m->pack_names, m->num_packs);
 
 	cur_pack_name = (const char *)m->chunk_pack_names;
 	for (i = 0; i < m->num_packs; i++) {
@@ -176,6 +177,94 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	return NULL;
 }
 
+static int prepare_midx_pack(struct multi_pack_index *m, uint32_t pack_int_id)
+{
+	struct strbuf pack_name = STRBUF_INIT;
+
+	if (pack_int_id >= m->num_packs)
+		BUG("bad pack-int-id");
+
+	if (m->packs[pack_int_id])
+		return 0;
+
+	strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir,
+		    m->pack_names[pack_int_id]);
+
+	m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
+	strbuf_release(&pack_name);
+	return !m->packs[pack_int_id];
+}
+
+int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
+{
+	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
+			    MIDX_HASH_LEN, result);
+}
+
+static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
+{
+	const unsigned char *offset_data;
+	uint32_t offset32;
+
+	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
+	offset32 = get_be32(offset_data + sizeof(uint32_t));
+
+	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
+		if (sizeof(offset32) < sizeof(uint64_t))
+			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
+
+		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
+		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
+	}
+
+	return offset32;
+}
+
+static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
+{
+	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
+}
+
+static int nth_midxed_pack_entry(struct multi_pack_index *m, struct pack_entry *e, uint32_t pos)
+{
+	uint32_t pack_int_id;
+	struct packed_git *p;
+
+	if (pos >= m->num_objects)
+		return 0;
+
+	pack_int_id = nth_midxed_pack_int_id(m, pos);
+
+	if (prepare_midx_pack(m, pack_int_id))
+		die(_("error preparing packfile from multi-pack-index"));
+	p = m->packs[pack_int_id];
+
+	/*
+	* We are about to tell the caller where they can locate the
+	* requested object.  We better make sure the packfile is
+	* still here and can be accessed before supplying that
+	* answer, as it may have been deleted since the MIDX was
+	* loaded!
+	*/
+	if (!is_pack_valid(p))
+		return 0;
+
+	e->offset = nth_midxed_offset(m, pos);
+	e->p = p;
+
+	return 1;
+}
+
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m)
+{
+	uint32_t pos;
+
+	if (!bsearch_midx(oid, m, &pos))
+		return 0;
+
+	return nth_midxed_pack_entry(m, e, pos);
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir)
 {
 	struct multi_pack_index *m = r->objects->multi_pack_index;
diff --git a/midx.h b/midx.h
index 731ad6f094..6b74a0640f 100644
--- a/midx.h
+++ b/midx.h
@@ -6,6 +6,8 @@
 struct multi_pack_index;
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir);
+int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/object-store.h b/object-store.h
index 7d67ad7aa9..03cc278758 100644
--- a/object-store.h
+++ b/object-store.h
@@ -106,6 +106,7 @@ struct multi_pack_index {
 	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
+	struct packed_git **packs;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/packfile.c b/packfile.c
index ff2df22a0b..946d0c241f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1904,11 +1904,17 @@ static int fill_pack_entry(const struct object_id *oid,
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
+	struct multi_pack_index *m;
 
 	prepare_packed_git(r);
-	if (!r->objects->packed_git)
+	if (!r->objects->packed_git && !r->objects->multi_pack_index)
 		return 0;
 
+	for (m = r->objects->multi_pack_index; m; m = m->next) {
+		if (fill_midx_entry(oid, e, m))
+			return 1;
+	}
+
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
 		if (fill_pack_entry(oid, e, p)) {
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 19/24] midx: use midx in abbreviation calculations
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (17 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 18/24] midx: read objects from multi-pack-index Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 20/24] midx: use existing midx when writing new one Derrick Stolee
                     ` (5 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 11 ++++++
 midx.h                      |  3 ++
 packfile.c                  |  6 ++++
 packfile.h                  |  1 +
 sha1-name.c                 | 70 +++++++++++++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh |  3 +-
 6 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 14514d6828..c258e3ebdf 100644
--- a/midx.c
+++ b/midx.c
@@ -201,6 +201,17 @@ int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32
 			    MIDX_HASH_LEN, result);
 }
 
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct multi_pack_index *m,
+					uint32_t n)
+{
+	if (n >= m->num_objects)
+		return NULL;
+
+	hashcpy(oid->hash, m->chunk_oid_lookup + m->hash_len * n);
+	return oid;
+}
+
 static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 {
 	const unsigned char *offset_data;
diff --git a/midx.h b/midx.h
index 6b74a0640f..f7c2ec7893 100644
--- a/midx.h
+++ b/midx.h
@@ -7,6 +7,9 @@ struct multi_pack_index;
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct multi_pack_index *m,
+					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
diff --git a/packfile.c b/packfile.c
index 946d0c241f..20b743da91 100644
--- a/packfile.c
+++ b/packfile.c
@@ -963,6 +963,12 @@ struct packed_git *get_packed_git(struct repository *r)
 	return r->objects->packed_git;
 }
 
+struct multi_pack_index *get_multi_pack_index(struct repository *r)
+{
+	prepare_packed_git(r);
+	return r->objects->multi_pack_index;
+}
+
 struct list_head *get_packed_git_mru(struct repository *r)
 {
 	prepare_packed_git(r);
diff --git a/packfile.h b/packfile.h
index b0eed44c0b..046280caf3 100644
--- a/packfile.h
+++ b/packfile.h
@@ -45,6 +45,7 @@ extern void install_packed_git(struct repository *r, struct packed_git *pack);
 
 struct packed_git *get_packed_git(struct repository *r);
 struct list_head *get_packed_git_mru(struct repository *r);
+struct multi_pack_index *get_multi_pack_index(struct repository *r);
 
 /*
  * Give a rough count of objects in the repository. This sacrifices accuracy
diff --git a/sha1-name.c b/sha1-name.c
index 60d9ef3c7e..7dc71201e6 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -12,6 +12,7 @@
 #include "packfile.h"
 #include "object-store.h"
 #include "repository.h"
+#include "midx.h"
 
 static int get_oid_oneline(const char *, struct object_id *, struct commit_list *);
 
@@ -149,6 +150,32 @@ static int match_sha(unsigned len, const unsigned char *a, const unsigned char *
 	return 1;
 }
 
+static void unique_in_midx(struct multi_pack_index *m,
+			   struct disambiguate_state *ds)
+{
+	uint32_t num, i, first = 0;
+	const struct object_id *current = NULL;
+	num = m->num_objects;
+
+	if (!num)
+		return;
+
+	bsearch_midx(&ds->bin_pfx, m, &first);
+
+	/*
+	 * At this point, "first" is the location of the lowest object
+	 * with an object name that could match "bin_pfx".  See if we have
+	 * 0, 1 or more objects that actually match(es).
+	 */
+	for (i = first; i < num && !ds->ambiguous; i++) {
+		struct object_id oid;
+		current = nth_midxed_object_oid(&oid, m, i);
+		if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash))
+			break;
+		update_candidates(ds, current);
+	}
+}
+
 static void unique_in_pack(struct packed_git *p,
 			   struct disambiguate_state *ds)
 {
@@ -177,8 +204,12 @@ static void unique_in_pack(struct packed_git *p,
 
 static void find_short_packed_object(struct disambiguate_state *ds)
 {
+	struct multi_pack_index *m;
 	struct packed_git *p;
 
+	for (m = get_multi_pack_index(the_repository); m && !ds->ambiguous;
+	     m = m->next)
+		unique_in_midx(m, ds);
 	for (p = get_packed_git(the_repository); p && !ds->ambiguous;
 	     p = p->next)
 		unique_in_pack(p, ds);
@@ -527,6 +558,42 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	return 0;
 }
 
+static void find_abbrev_len_for_midx(struct multi_pack_index *m,
+				     struct min_abbrev_data *mad)
+{
+	int match = 0;
+	uint32_t num, first = 0;
+	struct object_id oid;
+	const struct object_id *mad_oid;
+
+	if (!m->num_objects)
+		return;
+
+	num = m->num_objects;
+	mad_oid = mad->oid;
+	match = bsearch_midx(mad_oid, m, &first);
+
+	/*
+	 * first is now the position in the packfile where we would insert
+	 * mad->hash if it does not exist (or the position of mad->hash if
+	 * it does exist). Hence, we consider a maximum of two objects
+	 * nearby for the abbreviation length.
+	 */
+	mad->init_len = 0;
+	if (!match) {
+		if (nth_midxed_object_oid(&oid, m, first))
+			extend_abbrev_len(&oid, mad);
+	} else if (first < num - 1) {
+		if (nth_midxed_object_oid(&oid, m, first + 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	if (first > 0) {
+		if (nth_midxed_object_oid(&oid, m, first - 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	mad->init_len = mad->cur_len;
+}
+
 static void find_abbrev_len_for_pack(struct packed_git *p,
 				     struct min_abbrev_data *mad)
 {
@@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 
 static void find_abbrev_len_packed(struct min_abbrev_data *mad)
 {
+	struct multi_pack_index *m;
 	struct packed_git *p;
 
+	for (m = get_multi_pack_index(the_repository); m; m = m->next)
+		find_abbrev_len_for_midx(m, mad);
 	for (p = get_packed_git(the_repository); p; p = p->next)
 		find_abbrev_len_for_pack(p, mad);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index f7f55ea181..d8a636c7b7 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -93,7 +93,8 @@ compare_results_with_midx() {
 	MSG=$1
 	test_expect_success "check normal git operations: $MSG" '
 		midx_git_two_modes "rev-list --objects --all" &&
-		midx_git_two_modes "log --raw"
+		midx_git_two_modes "log --raw" &&
+		midx_git_two_modes "log --oneline"
 	'
 }
 
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 20/24] midx: use existing midx when writing new one
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (18 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 19/24] midx: use midx in abbreviation calculations Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 21/24] midx: use midx in approximate_object_count Derrick Stolee
                     ` (4 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 midx.h |   1 +
 2 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/midx.c b/midx.c
index c258e3ebdf..02cbfc5bd5 100644
--- a/midx.c
+++ b/midx.c
@@ -47,7 +47,6 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	fd = git_open(midx_name);
 
 	if (fd < 0) {
-		error_errno(_("failed to read %s"), midx_name);
 		FREE_AND_NULL(midx_name);
 		return NULL;
 	}
@@ -276,6 +275,29 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mu
 	return nth_midxed_pack_entry(m, e, pos);
 }
 
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_name)
+{
+	uint32_t first = 0, last = m->num_packs;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		const char *current;
+		int cmp;
+
+		current = m->pack_names[mid];
+		cmp = strcmp(idx_name, current);
+		if (!cmp)
+			return 1;
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	return 0;
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir)
 {
 	struct multi_pack_index *m = r->objects->multi_pack_index;
@@ -322,6 +344,7 @@ struct pack_list
 	uint32_t alloc_list;
 	uint32_t alloc_names;
 	size_t pack_name_concat_len;
+	struct multi_pack_index *m;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -330,6 +353,9 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 	struct pack_list *packs = (struct pack_list *)data;
 
 	if (ends_with(file_name, ".idx")) {
+		if (packs->m && midx_contains_pack(packs->m, file_name))
+			return;
+
 		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
 		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
 
@@ -413,6 +439,23 @@ static int midx_oid_compare(const void *_a, const void *_b)
 	return a->pack_int_id - b->pack_int_id;
 }
 
+static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
+				      uint32_t *pack_perm,
+				      struct pack_midx_entry *e,
+				      uint32_t pos)
+{
+	if (pos >= m->num_objects)
+		return 1;
+
+	nth_midxed_object_oid(&e->oid, m, pos);
+	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->offset = nth_midxed_offset(m, pos);
+
+	/* consider objects in midx to be from "old" packs */
+	e->pack_mtime = 0;
+	return 0;
+}
+
 static void fill_pack_entry(uint32_t pack_int_id,
 			    struct packed_git *p,
 			    uint32_t cur_object,
@@ -438,7 +481,8 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * Copy only the de-duplicated entries (selected by most-recent modified time
  * of a packfile containing the object).
  */
-static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
+						  struct packed_git **p,
 						  uint32_t *perm,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
@@ -447,8 +491,9 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	uint32_t alloc_fanout, alloc_objects, total_objects = 0;
 	struct pack_midx_entry *entries_by_fanout = NULL;
 	struct pack_midx_entry *deduplicated_entries = NULL;
+	uint32_t start_pack = m ? m->num_packs : 0;
 
-	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
 		total_objects += p[cur_pack]->num_objects;
 	}
 
@@ -466,7 +511,23 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
 		uint32_t nr_fanout = 0;
 
-		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+		if (m) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = ntohl(m->chunk_oid_fanout[cur_fanout - 1]);
+			end = ntohl(m->chunk_oid_fanout[cur_fanout]);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				nth_midxed_pack_midx_entry(m, perm,
+							   &entries_by_fanout[nr_fanout],
+							   cur_object);
+				nr_fanout++;
+			}
+		}
+
+		for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
@@ -661,7 +722,7 @@ int write_midx_file(const char *object_dir)
 	struct hashfile *f;
 	struct lock_file lk;
 	struct pack_list packs;
-	uint32_t *pack_perm;
+	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
@@ -676,24 +737,42 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	packs.m = load_multi_pack_index(object_dir);
+
 	packs.nr = 0;
-	packs.alloc_list = 16;
-	packs.alloc_names = 16;
+	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
+	packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
 	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
+	if (packs.m) {
+		for (i = 0; i < packs.m->num_packs; i++) {
+			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
+			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+
+			packs.list[packs.nr] = NULL;
+			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
+			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
+			packs.nr++;
+		}
+	}
+
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
+	if (packs.m && packs.nr == packs.m->num_packs)
+		goto cleanup;
+
 	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
-					(packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
 
-	ALLOC_ARRAY(pack_perm, packs.nr);
+	ALLOC_ARRAY(pack_perm, packs.alloc_list);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
-	entries = get_sorted_entries(packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
@@ -796,14 +875,18 @@ int write_midx_file(const char *object_dir)
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
+cleanup:
 	for (i = 0; i < packs.nr; i++) {
-		close_pack(packs.list[i]);
-		FREE_AND_NULL(packs.list[i]);
+		if (packs.list[i]) {
+			close_pack(packs.list[i]);
+			FREE_AND_NULL(packs.list[i]);
+		}
 		FREE_AND_NULL(packs.names[i]);
 	}
 
 	FREE_AND_NULL(packs.list);
 	FREE_AND_NULL(packs.names);
 	FREE_AND_NULL(entries);
+	FREE_AND_NULL(pack_perm);
 	return 0;
 }
diff --git a/midx.h b/midx.h
index f7c2ec7893..5faffb7bc6 100644
--- a/midx.h
+++ b/midx.h
@@ -11,6 +11,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct multi_pack_index *m,
 					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 21/24] midx: use midx in approximate_object_count
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (19 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 20/24] midx: use existing midx when writing new one Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 22/24] midx: prevent duplicate packfile loads Derrick Stolee
                     ` (3 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/packfile.c b/packfile.c
index 20b743da91..e72e8a685d 100644
--- a/packfile.c
+++ b/packfile.c
@@ -863,11 +863,14 @@ unsigned long approximate_object_count(void)
 {
 	if (!the_repository->objects->approximate_object_count_valid) {
 		unsigned long count;
+		struct multi_pack_index *m;
 		struct packed_git *p;
 
 		prepare_packed_git(the_repository);
 		count = 0;
-		for (p = the_repository->objects->packed_git; p; p = p->next) {
+		for (m = get_multi_pack_index(the_repository); m; m = m->next)
+			count += m->num_objects;
+		for (p = get_packed_git(the_repository); p; p = p->next) {
 			if (open_pack_index(p))
 				continue;
 			count += p->num_objects;
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 22/24] midx: prevent duplicate packfile loads
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (20 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 21/24] midx: use midx in approximate_object_count Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 23/24] packfile: skip loading index if in multi-pack-index Derrick Stolee
                     ` (2 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

The multi-pack-index, when present, tracks the existence of objects and
their offsets within a list of packfiles. This allows us to use the
multi-pack-index for object lookups, abbreviations, and object counts.

When the multi-pack-index tracks a packfile, then we do not need to add
that packfile to the packed_git linked list or the MRU list.

We still need to load the packfiles that are not tracked by the
multi-pack-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/packfile.c b/packfile.c
index e72e8a685d..f2b8d6f8a7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -796,6 +796,7 @@ struct prepare_pack_data
 	struct repository *r;
 	struct string_list *garbage;
 	int local;
+	struct multi_pack_index *m;
 };
 
 static void prepare_pack(const char *full_name, size_t full_name_len, const char *file_name, void *_data)
@@ -805,6 +806,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len, const char
 	size_t base_len = full_name_len;
 
 	if (strip_suffix_mem(full_name, &base_len, ".idx")) {
+		if (data->m && midx_contains_pack(data->m, file_name))
+			return;
 		/* Don't reopen a pack we already have. */
 		for (p = data->r->objects->packed_git; p; p = p->next) {
 			size_t len;
@@ -841,6 +844,12 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	struct prepare_pack_data data;
 	struct string_list garbage = STRING_LIST_INIT_DUP;
 
+	data.m = r->objects->multi_pack_index;
+
+	/* look for the multi-pack-index for this object directory */
+	while (data.m && strcmp(data.m->object_dir, objdir))
+		data.m = data.m->next;
+
 	data.r = r;
 	data.garbage = &garbage;
 	data.local = local;
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 23/24] packfile: skip loading index if in multi-pack-index
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (21 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 22/24] midx: prevent duplicate packfile loads Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-06-25 14:34   ` [PATCH v2 24/24] midx: clear midx on repack Derrick Stolee
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/packfile.c b/packfile.c
index f2b8d6f8a7..acd02430a8 100644
--- a/packfile.c
+++ b/packfile.c
@@ -469,8 +469,19 @@ static int open_packed_git_1(struct packed_git *p)
 	ssize_t read_result;
 	const unsigned hashsz = the_hash_algo->rawsz;
 
-	if (!p->index_data && open_pack_index(p))
-		return error("packfile %s index unavailable", p->pack_name);
+	if (!p->index_data) {
+		struct multi_pack_index *m;
+		const char *pack_name = strrchr(p->pack_name, '/');
+
+		for (m = the_repository->objects->multi_pack_index;
+		     m; m = m->next) {
+			if (midx_contains_pack(m, pack_name))
+				break;
+		}
+
+		if (!m && open_pack_index(p))
+			return error("packfile %s index unavailable", p->pack_name);
+	}
 
 	if (!pack_max_fds) {
 		unsigned int max_fds = get_max_fd_limit();
@@ -521,6 +532,10 @@ static int open_packed_git_1(struct packed_git *p)
 			" supported (try upgrading GIT to a newer version)",
 			p->pack_name, ntohl(hdr.hdr_version));
 
+	/* Skip index checking if in multi-pack-index */
+	if (!p->index_data)
+		return 0;
+
 	/* Verify the pack matches its index. */
 	if (p->num_objects != ntohl(hdr.hdr_entries))
 		return error("packfile %s claims to have %"PRIu32" objects"
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v2 24/24] midx: clear midx on repack
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (22 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 23/24] packfile: skip loading index if in multi-pack-index Derrick Stolee
@ 2018-06-25 14:34   ` Derrick Stolee
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-06-25 14:34 UTC (permalink / raw)
  To: git; +Cc: sbeller, pclouds, avarab, Derrick Stolee

If a 'git repack' command replaces existing packfiles, then we must
clear the existing multi-pack-index before moving the packfiles it
references.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 8 ++++++++
 midx.c           | 8 ++++++++
 midx.h           | 1 +
 3 files changed, 17 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 6c636e159e..66a7d8e8ea 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -8,6 +8,7 @@
 #include "strbuf.h"
 #include "string-list.h"
 #include "argv-array.h"
+#include "midx.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	int quiet = 0;
 	int local = 0;
+	int midx_cleared = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				continue;
 			}
 
+			if (!midx_cleared) {
+				/* if we move a packfile, it will invalidated the midx */
+				clear_midx_file(get_object_directory());
+				midx_cleared = 1;
+			}
+
 			fname_old = mkpathdup("%s/old-%s%s", packdir,
 						item->string, exts[ext].name);
 			if (file_exists(fname_old))
diff --git a/midx.c b/midx.c
index 02cbfc5bd5..ef9fb38610 100644
--- a/midx.c
+++ b/midx.c
@@ -890,3 +890,11 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(pack_perm);
 	return 0;
 }
+
+void clear_midx_file(const char *object_dir)
+{
+	char *midx = get_midx_filename(object_dir);
+
+	if (remove_path(midx))
+		die(_("failed to clear multi-pack-index at %s"), midx);
+}
diff --git a/midx.h b/midx.h
index 5faffb7bc6..5a42cbed1d 100644
--- a/midx.h
+++ b/midx.h
@@ -15,5 +15,6 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
+void clear_midx_file(const char *object_dir);
 
 #endif
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 03/24] multi-pack-index: add builtin
  2018-06-25 14:34   ` [PATCH v2 03/24] multi-pack-index: add builtin Derrick Stolee
@ 2018-06-25 19:15     ` Junio C Hamano
  0 siblings, 0 replies; 192+ messages in thread
From: Junio C Hamano @ 2018-06-25 19:15 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, pclouds, avarab, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> This new 'git multi-pack-index' builtin will be the plumbing access
> for writing, reading, and checking multi-pack-index files. The
> initial implementation is a no-op.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  .gitignore                             |  3 +-
>  Documentation/git-multi-pack-index.txt | 36 ++++++++++++++++++++++++
>  Makefile                               |  1 +
>  builtin.h                              |  1 +
>  builtin/multi-pack-index.c             | 38 ++++++++++++++++++++++++++
>  command-list.txt                       |  1 +
>  git.c                                  |  1 +
>  7 files changed, 80 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/git-multi-pack-index.txt
>  create mode 100644 builtin/multi-pack-index.c
>
> diff --git a/.gitignore b/.gitignore
> index 388cc4beee..25633bc515 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -99,8 +99,9 @@
>  /git-mergetool--lib
>  /git-mktag
>  /git-mktree
> -/git-name-rev
> +/git-multi-pack-index
>  /git-mv
> +/git-name-rev

Nice attention to the detail (even though the patch as an
incremental change gets distracting, the result is better
and future changes to the file will read cleaner).

> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> new file mode 100644
> index 0000000000..9877f9c441
> --- /dev/null
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -0,0 +1,36 @@
> +git-multi-pack-index(1)
> +======================
> +
> +NAME
> +----
> +git-multi-pack-index - Write and verify multi-pack-indexes
> +
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git multi-pack-index' [--object-dir <dir>]
> +
> +DESCRIPTION
> +-----------
> +Write or verify a multi-pack-index (MIDX) file.
> +
> +OPTIONS
> +-------
> +
> +--object-dir <dir>::
> +	Use given directory for the location of Git objects. We check
> +	<dir>/packs/multi-pack-index for the current MIDX file, and
> +	<dir>/packs for the pack-files to index.

Do we want to `quote` these constant strings?

> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> new file mode 100644
> index 0000000000..f101873525
> --- /dev/null
> +++ b/builtin/multi-pack-index.c
> @@ -0,0 +1,38 @@
> +#include "builtin.h"
> +#include "cache.h"
> +#include "config.h"
> +#include "parse-options.h"
> +
> +static char const * const builtin_multi_pack_index_usage[] ={

ERROR: spaces required around that '=' (ctx:WxV)
#112: FILE: builtin/multi-pack-index.c:6:
+static char const * const builtin_multi_pack_index_usage[] ={
                                                            ^


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 05/24] midx: write header information to lockfile
  2018-06-25 14:34   ` [PATCH v2 05/24] midx: write header information to lockfile Derrick Stolee
@ 2018-06-25 19:19     ` Junio C Hamano
  2018-07-05 19:13       ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Junio C Hamano @ 2018-06-25 19:19 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, pclouds, avarab, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> +#define MIDX_VERSION 1
> +#define MIDX_HASH_VERSION 1
> +#define MIDX_HEADER_SIZE 12
> +
> +static char *get_midx_filename(const char *object_dir)
> +{
> +	return xstrfmt("%s/pack/multi-pack-index", object_dir);
> +}
> +
> +static size_t write_midx_header(struct hashfile *f,
> +				unsigned char num_chunks,
> +				uint32_t num_packs)
> +{
> +	unsigned char byte_values[4];
> +	hashwrite_be32(f, MIDX_SIGNATURE);

WARNING: Missing a blank line after declarations
#48: FILE: midx.c:21:
+       unsigned char byte_values[4];
+       hashwrite_be32(f, MIDX_SIGNATURE);

> +	byte_values[0] = MIDX_VERSION;
> +	byte_values[1] = MIDX_HASH_VERSION;
> +	byte_values[2] = num_chunks;
> +	byte_values[3] = 0; /* unused */
> +	hashwrite(f, byte_values, sizeof(byte_values));
> +	hashwrite_be32(f, num_packs);
> +
> +	return MIDX_HEADER_SIZE;
> +}
> +
>  int write_midx_file(const char *object_dir)
>  {
> +	unsigned char num_chunks = 0;
> +	char *midx_name;
> +	struct hashfile *f;
> +	struct lock_file lk;
> +
> +	midx_name = get_midx_filename(object_dir);
> +	if (safe_create_leading_directories(midx_name)) {
> +		UNLEAK(midx_name);
> +		die_errno(_("unable to create leading directories of %s"),
> +			  midx_name);
> +	}
> +
> +	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
> +	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
> +	FREE_AND_NULL(midx_name);

I am not sure why people prefer FREE_AND_NULL over free() for things
like this.  It is on stack; it's not like this is a static variable
visible after this function returns or anything like that.

> +	write_midx_header(f, num_chunks, 0);
> +
> +	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
> +	commit_lock_file(&lk);
> +
>  	return 0;
>  }
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index ec3ddbe79c..8622a7cdce 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -4,7 +4,8 @@ test_description='multi-pack-indexes'
>  . ./test-lib.sh
>  
>  test_expect_success 'write midx with no packs' '
> -	git multi-pack-index --object-dir=. write
> +	git multi-pack-index --object-dir=. write &&
> +	test_path_is_file pack/multi-pack-index
>  '
>  
>  test_done

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 06/24] multi-pack-index: load into memory
  2018-06-25 14:34   ` [PATCH v2 06/24] multi-pack-index: load into memory Derrick Stolee
@ 2018-06-25 19:38     ` Junio C Hamano
  2018-07-05 14:19       ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Junio C Hamano @ 2018-06-25 19:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, pclouds, avarab, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +#define MIDX_HASH_LEN 20
> +#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
>  
>  static char *get_midx_filename(const char *object_dir)
>  {
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
>  
> +struct multi_pack_index *load_multi_pack_index(const char *object_dir)
> +{
> +	struct multi_pack_index *m = NULL;
> +	int fd;
> +	struct stat st;
> +	size_t midx_size;
> +	void *midx_map = NULL;
> +	uint32_t hash_version;
> +	char *midx_name = get_midx_filename(object_dir);
> +
> +	fd = git_open(midx_name);
> +
> +	if (fd < 0) {
> +		error_errno(_("failed to read %s"), midx_name);
> +		FREE_AND_NULL(midx_name);
> +		return NULL;
> +	}
> +	if (fstat(fd, &st)) {
> +		error_errno(_("failed to read %s"), midx_name);
> +		FREE_AND_NULL(midx_name);
> +		close(fd);
> +		return NULL;
> +	}
> +
> +	midx_size = xsize_t(st.st_size);
> +
> +	if (midx_size < MIDX_MIN_SIZE) {
> +		close(fd);
> +		error(_("multi-pack-index file %s is too small"), midx_name);
> +		goto cleanup_fail;
> +	}
> +
> +	FREE_AND_NULL(midx_name);

Error handling in the above part looks a bit inconsistent.  I first
thought that the earlier ones manually clean up and leave because
jumping to cleanup_fail would need a successfully opened fd and
successfully mmapped midx_map, but the above "goto" forces
cleanup_fail: to munmap NULL and close an already closed fd.

I wonder if it is simpler to do

	cleanup_fail:
		/* no need to check for NULL when freeing */
		free(m);
		free(midx_name);
		if (midx_map)
			munmap(midx_map, midx_size);
		if (0 <= fd)
			close(fd);
		return NULL;

and have all of the above error codepath to jump there.

> +	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
> +	strcpy(m->object_dir, object_dir);
> +	m->data = midx_map;
> +
> +	m->signature = get_be32(m->data);
> +	if (m->signature != MIDX_SIGNATURE) {
> +		error(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"),
> +		      m->signature, MIDX_SIGNATURE);
> +		goto cleanup_fail;
> +	}
> +
> +	m->version = m->data[4];
> +	if (m->version != MIDX_VERSION) {
> +		error(_("multi-pack-index version %d not recognized"),
> +		      m->version);
> +		goto cleanup_fail;
> +	}
> +
> +	hash_version = m->data[5];

Is there a good existing example to show a better way to avoid these
hard-coded constants that describe/define the file format?

> +	if (hash_version != MIDX_HASH_VERSION) {
> +		error(_("hash version %u does not match"), hash_version);
> +		goto cleanup_fail;
> +	}
> +	m->hash_len = MIDX_HASH_LEN;
> +
> +	m->num_chunks = *(m->data + 6);

By the way, this mixture of m->data[4] and *(m->data + 6) is even
worse.  You could do get_be32(&8[m->data]) if you want to irritate
readers even more ;-)

> +	m->num_packs = get_be32(m->data + 8);
> +
> +	return m;
> +
> +cleanup_fail:
> +	FREE_AND_NULL(m);
> +	FREE_AND_NULL(midx_name);
> +	munmap(midx_map, midx_size);
> +	close(fd);
> +	return NULL;
> +}
> +


> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 8622a7cdce..0372704c96 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -3,9 +3,19 @@
>  test_description='multi-pack-indexes'
>  . ./test-lib.sh
>  
> +midx_read_expect() {

"midx_read_expect () {", i.e. SP on both sides of (), please.

> +	cat >expect <<- EOF

"<<-\EOF", i.e. make it easy for readers to spot that there is no
funny substitutions happening in the here-doc body.


> +	header: 4d494458 1 0 0
> +	object_dir: .
> +	EOF
> +	test-tool read-midx . >actual &&
> +	test_cmp expect actual
> +}
> +
>  test_expect_success 'write midx with no packs' '
>  	git multi-pack-index --object-dir=. write &&
> -	test_path_is_file pack/multi-pack-index
> +	test_path_is_file pack/multi-pack-index &&
> +	midx_read_expect
>  '
>  
>  test_done

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 07/24] multi-pack-index: expand test data
  2018-06-25 14:34   ` [PATCH v2 07/24] multi-pack-index: expand test data Derrick Stolee
@ 2018-06-25 19:45     ` Junio C Hamano
  0 siblings, 0 replies; 192+ messages in thread
From: Junio C Hamano @ 2018-06-25 19:45 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, pclouds, avarab, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>  test_expect_success 'write midx with no packs' '
> +	test_when_finished rm pack/multi-pack-index &&

It is generally a good idea to give "-f" to "rm" used in
test_when_finished.  The main command sequence may have failed
before it has a chance to create that file; even though an error
will be ignored by the when_finished handler, it is a good code
hygiene to mark expected condition (e.g. the file to be removed may
not exist at this point) to signal to future readers that the author
knew what s/he was writing.

>  	git multi-pack-index --object-dir=. write &&
>  	test_path_is_file pack/multi-pack-index &&
>  	midx_read_expect
>  '
>  
> +test_expect_success 'create objects' '
> +	for i in `test_seq 1 5`

Please write it as "$(test_seq 1 5)"

> +	do
> +		iii=$(printf '%03i' $i)
> +		test-tool genrandom "bar" 200 > wide_delta_$iii &&

	test-tool genrandom "bar" 200 >"wide_delta_$iii" &&

i.e. no SP between the redirection operator and the target file
name.  Also dq around target file name that depends on variable
substitution to tell bash that we know what we are doing (some
vintage of bash will throw warning at us unless we do so).


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 08/24] packfile: generalize pack directory list
  2018-06-25 14:34   ` [PATCH v2 08/24] packfile: generalize pack directory list Derrick Stolee
@ 2018-06-25 19:57     ` Junio C Hamano
  0 siblings, 0 replies; 192+ messages in thread
From: Junio C Hamano @ 2018-06-25 19:57 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, pclouds, avarab, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +struct prepare_pack_data
> +{

ERROR: open brace '{' following struct go on the same line
#88: FILE: packfile.c:777:
+struct prepare_pack_data
+{

> +	struct repository *r;
> +	struct string_list *garbage;
> +	int local;
> +};
> +
> +static void prepare_pack(const char *full_name, size_t full_name_len, const char *file_name, void *_data)
> +{
> +	struct prepare_pack_data *data = (struct prepare_pack_data *)_data;
> +	struct packed_git *p;
> +	size_t base_len = full_name_len;
> +
> +	if (strip_suffix_mem(full_name, &base_len, ".idx")) {
> +		/* Don't reopen a pack we already have. */
> +		for (p = data->r->objects->packed_git; p; p = p->next) {
> +			size_t len;
> +			if (strip_suffix(p->pack_name, ".pack", &len) &&
> +			    len == base_len &&
> +			    !memcmp(p->pack_name, full_name, len))
> +				break;
> +		}
> +
> +		if (p == NULL &&
> +		    /*
> +		     * See if it really is a valid .idx file with
> +		     * corresponding .pack file that we can map.
> +		     */
> +		    (p = add_packed_git(full_name, full_name_len, data->local)) != NULL)
> +			install_packed_git(data->r, p);
> +	}

This is merely a moved code and the issue was inherited from the
original, but can we make it easier to read and at the same time
remove that assignment inside if() condition (which generally makes
the code harder to read)?  The most naïve

	if (!p) {
		p = add_packed_git(full_name, full_name_len, data->local);
		if (p)
			install_packed_git(data->r, p);
	}

isn't all that bad, but there may be even better ways.

> +	if (!report_garbage)
> +	       return;

This "return;" is indented with a run of SP not with HT?


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 11/24] midx: read pack names into array
  2018-06-25 14:34   ` [PATCH v2 11/24] midx: read pack names into array Derrick Stolee
@ 2018-06-25 23:52     ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-06-25 23:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Stefan Beller, Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Mon, Jun 25, 2018 at 10:35 AM Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/midx.c b/midx.c
> @@ -210,6 +227,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
> +static size_t write_midx_pack_lookup(struct hashfile *f,
> +                                    char **pack_names,
> +                                    uint32_t nr_packs)
> +{
> +       uint32_t i, cur_len = 0;
> +
> +       for (i = 0; i < nr_packs; i++) {
> +               hashwrite_be32(f, cur_len);
> +               cur_len += strlen(pack_names[i]) + 1;
> +       }
> +
> +       return sizeof(uint32_t) * (size_t)nr_packs;
> +}

This static function is never used, thus breaks the build with DEVELOPER=1:

    midx.c:567:15: error: ‘write_midx_pack_lookup’ defined but not used
        [-Werror=unused-function]
    cc1: all warnings being treated as errors

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 06/24] multi-pack-index: load into memory
  2018-06-25 19:38     ` Junio C Hamano
@ 2018-07-05 14:19       ` Derrick Stolee
  2018-07-05 18:58         ` Eric Sunshine
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-05 14:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, sbeller, pclouds, avarab, Derrick Stolee

On 6/25/2018 3:38 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>> +	cat >expect <<- EOF
> "<<-\EOF", i.e. make it easy for readers to spot that there is no
> funny substitutions happening in the here-doc body.

While I don't use substitutions in this patch, I do use them in later 
patches. Here is the final version of this method:

midx_read_expect () {
         NUM_PACKS=$1
         NUM_OBJECTS=$2
         NUM_CHUNKS=$3
         OBJECT_DIR=$4
         EXTRA_CHUNKS="$5"
         cat >expect <<-\EOF
         header: 4d494458 1 $NUM_CHUNKS $NUM_PACKS
         chunks: pack_names oid_fanout oid_lookup 
object_offsets$EXTRA_CHUNKS
         num_objects: $NUM_OBJECTS
         packs:
         EOF
         if [ $NUM_PACKS -ge 1 ]
         then
                 ls $OBJECT_DIR/pack/ | grep idx | sort >> expect
         fi
         printf "object_dir: $OBJECT_DIR\n" >>expect &&
         test-tool read-midx $OBJECT_DIR >actual &&
         test_cmp expect actual
}

Using <<-\EOF causes these substitutions to fail. Is there a different 
way I should construct this method?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 06/24] multi-pack-index: load into memory
  2018-07-05 14:19       ` Derrick Stolee
@ 2018-07-05 18:58         ` Eric Sunshine
  2018-07-06 19:20           ` Junio C Hamano
  0 siblings, 1 reply; 192+ messages in thread
From: Eric Sunshine @ 2018-07-05 18:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Git List, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 10:20 AM Derrick Stolee <stolee@gmail.com> wrote:
> On 6/25/2018 3:38 PM, Junio C Hamano wrote:
> While I don't use substitutions in this patch, I do use them in later
> patches. Here is the final version of this method:
>
> midx_read_expect () {
>          NUM_PACKS=$1
>          NUM_OBJECTS=$2
>          NUM_CHUNKS=$3
>          EXTRA_CHUNKS="$5"
>          cat >expect <<-\EOF
>          header: 4d494458 1 $NUM_CHUNKS $NUM_PACKS
>          chunks: pack_names oid_fanout oid_lookup
> object_offsets$EXTRA_CHUNKS
>          num_objects: $NUM_OBJECTS
>          packs:
>          EOF
>
> Using <<-\EOF causes these substitutions to fail. Is there a different
> way I should construct this method?

When you need to interpolate variables into the here-doc, use <<-EOF;
when you don't, use <<-\EOF.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 05/24] midx: write header information to lockfile
  2018-06-25 19:19     ` Junio C Hamano
@ 2018-07-05 19:13       ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-05 19:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, sbeller, pclouds, avarab, Derrick Stolee

On 6/25/2018 3:19 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>> +#define MIDX_VERSION 1
>> +#define MIDX_HASH_VERSION 1
>> +#define MIDX_HEADER_SIZE 12
>> +
>> +static char *get_midx_filename(const char *object_dir)
>> +{
>> +	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>> +}
>> +
>> +static size_t write_midx_header(struct hashfile *f,
>> +				unsigned char num_chunks,
>> +				uint32_t num_packs)
>> +{
>> +	unsigned char byte_values[4];
>> +	hashwrite_be32(f, MIDX_SIGNATURE);
> WARNING: Missing a blank line after declarations
> #48: FILE: midx.c:21:
> +       unsigned char byte_values[4];
> +       hashwrite_be32(f, MIDX_SIGNATURE);
>
>> +	byte_values[0] = MIDX_VERSION;
>> +	byte_values[1] = MIDX_HASH_VERSION;
>> +	byte_values[2] = num_chunks;
>> +	byte_values[3] = 0; /* unused */
>> +	hashwrite(f, byte_values, sizeof(byte_values));
>> +	hashwrite_be32(f, num_packs);
>> +
>> +	return MIDX_HEADER_SIZE;
>> +}
>> +
>>   int write_midx_file(const char *object_dir)
>>   {
>> +	unsigned char num_chunks = 0;
>> +	char *midx_name;
>> +	struct hashfile *f;
>> +	struct lock_file lk;
>> +
>> +	midx_name = get_midx_filename(object_dir);
>> +	if (safe_create_leading_directories(midx_name)) {
>> +		UNLEAK(midx_name);
>> +		die_errno(_("unable to create leading directories of %s"),
>> +			  midx_name);
>> +	}
>> +
>> +	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>> +	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>> +	FREE_AND_NULL(midx_name);
> I am not sure why people prefer FREE_AND_NULL over free() for things
> like this.  It is on stack; it's not like this is a static variable
> visible after this function returns or anything like that.

I default to FREE_AND_NULL(X) because a later change may introduce logic 
to use X later in the same code block. In this case, we add a 'cleanup:' 
at the end which would fail in a success case if we don't set midx_name 
to NULL here.

I think there are some other FREE_AND_NULLs that are currently at the 
end of a code block, so I'll work to clean those up.

>
>> +	write_midx_header(f, num_chunks, 0);
>> +
>> +	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>> +	commit_lock_file(&lk);
>> +
>>   	return 0;
>>   }
>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>> index ec3ddbe79c..8622a7cdce 100755
>> --- a/t/t5319-multi-pack-index.sh
>> +++ b/t/t5319-multi-pack-index.sh
>> @@ -4,7 +4,8 @@ test_description='multi-pack-indexes'
>>   . ./test-lib.sh
>>   
>>   test_expect_success 'write midx with no packs' '
>> -	git multi-pack-index --object-dir=. write
>> +	git multi-pack-index --object-dir=. write &&
>> +	test_path_is_file pack/multi-pack-index
>>   '
>>   
>>   test_done


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 00/24] Multi-pack-index (MIDX)
  2018-06-25 14:34 ` [PATCH v2 00/24] " Derrick Stolee
                     ` (23 preceding siblings ...)
  2018-06-25 14:34   ` [PATCH v2 24/24] midx: clear midx on repack Derrick Stolee
@ 2018-07-06  0:52   ` Derrick Stolee
  2018-07-06  0:52     ` [PATCH v3 01/24] multi-pack-index: add design document Derrick Stolee
                       ` (24 more replies)
  24 siblings, 25 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:52 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Thanks for the feedback on v2. I cleaned up the patch to respond to all
that feedback. I'll include the version diff in a follow-up email.

You can see the CI builds for Linux, Mac, and Windows linked from the
GitHub pull request [1].

Biggest changes in this version:

* Deleted the extra static method.

* Use $(...) instead of `...` in test script

* Fixed spaces in tabbing (hopefully I caught them all)

* Cleaned up some inter-patch diff issues

* `...` quotes in git-multi-pack-index.txt

* Replace FREE_AND_NULL() with free() when obvious that a variable is
  going out of scope

* Due to how Windows handles open file handles when replacing a
  lockfile, be sure to close the midx before writing a new one.

I do still want to revisit the "order by fanout" vs "merge sort"
discussion that we had on v1, but this series is very big already. I'd
like to come back to that as a follow-up.

Thanks,
-Stolee

[1] https://github.com/gitgitgadget/git/pull/5

Derrick Stolee (24):
  multi-pack-index: add design document
  multi-pack-index: add format details
  multi-pack-index: add builtin
  multi-pack-index: add 'write' verb
  midx: write header information to lockfile
  multi-pack-index: load into memory
  multi-pack-index: expand test data
  packfile: generalize pack directory list
  multi-pack-index: read packfile list
  multi-pack-index: write pack names in chunk
  midx: read pack names into array
  midx: sort and deduplicate objects from packfiles
  midx: write object ids in a chunk
  midx: write object id fanout chunk
  midx: write object offsets
  config: create core.multiPackIndex setting
  midx: prepare midxed_git struct
  midx: read objects from multi-pack-index
  midx: use midx in abbreviation calculations
  midx: use existing midx when writing new one
  midx: use midx in approximate_object_count
  midx: prevent duplicate packfile loads
  packfile: skip loading index if in multi-pack-index
  midx: clear midx on repack

 .gitignore                                   |   3 +-
 Documentation/config.txt                     |   4 +
 Documentation/git-multi-pack-index.txt       |  56 ++
 Documentation/technical/multi-pack-index.txt | 109 +++
 Documentation/technical/pack-format.txt      |  77 ++
 Makefile                                     |   3 +
 builtin.h                                    |   1 +
 builtin/multi-pack-index.c                   |  46 +
 builtin/repack.c                             |   8 +
 cache.h                                      |   1 +
 command-list.txt                             |   1 +
 config.c                                     |   5 +
 environment.c                                |   1 +
 git.c                                        |   1 +
 midx.c                                       | 896 +++++++++++++++++++
 midx.h                                       |  20 +
 object-store.h                               |  33 +
 packfile.c                                   | 169 +++-
 packfile.h                                   |   9 +
 sha1-name.c                                  |  70 ++
 t/helper/test-read-midx.c                    |  54 ++
 t/helper/test-tool.c                         |   1 +
 t/helper/test-tool.h                         |   1 +
 t/t5319-multi-pack-index.sh                  | 191 ++++
 24 files changed, 1717 insertions(+), 43 deletions(-)
 create mode 100644 Documentation/git-multi-pack-index.txt
 create mode 100644 Documentation/technical/multi-pack-index.txt
 create mode 100644 builtin/multi-pack-index.c
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100644 t/helper/test-read-midx.c
 create mode 100755 t/t5319-multi-pack-index.sh


base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 01/24] multi-pack-index: add design document
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
@ 2018-07-06  0:52     ` Derrick Stolee
  2018-07-06  0:52     ` [PATCH v3 02/24] multi-pack-index: add format details Derrick Stolee
                       ` (23 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:52 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 109 +++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000000..d7e57639f7
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The pack.multiIndex config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the most-
+  recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+  contents of the multi-pack-index file match the offsets listed in
+  the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- The reachability bitmap is currently paired directly with a single
+  packfile, using the pack-order as the object order to hopefully
+  compress the bitmaps well using run-length encoding. This could be
+  extended to pair a reachability bitmap with a multi-pack-index. If
+  the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then a reachability bitmap
+  could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+  may point to objects that are not stored locally, but available
+  on request to a server. The multi-pack-index does not currently
+  track these promisor packs.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+    An earlier RFC for the multi-pack-index feature
+
+[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)

base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 02/24] multi-pack-index: add format details
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
  2018-07-06  0:52     ` [PATCH v3 01/24] multi-pack-index: add design document Derrick Stolee
@ 2018-07-06  0:52     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 03/24] multi-pack-index: add builtin Derrick Stolee
                       ` (22 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:52 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

The multi-pack-index feature generalizes the existing pack-index
feature by indexing objects across multiple pack-files.

Describe the basic file format, using a 12-byte header followed by
a lookup table for a list of "chunks" which will be described later.
The file ends with a footer containing a checksum using the hash
algorithm.

The header allows later versions to create breaking changes by
advancing the version number. We can also change the hash algorithm
using a different version value.

We will add the individual chunk format information as we introduce
the code that writes that information.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 70a99fd142..e060e693f4 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -252,3 +252,52 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA-1-checksum of all of the above.
+
+== multi-pack-index (MIDX) files have the following format:
+
+The multi-pack-index files refer to multiple pack-files and loose objects.
+
+In order to allow extensions that add extra data to the MIDX, we organize
+the body into "chunks" and provide a lookup table at the beginning of the
+body. The header includes certain length values, such as the number of packs,
+the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+	4-byte signature:
+	    The signature is: {'M', 'I', 'D', 'X'}
+
+	1-byte version number:
+	    Git only writes or recognizes version 1.
+
+	1-byte Object Id Version
+	    Git only writes or recognizes version 1 (SHA1).
+
+	1-byte number of "chunks"
+
+	1-byte number of base multi-pack-index files:
+	    This value is currently always zero.
+
+	4-byte number of pack files
+
+CHUNK LOOKUP:
+
+	(C + 1) * 12 bytes providing the chunk offsets:
+	    First 4 bytes describe chunk id. Value 0 is a terminating label.
+	    Other 8 bytes provide offset in current file for chunk to start.
+	    (Chunks are provided in file-order, so you can infer the length
+	    using the next chunk position if necessary.)
+
+	The remaining data in the body is described one chunk at a time, and
+	these chunks may be given in any order. Chunks are required unless
+	otherwise specified.
+
+CHUNK DATA:
+
+	(This section intentionally left incomplete.)
+
+TRAILER:
+
+	20-byte SHA1-checksum of the above contents.
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 03/24] multi-pack-index: add builtin
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
  2018-07-06  0:52     ` [PATCH v3 01/24] multi-pack-index: add design document Derrick Stolee
  2018-07-06  0:52     ` [PATCH v3 02/24] multi-pack-index: add format details Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  3:54       ` Eric Sunshine
  2018-07-06  0:53     ` [PATCH v3 04/24] multi-pack-index: add 'write' verb Derrick Stolee
                       ` (21 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

This new 'git multi-pack-index' builtin will be the plumbing access
for writing, reading, and checking multi-pack-index files. The
initial implementation is a no-op.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                             |  3 +-
 Documentation/git-multi-pack-index.txt | 36 ++++++++++++++++++++++++
 Makefile                               |  1 +
 builtin.h                              |  1 +
 builtin/multi-pack-index.c             | 38 ++++++++++++++++++++++++++
 command-list.txt                       |  1 +
 git.c                                  |  1 +
 7 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/git-multi-pack-index.txt
 create mode 100644 builtin/multi-pack-index.c

diff --git a/.gitignore b/.gitignore
index 388cc4beee..25633bc515 100644
--- a/.gitignore
+++ b/.gitignore
@@ -99,8 +99,9 @@
 /git-mergetool--lib
 /git-mktag
 /git-mktree
-/git-name-rev
+/git-multi-pack-index
 /git-mv
+/git-name-rev
 /git-notes
 /git-p4
 /git-pack-redundant
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
new file mode 100644
index 0000000000..a83c0496a6
--- /dev/null
+++ b/Documentation/git-multi-pack-index.txt
@@ -0,0 +1,36 @@
+git-multi-pack-index(1)
+======================
+
+NAME
+----
+git-multi-pack-index - Write and verify multi-pack-indexes
+
+
+SYNOPSIS
+--------
+[verse]
+'git multi-pack-index' [--object-dir <dir>]
+
+DESCRIPTION
+-----------
+Write or verify a multi-pack-index (MIDX) file.
+
+OPTIONS
+-------
+
+--object-dir <dir>::
+	Use given directory for the location of Git objects. We check
+	`<dir>/packs/multi-pack-index` for the current MIDX file, and
+	`<dir>/packs` for the pack-files to index.
+
+
+SEE ALSO
+--------
+See link:technical/multi-pack-index.html[The Multi-Pack-Index Design
+Document] and link:technical/pack-format.html[The Multi-Pack-Index
+Format] for more information on the multi-pack-index feature.
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index e4b503d259..54610875ec 100644
--- a/Makefile
+++ b/Makefile
@@ -1047,6 +1047,7 @@ BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
+BUILTIN_OBJS += builtin/multi-pack-index.o
 BUILTIN_OBJS += builtin/mv.o
 BUILTIN_OBJS += builtin/name-rev.o
 BUILTIN_OBJS += builtin/notes.o
diff --git a/builtin.h b/builtin.h
index 4e0f64723e..70997d7ace 100644
--- a/builtin.h
+++ b/builtin.h
@@ -191,6 +191,7 @@ extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
+extern int cmd_multi_pack_index(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
 extern int cmd_name_rev(int argc, const char **argv, const char *prefix);
 extern int cmd_notes(int argc, const char **argv, const char *prefix);
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
new file mode 100644
index 0000000000..4853291477
--- /dev/null
+++ b/builtin/multi-pack-index.c
@@ -0,0 +1,38 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_multi_pack_index_usage[] = {
+	N_("git multi-pack-index [--object-dir <dir>]"),
+	NULL
+};
+
+static struct opts_multi_pack_index {
+	const char *object_dir;
+} opts;
+
+int cmd_multi_pack_index(int argc, const char **argv,
+			 const char *prefix)
+{
+	static struct option builtin_multi_pack_index_options[] = {
+		OPT_FILENAME(0, "object-dir", &opts.object_dir,
+		  N_("The object directory containing set of packfile and pack-index pairs")),
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_multi_pack_index_usage,
+				   builtin_multi_pack_index_options);
+
+	git_config(git_default_config, NULL);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_multi_pack_index_options,
+			     builtin_multi_pack_index_usage, 0);
+
+	if (!opts.object_dir)
+		opts.object_dir = get_object_directory();
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e1c26c1bb7..61071f8fa2 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -123,6 +123,7 @@ git-merge-index                         plumbingmanipulators
 git-merge-one-file                      purehelpers
 git-mergetool                           ancillarymanipulators           complete
 git-merge-tree                          ancillaryinterrogators
+git-multi-pack-index                    plumbingmanipulators
 git-mktag                               plumbingmanipulators
 git-mktree                              plumbingmanipulators
 git-mv                                  mainporcelain           worktree
diff --git a/git.c b/git.c
index c2f48d53dd..a7509fa5f7 100644
--- a/git.c
+++ b/git.c
@@ -505,6 +505,7 @@ static struct cmd_struct commands[] = {
 	{ "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
 	{ "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
 	{ "mktree", cmd_mktree, RUN_SETUP },
+	{ "multi-pack-index", cmd_multi_pack_index, RUN_SETUP_GENTLY },
 	{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
 	{ "name-rev", cmd_name_rev, RUN_SETUP },
 	{ "notes", cmd_notes, RUN_SETUP },
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 04/24] multi-pack-index: add 'write' verb
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (2 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 03/24] multi-pack-index: add builtin Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  4:07       ` Eric Sunshine
  2018-07-06  0:53     ` [PATCH v3 05/24] midx: write header information to lockfile Derrick Stolee
                       ` (20 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

In anticipation of writing multi-pack-indexes, add a
'git multi-pack-index write' subcommand and send the options to a
write_midx_file() method. Also create a basic test file that tests
the 'write' subcommand.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 22 +++++++++++++++++++++-
 Makefile                               |  1 +
 builtin/multi-pack-index.c             | 10 +++++++++-
 midx.c                                 |  7 +++++++
 midx.h                                 |  6 ++++++
 t/t5319-multi-pack-index.sh            | 10 ++++++++++
 6 files changed, 54 insertions(+), 2 deletions(-)
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-multi-pack-index.sh

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index a83c0496a6..be97c9372e 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir <dir>]
+'git multi-pack-index' [--object-dir <dir>] <verb>
 
 DESCRIPTION
 -----------
@@ -23,6 +23,26 @@ OPTIONS
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
 
+write::
+	When given as the verb, write a new MIDX file to
+	`<dir>/packs/multi-pack-index`.
+
+
+EXAMPLES
+--------
+
+* Write a MIDX file for the packfiles in the current .git folder.
++
+-----------------------------------------------
+$ git multi-pack-index write
+-----------------------------------------------
+
+* Write a MIDX file for the packfiles in an alternate.
++
+-----------------------------------------------
+$ git multi-pack-index --object-dir <alt> write
+-----------------------------------------------
+
 
 SEE ALSO
 --------
diff --git a/Makefile b/Makefile
index 54610875ec..f5636c711d 100644
--- a/Makefile
+++ b/Makefile
@@ -890,6 +890,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 4853291477..14b32e1373 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -2,9 +2,10 @@
 #include "cache.h"
 #include "config.h"
 #include "parse-options.h"
+#include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir <dir>]"),
+	N_("git multi-pack-index [--object-dir <dir>] [write]"),
 	NULL
 };
 
@@ -34,5 +35,12 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	if (!opts.object_dir)
 		opts.object_dir = get_object_directory();
 
+	if (argc == 0)
+		usage_with_options(builtin_multi_pack_index_usage,
+				   builtin_multi_pack_index_options);
+
+	if (!strcmp(argv[0], "write"))
+		return write_midx_file(opts.object_dir);
+
 	return 0;
 }
diff --git a/midx.c b/midx.c
new file mode 100644
index 0000000000..32468db1a2
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,7 @@
+#include "cache.h"
+#include "midx.h"
+
+int write_midx_file(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
new file mode 100644
index 0000000000..dbdbe9f873
--- /dev/null
+++ b/midx.h
@@ -0,0 +1,6 @@
+#ifndef __MIDX_H__
+#define __MIDX_H__
+
+int write_midx_file(const char *object_dir);
+
+#endif
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
new file mode 100755
index 0000000000..ec3ddbe79c
--- /dev/null
+++ b/t/t5319-multi-pack-index.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+test_description='multi-pack-indexes'
+. ./test-lib.sh
+
+test_expect_success 'write midx with no packs' '
+	git multi-pack-index --object-dir=. write
+'
+
+test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 05/24] midx: write header information to lockfile
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (3 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 04/24] multi-pack-index: add 'write' verb Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 06/24] multi-pack-index: load into memory Derrick Stolee
                       ` (19 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

As we begin writing the multi-pack-index format to disk, start with
the basics: the 12-byte header and the 20-byte checksum footer. Start
with these basics so we can add the rest of the format in small
increments.

As we implement the format, we will use a technique to check that our
computed offsets within the multi-pack-index file match what we are
actually writing. Each method that writes to the hashfile will return
the number of bytes written, and we will track that those values match
our expectations.

Currently, write_midx_header() returns 12, but is not checked. We will
check the return value in a later commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 50 +++++++++++++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh |  3 ++-
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 32468db1a2..f85f2d334d 100644
--- a/midx.c
+++ b/midx.c
@@ -1,7 +1,57 @@
 #include "cache.h"
+#include "csum-file.h"
+#include "lockfile.h"
 #include "midx.h"
 
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_VERSION 1
+#define MIDX_HASH_VERSION 1
+#define MIDX_HEADER_SIZE 12
+
+static char *get_midx_filename(const char *object_dir)
+{
+	return xstrfmt("%s/pack/multi-pack-index", object_dir);
+}
+
+static size_t write_midx_header(struct hashfile *f,
+				unsigned char num_chunks,
+				uint32_t num_packs)
+{
+	unsigned char byte_values[4];
+
+	hashwrite_be32(f, MIDX_SIGNATURE);
+	byte_values[0] = MIDX_VERSION;
+	byte_values[1] = MIDX_HASH_VERSION;
+	byte_values[2] = num_chunks;
+	byte_values[3] = 0; /* unused */
+	hashwrite(f, byte_values, sizeof(byte_values));
+	hashwrite_be32(f, num_packs);
+
+	return MIDX_HEADER_SIZE;
+}
+
 int write_midx_file(const char *object_dir)
 {
+	unsigned char num_chunks = 0;
+	char *midx_name;
+	struct hashfile *f = NULL;
+	struct lock_file lk;
+
+	midx_name = get_midx_filename(object_dir);
+	if (safe_create_leading_directories(midx_name)) {
+		UNLEAK(midx_name);
+		die_errno(_("unable to create leading directories of %s"),
+			  midx_name);
+	}
+
+	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
+	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	FREE_AND_NULL(midx_name);
+
+	write_midx_header(f, num_chunks, 0);
+
+	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	commit_lock_file(&lk);
+
 	return 0;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ec3ddbe79c..8622a7cdce 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -4,7 +4,8 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 test_expect_success 'write midx with no packs' '
-	git multi-pack-index --object-dir=. write
+	git multi-pack-index --object-dir=. write &&
+	test_path_is_file pack/multi-pack-index
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 06/24] multi-pack-index: load into memory
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (4 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 05/24] midx: write header information to lockfile Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  4:19       ` Eric Sunshine
  2018-07-09 19:08       ` Junio C Hamano
  2018-07-06  0:53     ` [PATCH v3 07/24] multi-pack-index: expand test data Derrick Stolee
                       ` (18 subsequent siblings)
  24 siblings, 2 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Create a new multi_pack_index struct for loading multi-pack-indexes into
memory. Create a test-tool builtin for reading basic information about
that multi-pack-index to verify the correct data is written.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile                    |  1 +
 midx.c                      | 82 +++++++++++++++++++++++++++++++++++++
 midx.h                      |  4 ++
 object-store.h              | 16 ++++++++
 t/helper/test-read-midx.c   | 34 +++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 t/t5319-multi-pack-index.sh | 12 +++++-
 8 files changed, 150 insertions(+), 1 deletion(-)
 create mode 100644 t/helper/test-read-midx.c

diff --git a/Makefile b/Makefile
index f5636c711d..0b801d1b16 100644
--- a/Makefile
+++ b/Makefile
@@ -717,6 +717,7 @@ TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-path-utils.o
 TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-read-cache.o
+TEST_BUILTINS_OBJS += test-read-midx.o
 TEST_BUILTINS_OBJS += test-ref-store.o
 TEST_BUILTINS_OBJS += test-regex.o
 TEST_BUILTINS_OBJS += test-revision-walking.o
diff --git a/midx.c b/midx.c
index f85f2d334d..fb388f5858 100644
--- a/midx.c
+++ b/midx.c
@@ -1,18 +1,100 @@
 #include "cache.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "object-store.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
+#define MIDX_BYTE_FILE_VERSION 4
+#define MIDX_BYTE_HASH_VERSION 5
+#define MIDX_BYTE_NUM_CHUNKS 6
+#define MIDX_BYTE_NUM_PACKS 8
 #define MIDX_HASH_VERSION 1
 #define MIDX_HEADER_SIZE 12
+#define MIDX_HASH_LEN 20
+#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+struct multi_pack_index *load_multi_pack_index(const char *object_dir)
+{
+	struct multi_pack_index *m = NULL;
+	int fd;
+	struct stat st;
+	size_t midx_size;
+	void *midx_map = NULL;
+	uint32_t hash_version;
+	char *midx_name = get_midx_filename(object_dir);
+
+	fd = git_open(midx_name);
+
+	if (fd < 0)
+		goto cleanup_fail;
+	if (fstat(fd, &st)) {
+		error_errno(_("failed to read %s"), midx_name);
+		goto cleanup_fail;
+	}
+
+	midx_size = xsize_t(st.st_size);
+
+	if (midx_size < MIDX_MIN_SIZE) {
+		close(fd);
+		error(_("multi-pack-index file %s is too small"), midx_name);
+		goto cleanup_fail;
+	}
+
+	FREE_AND_NULL(midx_name);
+
+	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
+	strcpy(m->object_dir, object_dir);
+	m->fd = fd;
+	m->data = midx_map;
+	m->data_len = midx_size;
+
+	m->signature = get_be32(m->data);
+	if (m->signature != MIDX_SIGNATURE) {
+		error(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"),
+		      m->signature, MIDX_SIGNATURE);
+		goto cleanup_fail;
+	}
+
+	m->version = m->data[MIDX_BYTE_FILE_VERSION];
+	if (m->version != MIDX_VERSION) {
+		error(_("multi-pack-index version %d not recognized"),
+		      m->version);
+		goto cleanup_fail;
+	}
+
+	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
+	if (hash_version != MIDX_HASH_VERSION) {
+		error(_("hash version %u does not match"), hash_version);
+		goto cleanup_fail;
+	}
+	m->hash_len = MIDX_HASH_LEN;
+
+	m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS];
+
+	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
+
+	return m;
+
+cleanup_fail:
+	/* no need to check for NULL when freeing */
+	free(m);
+	free(midx_name);
+	if (midx_map)
+		munmap(midx_map, midx_size);
+	if (0 <= fd)
+		close(fd);
+	return NULL;
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index dbdbe9f873..2d83dd9ec1 100644
--- a/midx.h
+++ b/midx.h
@@ -1,6 +1,10 @@
 #ifndef __MIDX_H__
 #define __MIDX_H__
 
+struct multi_pack_index;
+
+struct multi_pack_index *load_multi_pack_index(const char *object_dir);
+
 int write_midx_file(const char *object_dir);
 
 #endif
diff --git a/object-store.h b/object-store.h
index d683112fd7..4f410841cc 100644
--- a/object-store.h
+++ b/object-store.h
@@ -84,6 +84,22 @@ struct packed_git {
 	char pack_name[FLEX_ARRAY]; /* more */
 };
 
+struct multi_pack_index {
+	int fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	uint32_t signature;
+	unsigned char version;
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_packs;
+	uint32_t num_objects;
+
+	char object_dir[FLEX_ARRAY];
+};
+
 struct raw_object_store {
 	/*
 	 * Path to the repository's object store.
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
new file mode 100644
index 0000000000..5abf969175
--- /dev/null
+++ b/t/helper/test-read-midx.c
@@ -0,0 +1,34 @@
+/*
+ * test-mktemp.c: code to exercise the creation of temporary files
+ */
+#include "test-tool.h"
+#include "cache.h"
+#include "midx.h"
+#include "repository.h"
+#include "object-store.h"
+
+static int read_midx_file(const char *object_dir)
+{
+	struct multi_pack_index *m = load_multi_pack_index(object_dir);
+
+	if (!m)
+		return 0;
+
+	printf("header: %08x %d %d %d\n",
+	       m->signature,
+	       m->version,
+	       m->num_chunks,
+	       m->num_packs);
+
+	printf("object_dir: %s\n", m->object_dir);
+
+	return 0;
+}
+
+int cmd__read_midx(int argc, const char **argv)
+{
+	if (argc != 2)
+		usage("read-midx <object_dir>");
+
+	return read_midx_file(argv[1]);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 805a45de9c..1c3ab36e6c 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -27,6 +27,7 @@ static struct test_cmd cmds[] = {
 	{ "path-utils", cmd__path_utils },
 	{ "prio-queue", cmd__prio_queue },
 	{ "read-cache", cmd__read_cache },
+	{ "read-midx", cmd__read_midx },
 	{ "ref-store", cmd__ref_store },
 	{ "regex", cmd__regex },
 	{ "revision-walking", cmd__revision_walking },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 7116ddfb94..6af8c08a66 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -21,6 +21,7 @@ int cmd__online_cpus(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
 int cmd__prio_queue(int argc, const char **argv);
 int cmd__read_cache(int argc, const char **argv);
+int cmd__read_midx(int argc, const char **argv);
 int cmd__ref_store(int argc, const char **argv);
 int cmd__regex(int argc, const char **argv);
 int cmd__revision_walking(int argc, const char **argv);
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 8622a7cdce..2ecc369529 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,9 +3,19 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+midx_read_expect () {
+	cat >expect <<-EOF
+	header: 4d494458 1 0 0
+	object_dir: .
+	EOF
+	test-tool read-midx . >actual &&
+	test_cmp expect actual
+}
+
 test_expect_success 'write midx with no packs' '
 	git multi-pack-index --object-dir=. write &&
-	test_path_is_file pack/multi-pack-index
+	test_path_is_file pack/multi-pack-index &&
+	midx_read_expect
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 07/24] multi-pack-index: expand test data
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (5 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 06/24] multi-pack-index: load into memory Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  4:36       ` Eric Sunshine
  2018-07-06  0:53     ` [PATCH v3 08/24] packfile: generalize pack directory list Derrick Stolee
                       ` (17 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

As we build the multi-pack-index file format, we want to test the format
on real repoasitories. Add tests to t5319-multi-pack-index.sh that
create repository data including multiple packfiles with both version 1
and version 2 formats.

The current 'git multi-pack-index write' command will always write the
same file with no "real" data. This will be expanded in future commits,
along with the test expectations.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 99 +++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 2ecc369529..1be7be02b8 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -13,9 +13,108 @@ midx_read_expect () {
 }
 
 test_expect_success 'write midx with no packs' '
+	test_when_finished rm -f pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
 	test_path_is_file pack/multi-pack-index &&
 	midx_read_expect
 '
 
+test_expect_success 'create objects' '
+	for i in $(test_seq 1 5)
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 >wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree </dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with one v1 pack' '
+	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
+	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'write midx with one v2 pack' '
+	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more objects' '
+	for i in $(test_seq 6 10)
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 >wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list2 &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with two packs' '
+	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more packs' '
+	for j in $(test_seq 1 10)
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 >wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+		git update-index --add file_101 &&
+		tree=$(git write-tree) &&
+		commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+		echo $tree &&
+		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+		} >obj-list &&
+		git update-ref HEAD $commit &&
+		git pack-objects --index-version=2 test-pack <obj-list &&
+		i=$(expr $i + 1) || return 1 &&
+		j=$(expr $j + 1) || return 1
+	done
+'
+
+test_expect_success 'write midx with twelve packs' '
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 08/24] packfile: generalize pack directory list
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (6 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 07/24] multi-pack-index: expand test data Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 09/24] multi-pack-index: read packfile list Derrick Stolee
                       ` (16 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

In anticipation of sharing the pack directory listing with the
multi-pack-index, generalize prepare_packed_git_one() into
for_each_file_in_pack_dir().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 101 +++++++++++++++++++++++++++++++++--------------------
 packfile.h |   6 ++++
 2 files changed, 69 insertions(+), 38 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7cd45aa4b2..ee1ab9b804 100644
--- a/packfile.c
+++ b/packfile.c
@@ -738,13 +738,14 @@ static void report_pack_garbage(struct string_list *list)
 	report_helper(list, seen_bits, first, list->nr);
 }
 
-static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data)
 {
 	struct strbuf path = STRBUF_INIT;
 	size_t dirnamelen;
 	DIR *dir;
 	struct dirent *de;
-	struct string_list garbage = STRING_LIST_INIT_DUP;
 
 	strbuf_addstr(&path, objdir);
 	strbuf_addstr(&path, "/pack");
@@ -759,53 +760,77 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	strbuf_addch(&path, '/');
 	dirnamelen = path.len;
 	while ((de = readdir(dir)) != NULL) {
-		struct packed_git *p;
-		size_t base_len;
-
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		strbuf_setlen(&path, dirnamelen);
 		strbuf_addstr(&path, de->d_name);
 
-		base_len = path.len;
-		if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
-			/* Don't reopen a pack we already have. */
-			for (p = r->objects->packed_git; p;
-			     p = p->next) {
-				size_t len;
-				if (strip_suffix(p->pack_name, ".pack", &len) &&
-				    len == base_len &&
-				    !memcmp(p->pack_name, path.buf, len))
-					break;
-			}
-			if (p == NULL &&
-			    /*
-			     * See if it really is a valid .idx file with
-			     * corresponding .pack file that we can map.
-			     */
-			    (p = add_packed_git(path.buf, path.len, local)) != NULL)
-				install_packed_git(r, p);
-		}
-
-		if (!report_garbage)
-			continue;
-
-		if (ends_with(de->d_name, ".idx") ||
-		    ends_with(de->d_name, ".pack") ||
-		    ends_with(de->d_name, ".bitmap") ||
-		    ends_with(de->d_name, ".keep") ||
-		    ends_with(de->d_name, ".promisor"))
-			string_list_append(&garbage, path.buf);
-		else
-			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
+		fn(path.buf, path.len, de->d_name, data);
 	}
+
 	closedir(dir);
-	report_pack_garbage(&garbage);
-	string_list_clear(&garbage, 0);
 	strbuf_release(&path);
 }
 
+struct prepare_pack_data {
+	struct repository *r;
+	struct string_list *garbage;
+	int local;
+};
+
+static void prepare_pack(const char *full_name, size_t full_name_len,
+			 const char *file_name, void *_data)
+{
+	struct prepare_pack_data *data = (struct prepare_pack_data *)_data;
+	struct packed_git *p;
+	size_t base_len = full_name_len;
+
+	if (strip_suffix_mem(full_name, &base_len, ".idx")) {
+		/* Don't reopen a pack we already have. */
+		for (p = data->r->objects->packed_git; p; p = p->next) {
+			size_t len;
+			if (strip_suffix(p->pack_name, ".pack", &len) &&
+			    len == base_len &&
+			    !memcmp(p->pack_name, full_name, len))
+				break;
+		}
+
+		if (!p) {
+			p = add_packed_git(full_name, full_name_len, data->local);
+			if (p)
+				install_packed_git(data->r, p);
+		}
+	}
+
+	if (!report_garbage)
+		return;
+
+	if (ends_with(file_name, ".idx") ||
+	    ends_with(file_name, ".pack") ||
+	    ends_with(file_name, ".bitmap") ||
+	    ends_with(file_name, ".keep") ||
+	    ends_with(file_name, ".promisor"))
+		string_list_append(data->garbage, full_name);
+	else
+		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
+}
+
+static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+{
+	struct prepare_pack_data data;
+	struct string_list garbage = STRING_LIST_INIT_DUP;
+
+	data.r = r;
+	data.garbage = &garbage;
+	data.local = local;
+
+	for_each_file_in_pack_dir(objdir, prepare_pack, &data);
+
+	report_pack_garbage(data.garbage);
+	string_list_clear(data.garbage, 0);
+}
+
 static void prepare_packed_git(struct repository *r);
 /*
  * Give a fast, rough count of the number of objects in the repository. This
diff --git a/packfile.h b/packfile.h
index e0a38aba93..d2ad30300a 100644
--- a/packfile.h
+++ b/packfile.h
@@ -28,6 +28,12 @@ extern char *sha1_pack_index_name(const unsigned char *sha1);
 
 extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
+typedef void each_file_in_pack_dir_fn(const char *full_path, size_t full_path_len,
+				      const char *file_pach, void *data);
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data);
+
 /* A hook to report invalid files in pack directory */
 #define PACKDIR_FILE_PACK 1
 #define PACKDIR_FILE_IDX 2
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 09/24] multi-pack-index: read packfile list
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (7 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 08/24] packfile: generalize pack directory list Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 10/24] multi-pack-index: write pack names in chunk Derrick Stolee
                       ` (15 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

When constructing a multi-pack-index file for a given object directory,
read the files within the enclosed pack directory and find matches that
end with ".idx" and find the correct paired packfile using
add_packed_git().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 46 ++++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh | 16 ++++++-------
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index fb388f5858..b0722485df 100644
--- a/midx.c
+++ b/midx.c
@@ -1,6 +1,8 @@
 #include "cache.h"
 #include "csum-file.h"
+#include "dir.h"
 #include "lockfile.h"
+#include "packfile.h"
 #include "object-store.h"
 #include "midx.h"
 
@@ -112,12 +114,39 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_list {
+	struct packed_git **list;
+	uint32_t nr;
+	uint32_t alloc_list;
+};
+
+static void add_pack_to_midx(const char *full_path, size_t full_path_len,
+			     const char *file_name, void *data)
+{
+	struct pack_list *packs = (struct pack_list *)data;
+
+	if (ends_with(file_name, ".idx")) {
+		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
+
+		packs->list[packs->nr] = add_packed_git(full_path,
+							 full_path_len,
+							 0);
+		if (!packs->list[packs->nr]) {
+			warning(_("failed to add packfile '%s'"),
+				full_path);
+			return;
+		}
+	}
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char num_chunks = 0;
 	char *midx_name;
+	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
+	struct pack_list packs;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -126,14 +155,29 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	packs.nr = 0;
+	packs.alloc_list = 16;
+	packs.list = NULL;
+	ALLOC_ARRAY(packs.list, packs.alloc_list);
+
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, 0);
+	write_midx_header(f, num_chunks, packs.nr);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
+	for (i = 0; i < packs.nr; i++) {
+		if (packs.list[i]) {
+			close_pack(packs.list[i]);
+			free(packs.list[i]);
+		}
+	}
+
+	free(packs.list);
 	return 0;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1be7be02b8..fd0a3f3be7 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -4,8 +4,9 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 midx_read_expect () {
+	NUM_PACKS=$1
 	cat >expect <<-EOF
-	header: 4d494458 1 0 0
+	header: 4d494458 1 0 $NUM_PACKS
 	object_dir: .
 	EOF
 	test-tool read-midx . >actual &&
@@ -15,8 +16,7 @@ midx_read_expect () {
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm -f pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect
+	midx_read_expect 0
 '
 
 test_expect_success 'create objects' '
@@ -47,13 +47,13 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'Add more objects' '
@@ -83,7 +83,7 @@ test_expect_success 'Add more objects' '
 test_expect_success 'write midx with two packs' '
 	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 2
 '
 
 test_expect_success 'Add more packs' '
@@ -106,7 +106,7 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 test-pack <obj-list &&
+		git pack-objects --index-version=2 pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
@@ -114,7 +114,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 12
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 10/24] multi-pack-index: write pack names in chunk
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (8 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 09/24] multi-pack-index: read packfile list Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 11/24] midx: read pack names into array Derrick Stolee
                       ` (14 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

The multi-pack-index needs to track which packfiles it indexes. Store
these in our first required chunk. Since filenames are not well
structured, add padding to keep good alignment in later chunks.

Modify the 'git multi-pack-index read' subcommand to output the
existence of the pack-file name chunk. Modify t5319-multi-pack-index.sh
to reflect this new output and the new expected number of chunks.

Defense in depth: A pattern we are using in the multi-pack-index feature
is to verify the data as we write it. We want to ensure we never write
invalid data to the multi-pack-index. There are many checks that verify
that the values we are writing fit the format definitions. This mainly
helps developers while working on the feature, but it can also identify
issues that only appear when dealing with very large data sets. These
large sets are hard to encode into test cases.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |   6 +
 midx.c                                  | 176 +++++++++++++++++++++++-
 object-store.h                          |   2 +
 t/helper/test-read-midx.c               |   7 +
 t/t5319-multi-pack-index.sh             |   3 +-
 5 files changed, 191 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index e060e693f4..6c5a77475f 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,12 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+	Packfile Names (ID: {'P', 'N', 'A', 'M'})
+	    Stores the packfile names as concatenated, null-terminated strings.
+	    Packfiles must be listed in lexicographic order for fast lookups by
+	    name. This is the only chunk not guaranteed to be a multiple of four
+	    bytes in length, so should be the last chunk for alignment reasons.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/midx.c b/midx.c
index b0722485df..f78c161422 100644
--- a/midx.c
+++ b/midx.c
@@ -17,6 +17,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
+#define MIDX_MAX_CHUNKS 1
+#define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -31,6 +36,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	void *midx_map = NULL;
 	uint32_t hash_version;
 	char *midx_name = get_midx_filename(object_dir);
+	uint32_t i;
 
 	fd = git_open(midx_name);
 
@@ -84,6 +90,33 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
+	for (i = 0; i < m->num_chunks; i++) {
+		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
+					     MIDX_CHUNKLOOKUP_WIDTH * i);
+		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
+						 MIDX_CHUNKLOOKUP_WIDTH * i);
+
+		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKNAMES:
+				m->chunk_pack_names = m->data + chunk_offset;
+				break;
+
+			case 0:
+				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
+				break;
+
+			default:
+				/*
+				 * Do nothing on unrecognized chunks, allowing future
+				 * extensions to add optional chunks.
+				 */
+				break;
+		}
+	}
+
+	if (!m->chunk_pack_names)
+		die(_("multi-pack-index missing required pack-name chunk"));
+
 	return m;
 
 cleanup_fail:
@@ -116,8 +149,11 @@ static size_t write_midx_header(struct hashfile *f,
 
 struct pack_list {
 	struct packed_git **list;
+	char **names;
 	uint32_t nr;
 	uint32_t alloc_list;
+	uint32_t alloc_names;
+	size_t pack_name_concat_len;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -127,6 +163,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 	if (ends_with(file_name, ".idx")) {
 		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
+		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
 
 		packs->list[packs->nr] = add_packed_git(full_path,
 							 full_path_len,
@@ -136,17 +173,90 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 				full_path);
 			return;
 		}
+
+		packs->names[packs->nr] = xstrdup(file_name);
+		packs->pack_name_concat_len += strlen(file_name) + 1;
+		packs->nr++;
+	}
+}
+
+struct pack_pair {
+	uint32_t pack_int_id;
+	char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+	struct pack_pair *a = (struct pack_pair *)_a;
+	struct pack_pair *b = (struct pack_pair *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
+{
+	uint32_t i;
+	struct pack_pair *pairs;
+
+	ALLOC_ARRAY(pairs, nr_packs);
+
+	for (i = 0; i < nr_packs; i++) {
+		pairs[i].pack_int_id = i;
+		pairs[i].pack_name = pack_names[i];
+	}
+
+	QSORT(pairs, nr_packs, pack_pair_compare);
+
+	for (i = 0; i < nr_packs; i++) {
+		pack_names[i] = pairs[i].pack_name;
+		perm[pairs[i].pack_int_id] = i;
+	}
+
+	free(pairs);
+}
+
+static size_t write_midx_pack_names(struct hashfile *f,
+				    char **pack_names,
+				    uint32_t num_packs)
+{
+	uint32_t i;
+	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
+	size_t written = 0;
+
+	for (i = 0; i < num_packs; i++) {
+		size_t writelen = strlen(pack_names[i]) + 1;
+
+		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+			BUG("incorrect pack-file order: %s before %s",
+			    pack_names[i - 1],
+			    pack_names[i]);
+
+		hashwrite(f, pack_names[i], writelen);
+		written += writelen;
+	}
+
+	/* add padding to be aligned */
+	i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
+	if (i < MIDX_CHUNK_ALIGNMENT) {
+		memset(padding, 0, sizeof(padding));
+		hashwrite(f, padding, i);
+		written += i;
 	}
+
+	return written;
 }
 
 int write_midx_file(const char *object_dir)
 {
-	unsigned char num_chunks = 0;
+	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct pack_list packs;
+	uint32_t *pack_perm = NULL;
+	uint64_t written = 0;
+	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
+	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -157,16 +267,76 @@ int write_midx_file(const char *object_dir)
 
 	packs.nr = 0;
 	packs.alloc_list = 16;
+	packs.alloc_names = 16;
 	packs.list = NULL;
+	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
+	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
+	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
+	ALLOC_ARRAY(pack_perm, packs.nr);
+	sort_packs_by_name(packs.names, packs.nr, pack_perm);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, packs.nr);
+	cur_chunk = 0;
+	num_chunks = 1;
+
+	written = write_midx_header(f, num_chunks, packs.nr);
+
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
+
+	for (i = 0; i <= num_chunks; i++) {
+		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
+			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
+			    chunk_offsets[i - 1],
+			    chunk_offsets[i]);
+
+		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
+			BUG("chunk offset %"PRIu64" is not properly aligned",
+			    chunk_offsets[i]);
+
+		hashwrite_be32(f, chunk_ids[i]);
+		hashwrite_be32(f, chunk_offsets[i] >> 32);
+		hashwrite_be32(f, chunk_offsets[i]);
+
+		written += MIDX_CHUNKLOOKUP_WIDTH;
+	}
+
+	for (i = 0; i < num_chunks; i++) {
+		if (written != chunk_offsets[i])
+			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
+			    chunk_offsets[i],
+			    written,
+			    chunk_ids[i]);
+
+		switch (chunk_ids[i]) {
+			case MIDX_CHUNKID_PACKNAMES:
+				written += write_midx_pack_names(f, packs.names, packs.nr);
+				break;
+
+			default:
+				BUG("trying to write unknown chunk id %"PRIx32,
+				    chunk_ids[i]);
+		}
+	}
+
+	if (written != chunk_offsets[num_chunks])
+		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
+		    written,
+		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
@@ -176,8 +346,10 @@ int write_midx_file(const char *object_dir)
 			close_pack(packs.list[i]);
 			free(packs.list[i]);
 		}
+		free(packs.names[i]);
 	}
 
 	free(packs.list);
+	free(packs.names);
 	return 0;
 }
diff --git a/object-store.h b/object-store.h
index 4f410841cc..c87d051849 100644
--- a/object-store.h
+++ b/object-store.h
@@ -97,6 +97,8 @@ struct multi_pack_index {
 	uint32_t num_packs;
 	uint32_t num_objects;
 
+	const unsigned char *chunk_pack_names;
+
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 5abf969175..a9232d8219 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -20,6 +20,13 @@ static int read_midx_file(const char *object_dir)
 	       m->num_chunks,
 	       m->num_packs);
 
+	printf("chunks:");
+
+	if (m->chunk_pack_names)
+		printf(" pack_names");
+
+	printf("\n");
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index fd0a3f3be7..f458758945 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -6,7 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect () {
 	NUM_PACKS=$1
 	cat >expect <<-EOF
-	header: 4d494458 1 0 $NUM_PACKS
+	header: 4d494458 1 1 $NUM_PACKS
+	chunks: pack_names
 	object_dir: .
 	EOF
 	test-tool read-midx . >actual &&
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 11/24] midx: read pack names into array
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (9 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 10/24] multi-pack-index: write pack names in chunk Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  4:58       ` Eric Sunshine
  2018-07-06  0:53     ` [PATCH v3 12/24] midx: sort and deduplicate objects from packfiles Derrick Stolee
                       ` (13 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 17 +++++++++++++++++
 object-store.h              |  1 +
 t/helper/test-read-midx.c   |  5 +++++
 t/t5319-multi-pack-index.sh |  7 ++++++-
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index f78c161422..ffe29af65d 100644
--- a/midx.c
+++ b/midx.c
@@ -37,6 +37,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	uint32_t hash_version;
 	char *midx_name = get_midx_filename(object_dir);
 	uint32_t i;
+	const char *cur_pack_name;
 
 	fd = git_open(midx_name);
 
@@ -117,6 +118,22 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	if (!m->chunk_pack_names)
 		die(_("multi-pack-index missing required pack-name chunk"));
 
+	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
+
+	cur_pack_name = (const char *)m->chunk_pack_names;
+	for (i = 0; i < m->num_packs; i++) {
+		m->pack_names[i] = cur_pack_name;
+
+		cur_pack_name += strlen(cur_pack_name) + 1;
+
+		if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
+			error(_("multi-pack-index pack names out of order: '%s' before '%s'"),
+			      m->pack_names[i - 1],
+			      m->pack_names[i]);
+			goto cleanup_fail;
+		}
+	}
+
 	return m;
 
 cleanup_fail:
diff --git a/object-store.h b/object-store.h
index c87d051849..88169b33e9 100644
--- a/object-store.h
+++ b/object-store.h
@@ -99,6 +99,7 @@ struct multi_pack_index {
 
 	const unsigned char *chunk_pack_names;
 
+	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index a9232d8219..0b53a9e8b5 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -9,6 +9,7 @@
 
 static int read_midx_file(const char *object_dir)
 {
+	uint32_t i;
 	struct multi_pack_index *m = load_multi_pack_index(object_dir);
 
 	if (!m)
@@ -27,6 +28,10 @@ static int read_midx_file(const char *object_dir)
 
 	printf("\n");
 
+	printf("packs:\n");
+	for (i = 0; i < m->num_packs; i++)
+		printf("%s\n", m->pack_names[i]);
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index f458758945..4610352b69 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -8,8 +8,13 @@ midx_read_expect () {
 	cat >expect <<-EOF
 	header: 4d494458 1 1 $NUM_PACKS
 	chunks: pack_names
-	object_dir: .
+	packs:
 	EOF
+	if [ $NUM_PACKS -ge 1 ]
+	then
+		ls pack/ | grep idx | sort >> expect
+	fi
+	printf "object_dir: .\n" >>expect &&
 	test-tool read-midx . >actual &&
 	test_cmp expect actual
 }
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 12/24] midx: sort and deduplicate objects from packfiles
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (10 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 11/24] midx: read pack names into array Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 13/24] midx: write object ids in a chunk Derrick Stolee
                       ` (12 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Before writing a list of objects and their offsets to a multi-pack-index,
we need to collect the list of objects contained in the packfiles. There
may be multiple copies of some objects, so this list must be deduplicated.

It is possible to artificially get into a state where there are many
duplicate copies of objects. That can create high memory pressure if we
are to create a list of all objects before de-duplication. To reduce
this memory pressure without a significant performance drop,
automatically group objects by the first byte of their object id. Use
the IDX fanout tables to group the data, copy to a local array, then
sort.

Copy only the de-duplicated entries. Select the duplicate based on the
most-recent modified time of a packfile containing the object.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 packfile.c |  17 +++++++
 packfile.h |   2 +
 3 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index ffe29af65d..028e3aa5e9 100644
--- a/midx.c
+++ b/midx.c
@@ -4,6 +4,7 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
@@ -183,14 +184,22 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
 
 		packs->list[packs->nr] = add_packed_git(full_path,
-							 full_path_len,
-							 0);
+							full_path_len, 0);
+
 		if (!packs->list[packs->nr]) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
+		if (open_pack_index(packs->list[packs->nr])) {
+			warning(_("failed to open pack-index '%s'"),
+				full_path);
+			close_pack(packs->list[packs->nr]);
+			FREE_AND_NULL(packs->list[packs->nr]);
+			return;
+		}
+
 		packs->names[packs->nr] = xstrdup(file_name);
 		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
@@ -231,6 +240,119 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	free(pairs);
 }
 
+struct pack_midx_entry {
+	struct object_id oid;
+	uint32_t pack_int_id;
+	time_t pack_mtime;
+	uint64_t offset;
+};
+
+static int midx_oid_compare(const void *_a, const void *_b)
+{
+	const struct pack_midx_entry *a = (const struct pack_midx_entry *)_a;
+	const struct pack_midx_entry *b = (const struct pack_midx_entry *)_b;
+	int cmp = oidcmp(&a->oid, &b->oid);
+
+	if (cmp)
+		return cmp;
+
+	if (a->pack_mtime > b->pack_mtime)
+		return -1;
+	else if (a->pack_mtime < b->pack_mtime)
+		return 1;
+
+	return a->pack_int_id - b->pack_int_id;
+}
+
+static void fill_pack_entry(uint32_t pack_int_id,
+			    struct packed_git *p,
+			    uint32_t cur_object,
+			    struct pack_midx_entry *entry)
+{
+	if (!nth_packed_object_oid(&entry->oid, p, cur_object))
+		die(_("failed to locate object %d in packfile"), cur_object);
+
+	entry->pack_int_id = pack_int_id;
+	entry->pack_mtime = p->mtime;
+
+	entry->offset = nth_packed_object_offset(p, cur_object);
+}
+
+/*
+ * It is possible to artificially get into a state where there are many
+ * duplicate copies of objects. That can create high memory pressure if
+ * we are to create a list of all objects before de-duplication. To reduce
+ * this memory pressure without a significant performance drop, automatically
+ * group objects by the first byte of their object id. Use the IDX fanout
+ * tables to group the data, copy to a local array, then sort.
+ *
+ * Copy only the de-duplicated entries (selected by most-recent modified time
+ * of a packfile containing the object).
+ */
+static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+						  uint32_t *perm,
+						  uint32_t nr_packs,
+						  uint32_t *nr_objects)
+{
+	uint32_t cur_fanout, cur_pack, cur_object;
+	uint32_t alloc_fanout, alloc_objects, total_objects = 0;
+	struct pack_midx_entry *entries_by_fanout = NULL;
+	struct pack_midx_entry *deduplicated_entries = NULL;
+
+	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++)
+		total_objects += p[cur_pack]->num_objects;
+
+	/*
+	 * As we de-duplicate by fanout value, we expect the fanout
+	 * slices to be evenly distributed, with some noise. Hence,
+	 * allocate slightly more than one 256th.
+	 */
+	alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16;
+
+	ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
+	ALLOC_ARRAY(deduplicated_entries, alloc_objects);
+	*nr_objects = 0;
+
+	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
+		uint32_t nr_fanout = 0;
+
+		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
+			end = get_pack_fanout(p[cur_pack], cur_fanout);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				nr_fanout++;
+			}
+		}
+
+		QSORT(entries_by_fanout, nr_fanout, midx_oid_compare);
+
+		/*
+		 * The batch is now sorted by OID and then mtime (descending).
+		 * Take only the first duplicate.
+		 */
+		for (cur_object = 0; cur_object < nr_fanout; cur_object++) {
+			if (cur_object && !oidcmp(&entries_by_fanout[cur_object - 1].oid,
+						  &entries_by_fanout[cur_object].oid))
+				continue;
+
+			ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects);
+			memcpy(&deduplicated_entries[*nr_objects],
+			       &entries_by_fanout[cur_object],
+			       sizeof(struct pack_midx_entry));
+			(*nr_objects)++;
+		}
+	}
+
+	free(entries_by_fanout);
+	return deduplicated_entries;
+}
+
 static size_t write_midx_pack_names(struct hashfile *f,
 				    char **pack_names,
 				    uint32_t num_packs)
@@ -274,6 +396,8 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+	uint32_t nr_entries;
+	struct pack_midx_entry *entries = NULL;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -299,6 +423,8 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
+	entries = get_sorted_entries(packs.list, pack_perm, packs.nr, &nr_entries);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
@@ -368,5 +494,6 @@ int write_midx_file(const char *object_dir)
 
 	free(packs.list);
 	free(packs.names);
+	free(entries);
 	return 0;
 }
diff --git a/packfile.c b/packfile.c
index ee1ab9b804..3d652212c6 100644
--- a/packfile.c
+++ b/packfile.c
@@ -196,6 +196,23 @@ int open_pack_index(struct packed_git *p)
 	return ret;
 }
 
+uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
+{
+	const uint32_t *level1_ofs = p->index_data;
+
+	if (!level1_ofs) {
+		if (open_pack_index(p))
+			return 0;
+		level1_ofs = p->index_data;
+	}
+
+	if (p->index_version > 1) {
+		level1_ofs += 2;
+	}
+
+	return ntohl(level1_ofs[value]);
+}
+
 static struct packed_git *alloc_packed_git(int extra)
 {
 	struct packed_git *p = xmalloc(st_add(sizeof(*p), extra));
diff --git a/packfile.h b/packfile.h
index d2ad30300a..b0eed44c0b 100644
--- a/packfile.h
+++ b/packfile.h
@@ -69,6 +69,8 @@ extern int open_pack_index(struct packed_git *);
  */
 extern void close_pack_index(struct packed_git *);
 
+extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
+
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 extern void close_pack_windows(struct packed_git *);
 extern void close_pack(struct packed_git *);
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 13/24] midx: write object ids in a chunk
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (11 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 12/24] midx: sort and deduplicate objects from packfiles Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  5:04       ` Eric Sunshine
  2018-07-06  0:53     ` [PATCH v3 14/24] midx: write object id fanout chunk Derrick Stolee
                       ` (11 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  4 +++
 midx.c                                  | 47 +++++++++++++++++++++++--
 object-store.h                          |  1 +
 t/helper/test-read-midx.c               |  2 ++
 t/t5319-multi-pack-index.sh             |  4 +--
 5 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6c5a77475f..78ee0489c6 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -302,6 +302,10 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Lookup (ID: {'O', 'I', 'D', 'L'})
+	    The OIDs for all objects in the MIDX are stored in lexicographic
+	    order in this chunk.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/midx.c b/midx.c
index 028e3aa5e9..7606addab6 100644
--- a/midx.c
+++ b/midx.c
@@ -18,9 +18,10 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 1
+#define MIDX_MAX_CHUNKS 2
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
 static char *get_midx_filename(const char *object_dir)
@@ -103,6 +104,10 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				m->chunk_oid_lookup = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
 				break;
@@ -118,6 +123,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	if (!m->chunk_pack_names)
 		die(_("multi-pack-index missing required pack-name chunk"));
+	if (!m->chunk_oid_lookup)
+		die(_("multi-pack-index missing required OID lookup chunk"));
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
 
@@ -384,6 +391,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		if (i < nr_objects - 1) {
+			struct pack_midx_entry *next = list;
+			if (oidcmp(&obj->oid, &next->oid) >= 0)
+				BUG("OIDs not in order: %s >= %s",
+				oid_to_hex(&obj->oid),
+				oid_to_hex(&next->oid));
+		}
+
+		hashwrite(f, obj->oid.hash, (int)hash_len);
+		written += hash_len;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -430,7 +463,7 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 1;
+	num_chunks = 2;
 
 	written = write_midx_header(f, num_chunks, packs.nr);
 
@@ -438,9 +471,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -470,6 +507,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, packs.names, packs.nr);
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index 88169b33e9..25f8530eb4 100644
--- a/object-store.h
+++ b/object-store.h
@@ -98,6 +98,7 @@ struct multi_pack_index {
 	uint32_t num_objects;
 
 	const unsigned char *chunk_pack_names;
+	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 0b53a9e8b5..60bca5b668 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -25,6 +25,8 @@ static int read_midx_file(const char *object_dir)
 
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_lookup)
+		printf(" oid_lookup");
 
 	printf("\n");
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 4610352b69..cbe84c74fc 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -6,8 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect () {
 	NUM_PACKS=$1
 	cat >expect <<-EOF
-	header: 4d494458 1 1 $NUM_PACKS
-	chunks: pack_names
+	header: 4d494458 1 2 $NUM_PACKS
+	chunks: pack_names oid_lookup
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 14/24] midx: write object id fanout chunk
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (12 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 13/24] midx: write object ids in a chunk Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 15/24] midx: write object offsets Derrick Stolee
                       ` (10 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  5 +++
 midx.c                                  | 53 +++++++++++++++++++++++--
 object-store.h                          |  1 +
 t/helper/test-read-midx.c               |  4 +-
 t/t5319-multi-pack-index.sh             | 16 ++++----
 5 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 78ee0489c6..3215f7bfcd 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -302,6 +302,11 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Fanout (ID: {'O', 'I', 'D', 'F'})
+	    The ith entry, F[i], stores the number of OIDs with first
+	    byte at most i. Thus F[255] stores the total
+	    number of objects.
+
 	OID Lookup (ID: {'O', 'I', 'D', 'L'})
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
diff --git a/midx.c b/midx.c
index 7606addab6..404147bb9f 100644
--- a/midx.c
+++ b/midx.c
@@ -18,11 +18,13 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 2
+#define MIDX_MAX_CHUNKS 3
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+#define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -104,6 +106,10 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
@@ -123,9 +129,13 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	if (!m->chunk_pack_names)
 		die(_("multi-pack-index missing required pack-name chunk"));
+	if (!m->chunk_oid_fanout)
+		die(_("multi-pack-index missing required OID fanout chunk"));
 	if (!m->chunk_oid_lookup)
 		die(_("multi-pack-index missing required OID lookup chunk"));
 
+	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
+
 	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
 
 	cur_pack_name = (const char *)m->chunk_pack_names;
@@ -391,6 +401,35 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_fanout(struct hashfile *f,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	struct pack_midx_entry *last = objects + nr_objects;
+	uint32_t count = 0;
+	uint32_t i;
+
+	/*
+	* Write the first-level table (the list is sorted,
+	* but we use a 256-entry lookup to be able to avoid
+	* having to do eight extra binary search iterations).
+	*/
+	for (i = 0; i < 256; i++) {
+		struct pack_midx_entry *next = list;
+
+		while (next < last && next->oid.hash[0] == i) {
+			count++;
+			next++;
+		}
+
+		hashwrite_be32(f, count);
+		list = next;
+	}
+
+	return MIDX_CHUNK_FANOUT_SIZE;
+}
+
 static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 				    struct pack_midx_entry *objects,
 				    uint32_t nr_objects)
@@ -463,7 +502,7 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 2;
+	num_chunks = 3;
 
 	written = write_midx_header(f, num_chunks, packs.nr);
 
@@ -471,9 +510,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
+
 	cur_chunk++;
 	chunk_ids[cur_chunk] = 0;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
@@ -507,6 +550,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, packs.names, packs.nr);
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				written += write_midx_oid_fanout(f, entries, nr_entries);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
diff --git a/object-store.h b/object-store.h
index 25f8530eb4..3357e51100 100644
--- a/object-store.h
+++ b/object-store.h
@@ -98,6 +98,7 @@ struct multi_pack_index {
 	uint32_t num_objects;
 
 	const unsigned char *chunk_pack_names;
+	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 60bca5b668..d1bb7290ae 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -25,10 +25,12 @@ static int read_midx_file(const char *object_dir)
 
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_fanout)
+		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
 
-	printf("\n");
+	printf("\nnum_objects: %d\n", m->num_objects);
 
 	printf("packs:\n");
 	for (i = 0; i < m->num_packs; i++)
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index cbe84c74fc..23f653473a 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -5,9 +5,11 @@ test_description='multi-pack-indexes'
 
 midx_read_expect () {
 	NUM_PACKS=$1
+	NUM_OBJECTS=$2
 	cat >expect <<-EOF
-	header: 4d494458 1 2 $NUM_PACKS
-	chunks: pack_names oid_lookup
+	header: 4d494458 1 3 $NUM_PACKS
+	chunks: pack_names oid_fanout oid_lookup
+	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
@@ -22,7 +24,7 @@ midx_read_expect () {
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm -f pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 0
+	midx_read_expect 0 0
 '
 
 test_expect_success 'create objects' '
@@ -53,13 +55,13 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'Add more objects' '
@@ -89,7 +91,7 @@ test_expect_success 'Add more objects' '
 test_expect_success 'write midx with two packs' '
 	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 2
+	midx_read_expect 2 33
 '
 
 test_expect_success 'Add more packs' '
@@ -120,7 +122,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 12
+	midx_read_expect 12 73
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 15/24] midx: write object offsets
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (13 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 14/24] midx: write object id fanout chunk Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  5:27       ` Eric Sunshine
  2018-07-06  0:53     ` [PATCH v3 16/24] config: create core.multiPackIndex setting Derrick Stolee
                       ` (9 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

The final pair of chunks for the multi-pack-index file stores the object
offsets. We default to using 32-bit offsets as in the pack-index version
1 format, but if there exists an offset larger than 32-bits, we use a
trick similar to the pack-index version 2 format by storing all offsets
at least 2^31 in a 64-bit table; we use the 32-bit table to point into
that 64-bit table as necessary.

We only store these 64-bit offsets if necessary, so create a test that
manipulates a version 2 pack-index to fake a large offset. This allows
us to test that the large offset table is created, but the data does not
match the actual packfile offsets. The multi-pack-index offset does match
the (corrupted) pack-index offset, so a future feature will compare these
offsets during a 'verify' step.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  15 +++-
 midx.c                                  | 100 +++++++++++++++++++++++-
 object-store.h                          |   2 +
 t/helper/test-read-midx.c               |   4 +
 t/t5319-multi-pack-index.sh             |  44 ++++++++---
 5 files changed, 150 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 3215f7bfcd..cab5bdd2ff 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -311,7 +311,20 @@ CHUNK DATA:
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
 
-	(This section intentionally left incomplete.)
+	Object Offsets (ID: {'O', 'O', 'F', 'F'})
+	    Stores two 4-byte values for every object.
+	    1: The pack-int-id for the pack storing this object.
+	    2: The offset within the pack.
+		If all offsets are less than 2^31, then the large offset chunk
+		will not exist and offsets are stored as in IDX v1.
+		If there is at least one offset value larger than 2^32-1, then
+		the large offset chunk must exist. If the large offset chunk
+		exists and the 31st bit is on, then removing that bit reveals
+		the row in the large offsets containing the 8-byte offset of
+		this object.
+
+	[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+	    8-byte offsets into large packfiles.
 
 TRAILER:
 
diff --git a/midx.c b/midx.c
index 404147bb9f..cc35abe7a2 100644
--- a/midx.c
+++ b/midx.c
@@ -18,13 +18,18 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 3
+#define MIDX_MAX_CHUNKS 5
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
+#define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
+#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
+#define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -114,6 +119,14 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				m->chunk_object_offsets = m->data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				m->chunk_large_offsets = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
 				break;
@@ -133,6 +146,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 		die(_("multi-pack-index missing required OID fanout chunk"));
 	if (!m->chunk_oid_lookup)
 		die(_("multi-pack-index missing required OID lookup chunk"));
+	if (!m->chunk_object_offsets)
+		die(_("multi-pack-index missing required object offsets chunk"));
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
@@ -456,6 +471,56 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 	return written;
 }
 
+static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i, nr_large_offset = 0;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		hashwrite_be32(f, obj->pack_int_id);
+
+		if (large_offset_needed && obj->offset >> 31)
+			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
+		else if (!large_offset_needed && obj->offset >> 32)
+			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
+			    oid_to_hex(&obj->oid),
+			    obj->offset);
+		else
+			hashwrite_be32(f, (uint32_t)obj->offset);
+
+		written += MIDX_CHUNK_OFFSET_WIDTH;
+	}
+
+	return written;
+}
+
+static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
+				       struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	size_t written = 0;
+
+	while (nr_large_offset) {
+		struct pack_midx_entry *obj = list++;
+		uint64_t offset = obj->offset;
+
+		if (!(offset >> 31))
+			continue;
+
+		hashwrite_be32(f, offset >> 32);
+		hashwrite_be32(f, offset & 0xffffffffUL);
+		written += 2 * sizeof(uint32_t);
+
+		nr_large_offset--;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -468,8 +533,9 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t nr_entries;
+	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
+	int large_offsets_needed = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -496,13 +562,19 @@ int write_midx_file(const char *object_dir)
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
 	entries = get_sorted_entries(packs.list, pack_perm, packs.nr, &nr_entries);
+	for (i = 0; i < nr_entries; i++) {
+		if (entries[i].offset > 0x7fffffff)
+			num_large_offsets++;
+		if (entries[i].offset > 0xffffffff)
+			large_offsets_needed = 1;
+	}
 
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 3;
+	num_chunks = large_offsets_needed ? 5 : 4;
 
 	written = write_midx_header(f, num_chunks, packs.nr);
 
@@ -518,9 +590,21 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
 
+	cur_chunk++;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	if (large_offsets_needed) {
+		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
+
+		cur_chunk++;
+		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
+					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+	}
+
+	chunk_ids[cur_chunk] = 0;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -558,6 +642,14 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index 3357e51100..07bcc80e02 100644
--- a/object-store.h
+++ b/object-store.h
@@ -100,6 +100,8 @@ struct multi_pack_index {
 	const unsigned char *chunk_pack_names;
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_object_offsets;
+	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index d1bb7290ae..20771d1c1d 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -29,6 +29,10 @@ static int read_midx_file(const char *object_dir)
 		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
+	if (m->chunk_object_offsets)
+		printf(" object_offsets");
+	if (m->chunk_large_offsets)
+		printf(" large_offsets");
 
 	printf("\nnum_objects: %d\n", m->num_objects);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 23f653473a..ae6c9d4d02 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -6,25 +6,28 @@ test_description='multi-pack-indexes'
 midx_read_expect () {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
+	NUM_CHUNKS=$3
+	OBJECT_DIR=$4
+	EXTRA_CHUNKS="$5"
 	cat >expect <<-EOF
-	header: 4d494458 1 3 $NUM_PACKS
-	chunks: pack_names oid_fanout oid_lookup
+	header: 4d494458 1 $NUM_CHUNKS $NUM_PACKS
+	chunks: pack_names oid_fanout oid_lookup object_offsets$EXTRA_CHUNKS
 	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
 	then
-		ls pack/ | grep idx | sort >> expect
+		ls $OBJECT_DIR/pack/ | grep idx | sort >> expect
 	fi
-	printf "object_dir: .\n" >>expect &&
-	test-tool read-midx . >actual &&
+	printf "object_dir: $OBJECT_DIR\n" >>expect &&
+	test-tool read-midx $OBJECT_DIR >actual &&
 	test_cmp expect actual
 }
 
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm -f pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 0 0
+	midx_read_expect 0 0 4 .
 '
 
 test_expect_success 'create objects' '
@@ -55,13 +58,13 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 4 .
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 4 .
 '
 
 test_expect_success 'Add more objects' '
@@ -91,7 +94,7 @@ test_expect_success 'Add more objects' '
 test_expect_success 'write midx with two packs' '
 	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 2 33
+	midx_read_expect 2 33 4 .
 '
 
 test_expect_success 'Add more packs' '
@@ -122,7 +125,28 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 12 73
+	midx_read_expect 12 73 4 .
+'
+
+# usage: corrupt_data <file> <pos> [<data>]
+corrupt_data() {
+	file=$1
+	pos=$2
+	data="${3:-\0}"
+	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+# Force 64-bit offsets by manipulating the idx file.
+# This makes the IDX file _incorrect_ so be careful to clean up after!
+test_expect_success 'force some 64-bit offsets with pack-objects' '
+	mkdir objects64 &&
+	mkdir objects64/pack &&
+	pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
+	idx64=objects64/pack/test-64-$pack64.idx &&
+	chmod u+w $idx64 &&
+	corrupt_data $idx64 2899 "\02" &&
+	midx64=$(git multi-pack-index write --object-dir=objects64) &&
+	midx_read_expect 1 62 5 objects64 " large_offsets"
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (14 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 15/24] midx: write object offsets Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  5:39       ` Eric Sunshine
  2018-07-11  9:48       ` SZEDER Gábor
  2018-07-06  0:53     ` [PATCH v3 17/24] midx: prepare midxed_git struct Derrick Stolee
                       ` (8 subsequent siblings)
  24 siblings, 2 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

The core.multiPackIndex config setting controls the multi-pack-
index (MIDX) feature. If false, the setting will disable all reads
from the multi-pack-index file.

Add comparison commands in t5319-multi-pack-index.sh to check
typical Git behavior remains the same as the config setting is turned
on and off. This currently includes 'git rev-list' and 'git log'
commands to trigger several object database reads.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt    |  4 +++
 cache.h                     |  1 +
 config.c                    |  5 ++++
 environment.c               |  1 +
 t/t5319-multi-pack-index.sh | 56 +++++++++++++++++++++++++++++++------
 5 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ab641bf5a9..ab895ebb32 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -908,6 +908,10 @@ core.commitGraph::
 	Enable git commit graph feature. Allows reading from the
 	commit-graph file.
 
+core.multiPackIndex::
+	Use the multi-pack-index file to track multiple packfiles using a
+	single index. See linkgit:technical/multi-pack-index[1].
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 89a107a7f7..d12aa49710 100644
--- a/cache.h
+++ b/cache.h
@@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_commit_graph;
+extern int core_multi_pack_index;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index fbbf0f8e9f..95d8da4243 100644
--- a/config.c
+++ b/config.c
@@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.multipackindex")) {
+		core_multi_pack_index = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 2a6de2330b..b9bc919cdb 100644
--- a/environment.c
+++ b/environment.c
@@ -67,6 +67,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_commit_graph;
+int core_multi_pack_index;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ae6c9d4d02..fc582c9a59 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,6 +3,8 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+objdir=.git/objects
+
 midx_read_expect () {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
@@ -61,12 +63,42 @@ test_expect_success 'write midx with one v1 pack' '
 	midx_read_expect 1 17 4 .
 '
 
+midx_git_two_modes() {
+	git -c core.multiPackIndex=false $1 >expect &&
+	git -c core.multiPackIndex=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
 test_expect_success 'write midx with one v2 pack' '
-	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
-	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1 17 4 .
+	git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list &&
+	git multi-pack-index --object-dir=$objdir write &&
+	midx_read_expect 1 17 4 $objdir
 '
 
+midx_git_two_modes() {
+	git -c core.multiPackIndex=false $1 >expect &&
+	git -c core.multiPackIndex=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
+compare_results_with_midx "one v2 pack"
+
 test_expect_success 'Add more objects' '
 	for i in $(test_seq 6 10)
 	do
@@ -92,11 +124,13 @@ test_expect_success 'Add more objects' '
 '
 
 test_expect_success 'write midx with two packs' '
-	git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
-	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 2 33 4 .
+	git pack-objects --index-version=1 $objdir/pack/test-2 <obj-list2 &&
+	git multi-pack-index --object-dir=$objdir write &&
+	midx_read_expect 2 33 4 $objdir
 '
 
+compare_results_with_midx "two packs"
+
 test_expect_success 'Add more packs' '
 	for j in $(test_seq 1 10)
 	do
@@ -117,17 +151,21 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 pack/test-pack <obj-list &&
+		git pack-objects --index-version=2 $objdir/pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
 '
 
+compare_results_with_midx "mixed mode (two packs + extra)"
+
 test_expect_success 'write midx with twelve packs' '
-	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 12 73 4 .
+	git multi-pack-index --object-dir=$objdir write &&
+	midx_read_expect 12 73 4 $objdir
 '
 
+compare_results_with_midx "twelve packs"
+
 # usage: corrupt_data <file> <pos> [<data>]
 corrupt_data() {
 	file=$1
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 17/24] midx: prepare midxed_git struct
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (15 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 16/24] config: create core.multiPackIndex setting Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  5:41       ` Eric Sunshine
  2018-07-06  0:53     ` [PATCH v3 18/24] midx: read objects from multi-pack-index Derrick Stolee
                       ` (7 subsequent siblings)
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 22 ++++++++++++++++++++++
 midx.h         |  3 +++
 object-store.h |  9 +++++++++
 packfile.c     |  6 +++++-
 4 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index cc35abe7a2..d5a61c0c53 100644
--- a/midx.c
+++ b/midx.c
@@ -180,6 +180,28 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	return NULL;
 }
 
+int prepare_multi_pack_index_one(struct repository *r, const char *object_dir)
+{
+	struct multi_pack_index *m = r->objects->multi_pack_index;
+	struct multi_pack_index *m_search;
+
+	if (!core_multi_pack_index)
+		return 0;
+
+	for (m_search = m; m_search; m_search = m_search->next)
+		if (!strcmp(object_dir, m_search->object_dir))
+			return 1;
+
+	r->objects->multi_pack_index = load_multi_pack_index(object_dir);
+
+	if (r->objects->multi_pack_index) {
+		r->objects->multi_pack_index->next = m;
+		return 1;
+	}
+
+	return 0;
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index 2d83dd9ec1..731ad6f094 100644
--- a/midx.h
+++ b/midx.h
@@ -1,9 +1,12 @@
 #ifndef __MIDX_H__
 #define __MIDX_H__
 
+#include "repository.h"
+
 struct multi_pack_index;
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir);
+int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
 
diff --git a/object-store.h b/object-store.h
index 07bcc80e02..7d67ad7aa9 100644
--- a/object-store.h
+++ b/object-store.h
@@ -85,6 +85,8 @@ struct packed_git {
 };
 
 struct multi_pack_index {
+	struct multi_pack_index *next;
+
 	int fd;
 
 	const unsigned char *data;
@@ -126,6 +128,13 @@ struct raw_object_store {
 	 */
 	struct oidmap *replace_map;
 
+	/*
+	 * private data
+	 *
+	 * should only be accessed directly by packfile.c and midx.c
+	 */
+	struct multi_pack_index *multi_pack_index;
+
 	/*
 	 * private data
 	 *
diff --git a/packfile.c b/packfile.c
index 3d652212c6..5d4493dbf4 100644
--- a/packfile.c
+++ b/packfile.c
@@ -15,6 +15,7 @@
 #include "tree-walk.h"
 #include "tree.h"
 #include "object-store.h"
+#include "midx.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *sha1,
@@ -935,10 +936,13 @@ static void prepare_packed_git(struct repository *r)
 
 	if (r->objects->packed_git_initialized)
 		return;
+	prepare_multi_pack_index_one(r, r->objects->objectdir);
 	prepare_packed_git_one(r, r->objects->objectdir, 1);
 	prepare_alt_odb(r);
-	for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
+	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
+		prepare_multi_pack_index_one(r, alt->path);
 		prepare_packed_git_one(r, alt->path, 0);
+	}
 	rearrange_packed_git(r);
 	prepare_packed_git_mru(r);
 	r->objects->packed_git_initialized = 1;
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 18/24] midx: read objects from multi-pack-index
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (16 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 17/24] midx: prepare midxed_git struct Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 19/24] midx: use midx in abbreviation calculations Derrick Stolee
                       ` (6 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 91 +++++++++++++++++++++++++++++++++++++++++++++++++-
 midx.h         |  2 ++
 object-store.h |  1 +
 packfile.c     |  8 ++++-
 4 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index d5a61c0c53..84b045060a 100644
--- a/midx.c
+++ b/midx.c
@@ -4,7 +4,7 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "object-store.h"
-#include "packfile.h"
+#include "sha1-lookup.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
@@ -152,6 +152,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
+	m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
 
 	cur_pack_name = (const char *)m->chunk_pack_names;
 	for (i = 0; i < m->num_packs; i++) {
@@ -180,6 +181,94 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	return NULL;
 }
 
+static int prepare_midx_pack(struct multi_pack_index *m, uint32_t pack_int_id)
+{
+	struct strbuf pack_name = STRBUF_INIT;
+
+	if (pack_int_id >= m->num_packs)
+		BUG("bad pack-int-id");
+
+	if (m->packs[pack_int_id])
+		return 0;
+
+	strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir,
+		    m->pack_names[pack_int_id]);
+
+	m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
+	strbuf_release(&pack_name);
+	return !m->packs[pack_int_id];
+}
+
+int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
+{
+	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
+			    MIDX_HASH_LEN, result);
+}
+
+static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
+{
+	const unsigned char *offset_data;
+	uint32_t offset32;
+
+	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
+	offset32 = get_be32(offset_data + sizeof(uint32_t));
+
+	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
+		if (sizeof(offset32) < sizeof(uint64_t))
+			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
+
+		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
+		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
+	}
+
+	return offset32;
+}
+
+static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
+{
+	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
+}
+
+static int nth_midxed_pack_entry(struct multi_pack_index *m, struct pack_entry *e, uint32_t pos)
+{
+	uint32_t pack_int_id;
+	struct packed_git *p;
+
+	if (pos >= m->num_objects)
+		return 0;
+
+	pack_int_id = nth_midxed_pack_int_id(m, pos);
+
+	if (prepare_midx_pack(m, pack_int_id))
+		die(_("error preparing packfile from multi-pack-index"));
+	p = m->packs[pack_int_id];
+
+	/*
+	* We are about to tell the caller where they can locate the
+	* requested object.  We better make sure the packfile is
+	* still here and can be accessed before supplying that
+	* answer, as it may have been deleted since the MIDX was
+	* loaded!
+	*/
+	if (!is_pack_valid(p))
+		return 0;
+
+	e->offset = nth_midxed_offset(m, pos);
+	e->p = p;
+
+	return 1;
+}
+
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m)
+{
+	uint32_t pos;
+
+	if (!bsearch_midx(oid, m, &pos))
+		return 0;
+
+	return nth_midxed_pack_entry(m, e, pos);
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir)
 {
 	struct multi_pack_index *m = r->objects->multi_pack_index;
diff --git a/midx.h b/midx.h
index 731ad6f094..6b74a0640f 100644
--- a/midx.h
+++ b/midx.h
@@ -6,6 +6,8 @@
 struct multi_pack_index;
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir);
+int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/object-store.h b/object-store.h
index 7d67ad7aa9..03cc278758 100644
--- a/object-store.h
+++ b/object-store.h
@@ -106,6 +106,7 @@ struct multi_pack_index {
 	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
+	struct packed_git **packs;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/packfile.c b/packfile.c
index 5d4493dbf4..bc763d91b9 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1902,11 +1902,17 @@ static int fill_pack_entry(const struct object_id *oid,
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
+	struct multi_pack_index *m;
 
 	prepare_packed_git(r);
-	if (!r->objects->packed_git)
+	if (!r->objects->packed_git && !r->objects->multi_pack_index)
 		return 0;
 
+	for (m = r->objects->multi_pack_index; m; m = m->next) {
+		if (fill_midx_entry(oid, e, m))
+			return 1;
+	}
+
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
 		if (fill_pack_entry(oid, e, p)) {
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 19/24] midx: use midx in abbreviation calculations
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (17 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 18/24] midx: read objects from multi-pack-index Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 20/24] midx: use existing midx when writing new one Derrick Stolee
                       ` (5 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 11 ++++++
 midx.h                      |  3 ++
 packfile.c                  |  6 ++++
 packfile.h                  |  1 +
 sha1-name.c                 | 70 +++++++++++++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh |  3 +-
 6 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 84b045060a..e66025f066 100644
--- a/midx.c
+++ b/midx.c
@@ -205,6 +205,17 @@ int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32
 			    MIDX_HASH_LEN, result);
 }
 
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct multi_pack_index *m,
+					uint32_t n)
+{
+	if (n >= m->num_objects)
+		return NULL;
+
+	hashcpy(oid->hash, m->chunk_oid_lookup + m->hash_len * n);
+	return oid;
+}
+
 static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 {
 	const unsigned char *offset_data;
diff --git a/midx.h b/midx.h
index 6b74a0640f..f7c2ec7893 100644
--- a/midx.h
+++ b/midx.h
@@ -7,6 +7,9 @@ struct multi_pack_index;
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct multi_pack_index *m,
+					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
diff --git a/packfile.c b/packfile.c
index bc763d91b9..c0eb5ac885 100644
--- a/packfile.c
+++ b/packfile.c
@@ -961,6 +961,12 @@ struct packed_git *get_packed_git(struct repository *r)
 	return r->objects->packed_git;
 }
 
+struct multi_pack_index *get_multi_pack_index(struct repository *r)
+{
+	prepare_packed_git(r);
+	return r->objects->multi_pack_index;
+}
+
 struct list_head *get_packed_git_mru(struct repository *r)
 {
 	prepare_packed_git(r);
diff --git a/packfile.h b/packfile.h
index b0eed44c0b..046280caf3 100644
--- a/packfile.h
+++ b/packfile.h
@@ -45,6 +45,7 @@ extern void install_packed_git(struct repository *r, struct packed_git *pack);
 
 struct packed_git *get_packed_git(struct repository *r);
 struct list_head *get_packed_git_mru(struct repository *r);
+struct multi_pack_index *get_multi_pack_index(struct repository *r);
 
 /*
  * Give a rough count of objects in the repository. This sacrifices accuracy
diff --git a/sha1-name.c b/sha1-name.c
index 60d9ef3c7e..7dc71201e6 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -12,6 +12,7 @@
 #include "packfile.h"
 #include "object-store.h"
 #include "repository.h"
+#include "midx.h"
 
 static int get_oid_oneline(const char *, struct object_id *, struct commit_list *);
 
@@ -149,6 +150,32 @@ static int match_sha(unsigned len, const unsigned char *a, const unsigned char *
 	return 1;
 }
 
+static void unique_in_midx(struct multi_pack_index *m,
+			   struct disambiguate_state *ds)
+{
+	uint32_t num, i, first = 0;
+	const struct object_id *current = NULL;
+	num = m->num_objects;
+
+	if (!num)
+		return;
+
+	bsearch_midx(&ds->bin_pfx, m, &first);
+
+	/*
+	 * At this point, "first" is the location of the lowest object
+	 * with an object name that could match "bin_pfx".  See if we have
+	 * 0, 1 or more objects that actually match(es).
+	 */
+	for (i = first; i < num && !ds->ambiguous; i++) {
+		struct object_id oid;
+		current = nth_midxed_object_oid(&oid, m, i);
+		if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash))
+			break;
+		update_candidates(ds, current);
+	}
+}
+
 static void unique_in_pack(struct packed_git *p,
 			   struct disambiguate_state *ds)
 {
@@ -177,8 +204,12 @@ static void unique_in_pack(struct packed_git *p,
 
 static void find_short_packed_object(struct disambiguate_state *ds)
 {
+	struct multi_pack_index *m;
 	struct packed_git *p;
 
+	for (m = get_multi_pack_index(the_repository); m && !ds->ambiguous;
+	     m = m->next)
+		unique_in_midx(m, ds);
 	for (p = get_packed_git(the_repository); p && !ds->ambiguous;
 	     p = p->next)
 		unique_in_pack(p, ds);
@@ -527,6 +558,42 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	return 0;
 }
 
+static void find_abbrev_len_for_midx(struct multi_pack_index *m,
+				     struct min_abbrev_data *mad)
+{
+	int match = 0;
+	uint32_t num, first = 0;
+	struct object_id oid;
+	const struct object_id *mad_oid;
+
+	if (!m->num_objects)
+		return;
+
+	num = m->num_objects;
+	mad_oid = mad->oid;
+	match = bsearch_midx(mad_oid, m, &first);
+
+	/*
+	 * first is now the position in the packfile where we would insert
+	 * mad->hash if it does not exist (or the position of mad->hash if
+	 * it does exist). Hence, we consider a maximum of two objects
+	 * nearby for the abbreviation length.
+	 */
+	mad->init_len = 0;
+	if (!match) {
+		if (nth_midxed_object_oid(&oid, m, first))
+			extend_abbrev_len(&oid, mad);
+	} else if (first < num - 1) {
+		if (nth_midxed_object_oid(&oid, m, first + 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	if (first > 0) {
+		if (nth_midxed_object_oid(&oid, m, first - 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	mad->init_len = mad->cur_len;
+}
+
 static void find_abbrev_len_for_pack(struct packed_git *p,
 				     struct min_abbrev_data *mad)
 {
@@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 
 static void find_abbrev_len_packed(struct min_abbrev_data *mad)
 {
+	struct multi_pack_index *m;
 	struct packed_git *p;
 
+	for (m = get_multi_pack_index(the_repository); m; m = m->next)
+		find_abbrev_len_for_midx(m, mad);
 	for (p = get_packed_git(the_repository); p; p = p->next)
 		find_abbrev_len_for_pack(p, mad);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index fc582c9a59..4c630ecab4 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -93,7 +93,8 @@ compare_results_with_midx() {
 	MSG=$1
 	test_expect_success "check normal git operations: $MSG" '
 		midx_git_two_modes "rev-list --objects --all" &&
-		midx_git_two_modes "log --raw"
+		midx_git_two_modes "log --raw" &&
+		midx_git_two_modes "log --oneline"
 	'
 }
 
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 20/24] midx: use existing midx when writing new one
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (18 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 19/24] midx: use midx in abbreviation calculations Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 21/24] midx: use midx in approximate_object_count Derrick Stolee
                       ` (4 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Due to how Windows handles replacing a lockfile when there is an open
handle, create the close_midx() method to close the existing midx before
writing the new one.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 midx.h |   1 +
 2 files changed, 111 insertions(+), 6 deletions(-)

diff --git a/midx.c b/midx.c
index e66025f066..7c00b02436 100644
--- a/midx.c
+++ b/midx.c
@@ -181,6 +181,23 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	return NULL;
 }
 
+static void close_midx(struct multi_pack_index *m)
+{
+	uint32_t i;
+	munmap((unsigned char *)m->data, m->data_len);
+	close(m->fd);
+	m->fd = -1;
+
+	for (i = 0; i < m->num_packs; i++) {
+		if (m->packs[i]) {
+			close_pack(m->packs[i]);
+			free(m->packs);
+		}
+	}
+	FREE_AND_NULL(m->packs);
+	FREE_AND_NULL(m->pack_names);
+}
+
 static int prepare_midx_pack(struct multi_pack_index *m, uint32_t pack_int_id)
 {
 	struct strbuf pack_name = STRBUF_INIT;
@@ -280,6 +297,29 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mu
 	return nth_midxed_pack_entry(m, e, pos);
 }
 
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_name)
+{
+	uint32_t first = 0, last = m->num_packs;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		const char *current;
+		int cmp;
+
+		current = m->pack_names[mid];
+		cmp = strcmp(idx_name, current);
+		if (!cmp)
+			return 1;
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	return 0;
+}
+
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir)
 {
 	struct multi_pack_index *m = r->objects->multi_pack_index;
@@ -326,6 +366,7 @@ struct pack_list {
 	uint32_t alloc_list;
 	uint32_t alloc_names;
 	size_t pack_name_concat_len;
+	struct multi_pack_index *m;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -334,6 +375,9 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 	struct pack_list *packs = (struct pack_list *)data;
 
 	if (ends_with(file_name, ".idx")) {
+		if (packs->m && midx_contains_pack(packs->m, file_name))
+			return;
+
 		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
 		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
 
@@ -418,6 +462,23 @@ static int midx_oid_compare(const void *_a, const void *_b)
 	return a->pack_int_id - b->pack_int_id;
 }
 
+static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
+				      uint32_t *pack_perm,
+				      struct pack_midx_entry *e,
+				      uint32_t pos)
+{
+	if (pos >= m->num_objects)
+		return 1;
+
+	nth_midxed_object_oid(&e->oid, m, pos);
+	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->offset = nth_midxed_offset(m, pos);
+
+	/* consider objects in midx to be from "old" packs */
+	e->pack_mtime = 0;
+	return 0;
+}
+
 static void fill_pack_entry(uint32_t pack_int_id,
 			    struct packed_git *p,
 			    uint32_t cur_object,
@@ -443,7 +504,8 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * Copy only the de-duplicated entries (selected by most-recent modified time
  * of a packfile containing the object).
  */
-static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
+						  struct packed_git **p,
 						  uint32_t *perm,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
@@ -452,8 +514,9 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	uint32_t alloc_fanout, alloc_objects, total_objects = 0;
 	struct pack_midx_entry *entries_by_fanout = NULL;
 	struct pack_midx_entry *deduplicated_entries = NULL;
+	uint32_t start_pack = m ? m->num_packs : 0;
 
-	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++)
+	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++)
 		total_objects += p[cur_pack]->num_objects;
 
 	/*
@@ -470,7 +533,23 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
 		uint32_t nr_fanout = 0;
 
-		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+		if (m) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = ntohl(m->chunk_oid_fanout[cur_fanout - 1]);
+			end = ntohl(m->chunk_oid_fanout[cur_fanout]);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				nth_midxed_pack_midx_entry(m, perm,
+							   &entries_by_fanout[nr_fanout],
+							   cur_object);
+				nr_fanout++;
+			}
+		}
+
+		for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
@@ -666,16 +745,34 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	packs.m = load_multi_pack_index(object_dir);
+
 	packs.nr = 0;
-	packs.alloc_list = 16;
-	packs.alloc_names = 16;
+	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
+	packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
+	packs.names = NULL;
 	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
+	if (packs.m) {
+		for (i = 0; i < packs.m->num_packs; i++) {
+			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
+			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+
+			packs.list[packs.nr] = NULL;
+			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
+			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
+			packs.nr++;
+		}
+	}
+
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
+	if (packs.m && packs.nr == packs.m->num_packs)
+		goto cleanup;
+
 	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
 					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
@@ -683,7 +780,8 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
-	entries = get_sorted_entries(packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
@@ -695,6 +793,9 @@ int write_midx_file(const char *object_dir)
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
+	if (packs.m)
+		close_midx(packs.m);
+
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
@@ -786,6 +887,7 @@ int write_midx_file(const char *object_dir)
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
+cleanup:
 	for (i = 0; i < packs.nr; i++) {
 		if (packs.list[i]) {
 			close_pack(packs.list[i]);
@@ -797,5 +899,7 @@ int write_midx_file(const char *object_dir)
 	free(packs.list);
 	free(packs.names);
 	free(entries);
+	free(pack_perm);
+	free(midx_name);
 	return 0;
 }
diff --git a/midx.h b/midx.h
index f7c2ec7893..5faffb7bc6 100644
--- a/midx.h
+++ b/midx.h
@@ -11,6 +11,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct multi_pack_index *m,
 					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m);
+int midx_contains_pack(struct multi_pack_index *m, const char *idx_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 21/24] midx: use midx in approximate_object_count
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (19 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 20/24] midx: use existing midx when writing new one Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 22/24] midx: prevent duplicate packfile loads Derrick Stolee
                       ` (3 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/packfile.c b/packfile.c
index c0eb5ac885..97e7812b6b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -861,10 +861,13 @@ unsigned long approximate_object_count(void)
 {
 	if (!the_repository->objects->approximate_object_count_valid) {
 		unsigned long count;
+		struct multi_pack_index *m;
 		struct packed_git *p;
 
 		prepare_packed_git(the_repository);
 		count = 0;
+		for (m = get_multi_pack_index(the_repository); m; m = m->next)
+			count += m->num_objects;
 		for (p = the_repository->objects->packed_git; p; p = p->next) {
 			if (open_pack_index(p))
 				continue;
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 22/24] midx: prevent duplicate packfile loads
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (20 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 21/24] midx: use midx in approximate_object_count Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 23/24] packfile: skip loading index if in multi-pack-index Derrick Stolee
                       ` (2 subsequent siblings)
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

The multi-pack-index, when present, tracks the existence of objects and
their offsets within a list of packfiles. This allows us to use the
multi-pack-index for object lookups, abbreviations, and object counts.

When the multi-pack-index tracks a packfile, then we do not need to add
that packfile to the packed_git linked list or the MRU list.

We still need to load the packfiles that are not tracked by the
multi-pack-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/packfile.c b/packfile.c
index 97e7812b6b..2c819a0ad8 100644
--- a/packfile.c
+++ b/packfile.c
@@ -795,6 +795,7 @@ struct prepare_pack_data {
 	struct repository *r;
 	struct string_list *garbage;
 	int local;
+	struct multi_pack_index *m;
 };
 
 static void prepare_pack(const char *full_name, size_t full_name_len,
@@ -805,6 +806,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	size_t base_len = full_name_len;
 
 	if (strip_suffix_mem(full_name, &base_len, ".idx")) {
+		if (data->m && midx_contains_pack(data->m, file_name))
+			return;
 		/* Don't reopen a pack we already have. */
 		for (p = data->r->objects->packed_git; p; p = p->next) {
 			size_t len;
@@ -839,6 +842,12 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	struct prepare_pack_data data;
 	struct string_list garbage = STRING_LIST_INIT_DUP;
 
+	data.m = r->objects->multi_pack_index;
+
+	/* look for the multi-pack-index for this object directory */
+	while (data.m && strcmp(data.m->object_dir, objdir))
+		data.m = data.m->next;
+
 	data.r = r;
 	data.garbage = &garbage;
 	data.local = local;
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 23/24] packfile: skip loading index if in multi-pack-index
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (21 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 22/24] midx: prevent duplicate packfile loads Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  0:53     ` [PATCH v3 24/24] midx: clear midx on repack Derrick Stolee
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
  24 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/packfile.c b/packfile.c
index 2c819a0ad8..e6ecf12ab5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -469,8 +469,19 @@ static int open_packed_git_1(struct packed_git *p)
 	ssize_t read_result;
 	const unsigned hashsz = the_hash_algo->rawsz;
 
-	if (!p->index_data && open_pack_index(p))
-		return error("packfile %s index unavailable", p->pack_name);
+	if (!p->index_data) {
+		struct multi_pack_index *m;
+		const char *pack_name = strrchr(p->pack_name, '/');
+
+		for (m = the_repository->objects->multi_pack_index;
+		     m; m = m->next) {
+			if (midx_contains_pack(m, pack_name))
+				break;
+		}
+
+		if (!m && open_pack_index(p))
+			return error("packfile %s index unavailable", p->pack_name);
+	}
 
 	if (!pack_max_fds) {
 		unsigned int max_fds = get_max_fd_limit();
@@ -521,6 +532,10 @@ static int open_packed_git_1(struct packed_git *p)
 			" supported (try upgrading GIT to a newer version)",
 			p->pack_name, ntohl(hdr.hdr_version));
 
+	/* Skip index checking if in multi-pack-index */
+	if (!p->index_data)
+		return 0;
+
 	/* Verify the pack matches its index. */
 	if (p->num_objects != ntohl(hdr.hdr_entries))
 		return error("packfile %s claims to have %"PRIu32" objects"
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v3 24/24] midx: clear midx on repack
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (22 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 23/24] packfile: skip loading index if in multi-pack-index Derrick Stolee
@ 2018-07-06  0:53     ` Derrick Stolee
  2018-07-06  5:52       ` Eric Sunshine
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
  24 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-06  0:53 UTC (permalink / raw)
  To: git; +Cc: gitster, sbeller, pclouds, avarab, dstolee

If a 'git repack' command replaces existing packfiles, then we must
clear the existing multi-pack-index before moving the packfiles it
references.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c |  8 ++++++++
 midx.c           | 12 ++++++++++++
 midx.h           |  1 +
 3 files changed, 21 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 6c636e159e..66a7d8e8ea 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -8,6 +8,7 @@
 #include "strbuf.h"
 #include "string-list.h"
 #include "argv-array.h"
+#include "midx.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	int quiet = 0;
 	int local = 0;
+	int midx_cleared = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				continue;
 			}
 
+			if (!midx_cleared) {
+				/* if we move a packfile, it will invalidated the midx */
+				clear_midx_file(get_object_directory());
+				midx_cleared = 1;
+			}
+
 			fname_old = mkpathdup("%s/old-%s%s", packdir,
 						item->string, exts[ext].name);
 			if (file_exists(fname_old))
diff --git a/midx.c b/midx.c
index 7c00b02436..8afd08f3fe 100644
--- a/midx.c
+++ b/midx.c
@@ -903,3 +903,15 @@ int write_midx_file(const char *object_dir)
 	free(midx_name);
 	return 0;
 }
+
+void clear_midx_file(const char *object_dir)
+{
+	char *midx = get_midx_filename(object_dir);
+
+	if (remove_path(midx)) {
+		UNLEAK(midx);
+		die(_("failed to clear multi-pack-index at %s"), midx);
+	}
+
+	free(midx);
+}
diff --git a/midx.h b/midx.h
index 5faffb7bc6..5a42cbed1d 100644
--- a/midx.h
+++ b/midx.h
@@ -15,5 +15,6 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
+void clear_midx_file(const char *object_dir);
 
 #endif
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 03/24] multi-pack-index: add builtin
  2018-07-06  0:53     ` [PATCH v3 03/24] multi-pack-index: add builtin Derrick Stolee
@ 2018-07-06  3:54       ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  3:54 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:53 PM Derrick Stolee <stolee@gmail.com> wrote:
> This new 'git multi-pack-index' builtin will be the plumbing access
> for writing, reading, and checking multi-pack-index files. The
> initial implementation is a no-op.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> +SYNOPSIS
> +--------
> +'git multi-pack-index' [--object-dir <dir>]

In Git documentation, this is more typically written: [--object-dir=<dir>]

> +OPTIONS
> +-------
> +--object-dir <dir>::

Ditto: --object-dir=<dir>::

> +       Use given directory for the location of Git objects. We check
> +       `<dir>/packs/multi-pack-index` for the current MIDX file, and
> +       `<dir>/packs` for the pack-files to index.
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> @@ -0,0 +1,38 @@
> +static char const * const builtin_multi_pack_index_usage[] = {
> +       N_("git multi-pack-index [--object-dir <dir>]"),

Likewise.

> +int cmd_multi_pack_index(int argc, const char **argv,
> +                        const char *prefix)
> +{
> +       static struct option builtin_multi_pack_index_options[] = {
> +               OPT_FILENAME(0, "object-dir", &opts.object_dir,
> +                 N_("The object directory containing set of packfile and pack-index pairs")),

It's more typical not to capitalize these. Also, keep them short, if
possible, so perhaps drop "The".

> +               OPT_END(),
> +       };
> +
> +       if (argc == 2 && !strcmp(argv[1], "-h"))
> +               usage_with_options(builtin_multi_pack_index_usage,
> +                                  builtin_multi_pack_index_options);

Unless you are planning on adding a short "-h <something>" option
later in the series, then you can do away with this conditional
altogether since the below parse_options() will give you "-h" as help
for free.

> +       git_config(git_default_config, NULL);
> +
> +       argc = parse_options(argc, argv, prefix,
> +                            builtin_multi_pack_index_options,
> +                            builtin_multi_pack_index_usage, 0);

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 04/24] multi-pack-index: add 'write' verb
  2018-07-06  0:53     ` [PATCH v3 04/24] multi-pack-index: add 'write' verb Derrick Stolee
@ 2018-07-06  4:07       ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  4:07 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:53 PM Derrick Stolee <stolee@gmail.com> wrote:
> In anticipation of writing multi-pack-indexes, add a
> 'git multi-pack-index write' subcommand and send the options to a
> write_midx_file() method.

Since the 'write' command is a no-op at this point, perhaps say so in
the commit message. Something like:

    ... add a skeleton 'git multi-pack-index write' subcommand,
    which will be fleshed-out by a later commit.

The bit about sending options to write_midx_file() is superfluous;
it's a mere implementation detail which is clearly seen by reading the
patch.

> Also create a basic test file that tests
> the 'write' subcommand.

Maybe: s/file/script

And, as above, perhaps mention that this is a _skeleton_ test script
so as to avoid confusing readers into thinking that something
significant is happening at this stage.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> +* Write a MIDX file for the packfiles in an alternate.

In an alternate what?

> +-----------------------------------------------
> +$ git multi-pack-index --object-dir <alt> write
> +-----------------------------------------------
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> @@ -2,9 +2,10 @@
>  static char const * const builtin_multi_pack_index_usage[] = {
> -       N_("git multi-pack-index [--object-dir <dir>]"),
> +       N_("git multi-pack-index [--object-dir <dir>] [write]"),

Is there going to be some default behavior when no verb is provided?
The below implementation seems to suggest that the verb is required,
so this probably ought to be typeset as:

    git multi-pack-index [--object-dir=<dir>] write

Later, when you add more (mutually exclusive) verbs, change the typesetting to:

    git multi-pack-index [--object-dir=<dir>] (write|...|...)

Alternately, just use:

    git multi-pack-index [--object-dir=<dir>] <verb>

> @@ -34,5 +35,12 @@ int cmd_multi_pack_index(int argc, const char **argv,
> +       if (argc == 0)
> +               usage_with_options(builtin_multi_pack_index_usage,
> +                                  builtin_multi_pack_index_options);
> +
> +       if (!strcmp(argv[0], "write"))
> +               return write_midx_file(opts.object_dir);
> +
>         return 0;

This should be throwing an error when an unrecognized verb is provided.

It also should be throwing an error when 'write' is given too many
arguments (which, at this point, appears to be 0).

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 06/24] multi-pack-index: load into memory
  2018-07-06  0:53     ` [PATCH v3 06/24] multi-pack-index: load into memory Derrick Stolee
@ 2018-07-06  4:19       ` Eric Sunshine
  2018-07-06  5:18         ` Eric Sunshine
  2018-07-09 19:08       ` Junio C Hamano
  1 sibling, 1 reply; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  4:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:53 PM Derrick Stolee <stolee@gmail.com> wrote:
> Create a new multi_pack_index struct for loading multi-pack-indexes into
> memory. Create a test-tool builtin for reading basic information about
> that multi-pack-index to verify the correct data is written.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
> @@ -0,0 +1,34 @@
> +/*
> + * test-mktemp.c: code to exercise the creation of temporary files
> + */

Meh. Copy/paste botch.

> +static int read_midx_file(const char *object_dir)
> +{
> +       struct multi_pack_index *m = load_multi_pack_index(object_dir);
> +
> +       if (!m)
> +               return 0;

Should this 'return 0' be a die() or BUG() or something?

> +       printf("header: %08x %d %d %d\n",
> +              m->signature,
> +              m->version,
> +              m->num_chunks,
> +              m->num_packs);
> +
> +       printf("object_dir: %s\n", m->object_dir);
> +
> +       return 0;
> +}
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> @@ -3,9 +3,19 @@
> +midx_read_expect () {
> +       cat >expect <<-EOF

I guess you're planning on interpolating some variables here in a
later patch, which is why you used -EOF rather than -\EOF?

> +       header: 4d494458 1 0 0
> +       object_dir: .
> +       EOF
> +       test-tool read-midx . >actual &&
> +       test_cmp expect actual
> +}
> +
>  test_expect_success 'write midx with no packs' '
>         git multi-pack-index --object-dir=. write &&
> -       test_path_is_file pack/multi-pack-index
> +       test_path_is_file pack/multi-pack-index &&
> +       midx_read_expect
>  '

Kind of a do-nothing change. I wonder if this step would better be
delayed until a later patch. (Not necessarily a big deal.)

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 07/24] multi-pack-index: expand test data
  2018-07-06  0:53     ` [PATCH v3 07/24] multi-pack-index: expand test data Derrick Stolee
@ 2018-07-06  4:36       ` Eric Sunshine
  2018-07-06  5:20         ` Eric Sunshine
  2018-07-12 14:10         ` Derrick Stolee
  0 siblings, 2 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  4:36 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> multi-pack-index: expand test data

Since this patch is touching only t5319, a more typical title would be:

    t5319: expand test data

> As we build the multi-pack-index file format, we want to test the format
> on real repoasitories. Add tests to t5319-multi-pack-index.sh that

s/repoasitories/repositories/

And, since the title now mentions t5319, this can become simply:

    Add tests that...

> create repository data including multiple packfiles with both version 1
> and version 2 formats.
>
> The current 'git multi-pack-index write' command will always write the
> same file with no "real" data. This will be expanded in future commits,
> along with the test expectations.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> @@ -13,9 +13,108 @@ midx_read_expect () {
>  test_expect_success 'write midx with no packs' '
> +       test_when_finished rm -f pack/multi-pack-index &&

Should this test_when_finished() have been added in an earlier patch?
It seems out of place here.

>         git multi-pack-index --object-dir=. write &&
>         test_path_is_file pack/multi-pack-index &&
>         midx_read_expect
>  '
>
> +test_expect_success 'create objects' '
> +       for i in $(test_seq 1 5)
> +       do
> +               iii=$(printf '%03i' $i)
> +               test-tool genrandom "bar" 200 >wide_delta_$iii &&
> +               test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&

Alternately:

    {
        test-tool genrandom "bar" 200 &&
         test-tool genrandom "baz $iii" 50
    } >wide_delta_$iii &&

which makes it easier to see at a glance that both commands are
populating the same file. Same comment for the other files. (Not worth
a re-roll.)

> +               test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
> +               test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
> +               test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&

Or, just use POSIX arithmetic expansion:

    $(( $i + 1 ))

> +               echo $iii >file_$iii &&
> +               test-tool genrandom "$iii" 8192 >>file_$iii &&
> +               git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
> +               i=$(expr $i + 1) || return 1

Ditto, POSIX arithmetic expansion:

    i=$(( $i + 1 ))

(Not worth a re-roll.)

> +       done &&
> +       { echo 101 && test-tool genrandom 100 8192; } >file_101 &&
> +       git update-index --add file_101 &&
> +       tree=$(git write-tree) &&
> +       commit=$(git commit-tree $tree </dev/null) && {
> +       echo $tree &&
> +       git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)        .*/\\1/"
> +       } >obj-list &&

Perhaps indent the content of the {...} block?

> +       git update-ref HEAD $commit
> +'
> +
> +test_expect_success 'write midx with one v1 pack' '
> +       pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
> +       test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
> +       git multi-pack-index --object-dir=. write &&
> +       midx_read_expect
> +'

It's odd to see all these tests ending by creating an 'expect' file
but not actually doing anything with that file.

> +test_expect_success 'Add more objects' '

s/Add/add/

> +       for i in $(test_seq 6 10)
> +       do
> +               iii=$(printf '%03i' $i)
> +               test-tool genrandom "bar" 200 >wide_delta_$iii &&
> +               test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
> +               test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
> +               test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
> +               test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
> +               echo $iii >file_$iii &&
> +               test-tool genrandom "$iii" 8192 >>file_$iii &&
> +               git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
> +               i=$(expr $i + 1) || return 1
> +       done &&
> +       { echo 101 && test-tool genrandom 100 8192; } >file_101 &&
> +       git update-index --add file_101 &&
> +       tree=$(git write-tree) &&
> +       commit=$(git commit-tree $tree -p HEAD</dev/null) && {
> +       echo $tree &&
> +       git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)        .*/\\1/"
> +       } >obj-list2 &&
> +       git update-ref HEAD $commit
> +'

There seems to be a fair bit of duplication in these tests which
create objects. Is it possible to factor out some of this code into a
shell function?

> +test_expect_success 'write midx with two packs' '
> +       git pack-objects --index-version=1 pack/test-2 <obj-list2 &&
> +       git multi-pack-index --object-dir=. write &&
> +       midx_read_expect
> +'
> +
> +test_expect_success 'Add more packs' '

s/Add/add/

> +       for j in $(test_seq 1 10)
> +       do
> +               [...]
> +       done
> +'

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 11/24] midx: read pack names into array
  2018-07-06  0:53     ` [PATCH v3 11/24] midx: read pack names into array Derrick Stolee
@ 2018-07-06  4:58       ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  4:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> @@ -8,8 +8,13 @@ midx_read_expect () {
>         cat >expect <<-EOF

Broken &&-chain.

>         header: 4d494458 1 1 $NUM_PACKS
>         chunks: pack_names
> -       object_dir: .
> +       packs:
>         EOF
> +       if [ $NUM_PACKS -ge 1 ]

On this project, use 'test' rather than '['.

> +       then
> +               ls pack/ | grep idx | sort >> expect
> +       fi

Broken &&-chain.

> +       printf "object_dir: .\n" >>expect &&

All this code building up 'expect' could be in a {...} block to make
it clearer and less noisy:

    {
        cat <<-EOF &&
        ...
        EOF
        if test $NUM_PACKS -ge 1
        then
            ls -1 pack/ | ...
        fi &&
        printf "..."
    } >expect &&

And, some pointless bike-shedding while here: perhaps dashes instead
of underlines? "pack-names", "object-dir"

>         test-tool read-midx . >actual &&
>         test_cmp expect actual
>  }

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 13/24] midx: write object ids in a chunk
  2018-07-06  0:53     ` [PATCH v3 13/24] midx: write object ids in a chunk Derrick Stolee
@ 2018-07-06  5:04       ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  5:04 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/midx.c b/midx.c
> @@ -18,9 +18,10 @@
> @@ -384,6 +391,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
> +static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
> +                                   struct pack_midx_entry *objects,
> +                                   uint32_t nr_objects)
> +{
> +       struct pack_midx_entry *list = objects;
> +       uint32_t i;
> +       size_t written = 0;
> +
> +       for (i = 0; i < nr_objects; i++) {
> +               struct pack_midx_entry *obj = list++;
> +
> +               if (i < nr_objects - 1) {
> +                       struct pack_midx_entry *next = list;
> +                       if (oidcmp(&obj->oid, &next->oid) >= 0)
> +                               BUG("OIDs not in order: %s >= %s",
> +                               oid_to_hex(&obj->oid),
> +                               oid_to_hex(&next->oid));

The above two lines are arguments to BUG(), thus should be indented more.

> +               }
> +

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 06/24] multi-pack-index: load into memory
  2018-07-06  4:19       ` Eric Sunshine
@ 2018-07-06  5:18         ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  5:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Fri, Jul 6, 2018 at 12:19 AM Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Thu, Jul 5, 2018 at 8:53 PM Derrick Stolee <stolee@gmail.com> wrote:
> > +midx_read_expect () {
> > +       cat >expect <<-EOF
> > +       header: 4d494458 1 0 0
> > +       object_dir: .
> > +       EOF
> > +       test-tool read-midx . >actual &&
> > +       test_cmp expect actual
> > +}
> > +
> >  test_expect_success 'write midx with no packs' '
> >         git multi-pack-index --object-dir=. write &&
> > -       test_path_is_file pack/multi-pack-index
> > +       test_path_is_file pack/multi-pack-index &&
> > +       midx_read_expect
>
> Kind of a do-nothing change. I wonder if this step would better be
> delayed until a later patch. (Not necessarily a big deal.)

Never mind. I missed that midx_read_expect() is comparing the 'expect'
and 'actual' files, so this is not a do-nothing change. The function
name may have helped to mislead me (or I was just unobservant or too
focused on its creation of the 'expect' file). I wonder if a different
name would have helped. midx_check(), midx_verify()? Meh.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 07/24] multi-pack-index: expand test data
  2018-07-06  4:36       ` Eric Sunshine
@ 2018-07-06  5:20         ` Eric Sunshine
  2018-07-12 14:10         ` Derrick Stolee
  1 sibling, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  5:20 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Fri, Jul 6, 2018 at 12:36 AM Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> > +test_expect_success 'write midx with one v1 pack' '
> > +       pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
> > +       test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
> > +       git multi-pack-index --object-dir=. write &&
> > +       midx_read_expect
> > +'
>
> It's odd to see all these tests ending by creating an 'expect' file
> but not actually doing anything with that file.

Ignore this comment. As mentioned in my follow-up to 6/24, I missed
the fact that midx_read_expect() is doing more than just creating the
'expect' file.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 15/24] midx: write object offsets
  2018-07-06  0:53     ` [PATCH v3 15/24] midx: write object offsets Derrick Stolee
@ 2018-07-06  5:27       ` Eric Sunshine
  2018-07-12 16:33         ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  5:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> The final pair of chunks for the multi-pack-index file stores the object
> offsets. We default to using 32-bit offsets as in the pack-index version
> 1 format, but if there exists an offset larger than 32-bits, we use a
> trick similar to the pack-index version 2 format by storing all offsets
> at least 2^31 in a 64-bit table; we use the 32-bit table to point into
> that 64-bit table as necessary.
>
> We only store these 64-bit offsets if necessary, so create a test that
> manipulates a version 2 pack-index to fake a large offset. This allows
> us to test that the large offset table is created, but the data does not
> match the actual packfile offsets. The multi-pack-index offset does match
> the (corrupted) pack-index offset, so a future feature will compare these
> offsets during a 'verify' step.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> @@ -6,25 +6,28 @@ test_description='multi-pack-indexes'
> +# usage: corrupt_data <file> <pos> [<data>]
> +corrupt_data() {

Style: corrupt_data () {

> +       file=$1
> +       pos=$2
> +       data="${3:-\0}"
> +       printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
> +}
> +
> +# Force 64-bit offsets by manipulating the idx file.
> +# This makes the IDX file _incorrect_ so be careful to clean up after!
> +test_expect_success 'force some 64-bit offsets with pack-objects' '
> +       mkdir objects64 &&
> +       mkdir objects64/pack &&
> +       pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
> +       idx64=objects64/pack/test-64-$pack64.idx &&
> +       chmod u+w $idx64 &&

I guess you don't have to worry about the POSIXPERM prerequisite here
because the file is already writable on Windows, right?

> +       corrupt_data $idx64 2899 "\02" &&
> +       midx64=$(git multi-pack-index write --object-dir=objects64) &&
> +       midx_read_expect 1 62 5 objects64 " large_offsets"
>  '

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-06  0:53     ` [PATCH v3 16/24] config: create core.multiPackIndex setting Derrick Stolee
@ 2018-07-06  5:39       ` Eric Sunshine
  2018-07-12 13:19         ` Derrick Stolee
  2018-07-11  9:48       ` SZEDER Gábor
  1 sibling, 1 reply; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  5:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> The core.multiPackIndex config setting controls the multi-pack-
> index (MIDX) feature. If false, the setting will disable all reads
> from the multi-pack-index file.
>
> Add comparison commands in t5319-multi-pack-index.sh to check
> typical Git behavior remains the same as the config setting is turned
> on and off. This currently includes 'git rev-list' and 'git log'
> commands to trigger several object database reads.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> diff --git a/cache.h b/cache.h
> @@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
> +extern int core_multi_pack_index;
> diff --git a/config.c b/config.c
> @@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, const char *value)
> +       if (!strcmp(var, "core.multipackindex")) {
> +               core_multi_pack_index = git_config_bool(var, value);
> +               return 0;
> +       }

This is a rather unusual commit. This new configuration is assigned,
but it's never actually consulted, which means that it's impossible
for it to have any impact on functionality, yet tests are being added
to check whether it _did_ have any impact on functionality. Confusing.

Patch 17 does consult 'core_multi_pack_index', so it's only at that
point that it could have any impact. This situation would be less
confusing if you swapped patches 16 and 17 (and, of course, move the
declaration of 'core_multi_pack_index' to patch 17 with a reasonable
default value).

> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> @@ -61,12 +63,42 @@ test_expect_success 'write midx with one v1 pack' '
> +midx_git_two_modes() {
> +       git -c core.multiPackIndex=false $1 >expect &&
> +       git -c core.multiPackIndex=true $1 >actual &&
> +       test_cmp expect actual
> +}
> +
> +compare_results_with_midx() {
> +       MSG=$1
> +       test_expect_success "check normal git operations: $MSG" '
> +               midx_git_two_modes "rev-list --objects --all" &&
> +               midx_git_two_modes "log --raw"
> +       '
> +}

Here, you define midx_git_two_modes() and compare_results_with_midx()...

>  test_expect_success 'write midx with one v2 pack' '
> -       git pack-objects --index-version=2,0x40 pack/test <obj-list &&
> -       git multi-pack-index --object-dir=. write &&
> -       midx_read_expect 1 17 4 .
> +       git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list &&
> +       git multi-pack-index --object-dir=$objdir write &&
> +       midx_read_expect 1 17 4 $objdir
>  '
>
> +midx_git_two_modes() {
> +       git -c core.multiPackIndex=false $1 >expect &&
> +       git -c core.multiPackIndex=true $1 >actual &&
> +       test_cmp expect actual
> +}
> +
> +compare_results_with_midx() {
> +       MSG=$1
> +       test_expect_success "check normal git operations: $MSG" '
> +               midx_git_two_modes "rev-list --objects --all" &&
> +               midx_git_two_modes "log --raw"
> +       '
> +}

... and here you define both functions again.

> +
> +compare_results_with_midx "one v2 pack"

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 17/24] midx: prepare midxed_git struct
  2018-07-06  0:53     ` [PATCH v3 17/24] midx: prepare midxed_git struct Derrick Stolee
@ 2018-07-06  5:41       ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  5:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> midx: prepare midxed_git struct

What's a "midxed_git"? I don't see it in the code anywhere.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 24/24] midx: clear midx on repack
  2018-07-06  0:53     ` [PATCH v3 24/24] midx: clear midx on repack Derrick Stolee
@ 2018-07-06  5:52       ` Eric Sunshine
  0 siblings, 0 replies; 192+ messages in thread
From: Eric Sunshine @ 2018-07-06  5:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
> If a 'git repack' command replaces existing packfiles, then we must
> clear the existing multi-pack-index before moving the packfiles it
> references.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/repack.c |  8 ++++++++
>  midx.c           | 12 ++++++++++++
>  midx.h           |  1 +
>  3 files changed, 21 insertions(+)

This seems like a pretty important bit of functionality. Is there any
way to add a test of it to t5319 or would it be too difficult (or not
important enough)?

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v2 06/24] multi-pack-index: load into memory
  2018-07-05 18:58         ` Eric Sunshine
@ 2018-07-06 19:20           ` Junio C Hamano
  0 siblings, 0 replies; 192+ messages in thread
From: Junio C Hamano @ 2018-07-06 19:20 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Derrick Stolee, Git List, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

Eric Sunshine <sunshine@sunshineco.com> writes:

> On Thu, Jul 5, 2018 at 10:20 AM Derrick Stolee <stolee@gmail.com> wrote:
>> On 6/25/2018 3:38 PM, Junio C Hamano wrote:
>> While I don't use substitutions in this patch, I do use them in later
>> patches. Here is the final version of this method:
>>
>> midx_read_expect () {
>>          NUM_PACKS=$1
>>          NUM_OBJECTS=$2
>>          NUM_CHUNKS=$3
>>          EXTRA_CHUNKS="$5"
>>          cat >expect <<-\EOF
>>          header: 4d494458 1 $NUM_CHUNKS $NUM_PACKS
>>          chunks: pack_names oid_fanout oid_lookup
>> object_offsets$EXTRA_CHUNKS
>>          num_objects: $NUM_OBJECTS
>>          packs:
>>          EOF
>>
>> Using <<-\EOF causes these substitutions to fail. Is there a different
>> way I should construct this method?
>
> When you need to interpolate variables into the here-doc, use <<-EOF;
> when you don't, use <<-\EOF.

I think what was said is "in an early step there is no need to
interpolate but the same here-doc will need substitution in a later
step, and that is why I started an early step with a form without
quoting", which is different from "when should I use the form with
and without quoting?"

I think a reasonable response would have been "then please do use
the quoted form in the early step to help reviewers to let them know
there is not yet any substitutions, and then switch to quote-less form
at the step that starts needing substitution.  That way, reviewers can
see the test started to become "interesting" by interpolating variable
bits in the test vector by seeing the line with "<<EOF" appear in
the patch as modified".



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 06/24] multi-pack-index: load into memory
  2018-07-06  0:53     ` [PATCH v3 06/24] multi-pack-index: load into memory Derrick Stolee
  2018-07-06  4:19       ` Eric Sunshine
@ 2018-07-09 19:08       ` Junio C Hamano
  2018-07-12 16:06         ` Derrick Stolee
  1 sibling, 1 reply; 192+ messages in thread
From: Junio C Hamano @ 2018-07-09 19:08 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, pclouds, avarab, dstolee

Derrick Stolee <stolee@gmail.com> writes:

> +struct multi_pack_index *load_multi_pack_index(const char *object_dir)
> +{
> +	struct multi_pack_index *m = NULL;
> +	int fd;
> +	struct stat st;
> +	size_t midx_size;
> +	void *midx_map = NULL;
> +	uint32_t hash_version;
> +	char *midx_name = get_midx_filename(object_dir);
> +
> +	fd = git_open(midx_name);
> +
> +	if (fd < 0)
> +		goto cleanup_fail;
> +	if (fstat(fd, &st)) {
> +		error_errno(_("failed to read %s"), midx_name);
> +		goto cleanup_fail;
> +	}
> +
> +	midx_size = xsize_t(st.st_size);
> +
> +	if (midx_size < MIDX_MIN_SIZE) {
> +		close(fd);

With the use of "do things normally and jump to cleanup-fail label"
pattern, I think you do not want the close() here (unless you also
assign -1 to fd yourself, but that is a pointless workaround).
Another goto we see above after fstat() failure correctly omits it.

> +		error(_("multi-pack-index file %s is too small"), midx_name);
> +		goto cleanup_fail;
> +	}
> +
> +	FREE_AND_NULL(midx_name);

This correctly calls free-and-null not just free (otherwise we'd
break the cleanup-fail procedure below), which is good.

> +	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
> +	strcpy(m->object_dir, object_dir);

Hmph, I thought we had FLEX_ALLOC_*() convenience functions exactly
for doing things like this more safely.

> +	m->fd = fd;
> +	m->data = midx_map;
> +	m->data_len = midx_size;
> +
> +	m->signature = get_be32(m->data);
> +	if (m->signature != MIDX_SIGNATURE) {
> +		error(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"),
> +		      m->signature, MIDX_SIGNATURE);
> +		goto cleanup_fail;
> +	}
> +
> +	m->version = m->data[MIDX_BYTE_FILE_VERSION];
> +	if (m->version != MIDX_VERSION) {
> +		error(_("multi-pack-index version %d not recognized"),
> +		      m->version);
> +		goto cleanup_fail;
> +	}
> +
> +	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
> +	if (hash_version != MIDX_HASH_VERSION) {
> +		error(_("hash version %u does not match"), hash_version);
> +		goto cleanup_fail;
> +	}
> +	m->hash_len = MIDX_HASH_LEN;
> +
> +	m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS];
> +
> +	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
> +
> +	return m;

> +cleanup_fail:
> +	/* no need to check for NULL when freeing */

I wonder who the target reader of this comment is.  We certainly are
not in the business of C language newbies.

If this _were_ a commit that looked like this:

	-	if (ptr)
	-		free(ptr);
	+	/* no need to check for NULL when freeing */
	+	free(ptr);

then it might be more understandable, but it still is wrong (such a
comment does not help understanding the new code, which is the only
thing the people who read the comment sees, without knowing what was
there previously---it belongs to the commit log message as a rationale
to make that change).

> diff --git a/midx.h b/midx.h
> index dbdbe9f873..2d83dd9ec1 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -1,6 +1,10 @@
>  #ifndef __MIDX_H__
>  #define __MIDX_H__
>  
> +struct multi_pack_index;

I actually was quite surprised that this struct is defined in
object-store.h and not here.  It feels the other way around.

The raw_object_store needs to know that such an in-core structure
might exist as an optional feature in an object store, but as an
optional feature, I suspect that it has a pointer to an instance of
multi_pack_index, instead of embedding the struct itself in it, so I
would have expected to see an "I am only telling you that there is a
struct with this name, but I am leaving it opaque as you do not have
any business looking inside the struct yourself.  You only need to
be aware of the type's existence and a pointer to it so that you can
call helpers that know what's inside and that should be sufficient
for your needs." decl like this in object-store.h and instead an
actual implementation in here.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-06  0:53     ` [PATCH v3 16/24] config: create core.multiPackIndex setting Derrick Stolee
  2018-07-06  5:39       ` Eric Sunshine
@ 2018-07-11  9:48       ` SZEDER Gábor
  2018-07-12 13:01         ` Derrick Stolee
  1 sibling, 1 reply; 192+ messages in thread
From: SZEDER Gábor @ 2018-07-11  9:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, sbeller, pclouds, avarab, dstolee


> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index ab641bf5a9..ab895ebb32 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -908,6 +908,10 @@ core.commitGraph::
>  	Enable git commit graph feature. Allows reading from the
>  	commit-graph file.
>  
> +core.multiPackIndex::
> +	Use the multi-pack-index file to track multiple packfiles using a
> +	single index. See linkgit:technical/multi-pack-index[1].

The 'linkgit' macro should be used to create links to other man pages,
but 'technical/multi-pack-index' is not a man page and this causes
'make check-docs' to complain:

      LINT lint-docs
  ./config.txt:929: nongit link: technical/multi-pack-index[1]
  Makefile:456: recipe for target 'lint-docs' failed
  make[1]: *** [lint-docs] Error 1


> +
>  core.sparseCheckout::
>  	Enable "sparse checkout" feature. See section "Sparse checkout" in
>  	linkgit:git-read-tree[1] for more information.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-11  9:48       ` SZEDER Gábor
@ 2018-07-12 13:01         ` Derrick Stolee
  2018-07-12 13:31           ` SZEDER Gábor
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 13:01 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, gitster, sbeller, pclouds, avarab, dstolee

On 7/11/2018 5:48 AM, SZEDER Gábor wrote:
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index ab641bf5a9..ab895ebb32 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -908,6 +908,10 @@ core.commitGraph::
>>   	Enable git commit graph feature. Allows reading from the
>>   	commit-graph file.
>>   
>> +core.multiPackIndex::
>> +	Use the multi-pack-index file to track multiple packfiles using a
>> +	single index. See linkgit:technical/multi-pack-index[1].
> The 'linkgit' macro should be used to create links to other man pages,
> but 'technical/multi-pack-index' is not a man page and this causes
> 'make check-docs' to complain:
>
>        LINT lint-docs
>    ./config.txt:929: nongit link: technical/multi-pack-index[1]
>    Makefile:456: recipe for target 'lint-docs' failed
>    make[1]: *** [lint-docs] Error 1
>
Thanks for this point. It seems to work using 
"link:technical/multi-pack-index[1]", which is what I'll use in the next 
version.

-Stolee


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-06  5:39       ` Eric Sunshine
@ 2018-07-12 13:19         ` Derrick Stolee
  2018-07-12 16:30           ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 13:19 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On 7/6/2018 1:39 AM, Eric Sunshine wrote:
> On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>> The core.multiPackIndex config setting controls the multi-pack-
>> index (MIDX) feature. If false, the setting will disable all reads
>> from the multi-pack-index file.
>>
>> Add comparison commands in t5319-multi-pack-index.sh to check
>> typical Git behavior remains the same as the config setting is turned
>> on and off. This currently includes 'git rev-list' and 'git log'
>> commands to trigger several object database reads.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>> diff --git a/cache.h b/cache.h
>> @@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
>> +extern int core_multi_pack_index;
>> diff --git a/config.c b/config.c
>> @@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, const char *value)
>> +       if (!strcmp(var, "core.multipackindex")) {
>> +               core_multi_pack_index = git_config_bool(var, value);
>> +               return 0;
>> +       }
> This is a rather unusual commit. This new configuration is assigned,
> but it's never actually consulted, which means that it's impossible
> for it to have any impact on functionality, yet tests are being added
> to check whether it _did_ have any impact on functionality. Confusing.
>
> Patch 17 does consult 'core_multi_pack_index', so it's only at that
> point that it could have any impact. This situation would be less
> confusing if you swapped patches 16 and 17 (and, of course, move the
> declaration of 'core_multi_pack_index' to patch 17 with a reasonable
> default value).

You're right that this commit is a bit too aware of the future, but I 
disagree with the recommendation to change it.

Yes, in this commit there is no possible way that these tests could 
fail. The point is that patches 17-23 all change behavior if this 
setting is on, and we want to make sure we do not break at any point 
along that journey (or in future iterations of the multi-pack-index 
feature).

With this in mind, I don't think there is a better commit to place these 
tests.

>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>> @@ -61,12 +63,42 @@ test_expect_success 'write midx with one v1 pack' '
>> +midx_git_two_modes() {
>> +       git -c core.multiPackIndex=false $1 >expect &&
>> +       git -c core.multiPackIndex=true $1 >actual &&
>> +       test_cmp expect actual
>> +}
>> +
>> +compare_results_with_midx() {
>> +       MSG=$1
>> +       test_expect_success "check normal git operations: $MSG" '
>> +               midx_git_two_modes "rev-list --objects --all" &&
>> +               midx_git_two_modes "log --raw"
>> +       '
>> +}
> Here, you define midx_git_two_modes() and compare_results_with_midx()...
>
>>   test_expect_success 'write midx with one v2 pack' '
>> -       git pack-objects --index-version=2,0x40 pack/test <obj-list &&
>> -       git multi-pack-index --object-dir=. write &&
>> -       midx_read_expect 1 17 4 .
>> +       git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list &&
>> +       git multi-pack-index --object-dir=$objdir write &&
>> +       midx_read_expect 1 17 4 $objdir
>>   '
>>
>> +midx_git_two_modes() {
>> +       git -c core.multiPackIndex=false $1 >expect &&
>> +       git -c core.multiPackIndex=true $1 >actual &&
>> +       test_cmp expect actual
>> +}
>> +
>> +compare_results_with_midx() {
>> +       MSG=$1
>> +       test_expect_success "check normal git operations: $MSG" '
>> +               midx_git_two_modes "rev-list --objects --all" &&
>> +               midx_git_two_modes "log --raw"
>> +       '
>> +}
> ... and here you define both functions again.

This was a mistake. Thanks for catching it.


Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-12 13:01         ` Derrick Stolee
@ 2018-07-12 13:31           ` SZEDER Gábor
  2018-07-12 15:40             ` Derrick Stolee
  2018-07-12 17:29             ` Junio C Hamano
  0 siblings, 2 replies; 192+ messages in thread
From: SZEDER Gábor @ 2018-07-12 13:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git mailing list, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 12, 2018 at 3:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/11/2018 5:48 AM, SZEDER Gábor wrote:
> >> diff --git a/Documentation/config.txt b/Documentation/config.txt
> >> index ab641bf5a9..ab895ebb32 100644
> >> --- a/Documentation/config.txt
> >> +++ b/Documentation/config.txt
> >> @@ -908,6 +908,10 @@ core.commitGraph::
> >>      Enable git commit graph feature. Allows reading from the
> >>      commit-graph file.
> >>
> >> +core.multiPackIndex::
> >> +    Use the multi-pack-index file to track multiple packfiles using a
> >> +    single index. See linkgit:technical/multi-pack-index[1].
> > The 'linkgit' macro should be used to create links to other man pages,
> > but 'technical/multi-pack-index' is not a man page and this causes
> > 'make check-docs' to complain:
> >
> >        LINT lint-docs
> >    ./config.txt:929: nongit link: technical/multi-pack-index[1]
> >    Makefile:456: recipe for target 'lint-docs' failed
> >    make[1]: *** [lint-docs] Error 1
> >
> Thanks for this point. It seems to work using
> "link:technical/multi-pack-index[1]", which is what I'll use in the next
> version.

It doesn't work, it merely works around the build failure.

The generated man page looks like this:

  core.multiPackIndex
      Use the multi-pack-index file to track multiple packfiles using a
      single index. See 1[1].

And the resulting html page looks similar:

  core.multiPackIndex

      Use the multi-pack-index file to track multiple packfiles using a
      single index. See 1.

where that "1" is a link pointing to the non-existing URL
file:///home/me/src/git/Documentation/technical/multi-pack-index

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 07/24] multi-pack-index: expand test data
  2018-07-06  4:36       ` Eric Sunshine
  2018-07-06  5:20         ` Eric Sunshine
@ 2018-07-12 14:10         ` Derrick Stolee
  2018-07-12 18:02           ` Eric Sunshine
  1 sibling, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 14:10 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On 7/6/2018 12:36 AM, Eric Sunshine wrote:
> On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>
>> +       for i in $(test_seq 6 10)
>> +       do
>> +               iii=$(printf '%03i' $i)
>> +               test-tool genrandom "bar" 200 >wide_delta_$iii &&
>> +               test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
>> +               test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
>> +               test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
>> +               test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
>> +               echo $iii >file_$iii &&
>> +               test-tool genrandom "$iii" 8192 >>file_$iii &&
>> +               git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
>> +               i=$(expr $i + 1) || return 1
>> +       done &&
>> +       { echo 101 && test-tool genrandom 100 8192; } >file_101 &&
>> +       git update-index --add file_101 &&
>> +       tree=$(git write-tree) &&
>> +       commit=$(git commit-tree $tree -p HEAD</dev/null) && {
>> +       echo $tree &&
>> +       git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)        .*/\\1/"
>> +       } >obj-list2 &&
>> +       git update-ref HEAD $commit
>> +'
> There seems to be a fair bit of duplication in these tests which
> create objects. Is it possible to factor out some of this code into a
> shell function?

In addition to the other small changes, this refactor in particular was 
a big change (but a good one). I'm sending my current progress in this 
direction, as I expect this can be improved.

To make the commit_and_list_objects method more generic to all 
situations, I had to add an extra commit, which will cause some of the 
numbers to change in the later 'midx_read_expect' calls.

Thanks,

-Stolee

-->8--

 From cb38bb284fd05cf2230725b6cb9ead5795c913f2 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Thu, 31 May 2018 15:05:00 -0400
Subject: [PATCH] t5319: expand test data

As we build the multi-pack-index file format, we want to test the format
on real repositories. Add tests that create repository data including
multiple packfiles with both version 1 and version 2 formats.

The current 'git multi-pack-index write' command will always write the
same file with no "real" data. This will be expanded in future commits,
along with the test expectations.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
  t/t5319-multi-pack-index.sh | 83 +++++++++++++++++++++++++++++++++++++
  1 file changed, 83 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 2ecc369529..a50be41bc0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -13,9 +13,92 @@ midx_read_expect () {
  }

  test_expect_success 'write midx with no packs' '
+       test_when_finished rm -f pack/multi-pack-index &&
         git multi-pack-index --object-dir=. write &&
         test_path_is_file pack/multi-pack-index &&
         midx_read_expect
  '

+generate_objects () {
+       i=$1
+       iii=$(printf '%03i' $i)
+       {
+               test-tool genrandom "bar" 200 &&
+               test-tool genrandom "baz $iii" 50
+       } >wide_delta_$iii &&
+       {
+               test-tool genrandom "foo"$i 100 &&
+               test-tool genrandom "foo"$(( $i + 1 )) 100 &&
+               test-tool genrandom "foo"$(( $i + 2 )) 100
+       } >>deep_delta_$iii &&
+       echo $iii >file_$iii &&
+       test-tool genrandom "$iii" 8192 >>file_$iii &&
+       git update-index --add file_$iii deep_delta_$iii wide_delta_$iii
+}
+
+commit_and_list_objects () {
+       {
+               echo 101 &&
+               test-tool genrandom 100 8192;
+       } >file_101 &&
+       git update-index --add file_101 &&
+       tree=$(git write-tree) &&
+       commit=$(git commit-tree $tree -p HEAD</dev/null) &&
+       {
+               echo $tree &&
+               git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)        
.*/\\1/"
+       } >obj-list &&
+       git reset --hard $commit
+}
+
+test_expect_success 'create objects' '
+       test_commit initial &&
+       for i in $(test_seq 1 5)
+       do
+               generate_objects $i
+       done &&
+       commit_and_list_objects
+'
+
+test_expect_success 'write midx with one v1 pack' '
+       pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
+       test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx 
pack/multi-pack-index &&
+       git multi-pack-index --object-dir=. write &&
+       midx_read_expect
+'
+
+test_expect_success 'write midx with one v2 pack' '
+       git pack-objects --index-version=2,0x40 pack/test <obj-list &&
+       git multi-pack-index --object-dir=. write &&
+       midx_read_expect
+'
+
+test_expect_success 'add more objects' '
+       for i in $(test_seq 6 10)
+       do
+               generate_objects $i
+       done &&
+       commit_and_list_objects
+'
+
+test_expect_success 'write midx with two packs' '
+       git pack-objects --index-version=1 pack/test-2 <obj-list &&
+       git multi-pack-index --object-dir=. write &&
+       midx_read_expect
+'
+
+test_expect_success 'add more packs' '
+       for j in $(test_seq 1 10)
+       do
+               generate_objects $j &&
+               commit_and_list_objects &&
+               git pack-objects --index-version=2 test-pack <obj-list
+       done
+'
+
+test_expect_success 'write midx with twelve packs' '
+       git multi-pack-index --object-dir=. write &&
+       midx_read_expect
+'
+
  test_done
--
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-12 13:31           ` SZEDER Gábor
@ 2018-07-12 15:40             ` Derrick Stolee
  2018-07-12 17:29             ` Junio C Hamano
  1 sibling, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 15:40 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Git mailing list, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On 7/12/2018 9:31 AM, SZEDER Gábor wrote:
> On Thu, Jul 12, 2018 at 3:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>> On 7/11/2018 5:48 AM, SZEDER Gábor wrote:
>>>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>>>> index ab641bf5a9..ab895ebb32 100644
>>>> --- a/Documentation/config.txt
>>>> +++ b/Documentation/config.txt
>>>> @@ -908,6 +908,10 @@ core.commitGraph::
>>>>       Enable git commit graph feature. Allows reading from the
>>>>       commit-graph file.
>>>>
>>>> +core.multiPackIndex::
>>>> +    Use the multi-pack-index file to track multiple packfiles using a
>>>> +    single index. See linkgit:technical/multi-pack-index[1].
>>> The 'linkgit' macro should be used to create links to other man pages,
>>> but 'technical/multi-pack-index' is not a man page and this causes
>>> 'make check-docs' to complain:
>>>
>>>         LINT lint-docs
>>>     ./config.txt:929: nongit link: technical/multi-pack-index[1]
>>>     Makefile:456: recipe for target 'lint-docs' failed
>>>     make[1]: *** [lint-docs] Error 1
>>>
>> Thanks for this point. It seems to work using
>> "link:technical/multi-pack-index[1]", which is what I'll use in the next
>> version.
> It doesn't work, it merely works around the build failure.
>
> The generated man page looks like this:
>
>    core.multiPackIndex
>        Use the multi-pack-index file to track multiple packfiles using a
>        single index. See 1[1].
>
> And the resulting html page looks similar:
>
>    core.multiPackIndex
>
>        Use the multi-pack-index file to track multiple packfiles using a
>        single index. See 1.
>
> where that "1" is a link pointing to the non-existing URL
> file:///home/me/src/git/Documentation/technical/multi-pack-index

Right. Sorry. I also see that I use the correct kind of links in 
Documentation/git-multi-pack-index.txt (see below) so I will use it 
here, too.

SEE ALSO
--------
See link:technical/multi-pack-index.html[The Multi-Pack-Index Design
Document] and link:technical/pack-format.html[The Multi-Pack-Index
Format] for more information on the multi-pack-index feature.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 06/24] multi-pack-index: load into memory
  2018-07-09 19:08       ` Junio C Hamano
@ 2018-07-12 16:06         ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 16:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, sbeller, pclouds, avarab, dstolee

On 7/9/2018 3:08 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>> diff --git a/midx.h b/midx.h
>> index dbdbe9f873..2d83dd9ec1 100644
>> --- a/midx.h
>> +++ b/midx.h
>> @@ -1,6 +1,10 @@
>>   #ifndef __MIDX_H__
>>   #define __MIDX_H__
>>   
>> +struct multi_pack_index;
> I actually was quite surprised that this struct is defined in
> object-store.h and not here.  It feels the other way around.
>
> The raw_object_store needs to know that such an in-core structure
> might exist as an optional feature in an object store, but as an
> optional feature, I suspect that it has a pointer to an instance of
> multi_pack_index, instead of embedding the struct itself in it, so I
> would have expected to see an "I am only telling you that there is a
> struct with this name, but I am leaving it opaque as you do not have
> any business looking inside the struct yourself.  You only need to
> be aware of the type's existence and a pointer to it so that you can
> call helpers that know what's inside and that should be sufficient
> for your needs." decl like this in object-store.h and instead an
> actual implementation in here.

I thought it natural to include the struct definition next to the 
definition for 'struct packed_git', but I like your separation of 
concerns. Perhaps we could move the packed_git definition to packfile.h 
as well (separately). Of course, that sounds like an unnecessary churn.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-12 13:19         ` Derrick Stolee
@ 2018-07-12 16:30           ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 16:30 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On 7/12/2018 9:19 AM, Derrick Stolee wrote:
> On 7/6/2018 1:39 AM, Eric Sunshine wrote:
>> On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>>> The core.multiPackIndex config setting controls the multi-pack-
>>> index (MIDX) feature. If false, the setting will disable all reads
>>> from the multi-pack-index file.
>>>
>>> Add comparison commands in t5319-multi-pack-index.sh to check
>>> typical Git behavior remains the same as the config setting is turned
>>> on and off. This currently includes 'git rev-list' and 'git log'
>>> commands to trigger several object database reads.
>>>
>>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>>> ---
>>> diff --git a/cache.h b/cache.h
>>> @@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
>>> +extern int core_multi_pack_index;
>>> diff --git a/config.c b/config.c
>>> @@ -1313,6 +1313,11 @@ static int git_default_core_config(const char 
>>> *var, const char *value)
>>> +       if (!strcmp(var, "core.multipackindex")) {
>>> +               core_multi_pack_index = git_config_bool(var, value);
>>> +               return 0;
>>> +       }
>> This is a rather unusual commit. This new configuration is assigned,
>> but it's never actually consulted, which means that it's impossible
>> for it to have any impact on functionality, yet tests are being added
>> to check whether it _did_ have any impact on functionality. Confusing.
>>
>> Patch 17 does consult 'core_multi_pack_index', so it's only at that
>> point that it could have any impact. This situation would be less
>> confusing if you swapped patches 16 and 17 (and, of course, move the
>> declaration of 'core_multi_pack_index' to patch 17 with a reasonable
>> default value).
>
> You're right that this commit is a bit too aware of the future, but I 
> disagree with the recommendation to change it.
>
> Yes, in this commit there is no possible way that these tests could 
> fail. The point is that patches 17-23 all change behavior if this 
> setting is on, and we want to make sure we do not break at any point 
> along that journey (or in future iterations of the multi-pack-index 
> feature).
>
> With this in mind, I don't think there is a better commit to place 
> these tests.

Of course, as I convert this global config variable into an on-demand 
check as promised [1] this commit seems even more trivial. I'm going to 
squash it with PATCH 17.

[1] 
https://public-inbox.org/git/b5733625-29c8-4317-ff44-d27c2fca11ce@gmail.com/


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 15/24] midx: write object offsets
  2018-07-06  5:27       ` Eric Sunshine
@ 2018-07-12 16:33         ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 16:33 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On 7/6/2018 1:27 AM, Eric Sunshine wrote:
> On Thu, Jul 5, 2018 at 8:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>> The final pair of chunks for the multi-pack-index file stores the object
>> offsets. We default to using 32-bit offsets as in the pack-index version
>> 1 format, but if there exists an offset larger than 32-bits, we use a
>> trick similar to the pack-index version 2 format by storing all offsets
>> at least 2^31 in a 64-bit table; we use the 32-bit table to point into
>> that 64-bit table as necessary.
>>
>> We only store these 64-bit offsets if necessary, so create a test that
>> manipulates a version 2 pack-index to fake a large offset. This allows
>> us to test that the large offset table is created, but the data does not
>> match the actual packfile offsets. The multi-pack-index offset does match
>> the (corrupted) pack-index offset, so a future feature will compare these
>> offsets during a 'verify' step.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>> @@ -6,25 +6,28 @@ test_description='multi-pack-indexes'
>> +# usage: corrupt_data <file> <pos> [<data>]
>> +corrupt_data() {
> Style: corrupt_data () {
>
>> +       file=$1
>> +       pos=$2
>> +       data="${3:-\0}"
>> +       printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
>> +}
>> +
>> +# Force 64-bit offsets by manipulating the idx file.
>> +# This makes the IDX file _incorrect_ so be careful to clean up after!
>> +test_expect_success 'force some 64-bit offsets with pack-objects' '
>> +       mkdir objects64 &&
>> +       mkdir objects64/pack &&
>> +       pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
>> +       idx64=objects64/pack/test-64-$pack64.idx &&
>> +       chmod u+w $idx64 &&
> I guess you don't have to worry about the POSIXPERM prerequisite here
> because the file is already writable on Windows, right?

Correct. And I want this test to still run on Windows.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 16/24] config: create core.multiPackIndex setting
  2018-07-12 13:31           ` SZEDER Gábor
  2018-07-12 15:40             ` Derrick Stolee
@ 2018-07-12 17:29             ` Junio C Hamano
  1 sibling, 0 replies; 192+ messages in thread
From: Junio C Hamano @ 2018-07-12 17:29 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee, Git mailing list, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

SZEDER Gábor <szeder.dev@gmail.com> writes:

>> Thanks for this point. It seems to work using
>> "link:technical/multi-pack-index[1]", which is what I'll use in the next
>> version.
>
> It doesn't work, it merely works around the build failure.

Sorry. I fell into the same trap X-<.

link:techincal/multi-pack-index.html[the technical documentation
for it]

or something like that, perhaps.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 07/24] multi-pack-index: expand test data
  2018-07-12 14:10         ` Derrick Stolee
@ 2018-07-12 18:02           ` Eric Sunshine
  2018-07-12 18:06             ` Derrick Stolee
  0 siblings, 1 reply; 192+ messages in thread
From: Eric Sunshine @ 2018-07-12 18:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Thu, Jul 12, 2018 at 10:10 AM Derrick Stolee <stolee@gmail.com> wrote:
> On 7/6/2018 12:36 AM, Eric Sunshine wrote:
> > There seems to be a fair bit of duplication in these tests which
> > create objects. Is it possible to factor out some of this code into a
> > shell function?
>
> In addition to the other small changes, this refactor in particular was
> a big change (but a good one). I'm sending my current progress in this
> direction, as I expect this can be improved.

I like the amount of code reduction. A couple minor comments...

> +generate_objects () {
> +       i=$1
> +       iii=$(printf '%03i' $i)
> +       {
> +               test-tool genrandom "bar" 200 &&
> +               test-tool genrandom "baz $iii" 50
> +       } >wide_delta_$iii &&
> +       {
> +               test-tool genrandom "foo"$i 100 &&
> +               test-tool genrandom "foo"$(( $i + 1 )) 100 &&
> +               test-tool genrandom "foo"$(( $i + 2 )) 100
> +       } >>deep_delta_$iii &&

I think this should be: s/>>/>/

> +       echo $iii >file_$iii &&
> +       test-tool genrandom "$iii" 8192 >>file_$iii &&

And this: s/>>/>/

> +       git update-index --add file_$iii deep_delta_$iii wide_delta_$iii
> +}

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH v3 07/24] multi-pack-index: expand test data
  2018-07-12 18:02           ` Eric Sunshine
@ 2018-07-12 18:06             ` Derrick Stolee
  0 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 18:06 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Junio C Hamano, Stefan Beller,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On 7/12/2018 2:02 PM, Eric Sunshine wrote:
> On Thu, Jul 12, 2018 at 10:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>> On 7/6/2018 12:36 AM, Eric Sunshine wrote:
>>> There seems to be a fair bit of duplication in these tests which
>>> create objects. Is it possible to factor out some of this code into a
>>> shell function?
>> In addition to the other small changes, this refactor in particular was
>> a big change (but a good one). I'm sending my current progress in this
>> direction, as I expect this can be improved.
> I like the amount of code reduction. A couple minor comments...
>
>> +generate_objects () {
>> +       i=$1
>> +       iii=$(printf '%03i' $i)
>> +       {
>> +               test-tool genrandom "bar" 200 &&
>> +               test-tool genrandom "baz $iii" 50
>> +       } >wide_delta_$iii &&
>> +       {
>> +               test-tool genrandom "foo"$i 100 &&
>> +               test-tool genrandom "foo"$(( $i + 1 )) 100 &&
>> +               test-tool genrandom "foo"$(( $i + 2 )) 100
>> +       } >>deep_delta_$iii &&
> I think this should be: s/>>/>/

It should!

>> +       echo $iii >file_$iii &&
>> +       test-tool genrandom "$iii" 8192 >>file_$iii &&
> And this: s/>>/>/

In addition, I should wrap these two commands in { } like the files above.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 00/23] Multi-pack-index (MIDX)
  2018-07-06  0:52   ` [PATCH v3 00/24] Multi-pack-index (MIDX) Derrick Stolee
                       ` (23 preceding siblings ...)
  2018-07-06  0:53     ` [PATCH v3 24/24] midx: clear midx on repack Derrick Stolee
@ 2018-07-12 19:39     ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 01/23] multi-pack-index: add design document Derrick Stolee
                         ` (23 more replies)
  24 siblings, 24 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

v3 had a lot of interesting feedback, most of which was non-functional,
but made a big impact on the shape of the patch, especially the test
script.

These are the important changes:

* 'git multi-pack-index' will report usage if the 'write' verb is not
  provided, or if extra parameters are provided. A later series will
  create the 'verify' verb.

* t5319-multi-pack-index.sh has a reoganized way to generate object
  data, so it has fewer code clones.

* 'test-tool read-midx' uses '-' instead of '_'.

* The global 'core_multi_pack_index' is replaced with a one-time call to
  git_config_bool() per repository that loads a multi-pack-index.

* 'struct multi_pack_index' is now defined in midx.h and kept anonymous
  to object-store.h.

* Added a test that 'git repack' removes the multi-pack-index.

* Fixed a doc bug when linking to the technical docs.

I included the diff between the latest ds/multi-pack-index and this
series as part of this message.

You can see the CI builds for Linux, Mac, and Windows linked from the
GitHub pull request [1].

Thanks,
-Stolee

[1] https://github.com/gitgitgadget/git/pull/5

Derrick Stolee (23):
  multi-pack-index: add design document
  multi-pack-index: add format details
  multi-pack-index: add builtin
  multi-pack-index: add 'write' verb
  midx: write header information to lockfile
  multi-pack-index: load into memory
  t5319: expand test data
  packfile: generalize pack directory list
  multi-pack-index: read packfile list
  multi-pack-index: write pack names in chunk
  midx: read pack names into array
  midx: sort and deduplicate objects from packfiles
  midx: write object ids in a chunk
  midx: write object id fanout chunk
  midx: write object offsets
  config: create core.multiPackIndex setting
  midx: read objects from multi-pack-index
  midx: use midx in abbreviation calculations
  midx: use existing midx when writing new one
  midx: use midx in approximate_object_count
  midx: prevent duplicate packfile loads
  packfile: skip loading index if in multi-pack-index
  midx: clear midx on repack

 .gitignore                                   |   3 +-
 Documentation/config.txt                     |   5 +
 Documentation/git-multi-pack-index.txt       |  56 ++
 Documentation/technical/multi-pack-index.txt | 109 +++
 Documentation/technical/pack-format.txt      |  77 ++
 Makefile                                     |   3 +
 builtin.h                                    |   1 +
 builtin/multi-pack-index.c                   |  47 +
 builtin/repack.c                             |   9 +
 command-list.txt                             |   1 +
 git.c                                        |   1 +
 midx.c                                       | 918 +++++++++++++++++++
 midx.h                                       |  44 +
 object-store.h                               |   9 +
 packfile.c                                   | 169 +++-
 packfile.h                                   |   9 +
 sha1-name.c                                  |  70 ++
 t/helper/test-read-midx.c                    |  51 ++
 t/helper/test-tool.c                         |   1 +
 t/helper/test-tool.h                         |   1 +
 t/t5319-multi-pack-index.sh                  | 179 ++++
 21 files changed, 1720 insertions(+), 43 deletions(-)
 create mode 100644 Documentation/git-multi-pack-index.txt
 create mode 100644 Documentation/technical/multi-pack-index.txt
 create mode 100644 builtin/multi-pack-index.c
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100644 t/helper/test-read-midx.c
 create mode 100755 t/t5319-multi-pack-index.sh


base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c
-- 
2.18.0.118.gd4f65b8d14

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9dcde07a34..25f817ca42 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -910,7 +910,8 @@ core.commitGraph::
 
 core.multiPackIndex::
 	Use the multi-pack-index file to track multiple packfiles using a
-	single index. See link:technical/multi-pack-index[1].
+	single index. See link:technical/multi-pack-index.html[the
+	multi-pack-index design document].
 
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index be97c9372e..a62af1caca 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir <dir>] <verb>
+'git multi-pack-index' [--object-dir=<dir>] <verb>
 
 DESCRIPTION
 -----------
@@ -18,7 +18,7 @@ Write or verify a multi-pack-index (MIDX) file.
 OPTIONS
 -------
 
---object-dir <dir>::
+--object-dir=<dir>::
 	Use given directory for the location of Git objects. We check
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
@@ -37,7 +37,7 @@ EXAMPLES
 $ git multi-pack-index write
 -----------------------------------------------
 
-* Write a MIDX file for the packfiles in an alternate.
+* Write a MIDX file for the packfiles in an alternate object store.
 +
 -----------------------------------------------
 $ git multi-pack-index --object-dir <alt> write
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 14b32e1373..6a7aa00cf2 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir <dir>] [write]"),
+	N_("git multi-pack-index [--object-dir=<dir>] write"),
 	NULL
 };
 
@@ -18,14 +18,10 @@ int cmd_multi_pack_index(int argc, const char **argv,
 {
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
-		  N_("The object directory containing set of packfile and pack-index pairs")),
+		  N_("object directory containing set of packfile and pack-index pairs")),
 		OPT_END(),
 	};
 
-	if (argc == 2 && !strcmp(argv[1], "-h"))
-		usage_with_options(builtin_multi_pack_index_usage,
-				   builtin_multi_pack_index_options);
-
 	git_config(git_default_config, NULL);
 
 	argc = parse_options(argc, argv, prefix,
@@ -36,11 +32,16 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		opts.object_dir = get_object_directory();
 
 	if (argc == 0)
-		usage_with_options(builtin_multi_pack_index_usage,
-				   builtin_multi_pack_index_options);
+		goto usage;
+
+	if (!strcmp(argv[0], "write")) {
+		if (argc > 1)
+			goto usage;
 
-	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
+	}
 
-	return 0;
+usage:
+	usage_with_options(builtin_multi_pack_index_usage,
+			   builtin_multi_pack_index_options);
 }
diff --git a/builtin/repack.c b/builtin/repack.c
index 66a7d8e8ea..7f7cdc8b17 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -335,12 +335,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	for_each_string_list_item(item, &names) {
 		for (ext = 0; ext < ARRAY_SIZE(exts); ext++) {
 			char *fname, *fname_old;
-			fname = mkpathdup("%s/pack-%s%s", packdir,
-						item->string, exts[ext].name);
-			if (!file_exists(fname)) {
-				free(fname);
-				continue;
-			}
 
 			if (!midx_cleared) {
 				/* if we move a packfile, it will invalidated the midx */
@@ -348,6 +342,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				midx_cleared = 1;
 			}
 
+			fname = mkpathdup("%s/pack-%s%s", packdir,
+						item->string, exts[ext].name);
+			if (!file_exists(fname)) {
+				free(fname);
+				continue;
+			}
+
 			fname_old = mkpathdup("%s/old-%s%s", packdir,
 						item->string, exts[ext].name);
 			if (file_exists(fname_old))
diff --git a/cache.h b/cache.h
index d12aa49710..89a107a7f7 100644
--- a/cache.h
+++ b/cache.h
@@ -814,7 +814,6 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_commit_graph;
-extern int core_multi_pack_index;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index 95d8da4243..fbbf0f8e9f 100644
--- a/config.c
+++ b/config.c
@@ -1313,11 +1313,6 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
-	if (!strcmp(var, "core.multipackindex")) {
-		core_multi_pack_index = git_config_bool(var, value);
-		return 0;
-	}
-
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index b9bc919cdb..2a6de2330b 100644
--- a/environment.c
+++ b/environment.c
@@ -67,7 +67,6 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_commit_graph;
-int core_multi_pack_index;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
diff --git a/midx.c b/midx.c
index 8afd08f3fe..19b7df338e 100644
--- a/midx.c
+++ b/midx.c
@@ -1,4 +1,5 @@
 #include "cache.h"
+#include "config.h"
 #include "csum-file.h"
 #include "dir.h"
 #include "lockfile.h"
@@ -60,7 +61,6 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	midx_size = xsize_t(st.st_size);
 
 	if (midx_size < MIDX_MIN_SIZE) {
-		close(fd);
 		error(_("multi-pack-index file %s is too small"), midx_name);
 		goto cleanup_fail;
 	}
@@ -69,8 +69,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
 
-	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
-	strcpy(m->object_dir, object_dir);
+	FLEX_ALLOC_MEM(m, object_dir, object_dir, strlen(object_dir));
 	m->fd = fd;
 	m->data = midx_map;
 	m->data_len = midx_size;
@@ -171,7 +170,6 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	return m;
 
 cleanup_fail:
-	/* no need to check for NULL when freeing */
 	free(m);
 	free(midx_name);
 	if (midx_map)
@@ -324,8 +322,10 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir)
 {
 	struct multi_pack_index *m = r->objects->multi_pack_index;
 	struct multi_pack_index *m_search;
+	int config_value;
 
-	if (!core_multi_pack_index)
+	if (repo_config_get_bool(r, "core.multipackindex", &config_value) ||
+	    !config_value)
 		return 0;
 
 	for (m_search = m; m_search; m_search = m_search->next)
@@ -382,7 +382,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
 
 		packs->list[packs->nr] = add_packed_git(full_path,
-							full_path_len, 0);
+							full_path_len,
+							0);
 
 		if (!packs->list[packs->nr]) {
 			warning(_("failed to add packfile '%s'"),
@@ -661,8 +662,8 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 			struct pack_midx_entry *next = list;
 			if (oidcmp(&obj->oid, &next->oid) >= 0)
 				BUG("OIDs not in order: %s >= %s",
-				oid_to_hex(&obj->oid),
-				oid_to_hex(&next->oid));
+				    oid_to_hex(&obj->oid),
+				    oid_to_hex(&next->oid));
 		}
 
 		hashwrite(f, obj->oid.hash, (int)hash_len);
diff --git a/midx.h b/midx.h
index 5a42cbed1d..e3b07f1586 100644
--- a/midx.h
+++ b/midx.h
@@ -3,7 +3,31 @@
 
 #include "repository.h"
 
-struct multi_pack_index;
+struct multi_pack_index {
+	struct multi_pack_index *next;
+
+	int fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	uint32_t signature;
+	unsigned char version;
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_packs;
+	uint32_t num_objects;
+
+	const unsigned char *chunk_pack_names;
+	const uint32_t *chunk_oid_fanout;
+	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_object_offsets;
+	const unsigned char *chunk_large_offsets;
+
+	const char **pack_names;
+	struct packed_git **packs;
+	char object_dir[FLEX_ARRAY];
+};
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/object-store.h b/object-store.h
index 03cc278758..c2b162489a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -84,31 +84,7 @@ struct packed_git {
 	char pack_name[FLEX_ARRAY]; /* more */
 };
 
-struct multi_pack_index {
-	struct multi_pack_index *next;
-
-	int fd;
-
-	const unsigned char *data;
-	size_t data_len;
-
-	uint32_t signature;
-	unsigned char version;
-	unsigned char hash_len;
-	unsigned char num_chunks;
-	uint32_t num_packs;
-	uint32_t num_objects;
-
-	const unsigned char *chunk_pack_names;
-	const uint32_t *chunk_oid_fanout;
-	const unsigned char *chunk_oid_lookup;
-	const unsigned char *chunk_object_offsets;
-	const unsigned char *chunk_large_offsets;
-
-	const char **pack_names;
-	struct packed_git **packs;
-	char object_dir[FLEX_ARRAY];
-};
+struct multi_pack_index;
 
 struct raw_object_store {
 	/*
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 20771d1c1d..8e19972e89 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -1,6 +1,3 @@
-/*
- * test-mktemp.c: code to exercise the creation of temporary files
- */
 #include "test-tool.h"
 #include "cache.h"
 #include "midx.h"
@@ -13,7 +10,7 @@ static int read_midx_file(const char *object_dir)
 	struct multi_pack_index *m = load_multi_pack_index(object_dir);
 
 	if (!m)
-		return 0;
+		return 1;
 
 	printf("header: %08x %d %d %d\n",
 	       m->signature,
@@ -24,15 +21,15 @@ static int read_midx_file(const char *object_dir)
 	printf("chunks:");
 
 	if (m->chunk_pack_names)
-		printf(" pack_names");
+		printf(" pack-names");
 	if (m->chunk_oid_fanout)
-		printf(" oid_fanout");
+		printf(" oid-fanout");
 	if (m->chunk_oid_lookup)
-		printf(" oid_lookup");
+		printf(" oid-lookup");
 	if (m->chunk_object_offsets)
-		printf(" object_offsets");
+		printf(" object-offsets");
 	if (m->chunk_large_offsets)
-		printf(" large_offsets");
+		printf(" large-offsets");
 
 	printf("\nnum_objects: %d\n", m->num_objects);
 
@@ -40,7 +37,7 @@ static int read_midx_file(const char *object_dir)
 	for (i = 0; i < m->num_packs; i++)
 		printf("%s\n", m->pack_names[i]);
 
-	printf("object_dir: %s\n", m->object_dir);
+	printf("object-dir: %s\n", m->object_dir);
 
 	return 0;
 }
@@ -48,7 +45,7 @@ static int read_midx_file(const char *object_dir)
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (argc != 2)
-		usage("read-midx <object_dir>");
+		usage("read-midx <object-dir>");
 
 	return read_midx_file(argv[1]);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 4c630ecab4..5ad6614465 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -11,17 +11,19 @@ midx_read_expect () {
 	NUM_CHUNKS=$3
 	OBJECT_DIR=$4
 	EXTRA_CHUNKS="$5"
-	cat >expect <<-EOF
-	header: 4d494458 1 $NUM_CHUNKS $NUM_PACKS
-	chunks: pack_names oid_fanout oid_lookup object_offsets$EXTRA_CHUNKS
-	num_objects: $NUM_OBJECTS
-	packs:
-	EOF
-	if [ $NUM_PACKS -ge 1 ]
-	then
-		ls $OBJECT_DIR/pack/ | grep idx | sort >> expect
-	fi
-	printf "object_dir: $OBJECT_DIR\n" >>expect &&
+	{
+		cat <<-EOF &&
+		header: 4d494458 1 $NUM_CHUNKS $NUM_PACKS
+		chunks: pack-names oid-fanout oid-lookup object-offsets$EXTRA_CHUNKS
+		num_objects: $NUM_OBJECTS
+		packs:
+		EOF
+		if test $NUM_PACKS -ge 1
+		then
+			ls $OBJECT_DIR/pack/ | grep idx | sort
+		fi &&
+		printf "object-dir: $OBJECT_DIR\n"
+	} >expect &&
 	test-tool read-midx $OBJECT_DIR >actual &&
 	test_cmp expect actual
 }
@@ -32,35 +34,55 @@ test_expect_success 'write midx with no packs' '
 	midx_read_expect 0 0 4 .
 '
 
+generate_objects () {
+	i=$1
+	iii=$(printf '%03i' $i)
+	{
+		test-tool genrandom "bar" 200 &&
+		test-tool genrandom "baz $iii" 50
+	} >wide_delta_$iii &&
+	{
+		test-tool genrandom "foo"$i 100 &&
+		test-tool genrandom "foo"$(( $i + 1 )) 100 &&
+		test-tool genrandom "foo"$(( $i + 2 )) 100
+	} >deep_delta_$iii &&
+	{
+		echo $iii &&
+		test-tool genrandom "$iii" 8192
+	} >file_$iii &&
+	git update-index --add file_$iii deep_delta_$iii wide_delta_$iii
+}
+
+commit_and_list_objects () {
+	{
+		echo 101 &&
+		test-tool genrandom 100 8192;
+	} >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree -p HEAD</dev/null) &&
+	{
+		echo $tree &&
+		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list &&
+	git reset --hard $commit
+}
+
 test_expect_success 'create objects' '
+	test_commit initial &&
 	for i in $(test_seq 1 5)
 	do
-		iii=$(printf '%03i' $i)
-		test-tool genrandom "bar" 200 >wide_delta_$iii &&
-		test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
-		test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
-		test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
-		test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
-		echo $iii >file_$iii &&
-		test-tool genrandom "$iii" 8192 >>file_$iii &&
-		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
-		i=$(expr $i + 1) || return 1
+		generate_objects $i
 	done &&
-	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
-	git update-index --add file_101 &&
-	tree=$(git write-tree) &&
-	commit=$(git commit-tree $tree </dev/null) && {
-	echo $tree &&
-	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
-	} >obj-list &&
-	git update-ref HEAD $commit
+	commit_and_list_objects
 '
 
 test_expect_success 'write midx with one v1 pack' '
-	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
-	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
-	git multi-pack-index --object-dir=. write &&
-	midx_read_expect 1 17 4 .
+	pack=$(git pack-objects --index-version=1 $objdir/pack/test <obj-list) &&
+	test_when_finished rm $objdir/pack/test-$pack.pack \
+		$objdir/pack/test-$pack.idx $objdir/pack/multi-pack-index &&
+	git multi-pack-index --object-dir=$objdir write &&
+	midx_read_expect 1 18 4 $objdir
 '
 
 midx_git_two_modes() {
@@ -80,81 +102,33 @@ compare_results_with_midx() {
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list &&
 	git multi-pack-index --object-dir=$objdir write &&
-	midx_read_expect 1 17 4 $objdir
+	midx_read_expect 1 18 4 $objdir
 '
 
-midx_git_two_modes() {
-	git -c core.multiPackIndex=false $1 >expect &&
-	git -c core.multiPackIndex=true $1 >actual &&
-	test_cmp expect actual
-}
-
-compare_results_with_midx() {
-	MSG=$1
-	test_expect_success "check normal git operations: $MSG" '
-		midx_git_two_modes "rev-list --objects --all" &&
-		midx_git_two_modes "log --raw" &&
-		midx_git_two_modes "log --oneline"
-	'
-}
-
 compare_results_with_midx "one v2 pack"
 
-test_expect_success 'Add more objects' '
+test_expect_success 'add more objects' '
 	for i in $(test_seq 6 10)
 	do
-		iii=$(printf '%03i' $i)
-		test-tool genrandom "bar" 200 >wide_delta_$iii &&
-		test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
-		test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
-		test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
-		test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
-		echo $iii >file_$iii &&
-		test-tool genrandom "$iii" 8192 >>file_$iii &&
-		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
-		i=$(expr $i + 1) || return 1
+		generate_objects $i
 	done &&
-	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
-	git update-index --add file_101 &&
-	tree=$(git write-tree) &&
-	commit=$(git commit-tree $tree -p HEAD</dev/null) && {
-	echo $tree &&
-	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
-	} >obj-list2 &&
-	git update-ref HEAD $commit
+	commit_and_list_objects
 '
 
 test_expect_success 'write midx with two packs' '
-	git pack-objects --index-version=1 $objdir/pack/test-2 <obj-list2 &&
+	git pack-objects --index-version=1 $objdir/pack/test-2 <obj-list &&
 	git multi-pack-index --object-dir=$objdir write &&
-	midx_read_expect 2 33 4 $objdir
+	midx_read_expect 2 34 4 $objdir
 '
 
 compare_results_with_midx "two packs"
 
-test_expect_success 'Add more packs' '
-	for j in $(test_seq 1 10)
+test_expect_success 'add more packs' '
+	for j in $(test_seq 11 20)
 	do
-		iii=$(printf '%03i' $i)
-		test-tool genrandom "bar" 200 >wide_delta_$iii &&
-		test-tool genrandom "baz $iii" 50 >>wide_delta_$iii &&
-		test-tool genrandom "foo"$i 100 >deep_delta_$iii &&
-		test-tool genrandom "foo"$(expr $i + 1) 100 >>deep_delta_$iii &&
-		test-tool genrandom "foo"$(expr $i + 2) 100 >>deep_delta_$iii &&
-		echo $iii >file_$iii &&
-		test-tool genrandom "$iii" 8192 >>file_$iii &&
-		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
-		{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
-		git update-index --add file_101 &&
-		tree=$(git write-tree) &&
-		commit=$(git commit-tree $tree -p HEAD</dev/null) && {
-		echo $tree &&
-		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
-		} >obj-list &&
-		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 $objdir/pack/test-pack <obj-list &&
-		i=$(expr $i + 1) || return 1 &&
-		j=$(expr $j + 1) || return 1
+		generate_objects $j &&
+		commit_and_list_objects &&
+		git pack-objects --index-version=2 $objdir/pack/test-pack <obj-list
 	done
 '
 
@@ -162,13 +136,22 @@ compare_results_with_midx "mixed mode (two packs + extra)"
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=$objdir write &&
-	midx_read_expect 12 73 4 $objdir
+	midx_read_expect 12 74 4 $objdir
 '
 
 compare_results_with_midx "twelve packs"
 
+test_expect_success 'repack removes multi-pack-index' '
+	test_path_is_file $objdir/pack/multi-pack-index &&
+	git repack -adf &&
+	test_path_is_missing $objdir/pack/multi-pack-index
+'
+
+compare_results_with_midx "after repack"
+
+
 # usage: corrupt_data <file> <pos> [<data>]
-corrupt_data() {
+corrupt_data () {
 	file=$1
 	pos=$2
 	data="${3:-\0}"
@@ -180,12 +163,17 @@ corrupt_data() {
 test_expect_success 'force some 64-bit offsets with pack-objects' '
 	mkdir objects64 &&
 	mkdir objects64/pack &&
+	for i in $(test_seq 1 11)
+	do
+		generate_objects 11
+	done &&
+	commit_and_list_objects &&
 	pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
 	idx64=objects64/pack/test-64-$pack64.idx &&
 	chmod u+w $idx64 &&
-	corrupt_data $idx64 2899 "\02" &&
-	midx64=$(git multi-pack-index write --object-dir=objects64) &&
-	midx_read_expect 1 62 5 objects64 " large_offsets"
+	corrupt_data $idx64 2999 "\02" &&
+	midx64=$(git multi-pack-index --object-dir=objects64 write) &&
+	midx_read_expect 1 63 5 objects64 " large-offsets"
 '
 
 test_done

^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 01/23] multi-pack-index: add design document
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 02/23] multi-pack-index: add format details Derrick Stolee
                         ` (22 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 109 +++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000000..d7e57639f7
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The pack.multiIndex config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the most-
+  recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+  contents of the multi-pack-index file match the offsets listed in
+  the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- The reachability bitmap is currently paired directly with a single
+  packfile, using the pack-order as the object order to hopefully
+  compress the bitmaps well using run-length encoding. This could be
+  extended to pair a reachability bitmap with a multi-pack-index. If
+  the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then a reachability bitmap
+  could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+  may point to objects that are not stored locally, but available
+  on request to a server. The multi-pack-index does not currently
+  track these promisor packs.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+    An earlier RFC for the multi-pack-index feature
+
+[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 02/23] multi-pack-index: add format details
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 01/23] multi-pack-index: add design document Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 03/23] multi-pack-index: add builtin Derrick Stolee
                         ` (21 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

The multi-pack-index feature generalizes the existing pack-index
feature by indexing objects across multiple pack-files.

Describe the basic file format, using a 12-byte header followed by
a lookup table for a list of "chunks" which will be described later.
The file ends with a footer containing a checksum using the hash
algorithm.

The header allows later versions to create breaking changes by
advancing the version number. We can also change the hash algorithm
using a different version value.

We will add the individual chunk format information as we introduce
the code that writes that information.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 70a99fd142..e060e693f4 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -252,3 +252,52 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA-1-checksum of all of the above.
+
+== multi-pack-index (MIDX) files have the following format:
+
+The multi-pack-index files refer to multiple pack-files and loose objects.
+
+In order to allow extensions that add extra data to the MIDX, we organize
+the body into "chunks" and provide a lookup table at the beginning of the
+body. The header includes certain length values, such as the number of packs,
+the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+	4-byte signature:
+	    The signature is: {'M', 'I', 'D', 'X'}
+
+	1-byte version number:
+	    Git only writes or recognizes version 1.
+
+	1-byte Object Id Version
+	    Git only writes or recognizes version 1 (SHA1).
+
+	1-byte number of "chunks"
+
+	1-byte number of base multi-pack-index files:
+	    This value is currently always zero.
+
+	4-byte number of pack files
+
+CHUNK LOOKUP:
+
+	(C + 1) * 12 bytes providing the chunk offsets:
+	    First 4 bytes describe chunk id. Value 0 is a terminating label.
+	    Other 8 bytes provide offset in current file for chunk to start.
+	    (Chunks are provided in file-order, so you can infer the length
+	    using the next chunk position if necessary.)
+
+	The remaining data in the body is described one chunk at a time, and
+	these chunks may be given in any order. Chunks are required unless
+	otherwise specified.
+
+CHUNK DATA:
+
+	(This section intentionally left incomplete.)
+
+TRAILER:
+
+	20-byte SHA1-checksum of the above contents.
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 03/23] multi-pack-index: add builtin
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 01/23] multi-pack-index: add design document Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 02/23] multi-pack-index: add format details Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-20 18:22         ` Junio C Hamano
  2018-07-12 19:39       ` [PATCH v4 04/23] multi-pack-index: add 'write' verb Derrick Stolee
                         ` (20 subsequent siblings)
  23 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

This new 'git multi-pack-index' builtin will be the plumbing access
for writing, reading, and checking multi-pack-index files. The
initial implementation is a no-op.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                             |  3 ++-
 Documentation/git-multi-pack-index.txt | 36 ++++++++++++++++++++++++++
 Makefile                               |  1 +
 builtin.h                              |  1 +
 builtin/multi-pack-index.c             | 34 ++++++++++++++++++++++++
 command-list.txt                       |  1 +
 git.c                                  |  1 +
 7 files changed, 76 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/git-multi-pack-index.txt
 create mode 100644 builtin/multi-pack-index.c

diff --git a/.gitignore b/.gitignore
index 388cc4beee..25633bc515 100644
--- a/.gitignore
+++ b/.gitignore
@@ -99,8 +99,9 @@
 /git-mergetool--lib
 /git-mktag
 /git-mktree
-/git-name-rev
+/git-multi-pack-index
 /git-mv
+/git-name-rev
 /git-notes
 /git-p4
 /git-pack-redundant
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
new file mode 100644
index 0000000000..ec9982cbfc
--- /dev/null
+++ b/Documentation/git-multi-pack-index.txt
@@ -0,0 +1,36 @@
+git-multi-pack-index(1)
+======================
+
+NAME
+----
+git-multi-pack-index - Write and verify multi-pack-indexes
+
+
+SYNOPSIS
+--------
+[verse]
+'git multi-pack-index' [--object-dir=<dir>]
+
+DESCRIPTION
+-----------
+Write or verify a multi-pack-index (MIDX) file.
+
+OPTIONS
+-------
+
+--object-dir=<dir>::
+	Use given directory for the location of Git objects. We check
+	`<dir>/packs/multi-pack-index` for the current MIDX file, and
+	`<dir>/packs` for the pack-files to index.
+
+
+SEE ALSO
+--------
+See link:technical/multi-pack-index.html[The Multi-Pack-Index Design
+Document] and link:technical/pack-format.html[The Multi-Pack-Index
+Format] for more information on the multi-pack-index feature.
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index e4b503d259..54610875ec 100644
--- a/Makefile
+++ b/Makefile
@@ -1047,6 +1047,7 @@ BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
+BUILTIN_OBJS += builtin/multi-pack-index.o
 BUILTIN_OBJS += builtin/mv.o
 BUILTIN_OBJS += builtin/name-rev.o
 BUILTIN_OBJS += builtin/notes.o
diff --git a/builtin.h b/builtin.h
index 4e0f64723e..70997d7ace 100644
--- a/builtin.h
+++ b/builtin.h
@@ -191,6 +191,7 @@ extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
+extern int cmd_multi_pack_index(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
 extern int cmd_name_rev(int argc, const char **argv, const char *prefix);
 extern int cmd_notes(int argc, const char **argv, const char *prefix);
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
new file mode 100644
index 0000000000..3161ddae86
--- /dev/null
+++ b/builtin/multi-pack-index.c
@@ -0,0 +1,34 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_multi_pack_index_usage[] = {
+	N_("git multi-pack-index [--object-dir=<dir>]"),
+	NULL
+};
+
+static struct opts_multi_pack_index {
+	const char *object_dir;
+} opts;
+
+int cmd_multi_pack_index(int argc, const char **argv,
+			 const char *prefix)
+{
+	static struct option builtin_multi_pack_index_options[] = {
+		OPT_FILENAME(0, "object-dir", &opts.object_dir,
+		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_END(),
+	};
+
+	git_config(git_default_config, NULL);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_multi_pack_index_options,
+			     builtin_multi_pack_index_usage, 0);
+
+	if (!opts.object_dir)
+		opts.object_dir = get_object_directory();
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e1c26c1bb7..61071f8fa2 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -123,6 +123,7 @@ git-merge-index                         plumbingmanipulators
 git-merge-one-file                      purehelpers
 git-mergetool                           ancillarymanipulators           complete
 git-merge-tree                          ancillaryinterrogators
+git-multi-pack-index                    plumbingmanipulators
 git-mktag                               plumbingmanipulators
 git-mktree                              plumbingmanipulators
 git-mv                                  mainporcelain           worktree
diff --git a/git.c b/git.c
index c2f48d53dd..a7509fa5f7 100644
--- a/git.c
+++ b/git.c
@@ -505,6 +505,7 @@ static struct cmd_struct commands[] = {
 	{ "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
 	{ "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
 	{ "mktree", cmd_mktree, RUN_SETUP },
+	{ "multi-pack-index", cmd_multi_pack_index, RUN_SETUP_GENTLY },
 	{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
 	{ "name-rev", cmd_name_rev, RUN_SETUP },
 	{ "notes", cmd_notes, RUN_SETUP },
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 04/23] multi-pack-index: add 'write' verb
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
                         ` (2 preceding siblings ...)
  2018-07-12 19:39       ` [PATCH v4 03/23] multi-pack-index: add builtin Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 22:56         ` Eric Sunshine
  2018-07-12 19:39       ` [PATCH v4 05/23] midx: write header information to lockfile Derrick Stolee
                         ` (19 subsequent siblings)
  23 siblings, 1 reply; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

In anticipation of writing multi-pack-indexes, add a skeleton
'git multi-pack-index write' subcommand and send the options to a
write_midx_file() method. Also create a skeleton test script that
tests the 'write' subcommand.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 22 +++++++++++++++++++++-
 Makefile                               |  1 +
 builtin/multi-pack-index.c             | 17 +++++++++++++++--
 midx.c                                 |  7 +++++++
 midx.h                                 |  6 ++++++
 t/t5319-multi-pack-index.sh            | 10 ++++++++++
 6 files changed, 60 insertions(+), 3 deletions(-)
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-multi-pack-index.sh

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index ec9982cbfc..a62af1caca 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir=<dir>]
+'git multi-pack-index' [--object-dir=<dir>] <verb>
 
 DESCRIPTION
 -----------
@@ -23,6 +23,26 @@ OPTIONS
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
 
+write::
+	When given as the verb, write a new MIDX file to
+	`<dir>/packs/multi-pack-index`.
+
+
+EXAMPLES
+--------
+
+* Write a MIDX file for the packfiles in the current .git folder.
++
+-----------------------------------------------
+$ git multi-pack-index write
+-----------------------------------------------
+
+* Write a MIDX file for the packfiles in an alternate object store.
++
+-----------------------------------------------
+$ git multi-pack-index --object-dir <alt> write
+-----------------------------------------------
+
 
 SEE ALSO
 --------
diff --git a/Makefile b/Makefile
index 54610875ec..f5636c711d 100644
--- a/Makefile
+++ b/Makefile
@@ -890,6 +890,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 3161ddae86..6a7aa00cf2 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -2,9 +2,10 @@
 #include "cache.h"
 #include "config.h"
 #include "parse-options.h"
+#include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>]"),
+	N_("git multi-pack-index [--object-dir=<dir>] write"),
 	NULL
 };
 
@@ -30,5 +31,17 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	if (!opts.object_dir)
 		opts.object_dir = get_object_directory();
 
-	return 0;
+	if (argc == 0)
+		goto usage;
+
+	if (!strcmp(argv[0], "write")) {
+		if (argc > 1)
+			goto usage;
+
+		return write_midx_file(opts.object_dir);
+	}
+
+usage:
+	usage_with_options(builtin_multi_pack_index_usage,
+			   builtin_multi_pack_index_options);
 }
diff --git a/midx.c b/midx.c
new file mode 100644
index 0000000000..32468db1a2
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,7 @@
+#include "cache.h"
+#include "midx.h"
+
+int write_midx_file(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
new file mode 100644
index 0000000000..dbdbe9f873
--- /dev/null
+++ b/midx.h
@@ -0,0 +1,6 @@
+#ifndef __MIDX_H__
+#define __MIDX_H__
+
+int write_midx_file(const char *object_dir);
+
+#endif
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
new file mode 100755
index 0000000000..ec3ddbe79c
--- /dev/null
+++ b/t/t5319-multi-pack-index.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+test_description='multi-pack-indexes'
+. ./test-lib.sh
+
+test_expect_success 'write midx with no packs' '
+	git multi-pack-index --object-dir=. write
+'
+
+test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 05/23] midx: write header information to lockfile
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
                         ` (3 preceding siblings ...)
  2018-07-12 19:39       ` [PATCH v4 04/23] multi-pack-index: add 'write' verb Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 06/23] multi-pack-index: load into memory Derrick Stolee
                         ` (18 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

As we begin writing the multi-pack-index format to disk, start with
the basics: the 12-byte header and the 20-byte checksum footer. Start
with these basics so we can add the rest of the format in small
increments.

As we implement the format, we will use a technique to check that our
computed offsets within the multi-pack-index file match what we are
actually writing. Each method that writes to the hashfile will return
the number of bytes written, and we will track that those values match
our expectations.

Currently, write_midx_header() returns 12, but is not checked. We will
check the return value in a later commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 50 +++++++++++++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh |  4 ++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 32468db1a2..f85f2d334d 100644
--- a/midx.c
+++ b/midx.c
@@ -1,7 +1,57 @@
 #include "cache.h"
+#include "csum-file.h"
+#include "lockfile.h"
 #include "midx.h"
 
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_VERSION 1
+#define MIDX_HASH_VERSION 1
+#define MIDX_HEADER_SIZE 12
+
+static char *get_midx_filename(const char *object_dir)
+{
+	return xstrfmt("%s/pack/multi-pack-index", object_dir);
+}
+
+static size_t write_midx_header(struct hashfile *f,
+				unsigned char num_chunks,
+				uint32_t num_packs)
+{
+	unsigned char byte_values[4];
+
+	hashwrite_be32(f, MIDX_SIGNATURE);
+	byte_values[0] = MIDX_VERSION;
+	byte_values[1] = MIDX_HASH_VERSION;
+	byte_values[2] = num_chunks;
+	byte_values[3] = 0; /* unused */
+	hashwrite(f, byte_values, sizeof(byte_values));
+	hashwrite_be32(f, num_packs);
+
+	return MIDX_HEADER_SIZE;
+}
+
 int write_midx_file(const char *object_dir)
 {
+	unsigned char num_chunks = 0;
+	char *midx_name;
+	struct hashfile *f = NULL;
+	struct lock_file lk;
+
+	midx_name = get_midx_filename(object_dir);
+	if (safe_create_leading_directories(midx_name)) {
+		UNLEAK(midx_name);
+		die_errno(_("unable to create leading directories of %s"),
+			  midx_name);
+	}
+
+	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
+	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	FREE_AND_NULL(midx_name);
+
+	write_midx_header(f, num_chunks, 0);
+
+	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	commit_lock_file(&lk);
+
 	return 0;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ec3ddbe79c..50e80f8f2c 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -4,7 +4,9 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 test_expect_success 'write midx with no packs' '
-	git multi-pack-index --object-dir=. write
+	test_when_finished rm -f pack/multi-pack-index &&
+	git multi-pack-index --object-dir=. write &&
+	test_path_is_file pack/multi-pack-index
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 06/23] multi-pack-index: load into memory
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
                         ` (4 preceding siblings ...)
  2018-07-12 19:39       ` [PATCH v4 05/23] midx: write header information to lockfile Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 07/23] t5319: expand test data Derrick Stolee
                         ` (17 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

Create a new multi_pack_index struct for loading multi-pack-indexes into
memory. Create a test-tool builtin for reading basic information about
that multi-pack-index to verify the correct data is written.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile                    |  1 +
 midx.c                      | 79 +++++++++++++++++++++++++++++++++++++
 midx.h                      | 18 +++++++++
 object-store.h              |  2 +
 t/helper/test-read-midx.c   | 31 +++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 t/t5319-multi-pack-index.sh | 11 +++++-
 8 files changed, 143 insertions(+), 1 deletion(-)
 create mode 100644 t/helper/test-read-midx.c

diff --git a/Makefile b/Makefile
index f5636c711d..0b801d1b16 100644
--- a/Makefile
+++ b/Makefile
@@ -717,6 +717,7 @@ TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-path-utils.o
 TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-read-cache.o
+TEST_BUILTINS_OBJS += test-read-midx.o
 TEST_BUILTINS_OBJS += test-ref-store.o
 TEST_BUILTINS_OBJS += test-regex.o
 TEST_BUILTINS_OBJS += test-revision-walking.o
diff --git a/midx.c b/midx.c
index f85f2d334d..c1ff5acf85 100644
--- a/midx.c
+++ b/midx.c
@@ -1,18 +1,97 @@
 #include "cache.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "object-store.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
+#define MIDX_BYTE_FILE_VERSION 4
+#define MIDX_BYTE_HASH_VERSION 5
+#define MIDX_BYTE_NUM_CHUNKS 6
+#define MIDX_BYTE_NUM_PACKS 8
 #define MIDX_HASH_VERSION 1
 #define MIDX_HEADER_SIZE 12
+#define MIDX_HASH_LEN 20
+#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+struct multi_pack_index *load_multi_pack_index(const char *object_dir)
+{
+	struct multi_pack_index *m = NULL;
+	int fd;
+	struct stat st;
+	size_t midx_size;
+	void *midx_map = NULL;
+	uint32_t hash_version;
+	char *midx_name = get_midx_filename(object_dir);
+
+	fd = git_open(midx_name);
+
+	if (fd < 0)
+		goto cleanup_fail;
+	if (fstat(fd, &st)) {
+		error_errno(_("failed to read %s"), midx_name);
+		goto cleanup_fail;
+	}
+
+	midx_size = xsize_t(st.st_size);
+
+	if (midx_size < MIDX_MIN_SIZE) {
+		error(_("multi-pack-index file %s is too small"), midx_name);
+		goto cleanup_fail;
+	}
+
+	FREE_AND_NULL(midx_name);
+
+	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	FLEX_ALLOC_MEM(m, object_dir, object_dir, strlen(object_dir));
+	m->fd = fd;
+	m->data = midx_map;
+	m->data_len = midx_size;
+
+	m->signature = get_be32(m->data);
+	if (m->signature != MIDX_SIGNATURE) {
+		error(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"),
+		      m->signature, MIDX_SIGNATURE);
+		goto cleanup_fail;
+	}
+
+	m->version = m->data[MIDX_BYTE_FILE_VERSION];
+	if (m->version != MIDX_VERSION) {
+		error(_("multi-pack-index version %d not recognized"),
+		      m->version);
+		goto cleanup_fail;
+	}
+
+	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
+	if (hash_version != MIDX_HASH_VERSION) {
+		error(_("hash version %u does not match"), hash_version);
+		goto cleanup_fail;
+	}
+	m->hash_len = MIDX_HASH_LEN;
+
+	m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS];
+
+	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
+
+	return m;
+
+cleanup_fail:
+	free(m);
+	free(midx_name);
+	if (midx_map)
+		munmap(midx_map, midx_size);
+	if (0 <= fd)
+		close(fd);
+	return NULL;
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index dbdbe9f873..0e05051bca 100644
--- a/midx.h
+++ b/midx.h
@@ -1,6 +1,24 @@
 #ifndef __MIDX_H__
 #define __MIDX_H__
 
+struct multi_pack_index {
+	int fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	uint32_t signature;
+	unsigned char version;
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_packs;
+	uint32_t num_objects;
+
+	char object_dir[FLEX_ARRAY];
+};
+
+struct multi_pack_index *load_multi_pack_index(const char *object_dir);
+
 int write_midx_file(const char *object_dir);
 
 #endif
diff --git a/object-store.h b/object-store.h
index d683112fd7..13a766aea8 100644
--- a/object-store.h
+++ b/object-store.h
@@ -84,6 +84,8 @@ struct packed_git {
 	char pack_name[FLEX_ARRAY]; /* more */
 };
 
+struct multi_pack_index;
+
 struct raw_object_store {
 	/*
 	 * Path to the repository's object store.
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
new file mode 100644
index 0000000000..988a487169
--- /dev/null
+++ b/t/helper/test-read-midx.c
@@ -0,0 +1,31 @@
+#include "test-tool.h"
+#include "cache.h"
+#include "midx.h"
+#include "repository.h"
+#include "object-store.h"
+
+static int read_midx_file(const char *object_dir)
+{
+	struct multi_pack_index *m = load_multi_pack_index(object_dir);
+
+	if (!m)
+		return 1;
+
+	printf("header: %08x %d %d %d\n",
+	       m->signature,
+	       m->version,
+	       m->num_chunks,
+	       m->num_packs);
+
+	printf("object-dir: %s\n", m->object_dir);
+
+	return 0;
+}
+
+int cmd__read_midx(int argc, const char **argv)
+{
+	if (argc != 2)
+		usage("read-midx <object-dir>");
+
+	return read_midx_file(argv[1]);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 805a45de9c..1c3ab36e6c 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -27,6 +27,7 @@ static struct test_cmd cmds[] = {
 	{ "path-utils", cmd__path_utils },
 	{ "prio-queue", cmd__prio_queue },
 	{ "read-cache", cmd__read_cache },
+	{ "read-midx", cmd__read_midx },
 	{ "ref-store", cmd__ref_store },
 	{ "regex", cmd__regex },
 	{ "revision-walking", cmd__revision_walking },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 7116ddfb94..6af8c08a66 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -21,6 +21,7 @@ int cmd__online_cpus(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
 int cmd__prio_queue(int argc, const char **argv);
 int cmd__read_cache(int argc, const char **argv);
+int cmd__read_midx(int argc, const char **argv);
 int cmd__ref_store(int argc, const char **argv);
 int cmd__regex(int argc, const char **argv);
 int cmd__revision_walking(int argc, const char **argv);
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 50e80f8f2c..506bd8abb8 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,10 +3,19 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+midx_read_expect () {
+	cat >expect <<-EOF
+	header: 4d494458 1 0 0
+	object-dir: .
+	EOF
+	test-tool read-midx . >actual &&
+	test_cmp expect actual
+}
+
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm -f pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	test_path_is_file pack/multi-pack-index
+	midx_read_expect
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 07/23] t5319: expand test data
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
                         ` (5 preceding siblings ...)
  2018-07-12 19:39       ` [PATCH v4 06/23] multi-pack-index: load into memory Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 08/23] packfile: generalize pack directory list Derrick Stolee
                         ` (16 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

As we build the multi-pack-index file format, we want to test the format
on real repositories. Add tests that create repository data including
multiple packfiles with both version 1 and version 2 formats.

The current 'git multi-pack-index write' command will always write the
same file with no "real" data. This will be expanded in future commits,
along with the test expectations.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 84 +++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 506bd8abb8..1240127ec1 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -18,4 +18,88 @@ test_expect_success 'write midx with no packs' '
 	midx_read_expect
 '
 
+generate_objects () {
+	i=$1
+	iii=$(printf '%03i' $i)
+	{
+		test-tool genrandom "bar" 200 &&
+		test-tool genrandom "baz $iii" 50
+	} >wide_delta_$iii &&
+	{
+		test-tool genrandom "foo"$i 100 &&
+		test-tool genrandom "foo"$(( $i + 1 )) 100 &&
+		test-tool genrandom "foo"$(( $i + 2 )) 100
+	} >deep_delta_$iii &&
+	{
+		echo $iii &&
+		test-tool genrandom "$iii" 8192
+	} >file_$iii &&
+	git update-index --add file_$iii deep_delta_$iii wide_delta_$iii
+}
+
+commit_and_list_objects () {
+	{
+		echo 101 &&
+		test-tool genrandom 100 8192;
+	} >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree -p HEAD</dev/null) &&
+	{
+		echo $tree &&
+		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list &&
+	git reset --hard $commit
+}
+
+test_expect_success 'create objects' '
+	test_commit initial &&
+	for i in $(test_seq 1 5)
+	do
+		generate_objects $i
+	done &&
+	commit_and_list_objects
+'
+
+test_expect_success 'write midx with one v1 pack' '
+	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
+	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'write midx with one v2 pack' '
+	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'add more objects' '
+	for i in $(test_seq 6 10)
+	do
+		generate_objects $i
+	done &&
+	commit_and_list_objects
+'
+
+test_expect_success 'write midx with two packs' '
+	git pack-objects --index-version=1 pack/test-2 <obj-list &&
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'add more packs' '
+	for j in $(test_seq 11 20)
+	do
+		generate_objects $j &&
+		commit_and_list_objects &&
+		git pack-objects --index-version=2 test-pack <obj-list
+	done
+'
+
+test_expect_success 'write midx with twelve packs' '
+	git multi-pack-index --object-dir=. write &&
+	midx_read_expect
+'
+
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 08/23] packfile: generalize pack directory list
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
                         ` (6 preceding siblings ...)
  2018-07-12 19:39       ` [PATCH v4 07/23] t5319: expand test data Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 09/23] multi-pack-index: read packfile list Derrick Stolee
                         ` (15 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

In anticipation of sharing the pack directory listing with the
multi-pack-index, generalize prepare_packed_git_one() into
for_each_file_in_pack_dir().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 101 +++++++++++++++++++++++++++++++++--------------------
 packfile.h |   6 ++++
 2 files changed, 69 insertions(+), 38 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7cd45aa4b2..ee1ab9b804 100644
--- a/packfile.c
+++ b/packfile.c
@@ -738,13 +738,14 @@ static void report_pack_garbage(struct string_list *list)
 	report_helper(list, seen_bits, first, list->nr);
 }
 
-static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data)
 {
 	struct strbuf path = STRBUF_INIT;
 	size_t dirnamelen;
 	DIR *dir;
 	struct dirent *de;
-	struct string_list garbage = STRING_LIST_INIT_DUP;
 
 	strbuf_addstr(&path, objdir);
 	strbuf_addstr(&path, "/pack");
@@ -759,53 +760,77 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	strbuf_addch(&path, '/');
 	dirnamelen = path.len;
 	while ((de = readdir(dir)) != NULL) {
-		struct packed_git *p;
-		size_t base_len;
-
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		strbuf_setlen(&path, dirnamelen);
 		strbuf_addstr(&path, de->d_name);
 
-		base_len = path.len;
-		if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
-			/* Don't reopen a pack we already have. */
-			for (p = r->objects->packed_git; p;
-			     p = p->next) {
-				size_t len;
-				if (strip_suffix(p->pack_name, ".pack", &len) &&
-				    len == base_len &&
-				    !memcmp(p->pack_name, path.buf, len))
-					break;
-			}
-			if (p == NULL &&
-			    /*
-			     * See if it really is a valid .idx file with
-			     * corresponding .pack file that we can map.
-			     */
-			    (p = add_packed_git(path.buf, path.len, local)) != NULL)
-				install_packed_git(r, p);
-		}
-
-		if (!report_garbage)
-			continue;
-
-		if (ends_with(de->d_name, ".idx") ||
-		    ends_with(de->d_name, ".pack") ||
-		    ends_with(de->d_name, ".bitmap") ||
-		    ends_with(de->d_name, ".keep") ||
-		    ends_with(de->d_name, ".promisor"))
-			string_list_append(&garbage, path.buf);
-		else
-			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
+		fn(path.buf, path.len, de->d_name, data);
 	}
+
 	closedir(dir);
-	report_pack_garbage(&garbage);
-	string_list_clear(&garbage, 0);
 	strbuf_release(&path);
 }
 
+struct prepare_pack_data {
+	struct repository *r;
+	struct string_list *garbage;
+	int local;
+};
+
+static void prepare_pack(const char *full_name, size_t full_name_len,
+			 const char *file_name, void *_data)
+{
+	struct prepare_pack_data *data = (struct prepare_pack_data *)_data;
+	struct packed_git *p;
+	size_t base_len = full_name_len;
+
+	if (strip_suffix_mem(full_name, &base_len, ".idx")) {
+		/* Don't reopen a pack we already have. */
+		for (p = data->r->objects->packed_git; p; p = p->next) {
+			size_t len;
+			if (strip_suffix(p->pack_name, ".pack", &len) &&
+			    len == base_len &&
+			    !memcmp(p->pack_name, full_name, len))
+				break;
+		}
+
+		if (!p) {
+			p = add_packed_git(full_name, full_name_len, data->local);
+			if (p)
+				install_packed_git(data->r, p);
+		}
+	}
+
+	if (!report_garbage)
+		return;
+
+	if (ends_with(file_name, ".idx") ||
+	    ends_with(file_name, ".pack") ||
+	    ends_with(file_name, ".bitmap") ||
+	    ends_with(file_name, ".keep") ||
+	    ends_with(file_name, ".promisor"))
+		string_list_append(data->garbage, full_name);
+	else
+		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
+}
+
+static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
+{
+	struct prepare_pack_data data;
+	struct string_list garbage = STRING_LIST_INIT_DUP;
+
+	data.r = r;
+	data.garbage = &garbage;
+	data.local = local;
+
+	for_each_file_in_pack_dir(objdir, prepare_pack, &data);
+
+	report_pack_garbage(data.garbage);
+	string_list_clear(data.garbage, 0);
+}
+
 static void prepare_packed_git(struct repository *r);
 /*
  * Give a fast, rough count of the number of objects in the repository. This
diff --git a/packfile.h b/packfile.h
index e0a38aba93..d2ad30300a 100644
--- a/packfile.h
+++ b/packfile.h
@@ -28,6 +28,12 @@ extern char *sha1_pack_index_name(const unsigned char *sha1);
 
 extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
+typedef void each_file_in_pack_dir_fn(const char *full_path, size_t full_path_len,
+				      const char *file_pach, void *data);
+void for_each_file_in_pack_dir(const char *objdir,
+			       each_file_in_pack_dir_fn fn,
+			       void *data);
+
 /* A hook to report invalid files in pack directory */
 #define PACKDIR_FILE_PACK 1
 #define PACKDIR_FILE_IDX 2
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 09/23] multi-pack-index: read packfile list
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
                         ` (7 preceding siblings ...)
  2018-07-12 19:39       ` [PATCH v4 08/23] packfile: generalize pack directory list Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 10/23] multi-pack-index: write pack names in chunk Derrick Stolee
                         ` (14 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

When constructing a multi-pack-index file for a given object directory,
read the files within the enclosed pack directory and find matches that
end with ".idx" and find the correct paired packfile using
add_packed_git().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 48 ++++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh | 15 ++++++------
 2 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/midx.c b/midx.c
index c1ff5acf85..f742d7ccd7 100644
--- a/midx.c
+++ b/midx.c
@@ -1,6 +1,8 @@
 #include "cache.h"
 #include "csum-file.h"
+#include "dir.h"
 #include "lockfile.h"
+#include "packfile.h"
 #include "object-store.h"
 #include "midx.h"
 
@@ -109,12 +111,41 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_list {
+	struct packed_git **list;
+	uint32_t nr;
+	uint32_t alloc_list;
+};
+
+static void add_pack_to_midx(const char *full_path, size_t full_path_len,
+			     const char *file_name, void *data)
+{
+	struct pack_list *packs = (struct pack_list *)data;
+
+	if (ends_with(file_name, ".idx")) {
+		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
+
+		packs->list[packs->nr] = add_packed_git(full_path,
+							full_path_len,
+							0);
+		if (!packs->list[packs->nr]) {
+			warning(_("failed to add packfile '%s'"),
+				full_path);
+			return;
+		}
+
+		packs->nr++;
+	}
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char num_chunks = 0;
 	char *midx_name;
+	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
+	struct pack_list packs;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -123,14 +154,29 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	packs.nr = 0;
+	packs.alloc_list = 16;
+	packs.list = NULL;
+	ALLOC_ARRAY(packs.list, packs.alloc_list);
+
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, 0);
+	write_midx_header(f, num_chunks, packs.nr);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
+	for (i = 0; i < packs.nr; i++) {
+		if (packs.list[i]) {
+			close_pack(packs.list[i]);
+			free(packs.list[i]);
+		}
+	}
+
+	free(packs.list);
 	return 0;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1240127ec1..54117a7f49 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -4,8 +4,9 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 midx_read_expect () {
+	NUM_PACKS=$1
 	cat >expect <<-EOF
-	header: 4d494458 1 0 0
+	header: 4d494458 1 0 $NUM_PACKS
 	object-dir: .
 	EOF
 	test-tool read-midx . >actual &&
@@ -15,7 +16,7 @@ midx_read_expect () {
 test_expect_success 'write midx with no packs' '
 	test_when_finished rm -f pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 0
 '
 
 generate_objects () {
@@ -65,13 +66,13 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	git pack-objects --index-version=2,0x40 pack/test <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'add more objects' '
@@ -85,7 +86,7 @@ test_expect_success 'add more objects' '
 test_expect_success 'write midx with two packs' '
 	git pack-objects --index-version=1 pack/test-2 <obj-list &&
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 2
 '
 
 test_expect_success 'add more packs' '
@@ -93,13 +94,13 @@ test_expect_success 'add more packs' '
 	do
 		generate_objects $j &&
 		commit_and_list_objects &&
-		git pack-objects --index-version=2 test-pack <obj-list
+		git pack-objects --index-version=2 pack/test-pack <obj-list
 	done
 '
 
 test_expect_success 'write midx with twelve packs' '
 	git multi-pack-index --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 12
 '
 
 test_done
-- 
2.18.0.118.gd4f65b8d14


^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH v4 10/23] multi-pack-index: write pack names in chunk
  2018-07-12 19:39     ` [PATCH v4 00/23] Multi-pack-index (MIDX) Derrick Stolee
                         ` (8 preceding siblings ...)
  2018-07-12 19:39       ` [PATCH v4 09/23] multi-pack-index: read packfile list Derrick Stolee
@ 2018-07-12 19:39       ` Derrick Stolee
  2018-07-12 19:39       ` [PATCH v4 11/23] midx: read pack names into array Derrick Stolee
                         ` (13 subsequent siblings)
  23 siblings, 0 replies; 192+ messages in thread
From: Derrick Stolee @ 2018-07-12 19:39 UTC (permalink / raw)
  To: git, dstolee; +Cc: gitster, sbeller, pclouds, avarab, sunshine, szeder.dev

The multi-pack-index needs to track which packfiles it indexes. Store
these in our first required chunk. Since filenames are not well
structured, add padding to keep good alignment in later chunks.

Modify the 'git multi-pack-index read' subcommand to output the
existence of the pack-file name chunk. Modify t5319-multi-pack-index.sh
to reflect this new output and the new expected number of chunks.

Defense in depth: A pattern we are using in the multi-pack-index feature
is to verify the data as we write it. We want to ensure we never write
invalid data to the multi-pack-index. There are many checks that verify
that the values we are writing fit the format definitions. This mainly
helps developers while working on the feature, but it can also identify
issues that only appear when dealing with very large data sets. These
large sets are hard to encode into test cases.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |   6 +
 midx.c                                  | 174 +++++++++++++++++++++++-
 midx.h                                  |   2 +
 t/helper/test-read-midx.c               |   7 +
 t/t5319-multi-pack-index.sh             |   3 +-
 5 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index e060e693f4..6c5a77475f 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,12 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+	Packfile Names (ID: {'P', 'N', 'A', 'M'})
+	    Stores the packfile names as concatenated, null-terminated strings.
+	    Packfiles must be listed in lexicographic order for fast lookups by
+	    name. This is the only chunk not guaranteed to be a multiple of four
+	    bytes in length, so should be the last chunk for alignment reasons.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/midx.c b/midx.c
index f742d7ccd7..ca7a32bf95 100644
--- a/midx.c
+++ b/midx.c
@@ -17,6 +17,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
+#define MIDX_MAX_CHUNKS 1
+#define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -31,6 +36,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 	void *midx_map = NULL;
 	uint32_t hash_version;
 	char *midx_name = get_midx_filename(object_dir);
+	uint32_t i;
 
 	fd = git_open(midx_name);
 
@@ -82,6 +88,33 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir)
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
+	for (i = 0; i < m->num_chunks; i++) {
+		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
+					     MIDX_CHUNKLOOKUP_WIDTH * i);
+		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
+						 MIDX_CHUNKLOOKUP_WIDTH * i);
+
+		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKNAMES:
+				m->chunk_pack_names = m->data + chunk_offset;
+				break;
+
+			case 0:
+				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
+				break;
+
+			default:
+				/*
+				 * Do nothing on unrecognized chunks, allowing future
+				 * extensions to add optional chunks.
+				 */
+				break;
+		}
+	}
+
+	if (!m->chunk_pack_names)
+		die(_("multi-pack-index missing required pack-name chunk"));
+
 	return m;
 
 cleanup_fail:
@@ -113,8 +146,11 @@ static size_t write_midx_header(struct hashfile *f,
 
 struct pack_list {
 	struct packed_git **list;
+	char **names;
 	uint32_t nr;
 	uint32_t alloc_list;
+	uint32_t alloc_names;
+	size_t pack_name_concat_len;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -124,6 +160,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 	if (ends_with(file_name, ".idx")) {
 		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
+		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
 
 		packs->list[packs->nr] = add_packed_git(full_path,
 							full_path_len,
@@ -134,18 +171,89 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			return;
 		}
 
+		packs->names[packs->nr] = xstrdup(file_name);
+		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
 	}
 }
 
+struct pack_pair {
+	uint32_t pack_int_id;
+	char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+	struct pack_pair *a = (struct pack_pair *)_a;
+	struct pack_pair *b = (struct pack_pair *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
+{
+	uint32_t i;
+	struct pack_pair *pairs;
+
+	ALLOC_ARRAY(pairs, nr_packs);
+
+	for (i = 0; i < nr_packs; i++) {
+		pairs[i].pack_int_id = i;
+		pairs[i].pack_name = pack_names[i];
+	}
+
+	QSORT(pairs, nr_packs, pack_pair_compare);
+
+	for (i = 0; i < nr_packs; i++) {
+		pack_names[i] = pairs[i].pack_name;
+		perm[pairs[i].pack_int_id] = i;
+	}
+
+	free(pairs);
+}
+
+static size_t write_midx_pack_names(struct hashfile *f,
+				    char **pack_names,
+				    uint32_t num_packs)
+{
+	uint32_t i;
+	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
+	size_t written = 0;
+
+	for (i = 0; i < num_packs; i++) {
+		size_t writelen = strlen(pack_names[i]) + 1;
+
+		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+			BUG("incorrect pack-file order: %s before %s",
+			    pack_names[i - 1],
+			    pack_names[i]);
+
+		hashwrite(f, pack_names[i], writelen);
+		written += writelen;
+	}
+
+	/* add padding to be aligned */
+	i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
+	if (i < MIDX_CHUNK_ALIGNMENT) {
+		memset(padding, 0, sizeof(padding));
+		hashwrite(f, padding, i);
+		written += i;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
-	unsigned char num_chunks = 0;
+	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct pack_list packs;
+	uint32_t *pack_perm = NULL;
+	uint64_t written = 0;
+	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
+	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -156,16 +264,76 @@ int write_midx_file(const char *object_dir)
 
 	packs.nr = 0;
 	packs.alloc_list = 16;
+	packs.alloc_names = 16;
 	packs.list = NULL;
+	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
+	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
+	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
+	ALLOC_ARRAY(pack_perm, packs.nr);
+	sort_packs_by_name(packs.names, packs.nr, pack_perm);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, packs.nr);
+	cur_chunk = 0;
+	num_chunks = 1;
+
+	written = write_midx_header(f, num_chunks, packs.nr);
+
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_c