git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH 00/23] Multi-pack-index (MIDX)
@ 2018-06-07 14:03 Derrick Stolee
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
                   ` (24 more replies)
  0 siblings, 25 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

This patch series includes a rewrite of the previous
multi-pack-index RFC [1] using the feedback from the
commit-graph feature.

I based this series on 'next' as it requires the
recent object-store patches.

The multi-pack-index (MIDX) is explained fully in
the design document 'Documentation/technical/midx.txt'.
The short description is that the MIDX stores the
information from all of the IDX files in a pack
directory. The crucial design decision is that the
IDX files still exist, so we can fall back to the IDX
files if there is any issue with the MIDX (or core.midx
is set to false, or a user downgrades Git, etc.)

The MIDX feature has been part of our GVFS releases
for a few months (since the RFC). It has behaved well,
indexing over 31 million commits and trees across up
to 250 packfiles. These MIDX files are nearly 1GB in
size and take ~20 seconds to rewrite when adding new
IDX information. This ~20s mark is something I'd like
to improve, and I mention how to make the file
incremental (similar to split-index) in the design
document. I also want to make the commit-graph file
incremental, so I'd like to do that at the same time
after both the MIDX and commit-graph are stable.


Lookup Speedups
---------------

When looking for an object, Git uses an most-recently-
used (MRU) cache of packfiles. This does pretty well to
minimize the number of misses when searching through
packfiles for an object, especially if there is one
"big" packfile that contains most of the objets (so it
will rarely miss and is usually one of the first two
packfiles in the list). The MIDX does provide a way
to remove these misses, improving lookup time. However,
this lookup time greatly depends on the arrangement of
the packfiles.

For instance, if you take the Linux repository and repack
using `git repack -adfF --max-pack-size=128m` then all
commits will be in one packfile, all trees will be in
a small set of packfiles and organized well so 'git
rev-list --objects HEAD^{tree}' only inspects one or two
packfiles.

GVFS has the notion of a "prefetch packfile". These are
packfiles that are precomputed by cache servers to
contain the commits and trees introduced to the remote
each day. GVFS downloads these packfiles and places them
in an alternate. Since these are organized by "first
time introduced" and the working directory is so large,
the MRU misses are significant when performing a checkout
and updating the .git/index file.

To test the performance in this situation, I created a
script that organizes the Linux repository in a similar
fashion. I split the commit history into 50 parts by
creating branches on every 10,000 commits of the first-
parent history. Then, `git rev-list --objects A ^B`
provides the list of objects reachable from A but not B,
so I could send that to `git pack-objects` to create
these "time-based" packfiles. With these 50 packfiles
(deleting the old one from my fresh clone, and deleting
all tags as they were no longer on-disk) I could then
test 'git rev-list --objects HEAD^{tree}' and see:

        Before: 0.17s
        After:  0.13s
        % Diff: -23.5%

By adding logic to count hits and misses to bsearch_pack,
I was able to see that the command above calls that
method 266,930 times with a hit rate of 33%. The MIDX
has the same number of calls with a 100% hit rate.



Abbreviation Speedups
---------------------

To fully disambiguate an abbreviation, we must iterate
through all packfiles to ensure no collision exists in
any packfile. This requires O(P log N) time. With the
MIDX, this is only O(log N) time. Our standard test [2]
is 'git log --oneline --parents --raw' because it writes
many abbreviations while also doing a lot of other work
(walking commits and trees to compute the raw diff).

For a copy of the Linux repository with 50 packfiles
split by time, we observed the following:

        Before: 100.5 s
        After:   58.2 s
        % Diff: -59.7%


Request for Review Attention
----------------------------

I tried my best to take the feedback from the commit-graph
feature and apply it to this feature. I also worked to
follow the object-store refactoring as I could. I also have
some local commits that create a 'verify' subcommand and
integrate with 'fsck' similar to the commit-graph, but I'll
leave those for a later series (and review is still underway
for that part of the commit-graph).

One place where I could use some guidance is related to the
current state of 'the_hash_algo' patches. The file format
allows a different "hash version" which then indicates the
length of the hash. What's the best way to ensure this
feature doesn't cause extra pain in the hash-agnostic series?
This will inform how I go back and make the commit-graph
feature better in this area, too.


Thanks,
-Stolee

[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/T/#u
    Previous MIDX RFC.

[2] https://public-inbox.org/git/20171012120220.226427-1-dstolee@microsoft.com/
    A patch series on abbreviation speedups


Derrick Stolee (23):
  midx: add design document
  midx: add midx format details to pack-format.txt
  midx: add midx builtin
  midx: add 'write' subcommand and basic wiring
  midx: write header information to lockfile
  midx: struct midxed_git and 'read' subcommand
  midx: expand test data
  midx: read packfiles from pack directory
  midx: write pack names in chunk
  midx: write a lookup into the pack names chunk
  midx: sort and deduplicate objects from packfiles
  midx: write object ids in a chunk
  midx: write object id fanout chunk
  midx: write object offsets
  midx: create core.midx config setting
  midx: prepare midxed_git struct
  midx: read objects from multi-pack-index
  midx: use midx in abbreviation calculations
  midx: use existing midx when writing new one
  midx: use midx in approximate_object_count
  midx: prevent duplicate packfile loads
  midx: use midx to find ref-deltas
  midx: clear midx on repack

 .gitignore                              |   1 +
 Documentation/config.txt                |   4 +
 Documentation/git-midx.txt              |  60 ++
 Documentation/technical/midx.txt        | 109 +++
 Documentation/technical/pack-format.txt |  82 +++
 Makefile                                |   2 +
 builtin.h                               |   1 +
 builtin/midx.c                          |  88 +++
 builtin/repack.c                        |   8 +
 cache.h                                 |   1 +
 command-list.txt                        |   1 +
 config.c                                |   5 +
 environment.c                           |   1 +
 git.c                                   |   1 +
 midx.c                                  | 923 ++++++++++++++++++++++++
 midx.h                                  |  23 +
 object-store.h                          |  35 +
 packfile.c                              |  47 +-
 packfile.h                              |   1 +
 sha1-name.c                             |  70 ++
 t/t5319-midx.sh                         | 192 +++++
 21 files changed, 1652 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 Documentation/technical/midx.txt
 create mode 100644 builtin/midx.c
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-midx.sh

-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 01/23] midx: add design document
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-11 19:04   ` Stefan Beller
  2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/midx.txt | 109 +++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/technical/midx.txt

diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt
new file mode 100644
index 0000000000..789f410d71
--- /dev/null
+++ b/Documentation/technical/midx.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The core.midx config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the most-
+  recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+  contents of the multi-pack-index file match the offsets listed in
+  the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- The reachability bitmap is currently paired directly with a single
+  packfile, using the pack-order as the object order to hopefully
+  compress the bitmaps well using run-length encoding. This could be
+  extended to pair a reachability bitmap with a multi-pack-index. If
+  the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then a reachability bitmap
+  could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+  may point to objects that are not stored locally, but available
+  on request to a server. The multi-pack-index does not currently
+  track these promisor packs.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+    An earlier RFC for the multi-pack-index feature
+
+[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-11 19:19   ` Stefan Beller
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The multi-pack-index (MIDX) feature generalizes the existing pack-
index (IDX) feature by indexing objects across multiple pack-files.

Describe the basic file format, using a 12-byte header followed by
a lookup table for a list of "chunks" which will be described later.
The file ends with a footer containing a checksum using the hash
algorithm.

The header allows later versions to create breaking changes by
advancing the version number. We can also change the hash algorithm
using a different version value.

We will add the individual chunk format information as we introduce
the code that writes that information.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 70a99fd142..17666b4bfc 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -252,3 +252,52 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA-1-checksum of all of the above.
+
+== midx-*.midx files have the following format:
+
+The meta-index files refer to multiple pack-files and loose objects.
+
+In order to allow extensions that add extra data to the MIDX, we organize
+the body into "chunks" and provide a lookup table at the beginning of the
+body. The header includes certain length values, such as the number of packs,
+the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+	4-byte signature:
+	    The signature is: {'M', 'I', 'D', 'X'}
+
+	1-byte version number:
+	    Git only writes or recognizes version 1
+
+	1-byte Object Id Version
+	    Git only writes or recognizes verion 1 (SHA-1)
+
+	1-byte number (C) of "chunks"
+
+	1-byte number (I) of base multi-pack-index files:
+	    This value is currently always zero.
+
+	4-byte number (P) of pack files
+
+CHUNK LOOKUP:
+
+	(C + 1) * 12 bytes providing the chunk offsets:
+	    First 4 bytes describe chunk id. Value 0 is a terminating label.
+	    Other 8 bytes provide offset in current file for chunk to start.
+	    (Chunks are provided in file-order, so you can infer the length
+	    using the next chunk position if necessary.)
+
+	The remaining data in the body is described one chunk at a time, and
+	these chunks may be given in any order. Chunks are required unless
+	otherwise specified.
+
+CHUNK DATA:
+
+	(This section intentionally left incomplete.)
+
+TRAILER:
+
+	H-byte HASH-checksum of all of the above.
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 03/23] midx: add midx builtin
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
  2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:20   ` Duy Nguyen
  2018-06-11 21:02   ` Stefan Beller
  2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
                   ` (21 subsequent siblings)
  24 siblings, 2 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

This new 'git midx' builtin will be the plumbing access for writing,
reading, and checking multi-pack-index (MIDX) files. The initial
implementation is a no-op.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                 |  1 +
 Documentation/git-midx.txt | 29 +++++++++++++++++++++++++++++
 Makefile                   |  1 +
 builtin.h                  |  1 +
 builtin/midx.c             | 38 ++++++++++++++++++++++++++++++++++++++
 command-list.txt           |  1 +
 git.c                      |  1 +
 7 files changed, 72 insertions(+)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 builtin/midx.c

diff --git a/.gitignore b/.gitignore
index 388cc4beee..e309644d6b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -97,6 +97,7 @@
 /git-merge-subtree
 /git-mergetool
 /git-mergetool--lib
+/git-midx
 /git-mktag
 /git-mktree
 /git-name-rev
diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
new file mode 100644
index 0000000000..2bd886f1a2
--- /dev/null
+++ b/Documentation/git-midx.txt
@@ -0,0 +1,29 @@
+git-midx(1)
+============
+
+NAME
+----
+git-midx - Write and verify multi-pack-indexes (MIDX files).
+
+
+SYNOPSIS
+--------
+[verse]
+'git midx' [--object-dir <dir>]
+
+DESCRIPTION
+-----------
+Write or verify a MIDX file.
+
+OPTIONS
+-------
+
+--object-dir <dir>::
+	Use given directory for the location of Git objects. We check
+	<dir>/packs/multi-pack-index for the current MIDX file, and
+	<dir>/packs for the pack-files to index.
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 1d27f36365..88958c7b42 100644
--- a/Makefile
+++ b/Makefile
@@ -1045,6 +1045,7 @@ BUILTIN_OBJS += builtin/merge-index.o
 BUILTIN_OBJS += builtin/merge-ours.o
 BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
+BUILTIN_OBJS += builtin/midx.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
 BUILTIN_OBJS += builtin/mv.o
diff --git a/builtin.h b/builtin.h
index 4e0f64723e..7b5bd46c7d 100644
--- a/builtin.h
+++ b/builtin.h
@@ -189,6 +189,7 @@ extern int cmd_merge_ours(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
+extern int cmd_midx(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
diff --git a/builtin/midx.c b/builtin/midx.c
new file mode 100644
index 0000000000..59ea92178f
--- /dev/null
+++ b/builtin/midx.c
@@ -0,0 +1,38 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "parse-options.h"
+
+static char const * const builtin_midx_usage[] ={
+	N_("git midx [--object-dir <dir>]"),
+	NULL
+};
+
+static struct opts_midx {
+	const char *object_dir;
+} opts;
+
+int cmd_midx(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_midx_options[] = {
+		{ OPTION_STRING, 0, "object-dir", &opts.object_dir,
+		  N_("dir"),
+		  N_("The object directory containing set of packfile and pack-index pairs.") },
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_midx_usage, builtin_midx_options);
+
+	git_config(git_default_config, NULL);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_midx_options,
+			     builtin_midx_usage, 0);
+
+	if (!opts.object_dir)
+		opts.object_dir = get_object_directory();
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e1c26c1bb7..a21bd7470e 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -123,6 +123,7 @@ git-merge-index                         plumbingmanipulators
 git-merge-one-file                      purehelpers
 git-mergetool                           ancillarymanipulators           complete
 git-merge-tree                          ancillaryinterrogators
+git-midx                                plumbingmanipulators
 git-mktag                               plumbingmanipulators
 git-mktree                              plumbingmanipulators
 git-mv                                  mainporcelain           worktree
diff --git a/git.c b/git.c
index c2f48d53dd..400fadd677 100644
--- a/git.c
+++ b/git.c
@@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
 	{ "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
 	{ "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
 	{ "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
+	{ "midx", cmd_midx, RUN_SETUP },
 	{ "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
 	{ "mktree", cmd_mktree, RUN_SETUP },
 	{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 04/23] midx: add 'write' subcommand and basic wiring
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:27   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

In anticipation of writing multi-pack-indexes (MIDX files), add a
'git midx write' subcommand and send the options to a write_midx_file()
method. Also create a basic test file that tests the 'write' subcommand.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt | 22 +++++++++++++++++++++-
 Makefile                   |  1 +
 builtin/midx.c             |  9 ++++++++-
 midx.c                     |  9 +++++++++
 midx.h                     |  4 ++++
 t/t5319-midx.sh            | 10 ++++++++++
 6 files changed, 53 insertions(+), 2 deletions(-)
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-midx.sh

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 2bd886f1a2..dcaeb1a91b 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 --------
 [verse]
-'git midx' [--object-dir <dir>]
+'git midx' [--object-dir <dir>] <verb>
 
 DESCRIPTION
 -----------
@@ -23,6 +23,26 @@ OPTIONS
 	<dir>/packs/multi-pack-index for the current MIDX file, and
 	<dir>/packs for the pack-files to index.
 
+write::
+	When given as the verb, write a new MIDX file to
+	<dir>/packs/multi-pack-index.
+
+
+EXAMPLES
+--------
+
+* Write a MIDX file for the packfiles in the current .git folder.
++
+-------------------------------------------
+$ git midx write
+-------------------------------------------
+
+* Write a MIDX file for the packfiles in an alternate.
++
+-------------------------------------------
+$ git midx --object-dir <alt> write
+-------------------------------------------
+
 
 GIT
 ---
diff --git a/Makefile b/Makefile
index 88958c7b42..aa86fcd8ec 100644
--- a/Makefile
+++ b/Makefile
@@ -890,6 +890,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
diff --git a/builtin/midx.c b/builtin/midx.c
index 59ea92178f..dc0a5acd3f 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -3,9 +3,10 @@
 #include "config.h"
 #include "git-compat-util.h"
 #include "parse-options.h"
+#include "midx.h"
 
 static char const * const builtin_midx_usage[] ={
-	N_("git midx [--object-dir <dir>]"),
+	N_("git midx [--object-dir <dir>] [write]"),
 	NULL
 };
 
@@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 	if (!opts.object_dir)
 		opts.object_dir = get_object_directory();
 
+	if (argc == 0)
+		return 0;
+
+	if (!strcmp(argv[0], "write"))
+		return write_midx_file(opts.object_dir);
+
 	return 0;
 }
diff --git a/midx.c b/midx.c
new file mode 100644
index 0000000000..616af66b13
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,9 @@
+#include "git-compat-util.h"
+#include "cache.h"
+#include "dir.h"
+#include "midx.h"
+
+int write_midx_file(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
new file mode 100644
index 0000000000..3a63673952
--- /dev/null
+++ b/midx.h
@@ -0,0 +1,4 @@
+#include "cache.h"
+#include "packfile.h"
+
+int write_midx_file(const char *object_dir);
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
new file mode 100755
index 0000000000..a590137af7
--- /dev/null
+++ b/t/t5319-midx.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+test_description='multi-pack-indexes'
+. ./test-lib.sh
+
+test_expect_success 'write midx with no pakcs' '
+	git midx --object-dir=. write
+'
+
+test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 05/23] midx: write header information to lockfile
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (3 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:35   ` Duy Nguyen
  2018-06-12 15:00   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
                   ` (19 subsequent siblings)
  24 siblings, 2 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

As we begin writing the multi-pack-index format to disk, start with
the basics: the 12-byte header and the 20-byte checksum footer. Start
with these basics so we can add the rest of the format in small
increments.

As we implement the format, we will use a technique to check that our
computed offsets within the multi-pack-index file match what we are
actually writing. Each method that writes to the hashfile will return
the number of bytes written, and we will track that those values match
our expectations.

Currently, write_midx_header() returns 12, but is not checked. We will
check the return value in a later commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c          | 53 +++++++++++++++++++++++++++++++++++++++++++++++++
 t/t5319-midx.sh |  5 +++--
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 616af66b13..3e55422a21 100644
--- a/midx.c
+++ b/midx.c
@@ -1,9 +1,62 @@
 #include "git-compat-util.h"
 #include "cache.h"
 #include "dir.h"
+#include "csum-file.h"
+#include "lockfile.h"
 #include "midx.h"
 
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_VERSION 1
+#define MIDX_HASH_VERSION 1 /* SHA-1 */
+#define MIDX_HEADER_SIZE 12
+
+static char *get_midx_filename(const char *object_dir)
+{
+	struct strbuf midx_name = STRBUF_INIT;
+	strbuf_addstr(&midx_name, object_dir);
+	strbuf_addstr(&midx_name, "/pack/multi-pack-index");
+	return strbuf_detach(&midx_name, NULL);
+}
+
+static size_t write_midx_header(struct hashfile *f,
+				unsigned char num_chunks,
+				uint32_t num_packs)
+{
+	char byte_values[4];
+	hashwrite_be32(f, MIDX_SIGNATURE);
+	byte_values[0] = MIDX_VERSION;
+	byte_values[1] = MIDX_HASH_VERSION;
+	byte_values[2] = num_chunks;
+	byte_values[3] = 0; /* unused */
+	hashwrite(f, byte_values, sizeof(byte_values));
+	hashwrite_be32(f, num_packs);
+
+	return MIDX_HEADER_SIZE;
+}
+
 int write_midx_file(const char *object_dir)
 {
+	unsigned char num_chunks = 0;
+	uint32_t num_packs = 0;
+	char *midx_name;
+	struct hashfile *f;
+	struct lock_file lk;
+
+	midx_name = get_midx_filename(object_dir);
+	if (safe_create_leading_directories(midx_name)) {
+		UNLEAK(midx_name);
+		die_errno(_("unable to create leading directories of %s"),
+			  midx_name);
+	}
+
+	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
+	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	FREE_AND_NULL(midx_name);
+
+	write_midx_header(f, num_chunks, num_packs);
+
+	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	commit_lock_file(&lk);
+
 	return 0;
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index a590137af7..80f9389837 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,8 +3,9 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
-test_expect_success 'write midx with no pakcs' '
-	git midx --object-dir=. write
+test_expect_success 'write midx with no packs' '
+	git midx --object-dir=. write &&
+	test_path_is_file pack/multi-pack-index
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (4 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 17:54   ` Duy Nguyen
  2018-06-07 18:31   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 07/23] midx: expand test data Derrick Stolee
                   ` (18 subsequent siblings)
  24 siblings, 2 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

As we build the multi-pack-index feature by adding chunks at a time,
we want to test that the data is being written correctly.

Create struct midxed_git to store an in-memory representation of a
multi-pack-index and a memory-map of the binary file. Initialize this
struct in load_midxed_git(object_dir).

Create the 'git midx read' subcommand to output basic information about
the multi-pack-index file. This will be expanded as more information is
written to the file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt | 11 +++++++
 builtin/midx.c             | 23 +++++++++++++-
 midx.c                     | 65 ++++++++++++++++++++++++++++++++++++++
 midx.h                     |  9 ++++++
 object-store.h             | 19 +++++++++++
 t/t5319-midx.sh            | 12 ++++++-
 6 files changed, 137 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index dcaeb1a91b..919283fdd8 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -23,6 +23,11 @@ OPTIONS
 	<dir>/packs/multi-pack-index for the current MIDX file, and
 	<dir>/packs for the pack-files to index.
 
+read::
+	When given as the verb, read the current MIDX file and output
+	basic information about its contents. Used for debugging
+	purposes only.
+
 write::
 	When given as the verb, write a new MIDX file to
 	<dir>/packs/multi-pack-index.
@@ -43,6 +48,12 @@ $ git midx write
 $ git midx --object-dir <alt> write
 -------------------------------------------
 
+* Read the MIDX file in the .git/objects folder.
++
+-------------------------------------------
+$ git midx read
+-------------------------------------------
+
 
 GIT
 ---
diff --git a/builtin/midx.c b/builtin/midx.c
index dc0a5acd3f..c7002f664a 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -6,7 +6,7 @@
 #include "midx.h"
 
 static char const * const builtin_midx_usage[] ={
-	N_("git midx [--object-dir <dir>] [write]"),
+	N_("git midx [--object-dir <dir>] [read|write]"),
 	NULL
 };
 
@@ -14,6 +14,25 @@ static struct opts_midx {
 	const char *object_dir;
 } opts;
 
+static int read_midx_file(const char *object_dir)
+{
+	struct midxed_git *m = load_midxed_git(object_dir);
+
+	if (!m)
+		return 0;
+
+	printf("header: %08x %d %d %d %d\n",
+	       m->signature,
+	       m->version,
+	       m->hash_version,
+	       m->num_chunks,
+	       m->num_packs);
+
+	printf("object_dir: %s\n", m->object_dir);
+
+	return 0;
+}
+
 int cmd_midx(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_midx_options[] = {
@@ -38,6 +57,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 	if (argc == 0)
 		return 0;
 
+	if (!strcmp(argv[0], "read"))
+		return read_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 
diff --git a/midx.c b/midx.c
index 3e55422a21..fa18770f1d 100644
--- a/midx.c
+++ b/midx.c
@@ -3,12 +3,15 @@
 #include "dir.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "object-store.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
 #define MIDX_HASH_VERSION 1 /* SHA-1 */
 #define MIDX_HEADER_SIZE 12
+#define MIDX_HASH_LEN 20
+#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -18,6 +21,68 @@ static char *get_midx_filename(const char *object_dir)
 	return strbuf_detach(&midx_name, NULL);
 }
 
+struct midxed_git *load_midxed_git(const char *object_dir)
+{
+	struct midxed_git *m;
+	int fd;
+	struct stat st;
+	size_t midx_size;
+	void *midx_map;
+	const char *midx_name = get_midx_filename(object_dir);
+
+	fd = git_open(midx_name);
+	if (fd < 0)
+		return NULL;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return NULL;
+	}
+	midx_size = xsize_t(st.st_size);
+
+	if (midx_size < MIDX_MIN_SIZE) {
+		close(fd);
+		die("multi-pack-index file %s is too small", midx_name);
+	}
+
+	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
+	strcpy(m->object_dir, object_dir);
+	m->data = midx_map;
+
+	m->signature = get_be32(m->data);
+	if (m->signature != MIDX_SIGNATURE) {
+		error("multi-pack-index signature %X does not match signature %X",
+		      m->signature, MIDX_SIGNATURE);
+		goto cleanup_fail;
+	}
+
+	m->version = *(m->data + 4);
+	if (m->version != MIDX_VERSION) {
+		error("multi-pack-index version %d not recognized",
+		      m->version);
+		goto cleanup_fail;
+	}
+
+	m->hash_version = *(m->data + 5);
+	if (m->hash_version != MIDX_HASH_VERSION) {
+		error("hash version %d not recognized", m->hash_version);
+		goto cleanup_fail;
+	}
+	m->hash_len = MIDX_HASH_LEN;
+
+	m->num_chunks = *(m->data + 6);
+	m->num_packs = get_be32(m->data + 8);
+
+	return m;
+
+cleanup_fail:
+	FREE_AND_NULL(m);
+	munmap(midx_map, midx_size);
+	close(fd);
+	exit(1);
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index 3a63673952..a1d18ed991 100644
--- a/midx.h
+++ b/midx.h
@@ -1,4 +1,13 @@
+#ifndef MIDX_H
+#define MIDX_H
+
+#include "git-compat-util.h"
 #include "cache.h"
+#include "object-store.h"
 #include "packfile.h"
 
+struct midxed_git *load_midxed_git(const char *object_dir);
+
 int write_midx_file(const char *object_dir);
+
+#endif
diff --git a/object-store.h b/object-store.h
index d683112fd7..77cb82621a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -84,6 +84,25 @@ struct packed_git {
 	char pack_name[FLEX_ARRAY]; /* more */
 };
 
+struct midxed_git {
+	struct midxed_git *next;
+
+	int fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	uint32_t signature;
+	unsigned char version;
+	unsigned char hash_version;
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_packs;
+	uint32_t num_objects;
+
+	char object_dir[FLEX_ARRAY];
+};
+
 struct raw_object_store {
 	/*
 	 * Path to the repository's object store.
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 80f9389837..e78514d8e9 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,9 +3,19 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+midx_read_expect() {
+	cat >expect <<- EOF
+	header: 4d494458 1 1 0 0
+	object_dir: .
+	EOF
+	git midx read --object-dir=. >actual &&
+	test_cmp expect actual
+}
+
 test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
-	test_path_is_file pack/multi-pack-index
+	test_path_is_file pack/multi-pack-index &&
+	midx_read_expect
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 07/23] midx: expand test data
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (5 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

As we build the multi-pack-index file format, we want to test the format
on real repoasitories. Add tests to t5319-midx.sh that create repository
data including multiple packfiles with both version 1 and version 2
formats.

The current 'git midx write' command will always write the same file
with no "real" data. This will be expanded in future commits, along with
the test expectations.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-midx.sh | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index e78514d8e9..2c25a69744 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -14,8 +14,109 @@ midx_read_expect() {
 
 test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
+	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
 	midx_read_expect
 '
 
+test_expect_success 'create objects' '
+	for i in `test_seq 1 5`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree </dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with one v1 pack' '
+	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
+	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'write midx with one v2 pack' '
+	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
+	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more objects' '
+	for i in `test_seq 6 5`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		i=$(expr $i + 1) || return 1
+	done &&
+	{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+	git update-index --add file_101 &&
+	tree=$(git write-tree) &&
+	commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+	echo $tree &&
+	git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+	} >obj-list2 &&
+	git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with two packs' '
+	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
+	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
+test_expect_success 'Add more packs' '
+	for j in `test_seq 1 10`
+	do
+		iii=$(printf '%03i' $i)
+		test-tool genrandom "bar" 200 > wide_delta_$iii &&
+		test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+		test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+		test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+		echo $iii >file_$iii &&
+		test-tool genrandom "$iii" 8192 >>file_$iii &&
+		git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
+		{ echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+		git update-index --add file_101 &&
+		tree=$(git write-tree) &&
+		commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+		echo $tree &&
+		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+		} >obj-list &&
+		git update-ref HEAD $commit &&
+		git pack-objects --index-version=2 test-pack <obj-list &&
+		i=$(expr $i + 1) || return 1 &&
+		j=$(expr $j + 1) || return 1
+	done
+'
+
+test_expect_success 'write midx with twelve packs' '
+	git midx --object-dir=. write &&
+	midx_read_expect
+'
+
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 08/23] midx: read packfiles from pack directory
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (6 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 07/23] midx: expand test data Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 18:03   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

When constructing a multi-pack-index file for a given object directory,
read the files within the enclosed pack directory and find matches that
end with ".idx" and find the correct paired packfile using
add_packed_git().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c          | 51 +++++++++++++++++++++++++++++++++++++++++++++++--
 t/t5319-midx.sh | 15 ++++++++-------
 2 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index fa18770f1d..9fb89c80a2 100644
--- a/midx.c
+++ b/midx.c
@@ -102,10 +102,15 @@ static size_t write_midx_header(struct hashfile *f,
 int write_midx_file(const char *object_dir)
 {
 	unsigned char num_chunks = 0;
-	uint32_t num_packs = 0;
 	char *midx_name;
 	struct hashfile *f;
 	struct lock_file lk;
+	struct packed_git **packs = NULL;
+	uint32_t i, nr_packs = 0, alloc_packs = 0;
+	DIR *dir;
+	struct dirent *de;
+	struct strbuf pack_dir = STRBUF_INIT;
+	size_t pack_dir_len;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	strbuf_addf(&pack_dir, "%s/pack", object_dir);
+	dir = opendir(pack_dir.buf);
+
+	if (!dir) {
+		error_errno("unable to open pack directory: %s",
+			    pack_dir.buf);
+		strbuf_release(&pack_dir);
+		return 1;
+	}
+
+	strbuf_addch(&pack_dir, '/');
+	pack_dir_len = pack_dir.len;
+	ALLOC_ARRAY(packs, alloc_packs);
+	while ((de = readdir(dir)) != NULL) {
+		if (is_dot_or_dotdot(de->d_name))
+			continue;
+
+		if (ends_with(de->d_name, ".idx")) {
+			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+
+			strbuf_setlen(&pack_dir, pack_dir_len);
+			strbuf_addstr(&pack_dir, de->d_name);
+
+			packs[nr_packs] = add_packed_git(pack_dir.buf,
+							 pack_dir.len,
+							 0);
+			if (!packs[nr_packs])
+				warning("failed to add packfile '%s'",
+					pack_dir.buf);
+			else
+				nr_packs++;
+		}
+	}
+	closedir(dir);
+	strbuf_release(&pack_dir);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, num_packs);
+	write_midx_header(f, num_chunks, nr_packs);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
+	for (i = 0; i < nr_packs; i++) {
+		close_pack(packs[i]);
+		FREE_AND_NULL(packs[i]);
+	}
+
+	FREE_AND_NULL(packs);
 	return 0;
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 2c25a69744..abe545c7c4 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -4,8 +4,9 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 midx_read_expect() {
+	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 0 0
+	header: 4d494458 1 1 0 $NUM_PACKS
 	object_dir: .
 	EOF
 	git midx read --object-dir=. >actual &&
@@ -16,7 +17,7 @@ test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
 	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect
+	midx_read_expect 0
 '
 
 test_expect_success 'create objects' '
@@ -47,14 +48,14 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 1
 '
 
 test_expect_success 'Add more objects' '
@@ -85,7 +86,7 @@ test_expect_success 'write midx with two packs' '
 	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
 	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 2
 '
 
 test_expect_success 'Add more packs' '
@@ -108,7 +109,7 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 test-pack <obj-list &&
+		git pack-objects --index-version=2 pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
@@ -116,7 +117,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git midx --object-dir=. write &&
-	midx_read_expect
+	midx_read_expect 12
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 09/23] midx: write pack names in chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (7 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 18:26   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The multi-pack-index (MIDX) needs to track which pack-files are covered
by the MIDX file. Store these in our first required chunk. Since
filenames are not well structured, add padding to keep good alignment in
later chunks.

Modify the 'git midx read' subcommand to output the existence of the
pack-file name chunk. Modify t5319-midx.sh to reflect this new output
and the new expected number of chunks.

Defense in depth: A pattern we are using in the multi-pack-index feature
is to verify the data as we write it. We want to ensure we never write
invalid data to the multi-pack-index. There are many checks during the
write of a MIDX file that double-check that the values we are writing
fit the format definitions. If any value is incorrect, then we notice
before writing invalid data. This mainly helps developers while working
on the feature, but it can also identify issues that only appear when
dealing with very large data sets. These large sets are hard to encode
into test cases.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |   6 +
 builtin/midx.c                          |   7 +
 midx.c                                  | 176 +++++++++++++++++++++++-
 object-store.h                          |   2 +
 t/t5319-midx.sh                         |   3 +-
 5 files changed, 188 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 17666b4bfc..2b37be7b33 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,12 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+	Packfile Names (ID: {'P', 'N', 'A', 'M'})
+	    Stores the packfile names as concatenated, null-terminated strings.
+	    Packfiles must be listed in lexicographic order for fast lookups by
+	    name. This is the only chunk not guaranteed to be a multiple of four
+	    bytes in length, so should be the last chunk for alignment reasons.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/builtin/midx.c b/builtin/midx.c
index c7002f664a..fe56560853 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -28,6 +28,13 @@ static int read_midx_file(const char *object_dir)
 	       m->num_chunks,
 	       m->num_packs);
 
+	printf("chunks:");
+
+	if (m->chunk_pack_names)
+		printf(" pack_names");
+
+	printf("\n");
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/midx.c b/midx.c
index 9fb89c80a2..d4f4a01a51 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
+#define MIDX_MAX_CHUNKS 1
+#define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+
 static char *get_midx_filename(const char *object_dir)
 {
 	struct strbuf midx_name = STRBUF_INIT;
@@ -29,6 +34,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	size_t midx_size;
 	void *midx_map;
 	const char *midx_name = get_midx_filename(object_dir);
+	uint32_t i;
 
 	fd = git_open(midx_name);
 	if (fd < 0)
@@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	m->num_chunks = *(m->data + 6);
 	m->num_packs = get_be32(m->data + 8);
 
+	for (i = 0; i < m->num_chunks; i++) {
+		uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i);
+		uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
+
+		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKNAMES:
+				m->chunk_pack_names = m->data + chunk_offset;
+				break;
+
+			case 0:
+				die("terminating MIDX chunk id appears earlier than expected");
+				break;
+
+			default:
+				/*
+				 * Do nothing on unrecognized chunks, allowing future
+				 * extensions to add optional chunks.
+				 */
+				break;
+		}
+	}
+
+	if (!m->chunk_pack_names)
+		die("MIDX missing required pack-name chunk");
+
 	return m;
 
 cleanup_fail:
@@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_pair {
+	uint32_t pack_int_id;
+	char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+	struct pack_pair *a = (struct pack_pair *)_a;
+	struct pack_pair *b = (struct pack_pair *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
+{
+	uint32_t i;
+	struct pack_pair *pairs;
+
+	ALLOC_ARRAY(pairs, nr_packs);
+
+	for (i = 0; i < nr_packs; i++) {
+		pairs[i].pack_int_id = i;
+		pairs[i].pack_name = pack_names[i];
+	}
+
+	QSORT(pairs, nr_packs, pack_pair_compare);
+
+	for (i = 0; i < nr_packs; i++) {
+		pack_names[i] = pairs[i].pack_name;
+		perm[pairs[i].pack_int_id] = i;
+	}
+}
+
+static size_t write_midx_pack_names(struct hashfile *f,
+				    char **pack_names,
+				    uint32_t num_packs)
+{
+	uint32_t i;
+	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
+	size_t written = 0;
+
+	for (i = 0; i < num_packs; i++) {
+		size_t writelen = strlen(pack_names[i]) + 1;
+
+		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+			BUG("incorrect pack-file order: %s before %s",
+			    pack_names[i - 1],
+			    pack_names[i]);
+
+		hashwrite(f, pack_names[i], writelen);
+		written += writelen;
+	}
+
+	/* add padding to be aligned */
+	i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
+	if (i < MIDX_CHUNK_ALIGNMENT) {
+		bzero(padding, sizeof(padding));
+		hashwrite(f, padding, i);
+		written += i;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
-	unsigned char num_chunks = 0;
+	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	struct hashfile *f;
 	struct lock_file lk;
 	struct packed_git **packs = NULL;
+	char **pack_names = NULL;
+	uint32_t *pack_perm;
 	uint32_t i, nr_packs = 0, alloc_packs = 0;
+	uint32_t alloc_pack_names = 0;
 	DIR *dir;
 	struct dirent *de;
 	struct strbuf pack_dir = STRBUF_INIT;
 	size_t pack_dir_len;
+	uint64_t pack_name_concat_len = 0;
+	uint64_t written = 0;
+	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
+	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -132,12 +233,14 @@ int write_midx_file(const char *object_dir)
 	strbuf_addch(&pack_dir, '/');
 	pack_dir_len = pack_dir.len;
 	ALLOC_ARRAY(packs, alloc_packs);
+	ALLOC_ARRAY(pack_names, alloc_pack_names);
 	while ((de = readdir(dir)) != NULL) {
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		if (ends_with(de->d_name, ".idx")) {
 			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+			ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
 
 			strbuf_setlen(&pack_dir, pack_dir_len);
 			strbuf_addstr(&pack_dir, de->d_name);
@@ -145,21 +248,83 @@ int write_midx_file(const char *object_dir)
 			packs[nr_packs] = add_packed_git(pack_dir.buf,
 							 pack_dir.len,
 							 0);
-			if (!packs[nr_packs])
+			if (!packs[nr_packs]) {
 				warning("failed to add packfile '%s'",
 					pack_dir.buf);
-			else
-				nr_packs++;
+				continue;
+			}
+
+			pack_names[nr_packs] = xstrdup(de->d_name);
+			pack_name_concat_len += strlen(de->d_name) + 1;
+			nr_packs++;
 		}
 	}
+
 	closedir(dir);
 	strbuf_release(&pack_dir);
 
+	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
+	ALLOC_ARRAY(pack_perm, nr_packs);
+	sort_packs_by_name(pack_names, nr_packs, pack_perm);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	write_midx_header(f, num_chunks, nr_packs);
+	cur_chunk = 0;
+	num_chunks = 1;
+
+	written = write_midx_header(f, num_chunks, nr_packs);
+
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
+
+	for (i = 0; i <= num_chunks; i++) {
+		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
+			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
+			    chunk_offsets[i - 1],
+			    chunk_offsets[i]);
+
+		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
+			BUG("chunk offset %"PRIu64" is not properly aligned",
+			    chunk_offsets[i]);
+
+		hashwrite_be32(f, chunk_ids[i]);
+		hashwrite_be32(f, chunk_offsets[i] >> 32);
+		hashwrite_be32(f, chunk_offsets[i]);
+
+		written += MIDX_CHUNKLOOKUP_WIDTH;
+	}
+
+	for (i = 0; i < num_chunks; i++) {
+		if (written != chunk_offsets[i])
+			BUG("inccrrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
+			    chunk_offsets[i],
+			    written,
+			    chunk_ids[i]);
+
+		switch (chunk_ids[i]) {
+			case MIDX_CHUNKID_PACKNAMES:
+				written += write_midx_pack_names(f, pack_names, nr_packs);
+				break;
+
+			default:
+				BUG("trying to write unknown chunk id %"PRIx32,
+				    chunk_ids[i]);
+		}
+	}
+
+	if (written != chunk_offsets[num_chunks])
+		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
+		    written,
+		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
@@ -170,5 +335,6 @@ int write_midx_file(const char *object_dir)
 	}
 
 	FREE_AND_NULL(packs);
+	FREE_AND_NULL(pack_names);
 	return 0;
 }
diff --git a/object-store.h b/object-store.h
index 77cb82621a..199cf4bd44 100644
--- a/object-store.h
+++ b/object-store.h
@@ -100,6 +100,8 @@ struct midxed_git {
 	uint32_t num_packs;
 	uint32_t num_objects;
 
+	const unsigned char *chunk_pack_names;
+
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index abe545c7c4..fdf4f84a90 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,7 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 0 $NUM_PACKS
+	header: 4d494458 1 1 1 $NUM_PACKS
+	chunks: pack_names
 	object_dir: .
 	EOF
 	git midx read --object-dir=. >actual &&
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 10/23] midx: write a lookup into the pack names chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (8 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 16:43   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  5 +++
 builtin/midx.c                          |  7 ++++
 midx.c                                  | 56 +++++++++++++++++++++++--
 object-store.h                          |  2 +
 t/t5319-midx.sh                         | 11 +++--
 5 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 2b37be7b33..29bf87283a 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,11 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+	Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
+	    P * 4 bytes storing the offset in the packfile name chunk for
+	    the null-terminated string containing the filename for the
+	    ith packfile.
+
 	Packfile Names (ID: {'P', 'N', 'A', 'M'})
 	    Stores the packfile names as concatenated, null-terminated strings.
 	    Packfiles must be listed in lexicographic order for fast lookups by
diff --git a/builtin/midx.c b/builtin/midx.c
index fe56560853..3a261e9bbf 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -16,6 +16,7 @@ static struct opts_midx {
 
 static int read_midx_file(const char *object_dir)
 {
+	uint32_t i;
 	struct midxed_git *m = load_midxed_git(object_dir);
 
 	if (!m)
@@ -30,11 +31,17 @@ static int read_midx_file(const char *object_dir)
 
 	printf("chunks:");
 
+	if (m->chunk_pack_lookup)
+		printf(" pack_lookup");
 	if (m->chunk_pack_names)
 		printf(" pack_names");
 
 	printf("\n");
 
+	printf("packs:\n");
+	for (i = 0; i < m->num_packs; i++)
+		printf("%s\n", m->pack_names[i]);
+
 	printf("object_dir: %s\n", m->object_dir);
 
 	return 0;
diff --git a/midx.c b/midx.c
index d4f4a01a51..923acda72e 100644
--- a/midx.c
+++ b/midx.c
@@ -13,8 +13,9 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 1
+#define MIDX_MAX_CHUNKS 2
 #define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
@@ -85,6 +86,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
 
 		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKLOOKUP:
+				m->chunk_pack_lookup = (uint32_t *)(m->data + chunk_offset);
+				break;
+
 			case MIDX_CHUNKID_PACKNAMES:
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
@@ -102,9 +107,32 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		}
 	}
 
+	if (!m->chunk_pack_lookup)
+		die("MIDX missing required pack lookup chunk");
 	if (!m->chunk_pack_names)
 		die("MIDX missing required pack-name chunk");
 
+	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+	for (i = 0; i < m->num_packs; i++) {
+		if (i) {
+			if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
+				error("MIDX pack lookup value %d before %d",
+				      ntohl(m->chunk_pack_lookup[i - 1]),
+				      ntohl(m->chunk_pack_lookup[i]));
+				goto cleanup_fail;
+			}
+		}
+
+		m->pack_names[i] = (const char *)(m->chunk_pack_names + ntohl(m->chunk_pack_lookup[i]));
+
+		if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
+			error("MIDX pack names out of order: '%s' before '%s'",
+			      m->pack_names[i - 1],
+			      m->pack_names[i]);
+			goto cleanup_fail;
+		}
+	}
+
 	return m;
 
 cleanup_fail:
@@ -162,6 +190,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	}
 }
 
+static size_t write_midx_pack_lookup(struct hashfile *f,
+				     char **pack_names,
+				     uint32_t nr_packs)
+{
+	uint32_t i, cur_len = 0;
+
+	for (i = 0; i < nr_packs; i++) {
+		hashwrite_be32(f, cur_len);
+		cur_len += strlen(pack_names[i]) + 1;
+	}
+
+	return sizeof(uint32_t) * (size_t)nr_packs;
+}
+
 static size_t write_midx_pack_names(struct hashfile *f,
 				    char **pack_names,
 				    uint32_t num_packs)
@@ -275,13 +317,17 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 1;
+	num_chunks = 2;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKLOOKUP;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
+
 	cur_chunk++;
 	chunk_ids[cur_chunk] = 0;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
@@ -311,6 +357,10 @@ int write_midx_file(const char *object_dir)
 			    chunk_ids[i]);
 
 		switch (chunk_ids[i]) {
+			case MIDX_CHUNKID_PACKLOOKUP:
+				written += write_midx_pack_lookup(f, pack_names, nr_packs);
+				break;
+
 			case MIDX_CHUNKID_PACKNAMES:
 				written += write_midx_pack_names(f, pack_names, nr_packs);
 				break;
diff --git a/object-store.h b/object-store.h
index 199cf4bd44..1ba50459ca 100644
--- a/object-store.h
+++ b/object-store.h
@@ -100,8 +100,10 @@ struct midxed_git {
 	uint32_t num_packs;
 	uint32_t num_objects;
 
+	const uint32_t *chunk_pack_lookup;
 	const unsigned char *chunk_pack_names;
 
+	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index fdf4f84a90..a31c387c8f 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,10 +6,15 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 1 $NUM_PACKS
-	chunks: pack_names
-	object_dir: .
+	header: 4d494458 1 1 2 $NUM_PACKS
+	chunks: pack_lookup pack_names
+	packs:
 	EOF
+	if [ $NUM_PACKS -ge 1 ]
+	then
+		ls pack/ | grep idx | sort >> expect
+	fi
+	printf "object_dir: .\n" >>expect &&
 	git midx read --object-dir=. >actual &&
 	test_cmp expect actual
 }
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 11/23] midx: sort and deduplicate objects from packfiles
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (9 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:07   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Before writing a list of objects and their offsets to a multi-pack-index
(MIDX), we need to collect the list of objects contained in the
packfiles. There may be multiple copies of some objects, so this list
must be deduplicated.

It is possible to artificially get into a state where there are many
duplicate copies of objects. That can create high memory pressure if we
are to create a list of all objects before de-duplication. To reduce
this memory pressure without a significant performance drop,
automatically group objects by the first byte of their object id. Use
the IDX fanout tables to group the data, copy to a local array, then
sort.

Copy only the de-duplicated entries. Select the duplicate based on the
most-recent modified time of a packfile containing the object.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/midx.c b/midx.c
index 923acda72e..b20d52713c 100644
--- a/midx.c
+++ b/midx.c
@@ -4,6 +4,7 @@
 #include "csum-file.h"
 #include "lockfile.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
@@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	}
 }
 
+static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
+{
+	const uint32_t *level1_ofs = p->index_data;
+
+	if (!level1_ofs) {
+		if (open_pack_index(p))
+			return 0;
+		level1_ofs = p->index_data;
+	}
+
+	if (p->index_version > 1) {
+		level1_ofs += 2;
+	}
+
+	return ntohl(level1_ofs[value]);
+}
+
+struct pack_midx_entry {
+	struct object_id oid;
+	uint32_t pack_int_id;
+	time_t pack_mtime;
+	uint64_t offset;
+};
+
+static int midx_oid_compare(const void *_a, const void *_b)
+{
+	struct pack_midx_entry *a = (struct pack_midx_entry *)_a;
+	struct pack_midx_entry *b = (struct pack_midx_entry *)_b;
+	int cmp = oidcmp(&a->oid, &b->oid);
+
+	if (cmp)
+		return cmp;
+
+	if (a->pack_mtime > b->pack_mtime)
+		return -1;
+	else if (a->pack_mtime < b->pack_mtime)
+		return 1;
+
+	return a->pack_int_id - b->pack_int_id;
+}
+
+static void fill_pack_entry(uint32_t pack_int_id,
+			    struct packed_git *p,
+			    uint32_t cur_object,
+			    struct pack_midx_entry *entry)
+{
+	if (!nth_packed_object_oid(&entry->oid, p, cur_object))
+		die("failed to located object %d in packfile", cur_object);
+
+	entry->pack_int_id = pack_int_id;
+	entry->pack_mtime = p->mtime;
+
+	entry->offset = nth_packed_object_offset(p, cur_object);
+}
+
+/*
+ * It is possible to artificially get into a state where there are many
+ * duplicate copies of objects. That can create high memory pressure if
+ * we are to create a list of all objects before de-duplication. To reduce
+ * this memory pressure without a significant performance drop, automatically
+ * group objects by the first byte of their object id. Use the IDX fanout
+ * tables to group the data, copy to a local array, then sort.
+ *
+ * Copy only the de-duplicated entries (selected by most-recent modified time
+ * of a packfile containing the object).
+ */
+static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+						  uint32_t *perm,
+						  uint32_t nr_packs,
+						  uint32_t *nr_objects)
+{
+	uint32_t cur_fanout, cur_pack, cur_object;
+	uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
+	struct pack_midx_entry *entries_by_fanout = NULL;
+	struct pack_midx_entry *deduplicated_entries = NULL;
+
+	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+		if (open_pack_index(p[cur_pack]))
+			continue;
+
+		total_objects += p[cur_pack]->num_objects;
+	}
+
+	/*
+	 * As we de-duplicate by fanout value, we expect the fanout
+	 * slices to be evenly distributed, with some noise. Hence,
+	 * allocate slightly more than one 256th.
+	 */
+	alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16;
+
+	ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
+	ALLOC_ARRAY(deduplicated_entries, alloc_objects);
+	*nr_objects = 0;
+
+	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
+		nr_fanout = 0;
+
+		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
+			end = get_pack_fanout(p[cur_pack], cur_fanout);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				nr_fanout++;
+			}
+		}
+
+		QSORT(entries_by_fanout, nr_fanout, midx_oid_compare);
+
+		/*
+		 * The batch is now sorted by OID and then mtime (descending).
+		 * Take only the first duplicate.
+		 */
+		for (cur_object = 0; cur_object < nr_fanout; cur_object++) {
+			if (cur_object && !oidcmp(&entries_by_fanout[cur_object - 1].oid,
+						  &entries_by_fanout[cur_object].oid))
+				continue;
+
+			ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects);
+			memcpy(&deduplicated_entries[*nr_objects],
+			       &entries_by_fanout[cur_object],
+			       sizeof(struct pack_midx_entry));
+			(*nr_objects)++;
+		}
+	}
+
+	FREE_AND_NULL(entries_by_fanout);
+	return deduplicated_entries;
+}
+
 static size_t write_midx_pack_lookup(struct hashfile *f,
 				     char **pack_names,
 				     uint32_t nr_packs)
@@ -254,6 +389,7 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+	uint32_t nr_entries;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -312,6 +448,8 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, nr_packs);
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
+	get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 12/23] midx: write object ids in a chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (10 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:25   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  4 ++
 builtin/midx.c                          |  2 +
 midx.c                                  | 50 +++++++++++++++++++++++--
 object-store.h                          |  1 +
 t/t5319-midx.sh                         |  4 +-
 5 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 29bf87283a..de9ac778b6 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -307,6 +307,10 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+	    The OIDs for all objects in the MIDX are stored in lexicographic
+	    order in this chunk.
+
 	(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/builtin/midx.c b/builtin/midx.c
index 3a261e9bbf..86edd30174 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -35,6 +35,8 @@ static int read_midx_file(const char *object_dir)
 		printf(" pack_lookup");
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_lookup)
+		printf(" oid_lookup");
 
 	printf("\n");
 
diff --git a/midx.c b/midx.c
index b20d52713c..d06bc6876a 100644
--- a/midx.c
+++ b/midx.c
@@ -14,10 +14,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 2
+#define MIDX_MAX_CHUNKS 3
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
 static char *get_midx_filename(const char *object_dir)
@@ -95,6 +96,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				m->chunk_oid_lookup = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die("terminating MIDX chunk id appears earlier than expected");
 				break;
@@ -112,6 +117,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		die("MIDX missing required pack lookup chunk");
 	if (!m->chunk_pack_names)
 		die("MIDX missing required pack-name chunk");
+	if (!m->chunk_oid_lookup)
+		die("MIDX missing required OID lookup chunk");
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
 	for (i = 0; i < m->num_packs; i++) {
@@ -370,6 +377,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		if (i < nr_objects - 1) {
+			struct pack_midx_entry *next = list;
+			if (oidcmp(&obj->oid, &next->oid) >= 0)
+				BUG("OIDs not in order: %s >= %s",
+				oid_to_hex(&obj->oid),
+				oid_to_hex(&next->oid));
+		}
+
+		hashwrite(f, obj->oid.hash, (int)hash_len);
+		written += hash_len;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -389,6 +422,7 @@ int write_midx_file(const char *object_dir)
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+	struct pack_midx_entry *entries;
 	uint32_t nr_entries;
 
 	midx_name = get_midx_filename(object_dir);
@@ -448,14 +482,14 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, nr_packs);
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
-	get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+	entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
 
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 2;
+	num_chunks = 3;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
@@ -467,9 +501,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = 0;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -503,6 +541,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, pack_names, nr_packs);
 				break;
 
+			case MIDX_CHUNKID_OIDLOOKUP:
+				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index 1ba50459ca..7d14d3586e 100644
--- a/object-store.h
+++ b/object-store.h
@@ -102,6 +102,7 @@ struct midxed_git {
 
 	const uint32_t *chunk_pack_lookup;
 	const unsigned char *chunk_pack_names;
+	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index a31c387c8f..e71aa52b80 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,8 +6,8 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	cat >expect <<- EOF
-	header: 4d494458 1 1 2 $NUM_PACKS
-	chunks: pack_lookup pack_names
+	header: 4d494458 1 1 3 $NUM_PACKS
+	chunks: pack_lookup pack_names oid_lookup
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 13/23] midx: write object id fanout chunk
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (11 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:28   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  5 +++
 builtin/midx.c                          |  4 +-
 midx.c                                  | 53 +++++++++++++++++++++++--
 object-store.h                          |  1 +
 t/t5319-midx.sh                         | 18 +++++----
 5 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index de9ac778b6..77e88f85e4 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -307,6 +307,11 @@ CHUNK DATA:
 	    name. This is the only chunk not guaranteed to be a multiple of four
 	    bytes in length, so should be the last chunk for alignment reasons.
 
+	OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+	    The ith entry, F[i], stores the number of OIDs with first
+	    byte at most i. Thus F[255] stores the total
+	    number of objects (N).
+
 	OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
diff --git a/builtin/midx.c b/builtin/midx.c
index 86edd30174..e1fd0e0de4 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -35,10 +35,12 @@ static int read_midx_file(const char *object_dir)
 		printf(" pack_lookup");
 	if (m->chunk_pack_names)
 		printf(" pack_names");
+	if (m->chunk_oid_fanout)
+		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
 
-	printf("\n");
+	printf("\nnum_objects: %d\n", m->num_objects);
 
 	printf("packs:\n");
 	for (i = 0; i < m->num_packs; i++)
diff --git a/midx.c b/midx.c
index d06bc6876a..9458ced208 100644
--- a/midx.c
+++ b/midx.c
@@ -14,12 +14,14 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 3
+#define MIDX_MAX_CHUNKS 4
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+#define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -96,6 +98,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 				m->chunk_pack_names = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
@@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		die("MIDX missing required pack lookup chunk");
 	if (!m->chunk_pack_names)
 		die("MIDX missing required pack-name chunk");
+	if (!m->chunk_oid_fanout)
+		die("MIDX missing required OID fanout chunk");
 	if (!m->chunk_oid_lookup)
 		die("MIDX missing required OID lookup chunk");
 
+	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
+
 	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
 	for (i = 0; i < m->num_packs; i++) {
 		if (i) {
@@ -377,6 +387,35 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	return written;
 }
 
+static size_t write_midx_oid_fanout(struct hashfile *f,
+				    struct pack_midx_entry *objects,
+				    uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	struct pack_midx_entry *last = objects + nr_objects;
+	uint32_t count = 0;
+	uint32_t i;
+
+	/*
+	* Write the first-level table (the list is sorted,
+	* but we use a 256-entry lookup to be able to avoid
+	* having to do eight extra binary search iterations).
+	*/
+	for (i = 0; i < 256; i++) {
+		struct pack_midx_entry *next = list;
+
+		while (next < last && next->oid.hash[0] == i) {
+			count++;
+			next++;
+		}
+
+		hashwrite_be32(f, count);
+		list = next;
+	}
+
+	return MIDX_CHUNK_FANOUT_SIZE;
+}
+
 static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 				    struct pack_midx_entry *objects,
 				    uint32_t nr_objects)
@@ -489,7 +528,7 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 3;
+	num_chunks = 4;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
@@ -501,9 +540,13 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
+	cur_chunk++;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
+
 	cur_chunk++;
 	chunk_ids[cur_chunk] = 0;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
@@ -541,6 +584,10 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_pack_names(f, pack_names, nr_packs);
 				break;
 
+			case MIDX_CHUNKID_OIDFANOUT:
+				written += write_midx_oid_fanout(f, entries, nr_entries);
+				break;
+
 			case MIDX_CHUNKID_OIDLOOKUP:
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
diff --git a/object-store.h b/object-store.h
index 7d14d3586e..c613ff2571 100644
--- a/object-store.h
+++ b/object-store.h
@@ -102,6 +102,7 @@ struct midxed_git {
 
 	const uint32_t *chunk_pack_lookup;
 	const unsigned char *chunk_pack_names;
+	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 
 	const char **pack_names;
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index e71aa52b80..d4ae988479 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -5,9 +5,11 @@ test_description='multi-pack-indexes'
 
 midx_read_expect() {
 	NUM_PACKS=$1
+	NUM_OBJECTS=$2
 	cat >expect <<- EOF
-	header: 4d494458 1 1 3 $NUM_PACKS
-	chunks: pack_lookup pack_names oid_lookup
+	header: 4d494458 1 1 4 $NUM_PACKS
+	chunks: pack_lookup pack_names oid_fanout oid_lookup
+	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
@@ -23,7 +25,7 @@ test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
 	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect 0
+	midx_read_expect 0 0
 '
 
 test_expect_success 'create objects' '
@@ -54,18 +56,18 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1
+	midx_read_expect 1 17
 '
 
 test_expect_success 'Add more objects' '
-	for i in `test_seq 6 5`
+	for i in `test_seq 6 10`
 	do
 		iii=$(printf '%03i' $i)
 		test-tool genrandom "bar" 200 > wide_delta_$iii &&
@@ -92,7 +94,7 @@ test_expect_success 'write midx with two packs' '
 	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
 	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
 	git midx --object-dir=. write &&
-	midx_read_expect 2
+	midx_read_expect 2 33
 '
 
 test_expect_success 'Add more packs' '
@@ -123,7 +125,7 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git midx --object-dir=. write &&
-	midx_read_expect 12
+	midx_read_expect 12 73
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 14/23] midx: write object offsets
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (12 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:41   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 15/23] midx: create core.midx config setting Derrick Stolee
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The final pair of chunks for the multi-pack-index (MIDX) file stores the
object offsets. We default to using 32-bit offsets as in the pack-index
version 1 format, but if there exists an offset larger than 32-bits, we
use a trick similar to the pack-index version 2 format by storing all
offsets at least 2^31 in a 64-bit table; we use the 32-bit table to
point into that 64-bit table as necessary.

We only store these 64-bit offsets if necessary, so create a test that
manipulates a version 2 pack-index to fake a large offset. This allows
us to test that the large offset table is created, but the data does not
match the actual packfile offsets. The MIDX offset does match the
(corrupted) pack-index offset, so a later commit will compare these
offsets during a 'verify' step.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt |  15 +++-
 builtin/midx.c                          |   4 +
 midx.c                                  | 100 +++++++++++++++++++++++-
 object-store.h                          |   2 +
 t/t5319-midx.sh                         |  45 ++++++++---
 5 files changed, 151 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 77e88f85e4..0256cfb5e0 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -316,7 +316,20 @@ CHUNK DATA:
 	    The OIDs for all objects in the MIDX are stored in lexicographic
 	    order in this chunk.
 
-	(This section intentionally left incomplete.)
+	Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
+	    Stores two 4-byte values for every object.
+	    1: The pack-int-id for the pack storing this object.
+	    2: The offset within the pack.
+		If all offsets are less than 2^31, then the large offset chunk
+		will not exist and offsets are stored as in IDX v1.
+		If there is at least one offset value larger than 2^32-1, then
+		the large offset chunk must exist. If the large offset chunk
+		exists and the 31st bit is on, then removing that bit reveals
+		the row in the large offsets containing the 8-byte offset of
+		this object.
+
+	[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+	    8-byte offsets into large packfiles.
 
 TRAILER:
 
diff --git a/builtin/midx.c b/builtin/midx.c
index e1fd0e0de4..607d2b3544 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -39,6 +39,10 @@ static int read_midx_file(const char *object_dir)
 		printf(" oid_fanout");
 	if (m->chunk_oid_lookup)
 		printf(" oid_lookup");
+	if (m->chunk_object_offsets)
+		printf(" object_offsets");
+	if (m->chunk_large_offsets)
+		printf(" large_offsets");
 
 	printf("\nnum_objects: %d\n", m->num_objects);
 
diff --git a/midx.c b/midx.c
index 9458ced208..a49300bf75 100644
--- a/midx.c
+++ b/midx.c
@@ -14,14 +14,19 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 4
+#define MIDX_MAX_CHUNKS 6
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
+#define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
+#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
+#define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -106,6 +111,14 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 				m->chunk_oid_lookup = m->data + chunk_offset;
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				m->chunk_object_offsets = m->data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				m->chunk_large_offsets = m->data + chunk_offset;
+				break;
+
 			case 0:
 				die("terminating MIDX chunk id appears earlier than expected");
 				break;
@@ -127,6 +140,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 		die("MIDX missing required OID fanout chunk");
 	if (!m->chunk_oid_lookup)
 		die("MIDX missing required OID lookup chunk");
+	if (!m->chunk_object_offsets)
+		die("MIDX missing required object offsets chunk");
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
@@ -442,6 +457,56 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 	return written;
 }
 
+static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	uint32_t i, nr_large_offset = 0;
+	size_t written = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = list++;
+
+		hashwrite_be32(f, obj->pack_int_id);
+
+		if (large_offset_needed && obj->offset >> 31)
+			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
+		else if (!large_offset_needed && obj->offset >> 32)
+			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
+			    oid_to_hex(&obj->oid),
+			    obj->offset);
+		else
+			hashwrite_be32(f, (uint32_t)obj->offset);
+
+		written += MIDX_CHUNK_OFFSET_WIDTH;
+	}
+
+	return written;
+}
+
+static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
+				       struct pack_midx_entry *objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry *list = objects;
+	size_t written = 0;
+
+	while (nr_large_offset) {
+		struct pack_midx_entry *obj = list++;
+		uint64_t offset = obj->offset;
+
+		if (!(offset >> 31))
+			continue;
+
+		hashwrite_be32(f, offset >> 32);
+		hashwrite_be32(f, offset & 0xffffffff);
+		written += 2 * sizeof(uint32_t);
+
+		nr_large_offset--;
+	}
+
+	return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
 	unsigned char cur_chunk, num_chunks = 0;
@@ -462,7 +527,8 @@ int write_midx_file(const char *object_dir)
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	struct pack_midx_entry *entries;
-	uint32_t nr_entries;
+	uint32_t nr_entries, num_large_offsets = 0;
+	int large_offsets_needed = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -522,13 +588,19 @@ int write_midx_file(const char *object_dir)
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
 	entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+	for (i = 0; i < nr_entries; i++) {
+		if (entries[i].offset > 0x7fffffff)
+			num_large_offsets++;
+		if (entries[i].offset > 0xffffffff)
+			large_offsets_needed = 1;
+	}
 
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
 	cur_chunk = 0;
-	num_chunks = 4;
+	num_chunks = large_offsets_needed ? 6 : 5;
 
 	written = write_midx_header(f, num_chunks, nr_packs);
 
@@ -548,9 +620,21 @@ int write_midx_file(const char *object_dir)
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
 
 	cur_chunk++;
-	chunk_ids[cur_chunk] = 0;
+	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
 
+	cur_chunk++;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	if (large_offsets_needed) {
+		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
+
+		cur_chunk++;
+		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
+					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+	}
+
+	chunk_ids[cur_chunk] = 0;
+
 	for (i = 0; i <= num_chunks; i++) {
 		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
 			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
@@ -592,6 +676,14 @@ int write_midx_file(const char *object_dir)
 				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
 				break;
 
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				break;
+
 			default:
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
diff --git a/object-store.h b/object-store.h
index c613ff2571..9b671f1b0a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -104,6 +104,8 @@ struct midxed_git {
 	const unsigned char *chunk_pack_names;
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_object_offsets;
+	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
 	char object_dir[FLEX_ARRAY];
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index d4ae988479..709652c635 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -6,18 +6,21 @@ test_description='multi-pack-indexes'
 midx_read_expect() {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
+	NUM_CHUNKS=$3
+	OBJECT_DIR=$4
+	EXTRA_CHUNKS="$5"
 	cat >expect <<- EOF
-	header: 4d494458 1 1 4 $NUM_PACKS
-	chunks: pack_lookup pack_names oid_fanout oid_lookup
+	header: 4d494458 1 1 $NUM_CHUNKS $NUM_PACKS
+	chunks: pack_lookup pack_names oid_fanout oid_lookup object_offsets$EXTRA_CHUNKS
 	num_objects: $NUM_OBJECTS
 	packs:
 	EOF
 	if [ $NUM_PACKS -ge 1 ]
 	then
-		ls pack/ | grep idx | sort >> expect
+		ls $OBJECT_DIR/pack/ | grep idx | sort >> expect
 	fi
-	printf "object_dir: .\n" >>expect &&
-	git midx read --object-dir=. >actual &&
+	printf "object_dir: $OBJECT_DIR\n" >>expect &&
+	git midx read --object-dir=$OBJECT_DIR >actual &&
 	test_cmp expect actual
 }
 
@@ -25,7 +28,7 @@ test_expect_success 'write midx with no packs' '
 	git midx --object-dir=. write &&
 	test_when_finished rm pack/multi-pack-index &&
 	test_path_is_file pack/multi-pack-index &&
-	midx_read_expect 0 0
+	midx_read_expect 0 0 5 .
 '
 
 test_expect_success 'create objects' '
@@ -56,14 +59,14 @@ test_expect_success 'write midx with one v1 pack' '
 	pack=$(git pack-objects --index-version=1 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx pack/multi-pack-index &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 5 .
 '
 
 test_expect_success 'write midx with one v2 pack' '
 	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
 	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
 	git midx --object-dir=. write &&
-	midx_read_expect 1 17
+	midx_read_expect 1 17 5 .
 '
 
 test_expect_success 'Add more objects' '
@@ -94,7 +97,7 @@ test_expect_success 'write midx with two packs' '
 	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
 	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
 	git midx --object-dir=. write &&
-	midx_read_expect 2 33
+	midx_read_expect 2 33 5 .
 '
 
 test_expect_success 'Add more packs' '
@@ -125,7 +128,29 @@ test_expect_success 'Add more packs' '
 
 test_expect_success 'write midx with twelve packs' '
 	git midx --object-dir=. write &&
-	midx_read_expect 12 73
+	midx_read_expect 12 73 5 .
+'
+
+
+# usage: corrupt_data <file> <pos> [<data>]
+corrupt_data() {
+	file=$1
+	pos=$2
+	data="${3:-\0}"
+	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+# Force 64-bit offsets by manipulating the idx file.
+# This makes the IDX file _incorrect_ so be careful to clean up after!
+test_expect_success 'force some 64-bit offsets with pack-objects' '
+	mkdir objects64 &&
+	mkdir objects64/pack &&
+	pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
+	idx64=objects64/pack/test-64-$pack64.idx &&
+	chmod u+w $idx64 &&
+	corrupt_data $idx64 2899 "\02" &&
+	midx64=$(git midx write --object-dir=objects64) &&
+	midx_read_expect 1 62 6 objects64 " large_offsets"
 '
 
 test_done
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 15/23] midx: create core.midx config setting
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (13 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

The core.midx config setting controls the multi-pack-index (MIDX)
feature. If false, the setting will disable all reads from the
multi-pack-index file.

Add comparison commands in t5319-midx.sh to check typical Git behavior
remains the same as the config setting is turned on and off. This
currently includes 'git rev-list' and 'git log' commands to trigger
several object database reads.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt |  4 +++
 cache.h                  |  1 +
 config.c                 |  5 ++++
 environment.c            |  1 +
 t/t5319-midx.sh          | 57 ++++++++++++++++++++++++++++++++--------
 5 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ab641bf5a9..e78150e452 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -908,6 +908,10 @@ core.commitGraph::
 	Enable git commit graph feature. Allows reading from the
 	commit-graph file.
 
+core.midx::
+	Enable multi-pack-index feature. Allows reading from the multi-
+	pack-index file.
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 89a107a7f7..c7967f7643 100644
--- a/cache.h
+++ b/cache.h
@@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_commit_graph;
+extern int core_midx;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index fbbf0f8e9f..0df3dbdf74 100644
--- a/config.c
+++ b/config.c
@@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.midx")) {
+		core_midx = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 2a6de2330b..dcb4417604 100644
--- a/environment.c
+++ b/environment.c
@@ -67,6 +67,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_commit_graph;
+int core_midx;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 709652c635..1a50987778 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,6 +3,8 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+objdir=.git/objects
+
 midx_read_expect() {
 	NUM_PACKS=$1
 	NUM_OBJECTS=$2
@@ -62,13 +64,42 @@ test_expect_success 'write midx with one v1 pack' '
 	midx_read_expect 1 17 5 .
 '
 
+midx_git_two_modes() {
+	git -c core.midx=false $1 >expect &&
+	git -c core.midx=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
 test_expect_success 'write midx with one v2 pack' '
-	pack=$(git pack-objects --index-version=2,0x40 pack/test <obj-list) &&
-	test_when_finished rm pack/test-$pack.pack pack/test-$pack.idx &&
-	git midx --object-dir=. write &&
-	midx_read_expect 1 17 5 .
+	pack=$(git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list) &&
+	git midx --object-dir=$objdir write &&
+	midx_read_expect 1 17 5 $objdir
 '
 
+midx_git_two_modes() {
+	git -c core.midx=false $1 >expect &&
+	git -c core.midx=true $1 >actual &&
+	test_cmp expect actual
+}
+
+compare_results_with_midx() {
+	MSG=$1
+	test_expect_success "check normal git operations: $MSG" '
+		midx_git_two_modes "rev-list --objects --all" &&
+		midx_git_two_modes "log --raw"
+	'
+}
+
+compare_results_with_midx "one v2 pack"
+
 test_expect_success 'Add more objects' '
 	for i in `test_seq 6 10`
 	do
@@ -94,12 +125,13 @@ test_expect_success 'Add more objects' '
 '
 
 test_expect_success 'write midx with two packs' '
-	pack1=$(git pack-objects --index-version=1 pack/test-1 <obj-list) &&
-	pack2=$(git pack-objects --index-version=1 pack/test-2 <obj-list2) &&
-	git midx --object-dir=. write &&
-	midx_read_expect 2 33 5 .
+	pack2=$(git pack-objects --index-version=1 $objdir/pack/test-2 <obj-list2) &&
+	git midx --object-dir=$objdir write &&
+	midx_read_expect 2 33 5 $objdir
 '
 
+compare_results_with_midx "two packs"
+
 test_expect_success 'Add more packs' '
 	for j in `test_seq 1 10`
 	do
@@ -120,17 +152,20 @@ test_expect_success 'Add more packs' '
 		git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
 		} >obj-list &&
 		git update-ref HEAD $commit &&
-		git pack-objects --index-version=2 pack/test-pack <obj-list &&
+		git pack-objects --index-version=2 $objdir/pack/test-pack <obj-list &&
 		i=$(expr $i + 1) || return 1 &&
 		j=$(expr $j + 1) || return 1
 	done
 '
 
+compare_results_with_midx "mixed mode (two packs + extra)"
+
 test_expect_success 'write midx with twelve packs' '
-	git midx --object-dir=. write &&
-	midx_read_expect 12 73 5 .
+	git midx --object-dir=$objdir write &&
+	midx_read_expect 12 73 5 $objdir
 '
 
+compare_results_with_midx "twelve packs"
 
 # usage: corrupt_data <file> <pos> [<data>]
 corrupt_data() {
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 16/23] midx: prepare midxed_git struct
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (14 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 15/23] midx: create core.midx config setting Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:47   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 22 ++++++++++++++++++++++
 midx.h         |  2 ++
 object-store.h |  7 +++++++
 packfile.c     |  6 +++++-
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index a49300bf75..5e9290ca8f 100644
--- a/midx.c
+++ b/midx.c
@@ -175,6 +175,28 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	exit(1);
 }
 
+int prepare_midxed_git_one(struct repository *r, const char *object_dir)
+{
+	struct midxed_git *m = r->objects->midxed_git;
+	struct midxed_git *m_search;
+
+	if (!core_midx)
+		return 0;
+
+	for (m_search = m; m_search; m_search = m_search->next)
+		if (!strcmp(object_dir, m_search->object_dir))
+			return 1;
+
+	r->objects->midxed_git = load_midxed_git(object_dir);
+
+	if (r->objects->midxed_git) {
+		r->objects->midxed_git->next = m;
+		return 1;
+	}
+
+	return 0;
+}
+
 static size_t write_midx_header(struct hashfile *f,
 				unsigned char num_chunks,
 				uint32_t num_packs)
diff --git a/midx.h b/midx.h
index a1d18ed991..793203fc4a 100644
--- a/midx.h
+++ b/midx.h
@@ -5,8 +5,10 @@
 #include "cache.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "repository.h"
 
 struct midxed_git *load_midxed_git(const char *object_dir);
+int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
 
diff --git a/object-store.h b/object-store.h
index 9b671f1b0a..7908d46e34 100644
--- a/object-store.h
+++ b/object-store.h
@@ -130,6 +130,13 @@ struct raw_object_store {
 	 */
 	struct oidmap *replace_map;
 
+	/*
+	 * private data
+	 *
+	 * should only be accessed directly by packfile.c and midx.c
+	 */
+	struct midxed_git *midxed_git;
+
 	/*
 	 * private data
 	 *
diff --git a/packfile.c b/packfile.c
index 1a714fbde9..b91ca9b9f5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -15,6 +15,7 @@
 #include "tree-walk.h"
 #include "tree.h"
 #include "object-store.h"
+#include "midx.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *sha1,
@@ -893,10 +894,13 @@ static void prepare_packed_git(struct repository *r)
 
 	if (r->objects->packed_git_initialized)
 		return;
+	prepare_midxed_git_one(r, r->objects->objectdir);
 	prepare_packed_git_one(r, r->objects->objectdir, 1);
 	prepare_alt_odb(r);
-	for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
+	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
+		prepare_midxed_git_one(r, alt->path);
 		prepare_packed_git_one(r, alt->path, 0);
+	}
 	rearrange_packed_git(r);
 	prepare_packed_git_mru(r);
 	r->objects->packed_git_initialized = 1;
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 17/23] midx: read objects from multi-pack-index
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (15 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 17:56   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++++--
 midx.h         |  2 ++
 object-store.h |  1 +
 packfile.c     |  8 ++++-
 4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 5e9290ca8f..6eca8f1b12 100644
--- a/midx.c
+++ b/midx.c
@@ -3,6 +3,7 @@
 #include "dir.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "sha1-lookup.h"
 #include "object-store.h"
 #include "packfile.h"
 #include "midx.h"
@@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 
 	m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
 	strcpy(m->object_dir, object_dir);
-	m->data = midx_map;
+	m->data = (const unsigned char*)midx_map;
 
 	m->signature = get_be32(m->data);
 	if (m->signature != MIDX_SIGNATURE) {
@@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
-	m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+	m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
+
+	ALLOC_ARRAY(m->pack_names, m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		if (i) {
 			if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
@@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 	exit(1);
 }
 
+static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
+{
+	struct strbuf pack_name = STRBUF_INIT;
+
+	if (pack_int_id >= m->num_packs)
+		BUG("bad pack-int-id");
+
+	if (m->packs[pack_int_id])
+		return 0;
+
+	strbuf_addstr(&pack_name, m->object_dir);
+	strbuf_addstr(&pack_name, "/pack/");
+	strbuf_addstr(&pack_name, m->pack_names[pack_int_id]);
+
+	m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
+	strbuf_release(&pack_name);
+	return !m->packs[pack_int_id];
+}
+
+int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result)
+{
+	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
+			    MIDX_HASH_LEN, result);
+}
+
+static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
+{
+	const unsigned char *offset_data;
+	uint32_t offset32;
+
+	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
+	offset32 = get_be32(offset_data + sizeof(uint32_t));
+
+	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
+		if (sizeof(offset32) < sizeof(uint64_t))
+			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
+
+		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
+		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
+	}
+
+	return offset32;
+}
+
+static uint32_t nth_midxed_pack_int_id(struct midxed_git *m, uint32_t pos)
+{
+	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
+}
+
+static int nth_midxed_pack_entry(struct midxed_git *m, struct pack_entry *e, uint32_t pos)
+{
+	uint32_t pack_int_id;
+	struct packed_git *p;
+
+	if (pos >= m->num_objects)
+		return 0;
+
+	pack_int_id = nth_midxed_pack_int_id(m, pos);
+
+	if (prepare_midx_pack(m, pack_int_id))
+		die(_("error preparing packfile from multi-pack-index"));
+	p = m->packs[pack_int_id];
+
+	/*
+	* We are about to tell the caller where they can locate the
+	* requested object.  We better make sure the packfile is
+	* still here and can be accessed before supplying that
+	* answer, as it may have been deleted since the MIDX was
+	* loaded!
+	*/
+	if (!is_pack_valid(p))
+		return 0;
+
+	e->offset = nth_midxed_offset(m, pos);
+	e->p = p;
+
+	return 1;
+}
+
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m)
+{
+	uint32_t pos;
+
+	if (!bsearch_midx(oid, m, &pos))
+		return 0;
+
+	return nth_midxed_pack_entry(m, e, pos);
+}
+
 int prepare_midxed_git_one(struct repository *r, const char *object_dir)
 {
 	struct midxed_git *m = r->objects->midxed_git;
diff --git a/midx.h b/midx.h
index 793203fc4a..0c66812229 100644
--- a/midx.h
+++ b/midx.h
@@ -8,6 +8,8 @@
 #include "repository.h"
 
 struct midxed_git *load_midxed_git(const char *object_dir);
+int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result);
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/object-store.h b/object-store.h
index 7908d46e34..5af2a852bc 100644
--- a/object-store.h
+++ b/object-store.h
@@ -108,6 +108,7 @@ struct midxed_git {
 	const unsigned char *chunk_large_offsets;
 
 	const char **pack_names;
+	struct packed_git **packs;
 	char object_dir[FLEX_ARRAY];
 };
 
diff --git a/packfile.c b/packfile.c
index b91ca9b9f5..73f8cc28ee 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1857,11 +1857,17 @@ static int fill_pack_entry(const struct object_id *oid,
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
+	struct midxed_git *m;
 
 	prepare_packed_git(r);
-	if (!r->objects->packed_git)
+	if (!r->objects->packed_git && !r->objects->midxed_git)
 		return 0;
 
+	for (m = r->objects->midxed_git; m; m = m->next) {
+		if (fill_midx_entry(oid, e, m))
+			return 1;
+	}
+
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
 		if (fill_pack_entry(oid, e, p)) {
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 18/23] midx: use midx in abbreviation calculations
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (16 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:01   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 19/23] midx: use existing midx when writing new one Derrick Stolee
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c          | 11 ++++++++
 midx.h          |  3 +++
 packfile.c      |  6 +++++
 packfile.h      |  1 +
 sha1-name.c     | 70 +++++++++++++++++++++++++++++++++++++++++++++++++
 t/t5319-midx.sh |  3 ++-
 6 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 6eca8f1b12..25d8142c2a 100644
--- a/midx.c
+++ b/midx.c
@@ -203,6 +203,17 @@ int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *re
 			    MIDX_HASH_LEN, result);
 }
 
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct midxed_git *m,
+					uint32_t n)
+{
+	if (n >= m->num_objects)
+		return NULL;
+
+	hashcpy(oid->hash, m->chunk_oid_lookup + m->hash_len * n);
+	return oid;
+}
+
 static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
 {
 	const unsigned char *offset_data;
diff --git a/midx.h b/midx.h
index 0c66812229..497bdcc77c 100644
--- a/midx.h
+++ b/midx.h
@@ -9,6 +9,9 @@
 
 struct midxed_git *load_midxed_git(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result);
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					struct midxed_git *m,
+					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
diff --git a/packfile.c b/packfile.c
index 73f8cc28ee..638e113972 100644
--- a/packfile.c
+++ b/packfile.c
@@ -919,6 +919,12 @@ struct packed_git *get_packed_git(struct repository *r)
 	return r->objects->packed_git;
 }
 
+struct midxed_git *get_midxed_git(struct repository *r)
+{
+	prepare_packed_git(r);
+	return r->objects->midxed_git;
+}
+
 struct list_head *get_packed_git_mru(struct repository *r)
 {
 	prepare_packed_git(r);
diff --git a/packfile.h b/packfile.h
index e0a38aba93..01e14b93fd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -39,6 +39,7 @@ extern void install_packed_git(struct repository *r, struct packed_git *pack);
 
 struct packed_git *get_packed_git(struct repository *r);
 struct list_head *get_packed_git_mru(struct repository *r);
+struct midxed_git *get_midxed_git(struct repository *r);
 
 /*
  * Give a rough count of objects in the repository. This sacrifices accuracy
diff --git a/sha1-name.c b/sha1-name.c
index 60d9ef3c7e..d975a186c9 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -12,6 +12,7 @@
 #include "packfile.h"
 #include "object-store.h"
 #include "repository.h"
+#include "midx.h"
 
 static int get_oid_oneline(const char *, struct object_id *, struct commit_list *);
 
@@ -149,6 +150,32 @@ static int match_sha(unsigned len, const unsigned char *a, const unsigned char *
 	return 1;
 }
 
+static void unique_in_midx(struct midxed_git *m,
+			   struct disambiguate_state *ds)
+{
+	uint32_t num, i, first = 0;
+	const struct object_id *current = NULL;
+	num = m->num_objects;
+
+	if (!num)
+		return;
+
+	bsearch_midx(&ds->bin_pfx, m, &first);
+
+	/*
+	 * At this point, "first" is the location of the lowest object
+	 * with an object name that could match "bin_pfx".  See if we have
+	 * 0, 1 or more objects that actually match(es).
+	 */
+	for (i = first; i < num && !ds->ambiguous; i++) {
+		struct object_id oid;
+		current = nth_midxed_object_oid(&oid, m, i);
+		if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash))
+			break;
+		update_candidates(ds, current);
+	}
+}
+
 static void unique_in_pack(struct packed_git *p,
 			   struct disambiguate_state *ds)
 {
@@ -177,8 +204,12 @@ static void unique_in_pack(struct packed_git *p,
 
 static void find_short_packed_object(struct disambiguate_state *ds)
 {
+	struct midxed_git *m;
 	struct packed_git *p;
 
+	for (m = get_midxed_git(the_repository); m && !ds->ambiguous;
+	     m = m->next)
+		unique_in_midx(m, ds);
 	for (p = get_packed_git(the_repository); p && !ds->ambiguous;
 	     p = p->next)
 		unique_in_pack(p, ds);
@@ -527,6 +558,42 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	return 0;
 }
 
+static void find_abbrev_len_for_midx(struct midxed_git *m,
+				     struct min_abbrev_data *mad)
+{
+	int match = 0;
+	uint32_t num, first = 0;
+	struct object_id oid;
+	const struct object_id *mad_oid;
+
+	if (!m->num_objects)
+		return;
+
+	num = m->num_objects;
+	mad_oid = mad->oid;
+	match = bsearch_midx(mad_oid, m, &first);
+
+	/*
+	 * first is now the position in the packfile where we would insert
+	 * mad->hash if it does not exist (or the position of mad->hash if
+	 * it does exist). Hence, we consider a maximum of two objects
+	 * nearby for the abbreviation length.
+	 */
+	mad->init_len = 0;
+	if (!match) {
+		if (nth_midxed_object_oid(&oid, m, first))
+			extend_abbrev_len(&oid, mad);
+	} else if (first < num - 1) {
+		if (nth_midxed_object_oid(&oid, m, first + 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	if (first > 0) {
+		if (nth_midxed_object_oid(&oid, m, first - 1))
+			extend_abbrev_len(&oid, mad);
+	}
+	mad->init_len = mad->cur_len;
+}
+
 static void find_abbrev_len_for_pack(struct packed_git *p,
 				     struct min_abbrev_data *mad)
 {
@@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 
 static void find_abbrev_len_packed(struct min_abbrev_data *mad)
 {
+	struct midxed_git *m;
 	struct packed_git *p;
 
+	for (m = get_midxed_git(the_repository); m; m = m->next)
+		find_abbrev_len_for_midx(m, mad);
 	for (p = get_packed_git(the_repository); p; p = p->next)
 		find_abbrev_len_for_pack(p, mad);
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 1a50987778..e3873da7d6 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -94,7 +94,8 @@ compare_results_with_midx() {
 	MSG=$1
 	test_expect_success "check normal git operations: $MSG" '
 		midx_git_two_modes "rev-list --objects --all" &&
-		midx_git_two_modes "log --raw"
+		midx_git_two_modes "log --raw" &&
+		midx_git_two_modes "log --oneline"
 	'
 }
 
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 19/23] midx: use existing midx when writing new one
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (17 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/midx.c b/midx.c
index 25d8142c2a..388d79b7d9 100644
--- a/midx.c
+++ b/midx.c
@@ -389,6 +389,23 @@ static int midx_oid_compare(const void *_a, const void *_b)
 	return a->pack_int_id - b->pack_int_id;
 }
 
+static int nth_midxed_pack_midx_entry(struct midxed_git *m,
+				      uint32_t *pack_perm,
+				      struct pack_midx_entry *e,
+				      uint32_t pos)
+{
+	if (pos >= m->num_objects)
+		return 1;
+
+	nth_midxed_object_oid(&e->oid, m, pos);
+	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->offset = nth_midxed_offset(m, pos);
+
+	/* consider objects in midx to be from "old" packs */
+	e->pack_mtime = 0;
+	return 0;
+}
+
 static void fill_pack_entry(uint32_t pack_int_id,
 			    struct packed_git *p,
 			    uint32_t cur_object,
@@ -414,7 +431,8 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * Copy only the de-duplicated entries (selected by most-recent modified time
  * of a packfile containing the object).
  */
-static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+static struct pack_midx_entry *get_sorted_entries(struct midxed_git *m,
+						  struct packed_git **p,
 						  uint32_t *perm,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
@@ -423,8 +441,9 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
 	struct pack_midx_entry *entries_by_fanout = NULL;
 	struct pack_midx_entry *deduplicated_entries = NULL;
+	uint32_t start_pack = m ? m->num_packs : 0;
 
-	for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
 		if (open_pack_index(p[cur_pack]))
 			continue;
 
@@ -445,7 +464,23 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
 	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
 		nr_fanout = 0;
 
-		for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+		if (m) {
+			uint32_t start = 0, end;
+
+			if (cur_fanout)
+				start = ntohl(m->chunk_oid_fanout[cur_fanout - 1]);
+			end = ntohl(m->chunk_oid_fanout[cur_fanout]);
+
+			for (cur_object = start; cur_object < end; cur_object++) {
+				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
+				nth_midxed_pack_midx_entry(m, perm,
+							   &entries_by_fanout[nr_fanout],
+							   cur_object);
+				nr_fanout++;
+			}
+		}
+
+		for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
@@ -654,6 +689,7 @@ int write_midx_file(const char *object_dir)
 	struct pack_midx_entry *entries;
 	uint32_t nr_entries, num_large_offsets = 0;
 	int large_offsets_needed = 0;
+	struct midxed_git *m = NULL;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -662,6 +698,8 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
+	m = load_midxed_git(object_dir);
+
 	strbuf_addf(&pack_dir, "%s/pack", object_dir);
 	dir = opendir(pack_dir.buf);
 
@@ -676,11 +714,27 @@ int write_midx_file(const char *object_dir)
 	pack_dir_len = pack_dir.len;
 	ALLOC_ARRAY(packs, alloc_packs);
 	ALLOC_ARRAY(pack_names, alloc_pack_names);
+
+	if (m) {
+		for (i = 0; i < m->num_packs; i++) {
+			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+			ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
+
+			packs[nr_packs] = NULL;
+			pack_names[nr_packs] = xstrdup(m->pack_names[i]);
+			pack_name_concat_len += strlen(pack_names[nr_packs]) + 1;
+			nr_packs++;
+		}
+	}
+
 	while ((de = readdir(dir)) != NULL) {
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
 		if (ends_with(de->d_name, ".idx")) {
+			if (m && midx_contains_pack(m, de->d_name))
+				continue;
+
 			ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
 			ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
 
@@ -705,6 +759,9 @@ int write_midx_file(const char *object_dir)
 	closedir(dir);
 	strbuf_release(&pack_dir);
 
+	if (m && nr_packs == m->num_packs)
+		goto cleanup;
+
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
 					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
@@ -712,7 +769,7 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(pack_perm, nr_packs);
 	sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
-	entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
+	entries = get_sorted_entries(m, packs, pack_perm, nr_packs, &nr_entries);
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
@@ -823,7 +880,8 @@ int write_midx_file(const char *object_dir)
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
 
-	for (i = 0; i < nr_packs; i++) {
+cleanup:
+	for (i = m ? m->num_packs : 0; i < nr_packs; i++) {
 		close_pack(packs[i]);
 		FREE_AND_NULL(packs[i]);
 	}
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 20/23] midx: use midx in approximate_object_count
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (18 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 19/23] midx: use existing midx when writing new one Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:03   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/packfile.c b/packfile.c
index 638e113972..059b2aa097 100644
--- a/packfile.c
+++ b/packfile.c
@@ -819,11 +819,14 @@ unsigned long approximate_object_count(void)
 {
 	if (!the_repository->objects->approximate_object_count_valid) {
 		unsigned long count;
+		struct midxed_git *m;
 		struct packed_git *p;
 
 		prepare_packed_git(the_repository);
 		count = 0;
-		for (p = the_repository->objects->packed_git; p; p = p->next) {
+		for (m = get_midxed_git(the_repository); m; m = m->next)
+			count += m->num_objects;
+		for (p = get_packed_git(the_repository); p; p = p->next) {
 			if (open_pack_index(p))
 				continue;
 			count += p->num_objects;
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 21/23] midx: prevent duplicate packfile loads
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (19 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:05   ` Duy Nguyen
  2018-06-07 14:03 ` [PATCH 22/23] midx: use midx to find ref-deltas Derrick Stolee
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

If the multi-pack-index contains a packfile, then we do not need to add
that packfile to the packed_git linked list or the MRU list.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     | 23 +++++++++++++++++++++++
 midx.h     |  1 +
 packfile.c |  7 +++++++
 3 files changed, 31 insertions(+)

diff --git a/midx.c b/midx.c
index 388d79b7d9..3242646fe0 100644
--- a/midx.c
+++ b/midx.c
@@ -278,6 +278,29 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mi
 	return nth_midxed_pack_entry(m, e, pos);
 }
 
+int midx_contains_pack(struct midxed_git *m, const char *idx_name)
+{
+	uint32_t first = 0, last = m->num_packs;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		const char *current;
+		int cmp;
+
+		current = m->pack_names[mid];
+		cmp = strcmp(idx_name, current);
+		if (!cmp)
+			return 1;
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	return 0;
+}
+
 int prepare_midxed_git_one(struct repository *r, const char *object_dir)
 {
 	struct midxed_git *m = r->objects->midxed_git;
diff --git a/midx.h b/midx.h
index 497bdcc77c..c1db58d8c4 100644
--- a/midx.h
+++ b/midx.h
@@ -13,6 +13,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct midxed_git *m,
 					uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
+int midx_contains_pack(struct midxed_git *m, const char *idx_name);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/packfile.c b/packfile.c
index 059b2aa097..479cb69b9f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -746,6 +746,11 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 	DIR *dir;
 	struct dirent *de;
 	struct string_list garbage = STRING_LIST_INIT_DUP;
+	struct midxed_git *m = r->objects->midxed_git;
+
+	/* look for the multi-pack-index for this object directory */
+	while (m && strcmp(m->object_dir, objdir))
+		m = m->next;
 
 	strbuf_addstr(&path, objdir);
 	strbuf_addstr(&path, "/pack");
@@ -772,6 +777,8 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
 		base_len = path.len;
 		if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
 			/* Don't reopen a pack we already have. */
+			if (m && midx_contains_pack(m, de->d_name))
+				continue;
 			for (p = r->objects->packed_git; p;
 			     p = p->next) {
 				size_t len;
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 22/23] midx: use midx to find ref-deltas
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (20 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     |  2 +-
 midx.h     |  1 +
 packfile.c | 15 +++++++++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 3242646fe0..e46f392fa4 100644
--- a/midx.c
+++ b/midx.c
@@ -214,7 +214,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
 	return oid;
 }
 
-static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
+off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
 {
 	const unsigned char *offset_data;
 	uint32_t offset32;
diff --git a/midx.h b/midx.h
index c1db58d8c4..6996b5ff6b 100644
--- a/midx.h
+++ b/midx.h
@@ -9,6 +9,7 @@
 
 struct midxed_git *load_midxed_git(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result);
+off_t nth_midxed_offset(struct midxed_git *m, uint32_t n);
 struct object_id *nth_midxed_object_oid(struct object_id *oid,
 					struct midxed_git *m,
 					uint32_t n);
diff --git a/packfile.c b/packfile.c
index 479cb69b9f..9b814c89c7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1794,6 +1794,21 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	uint32_t result;
 
 	if (!index) {
+		/*
+		 * If we have a MIDX, then we want to
+		 * check the MIDX for the offset instead.
+		 */
+		struct midxed_git *m;
+
+		for (m = get_midxed_git(the_repository); m; m = m->next) {
+			if (midx_contains_pack(m, p->pack_name)) {
+				if (bsearch_midx(&oid, m, &result))
+					return nth_midxed_offset(m, result);
+
+				break;
+			}
+		}
+
 		if (open_pack_index(p))
 			return 0;
 	}
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 23/23] midx: clear midx on repack
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (21 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 22/23] midx: use midx to find ref-deltas Derrick Stolee
@ 2018-06-07 14:03 ` Derrick Stolee
  2018-06-09 18:13   ` Duy Nguyen
  2018-06-07 14:06 ` [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
  2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
  24 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:03 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

If a 'git repack' command replaces existing packfiles, then we must
clear the existing multi-pack-index before moving the packfiles it
references.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 8 ++++++++
 midx.c           | 8 ++++++++
 midx.h           | 1 +
 3 files changed, 17 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 6c636e159e..66a7d8e8ea 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -8,6 +8,7 @@
 #include "strbuf.h"
 #include "string-list.h"
 #include "argv-array.h"
+#include "midx.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	int quiet = 0;
 	int local = 0;
+	int midx_cleared = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				continue;
 			}
 
+			if (!midx_cleared) {
+				/* if we move a packfile, it will invalidated the midx */
+				clear_midx_file(get_object_directory());
+				midx_cleared = 1;
+			}
+
 			fname_old = mkpathdup("%s/old-%s%s", packdir,
 						item->string, exts[ext].name);
 			if (file_exists(fname_old))
diff --git a/midx.c b/midx.c
index e46f392fa4..1043c01fa7 100644
--- a/midx.c
+++ b/midx.c
@@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir)
 	FREE_AND_NULL(pack_names);
 	return 0;
 }
+
+void clear_midx_file(const char *object_dir)
+{
+	char *midx = get_midx_filename(object_dir);
+
+	if (remove_path(midx))
+		die(_("failed to clear multi-pack-index at %s"), midx);
+}
diff --git a/midx.h b/midx.h
index 6996b5ff6b..46f9f44c94 100644
--- a/midx.h
+++ b/midx.h
@@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char *idx_name);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
+void clear_midx_file(const char *object_dir);
 
 #endif
-- 
2.18.0.rc1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 00/23] Multi-pack-index (MIDX)
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (22 preceding siblings ...)
  2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
@ 2018-06-07 14:06 ` Derrick Stolee
  2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
  24 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, dstolee, avarab, jrnieder, jonathantanmy, mfick

On 6/7/2018 10:03 AM, Derrick Stolee wrote:
> This patch series includes a rewrite of the previous
> multi-pack-index RFC [1] using the feedback from the
> commit-graph feature.

Sorry to everyone who got a duplicate copy of this series. I misspelled 
'kernel.org' and it didn't go to the list.

I also have this series available as a GitHub PR [1]

[1] https://github.com/derrickstolee/git/pull/7


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 00/23] Multi-pack-index (MIDX)
  2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
                   ` (23 preceding siblings ...)
  2018-06-07 14:06 ` [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
@ 2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
  2018-06-07 14:54   ` Derrick Stolee
  24 siblings, 1 reply; 62+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-07 14:45 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, sbeller, dstolee, jrnieder, jonathantanmy, mfick


On Thu, Jun 07 2018, Derrick Stolee wrote:

> To test the performance in this situation, I created a
> script that organizes the Linux repository in a similar
> fashion. I split the commit history into 50 parts by
> creating branches on every 10,000 commits of the first-
> parent history. Then, `git rev-list --objects A ^B`
> provides the list of objects reachable from A but not B,
> so I could send that to `git pack-objects` to create
> these "time-based" packfiles. With these 50 packfiles
> (deleting the old one from my fresh clone, and deleting
> all tags as they were no longer on-disk) I could then
> test 'git rev-list --objects HEAD^{tree}' and see:
>
>         Before: 0.17s
>         After:  0.13s
>         % Diff: -23.5%
>
> By adding logic to count hits and misses to bsearch_pack,
> I was able to see that the command above calls that
> method 266,930 times with a hit rate of 33%. The MIDX
> has the same number of calls with a 100% hit rate.

Do you have the script you used for this? It would be very interesting
as something we could stick in t/perf/ to test this use-case in the
future.

How does this & the numbers below compare to just a naïve
--max-pack-size=<similar size> on linux.git?

Is it possible for you to tar this test repo up and share it as a
one-off? I've been polishing the core.validateAbbrev series I have, and
it would be interesting to compare some of the (abbrev) numbers.

> Abbreviation Speedups
> ---------------------
>
> To fully disambiguate an abbreviation, we must iterate
> through all packfiles to ensure no collision exists in
> any packfile. This requires O(P log N) time. With the
> MIDX, this is only O(log N) time. Our standard test [2]
> is 'git log --oneline --parents --raw' because it writes
> many abbreviations while also doing a lot of other work
> (walking commits and trees to compute the raw diff).
>
> For a copy of the Linux repository with 50 packfiles
> split by time, we observed the following:
>
>         Before: 100.5 s
>         After:   58.2 s
>         % Diff: -59.7%
>
>
> Request for Review Attention
> ----------------------------
>
> I tried my best to take the feedback from the commit-graph
> feature and apply it to this feature. I also worked to
> follow the object-store refactoring as I could. I also have
> some local commits that create a 'verify' subcommand and
> integrate with 'fsck' similar to the commit-graph, but I'll
> leave those for a later series (and review is still underway
> for that part of the commit-graph).
>
> One place where I could use some guidance is related to the
> current state of 'the_hash_algo' patches. The file format
> allows a different "hash version" which then indicates the
> length of the hash. What's the best way to ensure this
> feature doesn't cause extra pain in the hash-agnostic series?
> This will inform how I go back and make the commit-graph
> feature better in this area, too.
>
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 00/23] Multi-pack-index (MIDX)
  2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
@ 2018-06-07 14:54   ` Derrick Stolee
  0 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-07 14:54 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, sbeller, dstolee, jrnieder, jonathantanmy, mfick

On 6/7/2018 10:45 AM, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Jun 07 2018, Derrick Stolee wrote:
>
>> To test the performance in this situation, I created a
>> script that organizes the Linux repository in a similar
>> fashion. I split the commit history into 50 parts by
>> creating branches on every 10,000 commits of the first-
>> parent history. Then, `git rev-list --objects A ^B`
>> provides the list of objects reachable from A but not B,
>> so I could send that to `git pack-objects` to create
>> these "time-based" packfiles. With these 50 packfiles
>> (deleting the old one from my fresh clone, and deleting
>> all tags as they were no longer on-disk) I could then
>> test 'git rev-list --objects HEAD^{tree}' and see:
>>
>>          Before: 0.17s
>>          After:  0.13s
>>          % Diff: -23.5%
>>
>> By adding logic to count hits and misses to bsearch_pack,
>> I was able to see that the command above calls that
>> method 266,930 times with a hit rate of 33%. The MIDX
>> has the same number of calls with a 100% hit rate.
> Do you have the script you used for this? It would be very interesting
> as something we could stick in t/perf/ to test this use-case in the
> future.
>
> How does this & the numbers below compare to just a naïve
> --max-pack-size=<similar size> on linux.git?
>
> Is it possible for you to tar this test repo up and share it as a
> one-off? I've been polishing the core.validateAbbrev series I have, and
> it would be interesting to compare some of the (abbrev) numbers.

Here is what I used. You will want to adjust your constants for whatever 
repo you are using. This is for the Linux kernel which has a 
first-parent history of ~50,000 commits. It also leaves a bunch of extra 
files around, so it is nowhere near incorporating into the code.

#!/bin/bash

for i in `seq 1 50`
do
         ORDER=$((51 - $i))
         NUM_BACK=$((1000 * ($i - 1)))
         echo creating batch/$ORDER
         git branch -f batch/$ORDER HEAD~$NUM_BACK
         echo batch/$ORDER
         git rev-parse batch/$ORDER
done

lastbranch=""
for i in `seq 1 50`
do
         branch=batch/$i
         if [$lastbranch -eq ""]
         then
                 echo "$branch"
                 git rev-list --objects $branch | sed 's/ .*//' 
 >objects-$i.txt
         else
                 echo "$lastbranch"
                 echo "$branch"
                 git rev-list --objects $branch ^$lastbranch | sed 's/ 
.*//' >objects-$i.txt
         fi

         git pack-objects --no-reuse-delta 
.git/objects/pack/branch-split2 <objects-$i.txt
         lastbranch=$branch
done


for tag in `git tag --list`
do
         git tag -d $tag
done

rm -rf .git/objects/pack/pack-*
git midx write


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
@ 2018-06-07 17:20   ` Duy Nguyen
  2018-06-18 19:23     ` Derrick Stolee
  2018-06-11 21:02   ` Stefan Beller
  1 sibling, 1 reply; 62+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:20 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> new file mode 100644
> index 0000000000..2bd886f1a2
> --- /dev/null
> +++ b/Documentation/git-midx.txt
> @@ -0,0 +1,29 @@
> +git-midx(1)
> +============
> +
> +NAME
> +----
> +git-midx - Write and verify multi-pack-indexes (MIDX files).

No full stop. This head line is collected automatically with others
and its having a full stop while the rest does not looks strange/

> diff --git a/builtin/midx.c b/builtin/midx.c
> new file mode 100644
> index 0000000000..59ea92178f
> --- /dev/null
> +++ b/builtin/midx.c
> @@ -0,0 +1,38 @@
> +#include "builtin.h"
> +#include "cache.h"
> +#include "config.h"
> +#include "git-compat-util.h"

You only need either cache.h or git-compat-util.h. If cache.h is here,
git-compat-util can be removed.

> +#include "parse-options.h"
> +
> +static char const * const builtin_midx_usage[] ={
> +       N_("git midx [--object-dir <dir>]"),
> +       NULL
> +};
> +
> +static struct opts_midx {
> +       const char *object_dir;
> +} opts;
> +
> +int cmd_midx(int argc, const char **argv, const char *prefix)
> +{
> +       static struct option builtin_midx_options[] = {
> +               { OPTION_STRING, 0, "object-dir", &opts.object_dir,

For paths (including dir), OPTION_FILENAME may be a better option to
handle correctly when the command is run in a subdir. See df217ed643
(parse-opts: add OPT_FILENAME and transition builtins - 2009-05-23)
for more info.

> +                 N_("dir"),
> +                 N_("The object directory containing set of packfile and pack-index pairs.") },

Other help strings do not have full stop either (I only checked a
couple commands though)

Also, doesn't OPT_STRING() work here too (if you avoid OPTION_FILENAME
for some reason)?

> +               OPT_END(),
> +       };
> +
> +       if (argc == 2 && !strcmp(argv[1], "-h"))
> +               usage_with_options(builtin_midx_usage, builtin_midx_options);
> +
> +       git_config(git_default_config, NULL);
> +
> +       argc = parse_options(argc, argv, prefix,
> +                            builtin_midx_options,
> +                            builtin_midx_usage, 0);
> +
> +       if (!opts.object_dir)
> +               opts.object_dir = get_object_directory();
> +
> +       return 0;
> +}

> diff --git a/git.c b/git.c
> index c2f48d53dd..400fadd677 100644
> --- a/git.c
> +++ b/git.c
> @@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
>         { "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>         { "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>         { "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
> +       { "midx", cmd_midx, RUN_SETUP },

If it's a plumbing and can take an --object-dir, then I don't think
you should require it to run in a repo (with RUN_SETUP).
RUN_SETUP_GENTLY may be better. You could even leave it empty here and
only call setup_git_directory() only when --object-dir is not set.

>         { "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
>         { "mktree", cmd_mktree, RUN_SETUP },
>         { "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 04/23] midx: add 'write' subcommand and basic wiring
  2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
@ 2018-06-07 17:27   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/builtin/midx.c b/builtin/midx.c
> index 59ea92178f..dc0a5acd3f 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -3,9 +3,10 @@
>  #include "config.h"
>  #include "git-compat-util.h"
>  #include "parse-options.h"
> +#include "midx.h"
>
>  static char const * const builtin_midx_usage[] ={
> -       N_("git midx [--object-dir <dir>]"),
> +       N_("git midx [--object-dir <dir>] [write]"),
>         NULL
>  };
>
> @@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
>         if (!opts.object_dir)
>                 opts.object_dir = get_object_directory();
>
> +       if (argc == 0)
> +               return 0;

Isn't it better to die here when no verb is given? I don't see any
good use case for running a no-op "git midx" without verbs. It's more
likely a mistake (e.g. "git midx $foo" where foo happens to be empty)

> +
> +       if (!strcmp(argv[0], "write"))
> +               return write_midx_file(opts.object_dir);
> +
>         return 0;
>  }
> diff --git a/midx.c b/midx.c
> new file mode 100644
> index 0000000000..616af66b13
> --- /dev/null
> +++ b/midx.c
> @@ -0,0 +1,9 @@
> +#include "git-compat-util.h"
> +#include "cache.h"

Only one of the two is needed

> +#include "dir.h"

Not needed yet. It's better to include it in the patch that actually needs it.

> +#include "midx.h"
> +
> +int write_midx_file(const char *object_dir)
> +{
> +       return 0;
> +}
> diff --git a/midx.h b/midx.h
> new file mode 100644
> index 0000000000..3a63673952
> --- /dev/null
> +++ b/midx.h
> @@ -0,0 +1,4 @@
> +#include "cache.h"
> +#include "packfile.h"

These includes are not needed, at least not now. And please protect
the header file with #ifndef __MINDX_H__ .. #endif.

> +
> +int write_midx_file(const char *object_dir);
> diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
> new file mode 100755
> index 0000000000..a590137af7
> --- /dev/null
> +++ b/t/t5319-midx.sh
> @@ -0,0 +1,10 @@
> +#!/bin/sh
> +
> +test_description='multi-pack-indexes'
> +. ./test-lib.sh
> +
> +test_expect_success 'write midx with no pakcs' '

no packs


> +       git midx --object-dir=. write
> +'
> +
> +test_done
> --
> 2.18.0.rc1
>



-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
@ 2018-06-07 17:35   ` Duy Nguyen
  2018-06-12 15:00   ` Duy Nguyen
  1 sibling, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> +static char *get_midx_filename(const char *object_dir)
> +{
> +       struct strbuf midx_name = STRBUF_INIT;
> +       strbuf_addstr(&midx_name, object_dir);
> +       strbuf_addstr(&midx_name, "/pack/multi-pack-index");
> +       return strbuf_detach(&midx_name, NULL);
> +}

I think this whole function can be written as
xstrfmt("%s/pack/multi-pack-index", object_dir);

> +
> +static size_t write_midx_header(struct hashfile *f,
> +                               unsigned char num_chunks,
> +                               uint32_t num_packs)
> +{
> +       char byte_values[4];

unsigned char just to be on the safe side? 'char' is signed on ARM if
I remember correctly.
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
@ 2018-06-07 17:54   ` Duy Nguyen
  2018-06-20 13:13     ` Derrick Stolee
  2018-06-07 18:31   ` Duy Nguyen
  1 sibling, 1 reply; 62+ messages in thread
From: Duy Nguyen @ 2018-06-07 17:54 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> As we build the multi-pack-index feature by adding chunks at a time,
> we want to test that the data is being written correctly.
>
> Create struct midxed_git to store an in-memory representation of a

A word play on 'packed_git'? Amusing. Some more descriptive name would
be better though. midxed looks almost like random letters thrown
together.

> multi-pack-index and a memory-map of the binary file. Initialize this
> struct in load_midxed_git(object_dir).

> +static int read_midx_file(const char *object_dir)
> +{
> +       struct midxed_git *m = load_midxed_git(object_dir);
> +
> +       if (!m)
> +               return 0;

This looks like an error case, please don't just return zero,
typically used to say "success". I don't know if this command stays
"for debugging purposes" until the end. Of course in that case it does
not really matter.

> +struct midxed_git *load_midxed_git(const char *object_dir)
> +{
> +       struct midxed_git *m;
> +       int fd;
> +       struct stat st;
> +       size_t midx_size;
> +       void *midx_map;
> +       const char *midx_name = get_midx_filename(object_dir);

mem leak? This function returns allocated memory if I remember correctly.

> +
> +       fd = git_open(midx_name);
> +       if (fd < 0)
> +               return NULL;

do an error_errno() so we know what went wrong at least.

> +       if (fstat(fd, &st)) {
> +               close(fd);
> +               return NULL;

same here, we should know why fstat() fails.

> +       }
> +       midx_size = xsize_t(st.st_size);
> +
> +       if (midx_size < MIDX_MIN_SIZE) {
> +               close(fd);
> +               die("multi-pack-index file %s is too small", midx_name);

_()

The use of die() should be discouraged though. Many people still try
(or wish) to libify code and new die() does not help. I think error()
here would be enough then you can return NULL. Or you can go fancier
and store the error string in a strbuf like refs code.

> +       }
> +
> +       midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +       m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
> +       strcpy(m->object_dir, object_dir);
> +       m->data = midx_map;
> +
> +       m->signature = get_be32(m->data);
> +       if (m->signature != MIDX_SIGNATURE) {
> +               error("multi-pack-index signature %X does not match signature %X",
> +                     m->signature, MIDX_SIGNATURE);

_(). Maybe 0x%08x instead of %x

> +               goto cleanup_fail;
> +       }
> +
> +       m->version = *(m->data + 4);

m->data[4] instead? shorter and easier to understand.

Same comment on "*(m->data + x)" and error() without _() for the rest.

> +       if (m->version != MIDX_VERSION) {
> +               error("multi-pack-index version %d not recognized",
> +                     m->version);

_()

> +               goto cleanup_fail;
> +       }
> +
> +       m->hash_version = *(m->data + 5);

m->data[5]

> +cleanup_fail:
> +       FREE_AND_NULL(m);
> +       munmap(midx_map, midx_size);
> +       close(fd);
> +       exit(1);

It's bad enough that you die() but exit() in this code seems too much.
Please just return NULL and let the caller handle the error.

> diff --git a/midx.h b/midx.h
> index 3a63673952..a1d18ed991 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -1,4 +1,13 @@
> +#ifndef MIDX_H
> +#define MIDX_H
> +
> +#include "git-compat-util.h"
>  #include "cache.h"
> +#include "object-store.h"

I don't really think you need object-store here (git-compat-util.h
too). "struct mixed_git;" would be enough for load_midxed_git
declaration below.

>  #include "packfile.h"
>
> +struct midxed_git *load_midxed_git(const char *object_dir);
> +
>  int write_midx_file(const char *object_dir);
> +
> +#endif
> diff --git a/object-store.h b/object-store.h
> index d683112fd7..77cb82621a 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -84,6 +84,25 @@ struct packed_git {
>         char pack_name[FLEX_ARRAY]; /* more */
>  };
>
> +struct midxed_git {
> +       struct midxed_git *next;

Do we really have multiple midx files?

> +
> +       int fd;
> +
> +       const unsigned char *data;
> +       size_t data_len;
> +
> +       uint32_t signature;
> +       unsigned char version;
> +       unsigned char hash_version;
> +       unsigned char hash_len;
> +       unsigned char num_chunks;
> +       uint32_t num_packs;
> +       uint32_t num_objects;
> +
> +       char object_dir[FLEX_ARRAY];

Why do you need to keep object_dir when it could be easily retrieved
when the repo is available?

> +};
> +
>  struct raw_object_store {
>         /*
>          * Path to the repository's object store.
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 08/23] midx: read packfiles from pack directory
  2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
@ 2018-06-07 18:03   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-07 18:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> @@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir)
>                           midx_name);
>         }
>
> +       strbuf_addf(&pack_dir, "%s/pack", object_dir);
> +       dir = opendir(pack_dir.buf);
> +
> +       if (!dir) {
> +               error_errno("unable to open pack directory: %s",
> +                           pack_dir.buf);

_()

> +               strbuf_release(&pack_dir);
> +               return 1;
> +       }
> +
> +       strbuf_addch(&pack_dir, '/');
> +       pack_dir_len = pack_dir.len;
> +       ALLOC_ARRAY(packs, alloc_packs);
> +       while ((de = readdir(dir)) != NULL) {
> +               if (is_dot_or_dotdot(de->d_name))
> +                       continue;
> +
> +               if (ends_with(de->d_name, ".idx")) {
> +                       ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
> +
> +                       strbuf_setlen(&pack_dir, pack_dir_len);
> +                       strbuf_addstr(&pack_dir, de->d_name);
> +
> +                       packs[nr_packs] = add_packed_git(pack_dir.buf,
> +                                                        pack_dir.len,
> +                                                        0);
> +                       if (!packs[nr_packs])
> +                               warning("failed to add packfile '%s'",
> +                                       pack_dir.buf);
> +                       else
> +                               nr_packs++;
> +               }
> +       }
> +       closedir(dir);
> +       strbuf_release(&pack_dir);

Can we refactor and share this scanning-for-packs code with
packfile.c? I'm pretty sure it does something similar in there.

> -       write_midx_header(f, num_chunks, num_packs);
> +       write_midx_header(f, num_chunks, nr_packs);

Hmm.. could have stuck to one name from the beginning...
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/23] midx: write pack names in chunk
  2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
@ 2018-06-07 18:26   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-07 18:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>         m->num_chunks = *(m->data + 6);
>         m->num_packs = get_be32(m->data + 8);
>
> +       for (i = 0; i < m->num_chunks; i++) {
> +               uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i);
> +               uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);

Would be good to reduce magic numbers like 12 and 16, I think you have
some header length constants for those already.

> +               switch (chunk_id) {
> +                       case MIDX_CHUNKID_PACKNAMES:
> +                               m->chunk_pack_names = m->data + chunk_offset;
> +                               break;
> +
> +                       case 0:
> +                               die("terminating MIDX chunk id appears earlier than expected");

_()

> +                               break;
> +
> +                       default:
> +                               /*
> +                                * Do nothing on unrecognized chunks, allowing future
> +                                * extensions to add optional chunks.
> +                                */

I wrote about the chunk term reminding me of PNG format then deleted
it. But it may help to do similar to PNG here. The first letter can
let us know if the chunk is optional and can be safely ignored. E.g.
uppercase first letter cannot be ignored, lowercase go wild.

> +                               break;
> +               }
> +       }
> +
> +       if (!m->chunk_pack_names)
> +               die("MIDX missing required pack-name chunk");

_()

> +
>         return m;
>
>  cleanup_fail:
> @@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
>         return MIDX_HEADER_SIZE;
>  }
>
> +struct pack_pair {
> +       uint32_t pack_int_id;

can this be just pack_id?

> +       char *pack_name;
> +};
> +
> +static int pack_pair_compare(const void *_a, const void *_b)
> +{
> +       struct pack_pair *a = (struct pack_pair *)_a;
> +       struct pack_pair *b = (struct pack_pair *)_b;
> +       return strcmp(a->pack_name, b->pack_name);
> +}
> +
> +static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
> +{
> +       uint32_t i;
> +       struct pack_pair *pairs;
> +
> +       ALLOC_ARRAY(pairs, nr_packs);
> +
> +       for (i = 0; i < nr_packs; i++) {
> +               pairs[i].pack_int_id = i;
> +               pairs[i].pack_name = pack_names[i];
> +       }
> +
> +       QSORT(pairs, nr_packs, pack_pair_compare);
> +
> +       for (i = 0; i < nr_packs; i++) {
> +               pack_names[i] = pairs[i].pack_name;
> +               perm[pairs[i].pack_int_id] = i;
> +       }

pairs[] is leaked?

> +}
> +
> +static size_t write_midx_pack_names(struct hashfile *f,
> +                                   char **pack_names,
> +                                   uint32_t num_packs)
> +{
> +       uint32_t i;
> +       unsigned char padding[MIDX_CHUNK_ALIGNMENT];
> +       size_t written = 0;
> +
> +       for (i = 0; i < num_packs; i++) {
> +               size_t writelen = strlen(pack_names[i]) + 1;
> +
> +               if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
> +                       BUG("incorrect pack-file order: %s before %s",
> +                           pack_names[i - 1],
> +                           pack_names[i]);
> +
> +               hashwrite(f, pack_names[i], writelen);
> +               written += writelen;

side note. This pattern happens a lot. It may be a good idea to make
hashwrite() return writelen so we can just write

written += hashwrite(f, ..., writelen);

> +       }
> +
> +       /* add padding to be aligned */
> +       i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
> +       if (i < MIDX_CHUNK_ALIGNMENT) {
> +               bzero(padding, sizeof(padding));
> +               hashwrite(f, padding, i);
> +               written += i;
> +       }
> +
> +       return written;
> +}
> +
>  int write_midx_file(const char *object_dir)
>  {
> -       unsigned char num_chunks = 0;
> +       unsigned char cur_chunk, num_chunks = 0;
>         char *midx_name;
>         struct hashfile *f;
>         struct lock_file lk;
>         struct packed_git **packs = NULL;
> +       char **pack_names = NULL;
> +       uint32_t *pack_perm;
>         uint32_t i, nr_packs = 0, alloc_packs = 0;
> +       uint32_t alloc_pack_names = 0;
>         DIR *dir;
>         struct dirent *de;
>         struct strbuf pack_dir = STRBUF_INIT;
>         size_t pack_dir_len;
> +       uint64_t pack_name_concat_len = 0;
> +       uint64_t written = 0;
> +       uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
> +       uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];

This long list of local vars may be a good indicator that this
function needs split up into smaller ones.

>
>         midx_name = get_midx_filename(object_dir);
>         if (safe_create_leading_directories(midx_name)) {
> @@ -132,12 +233,14 @@ int write_midx_file(const char *object_dir)
>         strbuf_addch(&pack_dir, '/');
>         pack_dir_len = pack_dir.len;
>         ALLOC_ARRAY(packs, alloc_packs);
> +       ALLOC_ARRAY(pack_names, alloc_pack_names);
>         while ((de = readdir(dir)) != NULL) {
>                 if (is_dot_or_dotdot(de->d_name))
>                         continue;
>
>                 if (ends_with(de->d_name, ".idx")) {
>                         ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
> +                       ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
>
>                         strbuf_setlen(&pack_dir, pack_dir_len);
>                         strbuf_addstr(&pack_dir, de->d_name);
> @@ -145,21 +248,83 @@ int write_midx_file(const char *object_dir)
>                         packs[nr_packs] = add_packed_git(pack_dir.buf,
>                                                          pack_dir.len,
>                                                          0);
> -                       if (!packs[nr_packs])
> +                       if (!packs[nr_packs]) {
>                                 warning("failed to add packfile '%s'",
>                                         pack_dir.buf);
> -                       else
> -                               nr_packs++;
> +                               continue;
> +                       }
> +
> +                       pack_names[nr_packs] = xstrdup(de->d_name);
> +                       pack_name_concat_len += strlen(de->d_name) + 1;
> +                       nr_packs++;
>                 }
>         }
> +
>         closedir(dir);
>         strbuf_release(&pack_dir);
>
> +       if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
> +               pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
> +                                       (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
> +
> +       ALLOC_ARRAY(pack_perm, nr_packs);
> +       sort_packs_by_name(pack_names, nr_packs, pack_perm);
> +
>         hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>         f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>         FREE_AND_NULL(midx_name);
>
> -       write_midx_header(f, num_chunks, nr_packs);
> +       cur_chunk = 0;
> +       num_chunks = 1;
> +
> +       written = write_midx_header(f, num_chunks, nr_packs);
> +
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> +       chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
> +
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = 0;
> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
> +
> +       for (i = 0; i <= num_chunks; i++) {
> +               if (i && chunk_offsets[i] < chunk_offsets[i - 1])
> +                       BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
> +                           chunk_offsets[i - 1],
> +                           chunk_offsets[i]);
> +
> +               if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
> +                       BUG("chunk offset %"PRIu64" is not properly aligned",
> +                           chunk_offsets[i]);
> +
> +               hashwrite_be32(f, chunk_ids[i]);
> +               hashwrite_be32(f, chunk_offsets[i] >> 32);
> +               hashwrite_be32(f, chunk_offsets[i]);
> +
> +               written += MIDX_CHUNKLOOKUP_WIDTH;
> +       }
> +
> +       for (i = 0; i < num_chunks; i++) {
> +               if (written != chunk_offsets[i])
> +                       BUG("inccrrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,

incorrect

> +                           chunk_offsets[i],
> +                           written,
> +                           chunk_ids[i]);
> +
> +               switch (chunk_ids[i]) {
> +                       case MIDX_CHUNKID_PACKNAMES:
> +                               written += write_midx_pack_names(f, pack_names, nr_packs);
> +                               break;
> +
> +                       default:
> +                               BUG("trying to write unknown chunk id %"PRIx32,
> +                                   chunk_ids[i]);
> +               }
> +       }
> +
> +       if (written != chunk_offsets[num_chunks])
> +               BUG("incorrect final offset %"PRIu64" != %"PRIu64,
> +                   written,
> +                   chunk_offsets[num_chunks]);
>
>         finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>         commit_lock_file(&lk);
> @@ -170,5 +335,6 @@ int write_midx_file(const char *object_dir)
>         }
>
>         FREE_AND_NULL(packs);
> +       FREE_AND_NULL(pack_names);

What about the strings in this array? I think they are xstrdup() but I
didn't spot them being freed.

And maybe just use string_list...

>         return 0;
>  }
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
  2018-06-07 17:54   ` Duy Nguyen
@ 2018-06-07 18:31   ` Duy Nguyen
  2018-06-20 13:33     ` Derrick Stolee
  1 sibling, 1 reply; 62+ messages in thread
From: Duy Nguyen @ 2018-06-07 18:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> index dcaeb1a91b..919283fdd8 100644
> --- a/Documentation/git-midx.txt
> +++ b/Documentation/git-midx.txt
> @@ -23,6 +23,11 @@ OPTIONS
>         <dir>/packs/multi-pack-index for the current MIDX file, and
>         <dir>/packs for the pack-files to index.
>
> +read::
> +       When given as the verb, read the current MIDX file and output
> +       basic information about its contents. Used for debugging
> +       purposes only.

On second thought. If you just need a temporary debugging interface,
adding a program in t/helper may be a better option. In the end we
might still need 'read' to dump a file out, but we should have some
stable output format (and json might be a good choice).

That's it I'm done for today. I will continue on the rest some day, hopefully.
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 10/23] midx: write a lookup into the pack names chunk
  2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
@ 2018-06-09 16:43   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 16:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/pack-format.txt |  5 +++
>  builtin/midx.c                          |  7 ++++
>  midx.c                                  | 56 +++++++++++++++++++++++--
>  object-store.h                          |  2 +
>  t/t5319-midx.sh                         | 11 +++--
>  5 files changed, 75 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 2b37be7b33..29bf87283a 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -296,6 +296,11 @@ CHUNK LOOKUP:
>
>  CHUNK DATA:
>
> +       Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
> +           P * 4 bytes storing the offset in the packfile name chunk for
> +           the null-terminated string containing the filename for the
> +           ith packfile.
> +

Commit message is too light on this one. Why does this need to be
stored? Isn't the cost of rebuilding this in-core cheap?

Adding this chunk on disk in my opinion only adds more burden. Now you
have to verify that these offsets actually point to the right place.

>         Packfile Names (ID: {'P', 'N', 'A', 'M'})
>             Stores the packfile names as concatenated, null-terminated strings.
>             Packfiles must be listed in lexicographic order for fast lookups by
> diff --git a/builtin/midx.c b/builtin/midx.c
> index fe56560853..3a261e9bbf 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -16,6 +16,7 @@ static struct opts_midx {
>
>  static int read_midx_file(const char *object_dir)
>  {
> +       uint32_t i;
>         struct midxed_git *m = load_midxed_git(object_dir);
>
>         if (!m)
> @@ -30,11 +31,17 @@ static int read_midx_file(const char *object_dir)
>
>         printf("chunks:");
>
> +       if (m->chunk_pack_lookup)
> +               printf(" pack_lookup");
>         if (m->chunk_pack_names)
>                 printf(" pack_names");
>
>         printf("\n");
>
> +       printf("packs:\n");
> +       for (i = 0; i < m->num_packs; i++)
> +               printf("%s\n", m->pack_names[i]);
> +
>         printf("object_dir: %s\n", m->object_dir);
>
>         return 0;
> diff --git a/midx.c b/midx.c
> index d4f4a01a51..923acda72e 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -13,8 +13,9 @@
>  #define MIDX_HASH_LEN 20
>  #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
>
> -#define MIDX_MAX_CHUNKS 1
> +#define MIDX_MAX_CHUNKS 2
>  #define MIDX_CHUNK_ALIGNMENT 4
> +#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
>  #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
>
> @@ -85,6 +86,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i);
>
>                 switch (chunk_id) {
> +                       case MIDX_CHUNKID_PACKLOOKUP:
> +                               m->chunk_pack_lookup = (uint32_t *)(m->data + chunk_offset);
> +                               break;
> +
>                         case MIDX_CHUNKID_PACKNAMES:
>                                 m->chunk_pack_names = m->data + chunk_offset;
>                                 break;
> @@ -102,9 +107,32 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 }
>         }
>
> +       if (!m->chunk_pack_lookup)
> +               die("MIDX missing required pack lookup chunk");
>         if (!m->chunk_pack_names)
>                 die("MIDX missing required pack-name chunk");
>
> +       m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
> +       for (i = 0; i < m->num_packs; i++) {
> +               if (i) {
> +                       if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
> +                               error("MIDX pack lookup value %d before %d",
> +                                     ntohl(m->chunk_pack_lookup[i - 1]),
> +                                     ntohl(m->chunk_pack_lookup[i]));
> +                               goto cleanup_fail;
> +                       }
> +               }
> +
> +               m->pack_names[i] = (const char *)(m->chunk_pack_names + ntohl(m->chunk_pack_lookup[i]));
> +
> +               if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
> +                       error("MIDX pack names out of order: '%s' before '%s'",
> +                             m->pack_names[i - 1],
> +                             m->pack_names[i]);
> +                       goto cleanup_fail;
> +               }
> +       }
> +
>         return m;
>
>  cleanup_fail:
> @@ -162,6 +190,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
>         }
>  }
>
> +static size_t write_midx_pack_lookup(struct hashfile *f,
> +                                    char **pack_names,
> +                                    uint32_t nr_packs)
> +{
> +       uint32_t i, cur_len = 0;
> +
> +       for (i = 0; i < nr_packs; i++) {
> +               hashwrite_be32(f, cur_len);
> +               cur_len += strlen(pack_names[i]) + 1;
> +       }
> +
> +       return sizeof(uint32_t) * (size_t)nr_packs;
> +}
> +
>  static size_t write_midx_pack_names(struct hashfile *f,
>                                     char **pack_names,
>                                     uint32_t num_packs)
> @@ -275,13 +317,17 @@ int write_midx_file(const char *object_dir)
>         FREE_AND_NULL(midx_name);
>
>         cur_chunk = 0;
> -       num_chunks = 1;
> +       num_chunks = 2;
>
>         written = write_midx_header(f, num_chunks, nr_packs);
>
> -       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKLOOKUP;
>         chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
>
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
> +
>         cur_chunk++;
>         chunk_ids[cur_chunk] = 0;
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
> @@ -311,6 +357,10 @@ int write_midx_file(const char *object_dir)
>                             chunk_ids[i]);
>
>                 switch (chunk_ids[i]) {
> +                       case MIDX_CHUNKID_PACKLOOKUP:
> +                               written += write_midx_pack_lookup(f, pack_names, nr_packs);
> +                               break;
> +
>                         case MIDX_CHUNKID_PACKNAMES:
>                                 written += write_midx_pack_names(f, pack_names, nr_packs);
>                                 break;
> diff --git a/object-store.h b/object-store.h
> index 199cf4bd44..1ba50459ca 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -100,8 +100,10 @@ struct midxed_git {
>         uint32_t num_packs;
>         uint32_t num_objects;
>
> +       const uint32_t *chunk_pack_lookup;
>         const unsigned char *chunk_pack_names;
>
> +       const char **pack_names;
>         char object_dir[FLEX_ARRAY];
>  };
>
> diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
> index fdf4f84a90..a31c387c8f 100755
> --- a/t/t5319-midx.sh
> +++ b/t/t5319-midx.sh
> @@ -6,10 +6,15 @@ test_description='multi-pack-indexes'
>  midx_read_expect() {
>         NUM_PACKS=$1
>         cat >expect <<- EOF
> -       header: 4d494458 1 1 1 $NUM_PACKS
> -       chunks: pack_names
> -       object_dir: .
> +       header: 4d494458 1 1 2 $NUM_PACKS
> +       chunks: pack_lookup pack_names
> +       packs:
>         EOF
> +       if [ $NUM_PACKS -ge 1 ]
> +       then
> +               ls pack/ | grep idx | sort >> expect
> +       fi
> +       printf "object_dir: .\n" >>expect &&
>         git midx read --object-dir=. >actual &&
>         test_cmp expect actual
>  }
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 11/23] midx: sort and deduplicate objects from packfiles
  2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
@ 2018-06-09 17:07   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:07 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Before writing a list of objects and their offsets to a multi-pack-index
> (MIDX), we need to collect the list of objects contained in the
> packfiles. There may be multiple copies of some objects, so this list
> must be deduplicated.

Can you just do merge-sort with a slight modification to ignore duplicates?

>
> It is possible to artificially get into a state where there are many
> duplicate copies of objects. That can create high memory pressure if we
> are to create a list of all objects before de-duplication. To reduce
> this memory pressure without a significant performance drop,
> automatically group objects by the first byte of their object id. Use
> the IDX fanout tables to group the data, copy to a local array, then
> sort.
>
> Copy only the de-duplicated entries. Select the duplicate based on the
> most-recent modified time of a packfile containing the object.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 138 insertions(+)
>
> diff --git a/midx.c b/midx.c
> index 923acda72e..b20d52713c 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -4,6 +4,7 @@
>  #include "csum-file.h"
>  #include "lockfile.h"
>  #include "object-store.h"
> +#include "packfile.h"
>  #include "midx.h"
>
>  #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> @@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
>         }
>  }
>
> +static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
> +{
> +       const uint32_t *level1_ofs = p->index_data;
> +
> +       if (!level1_ofs) {
> +               if (open_pack_index(p))
> +                       return 0;
> +               level1_ofs = p->index_data;
> +       }
> +
> +       if (p->index_version > 1) {
> +               level1_ofs += 2;
> +       }
> +
> +       return ntohl(level1_ofs[value]);
> +}

Maybe keep this in packfile,c, refactor fanout code in there if
necessary, keep .idx file format info in that file instead of
spreading out more.

> +
> +struct pack_midx_entry {
> +       struct object_id oid;
> +       uint32_t pack_int_id;
> +       time_t pack_mtime;
> +       uint64_t offset;
> +};
> +
> +static int midx_oid_compare(const void *_a, const void *_b)
> +{
> +       struct pack_midx_entry *a = (struct pack_midx_entry *)_a;
> +       struct pack_midx_entry *b = (struct pack_midx_entry *)_b;

Try not to lose "const" while typecasting.

> +       int cmp = oidcmp(&a->oid, &b->oid);
> +
> +       if (cmp)
> +               return cmp;
> +
> +       if (a->pack_mtime > b->pack_mtime)
> +               return -1;
> +       else if (a->pack_mtime < b->pack_mtime)
> +               return 1;
> +
> +       return a->pack_int_id - b->pack_int_id;
> +}
> +
> +static void fill_pack_entry(uint32_t pack_int_id,
> +                           struct packed_git *p,
> +                           uint32_t cur_object,
> +                           struct pack_midx_entry *entry)
> +{
> +       if (!nth_packed_object_oid(&entry->oid, p, cur_object))
> +               die("failed to located object %d in packfile", cur_object);

_()

> +
> +       entry->pack_int_id = pack_int_id;
> +       entry->pack_mtime = p->mtime;
> +
> +       entry->offset = nth_packed_object_offset(p, cur_object);
> +}
> +
> +/*
> + * It is possible to artificially get into a state where there are many
> + * duplicate copies of objects. That can create high memory pressure if
> + * we are to create a list of all objects before de-duplication. To reduce
> + * this memory pressure without a significant performance drop, automatically
> + * group objects by the first byte of their object id. Use the IDX fanout
> + * tables to group the data, copy to a local array, then sort.
> + *
> + * Copy only the de-duplicated entries (selected by most-recent modified time
> + * of a packfile containing the object).
> + */
> +static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
> +                                                 uint32_t *perm,
> +                                                 uint32_t nr_packs,
> +                                                 uint32_t *nr_objects)
> +{
> +       uint32_t cur_fanout, cur_pack, cur_object;
> +       uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
> +       struct pack_midx_entry *entries_by_fanout = NULL;
> +       struct pack_midx_entry *deduplicated_entries = NULL;
> +
> +       for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
> +               if (open_pack_index(p[cur_pack]))
> +                       continue;

Is it a big problem if you fail to open .idx for a certain pack?
Should we error out and abort instead of continuing on? Later on in
the second pack loop code when get_fanout return zero (failure), you
don't seem to catch it and skip the pack.

> +
> +               total_objects += p[cur_pack]->num_objects;
> +       }
> +
> +       /*
> +        * As we de-duplicate by fanout value, we expect the fanout
> +        * slices to be evenly distributed, with some noise. Hence,
> +        * allocate slightly more than one 256th.
> +        */
> +       alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16;
> +
> +       ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
> +       ALLOC_ARRAY(deduplicated_entries, alloc_objects);
> +       *nr_objects = 0;
> +
> +       for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
> +               nr_fanout = 0;

Keep variable scope small, declare nr_fanout here instead of at the
top of the function.

> +
> +               for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
> +                       uint32_t start = 0, end;
> +
> +                       if (cur_fanout)
> +                               start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
> +                       end = get_pack_fanout(p[cur_pack], cur_fanout);
> +
> +                       for (cur_object = start; cur_object < end; cur_object++) {
> +                               ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
> +                               fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
> +                               nr_fanout++;
> +                       }
> +               }
> +
> +               QSORT(entries_by_fanout, nr_fanout, midx_oid_compare);
> +
> +               /*
> +                * The batch is now sorted by OID and then mtime (descending).
> +                * Take only the first duplicate.
> +                */
> +               for (cur_object = 0; cur_object < nr_fanout; cur_object++) {
> +                       if (cur_object && !oidcmp(&entries_by_fanout[cur_object - 1].oid,
> +                                                 &entries_by_fanout[cur_object].oid))
> +                               continue;
> +
> +                       ALLOC_GROW(deduplicated_entries, *nr_objects + 1, alloc_objects);
> +                       memcpy(&deduplicated_entries[*nr_objects],
> +                              &entries_by_fanout[cur_object],
> +                              sizeof(struct pack_midx_entry));
> +                       (*nr_objects)++;
> +               }
> +       }
> +
> +       FREE_AND_NULL(entries_by_fanout);
> +       return deduplicated_entries;
> +}
> +
>  static size_t write_midx_pack_lookup(struct hashfile *f,
>                                      char **pack_names,
>                                      uint32_t nr_packs)
> @@ -254,6 +389,7 @@ int write_midx_file(const char *object_dir)
>         uint64_t written = 0;
>         uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
>         uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
> +       uint32_t nr_entries;
>
>         midx_name = get_midx_filename(object_dir);
>         if (safe_create_leading_directories(midx_name)) {
> @@ -312,6 +448,8 @@ int write_midx_file(const char *object_dir)
>         ALLOC_ARRAY(pack_perm, nr_packs);
>         sort_packs_by_name(pack_names, nr_packs, pack_perm);
>
> +       get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);

Intentional ignoring return value (and temporary leaking as a result)
should have a least a comment to acknowledge it and save reviewers
some head scratching. Or even better, just free it now, even if you
don't use it.

> +
>         hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>         f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>         FREE_AND_NULL(midx_name);
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 12/23] midx: write object ids in a chunk
  2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
@ 2018-06-09 17:25   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/pack-format.txt |  4 ++
>  builtin/midx.c                          |  2 +
>  midx.c                                  | 50 +++++++++++++++++++++++--
>  object-store.h                          |  1 +
>  t/t5319-midx.sh                         |  4 +-
>  5 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 29bf87283a..de9ac778b6 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -307,6 +307,10 @@ CHUNK DATA:
>             name. This is the only chunk not guaranteed to be a multiple of four
>             bytes in length, so should be the last chunk for alignment reasons.
>
> +       OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)

So N is the number of objects and H is hash size? Please don't let me guess.

> +           The OIDs for all objects in the MIDX are stored in lexicographic
> +           order in this chunk.

The reason we keep all hashes together, packed right, is to reduce
cache footprint. Another observation is it takes us usually just 12
bytes or less to uniquely identify an object, which means we could
pack even tighter if we split he object hash into two chunks, do
bsearch in the first chunk with just <n> bytes then verify that the
remaining 20-<b> bytes is matched in the second chunk. This may matter
more when we move to larger hashes. The split would of course be
configurable since different project may have different optimal value,
but default value could be 10/10 bytes.

> +
>         (This section intentionally left incomplete.)
>
>  TRAILER:
> diff --git a/builtin/midx.c b/builtin/midx.c
> index 3a261e9bbf..86edd30174 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -35,6 +35,8 @@ static int read_midx_file(const char *object_dir)
>                 printf(" pack_lookup");
>         if (m->chunk_pack_names)
>                 printf(" pack_names");
> +       if (m->chunk_oid_lookup)
> +               printf(" oid_lookup");
>
>         printf("\n");
>
> diff --git a/midx.c b/midx.c
> index b20d52713c..d06bc6876a 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -14,10 +14,11 @@
>  #define MIDX_HASH_LEN 20
>  #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
>
> -#define MIDX_MAX_CHUNKS 2
> +#define MIDX_MAX_CHUNKS 3
>  #define MIDX_CHUNK_ALIGNMENT 4
>  #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
> +#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
>
>  static char *get_midx_filename(const char *object_dir)
> @@ -95,6 +96,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                                 m->chunk_pack_names = m->data + chunk_offset;
>                                 break;
>
> +                       case MIDX_CHUNKID_OIDLOOKUP:
> +                               m->chunk_oid_lookup = m->data + chunk_offset;
> +                               break;


I just now realized, how do you protect from duplicate chunks? From
this patch, it looks like you could accept two oidlookup chunks just
fine then siliently ignore the first one.

>                         case 0:
>                                 die("terminating MIDX chunk id appears earlier than expected");
>                                 break;
> @@ -112,6 +117,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 die("MIDX missing required pack lookup chunk");
>         if (!m->chunk_pack_names)
>                 die("MIDX missing required pack-name chunk");
> +       if (!m->chunk_oid_lookup)
> +               die("MIDX missing required OID lookup chunk");

_()

>
>         m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
>         for (i = 0; i < m->num_packs; i++) {
> @@ -370,6 +377,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
>         return written;
>  }
>
> +static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
> +                                   struct pack_midx_entry *objects,
> +                                   uint32_t nr_objects)
> +{
> +       struct pack_midx_entry *list = objects;
> +       uint32_t i;
> +       size_t written = 0;
> +
> +       for (i = 0; i < nr_objects; i++) {
> +               struct pack_midx_entry *obj = list++;
> +
> +               if (i < nr_objects - 1) {
> +                       struct pack_midx_entry *next = list;
> +                       if (oidcmp(&obj->oid, &next->oid) >= 0)
> +                               BUG("OIDs not in order: %s >= %s",
> +                               oid_to_hex(&obj->oid),
> +                               oid_to_hex(&next->oid));

Indentation. I almost thought oid_to_hex() was a separate statement.

> +               }
> +
> +               hashwrite(f, obj->oid.hash, (int)hash_len);

Is (int) really necessary? There's no loss in automatically casting
unsigned char to int. But I didn't check C spec, maybe there's some
rules...

> +               written += hash_len;
> +       }
> +
> +       return written;
> +}
> +
>  int write_midx_file(const char *object_dir)
>  {
>         unsigned char cur_chunk, num_chunks = 0;
> @@ -389,6 +422,7 @@ int write_midx_file(const char *object_dir)
>         uint64_t written = 0;
>         uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
>         uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
> +       struct pack_midx_entry *entries;
>         uint32_t nr_entries;
>
>         midx_name = get_midx_filename(object_dir);
> @@ -448,14 +482,14 @@ int write_midx_file(const char *object_dir)
>         ALLOC_ARRAY(pack_perm, nr_packs);
>         sort_packs_by_name(pack_names, nr_packs, pack_perm);
>
> -       get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);
> +       entries = get_sorted_entries(packs, pack_perm, nr_packs, &nr_entries);

This change should belong to the previous patch. This patch alone
can't tell me that entries is a new allocation. If I didn't remember
the last patch, I could not realize that entries should be freed (and
it does not look like it is here)

>
>         hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>         f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>         FREE_AND_NULL(midx_name);
>
>         cur_chunk = 0;
> -       num_chunks = 2;
> +       num_chunks = 3;
>
>         written = write_midx_header(f, num_chunks, nr_packs);
>
> @@ -467,9 +501,13 @@ int write_midx_file(const char *object_dir)
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
>
>         cur_chunk++;
> -       chunk_ids[cur_chunk] = 0;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
>
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = 0;
> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
> +
>         for (i = 0; i <= num_chunks; i++) {
>                 if (i && chunk_offsets[i] < chunk_offsets[i - 1])
>                         BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
> @@ -503,6 +541,10 @@ int write_midx_file(const char *object_dir)
>                                 written += write_midx_pack_names(f, pack_names, nr_packs);
>                                 break;
>
> +                       case MIDX_CHUNKID_OIDLOOKUP:
> +                               written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
> +                               break;
> +
>                         default:
>                                 BUG("trying to write unknown chunk id %"PRIx32,
>                                     chunk_ids[i]);
> diff --git a/object-store.h b/object-store.h
> index 1ba50459ca..7d14d3586e 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -102,6 +102,7 @@ struct midxed_git {
>
>         const uint32_t *chunk_pack_lookup;
>         const unsigned char *chunk_pack_names;
> +       const unsigned char *chunk_oid_lookup;
>
>         const char **pack_names;
>         char object_dir[FLEX_ARRAY];
> diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
> index a31c387c8f..e71aa52b80 100755
> --- a/t/t5319-midx.sh
> +++ b/t/t5319-midx.sh
> @@ -6,8 +6,8 @@ test_description='multi-pack-indexes'
>  midx_read_expect() {
>         NUM_PACKS=$1
>         cat >expect <<- EOF
> -       header: 4d494458 1 1 2 $NUM_PACKS
> -       chunks: pack_lookup pack_names
> +       header: 4d494458 1 1 3 $NUM_PACKS
> +       chunks: pack_lookup pack_names oid_lookup
>         packs:
>         EOF
>         if [ $NUM_PACKS -ge 1 ]
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 13/23] midx: write object id fanout chunk
  2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
@ 2018-06-09 17:28   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
> @@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>                 die("MIDX missing required pack lookup chunk");
>         if (!m->chunk_pack_names)
>                 die("MIDX missing required pack-name chunk");
> +       if (!m->chunk_oid_fanout)
> +               die("MIDX missing required OID fanout chunk");

_()

> @@ -501,9 +540,13 @@ int write_midx_file(const char *object_dir)
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * sizeof(uint32_t);
>
>         cur_chunk++;
> -       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;

Err.. mistake?

>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
>
> +       cur_chunk++;
> +       chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;

Same here.

> +       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
> +
>         cur_chunk++;
>         chunk_ids[cur_chunk] = 0;
>         chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
>
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/23] midx: write object offsets
  2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
@ 2018-06-09 17:41   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:02 PM Derrick Stolee <stolee@gmail.com> wrote:
> +static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
> +                                      struct pack_midx_entry *objects, uint32_t nr_objects)
> +{
> +       struct pack_midx_entry *list = objects;
> +       size_t written = 0;
> +
> +       while (nr_large_offset) {
> +               struct pack_midx_entry *obj = list++;
> +               uint64_t offset = obj->offset;
> +
> +               if (!(offset >> 31))
> +                       continue;
> +
> +               hashwrite_be32(f, offset >> 32);
> +               hashwrite_be32(f, offset & 0xffffffff);

Not sure if you need UL suffix or something here on 32-bit platform.

> +               written += 2 * sizeof(uint32_t);
> +
> +               nr_large_offset--;
> +       }
> +
> +       return written;
> +}
> +
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 16/23] midx: prepare midxed_git struct
  2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
@ 2018-06-09 17:47   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c         | 22 ++++++++++++++++++++++
>  midx.h         |  2 ++
>  object-store.h |  7 +++++++
>  packfile.c     |  6 +++++-
>  4 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/midx.c b/midx.c
> index a49300bf75..5e9290ca8f 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -175,6 +175,28 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>         exit(1);
>  }
>
> +int prepare_midxed_git_one(struct repository *r, const char *object_dir)
> +{
> +       struct midxed_git *m = r->objects->midxed_git;
> +       struct midxed_git *m_search;
> +
> +       if (!core_midx)
> +               return 0;
> +
> +       for (m_search = m; m_search; m_search = m_search->next)
> +               if (!strcmp(object_dir, m_search->object_dir))
> +                       return 1;
> +
> +       r->objects->midxed_git = load_midxed_git(object_dir);
> +
> +       if (r->objects->midxed_git) {
> +               r->objects->midxed_git->next = m;
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
>  static size_t write_midx_header(struct hashfile *f,
>                                 unsigned char num_chunks,
>                                 uint32_t num_packs)
> diff --git a/midx.h b/midx.h
> index a1d18ed991..793203fc4a 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -5,8 +5,10 @@
>  #include "cache.h"
>  #include "object-store.h"
>  #include "packfile.h"
> +#include "repository.h"
>
>  struct midxed_git *load_midxed_git(const char *object_dir);
> +int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
>
> diff --git a/object-store.h b/object-store.h
> index 9b671f1b0a..7908d46e34 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -130,6 +130,13 @@ struct raw_object_store {
>          */
>         struct oidmap *replace_map;
>
> +       /*
> +        * private data
> +        *
> +        * should only be accessed directly by packfile.c and midx.c
> +        */
> +       struct midxed_git *midxed_git;
> +
>         /*
>          * private data
>          *
> diff --git a/packfile.c b/packfile.c
> index 1a714fbde9..b91ca9b9f5 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -15,6 +15,7 @@
>  #include "tree-walk.h"
>  #include "tree.h"
>  #include "object-store.h"
> +#include "midx.h"
>
>  char *odb_pack_name(struct strbuf *buf,
>                     const unsigned char *sha1,
> @@ -893,10 +894,13 @@ static void prepare_packed_git(struct repository *r)
>
>         if (r->objects->packed_git_initialized)
>                 return;
> +       prepare_midxed_git_one(r, r->objects->objectdir);
>         prepare_packed_git_one(r, r->objects->objectdir, 1);
>         prepare_alt_odb(r);
> -       for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
> +       for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
> +               prepare_midxed_git_one(r, alt->path);
>                 prepare_packed_git_one(r, alt->path, 0);
> +       }

Ah, so the object path and the linked list in midxed_git is for
alternates. Makes sense. Would have saved me the trouble if you only
introduced those fields now, when they are actually used (and become
self explanatory)

>         rearrange_packed_git(r);
>         prepare_packed_git_mru(r);
>         r->objects->packed_git_initialized = 1;
> --
> 2.18.0.rc1
>
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 17/23] midx: read objects from multi-pack-index
  2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
@ 2018-06-09 17:56   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 17:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 6:55 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  midx.h         |  2 ++
>  object-store.h |  1 +
>  packfile.c     |  8 ++++-
>  4 files changed, 104 insertions(+), 3 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 5e9290ca8f..6eca8f1b12 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -3,6 +3,7 @@
>  #include "dir.h"
>  #include "csum-file.h"
>  #include "lockfile.h"
> +#include "sha1-lookup.h"
>  #include "object-store.h"
>  #include "packfile.h"
>  #include "midx.h"
> @@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>
>         m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
>         strcpy(m->object_dir, object_dir);
> -       m->data = midx_map;
> +       m->data = (const unsigned char*)midx_map;

Hmm? Why is this typecast only needed now? Or is it not really needed at all?

>
>         m->signature = get_be32(m->data);
>         if (m->signature != MIDX_SIGNATURE) {
> @@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>
>         m->num_objects = ntohl(m->chunk_oid_fanout[255]);
>
> -       m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
> +       m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
> +
> +       ALLOC_ARRAY(m->pack_names, m->num_packs);

Please make this ALLOC_ARRAY change in the patch that adds
xcalloc(m->num_packs).

>         for (i = 0; i < m->num_packs; i++) {
>                 if (i) {
>                         if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) {
> @@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>         exit(1);
>  }
>
> +static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
> +{
> +       struct strbuf pack_name = STRBUF_INIT;
> +
> +       if (pack_int_id >= m->num_packs)
> +               BUG("bad pack-int-id");
> +
> +       if (m->packs[pack_int_id])
> +               return 0;
> +
> +       strbuf_addstr(&pack_name, m->object_dir);
> +       strbuf_addstr(&pack_name, "/pack/");
> +       strbuf_addstr(&pack_name, m->pack_names[pack_int_id]);

Just use strbuf_addf()

> +
> +       m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
> +       strbuf_release(&pack_name);
> +       return !m->packs[pack_int_id];

This is a weird return value convention. Normally we go zero/negative
or non-zero/zero for success/failure.

> +}
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 18/23] midx: use midx in abbreviation calculations
  2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
@ 2018-06-09 18:01   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
> @@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
>
>  static void find_abbrev_len_packed(struct min_abbrev_data *mad)
>  {
> +       struct midxed_git *m;
>         struct packed_git *p;
>
> +       for (m = get_midxed_git(the_repository); m; m = m->next)
> +               find_abbrev_len_for_midx(m, mad);

If all the packs are in midx, we don't need to run the second loop
below, do we? Otherwise I don't see why we waste cycles on finding
abbrev length on midx at all.

>         for (p = get_packed_git(the_repository); p; p = p->next)
>                 find_abbrev_len_for_pack(p, mad);
>  }
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 20/23] midx: use midx in approximate_object_count
  2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
@ 2018-06-09 18:03   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  packfile.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/packfile.c b/packfile.c
> index 638e113972..059b2aa097 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -819,11 +819,14 @@ unsigned long approximate_object_count(void)
>  {
>         if (!the_repository->objects->approximate_object_count_valid) {
>                 unsigned long count;
> +               struct midxed_git *m;
>                 struct packed_git *p;
>
>                 prepare_packed_git(the_repository);
>                 count = 0;
> -               for (p = the_repository->objects->packed_git; p; p = p->next) {
> +               for (m = get_midxed_git(the_repository); m; m = m->next)
> +                       count += m->num_objects;
> +               for (p = get_packed_git(the_repository); p; p = p->next) {

Please don't change this line, it's not related to this patch. Same
concern applies, if we have already counted objects in midx we should
ignore packs that belong to it or we double count.

>                         if (open_pack_index(p))
>                                 continue;
>                         count += p->num_objects;
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 21/23] midx: prevent duplicate packfile loads
  2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
@ 2018-06-09 18:05   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> If the multi-pack-index contains a packfile, then we do not need to add
> that packfile to the packed_git linked list or the MRU list.

Because...?

I think I see the reason, but I'd like it spelled out to avoid any
misunderstanding.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c     | 23 +++++++++++++++++++++++
>  midx.h     |  1 +
>  packfile.c |  7 +++++++
>  3 files changed, 31 insertions(+)
>
> diff --git a/midx.c b/midx.c
> index 388d79b7d9..3242646fe0 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -278,6 +278,29 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mi
>         return nth_midxed_pack_entry(m, e, pos);
>  }
>
> +int midx_contains_pack(struct midxed_git *m, const char *idx_name)
> +{
> +       uint32_t first = 0, last = m->num_packs;
> +
> +       while (first < last) {
> +               uint32_t mid = first + (last - first) / 2;
> +               const char *current;
> +               int cmp;
> +
> +               current = m->pack_names[mid];
> +               cmp = strcmp(idx_name, current);
> +               if (!cmp)
> +                       return 1;
> +               if (cmp > 0) {
> +                       first = mid + 1;
> +                       continue;
> +               }
> +               last = mid;
> +       }
> +
> +       return 0;
> +}
> +
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir)
>  {
>         struct midxed_git *m = r->objects->midxed_git;
> diff --git a/midx.h b/midx.h
> index 497bdcc77c..c1db58d8c4 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -13,6 +13,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
>                                         struct midxed_git *m,
>                                         uint32_t n);
>  int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m);
> +int midx_contains_pack(struct midxed_git *m, const char *idx_name);
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
> diff --git a/packfile.c b/packfile.c
> index 059b2aa097..479cb69b9f 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -746,6 +746,11 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
>         DIR *dir;
>         struct dirent *de;
>         struct string_list garbage = STRING_LIST_INIT_DUP;
> +       struct midxed_git *m = r->objects->midxed_git;
> +
> +       /* look for the multi-pack-index for this object directory */
> +       while (m && strcmp(m->object_dir, objdir))
> +               m = m->next;
>
>         strbuf_addstr(&path, objdir);
>         strbuf_addstr(&path, "/pack");
> @@ -772,6 +777,8 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local
>                 base_len = path.len;
>                 if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
>                         /* Don't reopen a pack we already have. */
> +                       if (m && midx_contains_pack(m, de->d_name))
> +                               continue;
>                         for (p = r->objects->packed_git; p;
>                              p = p->next) {
>                                 size_t len;
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 23/23] midx: clear midx on repack
  2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
@ 2018-06-09 18:13   ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-09 18:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> If a 'git repack' command replaces existing packfiles, then we must
> clear the existing multi-pack-index before moving the packfiles it
> references.

I think there are other places where we add or remove pack files and
need to reprepare_packed_git(). Any midx invalidation should be part
of that as well.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/repack.c | 8 ++++++++
>  midx.c           | 8 ++++++++
>  midx.h           | 1 +
>  3 files changed, 17 insertions(+)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 6c636e159e..66a7d8e8ea 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -8,6 +8,7 @@
>  #include "strbuf.h"
>  #include "string-list.h"
>  #include "argv-array.h"
> +#include "midx.h"
>
>  static int delta_base_offset = 1;
>  static int pack_kept_objects = -1;
> @@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>         int no_update_server_info = 0;
>         int quiet = 0;
>         int local = 0;
> +       int midx_cleared = 0;
>
>         struct option builtin_repack_options[] = {
>                 OPT_BIT('a', NULL, &pack_everything,
> @@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>                                 continue;
>                         }
>
> +                       if (!midx_cleared) {
> +                               /* if we move a packfile, it will invalidated the midx */

What about removing packs, which also happens in repack? If the
removed pack is part of midx, then midx becomes invalid as well.

> +                               clear_midx_file(get_object_directory());
> +                               midx_cleared = 1;
> +                       }
> +
>                         fname_old = mkpathdup("%s/old-%s%s", packdir,
>                                                 item->string, exts[ext].name);
>                         if (file_exists(fname_old))
> diff --git a/midx.c b/midx.c
> index e46f392fa4..1043c01fa7 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir)
>         FREE_AND_NULL(pack_names);
>         return 0;
>  }
> +
> +void clear_midx_file(const char *object_dir)

delete_ may be more obvious than clear_

> +{
> +       char *midx = get_midx_filename(object_dir);
> +
> +       if (remove_path(midx))
> +               die(_("failed to clear multi-pack-index at %s"), midx);

die_errno()

> +}
> diff --git a/midx.h b/midx.h
> index 6996b5ff6b..46f9f44c94 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char *idx_name);
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
> +void clear_midx_file(const char *object_dir);
>
>  #endif
> --
> 2.18.0.rc1
>


-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/23] midx: add design document
  2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
@ 2018-06-11 19:04   ` Stefan Beller
  2018-06-18 18:48     ` Derrick Stolee
  0 siblings, 1 reply; 62+ messages in thread
From: Stefan Beller @ 2018-06-11 19:04 UTC (permalink / raw)
  To: stolee; +Cc: git, dstolee, avarab, jrnieder, jonathantanmy, mfick

On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/midx.txt | 109 +++++++++++++++++++++++++++++++
>  1 file changed, 109 insertions(+)
>  create mode 100644 Documentation/technical/midx.txt
>
> diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt
> new file mode 100644
> index 0000000000..789f410d71
> --- /dev/null
> +++ b/Documentation/technical/midx.txt
> @@ -0,0 +1,109 @@
> +Multi-Pack-Index (MIDX) Design Notes
> +====================================
> +
> +The Git object directory contains a 'pack' directory containing
> +packfiles (with suffix ".pack") and pack-indexes (with suffix
> +".idx"). The pack-indexes provide a way to lookup objects and
> +navigate to their offset within the pack, but these must come
> +in pairs with the packfiles. This pairing depends on the file
> +names, as the pack-index differs only in suffix with its pack-
> +file. While the pack-indexes provide fast lookup per packfile,
> +this performance degrades as the number of packfiles increases,
> +because abbreviations need to inspect every packfile and we are
> +more likely to have a miss on our most-recently-used packfile.
> +For some large repositories, repacking into a single packfile
> +is not feasible due to storage space or excessive repack times.

This leads to the question how MIDX will cope with large repos or
a large number of packs. As it is just an index and not a pack itself,
I guess it is smaller by some orders of magnitude, such that it
is ok for now.

> +The multi-pack-index (MIDX for short) stores a list of objects
> +and their offsets into multiple packfiles. It contains:
> +
> +- A list of packfile names.
> +- A sorted list of object IDs.
> +- A list of metadata for the ith object ID including:
> +  - A value j referring to the jth packfile.
> +  - An offset within the jth packfile for the object.
> +- If large offsets are required, we use another list of large
> +  offsets similar to version 2 pack-indexes.
> +
> +Thus, we can provide O(log N) lookup time for any number
> +of packfiles.

This sounds great for the lookup case!
Though that is for the repo-read case.
Let's read on how the dynamics of a repository are dealt with,
e.g. integrating new packs into the MIDX, or how we deal with
objects in multiple packs.

> +
> +Design Details
> +--------------
> +
> +- The MIDX is stored in a file named 'multi-pack-index' in the
> +  .git/objects/pack directory. This could be stored in the pack
> +  directory of an alternate. It refers only to packfiles in that
> +  same directory.

So there is one and only one multi pack index?
That makes the case of preparing the next MIDX that contains more
pack references more interesting, as then we have to atomically update
that file.

> +- The core.midx config setting must be on to consume MIDX files.

Looking through current config options, I would rename this to a more
suggestive name. I searched for the core.idx counterpart that enables
idx files -- it turns out that is named pack.indexVersion.

So maybe pack.MultiIndex ? That could start out as a boolean as in this
series and then evolve into a version number or such later.

> +- The file format includes parameters for the object ID hash
> +  function, so a future change of hash algorithm does not require
> +  a change in format.
> +
> +- The MIDX keeps only one record per object ID. If an object appears
> +  in multiple packfiles, then the MIDX selects the copy in the most-
> +  recently modified packfile.

Okay. That answers the question from above. Though this is just the tie
breaking decision and not a hard limitation? (i.e. we could change this
this later to that pack that has e.g. shortest delta chain for that object or
such)

> +- If there exist packfiles in the pack directory not registered in
> +  the MIDX, then those packfiles are loaded into the `packed_git`
> +  list and `packed_git_mru` cache.

Not sure I understand the implications of this?
Does that mean we first look at the multi index and if an object is not
found, we'll search linearly through all packs that are not part of the
MIDX? That would require the MIDX to be kepot up to date reasonably
to be useful.

> +- The pack-indexes (.idx files) remain in the pack directory so we
> +  can delete the MIDX file, set core.midx to false, or downgrade
> +  without any loss of information.

In the future will it be possible to have no .idx files and just have the .midx?
(I guess that depends on the strategy of how to integrate new packs into
the MIDX?)

> +- The MIDX file format uses a chunk-based approach (similar to the
> +  commit-graph file) that allows optional data to be added.

... or the index files v2 (or reftable files)? Sure, you are most familiar with
commit-graph files, but others may find it easier to have some older
file formats to relate to.

> +Future Work
> +-----------
> +
> +- Add a 'verify' subcommand to the 'git midx' builtin to verify the
> +  contents of the multi-pack-index file match the offsets listed in
> +  the corresponding pack-indexes.
> +
> +- The multi-pack-index allows many packfiles, especially in a context
> +  where repacking is expensive (such as a very large repo), or
> +  unexpected maintenance time is unacceptable (such as a high-demand
> +  build machine).

Supposedly maintenance (git gc) can be run in the background without
interfering with day-to-day life, how is the regeneration of commit graph
or MIDX files impacting the work here?

>     However, the multi-pack-index needs to be rewritten
> +  in full every time. We can extend the format to be incremental, so
> +  writes are fast. By storing a small "tip" multi-pack-index that
> +  points to large "base" MIDX files, we can keep writes fast while
> +  still reducing the number of binary searches required for object
> +  lookups.

So we can have multiple MIDX files? How would that work? Would there
be a chunk that refers to other MIDX files?

> +- The reachability bitmap is currently paired directly with a single
> +  packfile, using the pack-order as the object order to hopefully
> +  compress the bitmaps well using run-length encoding. This could be
> +  extended to pair a reachability bitmap with a multi-pack-index. If
> +  the multi-pack-index is extended to store a "stable object order"
> +  (a function Order(hash) = integer that is constant for a given hash,

This stable object order doesn't fly well with integrating new packs?

> +  even as the multi-pack-index is updated) then a reachability bitmap
> +  could point to a multi-pack-index and be updated independently.
> +
> +- Packfiles can be marked as "special" using empty files that share
> +  the initial name but replace ".pack" with ".keep" or ".promisor".
> +  We can add an optional chunk of data to the multi-pack-index that
> +  records flags of information about the packfiles. This allows new
> +  states, such as 'repacked' or 'redeltified', that can help with
> +  pack maintenance in a multi-pack environment. It may also be
> +  helpful to organize packfiles by object type (commit, tree, blob,
> +  etc.) and use this metadata to help that maintenance.
> +
> +- The partial clone feature records special "promisor" packs that
> +  may point to objects that are not stored locally, but available
> +  on request to a server. The multi-pack-index does not currently
> +  track these promisor packs.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
@ 2018-06-11 19:19   ` Stefan Beller
  2018-06-18 19:01     ` Derrick Stolee
  0 siblings, 1 reply; 62+ messages in thread
From: Stefan Beller @ 2018-06-11 19:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

Hi Derrick,
On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> The multi-pack-index (MIDX) feature generalizes the existing pack-
> index (IDX) feature by indexing objects across multiple pack-files.
>
> Describe the basic file format, using a 12-byte header followed by
> a lookup table for a list of "chunks" which will be described later.
> The file ends with a footer containing a checksum using the hash
> algorithm.
>
> The header allows later versions to create breaking changes by
> advancing the version number. We can also change the hash algorithm
> using a different version value.
>
> We will add the individual chunk format information as we introduce
> the code that writes that information.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 70a99fd142..17666b4bfc 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -252,3 +252,52 @@ Pack file entry: <+
>      corresponding packfile.
>
>      20-byte SHA-1-checksum of all of the above.
> +
> +== midx-*.midx files have the following format:
> +
> +The meta-index files refer to multiple pack-files and loose objects.

So is it meta or multi?

> +In order to allow extensions that add extra data to the MIDX, we organize
> +the body into "chunks" and provide a lookup table at the beginning of the
> +body. The header includes certain length values, such as the number of packs,
> +the number of base MIDX files, hash lengths and types.
> +
> +All 4-byte numbers are in network order.
> +
> +HEADER:
> +
> +       4-byte signature:
> +           The signature is: {'M', 'I', 'D', 'X'}
> +
> +       1-byte version number:
> +           Git only writes or recognizes version 1
> +
> +       1-byte Object Id Version
> +           Git only writes or recognizes verion 1 (SHA-1)

s/verion/version/

> +       1-byte number (C) of "chunks"
> +
> +       1-byte number (I) of base multi-pack-index files:
> +           This value is currently always zero.

Oh? Are meta-index and multi-index files different things?

> +       4-byte number (P) of pack files
> +
> +CHUNK LOOKUP:
> +
> +       (C + 1) * 12 bytes providing the chunk offsets:
> +           First 4 bytes describe chunk id. Value 0 is a terminating label.
> +           Other 8 bytes provide offset in current file for chunk to start.
> +           (Chunks are provided in file-order, so you can infer the length
> +           using the next chunk position if necessary.)

It is so nice to have the header also have 12 bytes, so it fits right into the
lookup table. So an alternative point of view:

  If a chunk needs to store more than 8 bytes, we'll have an offset after
  the first 4 bytes that describe the chunk, otherwise you can store the 8 bytes
  of information directly after the 4 bytes.
   "MIDX" is a special chunk and must come first (does it?) and only once
  as it contains the version number.

> +       The remaining data in the body is described one chunk at a time, and
> +       these chunks may be given in any order. Chunks are required unless
> +       otherwise specified.
> +
> +CHUNK DATA:
> +
> +       (This section intentionally left incomplete.)
> +
> +TRAILER:
> +
> +       H-byte HASH-checksum of all of the above.

This means we have to rehash the whole file for updating its contents.
okay.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
  2018-06-07 17:20   ` Duy Nguyen
@ 2018-06-11 21:02   ` Stefan Beller
  2018-06-18 19:40     ` Derrick Stolee
  1 sibling, 1 reply; 62+ messages in thread
From: Stefan Beller @ 2018-06-11 21:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

Hi Derrick,
On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> This new 'git midx' builtin will be the plumbing access for writing,
> reading, and checking multi-pack-index (MIDX) files. The initial
> implementation is a no-op.

Let's talk about the name for a second:

.idx files are written by git-index-pack or as part of
git-pack-objects (which just calls write_idx_file as part
of finish_tmp_packfile), and the name actually suggests
it writes the index files. I have a hard time understanding
what the git-midx command does[1].

With both commit graph as well as multi index we introduce
a command that is centered around that concept (similar to
git-remote or git-config that are centered around a concept,
that is closely resembled by a file), but for indexes for packs
it was integrated differently into Git. So I am not sure if I want
to suggest to integrate it into the packfile commands as that
doesn't really fit. But maybe we can have a name that is human
readable instead of the file suffix? Maybe

  git multi-pack-index ?

I suppose that eventually this command is not really used by
users as it will be used by other porcelain commands in the
background or even as part of repack/gc so I am not worried
about a long name, but I'd be more worried about understandability.

[1] While these names are not perfect for the layman, it is okay?
  I am sure you are aware of https://git-man-page-generator.lokaltog.net/


> new file mode 100644
> index 0000000000..2bd886f1a2
> --- /dev/null
> +++ b/Documentation/git-midx.txt
> @@ -0,0 +1,29 @@
> +git-midx(1)
> +============
> +
> +NAME
> +----
> +git-midx - Write and verify multi-pack-indexes (MIDX files).

The reading is done as part of all other commands.

> +
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git midx' [--object-dir <dir>]
> +
> +DESCRIPTION
> +-----------
> +Write or verify a MIDX file.
> +
> +OPTIONS
> +-------
> +
> +--object-dir <dir>::
> +       Use given directory for the location of Git objects. We check
> +       <dir>/packs/multi-pack-index for the current MIDX file, and
> +       <dir>/packs for the pack-files to index.
> +
> +

Maybe we could have a SEE ALSO section that points at
the explanation of multi index files?
(c.f. man git-submodule that has a  SEE ALSO
gitsubmodules(7), gitmodules(5) explaining concepts(7)
and the file(5))

But as this is plumbing and users should not need to worry about it
this is optional, I would think.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
  2018-06-07 17:35   ` Duy Nguyen
@ 2018-06-12 15:00   ` Duy Nguyen
  2018-06-19 12:54     ` Derrick Stolee
  1 sibling, 1 reply; 62+ messages in thread
From: Duy Nguyen @ 2018-06-12 15:00 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
> diff --git a/midx.c b/midx.c
> index 616af66b13..3e55422a21 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1,9 +1,62 @@
>  #include "git-compat-util.h"
>  #include "cache.h"
>  #include "dir.h"
> +#include "csum-file.h"
> +#include "lockfile.h"
>  #include "midx.h"
>
> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> +#define MIDX_VERSION 1
> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
...
> +static size_t write_midx_header(struct hashfile *f,
> +                               unsigned char num_chunks,
> +                               uint32_t num_packs)
> +{
> +       char byte_values[4];
> +       hashwrite_be32(f, MIDX_SIGNATURE);
> +       byte_values[0] = MIDX_VERSION;
> +       byte_values[1] = MIDX_HASH_VERSION;

Quoting from "State of NewHash work, future directions, and discussion" [1]

* If you need to serialize an algorithm identifier into your data
  format, use the format_id field of struct git_hash_algo.  It's
  designed specifically for that purpose.

[1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60

> +       byte_values[2] = num_chunks;
> +       byte_values[3] = 0; /* unused */
> +       hashwrite(f, byte_values, sizeof(byte_values));
> +       hashwrite_be32(f, num_packs);
> +
> +       return MIDX_HEADER_SIZE;
> +}
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/23] midx: add design document
  2018-06-11 19:04   ` Stefan Beller
@ 2018-06-18 18:48     ` Derrick Stolee
  0 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-18 18:48 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, dstolee, avarab, jrnieder, jonathantanmy, mfick

On 6/11/2018 3:04 PM, Stefan Beller wrote:
> On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/technical/midx.txt | 109 +++++++++++++++++++++++++++++++
>>   1 file changed, 109 insertions(+)
>>   create mode 100644 Documentation/technical/midx.txt
>>
>> diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt
>> new file mode 100644
>> index 0000000000..789f410d71
>> --- /dev/null
>> +++ b/Documentation/technical/midx.txt
>> @@ -0,0 +1,109 @@
>> +Multi-Pack-Index (MIDX) Design Notes
>> +====================================
>> +
>> +The Git object directory contains a 'pack' directory containing
>> +packfiles (with suffix ".pack") and pack-indexes (with suffix
>> +".idx"). The pack-indexes provide a way to lookup objects and
>> +navigate to their offset within the pack, but these must come
>> +in pairs with the packfiles. This pairing depends on the file
>> +names, as the pack-index differs only in suffix with its pack-
>> +file. While the pack-indexes provide fast lookup per packfile,
>> +this performance degrades as the number of packfiles increases,
>> +because abbreviations need to inspect every packfile and we are
>> +more likely to have a miss on our most-recently-used packfile.
>> +For some large repositories, repacking into a single packfile
>> +is not feasible due to storage space or excessive repack times.
> This leads to the question how MIDX will cope with large repos or
> a large number of packs. As it is just an index and not a pack itself,
> I guess it is smaller by some orders of magnitude, such that it
> is ok for now.

The MIDX file is only slightly larger than the union of the IDX files 
for those packfiles.

>> +The multi-pack-index (MIDX for short) stores a list of objects
>> +and their offsets into multiple packfiles. It contains:
>> +
>> +- A list of packfile names.
>> +- A sorted list of object IDs.
>> +- A list of metadata for the ith object ID including:
>> +  - A value j referring to the jth packfile.
>> +  - An offset within the jth packfile for the object.
>> +- If large offsets are required, we use another list of large
>> +  offsets similar to version 2 pack-indexes.
>> +
>> +Thus, we can provide O(log N) lookup time for any number
>> +of packfiles.
> This sounds great for the lookup case!
> Though that is for the repo-read case.
> Let's read on how the dynamics of a repository are dealt with,
> e.g. integrating new packs into the MIDX, or how we deal with
> objects in multiple packs.
>
>> +
>> +Design Details
>> +--------------
>> +
>> +- The MIDX is stored in a file named 'multi-pack-index' in the
>> +  .git/objects/pack directory. This could be stored in the pack
>> +  directory of an alternate. It refers only to packfiles in that
>> +  same directory.
> So there is one and only one multi pack index?
> That makes the case of preparing the next MIDX that contains more
> pack references more interesting, as then we have to atomically update
> that file.

There is only one, but we can make the file incremental without changing 
this name similar to how the split index works.

>
>> +- The core.midx config setting must be on to consume MIDX files.
> Looking through current config options, I would rename this to a more
> suggestive name. I searched for the core.idx counterpart that enables
> idx files -- it turns out that is named pack.indexVersion.
>
> So maybe pack.MultiIndex ? That could start out as a boolean as in this
> series and then evolve into a version number or such later.

I'll use that name and rename this file to 
Documentation/technical/multi-pack-index.txt

>
>> +- The file format includes parameters for the object ID hash
>> +  function, so a future change of hash algorithm does not require
>> +  a change in format.
>> +
>> +- The MIDX keeps only one record per object ID. If an object appears
>> +  in multiple packfiles, then the MIDX selects the copy in the most-
>> +  recently modified packfile.
> Okay. That answers the question from above. Though this is just the tie
> breaking decision and not a hard limitation? (i.e. we could change this
> this later to that pack that has e.g. shortest delta chain for that object or
> such)

This is a soft requirement. It is an easy thing to track at the moment. 
We can compute the MIDX without opening a packfile, for instance.

>
>> +- If there exist packfiles in the pack directory not registered in
>> +  the MIDX, then those packfiles are loaded into the `packed_git`
>> +  list and `packed_git_mru` cache.
> Not sure I understand the implications of this?
> Does that mean we first look at the multi index and if an object is not
> found, we'll search linearly through all packs that are not part of the
> MIDX? That would require the MIDX to be kepot up to date reasonably
> to be useful.

If you add a packfile to the pack directory, you can immediately start 
consuming it. You do not need to wait for the MIDX to be updated. The 
more asynchronous these auxiliary data structures (MIDX, commit-graph) 
can be, the better. This is in direct contrast to the reachability 
bitmap which is useless without its corresponding packfile.

>
>> +- The pack-indexes (.idx files) remain in the pack directory so we
>> +  can delete the MIDX file, set core.midx to false, or downgrade
>> +  without any loss of information.
> In the future will it be possible to have no .idx files and just have the .midx?
> (I guess that depends on the strategy of how to integrate new packs into
> the MIDX?)

This may be part of a future plan, but we need to know a user will never 
set pack.multiIndex to false if they deleted their IDX files.

>> +- The MIDX file format uses a chunk-based approach (similar to the
>> +  commit-graph file) that allows optional data to be added.
> ... or the index files v2 (or reftable files)? Sure, you are most familiar with
> commit-graph files, but others may find it easier to have some older
> file formats to relate to.

I specifically mean that we have a "table of contents" describing the 
list of chunks. IDX v2 relies on a fixed ordering of the tables, and the 
offsets are computed by consuming the last fanout value (number of 
objects). Also, I'm not sure how to add optional data (data that can 
safely be ignored by an earlier version of Git) to an IDX without 
incrementing the version.

>> +Future Work
>> +-----------
>> +
>> +- Add a 'verify' subcommand to the 'git midx' builtin to verify the
>> +  contents of the multi-pack-index file match the offsets listed in
>> +  the corresponding pack-indexes.
>> +
>> +- The multi-pack-index allows many packfiles, especially in a context
>> +  where repacking is expensive (such as a very large repo), or
>> +  unexpected maintenance time is unacceptable (such as a high-demand
>> +  build machine).
> Supposedly maintenance (git gc) can be run in the background without
> interfering with day-to-day life, how is the regeneration of commit graph
> or MIDX files impacting the work here?

Assuming infinite RAM and disk, then yes we could not interfere with 
daily life. A big problem we see is that users frequently don't have the 
disk space to store a second copy of their packfiles on disk, even if we 
could organize a new packfile in reasonable time.

The MIDX is only invalid when a packfile it references is deleted.

The commit-graph is never invalid, except if a commit is deleted by GC. 
But even in that case, how did we "see" the commit ID? As long as we 
don't access these commits, the commit-graph feature doesn't violate 
expectations and can be generated asynchronously with a GC and repack.

>
>>      However, the multi-pack-index needs to be rewritten
>> +  in full every time. We can extend the format to be incremental, so
>> +  writes are fast. By storing a small "tip" multi-pack-index that
>> +  points to large "base" MIDX files, we can keep writes fast while
>> +  still reducing the number of binary searches required for object
>> +  lookups.
> So we can have multiple MIDX files? How would that work? Would there
> be a chunk that refers to other MIDX files?

We can have an optional chunk that refers to a list of "base" MIDX 
files. We then load that full list into multiple 'midxed_git' structs, 
and iterate through the list. VSTS keeps this list to a maximum length 
of 3 (LARGE, Medium, tiny) and merging files as necessary.

>
>> +- The reachability bitmap is currently paired directly with a single
>> +  packfile, using the pack-order as the object order to hopefully
>> +  compress the bitmaps well using run-length encoding. This could be
>> +  extended to pair a reachability bitmap with a multi-pack-index. If
>> +  the multi-pack-index is extended to store a "stable object order"
>> +  (a function Order(hash) = integer that is constant for a given hash,
> This stable object order doesn't fly well with integrating new packs?

When you integrate new packs, the lexicographic order changes as the new 
objects are inserted into the list. However, if we track a separate 
integer value (order[obj]) associated with the object, and keep that 
constant, we can track a stable order for an object across multiple 
generations of MIDX files. New objects are assigned order values larger 
than the previous order values. We can select a "good" ordering of the 
new objects as we extend the list.

This requires a new chunk in the file format. It also helps to store the 
reverse-lookup lex[i] which provides the lexicographic position of the 
object 'obj' with stable-order order[obj] == i.

I'm being intentionally vague in this document to hint towards a 
valuable feature without giving robust details of something that may 
never get built. But, I do think this is one feature of the MIDX that 
would be of the most value for services that use Git as a server 
process, as it allows the reachability bitmap to be set to this stable 
order instead of a single pack order. This is speculation on my part, as 
I don't know how such servers are maintained in the background.

>
>> +  even as the multi-pack-index is updated) then a reachability bitmap
>> +  could point to a multi-pack-index and be updated independently.
>> +
>> +- Packfiles can be marked as "special" using empty files that share
>> +  the initial name but replace ".pack" with ".keep" or ".promisor".
>> +  We can add an optional chunk of data to the multi-pack-index that
>> +  records flags of information about the packfiles. This allows new
>> +  states, such as 'repacked' or 'redeltified', that can help with
>> +  pack maintenance in a multi-pack environment. It may also be
>> +  helpful to organize packfiles by object type (commit, tree, blob,
>> +  etc.) and use this metadata to help that maintenance.
>> +
>> +- The partial clone feature records special "promisor" packs that
>> +  may point to objects that are not stored locally, but available
>> +  on request to a server. The multi-pack-index does not currently
>> +  track these promisor packs.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-11 19:19   ` Stefan Beller
@ 2018-06-18 19:01     ` Derrick Stolee
  2018-06-18 19:41       ` Stefan Beller
  0 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:01 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

On 6/11/2018 3:19 PM, Stefan Beller wrote:
> Hi Derrick,
> On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>> The multi-pack-index (MIDX) feature generalizes the existing pack-
>> index (IDX) feature by indexing objects across multiple pack-files.
>>
>> Describe the basic file format, using a 12-byte header followed by
>> a lookup table for a list of "chunks" which will be described later.
>> The file ends with a footer containing a checksum using the hash
>> algorithm.
>>
>> The header allows later versions to create breaking changes by
>> advancing the version number. We can also change the hash algorithm
>> using a different version value.
>>
>> We will add the individual chunk format information as we introduce
>> the code that writes that information.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/technical/pack-format.txt | 49 +++++++++++++++++++++++++
>>   1 file changed, 49 insertions(+)
>>
>> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
>> index 70a99fd142..17666b4bfc 100644
>> --- a/Documentation/technical/pack-format.txt
>> +++ b/Documentation/technical/pack-format.txt
>> @@ -252,3 +252,52 @@ Pack file entry: <+
>>       corresponding packfile.
>>
>>       20-byte SHA-1-checksum of all of the above.
>> +
>> +== midx-*.midx files have the following format:
>> +
>> +The meta-index files refer to multiple pack-files and loose objects.
> So is it meta or multi?

Good catch. We were calling this the meta-index internally before 
changing to "multi-pack-index" (helps to not change the acronym).

>
>> +In order to allow extensions that add extra data to the MIDX, we organize
>> +the body into "chunks" and provide a lookup table at the beginning of the
>> +body. The header includes certain length values, such as the number of packs,
>> +the number of base MIDX files, hash lengths and types.
>> +
>> +All 4-byte numbers are in network order.
>> +
>> +HEADER:
>> +
>> +       4-byte signature:
>> +           The signature is: {'M', 'I', 'D', 'X'}
>> +
>> +       1-byte version number:
>> +           Git only writes or recognizes version 1
>> +
>> +       1-byte Object Id Version
>> +           Git only writes or recognizes verion 1 (SHA-1)
> s/verion/version/
>
>> +       1-byte number (C) of "chunks"
>> +
>> +       1-byte number (I) of base multi-pack-index files:
>> +           This value is currently always zero.
> Oh? Are meta-index and multi-index files different things?

Not intended to be different things, but this number is related to 
making the feature incremental.

>
>> +       4-byte number (P) of pack files
>> +
>> +CHUNK LOOKUP:
>> +
>> +       (C + 1) * 12 bytes providing the chunk offsets:
>> +           First 4 bytes describe chunk id. Value 0 is a terminating label.
>> +           Other 8 bytes provide offset in current file for chunk to start.
>> +           (Chunks are provided in file-order, so you can infer the length
>> +           using the next chunk position if necessary.)
> It is so nice to have the header also have 12 bytes, so it fits right into the
> lookup table. So an alternative point of view:
>
>    If a chunk needs to store more than 8 bytes, we'll have an offset after
>    the first 4 bytes that describe the chunk, otherwise you can store the 8 bytes
>    of information directly after the 4 bytes.
>     "MIDX" is a special chunk and must come first (does it?) and only once
>    as it contains the version number.

This sounds feasible, but unnecessarily complicated. I don't think any 
other chunk will be this small.

>> +       The remaining data in the body is described one chunk at a time, and
>> +       these chunks may be given in any order. Chunks are required unless
>> +       otherwise specified.
>> +
>> +CHUNK DATA:
>> +
>> +       (This section intentionally left incomplete.)
>> +
>> +TRAILER:
>> +
>> +       H-byte HASH-checksum of all of the above.
> This means we have to rehash the whole file for updating its contents.
> okay.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-07 17:20   ` Duy Nguyen
@ 2018-06-18 19:23     ` Derrick Stolee
  0 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:23 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/7/2018 1:20 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
>> new file mode 100644
>> index 0000000000..2bd886f1a2
>> --- /dev/null
>> +++ b/Documentation/git-midx.txt
>> @@ -0,0 +1,29 @@
>> +git-midx(1)
>> +============
>> +
>> +NAME
>> +----
>> +git-midx - Write and verify multi-pack-indexes (MIDX files).
> No full stop. This head line is collected automatically with others
> and its having a full stop while the rest does not looks strange/
>
>> diff --git a/builtin/midx.c b/builtin/midx.c
>> new file mode 100644
>> index 0000000000..59ea92178f
>> --- /dev/null
>> +++ b/builtin/midx.c
>> @@ -0,0 +1,38 @@
>> +#include "builtin.h"
>> +#include "cache.h"
>> +#include "config.h"
>> +#include "git-compat-util.h"
> You only need either cache.h or git-compat-util.h. If cache.h is here,
> git-compat-util can be removed.
>
>> +#include "parse-options.h"
>> +
>> +static char const * const builtin_midx_usage[] ={
>> +       N_("git midx [--object-dir <dir>]"),
>> +       NULL
>> +};
>> +
>> +static struct opts_midx {
>> +       const char *object_dir;
>> +} opts;
>> +
>> +int cmd_midx(int argc, const char **argv, const char *prefix)
>> +{
>> +       static struct option builtin_midx_options[] = {
>> +               { OPTION_STRING, 0, "object-dir", &opts.object_dir,
> For paths (including dir), OPTION_FILENAME may be a better option to
> handle correctly when the command is run in a subdir. See df217ed643
> (parse-opts: add OPT_FILENAME and transition builtins - 2009-05-23)
> for more info.
Thanks for the pointer!

>
>> +                 N_("dir"),
>> +                 N_("The object directory containing set of packfile and pack-index pairs.") },
> Other help strings do not have full stop either (I only checked a
> couple commands though)
>
> Also, doesn't OPT_STRING() work here too (if you avoid OPTION_FILENAME
> for some reason)?
>
>> +               OPT_END(),
>> +       };
>> +
>> +       if (argc == 2 && !strcmp(argv[1], "-h"))
>> +               usage_with_options(builtin_midx_usage, builtin_midx_options);
>> +
>> +       git_config(git_default_config, NULL);
>> +
>> +       argc = parse_options(argc, argv, prefix,
>> +                            builtin_midx_options,
>> +                            builtin_midx_usage, 0);
>> +
>> +       if (!opts.object_dir)
>> +               opts.object_dir = get_object_directory();
>> +
>> +       return 0;
>> +}
>> diff --git a/git.c b/git.c
>> index c2f48d53dd..400fadd677 100644
>> --- a/git.c
>> +++ b/git.c
>> @@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
>>          { "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>>          { "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>>          { "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
>> +       { "midx", cmd_midx, RUN_SETUP },
> If it's a plumbing and can take an --object-dir, then I don't think
> you should require it to run in a repo (with RUN_SETUP).
> RUN_SETUP_GENTLY may be better. You could even leave it empty here and
> only call setup_git_directory() only when --object-dir is not set.

I agree. Good point. This could be run to maintain an alternate without 
any .git folder.

>
>>          { "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
>>          { "mktree", cmd_mktree, RUN_SETUP },
>>          { "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-11 21:02   ` Stefan Beller
@ 2018-06-18 19:40     ` Derrick Stolee
  2018-06-18 19:55       ` Stefan Beller
  0 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:40 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

On 6/11/2018 5:02 PM, Stefan Beller wrote:
> Hi Derrick,
> On Thu, Jun 7, 2018 at 7:03 AM Derrick Stolee <stolee@gmail.com> wrote:
>> This new 'git midx' builtin will be the plumbing access for writing,
>> reading, and checking multi-pack-index (MIDX) files. The initial
>> implementation is a no-op.
> Let's talk about the name for a second:
>
> .idx files are written by git-index-pack or as part of
> git-pack-objects (which just calls write_idx_file as part
> of finish_tmp_packfile), and the name actually suggests
> it writes the index files. I have a hard time understanding
> what the git-midx command does[1].
>
> With both commit graph as well as multi index we introduce
> a command that is centered around that concept (similar to
> git-remote or git-config that are centered around a concept,
> that is closely resembled by a file), but for indexes for packs
> it was integrated differently into Git. So I am not sure if I want
> to suggest to integrate it into the packfile commands as that
> doesn't really fit. But maybe we can have a name that is human
> readable instead of the file suffix? Maybe
>
>    git multi-pack-index ?
>
> I suppose that eventually this command is not really used by
> users as it will be used by other porcelain commands in the
> background or even as part of repack/gc so I am not worried
> about a long name, but I'd be more worried about understandability.

I'll use "git multi-pack-index" in v2. I'll keep "midx.c" in the root, 
though, if that is OK.

> [1] While these names are not perfect for the layman, it is okay?
>    I am sure you are aware of https://git-man-page-generator.lokaltog.net/

I was not, and enjoyed that quite a bit.

Thanks,
-Stolee

>
>
>> new file mode 100644
>> index 0000000000..2bd886f1a2
>> --- /dev/null
>> +++ b/Documentation/git-midx.txt
>> @@ -0,0 +1,29 @@
>> +git-midx(1)
>> +============
>> +
>> +NAME
>> +----
>> +git-midx - Write and verify multi-pack-indexes (MIDX files).
> The reading is done as part of all other commands.

I like to think the 'read' verb is a subset of "verify" because we are 
checking for information about the MIDX, and mostly for tests or debugging.

>
>> +
>> +
>> +SYNOPSIS
>> +--------
>> +[verse]
>> +'git midx' [--object-dir <dir>]
>> +
>> +DESCRIPTION
>> +-----------
>> +Write or verify a MIDX file.
>> +
>> +OPTIONS
>> +-------
>> +
>> +--object-dir <dir>::
>> +       Use given directory for the location of Git objects. We check
>> +       <dir>/packs/multi-pack-index for the current MIDX file, and
>> +       <dir>/packs for the pack-files to index.
>> +
>> +
> Maybe we could have a SEE ALSO section that points at
> the explanation of multi index files?
> (c.f. man git-submodule that has a  SEE ALSO
> gitsubmodules(7), gitmodules(5) explaining concepts(7)
> and the file(5))
>
> But as this is plumbing and users should not need to worry about it
> this is optional, I would think.

The design document is also in 'Documentation/technical' instead of just 
'Documentation/'. Do we have a pattern of linking to the technical 
documents?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/23] midx: add midx format details to pack-format.txt
  2018-06-18 19:01     ` Derrick Stolee
@ 2018-06-18 19:41       ` Stefan Beller
  0 siblings, 0 replies; 62+ messages in thread
From: Stefan Beller @ 2018-06-18 19:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

> >> +       (C + 1) * 12 bytes providing the chunk offsets:
> >> +           First 4 bytes describe chunk id. Value 0 is a terminating label.
> >> +           Other 8 bytes provide offset in current file for chunk to start.
> >> +           (Chunks are provided in file-order, so you can infer the length
> >> +           using the next chunk position if necessary.)
> > It is so nice to have the header also have 12 bytes, so it fits right into the
> > lookup table. So an alternative point of view:
> >
> >    If a chunk needs to store more than 8 bytes, we'll have an offset after
> >    the first 4 bytes that describe the chunk, otherwise you can store the 8 bytes
> >    of information directly after the 4 bytes.
> >     "MIDX" is a special chunk and must come first (does it?) and only once
> >    as it contains the version number.
>
> This sounds feasible, but unnecessarily complicated. I don't think any
> other chunk will be this small.

I was just writing it as a way to test if I really understood what you said
in the doc, not as a suggestion to incorporate it.

Stefan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-18 19:40     ` Derrick Stolee
@ 2018-06-18 19:55       ` Stefan Beller
  2018-06-18 19:58         ` Derrick Stolee
  0 siblings, 1 reply; 62+ messages in thread
From: Stefan Beller @ 2018-06-18 19:55 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

> > But as this is plumbing and users should not need to worry about it
> > this is optional, I would think.
>
> The design document is also in 'Documentation/technical' instead of just
> 'Documentation/'. Do we have a pattern of linking to the technical
> documents?

Apparently we do (and I was not aware of it):

    $ git -C Documentation/ grep link:technical
    git-credential.txt:23:link:technical/api-credentials.html[the Git
credential API] for more
    git.txt:839:link:technical/api-index.html[Git API documentation].
    gitcredentials.txt:184:link:technical/api-credentials.html[credentials
API] for details.
    technical/http-protocol.txt:517:link:technical/pack-protocol.html
    technical/http-protocol.txt:518:link:technical/protocol-capabilities.html
    user-manual.txt:3220:found in link:technical/pack-format.html[pack format].

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/23] midx: add midx builtin
  2018-06-18 19:55       ` Stefan Beller
@ 2018-06-18 19:58         ` Derrick Stolee
  0 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-18 19:58 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Jonathan Tan, Martin Fick

On 6/18/2018 3:55 PM, Stefan Beller wrote:
>>> But as this is plumbing and users should not need to worry about it
>>> this is optional, I would think.
>> The design document is also in 'Documentation/technical' instead of just
>> 'Documentation/'. Do we have a pattern of linking to the technical
>> documents?
> Apparently we do (and I was not aware of it):
>
>      $ git -C Documentation/ grep link:technical
>      git-credential.txt:23:link:technical/api-credentials.html[the Git
> credential API] for more
>      git.txt:839:link:technical/api-index.html[Git API documentation].
>      gitcredentials.txt:184:link:technical/api-credentials.html[credentials
> API] for details.
>      technical/http-protocol.txt:517:link:technical/pack-protocol.html
>      technical/http-protocol.txt:518:link:technical/protocol-capabilities.html
>      user-manual.txt:3220:found in link:technical/pack-format.html[pack format].

Thanks! I'll add some links.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-12 15:00   ` Duy Nguyen
@ 2018-06-19 12:54     ` Derrick Stolee
  2018-06-19 14:59       ` Duy Nguyen
  0 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-19 12:54 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/12/2018 11:00 AM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>> diff --git a/midx.c b/midx.c
>> index 616af66b13..3e55422a21 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -1,9 +1,62 @@
>>   #include "git-compat-util.h"
>>   #include "cache.h"
>>   #include "dir.h"
>> +#include "csum-file.h"
>> +#include "lockfile.h"
>>   #include "midx.h"
>>
>> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>> +#define MIDX_VERSION 1
>> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
> ...
>> +static size_t write_midx_header(struct hashfile *f,
>> +                               unsigned char num_chunks,
>> +                               uint32_t num_packs)
>> +{
>> +       char byte_values[4];
>> +       hashwrite_be32(f, MIDX_SIGNATURE);
>> +       byte_values[0] = MIDX_VERSION;
>> +       byte_values[1] = MIDX_HASH_VERSION;
> Quoting from "State of NewHash work, future directions, and discussion" [1]
>
> * If you need to serialize an algorithm identifier into your data
>    format, use the format_id field of struct git_hash_algo.  It's
>    designed specifically for that purpose.
>
> [1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60

Thanks! I'll also use the_hash_algo->rawsz to infer the length of the 
hash function.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-19 12:54     ` Derrick Stolee
@ 2018-06-19 14:59       ` Duy Nguyen
  2018-06-19 15:24         ` Derrick Stolee
  0 siblings, 1 reply; 62+ messages in thread
From: Duy Nguyen @ 2018-06-19 14:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Tue, Jun 19, 2018 at 2:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/12/2018 11:00 AM, Duy Nguyen wrote:
> > On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
> >> diff --git a/midx.c b/midx.c
> >> index 616af66b13..3e55422a21 100644
> >> --- a/midx.c
> >> +++ b/midx.c
> >> @@ -1,9 +1,62 @@
> >>   #include "git-compat-util.h"
> >>   #include "cache.h"
> >>   #include "dir.h"
> >> +#include "csum-file.h"
> >> +#include "lockfile.h"
> >>   #include "midx.h"
> >>
> >> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> >> +#define MIDX_VERSION 1
> >> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
> > ...
> >> +static size_t write_midx_header(struct hashfile *f,
> >> +                               unsigned char num_chunks,
> >> +                               uint32_t num_packs)
> >> +{
> >> +       char byte_values[4];
> >> +       hashwrite_be32(f, MIDX_SIGNATURE);
> >> +       byte_values[0] = MIDX_VERSION;
> >> +       byte_values[1] = MIDX_HASH_VERSION;
> > Quoting from "State of NewHash work, future directions, and discussion" [1]
> >
> > * If you need to serialize an algorithm identifier into your data
> >    format, use the format_id field of struct git_hash_algo.  It's
> >    designed specifically for that purpose.
> >
> > [1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60
>
> Thanks! I'll also use the_hash_algo->rawsz to infer the length of the
> hash function.

BTW, since you're the author of commit-graph.c and may notice it has
the same problem. Don't touch that code. Brian already has some WIP
changes [1]. We just make sure new code does not add extra work for
him. I expect he'll send all those patches out soon.

[1] https://github.com/bk2204/git/commit/3f9031e06cfb21534eb7dfff7b54e7598ac1149f

-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/23] midx: write header information to lockfile
  2018-06-19 14:59       ` Duy Nguyen
@ 2018-06-19 15:24         ` Derrick Stolee
  0 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-19 15:24 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/19/2018 10:59 AM, Duy Nguyen wrote:
> On Tue, Jun 19, 2018 at 2:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>> On 6/12/2018 11:00 AM, Duy Nguyen wrote:
>>> On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>>>> diff --git a/midx.c b/midx.c
>>>> index 616af66b13..3e55422a21 100644
>>>> --- a/midx.c
>>>> +++ b/midx.c
>>>> @@ -1,9 +1,62 @@
>>>>    #include "git-compat-util.h"
>>>>    #include "cache.h"
>>>>    #include "dir.h"
>>>> +#include "csum-file.h"
>>>> +#include "lockfile.h"
>>>>    #include "midx.h"
>>>>
>>>> +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>>>> +#define MIDX_VERSION 1
>>>> +#define MIDX_HASH_VERSION 1 /* SHA-1 */
>>> ...
>>>> +static size_t write_midx_header(struct hashfile *f,
>>>> +                               unsigned char num_chunks,
>>>> +                               uint32_t num_packs)
>>>> +{
>>>> +       char byte_values[4];
>>>> +       hashwrite_be32(f, MIDX_SIGNATURE);
>>>> +       byte_values[0] = MIDX_VERSION;
>>>> +       byte_values[1] = MIDX_HASH_VERSION;
>>> Quoting from "State of NewHash work, future directions, and discussion" [1]
>>>
>>> * If you need to serialize an algorithm identifier into your data
>>>     format, use the format_id field of struct git_hash_algo.  It's
>>>     designed specifically for that purpose.
>>>
>>> [1] https://public-inbox.org/git/20180612024252.GA141166@aiede.svl.corp.google.com/T/#m5fdd09dcaf31266c45343fb6c0beaaa3e928bc60
>> Thanks! I'll also use the_hash_algo->rawsz to infer the length of the
>> hash function.
> BTW, since you're the author of commit-graph.c and may notice it has
> the same problem. Don't touch that code. Brian already has some WIP
> changes [1]. We just make sure new code does not add extra work for
> him. I expect he'll send all those patches out soon.
>
> [1] https://github.com/bk2204/git/commit/3f9031e06cfb21534eb7dfff7b54e7598ac1149f

Thanks for the link. It seems he is creating an oid_version() method 
that returns a 1-byte version for the hash version instead of the 4-byte 
signature of the_hash_algo->format_id. I look forward to incorporating 
that into the MIDX format. I'll keep my macros for now, as we work out 
the other details, and while Brain's patch is cooking.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 17:54   ` Duy Nguyen
@ 2018-06-20 13:13     ` Derrick Stolee
  0 siblings, 0 replies; 62+ messages in thread
From: Derrick Stolee @ 2018-06-20 13:13 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/7/2018 1:54 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> As we build the multi-pack-index feature by adding chunks at a time,
>> we want to test that the data is being written correctly.
>>
>> Create struct midxed_git to store an in-memory representation of a
> A word play on 'packed_git'? Amusing. Some more descriptive name would
> be better though. midxed looks almost like random letters thrown
> together.

I'll use 'struct multi_pack_index'.

>
>> multi-pack-index and a memory-map of the binary file. Initialize this
>> struct in load_midxed_git(object_dir).
>> +static int read_midx_file(const char *object_dir)
>> +{
>> +       struct midxed_git *m = load_midxed_git(object_dir);
>> +
>> +       if (!m)
>> +               return 0;
> This looks like an error case, please don't just return zero,
> typically used to say "success". I don't know if this command stays
> "for debugging purposes" until the end. Of course in that case it does
> not really matter.

It is intended for debugging and testing. Generally, it is not an error 
to not have a MIDX in an object directory.

>> +struct midxed_git *load_midxed_git(const char *object_dir)
>> +{
>> +       struct midxed_git *m;
>> +       int fd;
>> +       struct stat st;
>> +       size_t midx_size;
>> +       void *midx_map;
>> +       const char *midx_name = get_midx_filename(object_dir);
> mem leak? This function returns allocated memory if I remember correctly.
>
>> +
>> +       fd = git_open(midx_name);
>> +       if (fd < 0)
>> +               return NULL;
> do an error_errno() so we know what went wrong at least.
>
>> +       if (fstat(fd, &st)) {
>> +               close(fd);
>> +               return NULL;
> same here, we should know why fstat() fails.
>
>> +       }
>> +       midx_size = xsize_t(st.st_size);
>> +
>> +       if (midx_size < MIDX_MIN_SIZE) {
>> +               close(fd);
>> +               die("multi-pack-index file %s is too small", midx_name);
> _()
>
> The use of die() should be discouraged though. Many people still try
> (or wish) to libify code and new die() does not help. I think error()
> here would be enough then you can return NULL. Or you can go fancier
> and store the error string in a strbuf like refs code.
>
>> +       }
>> +
>> +       midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
>> +
>> +       m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
>> +       strcpy(m->object_dir, object_dir);
>> +       m->data = midx_map;
>> +
>> +       m->signature = get_be32(m->data);
>> +       if (m->signature != MIDX_SIGNATURE) {
>> +               error("multi-pack-index signature %X does not match signature %X",
>> +                     m->signature, MIDX_SIGNATURE);
> _(). Maybe 0x%08x instead of %x
>
>> +               goto cleanup_fail;
>> +       }
>> +
>> +       m->version = *(m->data + 4);
> m->data[4] instead? shorter and easier to understand.
>
> Same comment on "*(m->data + x)" and error() without _() for the rest.
>
>> +       if (m->version != MIDX_VERSION) {
>> +               error("multi-pack-index version %d not recognized",
>> +                     m->version);
> _()
>> +               goto cleanup_fail;
>> +       }
>> +
>> +       m->hash_version = *(m->data + 5);
> m->data[5]
>
>> +cleanup_fail:
>> +       FREE_AND_NULL(m);
>> +       munmap(midx_map, midx_size);
>> +       close(fd);
>> +       exit(1);
> It's bad enough that you die() but exit() in this code seems too much.
> Please just return NULL and let the caller handle the error.

Will do.

>
>> diff --git a/midx.h b/midx.h
>> index 3a63673952..a1d18ed991 100644
>> --- a/midx.h
>> +++ b/midx.h
>> @@ -1,4 +1,13 @@
>> +#ifndef MIDX_H
>> +#define MIDX_H
>> +
>> +#include "git-compat-util.h"
>>   #include "cache.h"
>> +#include "object-store.h"
> I don't really think you need object-store here (git-compat-util.h
> too). "struct mixed_git;" would be enough for load_midxed_git
> declaration below.
>
>>   #include "packfile.h"
>>
>> +struct midxed_git *load_midxed_git(const char *object_dir);
>> +
>>   int write_midx_file(const char *object_dir);
>> +
>> +#endif
>> diff --git a/object-store.h b/object-store.h
>> index d683112fd7..77cb82621a 100644
>> --- a/object-store.h
>> +++ b/object-store.h
>> @@ -84,6 +84,25 @@ struct packed_git {
>>          char pack_name[FLEX_ARRAY]; /* more */
>>   };
>>
>> +struct midxed_git {
>> +       struct midxed_git *next;
> Do we really have multiple midx files?

There is one per object directory currently, but you may have one 
locally and one in each of your alternates. I do need to double-check 
that we populate this list later in the series. (And I'll remove it from 
this commit and save it for when it is needed.)

>
>> +
>> +       int fd;
>> +
>> +       const unsigned char *data;
>> +       size_t data_len;
>> +
>> +       uint32_t signature;
>> +       unsigned char version;
>> +       unsigned char hash_version;
>> +       unsigned char hash_len;
>> +       unsigned char num_chunks;
>> +       uint32_t num_packs;
>> +       uint32_t num_objects;
>> +
>> +       char object_dir[FLEX_ARRAY];
> Why do you need to keep object_dir when it could be easily retrieved
> when the repo is available?
>
>> +};
>> +
>>   struct raw_object_store {
>>          /*
>>           * Path to the repository's object store.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-07 18:31   ` Duy Nguyen
@ 2018-06-20 13:33     ` Derrick Stolee
  2018-06-20 15:07       ` Duy Nguyen
  0 siblings, 1 reply; 62+ messages in thread
From: Derrick Stolee @ 2018-06-20 13:33 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On 6/7/2018 2:31 PM, Duy Nguyen wrote:
> On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
>> index dcaeb1a91b..919283fdd8 100644
>> --- a/Documentation/git-midx.txt
>> +++ b/Documentation/git-midx.txt
>> @@ -23,6 +23,11 @@ OPTIONS
>>          <dir>/packs/multi-pack-index for the current MIDX file, and
>>          <dir>/packs for the pack-files to index.
>>
>> +read::
>> +       When given as the verb, read the current MIDX file and output
>> +       basic information about its contents. Used for debugging
>> +       purposes only.
> On second thought. If you just need a temporary debugging interface,
> adding a program in t/helper may be a better option. In the end we
> might still need 'read' to dump a file out, but we should have some
> stable output format (and json might be a good choice).

My intention with this 'read' pattern in the MIDX (and commit-graph) is 
two-fold:

1. We can test that we are writing the correct data in our test suite. A 
test-tool builtin would suffice for this purpose.

2. We can help trouble-shoot users who may be having trouble with their 
MIDX files. Having the subcommand in a plumbing command allows us to do 
this in the shipped versions of Git.

Maybe this second purpose isn't enough to justify the feature in Git and 
we should move this to the test-tool, especially with the 'verify' mode 
coming in a second series. Note that a 'verify' mode doesn't satisfy 
item (1).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
  2018-06-20 13:33     ` Derrick Stolee
@ 2018-06-20 15:07       ` Duy Nguyen
  0 siblings, 0 replies; 62+ messages in thread
From: Duy Nguyen @ 2018-06-20 15:07 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Stefan Beller, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jonathan Tan, Martin Fick

On Wed, Jun 20, 2018 at 3:33 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/7/2018 2:31 PM, Duy Nguyen wrote:
> > On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee <stolee@gmail.com> wrote:
> >> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> >> index dcaeb1a91b..919283fdd8 100644
> >> --- a/Documentation/git-midx.txt
> >> +++ b/Documentation/git-midx.txt
> >> @@ -23,6 +23,11 @@ OPTIONS
> >>          <dir>/packs/multi-pack-index for the current MIDX file, and
> >>          <dir>/packs for the pack-files to index.
> >>
> >> +read::
> >> +       When given as the verb, read the current MIDX file and output
> >> +       basic information about its contents. Used for debugging
> >> +       purposes only.
> > On second thought. If you just need a temporary debugging interface,
> > adding a program in t/helper may be a better option. In the end we
> > might still need 'read' to dump a file out, but we should have some
> > stable output format (and json might be a good choice).
>
> My intention with this 'read' pattern in the MIDX (and commit-graph) is
> two-fold:
>
> 1. We can test that we are writing the correct data in our test suite. A
> test-tool builtin would suffice for this purpose.
>
> 2. We can help trouble-shoot users who may be having trouble with their
> MIDX files. Having the subcommand in a plumbing command allows us to do
> this in the shipped versions of Git.
>
> Maybe this second purpose isn't enough to justify the feature in Git and
> we should move this to the test-tool, especially with the 'verify' mode
> coming in a second series. Note that a 'verify' mode doesn't satisfy
> item (1).

Yeah I think normally we just have some "fsck" thing to verify when
things go bad. If you need more than that I think you just ask the
user to send the .midx to you (with full understanding of potentially
revealing confidential info and stuff). It'll be faster than
instructing them to "run this command", "ok, run another command"....
I thought of suggesting a command to dump the midx file in readable
form (like json), but I think if fsck fails then chances of that
command successfully dumping may be very low.

Either way, if the command is meant for troubleshooting, I think it
should be added at the end when the whole midx file is implemented and
understood and we see what we need to troubleshoot. Adding small
pieces of changes from patch to patch makes it really hard to see if
it helps troubleshooting at all, it just helps the first purpose.
-- 
Duy

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, back to index

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07 14:03 [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
2018-06-07 14:03 ` [PATCH 01/23] midx: add design document Derrick Stolee
2018-06-11 19:04   ` Stefan Beller
2018-06-18 18:48     ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 02/23] midx: add midx format details to pack-format.txt Derrick Stolee
2018-06-11 19:19   ` Stefan Beller
2018-06-18 19:01     ` Derrick Stolee
2018-06-18 19:41       ` Stefan Beller
2018-06-07 14:03 ` [PATCH 03/23] midx: add midx builtin Derrick Stolee
2018-06-07 17:20   ` Duy Nguyen
2018-06-18 19:23     ` Derrick Stolee
2018-06-11 21:02   ` Stefan Beller
2018-06-18 19:40     ` Derrick Stolee
2018-06-18 19:55       ` Stefan Beller
2018-06-18 19:58         ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 04/23] midx: add 'write' subcommand and basic wiring Derrick Stolee
2018-06-07 17:27   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 05/23] midx: write header information to lockfile Derrick Stolee
2018-06-07 17:35   ` Duy Nguyen
2018-06-12 15:00   ` Duy Nguyen
2018-06-19 12:54     ` Derrick Stolee
2018-06-19 14:59       ` Duy Nguyen
2018-06-19 15:24         ` Derrick Stolee
2018-06-07 14:03 ` [PATCH 06/23] midx: struct midxed_git and 'read' subcommand Derrick Stolee
2018-06-07 17:54   ` Duy Nguyen
2018-06-20 13:13     ` Derrick Stolee
2018-06-07 18:31   ` Duy Nguyen
2018-06-20 13:33     ` Derrick Stolee
2018-06-20 15:07       ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 07/23] midx: expand test data Derrick Stolee
2018-06-07 14:03 ` [PATCH 08/23] midx: read packfiles from pack directory Derrick Stolee
2018-06-07 18:03   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 09/23] midx: write pack names in chunk Derrick Stolee
2018-06-07 18:26   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 10/23] midx: write a lookup into the pack names chunk Derrick Stolee
2018-06-09 16:43   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 11/23] midx: sort and deduplicate objects from packfiles Derrick Stolee
2018-06-09 17:07   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 12/23] midx: write object ids in a chunk Derrick Stolee
2018-06-09 17:25   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 13/23] midx: write object id fanout chunk Derrick Stolee
2018-06-09 17:28   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 14/23] midx: write object offsets Derrick Stolee
2018-06-09 17:41   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 15/23] midx: create core.midx config setting Derrick Stolee
2018-06-07 14:03 ` [PATCH 16/23] midx: prepare midxed_git struct Derrick Stolee
2018-06-09 17:47   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 17/23] midx: read objects from multi-pack-index Derrick Stolee
2018-06-09 17:56   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 18/23] midx: use midx in abbreviation calculations Derrick Stolee
2018-06-09 18:01   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 19/23] midx: use existing midx when writing new one Derrick Stolee
2018-06-07 14:03 ` [PATCH 20/23] midx: use midx in approximate_object_count Derrick Stolee
2018-06-09 18:03   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 21/23] midx: prevent duplicate packfile loads Derrick Stolee
2018-06-09 18:05   ` Duy Nguyen
2018-06-07 14:03 ` [PATCH 22/23] midx: use midx to find ref-deltas Derrick Stolee
2018-06-07 14:03 ` [PATCH 23/23] midx: clear midx on repack Derrick Stolee
2018-06-09 18:13   ` Duy Nguyen
2018-06-07 14:06 ` [PATCH 00/23] Multi-pack-index (MIDX) Derrick Stolee
2018-06-07 14:45 ` Ævar Arnfjörð Bjarmason
2018-06-07 14:54   ` Derrick Stolee

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox