[RFC PATCH 00/18] Multi-pack index (MIDX)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC PATCH 00/18] Multi-pack index (MIDX)
@ 2018-01-07 18:14 Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
                   ` (19 more replies)
  0 siblings, 20 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

This RFC includes a new way to index the objects in multiple packs
using one file, called the multi-pack index (MIDX).

The commits are split into parts as follows:

[01] - A full design document.

[02] - The full file format for MIDX files.

[03] - Creation of core.midx config setting.

[04-12] - Creation of "midx" builtin that writes, reads, and deletes
          MIDX files.

[13-18] - Consume MIDX files for abbreviations and object loads.

The main goals of this RFC are:

* Determine interest in this feature.

* Find other use cases for the MIDX feature.

* Design a proper command-line interface for constructing and checking
  MIDX files. The current "midx" builtin is probably inadequate.

* Determine what additional changes are needed before the feature can
  be merged. Specifically, I'm interested in the interactions with
  repack and fsck. The current patch also does not update the MIDX on
  a fetch (which adds a packfile) but would be valuable. Whenever
  possible, I tried to leave out features that could be added in a
  later patch.

* Consider splitting this patch into multiple patches, such as:

    i. The MIDX design document.
   ii. The command-line interface for building and reading MIDX files.
  iii. Integrations with abbreviations and object lookups.

Please do not send any style nits to this patch, as I expect the code to
change dramatically before we consider merging.

I created three copies of the Linux repo with 1, 24, and 127 packfiles
each using 'git repack -adfF --max-pack-size=[64m|16m]'. These copies
gave significant performance improvements on the following comand:

    git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
----------+-------------+------------+--------+----------
        1 |     35.64 s |    35.28 s |  -1.0% |   -1.0%
       24 |     90.81 s |    40.06 s | -55.9% |  +12.4%
      127 |    257.97 s |    42.25 s | -83.6% |  +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

    git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

    git midx --write --update-head --delete-expired --pack-dir=<alt>

As that release deploys we will gather more specific numbers on the
performance improvements and report them in this thread.

Derrick Stolee (18):
  docs: Multi-Pack Index (MIDX) Design Notes
  midx: specify midx file format
  midx: create core.midx config setting
  midx: write multi-pack indexes for an object list
  midx: create midx builtin with --write mode
  midx: add t5318-midx.sh test script
  midx: teach midx --write to update midx-head
  midx: teach git-midx to read midx file details
  midx: find details of nth object in midx
  midx: use existing midx when writing
  midx: teach git-midx to clear midx files
  midx: teach git-midx to delete expired files
  t5318-midx.h: confirm git actions are stable
  midx: load midx files when loading packs
  midx: use midx for approximate object count
  midx: nth_midxed_object_oid() and bsearch_midx()
  sha1_name: use midx for abbreviations
  packfile: use midx for object loads

 .gitignore                                   |   1 +
 Documentation/config.txt                     |   3 +
 Documentation/git-midx.txt                   | 106 ++++
 Documentation/technical/multi-pack-index.txt | 149 +++++
 Documentation/technical/pack-format.txt      |  85 +++
 Makefile                                     |   2 +
 builtin.h                                    |   1 +
 builtin/midx.c                               | 352 +++++++++++
 cache.h                                      |   1 +
 command-list.txt                             |   1 +
 config.c                                     |   5 +
 environment.c                                |   2 +
 git.c                                        |   1 +
 midx.c                                       | 850 +++++++++++++++++++++++++++
 midx.h                                       | 136 +++++
 packfile.c                                   |  79 ++-
 packfile.h                                   |   2 +
 sha1_name.c                                  |  70 ++-
 t/t5318-midx.sh                              | 189 ++++++
 19 files changed, 2020 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 Documentation/technical/multi-pack-index.txt
 create mode 100644 builtin/midx.c
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5318-midx.sh

-- 
2.15.0

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-08 19:32   ` Jonathan Tan
  2018-01-07 18:14 ` [RFC PATCH 02/18] midx: specify midx file format Derrick Stolee
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Commentary: This file format uses the large offsets from the pack-index
version 2 format, but drops the CRC32 hashes from that format.

Also: I included the HASH footer at the end only because it is already in
the pack and pack-index formats, but not because it is particularly useful
here. If possible, I'd like to remove it and speed up MIDX writes.

-- >8 --

Add a technical documentation file describing the design
for the multi-pack index (MIDX). Includes current limitations
and future work.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 149 +++++++++++++++++++++++++++
 1 file changed, 149 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000000..d31b03dec5
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,149 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short, with suffix ".midx")
+stores a list of objects and their offsets into multiple pack-
+files. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+A new config setting 'core.midx' must be enabled before writing
+or reading MIDX files.
+
+The MIDX files are updated by the 'midx' builtin with the
+following common parameter combinations:
+
+- 'git midx' gives the hash of the current MIDX head.
+- 'git midx --write --update-head --delete-expired' writes a new
+  MIDX file, points the MIDX head to that file, and deletes the
+  existing MIDX file if out-of-date.
+- 'git midx --read' lists some basic information about the current
+  MIDX head. Used for basic tests.
+- 'git midx --clear' deletes the current MIDX head.
+
+Design Details
+--------------
+
+- The MIDX file refers only to packfiles in the same directory
+  as the MIDX file.
+
+- A special file, 'midx-head', stores the hash of the latest
+  MIDX file so we can load the file without performing a dirstat.
+  This file is especially important with incremental MIDX files,
+  pointing to the newest file.
+
+- If a packfile exists in the pack directory but is not referenced
+  by the MIDX file, then the packfile is loaded into the packed_git
+  list and Git can access the objects as usual. This behavior is
+  necessary since other tools could add packfiles to the pack
+  directory without notifying Git.
+
+- The MIDX file should be only a supplemental structure. If a
+  user downgrades or disables the `core.midx` config setting,
+  then the existing .idx and .pack files should be sufficient
+  to operate correctly.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+- If an object appears in multiple packfiles, then only one copy
+  is stored in the MIDX. This has a possible performance issue:
+  If an object appears as the delta-base of multiple objects from
+  multiple packs, then cross-pack delta calculations may slow down.
+  This is currently only theoretical and has not been demonstrated
+  to be a measurable issue.
+
+Current Limitations
+-------------------
+
+- MIDX files are managed only by the midx builtin and is not
+  automatically updated on clone or fetch.
+
+- There is no '--verify' option for the midx builtin to verify
+  the contents of the MIDX file against the pack contents.
+
+- Constructing a MIDX file currently requires the single-pack
+  index for every pack being added to the MIDX.
+
+- The fsck builtin does not check MIDX files, but should.
+
+- The repack builtin is not aware of the MIDX files, and may
+  invalidate the MIDX files by deleting existing packfiles. The
+  MIDX may also be extended in the future to store metadata about
+  a packfile that can be used for faster repack commands.
+
+- The naive Git HTTP server advertises lists of packfiles using
+  the file system directly.
+
+Future Work
+-----------
+
+- The current file-format requires between 28 and 36 bytes per
+  object. As the repository grows, the MIDX file can become
+  very large and become a bottleneck when updating the file. To
+  fix this "big write" problem, we can make the MIDX file
+  incremental. Instead of just one MIDX file, we will have a
+  sequence of MIDX files that can be unioned together. Then
+  on write we take the new objects to add and consider how many
+  existing files should be merged into a new file containing
+  the latest objects.
+
+  This list of "base indexes" will be presented as an optional
+  chunk in the MIDX format and contains the OIDs for the base
+  files. Thus, the `midx_head` file only stores the OID for the
+  "tip" MIDX file and then the rest are loaded based on those
+  pointers, such as the following figure:
+
+  [     BIG     ] <- [ MEDIUM ] <- [tiny] <- midx_head
+	 ^___________________________|
+
+  The plan being that every write replaces the "tiny" index,
+  and when that index becomes large enough it merges with the
+  "medium" index and a new tiny index is created in the next
+  write. Very rarely, the "big" index would be updated, causing
+  a slow write.
+
+- After the MIDX feature is sufficiently hardened and widely used,
+  consider making Git more fully depend on the MIDX file. If MIDX
+  is the default, then we can delete the single-pack-indexes from
+  the pack directory. We could also allow thin packs in the pack
+  directory.
+
+- The MIDX could be extended to store a "stable object order" such
+  that adding objects to the order does not change the existing
+  objects. This would enable re-using the reachability bitmaps after
+  repacking and updating the MIDX file.
+
+Related Links
+-------------
+
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/CB5074CF.3AD7A%25joshua.redstone@fb.com/T/#u
+    Subject: Git performance results on a large repository
+    Date: 3 Feb 2012
+
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 02/18] midx: specify midx file format
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 03/18] midx: create core.midx config setting Derrick Stolee
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

A multi-pack-index (MIDX) file indexes the objects in multiple
packfiles in a single pack directory. After a simple fixed-size
header, the version 1 file format uses chunks to specify
different regions of the data that correspond to different types
of data, including:

- List of OIDs in lex-order
- A fanout table into the OID list
- List of packfile names (relative to pack directory)
- List of object metadata
- Large offsets (if needed)

By adding extra optional chunks, we can easily extend this format
without invalidating written v1 files.

One value in the header corresponds to a number of "base MIDX files"
and will always be zero until the value is used in a later patch.

We considered using a hashtable format instead of an ordered list
of objects to reduce the O(log N) lookups to O(1) lookups, but two
main issues arose that lead us to abandon the idea:

- Extra space required to ensure collision counts were low.
- We need to identify the two lexicographically closest OIDs for
  fast abbreviations. Binary search allows this.

The current solution presents multiple packfiles as if they were
packed into a single packfile with one pack-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/pack-format.txt | 85 +++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 8e5bf60be3..ab459ef142 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -160,3 +160,88 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA-1-checksum of all of the above.
+
+== midx-*.midx files have the following format:
+
+The multi-pack-index (MIDX) files refer to multiple pack-files.
+
+In order to allow extensions that add extra data to the MIDX format, we
+organize the body into "chunks" and provide a lookup table at the beginning
+of the body. The header includes certain length values, such as the number
+of packs, the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+	4-byte signature:
+	    The signature is: {'M', 'I', 'D', 'X'}
+
+	4-byte version number:
+	    Git currently only supports version 1.
+
+	1-byte Object Id Version (1 = SHA-1)
+
+	1-byte Object Id Length (H)
+
+	1-byte number (I) of base multi-pack-index files:
+	    This value is currently always zero.
+
+	1-byte number (C) of "chunks"
+
+	4-byte number (P) of pack files
+
+CHUNK LOOKUP:
+
+	(C + 1) * 12 bytes providing the chunk offsets:
+	    First 4 bytes describe chunk id. Value 0 is a terminating label.
+	    Other 8 bytes provide offset in current file for chunk to start.
+	    (Chunks are provided in file-order, so you can infer the length
+	    using the next chunk position if necessary.)
+
+	The remaining data in the body is described one chunk at a time, and
+	these chunks may be given in any order. Chunks are required unless
+	otherwise specified.
+
+CHUNK DATA:
+
+	OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+	    The ith entry, F[i], stores the number of OIDs with first
+	    byte at most i. Thus F[255] stores the total
+	    number of objects (N). The number of objects with first byte
+	    value i is (F[i] - F[i-1]) for i > 0.
+
+	OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+	    The OIDs for all objects in the MIDX are stored in lexicographic
+	    order in this chunk.
+
+	Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
+	    Stores two 4-byte values for every object.
+	    1: The pack-int-id for the pack storing this object.
+	    2: The offset within the pack.
+		If all offsets are less than 2^31, then the large offset chunk
+		will not exist and offsets are stored as in IDX v1.
+		If there is at least one offset value larger than 2^32-1, then
+		the large offset chunk must exist. If the large offset chunk
+		exists and the 31st bit is on, then removing that bit reveals
+		the row in the large offsets containing the 8-byte offset of
+		this object.
+
+	[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+	    8-byte offsets into large packfiles.
+
+	Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
+	    P * 4 bytes storing the offset in the packfile name chunk for
+	    the null-terminated string containing the filename for the
+	    ith packfile. The filename is relative to the MIDX file's parent
+	    directory.
+
+	Packfile Names (ID: {'P', 'N', 'A', 'M'})
+	    Stores the packfile names as concatenated, null-terminated strings.
+	    Packfiles must be listed in lexicographic order for fast lookups by
+	    name. This is the only chunk not guaranteed to be a multiple of four
+	    bytes in length, so it should be the last chunk for alignment reasons.
+
+TRAILER:
+
+	H-byte HASH-checksum of all of the above.
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 03/18] midx: create core.midx config setting
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 02/18] midx: specify midx file format Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 04/18] midx: write multi-pack indexes for an object list Derrick Stolee
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

As the multi-pack-index feature is being developed, we use a config
setting 'core.midx' to enable all use of MIDX files.

Since MIDX files are designed as a way to augment the existing data
stores in Git, turning this setting off will revert to previous
behavior without needing to downgrade. This can also be a repo-
specific setting if the MIDX is misbehaving in only one repo.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h                  | 1 +
 config.c                 | 5 +++++
 environment.c            | 2 ++
 4 files changed, 11 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 64c1dbba94..dc7cb4b900 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -896,6 +896,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.midx::
+	Enable "multi-pack-index" feature. Set to true to read and write MIDX files.
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index a2ec8c0b55..f4943d3136 100644
--- a/cache.h
+++ b/cache.h
@@ -820,6 +820,7 @@ extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
 extern const char *core_fsmonitor;
+extern int core_midx;
 
 /*
  * Include broken refs in all ref iterations, which will
diff --git a/config.c b/config.c
index e617c2018d..17f560ddc4 100644
--- a/config.c
+++ b/config.c
@@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.midx")) {
+		core_midx = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..57a3943849 100644
--- a/environment.c
+++ b/environment.c
@@ -78,6 +78,8 @@ int protect_hfs = PROTECT_HFS_DEFAULT;
 int protect_ntfs = PROTECT_NTFS_DEFAULT;
 const char *core_fsmonitor;
 
+int core_midx;
+
 /*
  * The character that begins a commented line in user-editable file
  * that is subject to stripspace.
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 04/18] midx: write multi-pack indexes for an object list
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 03/18] midx: create core.midx config setting Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 05/18] midx: create midx builtin with --write mode Derrick Stolee
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

The write_midx_file() method takes a list of packfiles and indexed
objects with offset information and writes according to the format
in Documentation/technical/pack-format.txt. The chunks are separated
into methods.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile |   1 +
 midx.c   | 412 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 midx.h   |  42 +++++++
 3 files changed, 455 insertions(+)
 create mode 100644 midx.c
 create mode 100644 midx.h

diff --git a/Makefile b/Makefile
index 2a81ae22e9..d0d810951f 100644
--- a/Makefile
+++ b/Makefile
@@ -827,6 +827,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += mru.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
diff --git a/midx.c b/midx.c
new file mode 100644
index 0000000000..5c320726ed
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,412 @@
+#include "cache.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "midx.h"
+
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
+
+#define MIDX_VERSION_1 1
+#define MIDX_VERSION MIDX_VERSION_1
+
+#define MIDX_OID_VERSION_SHA1 1
+#define MIDX_OID_LEN_SHA1 20
+#define MIDX_OID_VERSION MIDX_OID_VERSION_SHA1
+#define MIDX_OID_LEN MIDX_OID_LEN_SHA1
+
+#define MIDX_LARGE_OFFSET_NEEDED 0x80000000
+
+char* get_midx_filename_oid(const char *pack_dir,
+			    struct object_id *oid)
+{
+	struct strbuf head_path = STRBUF_INIT;
+	strbuf_addstr(&head_path, pack_dir);
+	strbuf_addstr(&head_path, "/midx-");
+	strbuf_addstr(&head_path, oid_to_hex(oid));
+	strbuf_addstr(&head_path, ".midx");
+
+	return strbuf_detach(&head_path, NULL);
+}
+
+struct pack_midx_details_internal {
+	uint32_t pack_int_id;
+	uint32_t internal_offset;
+};
+
+static int midx_sha1_compare(const void *_a, const void *_b)
+{
+	struct pack_midx_entry *a = *(struct pack_midx_entry **)_a;
+	struct pack_midx_entry *b = *(struct pack_midx_entry **)_b;
+	return oidcmp(&a->oid, &b->oid);
+}
+
+static void write_midx_chunk_packlookup(
+	struct sha1file *f,
+	const char **pack_names, uint32_t nr_packs)
+{
+	uint32_t i, cur_len = 0;
+
+	for (i = 0; i < nr_packs; i++) {
+		uint32_t swap_len = htonl(cur_len);
+		sha1write(f, &swap_len, 4);
+		cur_len += strlen(pack_names[i]) + 1;
+	}
+}
+
+static void write_midx_chunk_packnames(
+	struct sha1file *f,
+	const char **pack_names, uint32_t nr_packs)
+{
+	uint32_t i;
+	for (i = 0; i < nr_packs; i++)
+		sha1write(f, pack_names[i], strlen(pack_names[i]) + 1);
+}
+
+static void write_midx_chunk_oidfanout(
+	struct sha1file *f,
+	struct pack_midx_entry **objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry **list = objects;
+	struct pack_midx_entry **last = objects + nr_objects;
+	uint32_t count_distinct = 0;
+	uint32_t i;
+
+	/*
+	* Write the first-level table (the list is sorted,
+	* but we use a 256-entry lookup to be able to avoid
+	* having to do eight extra binary search iterations).
+	*/
+	for (i = 0; i < 256; i++) {
+		struct pack_midx_entry **next = list;
+		struct pack_midx_entry *prev = 0;
+		uint32_t swap_distinct;
+
+		while (next < last) {
+			struct pack_midx_entry *obj = *next;
+			if (obj->oid.hash[0] != i)
+				break;
+
+			if (!prev || oidcmp(&(prev->oid), &(obj->oid)))
+				count_distinct++;
+
+			prev = obj;
+			next++;
+		}
+
+		swap_distinct = htonl(count_distinct);
+		sha1write(f, &swap_distinct, 4);
+		list = next;
+	}
+}
+
+static void write_midx_chunk_oidlookup(
+	struct sha1file *f, unsigned char hash_len,
+	struct pack_midx_entry **objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry **list = objects;
+	struct object_id *last_oid = 0;
+	uint32_t i;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_entry *obj = *list++;
+
+		if (last_oid && !oidcmp(last_oid, &obj->oid))
+			continue;
+
+		last_oid = &obj->oid;
+		sha1write(f, obj->oid.hash, (int)hash_len);
+	}
+}
+
+static void write_midx_chunk_objectoffsets(
+	struct sha1file *f, int large_offset_needed,
+	struct pack_midx_entry **objects, uint32_t nr_objects, uint32_t *pack_perm)
+{
+	struct pack_midx_entry **list = objects;
+	struct object_id *last_oid = 0;
+	uint32_t i, nr_large_offset = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct pack_midx_details_internal details;
+		struct pack_midx_entry *obj = *list++;
+
+		if (last_oid && !oidcmp(last_oid, &obj->oid))
+			continue;
+
+		last_oid = &obj->oid;
+
+		details.pack_int_id = htonl(pack_perm[obj->pack_int_id]);
+
+		if (large_offset_needed && obj->offset >> 31)
+			details.internal_offset = (MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
+		else
+			details.internal_offset = (uint32_t)obj->offset;
+
+		details.internal_offset = htonl(details.internal_offset);
+		sha1write(f, &details, 8);
+	}
+}
+
+static void write_midx_chunk_largeoffsets(
+	struct sha1file *f, uint32_t nr_large_offset,
+	struct pack_midx_entry **objects, uint32_t nr_objects)
+{
+	struct pack_midx_entry **list = objects;
+	struct object_id *last_oid = 0;
+
+	while (nr_large_offset) {
+		struct pack_midx_entry *obj = *list++;
+		uint64_t offset = obj->offset;
+		uint32_t split[2];
+
+		if (last_oid && !oidcmp(last_oid, &obj->oid))
+			continue;
+
+		last_oid = &obj->oid;
+
+		if (!(offset >> 31))
+			continue;
+
+		split[0] = htonl(offset >> 32);
+		split[1] = htonl(offset & 0xffffffff);
+
+		sha1write(f, split, 8);
+		nr_large_offset--;
+	}
+}
+
+struct pack_pair {
+	uint32_t pack_int_id;
+	const char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+	struct pack_pair *a = (struct pack_pair *)_a;
+	struct pack_pair *b = (struct pack_pair *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(const char **pack_names, uint32_t nr_packs, uint32_t *perm)
+{
+	uint32_t i;
+	struct pack_pair *pairs;
+
+	ALLOC_ARRAY(pairs, nr_packs);
+
+	for (i = 0; i < nr_packs; i++) {
+		pairs[i].pack_int_id = i;
+		pairs[i].pack_name = pack_names[i];
+	}
+
+	QSORT(pairs, nr_packs, pack_pair_compare);
+
+	for (i = 0; i < nr_packs; i++) {
+		pack_names[i] = pairs[i].pack_name;
+		perm[pairs[i].pack_int_id] = i;
+	}
+}
+
+const char *write_midx_file(const char *pack_dir,
+			    const char *midx_name,
+			    const char **pack_names,
+			    uint32_t nr_packs,
+			    struct pack_midx_entry **objects,
+			    uint32_t nr_objects)
+{
+	struct sha1file *f;
+	struct pack_midx_entry **sorted_by_sha;
+	int i, chunk, fd;
+	struct pack_midx_header hdr;
+	uint32_t chunk_ids[7];
+	uint64_t chunk_offsets[7];
+	unsigned char large_offset_needed = 0;
+	unsigned int nr_large_offset = 0;
+	unsigned char final_hash[GIT_MAX_RAWSZ];
+	const char *final_hex;
+	int rename_needed = 0;
+	uint32_t count_distinct = 0;
+	int total_name_len = 0;
+	uint32_t *pack_perm;
+
+	if (!core_midx)
+		return 0;
+
+	/* Sort packs */
+	if (nr_packs) {
+		ALLOC_ARRAY(pack_perm, nr_packs);
+		sort_packs_by_name(pack_names, nr_packs, pack_perm);
+	} else {
+		pack_perm = 0;
+	}
+
+	/* Sort objects */
+	if (nr_objects) {
+		sorted_by_sha = objects;
+
+		QSORT(sorted_by_sha, nr_objects, midx_sha1_compare);
+
+		for (i = 0; i < nr_objects; i++) {
+			if (i &&
+			    !oidcmp(&sorted_by_sha[i-1]->oid, &sorted_by_sha[i]->oid))
+				continue;
+
+			count_distinct++;
+
+			if (sorted_by_sha[i]->offset > 0x7fffffff)
+				nr_large_offset++;
+			if (sorted_by_sha[i]->offset > 0xffffffff)
+				large_offset_needed = 1;
+		}
+	} else {
+		sorted_by_sha = NULL;
+	}
+
+	for (i = 0; i < nr_packs; i++)
+		total_name_len += strlen(pack_names[i]) + 1;
+
+	/* open temp file, or direct file if given */
+	if (!midx_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		strbuf_addstr(&tmp_file, pack_dir);
+		strbuf_addstr(&tmp_file, "/tmp_midx_XXXXXX");
+
+		fd = git_mkstemp_mode(tmp_file.buf, 0444);
+		if (fd < 0)
+			die_errno("unable to create '%s'", tmp_file.buf);
+
+		midx_name = strbuf_detach(&tmp_file, NULL);
+		rename_needed = 1;
+	} else {
+		unlink(midx_name);
+		fd = open(midx_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+		if (fd < 0)
+			die_errno("unable to create '%s'", midx_name);
+	}
+	f = sha1fd(fd, midx_name);
+
+	/* fill header info */
+	hdr.midx_signature = htonl(MIDX_SIGNATURE);
+	hdr.midx_version = htonl(MIDX_VERSION);
+
+	hdr.hash_version = MIDX_OID_VERSION;
+	hdr.hash_len = MIDX_OID_LEN;
+	hdr.num_base_midx = 0;
+	hdr.num_packs = htonl(nr_packs);
+
+	/*
+	 * We expect the following chunks, which are required:
+	 *
+	 * Packfile Name Lookup
+	 * Packfile Names
+	 * OID Fanout
+	 * OID Lookup
+	 * Object Offsets
+	 */
+	hdr.num_chunks = large_offset_needed ? 6 : 5;
+
+	/* write header to file */
+	assert(sizeof(hdr) == 16);
+	sha1write(f, &hdr, sizeof(hdr));
+
+	/*
+	 * Fill initial chunk values using offsets
+	 * relative to first chunk.
+	 */
+	chunk_offsets[0] = sizeof(hdr) + 12 * (hdr.num_chunks + 1);
+	chunk_ids[0] = MIDX_CHUNKID_PACKLOOKUP;
+	chunk_offsets[1] = chunk_offsets[0] + nr_packs * 4;
+	chunk_ids[1] = MIDX_CHUNKID_OIDFANOUT;
+	chunk_offsets[2] = chunk_offsets[1] + 256 * 4;
+	chunk_ids[2] = MIDX_CHUNKID_OIDLOOKUP;
+	chunk_offsets[3] = chunk_offsets[2] + (uint64_t)count_distinct
+					    * (uint64_t)hdr.hash_len;
+	chunk_ids[3] = MIDX_CHUNKID_OBJECTOFFSETS;
+	chunk_offsets[4] = chunk_offsets[3] + 8 * (uint64_t)count_distinct;
+
+	if (large_offset_needed) {
+		chunk_ids[4] = MIDX_CHUNKID_LARGEOFFSETS;
+		chunk_offsets[5] = chunk_offsets[4] + 8 * (uint64_t)nr_large_offset;
+		chunk_ids[5] = MIDX_CHUNKID_PACKNAMES;
+		chunk_offsets[6] = chunk_offsets[5] + total_name_len;
+		chunk_ids[6] = 0;
+	} else {
+		chunk_ids[4] = MIDX_CHUNKID_PACKNAMES;
+		chunk_offsets[5] = chunk_offsets[4] + total_name_len;
+		chunk_ids[5] = 0;
+	}
+
+	for (i = 0; i <= hdr.num_chunks; i++) {
+		uint32_t chunk_write[3];
+
+		chunk_write[0] = htonl(chunk_ids[i]);
+		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
+		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
+		sha1write(f, chunk_write, 12);
+	}
+
+	for (chunk = 0; chunk < hdr.num_chunks; chunk++) {
+		switch (chunk_ids[chunk]) {
+		case MIDX_CHUNKID_PACKLOOKUP:
+			write_midx_chunk_packlookup(f, pack_names, nr_packs);
+			break;
+
+		case MIDX_CHUNKID_PACKNAMES:
+			write_midx_chunk_packnames(f, pack_names, nr_packs);
+			break;
+
+		case MIDX_CHUNKID_OIDFANOUT:
+			write_midx_chunk_oidfanout(f, sorted_by_sha, nr_objects);
+			break;
+
+		case MIDX_CHUNKID_OIDLOOKUP:
+			write_midx_chunk_oidlookup(f, hdr.hash_len, sorted_by_sha,
+						   nr_objects);
+			break;
+
+		case MIDX_CHUNKID_OBJECTOFFSETS:
+			write_midx_chunk_objectoffsets(f, large_offset_needed,
+						       sorted_by_sha, nr_objects,
+						       pack_perm);
+			break;
+
+		case MIDX_CHUNKID_LARGEOFFSETS:
+			write_midx_chunk_largeoffsets(f, nr_large_offset,
+						      sorted_by_sha, nr_objects);
+			break;
+
+		case 0:
+			break;
+
+		default:
+			die("unrecognized MIDX chunk id: %08x", chunk_ids[chunk]);
+		}
+	}
+
+	sha1close(f, final_hash, CSUM_CLOSE | CSUM_FSYNC);
+
+	if (rename_needed)
+	{
+		struct object_id oid;
+		char *fname;
+
+		memcpy(oid.hash, final_hash, GIT_MAX_RAWSZ);
+		fname = get_midx_filename_oid(pack_dir, &oid);
+		final_hex = sha1_to_hex(final_hash);
+
+		if (rename(midx_name, fname))
+			die("failed to rename %s to %s", midx_name, fname);
+
+		free(fname);
+	} else {
+		final_hex = midx_name;
+	}
+
+	return final_hex;
+}
diff --git a/midx.h b/midx.h
new file mode 100644
index 0000000000..4b00463651
--- /dev/null
+++ b/midx.h
@@ -0,0 +1,42 @@
+#ifndef MIDX_H
+#define MIDX_H
+
+#include "git-compat-util.h"
+#include "object.h"
+#include "csum-file.h"
+
+extern char *get_midx_filename_oid(const char *pack_dir,
+				   struct object_id *oid);
+
+struct pack_midx_entry {
+	struct object_id oid;
+	uint32_t pack_int_id;
+	off_t offset;
+};
+
+struct pack_midx_header {
+	uint32_t midx_signature;
+	uint32_t midx_version;
+	unsigned char hash_version;
+	unsigned char hash_len;
+	unsigned char num_base_midx;
+	unsigned char num_chunks;
+	uint32_t num_packs;
+};
+
+/*
+ * Write a single MIDX file storing the given entries for the
+ * given list of packfiles. If midx_name is null, then a temp
+ * file will be created and swapped using the result hash value.
+ * Otherwise, write directly to midx_name.
+ *
+ * Returns the final name of the MIDX file within pack_dir.
+ */
+extern const char *write_midx_file(const char *pack_dir,
+				   const char *midx_name,
+				   const char **pack_names,
+				   uint32_t nr_packs,
+				   struct pack_midx_entry **objects,
+				   uint32_t nr_objects);
+
+#endif
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 05/18] midx: create midx builtin with --write mode
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (3 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 04/18] midx: write multi-pack indexes for an object list Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 06/18] midx: add t5318-midx.sh test script Derrick Stolee
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Commentary: As we extend the function of the midx builtin, I expand the
SYNOPSIS row of "git-midx.txt" but do not create multiple rows. If this
builtin doesn't change too much, I will rewrite the SYNOPSIS to be multi-
lined, such as in "git-branch.txt".

-- >8 --

Create, document, and implement the first ability of the midx builtin.

The --write subcommand creates a multi-pack-index for all indexed
packfiles within a given pack directory. If none is provided, the
objects/pack directory is implied. The arguments allow specifying the
pack directory so we can add MIDX files to alternates.

The packfiles are expected to be paired with pack-indexes and are
otherwise ignored. This simplifies the implementation and also keeps
compatibility with older versions of Git (or changing core.midx to
false).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                 |   1 +
 Documentation/git-midx.txt |  54 +++++++++++++
 Makefile                   |   1 +
 builtin.h                  |   1 +
 builtin/midx.c             | 195 +++++++++++++++++++++++++++++++++++++++++++++
 command-list.txt           |   1 +
 git.c                      |   1 +
 7 files changed, 254 insertions(+)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 builtin/midx.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..545e195f2a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -95,6 +95,7 @@
 /git-merge-subtree
 /git-mergetool
 /git-mergetool--lib
+/git-midx
 /git-mktag
 /git-mktree
 /git-name-rev
diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
new file mode 100644
index 0000000000..17464222c1
--- /dev/null
+++ b/Documentation/git-midx.txt
@@ -0,0 +1,54 @@
+git-midx(1)
+============
+
+NAME
+----
+git-midx - Write and verify multi-pack-indexes (MIDX files).
+
+
+SYNOPSIS
+--------
+[verse]
+'git midx' --write [--pack-dir <pack_dir>]
+
+DESCRIPTION
+-----------
+Write a MIDX file.
+
+OPTIONS
+-------
+
+--pack-dir <pack_dir>::
+	Use given directory for the location of packfiles, pack-indexes,
+	and MIDX files.
+
+--write::
+	If specified, write a new midx file to the pack directory using
+	the packfiles present. Outputs the hash of the result midx file.
+
+EXAMPLES
+--------
+
+* Write a MIDX file for the packfiles in your local .git folder.
++
+------------------------------------------------
+$ git midx --write
+------------------------------------------------
+
+* Write a MIDX file for the packfiles in a different folder
++
+---------------------------------------------------------
+$ git midx --write --pack-dir ../../alt/pack/
+---------------------------------------------------------
+
+CONFIGURATION
+-------------
+
+core.midx::
+	The midx command will fail if core.midx is false.
+	Also, the written MIDX files will be ignored by other commands
+	unless core.midx is true.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index d0d810951f..5c458705c1 100644
--- a/Makefile
+++ b/Makefile
@@ -980,6 +980,7 @@ BUILTIN_OBJS += builtin/merge-index.o
 BUILTIN_OBJS += builtin/merge-ours.o
 BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
+BUILTIN_OBJS += builtin/midx.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
 BUILTIN_OBJS += builtin/mv.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..880383e341 100644
--- a/builtin.h
+++ b/builtin.h
@@ -188,6 +188,7 @@ extern int cmd_merge_ours(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
+extern int cmd_midx(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
diff --git a/builtin/midx.c b/builtin/midx.c
new file mode 100644
index 0000000000..4aae14cf8e
--- /dev/null
+++ b/builtin/midx.c
@@ -0,0 +1,195 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "dir.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "packfile.h"
+#include "parse-options.h"
+#include "midx.h"
+
+static char const * const builtin_midx_usage[] = {
+	N_("git midx --write [--pack-dir <packdir>]"),
+	NULL
+};
+
+static struct opts_midx {
+	const char *pack_dir;
+	int write;
+} opts;
+
+static int build_midx_from_packs(
+	const char *pack_dir,
+	const char **pack_names, uint32_t nr_packs,
+	const char **midx_id)
+{
+	struct packed_git **packs;
+	const char **installed_pack_names;
+	uint32_t i, j, nr_installed_packs = 0;
+	uint32_t nr_objects = 0;
+	struct pack_midx_entry *objects;
+	struct pack_midx_entry **obj_ptrs;
+	uint32_t nr_total_packs = nr_packs;
+	uint32_t pack_offset = 0;
+	struct strbuf pack_path = STRBUF_INIT;
+	int baselen;
+
+	if (!nr_total_packs) {
+		*midx_id = NULL;
+		return 0;
+	}
+
+	ALLOC_ARRAY(packs, nr_total_packs);
+	ALLOC_ARRAY(installed_pack_names, nr_total_packs);
+
+	strbuf_addstr(&pack_path, pack_dir);
+	strbuf_addch(&pack_path, '/');
+	baselen = pack_path.len;
+	for (i = 0; i < nr_packs; i++) {
+		strbuf_setlen(&pack_path, baselen);
+		strbuf_addstr(&pack_path, pack_names[i]);
+
+		strbuf_strip_suffix(&pack_path, ".pack");
+		strbuf_addstr(&pack_path, ".idx");
+
+		packs[nr_installed_packs] = add_packed_git(pack_path.buf, pack_path.len, 0);
+
+		if (packs[nr_installed_packs] != NULL) {
+			if (open_pack_index(packs[nr_installed_packs]))
+				continue;
+
+			nr_objects += packs[nr_installed_packs]->num_objects;
+			installed_pack_names[nr_installed_packs] = pack_names[i];
+			nr_installed_packs++;
+		}
+	}
+	strbuf_release(&pack_path);
+
+	if (!nr_objects || !nr_installed_packs) {
+		FREE_AND_NULL(packs);
+		FREE_AND_NULL(installed_pack_names);
+		*midx_id = NULL;
+		return 0;
+	}
+
+	ALLOC_ARRAY(objects, nr_objects);
+	nr_objects = 0;
+
+	for (i = pack_offset; i < nr_installed_packs; i++) {
+		struct packed_git *p = packs[i];
+
+		for (j = 0; j < p->num_objects; j++) {
+			struct pack_midx_entry entry;
+
+			if (!nth_packed_object_oid(&entry.oid, p, j))
+				die("unable to get sha1 of object %u in %s",
+				    i, p->pack_name);
+
+			entry.pack_int_id = i;
+			entry.offset = nth_packed_object_offset(p, j);
+
+			objects[nr_objects] = entry;
+			nr_objects++;
+		}
+	}
+
+	ALLOC_ARRAY(obj_ptrs, nr_objects);
+	for (i = 0; i < nr_objects; i++)
+		obj_ptrs[i] = &objects[i];
+
+	*midx_id = write_midx_file(pack_dir, NULL,
+		installed_pack_names, nr_installed_packs,
+		obj_ptrs, nr_objects);
+
+	FREE_AND_NULL(packs);
+	FREE_AND_NULL(installed_pack_names);
+	FREE_AND_NULL(obj_ptrs);
+	FREE_AND_NULL(objects);
+
+	return 0;
+}
+
+static int midx_write(void)
+{
+	const char **pack_names = NULL;
+	uint32_t i, nr_packs = 0;
+	const char *midx_id = 0;
+	DIR *dir;
+	struct dirent *de;
+
+	dir = opendir(opts.pack_dir);
+	if (!dir) {
+		error_errno("unable to open object pack directory: %s",
+			    opts.pack_dir);
+		return 1;
+	}
+
+	nr_packs = 256;
+	ALLOC_ARRAY(pack_names, nr_packs);
+
+	i = 0;
+	while ((de = readdir(dir)) != NULL) {
+		if (is_dot_or_dotdot(de->d_name))
+			continue;
+
+		if (ends_with(de->d_name, ".pack")) {
+			ALLOC_GROW(pack_names, i + 1, nr_packs);
+			pack_names[i++] = xstrdup(de->d_name);
+		}
+	}
+
+	nr_packs = i;
+	closedir(dir);
+
+	if (!nr_packs)
+		goto cleanup;
+
+	if (build_midx_from_packs(opts.pack_dir, pack_names, nr_packs, &midx_id))
+		die("failed to build MIDX");
+
+	if (midx_id == NULL)
+		goto cleanup;
+
+	printf("%s\n", midx_id);
+
+cleanup:
+	if (pack_names)
+		FREE_AND_NULL(pack_names);
+	return 0;
+}
+
+int cmd_midx(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_midx_options[] = {
+		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
+			N_("dir"),
+			N_("The pack directory containing set of packfile and pack-index pairs.") },
+		OPT_BOOL('w', "write", &opts.write,
+			N_("write midx file")),
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_midx_usage, builtin_midx_options);
+
+	git_config(git_default_config, NULL);
+	if (!core_midx)
+		die(_("git-midx requires core.midx=true"));
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_midx_options,
+			     builtin_midx_usage, 0);
+
+	if (!opts.pack_dir) {
+		struct strbuf path = STRBUF_INIT;
+		strbuf_addstr(&path, get_object_directory());
+		strbuf_addstr(&path, "/pack");
+		opts.pack_dir = strbuf_detach(&path, NULL);
+	}
+
+	if (opts.write)
+		return midx_write();
+
+	usage_with_options(builtin_midx_usage, builtin_midx_options);
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..a7b9412182 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -87,6 +87,7 @@ git-merge-index                         plumbingmanipulators
 git-merge-one-file                      purehelpers
 git-mergetool                           ancillarymanipulators
 git-merge-tree                          ancillaryinterrogators
+git-midx                                plumbingmanipulators
 git-mktag                               plumbingmanipulators
 git-mktree                              plumbingmanipulators
 git-mv                                  mainporcelain           worktree
diff --git a/git.c b/git.c
index c870b9719c..87fbda8463 100644
--- a/git.c
+++ b/git.c
@@ -431,6 +431,7 @@ static struct cmd_struct commands[] = {
 	{ "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE },
 	{ "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE },
 	{ "merge-tree", cmd_merge_tree, RUN_SETUP },
+	{ "midx", cmd_midx, RUN_SETUP },
 	{ "mktag", cmd_mktag, RUN_SETUP },
 	{ "mktree", cmd_mktree, RUN_SETUP },
 	{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 06/18] midx: add t5318-midx.sh test script
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (4 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 05/18] midx: create midx builtin with --write mode Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 07/18] midx: teach midx --write to update midx-head Derrick Stolee
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Test interactions between the midx builtin and other Git operations.

Use both a full repo and a bare repo to ensure the pack directory
redirection works correctly.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5318-midx.sh | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)
 create mode 100755 t/t5318-midx.sh

diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
new file mode 100755
index 0000000000..869bbea29c
--- /dev/null
+++ b/t/t5318-midx.sh
@@ -0,0 +1,100 @@
+#!/bin/sh
+
+test_description='multi-pack indexes'
+. ./test-lib.sh
+
+test_expect_success 'config' \
+    'rm -rf .git &&
+     mkdir full &&
+     cd full &&
+     git init &&
+     git config core.midx true &&
+     git config pack.threads 1 &&
+     packdir=.git/objects/pack'
+
+test_expect_success 'write-midx with no packs' \
+    'midx0=$(git midx --write) &&
+     test "a$midx0" = "a"'
+
+test_expect_success 'create objects' \
+    'for i in $(test_seq 100)
+     do
+         echo $i >file-1-$i
+     done &&
+     git add file-* &&
+     test_tick &&
+     git commit -m "test data 1" &&
+     git branch commit1 HEAD'
+
+test_expect_success 'write-midx from index version 1' \
+    'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 ${packdir}/test-1) &&
+     midx1=$(git midx --write) &&
+     test_path_is_file ${packdir}/midx-${midx1}.midx'
+
+test_expect_success 'write-midx from index version 2' \
+    'rm "${packdir}/test-1-${pack1}.pack" &&
+     pack2=$(git rev-list --all --objects | git pack-objects --index-version=2 ${packdir}/test-2) &&
+     midx2=$(git midx --write) &&
+     test_path_is_file ${packdir}/midx-${midx2}.midx'
+
+test_expect_success 'Create more objects' \
+    'for i in $(test_seq 100)
+     do
+         echo $i >file-2-$i
+     done &&
+     git add file-* &&
+     test_tick &&
+     git commit -m "test data 2" &&
+     git branch commit2 HEAD'
+
+test_expect_success 'write-midx with two packs' \
+    'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects --index-version=2 ${packdir}/test-3) &&
+     midx3=$(git midx --write) &&
+     test_path_is_file ${packdir}/midx-${midx3}.midx'
+
+test_expect_success 'Add more packs' \
+    'for j in $(test_seq 10)
+     do
+         jjj=$(printf '%03i' $j)
+         test-genrandom "bar" 200 > wide_delta_$jjj &&
+         test-genrandom "baz $jjj" 50 >> wide_delta_$jjj &&
+         test-genrandom "foo"$j 100 > deep_delta_$jjj &&
+         test-genrandom "foo"$(expr $j + 1) 100 >> deep_delta_$jjj &&
+         test-genrandom "foo"$(expr $j + 2) 100 >> deep_delta_$jjj &&
+         echo $jjj >file_$jjj &&
+         test-genrandom "$jjj" 8192 >>file_$jjj &&
+         git update-index --add file_$jjj deep_delta_$jjj wide_delta_$jjj &&
+         { echo 101 && test-genrandom 100 8192; } >file_101 &&
+         git update-index --add file_101 &&
+         commit=$(git commit-tree $EMPTY_TREE -p HEAD</dev/null) && {
+         echo $EMPTY_TREE &&
+         git ls-tree $EMPTY_TREE | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+         } >obj-list &&
+         echo commit_packs_$j = $commit &&
+	 git branch commit_packs_$j $commit &&
+         git update-ref HEAD $commit &&
+         git pack-objects --index-version=2 ${packdir}/test-pack <obj-list
+     done'
+
+test_expect_success 'write-midx with twelve packs' \
+    'midx4=$(git midx --write) &&
+     test_path_is_file ${packdir}/midx-${midx4}.midx'
+
+test_expect_success 'write-midx with no new packs' \
+    'midx5=$(git midx --write) &&
+     test_path_is_file ${packdir}/midx-${midx5}.midx &&
+     test "a$midx4" = "a$midx5"'
+
+test_expect_success 'create bare repo' \
+    'cd .. &&
+     git clone --bare full bare &&
+     cd bare &&
+     git config core.midx true &&
+     git config pack.threads 1 &&
+     baredir=objects/pack'
+
+test_expect_success 'write-midx in bare repo' \
+    'midxbare=$(git midx --write) &&
+     test_path_is_file ${baredir}/midx-${midxbare}.midx'
+
+test_done
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 07/18] midx: teach midx --write to update midx-head
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (5 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 06/18] midx: add t5318-midx.sh test script Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 08/18] midx: teach git-midx to read midx file details Derrick Stolee
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

There may be multiple MIDX files in a single pack directory. The primary
file is pointed to by a pointer file "midx-head" that contains an OID.
The MIDX file to load is then given by "midx-<OID>.midx".

This head file will be especially important when the MIDX files are
extended to be incremental and we expect multiple MIDX files at any
point.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt | 19 ++++++++++++++++++-
 builtin/midx.c             | 32 ++++++++++++++++++++++++++++++--
 midx.c                     | 31 +++++++++++++++++++++++++++++++
 midx.h                     |  3 +++
 t/t5318-midx.sh            | 33 ++++++++++++++++++++++-----------
 5 files changed, 104 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 17464222c1..01f79cbba5 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 --------
 [verse]
-'git midx' --write [--pack-dir <pack_dir>]
+'git midx' --write <options> [--pack-dir <pack_dir>]
 
 DESCRIPTION
 -----------
@@ -26,15 +26,32 @@ OPTIONS
 	If specified, write a new midx file to the pack directory using
 	the packfiles present. Outputs the hash of the result midx file.
 
+--update-head::
+	If specified with --write, update the midx-head file to point to
+	the written midx file.
+
 EXAMPLES
 --------
 
+* Read the midx-head file and output the OID of the head MIDX file.
++
+------------------------------------------------
+$ git midx
+------------------------------------------------
+
 * Write a MIDX file for the packfiles in your local .git folder.
 +
 ------------------------------------------------
 $ git midx --write
 ------------------------------------------------
 
+* Write a MIDX file for the packfiles in your local .git folder and
+* update the midx-head file.
++
+------------------------------------------------
+$ git midx --write --update-head
+------------------------------------------------
+
 * Write a MIDX file for the packfiles in a different folder
 +
 ---------------------------------------------------------
diff --git a/builtin/midx.c b/builtin/midx.c
index 4aae14cf8e..84ce6588a2 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -9,13 +9,17 @@
 #include "midx.h"
 
 static char const * const builtin_midx_usage[] = {
-	N_("git midx --write [--pack-dir <packdir>]"),
+	N_("git midx [--pack-dir <packdir>]"),
+	N_("git midx --write [--update-head] [--pack-dir <packdir>]"),
 	NULL
 };
 
 static struct opts_midx {
 	const char *pack_dir;
 	int write;
+	int update_head;
+	int has_existing;
+	struct object_id old_midx_oid;
 } opts;
 
 static int build_midx_from_packs(
@@ -109,6 +113,22 @@ static int build_midx_from_packs(
 	return 0;
 }
 
+static void update_head_file(const char *pack_dir, const char *midx_id)
+{
+	int fd;
+	struct lock_file lk = LOCK_INIT;
+	char *head_path = get_midx_head_filename(pack_dir);
+
+	fd = hold_lock_file_for_update(&lk, head_path, LOCK_DIE_ON_ERROR);
+	FREE_AND_NULL(head_path);
+
+	if (fd < 0)
+		die_errno("unable to open midx-head");
+
+	write_in_full(fd, midx_id, GIT_MAX_HEXSZ);
+	commit_lock_file(&lk);
+}
+
 static int midx_write(void)
 {
 	const char **pack_names = NULL;
@@ -152,6 +172,9 @@ static int midx_write(void)
 
 	printf("%s\n", midx_id);
 
+	if (opts.update_head)
+		update_head_file(opts.pack_dir, midx_id);
+
 cleanup:
 	if (pack_names)
 		FREE_AND_NULL(pack_names);
@@ -166,6 +189,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 			N_("The pack directory containing set of packfile and pack-index pairs.") },
 		OPT_BOOL('w', "write", &opts.write,
 			N_("write midx file")),
+		OPT_BOOL('u', "update-head", &opts.update_head,
+			N_("update midx-head to written midx file")),
 		OPT_END(),
 	};
 
@@ -187,9 +212,12 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 		opts.pack_dir = strbuf_detach(&path, NULL);
 	}
 
+	opts.has_existing = !!get_midx_head_oid(opts.pack_dir, &opts.old_midx_oid);
+
 	if (opts.write)
 		return midx_write();
 
-	usage_with_options(builtin_midx_usage, builtin_midx_options);
+	if (opts.has_existing)
+		printf("%s\n", oid_to_hex(&opts.old_midx_oid));
 	return 0;
 }
diff --git a/midx.c b/midx.c
index 5c320726ed..f4178c1b81 100644
--- a/midx.c
+++ b/midx.c
@@ -34,6 +34,37 @@ char* get_midx_filename_oid(const char *pack_dir,
 	return strbuf_detach(&head_path, NULL);
 }
 
+char *get_midx_head_filename(const char *pack_dir)
+{
+	struct strbuf head_filename = STRBUF_INIT;
+	strbuf_addstr(&head_filename, pack_dir);
+	strbuf_addstr(&head_filename, "/midx-head");
+	return strbuf_detach(&head_filename, NULL);
+}
+
+struct object_id *get_midx_head_oid(const char *pack_dir,
+				    struct object_id *oid)
+{
+	char oid_hex[GIT_MAX_HEXSZ + 1];
+	FILE *f;
+	char *head_filename = get_midx_head_filename(pack_dir);
+
+	f = fopen(head_filename, "r");
+	FREE_AND_NULL(head_filename);
+
+	if (!f)
+		return 0;
+
+	if (!fgets(oid_hex, sizeof(oid_hex), f))
+		die("failed to read midx-head");
+
+	fclose(f);
+
+	if (get_oid_hex(oid_hex, oid))
+		return 0;
+	return oid;
+}
+
 struct pack_midx_details_internal {
 	uint32_t pack_int_id;
 	uint32_t internal_offset;
diff --git a/midx.h b/midx.h
index 4b00463651..9d9ab85261 100644
--- a/midx.h
+++ b/midx.h
@@ -7,6 +7,9 @@
 
 extern char *get_midx_filename_oid(const char *pack_dir,
 				   struct object_id *oid);
+extern char *get_midx_head_filename(const char *pack_dir);
+
+extern struct object_id *get_midx_head_oid(const char *pack_dir, struct object_id *oid);
 
 struct pack_midx_entry {
 	struct object_id oid;
diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index 869bbea29c..b66efcdce9 100755
--- a/t/t5318-midx.sh
+++ b/t/t5318-midx.sh
@@ -29,13 +29,16 @@ test_expect_success 'create objects' \
 test_expect_success 'write-midx from index version 1' \
     'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 ${packdir}/test-1) &&
      midx1=$(git midx --write) &&
-     test_path_is_file ${packdir}/midx-${midx1}.midx'
+     test_path_is_file ${packdir}/midx-${midx1}.midx &&
+     test_path_is_missing ${packdir}/midx-head'
 
 test_expect_success 'write-midx from index version 2' \
     'rm "${packdir}/test-1-${pack1}.pack" &&
      pack2=$(git rev-list --all --objects | git pack-objects --index-version=2 ${packdir}/test-2) &&
-     midx2=$(git midx --write) &&
-     test_path_is_file ${packdir}/midx-${midx2}.midx'
+     midx2=$(git midx --write --update-head) &&
+     test_path_is_file ${packdir}/midx-${midx2}.midx &&
+     test_path_is_file ${packdir}/midx-head &&
+     test $(cat ${packdir}/midx-head) = "$midx2"'
 
 test_expect_success 'Create more objects' \
     'for i in $(test_seq 100)
@@ -49,8 +52,10 @@ test_expect_success 'Create more objects' \
 
 test_expect_success 'write-midx with two packs' \
     'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects --index-version=2 ${packdir}/test-3) &&
-     midx3=$(git midx --write) &&
-     test_path_is_file ${packdir}/midx-${midx3}.midx'
+     midx3=$(git midx --write --update-head) &&
+     test_path_is_file ${packdir}/midx-${midx3}.midx &&
+     test_path_is_file ${packdir}/midx-head &&
+     test $(cat ${packdir}/midx-head) = "$midx3"'
 
 test_expect_success 'Add more packs' \
     'for j in $(test_seq 10)
@@ -77,13 +82,17 @@ test_expect_success 'Add more packs' \
      done'
 
 test_expect_success 'write-midx with twelve packs' \
-    'midx4=$(git midx --write) &&
-     test_path_is_file ${packdir}/midx-${midx4}.midx'
+    'midx4=$(git midx --write --update-head) &&
+     test_path_is_file ${packdir}/midx-${midx4}.midx &&
+     test_path_is_file ${packdir}/midx-head &&
+     test $(cat ${packdir}/midx-head) = "$midx4"'
 
 test_expect_success 'write-midx with no new packs' \
-    'midx5=$(git midx --write) &&
+    'midx5=$(git midx --write --update-head) &&
      test_path_is_file ${packdir}/midx-${midx5}.midx &&
-     test "a$midx4" = "a$midx5"'
+     test "a$midx4" = "a$midx5" &&
+     test_path_is_file ${packdir}/midx-head &&
+     test $(cat ${packdir}/midx-head) = "$midx4"'
 
 test_expect_success 'create bare repo' \
     'cd .. &&
@@ -94,7 +103,9 @@ test_expect_success 'create bare repo' \
      baredir=objects/pack'
 
 test_expect_success 'write-midx in bare repo' \
-    'midxbare=$(git midx --write) &&
-     test_path_is_file ${baredir}/midx-${midxbare}.midx'
+    'midxbare=$(git midx --write --update-head) &&
+     test_path_is_file ${baredir}/midx-${midxbare}.midx  &&
+     test_path_is_file ${baredir}/midx-head &&
+     test $(cat ${baredir}/midx-head) = "$midxbare"'
 
 test_done
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 08/18] midx: teach git-midx to read midx file details
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (6 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 07/18] midx: teach midx --write to update midx-head Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 09/18] midx: find details of nth object in midx Derrick Stolee
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Commentary: I included the pack directory of the MIDX file as a FLEX_ARRAY
at the end of the midxed_git struct, similar to how the pack name appears
at the end of the packed_git struct. A colleague mentioned this pattern is
confusing and possibly dangerous so I should consider changing it. If there
is no strong reason for this, then I will modify the struct before the v1
patch to use a char*.

-- >8 --

Add a "--read" subcommand to the midx builtin to report summary information
on the head MIDX file or a MIDX file specified by the supplied "--midx-id"
parameter.

This subcommand is used by t5318-midx.sh to verify the indexed objects are
as expected.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt |  23 +++++++-
 builtin/midx.c             |  59 ++++++++++++++++++++
 midx.c                     | 132 +++++++++++++++++++++++++++++++++++++++++++++
 midx.h                     |  58 ++++++++++++++++++++
 t/t5318-midx.sh            |  79 +++++++++++++++++++--------
 5 files changed, 328 insertions(+), 23 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 01f79cbba5..3eeed1d969 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 --------
 [verse]
-'git midx' --write <options> [--pack-dir <pack_dir>]
+'git midx' [--write|--read] <options> [--pack-dir <pack_dir>]
 
 DESCRIPTION
 -----------
@@ -22,9 +22,18 @@ OPTIONS
 	Use given directory for the location of packfiles, pack-indexes,
 	and MIDX files.
 
+--read::
+	If specified, read a midx file specified by the midx-head file
+	and output basic details about the midx file. (Cannot be combined
+	with --write.)
+
+--midx-id <oid>::
+	If specified with --read, use the given oid to read midx-[oid].midx
+	instead of using midx-head.
 --write::
 	If specified, write a new midx file to the pack directory using
 	the packfiles present. Outputs the hash of the result midx file.
+	(Cannot be combined with --read.)
 
 --update-head::
 	If specified with --write, update the midx-head file to point to
@@ -58,6 +67,18 @@ $ git midx --write --update-head
 $ git midx --write --pack-dir ../../alt/pack/
 ---------------------------------------------------------
 
+* Read the current midx-head.
++
+-----------------------------------------------
+$ git midx --read
+-----------------------------------------------
+
+* Read a specific MIDX file in the local .git folder.
++
+--------------------------------------------------------------------
+$ git midx --read --midx-id 3e50d982a2257168c7fd0ff12ffe5cf6af38c74e
+--------------------------------------------------------------------
+
 CONFIGURATION
 -------------
 
diff --git a/builtin/midx.c b/builtin/midx.c
index 84ce6588a2..ee9234583d 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -16,12 +16,60 @@ static char const * const builtin_midx_usage[] = {
 
 static struct opts_midx {
 	const char *pack_dir;
+	int read;
+	const char *midx_id;
 	int write;
 	int update_head;
 	int has_existing;
 	struct object_id old_midx_oid;
 } opts;
 
+static int midx_read(void)
+{
+	struct object_id midx_oid;
+	struct midxed_git *midx;
+	uint32_t i;
+
+	if (opts.midx_id && strlen(opts.midx_id) == GIT_MAX_HEXSZ)
+		get_oid_hex(opts.midx_id, &midx_oid);
+	else if (!get_midx_head_oid(opts.pack_dir, &midx_oid))
+		die("No midx-head exists.");
+
+	midx = get_midxed_git(opts.pack_dir, &midx_oid);
+
+	printf("header: %08x %x %d %d %d %d %d\n",
+		ntohl(midx->hdr->midx_signature),
+		ntohl(midx->hdr->midx_version),
+		midx->hdr->hash_version,
+		midx->hdr->hash_len,
+		midx->hdr->num_base_midx,
+		midx->hdr->num_chunks,
+		ntohl(midx->hdr->num_packs));
+	printf("num_objects: %d\n", midx->num_objects);
+	printf("chunks:");
+
+	if (midx->chunk_pack_lookup)
+		printf(" pack_lookup");
+	if (midx->chunk_pack_names)
+		printf(" pack_names");
+	if (midx->chunk_oid_fanout)
+		printf(" oid_fanout");
+	if (midx->chunk_oid_lookup)
+		printf(" oid_lookup");
+	if (midx->chunk_object_offsets)
+		printf(" object_offsets");
+	if (midx->chunk_large_offsets)
+		printf(" large_offsets");
+	printf("\n");
+
+	printf("pack_names:\n");
+	for (i = 0; i < midx->num_packs; i++)
+		printf("%s\n", midx->pack_names[i]);
+
+	printf("pack_dir: %s\n", midx->pack_dir);
+	return 0;
+}
+
 static int build_midx_from_packs(
 	const char *pack_dir,
 	const char **pack_names, uint32_t nr_packs,
@@ -187,6 +235,12 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
 			N_("dir"),
 			N_("The pack directory containing set of packfile and pack-index pairs.") },
+		OPT_BOOL('r', "read", &opts.read,
+			N_("read midx file")),
+		{ OPTION_STRING, 'M', "midx-id", &opts.midx_id,
+			N_("oid"),
+			N_("An OID for a specific midx file in the pack-dir."),
+			PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
 		OPT_BOOL('w', "write", &opts.write,
 			N_("write midx file")),
 		OPT_BOOL('u', "update-head", &opts.update_head,
@@ -205,6 +259,9 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 			     builtin_midx_options,
 			     builtin_midx_usage, 0);
 
+	if (opts.write + opts.read > 1)
+		usage_with_options(builtin_midx_usage, builtin_midx_options);
+
 	if (!opts.pack_dir) {
 		struct strbuf path = STRBUF_INIT;
 		strbuf_addstr(&path, get_object_directory());
@@ -214,6 +271,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 
 	opts.has_existing = !!get_midx_head_oid(opts.pack_dir, &opts.old_midx_oid);
 
+	if (opts.read)
+		return midx_read();
 	if (opts.write)
 		return midx_write();
 
diff --git a/midx.c b/midx.c
index f4178c1b81..c631be451f 100644
--- a/midx.c
+++ b/midx.c
@@ -65,6 +65,138 @@ struct object_id *get_midx_head_oid(const char *pack_dir,
 	return oid;
 }
 
+static struct midxed_git *alloc_midxed_git(int extra)
+{
+	struct midxed_git *m = xmalloc(st_add(sizeof(*m), extra));
+	memset(m, 0, sizeof(*m));
+	m->midx_fd = -1;
+
+	return m;
+}
+
+static struct midxed_git *load_midxed_git_one(const char *midx_file, const char *pack_dir)
+{
+	void *midx_map;
+	const unsigned char *data;
+	struct pack_midx_header *hdr;
+	size_t midx_size, packs_len;
+	struct stat st;
+	uint32_t i;
+	struct midxed_git *midx;
+	int fd = git_open(midx_file);
+
+	if (fd < 0)
+		return 0;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return 0;
+	}
+	midx_size = xsize_t(st.st_size);
+
+	if (midx_size < 16 + 8 * 5 + 4 * 256 + GIT_MAX_RAWSZ) {
+		close(fd);
+		die("midx file %s is too small", midx_file);
+	}
+	midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	data = (const unsigned char *)midx_map;
+
+	hdr = midx_map;
+	if (ntohl(hdr->midx_signature) != MIDX_SIGNATURE) {
+		munmap(midx_map, midx_size);
+		close(fd);
+		die("MIDX signature %X does not match signature %X",
+		    ntohl(hdr->midx_signature), MIDX_SIGNATURE);
+	}
+
+	if (ntohl(hdr->midx_version) != MIDX_VERSION) {
+		munmap(midx_map, midx_size);
+		die("MIDX version %X does not match version %X",
+		    ntohl(hdr->midx_version), MIDX_VERSION);
+	}
+
+	midx = alloc_midxed_git(strlen(pack_dir) + 1);
+
+	midx->hdr = hdr;
+	midx->midx_fd = fd;
+	midx->data = midx_map;
+	midx->data_len = midx_size;
+
+	for (i = 0; i <= hdr->num_chunks; i++) {
+		uint32_t chunk_id = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i));
+		uint64_t chunk_offset1 = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i + 4));
+		uint32_t chunk_offset2 = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i + 8));
+		uint64_t chunk_offset = (chunk_offset1 << 32) | chunk_offset2;
+
+		if (sizeof(data) == 4 && chunk_offset >> 32) {
+			munmap(midx_map, midx_size);
+			close(fd);
+			die(_("unable to memory-map in 32-bit address space"));
+		}
+
+		switch (chunk_id) {
+			case MIDX_CHUNKID_PACKLOOKUP:
+				midx->chunk_pack_lookup = data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_PACKNAMES:
+				midx->chunk_pack_names = data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_OIDFANOUT:
+				midx->chunk_oid_fanout = data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_OIDLOOKUP:
+				midx->chunk_oid_lookup = data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_OBJECTOFFSETS:
+				midx->chunk_object_offsets = data + chunk_offset;
+				break;
+
+			case MIDX_CHUNKID_LARGEOFFSETS:
+				midx->chunk_large_offsets = data + chunk_offset;
+				break;
+
+			case 0:
+				break;
+
+			default:
+				munmap(midx_map, midx_size);
+				close(fd);
+				die("unrecognized MIDX chunk id: %08x", chunk_id);
+		}
+	}
+
+	midx->num_objects = ntohl(*((uint32_t*)(midx->chunk_oid_fanout + 255 * 4)));
+	midx->num_packs = ntohl(midx->hdr->num_packs);
+
+	packs_len = st_mult(sizeof(struct packed_git*), midx->num_packs);
+
+	if (packs_len) {
+		ALLOC_ARRAY(midx->packs, midx->num_packs);
+		ALLOC_ARRAY(midx->pack_names, midx->num_packs);
+		memset(midx->packs, 0, packs_len);
+
+		for (i = 0; i < midx->num_packs; i++) {
+			uint32_t name_offset = ntohl(*(uint32_t*)(midx->chunk_pack_lookup + 4 * i));
+			midx->pack_names[i] = (const char*)(midx->chunk_pack_names + name_offset);
+		}
+	}
+
+	strcpy(midx->pack_dir, pack_dir);
+	return midx;
+}
+
+struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid)
+{
+	struct midxed_git *m;
+	char *fname = get_midx_filename_oid(pack_dir, oid);
+	m = load_midxed_git_one(fname, pack_dir);
+	free(fname);
+	return m;
+}
+
 struct pack_midx_details_internal {
 	uint32_t pack_int_id;
 	uint32_t internal_offset;
diff --git a/midx.h b/midx.h
index 9d9ab85261..92b74e49db 100644
--- a/midx.h
+++ b/midx.h
@@ -27,6 +27,64 @@ struct pack_midx_header {
 	uint32_t num_packs;
 };
 
+struct midxed_git {
+	struct midxed_git *next;
+
+	int midx_fd;
+
+	/* the mmap'd data for the midx file */
+	const unsigned char *data;
+	size_t data_len;
+
+	/* points into the mmap'd data */
+	struct pack_midx_header *hdr;
+
+	/* can construct filename from obj_dir + "/packs/midx-" + oid + ".midx" */
+	struct object_id oid;
+
+	/* derived from the fanout chunk */
+	uint32_t num_objects;
+
+	/* converted number of packs */
+	uint32_t num_packs;
+
+	/* hdr->num_packs * 4 bytes */
+	const unsigned char *chunk_pack_lookup;
+	const unsigned char *chunk_pack_names;
+
+	/* 256 * 4 bytes */
+	const unsigned char *chunk_oid_fanout;
+
+	/* num_objects * hdr->hash_len bytes */
+	const unsigned char *chunk_oid_lookup;
+
+	/* num_objects * 8 bytes */
+	const unsigned char *chunk_object_offsets;
+
+	/*
+	 * 8 bytes per large offset.
+	 * (Optional: may be null.)
+	 */
+	const unsigned char *chunk_large_offsets;
+
+	/*
+	 * Points into mmap'd data storing the pack filenames.
+	 */
+	const char **pack_names;
+
+	/*
+	 * Store an array of pack-pointers. If NULL, then the
+	 * pack has not been loaded yet. The array indices
+	 * correspond to the pack_int_ids from the midx storage.
+	 */
+	struct packed_git **packs;
+
+	/* something like ".git/objects/pack" */
+	char pack_dir[FLEX_ARRAY]; /* more */
+};
+
+extern struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid);
+
 /*
  * Write a single MIDX file storing the given entries for the
  * given list of packfiles. If midx_name is null, then a temp
diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index b66efcdce9..2e52389442 100755
--- a/t/t5318-midx.sh
+++ b/t/t5318-midx.sh
@@ -26,11 +26,27 @@ test_expect_success 'create objects' \
      git commit -m "test data 1" &&
      git branch commit1 HEAD'
 
+_midx_read_expect() {
+	cat >expect <<- EOF
+	header: 4d494458 1 1 20 0 5 $1
+	num_objects: $2
+	chunks: pack_lookup pack_names oid_fanout oid_lookup object_offsets
+	pack_names:
+	$(ls $3 | grep pack | grep -v idx | sort)
+	pack_dir: $3
+	EOF
+}
+
 test_expect_success 'write-midx from index version 1' \
     'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 ${packdir}/test-1) &&
      midx1=$(git midx --write) &&
      test_path_is_file ${packdir}/midx-${midx1}.midx &&
-     test_path_is_missing ${packdir}/midx-head'
+     test_path_is_missing ${packdir}/midx-head &&
+     _midx_read_expect \
+         "1" "102" \
+         "${packdir}" &&
+     git midx --read --midx-id=${midx1} >output &&
+     cmp output expect'
 
 test_expect_success 'write-midx from index version 2' \
     'rm "${packdir}/test-1-${pack1}.pack" &&
@@ -38,12 +54,17 @@ test_expect_success 'write-midx from index version 2' \
      midx2=$(git midx --write --update-head) &&
      test_path_is_file ${packdir}/midx-${midx2}.midx &&
      test_path_is_file ${packdir}/midx-head &&
-     test $(cat ${packdir}/midx-head) = "$midx2"'
+     test $(cat ${packdir}/midx-head) = "$midx2" &&
+     _midx_read_expect \
+         "1" "102" \
+         "${packdir}" &&
+     git midx --read> output &&
+     cmp output expect'
 
 test_expect_success 'Create more objects' \
     'for i in $(test_seq 100)
      do
-         echo $i >file-2-$i
+         echo extra-$i >file-2-$i
      done &&
      git add file-* &&
      test_tick &&
@@ -55,28 +76,32 @@ test_expect_success 'write-midx with two packs' \
      midx3=$(git midx --write --update-head) &&
      test_path_is_file ${packdir}/midx-${midx3}.midx &&
      test_path_is_file ${packdir}/midx-head &&
-     test $(cat ${packdir}/midx-head) = "$midx3"'
+     test $(cat ${packdir}/midx-head) = "$midx3" &&
+     _midx_read_expect \
+         "2" "204" \
+	 "${packdir}" &&
+     git midx --read >output &&
+     cmp output expect'
 
 test_expect_success 'Add more packs' \
-    'for j in $(test_seq 10)
+    'for i in $(test_seq 10)
      do
-         jjj=$(printf '%03i' $j)
-         test-genrandom "bar" 200 > wide_delta_$jjj &&
-         test-genrandom "baz $jjj" 50 >> wide_delta_$jjj &&
-         test-genrandom "foo"$j 100 > deep_delta_$jjj &&
-         test-genrandom "foo"$(expr $j + 1) 100 >> deep_delta_$jjj &&
-         test-genrandom "foo"$(expr $j + 2) 100 >> deep_delta_$jjj &&
-         echo $jjj >file_$jjj &&
-         test-genrandom "$jjj" 8192 >>file_$jjj &&
-         git update-index --add file_$jjj deep_delta_$jjj wide_delta_$jjj &&
+         iii=$(printf '%03i' $i)
+         test-genrandom "bar" 200 > wide_delta_$iii &&
+         test-genrandom "baz $iii" 50 >> wide_delta_$iii &&
+         test-genrandom "foo"$i 100 > deep_delta_$iii &&
+         test-genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii &&
+         test-genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii &&
+         echo $iii >file_$iii &&
+         test-genrandom "$iii" 8192 >>file_$iii &&
+         git update-index --add file_$iii deep_delta_$iii wide_delta_$iii &&
          { echo 101 && test-genrandom 100 8192; } >file_101 &&
          git update-index --add file_101 &&
-         commit=$(git commit-tree $EMPTY_TREE -p HEAD</dev/null) && {
-         echo $EMPTY_TREE &&
-         git ls-tree $EMPTY_TREE | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
+         tree=$(git write-tree) &&
+         commit=$(git commit-tree $tree -p HEAD</dev/null) && {
+         echo $tree &&
+         git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\)	.*/\\1/"
          } >obj-list &&
-         echo commit_packs_$j = $commit &&
-	 git branch commit_packs_$j $commit &&
          git update-ref HEAD $commit &&
          git pack-objects --index-version=2 ${packdir}/test-pack <obj-list
      done'
@@ -85,7 +110,12 @@ test_expect_success 'write-midx with twelve packs' \
     'midx4=$(git midx --write --update-head) &&
      test_path_is_file ${packdir}/midx-${midx4}.midx &&
      test_path_is_file ${packdir}/midx-head &&
-     test $(cat ${packdir}/midx-head) = "$midx4"'
+     test $(cat ${packdir}/midx-head) = "$midx4" &&
+     _midx_read_expect \
+         "12" "245" \
+         "${packdir}" &&
+     git midx --read >output &&
+     cmp output expect'
 
 test_expect_success 'write-midx with no new packs' \
     'midx5=$(git midx --write --update-head) &&
@@ -100,12 +130,17 @@ test_expect_success 'create bare repo' \
      cd bare &&
      git config core.midx true &&
      git config pack.threads 1 &&
-     baredir=objects/pack'
+     baredir=./objects/pack'
 
 test_expect_success 'write-midx in bare repo' \
     'midxbare=$(git midx --write --update-head) &&
      test_path_is_file ${baredir}/midx-${midxbare}.midx  &&
      test_path_is_file ${baredir}/midx-head &&
-     test $(cat ${baredir}/midx-head) = "$midxbare"'
+     test $(cat ${baredir}/midx-head) = "$midxbare" &&
+     _midx_read_expect \
+         "12" "245" \
+         "${baredir}" &&
+     git midx --read >output &&
+     cmp output expect'
 
 test_done
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 09/18] midx: find details of nth object in midx
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (7 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 08/18] midx: teach git-midx to read midx file details Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 10/18] midx: use existing midx when writing Derrick Stolee
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

The MIDX file stores pack offset information for a list of objects. The
nth_midxed_object_* methods provide ways to extract this information.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 midx.h | 15 +++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/midx.c b/midx.c
index c631be451f..4e0df0285a 100644
--- a/midx.c
+++ b/midx.c
@@ -202,6 +202,61 @@ struct pack_midx_details_internal {
 	uint32_t internal_offset;
 };
 
+struct pack_midx_details *nth_midxed_object_details(struct midxed_git *m,
+						    uint32_t n,
+						    struct pack_midx_details *d)
+{
+	struct pack_midx_details_internal *d_internal;
+	const unsigned char *details = m->chunk_object_offsets;
+
+	if (n >= m->num_objects)
+		return NULL;
+
+	d_internal = (struct pack_midx_details_internal*)(details + 8 * n);
+	d->pack_int_id = ntohl(d_internal->pack_int_id);
+	d->offset = ntohl(d_internal->internal_offset);
+
+	if (m->chunk_large_offsets && d->offset & MIDX_LARGE_OFFSET_NEEDED) {
+		uint32_t large_offset = d->offset ^ MIDX_LARGE_OFFSET_NEEDED;
+		const unsigned char *large_offsets = m->chunk_large_offsets + 8 * large_offset;
+
+		d->offset =  (((uint64_t)ntohl(*((uint32_t *)(large_offsets + 0)))) << 32) |
+					 ntohl(*((uint32_t *)(large_offsets + 4)));
+	}
+
+	return d;
+}
+
+struct pack_midx_entry *nth_midxed_object_entry(struct midxed_git *m,
+						uint32_t n,
+						struct pack_midx_entry *e)
+{
+	struct pack_midx_details details;
+	const unsigned char *index = m->chunk_oid_lookup;
+
+	if (!nth_midxed_object_details(m, n, &details))
+		return NULL;
+
+	memcpy(e->oid.hash, index + m->hdr->hash_len * n, m->hdr->hash_len);
+	e->pack_int_id = details.pack_int_id;
+	e->offset = details.offset;
+
+	return e;
+}
+
+const struct object_id *nth_midxed_object_oid(struct object_id *oid,
+					      struct midxed_git *m,
+					      uint32_t n)
+{
+	struct pack_midx_entry e;
+
+	if (!nth_midxed_object_entry(m, n, &e))
+		return 0;
+
+	hashcpy(oid->hash, e.oid.hash);
+	return oid;
+}
+
 static int midx_sha1_compare(const void *_a, const void *_b)
 {
 	struct pack_midx_entry *a = *(struct pack_midx_entry **)_a;
diff --git a/midx.h b/midx.h
index 92b74e49db..9255909ae8 100644
--- a/midx.h
+++ b/midx.h
@@ -85,6 +85,21 @@ struct midxed_git {
 
 extern struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid);
 
+struct pack_midx_details {
+	uint32_t pack_int_id;
+	off_t offset;
+};
+
+extern struct pack_midx_details *nth_midxed_object_details(struct midxed_git *m,
+							   uint32_t n,
+							   struct pack_midx_details *d);
+extern struct pack_midx_entry *nth_midxed_object_entry(struct midxed_git *m,
+						       uint32_t n,
+						       struct pack_midx_entry *e);
+extern const struct object_id *nth_midxed_object_oid(struct object_id *oid,
+						     struct midxed_git *m,
+						     uint32_t n);
+
 /*
  * Write a single MIDX file storing the given entries for the
  * given list of packfiles. If midx_name is null, then a temp
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 10/18] midx: use existing midx when writing
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (8 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 09/18] midx: find details of nth object in midx Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 11/18] midx: teach git-midx to clear midx files Derrick Stolee
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

When writing a new MIDX file, it is faster to use an existing MIDX file
to load the object list and pack offsets and to only inspect pack-indexes
for packs not already covered by the MIDX file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/midx.c | 34 +++++++++++++++++++++++++++++++---
 midx.c         | 23 +++++++++++++++++++++++
 midx.h         |  2 ++
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/builtin/midx.c b/builtin/midx.c
index ee9234583d..aff2085771 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -73,7 +73,7 @@ static int midx_read(void)
 static int build_midx_from_packs(
 	const char *pack_dir,
 	const char **pack_names, uint32_t nr_packs,
-	const char **midx_id)
+	const char **midx_id, struct midxed_git *midx)
 {
 	struct packed_git **packs;
 	const char **installed_pack_names;
@@ -86,6 +86,9 @@ static int build_midx_from_packs(
 	struct strbuf pack_path = STRBUF_INIT;
 	int baselen;
 
+	if (midx)
+		nr_total_packs += midx->num_packs;
+
 	if (!nr_total_packs) {
 		*midx_id = NULL;
 		return 0;
@@ -94,6 +97,12 @@ static int build_midx_from_packs(
 	ALLOC_ARRAY(packs, nr_total_packs);
 	ALLOC_ARRAY(installed_pack_names, nr_total_packs);
 
+	if (midx) {
+		for (i = 0; i < midx->num_packs; i++)
+			installed_pack_names[nr_installed_packs++] = midx->pack_names[i];
+		pack_offset = midx->num_packs;
+	}
+
 	strbuf_addstr(&pack_path, pack_dir);
 	strbuf_addch(&pack_path, '/');
 	baselen = pack_path.len;
@@ -101,6 +110,9 @@ static int build_midx_from_packs(
 		strbuf_setlen(&pack_path, baselen);
 		strbuf_addstr(&pack_path, pack_names[i]);
 
+		if (midx && contains_pack(midx, pack_names[i]))
+			continue;
+
 		strbuf_strip_suffix(&pack_path, ".pack");
 		strbuf_addstr(&pack_path, ".idx");
 
@@ -120,13 +132,24 @@ static int build_midx_from_packs(
 	if (!nr_objects || !nr_installed_packs) {
 		FREE_AND_NULL(packs);
 		FREE_AND_NULL(installed_pack_names);
-		*midx_id = NULL;
+
+		if (opts.has_existing)
+			*midx_id = oid_to_hex(&opts.old_midx_oid);
+		else
+			*midx_id = NULL;
+
 		return 0;
 	}
 
+	if (midx)
+		nr_objects += midx->num_objects;
+
 	ALLOC_ARRAY(objects, nr_objects);
 	nr_objects = 0;
 
+	for (i = 0; midx && i < midx->num_objects; i++)
+		nth_midxed_object_entry(midx, i, &objects[nr_objects++]);
+
 	for (i = pack_offset; i < nr_installed_packs; i++) {
 		struct packed_git *p = packs[i];
 
@@ -184,6 +207,10 @@ static int midx_write(void)
 	const char *midx_id = 0;
 	DIR *dir;
 	struct dirent *de;
+	struct midxed_git *midx = NULL;
+
+	if (opts.has_existing)
+		midx = get_midxed_git(opts.pack_dir, &opts.old_midx_oid);
 
 	dir = opendir(opts.pack_dir);
 	if (!dir) {
@@ -212,7 +239,8 @@ static int midx_write(void)
 	if (!nr_packs)
 		goto cleanup;
 
-	if (build_midx_from_packs(opts.pack_dir, pack_names, nr_packs, &midx_id))
+	if (build_midx_from_packs(opts.pack_dir, pack_names,
+				  nr_packs, &midx_id, midx))
 		die("failed to build MIDX");
 
 	if (midx_id == NULL)
diff --git a/midx.c b/midx.c
index 4e0df0285a..53eb29dac3 100644
--- a/midx.c
+++ b/midx.c
@@ -257,6 +257,29 @@ const struct object_id *nth_midxed_object_oid(struct object_id *oid,
 	return oid;
 }
 
+int contains_pack(struct midxed_git *m, const char *pack_name)
+{
+	uint32_t first = 0, last = m->num_packs;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		const char *current;
+		int cmp;
+
+		current = m->pack_names[mid];
+		cmp = strcmp(pack_name, current);
+		if (!cmp)
+			return 1;
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	return 0;
+}
+
 static int midx_sha1_compare(const void *_a, const void *_b)
 {
 	struct pack_midx_entry *a = *(struct pack_midx_entry **)_a;
diff --git a/midx.h b/midx.h
index 9255909ae8..1e7a94651c 100644
--- a/midx.h
+++ b/midx.h
@@ -100,6 +100,8 @@ extern const struct object_id *nth_midxed_object_oid(struct object_id *oid,
 						     struct midxed_git *m,
 						     uint32_t n);
 
+extern int contains_pack(struct midxed_git *m, const char *pack_name);
+
 /*
  * Write a single MIDX file storing the given entries for the
  * given list of packfiles. If midx_name is null, then a temp
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 11/18] midx: teach git-midx to clear midx files
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (9 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 10/18] midx: use existing midx when writing Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 12/18] midx: teach git-midx to delete expired files Derrick Stolee
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

As a way to troubleshoot unforeseen problems with MIDX files, provide
a way to delete "midx-head" and the MIDX it references.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt | 12 +++++++++++-
 builtin/midx.c             | 31 ++++++++++++++++++++++++++++++-
 t/t5318-midx.sh            |  9 +++++++++
 3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 3eeed1d969..c184d3a593 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 --------
 [verse]
-'git midx' [--write|--read] <options> [--pack-dir <pack_dir>]
+'git midx' [--write|--read|--clear] <options> [--pack-dir <pack_dir>]
 
 DESCRIPTION
 -----------
@@ -22,6 +22,10 @@ OPTIONS
 	Use given directory for the location of packfiles, pack-indexes,
 	and MIDX files.
 
+--clear::
+	If specified, delete the midx file specified by midx-head, and
+	midx-head. (Cannot be combined with --write or --read.)
+
 --read::
 	If specified, read a midx file specified by the midx-head file
 	and output basic details about the midx file. (Cannot be combined
@@ -79,6 +83,12 @@ $ git midx --read
 $ git midx --read --midx-id 3e50d982a2257168c7fd0ff12ffe5cf6af38c74e
 --------------------------------------------------------------------
 
+* Delete the current midx-head and the file it references.
++
+-----------------------------------------------
+$ git midx --clear
+-----------------------------------------------
+
 CONFIGURATION
 -------------
 
diff --git a/builtin/midx.c b/builtin/midx.c
index aff2085771..b30ef36ff8 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -11,11 +11,13 @@
 static char const * const builtin_midx_usage[] = {
 	N_("git midx [--pack-dir <packdir>]"),
 	N_("git midx --write [--update-head] [--pack-dir <packdir>]"),
+	N_("git midx --clear [--pack-dir <packdir>]"),
 	NULL
 };
 
 static struct opts_midx {
 	const char *pack_dir;
+	int clear;
 	int read;
 	const char *midx_id;
 	int write;
@@ -24,6 +26,29 @@ static struct opts_midx {
 	struct object_id old_midx_oid;
 } opts;
 
+static int midx_clear(void)
+{
+	struct strbuf head_path = STRBUF_INIT;
+	char *old_path;
+
+	if (!opts.has_existing)
+		return 0;
+
+	strbuf_addstr(&head_path, opts.pack_dir);
+	strbuf_addstr(&head_path, "/");
+	strbuf_addstr(&head_path, "midx-head");
+	if (remove_path(head_path.buf))
+		die("failed to remove path %s", head_path.buf);
+	strbuf_release(&head_path);
+
+	old_path = get_midx_filename_oid(opts.pack_dir, &opts.old_midx_oid);
+	if (remove_path(old_path))
+		die("failed to remove path %s", old_path);
+	free(old_path);
+
+	return 0;
+}
+
 static int midx_read(void)
 {
 	struct object_id midx_oid;
@@ -263,6 +288,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
 			N_("dir"),
 			N_("The pack directory containing set of packfile and pack-index pairs.") },
+		OPT_BOOL('c', "clear", &opts.clear,
+			N_("clear midx file and midx-head")),
 		OPT_BOOL('r', "read", &opts.read,
 			N_("read midx file")),
 		{ OPTION_STRING, 'M', "midx-id", &opts.midx_id,
@@ -287,7 +314,7 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 			     builtin_midx_options,
 			     builtin_midx_usage, 0);
 
-	if (opts.write + opts.read > 1)
+	if (opts.write + opts.read + opts.clear > 1)
 		usage_with_options(builtin_midx_usage, builtin_midx_options);
 
 	if (!opts.pack_dir) {
@@ -299,6 +326,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 
 	opts.has_existing = !!get_midx_head_oid(opts.pack_dir, &opts.old_midx_oid);
 
+	if (opts.clear)
+		return midx_clear();
 	if (opts.read)
 		return midx_read();
 	if (opts.write)
diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index 2e52389442..9337355ab3 100755
--- a/t/t5318-midx.sh
+++ b/t/t5318-midx.sh
@@ -143,4 +143,13 @@ test_expect_success 'write-midx in bare repo' \
      git midx --read >output &&
      cmp output expect'
 
+test_expect_success 'midx --clear' \
+    'git midx --clear &&
+     test_path_is_missing "${baredir}/midx-${midx4}.midx" &&
+     test_path_is_missing "${baredir}/midx-head" &&
+     cd ../full &&
+     git midx --clear &&
+     test_path_is_missing "${packdir}/midx-${midx4}.midx" &&
+     test_path_is_missing "${packdir}/midx-head"'
+
 test_done
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 12/18] midx: teach git-midx to delete expired files
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (10 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 11/18] midx: teach git-midx to clear midx files Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable Derrick Stolee
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

As we write new MIDX files, the existing files are probably not needed. Supply
the "--delete-expired" flag to remove these files during the "--write" sub-
command.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-midx.txt |  4 ++++
 builtin/midx.c             | 15 ++++++++++++++-
 midx.c                     | 26 ++++++++++++++++++++++++++
 midx.h                     |  2 ++
 packfile.c                 |  2 +-
 packfile.h                 |  1 +
 t/t5318-midx.sh            |  9 ++++++---
 7 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index c184d3a593..4635247d0d 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -43,6 +43,10 @@ OPTIONS
 	If specified with --write, update the midx-head file to point to
 	the written midx file.
 
+--delete-expired::
+	If specified with --write and --update-head, delete the midx file
+	previously pointed to by midx-head (if changed).
+
 EXAMPLES
 --------
 
diff --git a/builtin/midx.c b/builtin/midx.c
index b30ef36ff8..6f56f39390 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -10,7 +10,7 @@
 
 static char const * const builtin_midx_usage[] = {
 	N_("git midx [--pack-dir <packdir>]"),
-	N_("git midx --write [--update-head] [--pack-dir <packdir>]"),
+	N_("git midx --write [--update-head [--delete-expired]] [--pack-dir <packdir>]"),
 	N_("git midx --clear [--pack-dir <packdir>]"),
 	NULL
 };
@@ -22,6 +22,7 @@ static struct opts_midx {
 	const char *midx_id;
 	int write;
 	int update_head;
+	int delete_expired;
 	int has_existing;
 	struct object_id old_midx_oid;
 } opts;
@@ -276,6 +277,16 @@ static int midx_write(void)
 	if (opts.update_head)
 		update_head_file(opts.pack_dir, midx_id);
 
+	if (opts.delete_expired && opts.update_head && opts.has_existing &&
+	    strcmp(midx_id, oid_to_hex(&opts.old_midx_oid))) {
+		char *old_path = get_midx_filename_oid(opts.pack_dir, &opts.old_midx_oid);
+		close_midx(midx);
+		if (remove_path(old_path))
+			die("failed to remove path %s", old_path);
+
+		free(old_path);
+	}
+
 cleanup:
 	if (pack_names)
 		FREE_AND_NULL(pack_names);
@@ -300,6 +311,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
 			N_("write midx file")),
 		OPT_BOOL('u', "update-head", &opts.update_head,
 			N_("update midx-head to written midx file")),
+		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
+			N_("delete expired head midx file")),
 		OPT_END(),
 	};
 
diff --git a/midx.c b/midx.c
index 53eb29dac3..3ce2b736ea 100644
--- a/midx.c
+++ b/midx.c
@@ -651,3 +651,29 @@ const char *write_midx_file(const char *pack_dir,
 
 	return final_hex;
 }
+
+int close_midx(struct midxed_git *m)
+{
+	int i;
+	if (m->midx_fd < 0)
+		return 0;
+
+	for (i = 0; i < m->num_packs; i++) {
+		if (m->packs[i]) {
+			close_pack(m->packs[i]);
+			free(m->packs[i]);
+			m->packs[i] = NULL;
+		}
+	}
+
+	munmap((void *)m->data, m->data_len);
+	m->data = 0;
+
+	close(m->midx_fd);
+	m->midx_fd = -1;
+
+	free(m->packs);
+	free(m->pack_names);
+
+	return 1;
+}
diff --git a/midx.h b/midx.h
index 1e7a94651c..27d48163e9 100644
--- a/midx.h
+++ b/midx.h
@@ -117,4 +117,6 @@ extern const char *write_midx_file(const char *pack_dir,
 				   struct pack_midx_entry **objects,
 				   uint32_t nr_objects);
 
+extern int close_midx(struct midxed_git *m);
+
 #endif
diff --git a/packfile.c b/packfile.c
index 4a5fe7ab18..c36420b33f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -299,7 +299,7 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
-static void close_pack(struct packed_git *p)
+void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
diff --git a/packfile.h b/packfile.h
index 0cdeb54dcd..7cf4771029 100644
--- a/packfile.h
+++ b/packfile.h
@@ -61,6 +61,7 @@ extern void close_pack_index(struct packed_git *);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 extern void close_pack_windows(struct packed_git *);
+extern void close_pack(struct packed_git *p);
 extern void close_all_packs(void);
 extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index 9337355ab3..42d103c879 100755
--- a/t/t5318-midx.sh
+++ b/t/t5318-midx.sh
@@ -75,6 +75,7 @@ test_expect_success 'write-midx with two packs' \
     'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects --index-version=2 ${packdir}/test-3) &&
      midx3=$(git midx --write --update-head) &&
      test_path_is_file ${packdir}/midx-${midx3}.midx &&
+     test_path_is_file ${packdir}/midx-${midx2}.midx &&
      test_path_is_file ${packdir}/midx-head &&
      test $(cat ${packdir}/midx-head) = "$midx3" &&
      _midx_read_expect \
@@ -107,8 +108,10 @@ test_expect_success 'Add more packs' \
      done'
 
 test_expect_success 'write-midx with twelve packs' \
-    'midx4=$(git midx --write --update-head) &&
+    'midx4=$(git midx --write --update-head --delete-expired) &&
      test_path_is_file ${packdir}/midx-${midx4}.midx &&
+     test_path_is_missing ${packdir}/midx-${midx3}.midx &&
+     test_path_is_file ${packdir}/midx-${midx2}.midx &&
      test_path_is_file ${packdir}/midx-head &&
      test $(cat ${packdir}/midx-head) = "$midx4" &&
      _midx_read_expect \
@@ -118,7 +121,7 @@ test_expect_success 'write-midx with twelve packs' \
      cmp output expect'
 
 test_expect_success 'write-midx with no new packs' \
-    'midx5=$(git midx --write --update-head) &&
+    'midx5=$(git midx --write --update-head --delete-expired) &&
      test_path_is_file ${packdir}/midx-${midx5}.midx &&
      test "a$midx4" = "a$midx5" &&
      test_path_is_file ${packdir}/midx-head &&
@@ -133,7 +136,7 @@ test_expect_success 'create bare repo' \
      baredir=./objects/pack'
 
 test_expect_success 'write-midx in bare repo' \
-    'midxbare=$(git midx --write --update-head) &&
+    'midxbare=$(git midx --write --update-head --delete-expired) &&
      test_path_is_file ${baredir}/midx-${midxbare}.midx  &&
      test_path_is_file ${baredir}/midx-head &&
      test $(cat ${baredir}/midx-head) = "$midxbare" &&
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (11 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 12/18] midx: teach git-midx to delete expired files Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 14/18] midx: load midx files when loading packs Derrick Stolee
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Perform some basic read-only operations that load objects and find
abbreviations. As this functionality begins to reference MIDX files,
ensure the output matches when using MIDX files and when not using them.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5318-midx.sh | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index 42d103c879..00be852ed3 100755
--- a/t/t5318-midx.sh
+++ b/t/t5318-midx.sh
@@ -37,6 +37,19 @@ _midx_read_expect() {
 	EOF
 }
 
+_midx_git_two_modes() {
+	git -c core.midx=true $1 >output
+	git -c core.midx=false $1 >expect
+}
+
+_midx_git_behavior() {
+	test_expect_success 'check normal git operations' \
+	    '_midx_git_two_modes "log --patch master" &&
+	     cmp output expect &&
+	     _midx_git_two_modes "rev-list --all --objects" &&
+	     cmp output expect'
+}
+
 test_expect_success 'write-midx from index version 1' \
     'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 ${packdir}/test-1) &&
      midx1=$(git midx --write) &&
@@ -48,6 +61,8 @@ test_expect_success 'write-midx from index version 1' \
      git midx --read --midx-id=${midx1} >output &&
      cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'write-midx from index version 2' \
     'rm "${packdir}/test-1-${pack1}.pack" &&
      pack2=$(git rev-list --all --objects | git pack-objects --index-version=2 ${packdir}/test-2) &&
@@ -61,6 +76,8 @@ test_expect_success 'write-midx from index version 2' \
      git midx --read> output &&
      cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'Create more objects' \
     'for i in $(test_seq 100)
      do
@@ -71,6 +88,8 @@ test_expect_success 'Create more objects' \
      git commit -m "test data 2" &&
      git branch commit2 HEAD'
 
+_midx_git_behavior
+
 test_expect_success 'write-midx with two packs' \
     'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects --index-version=2 ${packdir}/test-3) &&
      midx3=$(git midx --write --update-head) &&
@@ -84,6 +103,8 @@ test_expect_success 'write-midx with two packs' \
      git midx --read >output &&
      cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'Add more packs' \
     'for i in $(test_seq 10)
      do
@@ -107,6 +128,8 @@ test_expect_success 'Add more packs' \
          git pack-objects --index-version=2 ${packdir}/test-pack <obj-list
      done'
 
+_midx_git_behavior
+
 test_expect_success 'write-midx with twelve packs' \
     'midx4=$(git midx --write --update-head --delete-expired) &&
      test_path_is_file ${packdir}/midx-${midx4}.midx &&
@@ -120,6 +143,8 @@ test_expect_success 'write-midx with twelve packs' \
      git midx --read >output &&
      cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'write-midx with no new packs' \
     'midx5=$(git midx --write --update-head --delete-expired) &&
      test_path_is_file ${packdir}/midx-${midx5}.midx &&
@@ -127,6 +152,8 @@ test_expect_success 'write-midx with no new packs' \
      test_path_is_file ${packdir}/midx-head &&
      test $(cat ${packdir}/midx-head) = "$midx4"'
 
+_midx_git_behavior
+
 test_expect_success 'create bare repo' \
     'cd .. &&
      git clone --bare full bare &&
@@ -146,6 +173,8 @@ test_expect_success 'write-midx in bare repo' \
      git midx --read >output &&
      cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'midx --clear' \
     'git midx --clear &&
      test_path_is_missing "${baredir}/midx-${midx4}.midx" &&
@@ -155,4 +184,6 @@ test_expect_success 'midx --clear' \
      test_path_is_missing "${packdir}/midx-${midx4}.midx" &&
      test_path_is_missing "${packdir}/midx-head"'
 
+_midx_git_behavior
+
 test_done
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 14/18] midx: load midx files when loading packs
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (12 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 15/18] midx: use midx for approximate object count Derrick Stolee
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Replace prepare_packed_git() with prepare_packed_git_internal(use_midx) to
allow some consumers of prepare_packed_git() with a way to load MIDX files.
Consumers should only use the new method if they are prepared to use the
midxed_git struct alongside the packed_git struct.

If a packfile is found that is not referenced by the current MIDX, then add
it to the packed_git struct. This is important to keep the MIDX useful after
adding packs due to "fetch" commands and when third-party tools (such as
libgit2) add packs directly to the repo.

If prepare_packed_git_internal is called with use_midx = 0, then unload the
MIDX file and reload the packfiles in to the packed_git struct.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 midx.h     |  6 ++++--
 packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 packfile.h |  1 +
 4 files changed, 117 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index 3ce2b736ea..a66763b9e3 100644
--- a/midx.c
+++ b/midx.c
@@ -22,6 +22,9 @@
 
 #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
+/* MIDX-git global storage */
+struct midxed_git *midxed_git = 0;
+
 char* get_midx_filename_oid(const char *pack_dir,
 			    struct object_id *oid)
 {
@@ -197,6 +200,45 @@ struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid)
 	return m;
 }
 
+static char* get_midx_filename_dir(const char *pack_dir)
+{
+	struct object_id oid;
+	if (!get_midx_head_oid(pack_dir, &oid))
+		return 0;
+
+	return get_midx_filename_oid(pack_dir, &oid);
+}
+
+static int prepare_midxed_git_head(char *pack_dir, int local)
+{
+	struct midxed_git *m = midxed_git;
+	char *midx_head_path = get_midx_filename_dir(pack_dir);
+
+	if (!core_midx)
+		return 1;
+
+	if (midx_head_path) {
+		midxed_git = load_midxed_git_one(midx_head_path, pack_dir);
+		midxed_git->next = m;
+		FREE_AND_NULL(midx_head_path);
+		return 1;
+	}
+
+	return 0;
+}
+
+int prepare_midxed_git_objdir(char *obj_dir, int local)
+{
+	int ret;
+	struct strbuf pack_dir = STRBUF_INIT;
+	strbuf_addstr(&pack_dir, obj_dir);
+	strbuf_add(&pack_dir, "/pack", 5);
+
+	ret = prepare_midxed_git_head(pack_dir.buf, local);
+	strbuf_release(&pack_dir);
+	return ret;
+}
+
 struct pack_midx_details_internal {
 	uint32_t pack_int_id;
 	uint32_t internal_offset;
@@ -677,3 +719,18 @@ int close_midx(struct midxed_git *m)
 
 	return 1;
 }
+
+void close_all_midx(void)
+{
+	struct midxed_git *m = midxed_git;
+	struct midxed_git *next;
+
+	while (m) {
+		next = m->next;
+		close_midx(m);
+		free(m);
+		m = next;
+	}
+
+	midxed_git = 0;
+}
diff --git a/midx.h b/midx.h
index 27d48163e9..d8ede8121c 100644
--- a/midx.h
+++ b/midx.h
@@ -27,7 +27,7 @@ struct pack_midx_header {
 	uint32_t num_packs;
 };
 
-struct midxed_git {
+extern struct midxed_git {
 	struct midxed_git *next;
 
 	int midx_fd;
@@ -81,9 +81,10 @@ struct midxed_git {
 
 	/* something like ".git/objects/pack" */
 	char pack_dir[FLEX_ARRAY]; /* more */
-};
+} *midxed_git;
 
 extern struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid);
+extern int prepare_midxed_git_objdir(char *obj_dir, int local);
 
 struct pack_midx_details {
 	uint32_t pack_int_id;
@@ -118,5 +119,6 @@ extern const char *write_midx_file(const char *pack_dir,
 				   uint32_t nr_objects);
 
 extern int close_midx(struct midxed_git *m);
+extern void close_all_midx(void);
 
 #endif
diff --git a/packfile.c b/packfile.c
index c36420b33f..1c0822878b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -8,6 +8,7 @@
 #include "list.h"
 #include "streaming.h"
 #include "sha1-lookup.h"
+#include "midx.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *sha1,
@@ -309,10 +310,22 @@ void close_pack(struct packed_git *p)
 void close_all_packs(void)
 {
 	struct packed_git *p;
+	struct midxed_git *m;
+
+	for (m = midxed_git; m; m = m->next) {
+		int i;
+		for (i = 0; i < m->num_packs; i++) {
+			p = m->packs[i];
+			if (p && p->do_not_close)
+				BUG("want to close pack marked 'do-not-close'");
+			else if (p)
+				close_pack(p);
+		}
+	}
 
 	for (p = packed_git; p; p = p->next)
 		if (p->do_not_close)
-			die("BUG: want to close pack marked 'do-not-close'");
+			BUG("want to close pack marked 'do-not-close'");
 		else
 			close_pack(p);
 }
@@ -748,6 +761,7 @@ static void prepare_packed_git_one(char *objdir, int local)
 	dirnamelen = path.len;
 	while ((de = readdir(dir)) != NULL) {
 		struct packed_git *p;
+		struct midxed_git *m;
 		size_t base_len;
 
 		if (is_dot_or_dotdot(de->d_name))
@@ -758,15 +772,23 @@ static void prepare_packed_git_one(char *objdir, int local)
 
 		base_len = path.len;
 		if (strip_suffix_mem(path.buf, &base_len, ".idx")) {
+			strbuf_setlen(&path, base_len + 1);
+			strbuf_add(&path, "pack", 4);
+
 			/* Don't reopen a pack we already have. */
+			for (m = midxed_git; m; m = m->next)
+				if (!memcmp(m->pack_dir, path.buf, dirnamelen - 1) &&
+				    contains_pack(m, path.buf + dirnamelen))
+					break;
 			for (p = packed_git; p; p = p->next) {
-				size_t len;
-				if (strip_suffix(p->pack_name, ".pack", &len) &&
-				    len == base_len &&
-				    !memcmp(p->pack_name, path.buf, len))
+				if (!strcmp(p->pack_name, path.buf))
 					break;
 			}
-			if (p == NULL &&
+
+			strbuf_setlen(&path, base_len + 1);
+			strbuf_add(&path, "idx", 3);
+
+			if (m == NULL && p == NULL &&
 			    /*
 			     * See if it really is a valid .idx file with
 			     * corresponding .pack file that we can map.
@@ -872,21 +894,45 @@ static void prepare_packed_git_mru(void)
 }
 
 static int prepare_packed_git_run_once = 0;
-void prepare_packed_git(void)
+static int prepare_midxed_git_run_once = 0;
+void prepare_packed_git_internal(int use_midx)
 {
 	struct alternate_object_database *alt;
+	char *obj_dir;
+
+	if (prepare_midxed_git_run_once) {
+		if (!use_midx) {
+			prepare_midxed_git_run_once = 0;
+			close_all_midx();
+			reprepare_packed_git();
+		}
+		return;
+	}
 
 	if (prepare_packed_git_run_once)
 		return;
-	prepare_packed_git_one(get_object_directory(), 1);
+
+	obj_dir = get_object_directory();
+
+	if (use_midx)
+		prepare_midxed_git_objdir(obj_dir, 1);
+	prepare_packed_git_one(obj_dir, 1);
 	prepare_alt_odb();
-	for (alt = alt_odb_list; alt; alt = alt->next)
+	for (alt = alt_odb_list; alt; alt = alt->next) {
+		if (use_midx)
+			prepare_midxed_git_objdir(alt->path, 0);
 		prepare_packed_git_one(alt->path, 0);
+	}
 	rearrange_packed_git();
 	prepare_packed_git_mru();
 	prepare_packed_git_run_once = 1;
+	prepare_midxed_git_run_once = use_midx;
 }
 
+void prepare_packed_git(void)
+{
+	prepare_packed_git_internal(0);
+}
 void reprepare_packed_git(void)
 {
 	approximate_object_count_valid = 0;
diff --git a/packfile.h b/packfile.h
index 7cf4771029..25bac91efb 100644
--- a/packfile.h
+++ b/packfile.h
@@ -32,6 +32,7 @@ extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_
 #define PACKDIR_FILE_GARBAGE 4
 extern void (*report_garbage)(unsigned seen_bits, const char *path);
 
+extern void prepare_packed_git_internal(int use_midx);
 extern void prepare_packed_git(void);
 extern void reprepare_packed_git(void);
 extern void install_packed_git(struct packed_git *pack);
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 15/18] midx: use midx for approximate object count
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (13 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 14/18] midx: load midx files when loading packs Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx() Derrick Stolee
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

The MIDX files contain a complete object count, so we can report the number
of objects in the MIDX. The count remains approximate as there may be overlap
between the packfiles not referenced by the MIDX.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 packfile.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/packfile.c b/packfile.c
index 1c0822878b..866a1f30dd 100644
--- a/packfile.c
+++ b/packfile.c
@@ -803,7 +803,8 @@ static void prepare_packed_git_one(char *objdir, int local)
 		if (ends_with(de->d_name, ".idx") ||
 		    ends_with(de->d_name, ".pack") ||
 		    ends_with(de->d_name, ".bitmap") ||
-		    ends_with(de->d_name, ".keep"))
+		    ends_with(de->d_name, ".keep") ||
+		    ends_with(de->d_name, ".midx"))
 			string_list_append(&garbage, path.buf);
 		else
 			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
@@ -828,9 +829,12 @@ unsigned long approximate_object_count(void)
 	static unsigned long count;
 	if (!approximate_object_count_valid) {
 		struct packed_git *p;
+		struct midxed_git *m;
 
-		prepare_packed_git();
+		prepare_packed_git_internal(1);
 		count = 0;
+		for (m = midxed_git; m; m = m->next)
+			count += m->num_objects;
 		for (p = packed_git; p; p = p->next) {
 			if (open_pack_index(p))
 				continue;
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx()
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (14 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 15/18] midx: use midx for approximate object count Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 17/18] sha1_name: use midx for abbreviations Derrick Stolee
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Using a binary search, we can navigate to the position n within a
MIDX file where an object appears in the ordered list of objects.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 30 ++++++++++++++++++++++++++++++
 midx.h |  9 +++++++++
 2 files changed, 39 insertions(+)

diff --git a/midx.c b/midx.c
index a66763b9e3..8c643caa92 100644
--- a/midx.c
+++ b/midx.c
@@ -299,6 +299,36 @@ const struct object_id *nth_midxed_object_oid(struct object_id *oid,
 	return oid;
 }
 
+int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, uint32_t *pos)
+{
+	uint32_t last, first = 0;
+
+	if (sha1[0])
+		first = ntohl(*(uint32_t*)(m->chunk_oid_fanout + 4 * (sha1[0] - 1)));
+	last = ntohl(*(uint32_t*)(m->chunk_oid_fanout + 4 * sha1[0]));
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		const unsigned char *current;
+		int cmp;
+
+		current = m->chunk_oid_lookup + m->hdr->hash_len * mid;
+		cmp = hashcmp(sha1, current);
+		if (!cmp) {
+			*pos = mid;
+			return 1;
+		}
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	*pos = first;
+	return 0;
+}
+
 int contains_pack(struct midxed_git *m, const char *pack_name)
 {
 	uint32_t first = 0, last = m->num_packs;
diff --git a/midx.h b/midx.h
index d8ede8121c..5598799189 100644
--- a/midx.h
+++ b/midx.h
@@ -101,6 +101,15 @@ extern const struct object_id *nth_midxed_object_oid(struct object_id *oid,
 						     struct midxed_git *m,
 						     uint32_t n);
 
+/*
+ * Perform a binary search on the object list in a MIDX file for the given sha1.
+ *
+ * If the object exists, then return 1 and set *pos to the position of the sha1.
+ * Otherwise, return 0 and set *pos to the position of the lex-first object greater
+ * than the given sha1.
+ */
+extern int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, uint32_t *pos);
+
 extern int contains_pack(struct midxed_git *m, const char *pack_name);
 
 /*
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 17/18] sha1_name: use midx for abbreviations
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (15 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx() Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 18:14 ` [RFC PATCH 18/18] packfile: use midx for object loads Derrick Stolee
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

Create unique_in_midx() to mimic behavior of unique_in_pack().

Create find_abbrev_len_for_midx() to mimic behavior of
find_abbrev_len_for_pack().

Consume these methods when interacting with abbreviations.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sha1_name.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..2f426e136e 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -10,6 +10,7 @@
 #include "dir.h"
 #include "sha1-array.h"
 #include "packfile.h"
+#include "midx.h"
 
 static int get_oid_oneline(const char *, struct object_id *, struct commit_list *);
 
@@ -190,11 +191,40 @@ static void unique_in_pack(struct packed_git *p,
 	}
 }
 
+static void unique_in_midx(struct midxed_git *m,
+			   struct disambiguate_state *ds)
+{
+	uint32_t num, i, first = 0;
+	const struct object_id *current = NULL;
+
+	if (!m->num_objects)
+		return;
+
+	num = m->num_objects;
+	bsearch_midx(m, ds->bin_pfx.hash, &first);
+
+	/*
+	 * At this point, "first" is the location of the lowest object
+	 * with an object name that could match "bin_pfx".  See if we have
+	 * 0, 1 or more objects that actually match(es).
+	 */
+	for (i = first; i < num && !ds->ambiguous; i++) {
+		struct object_id oid;
+		current = nth_midxed_object_oid(&oid, m, i);
+		if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash))
+			break;
+		update_candidates(ds, current);
+	}
+}
+
 static void find_short_packed_object(struct disambiguate_state *ds)
 {
 	struct packed_git *p;
+	struct midxed_git *m;
 
-	prepare_packed_git();
+	prepare_packed_git_internal(1);
+	for (m = midxed_git; m && !ds->ambiguous; m = m->next)
+		unique_in_midx(m, ds);
 	for (p = packed_git; p && !ds->ambiguous; p = p->next)
 		unique_in_pack(p, ds);
 }
@@ -508,6 +538,39 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 	return 0;
 }
 
+static void find_abbrev_len_for_midx(struct midxed_git *m,
+				     struct min_abbrev_data *mad)
+{
+	int match = 0;
+	uint32_t first = 0;
+	struct object_id oid;
+
+	if (!m->num_objects)
+		return;
+
+	match = bsearch_midx(m, mad->hash, &first);
+
+	/*
+	 * first is now the position in the packfile where we would insert
+	 * mad->hash if it does not exist (or the position of mad->hash if
+	 * it does exist). Hence, we consider a maximum of three objects
+	 * nearby for the abbreviation length.
+	 */
+	mad->init_len = 0;
+	if (!match) {
+		nth_midxed_object_oid(&oid, m, first);
+		extend_abbrev_len(&oid, mad);
+	} else if (first < m->num_objects - 1) {
+		nth_midxed_object_oid(&oid, m, first + 1);
+		extend_abbrev_len(&oid, mad);
+	}
+	if (first > 0) {
+		nth_midxed_object_oid(&oid, m, first - 1);
+		extend_abbrev_len(&oid, mad);
+	}
+	mad->init_len = mad->cur_len;
+}
+
 static void find_abbrev_len_for_pack(struct packed_git *p,
 				     struct min_abbrev_data *mad)
 {
@@ -563,8 +626,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 static void find_abbrev_len_packed(struct min_abbrev_data *mad)
 {
 	struct packed_git *p;
+	struct midxed_git *m;
 
-	prepare_packed_git();
+	prepare_packed_git_internal(1);
+	for (m = midxed_git; m; m = m->next)
+		find_abbrev_len_for_midx(m, mad);
 	for (p = packed_git; p; p = p->next)
 		find_abbrev_len_for_pack(p, mad);
 }
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 18/18] packfile: use midx for object loads
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (16 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 17/18] sha1_name: use midx for abbreviations Derrick Stolee
@ 2018-01-07 18:14 ` Derrick Stolee
  2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
  2018-01-10 18:25 ` Martin Fick
  19 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-07 18:14 UTC (permalink / raw)
  To: git; +Cc: dstolee, stolee, git, peff, gitster, Johannes.Shindelin, jrnieder

When looking for a packed object, first check the MIDX for that object.
This reduces thrashing in the MRU list of packfiles.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c     | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 midx.h     |  3 +++
 packfile.c |  5 +++-
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 8c643caa92..4b2398b3ee 100644
--- a/midx.c
+++ b/midx.c
@@ -329,6 +329,90 @@ int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, uint32_t *pos)
 	return 0;
 }
 
+static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
+{
+	struct strbuf pack_name = STRBUF_INIT;
+
+	if (pack_int_id >= m->hdr->num_packs)
+		return 1;
+
+	if (m->packs[pack_int_id])
+		return 0;
+
+	strbuf_addstr(&pack_name, m->pack_dir);
+	strbuf_addstr(&pack_name, "/");
+	strbuf_addstr(&pack_name, m->pack_names[pack_int_id]);
+	strbuf_strip_suffix(&pack_name, ".pack");
+	strbuf_addstr(&pack_name, ".idx");
+
+	m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
+	strbuf_release(&pack_name);
+	return !m->packs[pack_int_id];
+}
+
+static int find_pack_entry_midx(const unsigned char *sha1,
+				struct midxed_git *m,
+				struct packed_git **p,
+				off_t *offset)
+{
+	uint32_t pos;
+	struct pack_midx_details d;
+
+	if (!bsearch_midx(m, sha1, &pos) ||
+	    !nth_midxed_object_details(m, pos, &d))
+		return 0;
+
+	if (d.pack_int_id >= m->num_packs)
+		die(_("bad pack-int-id %d"), d.pack_int_id);
+
+	/* load packfile, if necessary */
+	if (prepare_midx_pack(m, d.pack_int_id))
+		return 0;
+
+	*p = m->packs[d.pack_int_id];
+	*offset = d.offset;
+
+	return 1;
+}
+
+int fill_pack_entry_midx(const unsigned char *sha1,
+			 struct pack_entry *e)
+{
+	struct packed_git *p;
+	struct midxed_git *m;
+
+	if (!core_midx)
+		return 0;
+
+	m = midxed_git;
+	while (m)
+	{
+		off_t offset;
+		if (!find_pack_entry_midx(sha1, m, &p, &offset)) {
+			m = m->next;
+			continue;
+		}
+
+		/*
+		* We are about to tell the caller where they can locate the
+		* requested object.  We better make sure the packfile is
+		* still here and can be accessed before supplying that
+		* answer, as it may have been deleted since the MIDX was
+		* loaded!
+		*/
+		if (!is_pack_valid(p))
+			return 0;
+
+		e->offset = offset;
+		e->p = p;
+		hashcpy(e->sha1, sha1);
+
+		return 1;
+	}
+
+	return 0;
+}
+
 int contains_pack(struct midxed_git *m, const char *pack_name)
 {
 	uint32_t first = 0, last = m->num_packs;
diff --git a/midx.h b/midx.h
index 5598799189..b7e8b15fe4 100644
--- a/midx.h
+++ b/midx.h
@@ -11,6 +11,9 @@ extern char *get_midx_head_filename(const char *pack_dir);
 
 extern struct object_id *get_midx_head_oid(const char *pack_dir, struct object_id *oid);
 
+extern int fill_pack_entry_midx(const unsigned char *sha1,
+				struct pack_entry *e);
+
 struct pack_midx_entry {
 	struct object_id oid;
 	uint32_t pack_int_id;
diff --git a/packfile.c b/packfile.c
index 866a1f30dd..9ec39a83e9 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1883,7 +1883,10 @@ int find_pack_entry(const unsigned char *sha1, struct pack_entry *e)
 {
 	struct mru_entry *p;
 
-	prepare_packed_git();
+	prepare_packed_git_internal(1);
+	if (fill_pack_entry_midx(sha1, e))
+		return 1;
+
 	if (!packed_git)
 		return 0;
 
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (17 preceding siblings ...)
  2018-01-07 18:14 ` [RFC PATCH 18/18] packfile: use midx for object loads Derrick Stolee
@ 2018-01-07 22:42 ` Ævar Arnfjörð Bjarmason
  2018-01-08  0:08   ` Derrick Stolee
  2018-01-10 18:25 ` Martin Fick
  19 siblings, 1 reply; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-07 22:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, gitster, Johannes.Shindelin, jrnieder


On Sun, Jan 07 2018, Derrick Stolee jotted:

>     git log --oneline --raw --parents
>
> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
> ----------+-------------+------------+--------+----------
>         1 |     35.64 s |    35.28 s |  -1.0% |   -1.0%
>        24 |     90.81 s |    40.06 s | -55.9% |  +12.4%
>       127 |    257.97 s |    42.25 s | -83.6% |  +18.6%
>
> The last column is the relative difference between the MIDX-enabled repo
> and the single-pack repo. The goal of the MIDX feature is to present the
> ODB as if it was fully repacked, so there is still room for improvement.
>
> Changing the command to
>
>     git log --oneline --raw --parents --abbrev=40
>
> has no observable difference (sub 1% change in all cases). This is likely
> due to the repack I used putting commits and trees in a small number of
> packfiles so the MRU cache workes very well. On more naturally-created
> lists of packfiles, there can be up to 20% improvement on this command.
>
> We are using a version of this patch with an upcoming release of GVFS.
> This feature is particularly important in that space since GVFS performs
> a "prefetch" step that downloads a pack of commits and trees on a daily
> basis. These packfiles are placed in an alternate that is shared by all
> enlistments. Some users have 150+ packfiles and the MRU misses and
> abbreviation computations are significant. Now, GVFS manages the MIDX file
> after adding new prefetch packfiles using the following command:
>
>     git midx --write --update-head --delete-expired --pack-dir=<alt>

(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.

The reason we do that at all is because it makes for a prettier UI.

Are there things that both want the pretty SHA-1 and also care about the
throughput? I'd have expected machine parsing to just use
--no-abbrev-commit.

If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
@ 2018-01-08  0:08   ` Derrick Stolee
  2018-01-08 10:20     ` Jeff King
                       ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-01-08  0:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, dstolee, git, peff, gitster, johannes.schindelin, jrnieder

On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Jan 07 2018, Derrick Stolee jotted:
> 
>>      git log --oneline --raw --parents
>>
>> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
>> ----------+-------------+------------+--------+----------
>>          1 |     35.64 s |    35.28 s |  -1.0% |   -1.0%
>>         24 |     90.81 s |    40.06 s | -55.9% |  +12.4%
>>        127 |    257.97 s |    42.25 s | -83.6% |  +18.6%
>>
>> The last column is the relative difference between the MIDX-enabled repo
>> and the single-pack repo. The goal of the MIDX feature is to present the
>> ODB as if it was fully repacked, so there is still room for improvement.
>>
>> Changing the command to
>>
>>      git log --oneline --raw --parents --abbrev=40
>>
>> has no observable difference (sub 1% change in all cases). This is likely
>> due to the repack I used putting commits and trees in a small number of
>> packfiles so the MRU cache workes very well. On more naturally-created
>> lists of packfiles, there can be up to 20% improvement on this command.
>>
>> We are using a version of this patch with an upcoming release of GVFS.
>> This feature is particularly important in that space since GVFS performs
>> a "prefetch" step that downloads a pack of commits and trees on a daily
>> basis. These packfiles are placed in an alternate that is shared by all
>> enlistments. Some users have 150+ packfiles and the MRU misses and
>> abbreviation computations are significant. Now, GVFS manages the MIDX file
>> after adding new prefetch packfiles using the following command:
>>
>>      git midx --write --update-head --delete-expired --pack-dir=<alt>
> 
> (Not a critique of this, just a (stupid) question)
> 
> What's the practical use-case for this feature? Since it doesn't help
> with --abbrev=40 the speedup is all in the part that ensures we don't
> show an ambiguous SHA-1.

The point of including the --abbrev=40 is to point out that object 
lookups do not get slower with the MIDX feature. Using these "git log" 
options is a good way to balance object lookups and abbreviations with 
object parsing and diff machinery. And while the public data shape I 
shared did not show a difference, our private testing of the Windows 
repository did show a valuable improvement when isolating to object 
lookups and ignoring abbreviation calculations.

> The reason we do that at all is because it makes for a prettier UI.

We tried setting core.abbrev=40 on GVFS enlistments to speed up 
performance and the users rebelled against the hideous output. They 
would rather have slower speeds than long hashes.

> Are there things that both want the pretty SHA-1 and also care about the
> throughput? I'd have expected machine parsing to just use
> --no-abbrev-commit.

The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes.

> If something cares about both throughput and e.g. is saving the
> abbreviated SHA-1s isn't it better off picking some arbitrary size
> (e.g. --abbrev=20), after all the default abbreviation is going to show
> something as small as possible, which may soon become ambigous after the
> next commit.

Unfortunately, with the way the abbreviation algorithms work, using 
--abbrev=20 will have similar performance problems because you still 
need to inspect all packfiles to ensure there isn't a collision in the 
first 20 hex characters.

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08  0:08   ` Derrick Stolee
@ 2018-01-08 10:20     ` Jeff King
  2018-01-08 10:27       ` Jeff King
                         ` (2 more replies)
  2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
  2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
  2 siblings, 3 replies; 48+ messages in thread
From: Jeff King @ 2018-01-08 10:20 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, git, dstolee, git,
	gitster, johannes.schindelin, jrnieder

On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote:

> > (Not a critique of this, just a (stupid) question)
> > 
> > What's the practical use-case for this feature? Since it doesn't help
> > with --abbrev=40 the speedup is all in the part that ensures we don't
> > show an ambiguous SHA-1.
> 
> The point of including the --abbrev=40 is to point out that object lookups
> do not get slower with the MIDX feature. Using these "git log" options is a
> good way to balance object lookups and abbreviations with object parsing and
> diff machinery. And while the public data shape I shared did not show a
> difference, our private testing of the Windows repository did show a
> valuable improvement when isolating to object lookups and ignoring
> abbreviation calculations.

Just to make sure I'm parsing this correctly: normal lookups do get faster
when you have a single index, given the right setup?

I'm curious what that setup looked like. Is it just tons and tons of
packs? Is it ones where the packs do not follow the mru patterns very
well?

I think it's worth thinking a bit about, because...

> > If something cares about both throughput and e.g. is saving the
> > abbreviated SHA-1s isn't it better off picking some arbitrary size
> > (e.g. --abbrev=20), after all the default abbreviation is going to show
> > something as small as possible, which may soon become ambigous after the
> > next commit.
> 
> Unfortunately, with the way the abbreviation algorithms work, using
> --abbrev=20 will have similar performance problems because you still need to
> inspect all packfiles to ensure there isn't a collision in the first 20 hex
> characters.

...if what we primarily care about speeding up is abbreviations, is it
crazy to consider disabling the disambiguation step entirely?

The results of find_unique_abbrev are already a bit of a probability
game. They're guaranteed at the moment of generation, but as more
objects are added, ambiguities may be introduced. Likewise, what's
unambiguous for you may not be for somebody else you're communicating
with, if they have their own clone.

Since we now scale the default abbreviation with the size of the repo,
that gives us a bounded and pretty reasonable probability that we won't
hit a collision at all[1].

I.e., what if we did something like this:

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..04c661ba85 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 	if (len == GIT_SHA1_HEXSZ || !len)
 		return GIT_SHA1_HEXSZ;

+	/*
+	 * A default length of 10 implies a repository big enough that it's
+	 * getting expensive to double check the ambiguity of each object,
+	 * and the chance that any particular object of interest has a
+	 * collision is low.
+	 */
+	if (len >= 10)
+		return len;
+
 	mad.init_len = len;
 	mad.cur_len = len;
 	mad.hex = hex;

If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The
patch above drops "git log --oneline --raw" on the resulting repo from
~150s to ~30s.

With a single pack, it goes from ~33s ~29s. Less impressive, but there's
still some benefit.

There may be other reasons to want MIDX or something like it, but I just
wonder if we can do this much simpler thing to cover the abbreviation
case. I guess the question is whether somebody is going to be annoyed in
the off chance that they hit a collision.

-Peff

[1] I'd have to check the numbers, but IIRC we've set the scaling so
    that the chance of having a _single_ collision in the repository is
    less than 50%, and rounding to the conservative side (since each hex
    char gives us 4 bits). And indeed, "git log --oneline --raw" on
    linux.git does not seem to have any collisions at its default of 12
    characters, at least in my clone.

    We could also consider switching core.disambiguate to "commit",
    which makes even a collision less likely to annoy the user.

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08 10:20     ` Jeff King
@ 2018-01-08 10:27       ` Jeff King
  2018-01-08 12:28         ` Ævar Arnfjörð Bjarmason
  2018-01-08 13:43       ` Johannes Schindelin
  2018-01-08 13:43       ` Derrick Stolee
  2 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2018-01-08 10:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, git, dstolee, git,
	gitster, johannes.schindelin, jrnieder

On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote:

> I.e., what if we did something like this:
> 
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..04c661ba85 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
>  	if (len == GIT_SHA1_HEXSZ || !len)
>  		return GIT_SHA1_HEXSZ;
>  
> +	/*
> +	 * A default length of 10 implies a repository big enough that it's
> +	 * getting expensive to double check the ambiguity of each object,
> +	 * and the chance that any particular object of interest has a
> +	 * collision is low.
> +	 */
> +	if (len >= 10)
> +		return len;
> +

Oops, this really needs to terminate the string in addition to returning
the length (so it was always printing 40 characters in most cases). The
correct patch is below, but it performs the same.

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..5921298a80 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 	if (len == GIT_SHA1_HEXSZ || !len)
 		return GIT_SHA1_HEXSZ;
 
+	/*
+	 * A default length of 10 implies a repository big enough that it's
+	 * getting expensive to double check the ambiguity of each object,
+	 * and the chance that any particular object of interest has a
+	 * collision is low.
+	 */
+	if (len >= 10) {
+		hex[len] = 0;
+		return len;
+	}
+
 	mad.init_len = len;
 	mad.cur_len = len;
 	mad.hex = hex;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08  0:08   ` Derrick Stolee
  2018-01-08 10:20     ` Jeff King
@ 2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
  2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-08 11:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, gitster, johannes.schindelin, jrnieder


On Mon, Jan 08 2018, Derrick Stolee jotted:

> On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>> If something cares about both throughput and e.g. is saving the
>> abbreviated SHA-1s isn't it better off picking some arbitrary size
>> (e.g. --abbrev=20), after all the default abbreviation is going to show
>> something as small as possible, which may soon become ambigous after the
>> next commit.
>
> Unfortunately, with the way the abbreviation algorithms work, using
> --abbrev=20 will have similar performance problems because you still
> need to inspect all packfiles to ensure there isn't a collision in the
> first 20 hex characters.

I meant (but forgot to write) that this would be some new mode,
e.g. --abbrev=20 --no-abbrev-check which would just perform a substr()
of the 40 character SHA-1. It might be interesting to add that for
reasons completely unrelated to your series.

Thanks for answering the rest.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08 10:27       ` Jeff King
@ 2018-01-08 12:28         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-08 12:28 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, git, dstolee, git, gitster, johannes.schindelin,
	jrnieder, Linus Torvalds


On Mon, Jan 08 2018, Jeff King jotted:

> On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote:
>
>> I.e., what if we did something like this:
>>
>> diff --git a/sha1_name.c b/sha1_name.c
>> index 611c7d24dd..04c661ba85 100644
>> --- a/sha1_name.c
>> +++ b/sha1_name.c
>> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
>>  	if (len == GIT_SHA1_HEXSZ || !len)
>>  		return GIT_SHA1_HEXSZ;
>>
>> +	/*
>> +	 * A default length of 10 implies a repository big enough that it's
>> +	 * getting expensive to double check the ambiguity of each object,
>> +	 * and the chance that any particular object of interest has a
>> +	 * collision is low.
>> +	 */
>> +	if (len >= 10)
>> +		return len;
>> +
>
> Oops, this really needs to terminate the string in addition to returning
> the length (so it was always printing 40 characters in most cases). The
> correct patch is below, but it performs the same.
>
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..5921298a80 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
>  	if (len == GIT_SHA1_HEXSZ || !len)
>  		return GIT_SHA1_HEXSZ;
>
> +	/*
> +	 * A default length of 10 implies a repository big enough that it's
> +	 * getting expensive to double check the ambiguity of each object,
> +	 * and the chance that any particular object of interest has a
> +	 * collision is low.
> +	 */
> +	if (len >= 10) {
> +		hex[len] = 0;
> +		return len;
> +	}
> +
>  	mad.init_len = len;
>  	mad.cur_len = len;
>  	mad.hex = hex;

That looks much more sensible, leaving aside other potential benefits of
MIDX.

Given the argument Linus made in e6c587c733 ("abbrev: auto size the
default abbreviation", 2016-09-30) maybe we should add a small integer
to the length for good measure, i.e. something like:

	if (len >= 10) {
		int extra = 2; /* or  just 1? or maybe 0 ... */
		hex[len + extra] = 0;
		return len + extra;
	}

I tried running:

    git log --pretty=format:%h --abbrev=7 | perl -nE 'chomp; say length'|sort|uniq -c|sort -nr

On several large repos, which forces something like the disambiguation
we had before Linus's patch, on e.g. David Turner's
2015-04-03-1M-git.git test repo it's:

     952858 7
      44541 8
       2861 9
        168 10
         17 11
          2 12

And the default abbreviation picks 12. I haven't yet found a case where
it's wrong, but if we wanted to be extra safe we could just add a byte
or two to the SHA-1.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08 10:20     ` Jeff King
  2018-01-08 10:27       ` Jeff King
@ 2018-01-08 13:43       ` Johannes Schindelin
  2018-01-09  6:50         ` Jeff King
  2018-01-08 13:43       ` Derrick Stolee
  2 siblings, 1 reply; 48+ messages in thread
From: Johannes Schindelin @ 2018-01-08 13:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, git,
	dstolee, git, gitster, jrnieder

Hi Peff,

On Mon, 8 Jan 2018, Jeff King wrote:

> On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote:
> 
> > > (Not a critique of this, just a (stupid) question)
> > > 
> > > What's the practical use-case for this feature? Since it doesn't
> > > help with --abbrev=40 the speedup is all in the part that ensures we
> > > don't show an ambiguous SHA-1.
> > 
> > The point of including the --abbrev=40 is to point out that object
> > lookups do not get slower with the MIDX feature. Using these "git log"
> > options is a good way to balance object lookups and abbreviations with
> > object parsing and diff machinery. And while the public data shape I
> > shared did not show a difference, our private testing of the Windows
> > repository did show a valuable improvement when isolating to object
> > lookups and ignoring abbreviation calculations.
> 
> Just to make sure I'm parsing this correctly: normal lookups do get
> faster when you have a single index, given the right setup?
> 
> I'm curious what that setup looked like. Is it just tons and tons of
> packs? Is it ones where the packs do not follow the mru patterns very
> well?
> 
> I think it's worth thinking a bit about, because...
> 
> > > If something cares about both throughput and e.g. is saving the
> > > abbreviated SHA-1s isn't it better off picking some arbitrary size
> > > (e.g. --abbrev=20), after all the default abbreviation is going to show
> > > something as small as possible, which may soon become ambigous after the
> > > next commit.
> > 
> > Unfortunately, with the way the abbreviation algorithms work, using
> > --abbrev=20 will have similar performance problems because you still need to
> > inspect all packfiles to ensure there isn't a collision in the first 20 hex
> > characters.
> 
> ...if what we primarily care about speeding up is abbreviations, is it
> crazy to consider disabling the disambiguation step entirely?

Not crazy. But it would break stuff. Because...

> The results of find_unique_abbrev are already a bit of a probability
> game. They're guaranteed at the moment of generation, but as more
> objects are added, ambiguities may be introduced. Likewise, what's
> unambiguous for you may not be for somebody else you're communicating
> with, if they have their own clone.

... this is only a probability game in the long term, when you consider
new objects to enter from *somewhere*.

But in purely local settings, when we expect no new objects to be
introduced, we do use known-unambiguous abbreviations.

Take the interactive rebase for example. It generates todo lists with
abbreviated commit names, for readability (and it is *really* important to
keep this readable). As we expect new objects to be introduced by the
interactive rebase, we convert that todo list to unabbreviated commit
names before executing the interactive rebase.

Your idea (to not care about unambiguous abbreviations) would break that.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08 10:20     ` Jeff King
  2018-01-08 10:27       ` Jeff King
  2018-01-08 13:43       ` Johannes Schindelin
@ 2018-01-08 13:43       ` Derrick Stolee
  2018-01-09  7:12         ` Jeff King
  2 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2018-01-08 13:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, git, dstolee, git,
	gitster, johannes.schindelin, jrnieder

On 1/8/2018 5:20 AM, Jeff King wrote:
> On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote:
>
>>> (Not a critique of this, just a (stupid) question)
>>>
>>> What's the practical use-case for this feature? Since it doesn't help
>>> with --abbrev=40 the speedup is all in the part that ensures we don't
>>> show an ambiguous SHA-1.
>> The point of including the --abbrev=40 is to point out that object lookups
>> do not get slower with the MIDX feature. Using these "git log" options is a
>> good way to balance object lookups and abbreviations with object parsing and
>> diff machinery. And while the public data shape I shared did not show a
>> difference, our private testing of the Windows repository did show a
>> valuable improvement when isolating to object lookups and ignoring
>> abbreviation calculations.
> Just to make sure I'm parsing this correctly: normal lookups do get faster
> when you have a single index, given the right setup?
>
> I'm curious what that setup looked like. Is it just tons and tons of
> packs? Is it ones where the packs do not follow the mru patterns very
> well?

The way I repacked the Linux repo creates an artificially good set of 
packs for the MRU cache. When the packfiles are partitioned instead by 
the time the objects were pushed to a remote, the MRU cache performs 
poorly. Improving these object lookups are a primary reason for the MIDX 
feature, and almost all commands improve because of it. 'git log' is 
just the simplest to use for demonstration.

> I think it's worth thinking a bit about, because...
>
>>> If something cares about both throughput and e.g. is saving the
>>> abbreviated SHA-1s isn't it better off picking some arbitrary size
>>> (e.g. --abbrev=20), after all the default abbreviation is going to show
>>> something as small as possible, which may soon become ambigous after the
>>> next commit.
>> Unfortunately, with the way the abbreviation algorithms work, using
>> --abbrev=20 will have similar performance problems because you still need to
>> inspect all packfiles to ensure there isn't a collision in the first 20 hex
>> characters.
> ...if what we primarily care about speeding up is abbreviations, is it
> crazy to consider disabling the disambiguation step entirely?
>
> The results of find_unique_abbrev are already a bit of a probability
> game. They're guaranteed at the moment of generation, but as more
> objects are added, ambiguities may be introduced. Likewise, what's
> unambiguous for you may not be for somebody else you're communicating
> with, if they have their own clone.
>
> Since we now scale the default abbreviation with the size of the repo,
> that gives us a bounded and pretty reasonable probability that we won't
> hit a collision at all[1].
>
> I.e., what if we did something like this:
>
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..04c661ba85 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
>   	if (len == GIT_SHA1_HEXSZ || !len)
>   		return GIT_SHA1_HEXSZ;
>   
> +	/*
> +	 * A default length of 10 implies a repository big enough that it's
> +	 * getting expensive to double check the ambiguity of each object,
> +	 * and the chance that any particular object of interest has a
> +	 * collision is low.
> +	 */
> +	if (len >= 10)
> +		return len;
> +
>   	mad.init_len = len;
>   	mad.cur_len = len;
>   	mad.hex = hex;
>
> If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The
> patch above drops "git log --oneline --raw" on the resulting repo from
> ~150s to ~30s.
>
> With a single pack, it goes from ~33s ~29s. Less impressive, but there's
> still some benefit.
>
> There may be other reasons to want MIDX or something like it, but I just
> wonder if we can do this much simpler thing to cover the abbreviation
> case. I guess the question is whether somebody is going to be annoyed in
> the off chance that they hit a collision.

No only are users going to be annoyed when they hit collisions after 
copy-pasting an abbreviated hash, there are also a large number of tools 
that people build that use abbreviated hashes (either for presenting to 
users or because they didn't turn off abbreviations).

Abbreviations cause performance issues in other commands, too (like 
'fetch'!), so whatever short-circuit you put in, it would need to be 
global. A flag on one builtin would not suffice.

> -Peff
>
> [1] I'd have to check the numbers, but IIRC we've set the scaling so
>      that the chance of having a _single_ collision in the repository is
>      less than 50%, and rounding to the conservative side (since each hex
>      char gives us 4 bits). And indeed, "git log --oneline --raw" on
>      linux.git does not seem to have any collisions at its default of 12
>      characters, at least in my clone.
>
>      We could also consider switching core.disambiguate to "commit",
>      which makes even a collision less likely to annoy the user.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
  2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
@ 2018-01-08 19:32   ` Jonathan Tan
  2018-01-08 20:35     ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Jonathan Tan @ 2018-01-08 19:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, gitster, Johannes.Shindelin, jrnieder

On Sun,  7 Jan 2018 13:14:42 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> +Design Details
> +--------------
> +
> +- The MIDX file refers only to packfiles in the same directory
> +  as the MIDX file.
> +
> +- A special file, 'midx-head', stores the hash of the latest
> +  MIDX file so we can load the file without performing a dirstat.
> +  This file is especially important with incremental MIDX files,
> +  pointing to the newest file.

I presume that the actual MIDX files are named by hash? (You might have
written this somewhere that I somehow missed.)

Also, I notice that in the "Future Work" section, the possibility of
multiple MIDX files is raised. Could this 'midx-head' file be allowed to
store multiple such files? That way, we avoid a bit of file format
churn (in that we won't need to define a new "chunk" in the future).

> +- If a packfile exists in the pack directory but is not referenced
> +  by the MIDX file, then the packfile is loaded into the packed_git
> +  list and Git can access the objects as usual. This behavior is
> +  necessary since other tools could add packfiles to the pack
> +  directory without notifying Git.
> +
> +- The MIDX file should be only a supplemental structure. If a
> +  user downgrades or disables the `core.midx` config setting,
> +  then the existing .idx and .pack files should be sufficient
> +  to operate correctly.

Let me try to summarize: so, at this point, there are no
backwards-incompatible changes to the repo disk format. Unupdated code
paths (and old versions of Git) can just read the .idx and .pack files,
as always. Updated code paths will look at the .midx and .idx files, and
will sort them as follows:
 - .midx files go into a data structure
 - .idx files not referenced by any .midx files go into the
   existing packed_git data structure

A writer can either merely write a new packfile (like old versions of
Git) or write a packfile and update the .midx file, and everything above
will still work. In the event that a writer deletes an existing packfile
referenced by a .midx (for example, old versions of Git during a
repack), we will lose the advantages of the .midx file - we will detect
that the .midx no longer works when attempting to read an object given
its information, but in this case, we can recover by dropping the .midx
file and loading all the .idx files it references that still exist.

As a reviewer, I think this is a very good approach, and this does make
things easier to review (as opposed to, say, an approach where a lot of
the code must be aware of .midx files).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
  2018-01-08 19:32   ` Jonathan Tan
@ 2018-01-08 20:35     ` Derrick Stolee
  2018-01-08 22:06       ` Jonathan Tan
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2018-01-08 20:35 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, dstolee, git, peff, gitster, Johannes.Shindelin, jrnieder

On 1/8/2018 2:32 PM, Jonathan Tan wrote:
> On Sun,  7 Jan 2018 13:14:42 -0500
> Derrick Stolee <stolee@gmail.com> wrote:
>
>> +Design Details
>> +--------------
>> +
>> +- The MIDX file refers only to packfiles in the same directory
>> +  as the MIDX file.
>> +
>> +- A special file, 'midx-head', stores the hash of the latest
>> +  MIDX file so we can load the file without performing a dirstat.
>> +  This file is especially important with incremental MIDX files,
>> +  pointing to the newest file.
> I presume that the actual MIDX files are named by hash? (You might have
> written this somewhere that I somehow missed.)
>
> Also, I notice that in the "Future Work" section, the possibility of
> multiple MIDX files is raised. Could this 'midx-head' file be allowed to
> store multiple such files? That way, we avoid a bit of file format
> churn (in that we won't need to define a new "chunk" in the future).

I hadn't considered this idea, and I like it. I'm not sure this is a 
robust solution, since isolated MIDX files don't contain information 
that they could use other MIDX files, or what order they should be in. I 
think the "order" of incremental MIDX files is important in a few ways 
(such as the "stable object order" idea).

I will revisit this idea when I come back with the incremental MIDX 
feature. For now, the only reference to "number of base MIDX files" is 
in one byte of the MIDX header. We should consider changing that byte 
for this patch.

>> +- If a packfile exists in the pack directory but is not referenced
>> +  by the MIDX file, then the packfile is loaded into the packed_git
>> +  list and Git can access the objects as usual. This behavior is
>> +  necessary since other tools could add packfiles to the pack
>> +  directory without notifying Git.
>> +
>> +- The MIDX file should be only a supplemental structure. If a
>> +  user downgrades or disables the `core.midx` config setting,
>> +  then the existing .idx and .pack files should be sufficient
>> +  to operate correctly.
> Let me try to summarize: so, at this point, there are no
> backwards-incompatible changes to the repo disk format. Unupdated code
> paths (and old versions of Git) can just read the .idx and .pack files,
> as always. Updated code paths will look at the .midx and .idx files, and
> will sort them as follows:
>   - .midx files go into a data structure
>   - .idx files not referenced by any .midx files go into the
>     existing packed_git data structure
>
> A writer can either merely write a new packfile (like old versions of
> Git) or write a packfile and update the .midx file, and everything above
> will still work. In the event that a writer deletes an existing packfile
> referenced by a .midx (for example, old versions of Git during a
> repack), we will lose the advantages of the .midx file - we will detect
> that the .midx no longer works when attempting to read an object given
> its information, but in this case, we can recover by dropping the .midx
> file and loading all the .idx files it references that still exist.
>
> As a reviewer, I think this is a very good approach, and this does make
> things easier to review (as opposed to, say, an approach where a lot of
> the code must be aware of .midx files).

Thanks! That is certainly the idea. If you know about MIDX, then you can 
benefit from it. If you do not, then you have all the same data 
available to you do to your work. Having a MIDX file will not break 
other tools (libgit2, JGit, etc.).

One thing I'd like to determine before this patch goes to v1 is how much 
we should make the other packfile-aware commands also midx-aware. My gut 
reaction right now is to have git-repack call 'git midx --clear' if 
core.midx=true and a packfile was deleted. However, this could easily be 
changed with 'git midx --clear' followed by 'git midx --write 
--update-head' if midx-head exists.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
  2018-01-08 20:35     ` Derrick Stolee
@ 2018-01-08 22:06       ` Jonathan Tan
  0 siblings, 0 replies; 48+ messages in thread
From: Jonathan Tan @ 2018-01-08 22:06 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, gitster, Johannes.Shindelin, jrnieder

On Mon, 8 Jan 2018 15:35:59 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> Thanks! That is certainly the idea. If you know about MIDX, then you can 
> benefit from it. If you do not, then you have all the same data 
> available to you do to your work. Having a MIDX file will not break 
> other tools (libgit2, JGit, etc.).
> 
> One thing I'd like to determine before this patch goes to v1 is how much 
> we should make the other packfile-aware commands also midx-aware. My gut 
> reaction right now is to have git-repack call 'git midx --clear' if 
> core.midx=true and a packfile was deleted. However, this could easily be 
> changed with 'git midx --clear' followed by 'git midx --write 
> --update-head' if midx-head exists.

My opinion is that these are sufficient:
 (a) functionality to create a .midx file from scratch (deleting any
     existing ones)
 (b) either:
     - update of packfile.c to read (one or more) midx files (like in
       patch 18), and possibly usage in a benchmark, or
     - any other usage of midx file (e.g. abbreviations, like in patch
       17)

In general, a way to create them (so that they can be used from a
cronjob, etc.), and a way to consume them to show that the new way works
and is indeed faster. This functionality in itself might be sufficient
to merge in - this would already be useful in situations where it is
undesirable for repacks to happen often, and we can "bridge" the
intervals between repacks using a cronjob that periodically generates
.midx files in order to keep up the object lookup performance.

In particular, I think that it is fine to omit a more sophisticated
"repack" for now - the .midx mechanism must tolerate packs referenced by
.midx files suddenly disappearing anyway, and in this way, at least we
can demonstrate that the .midx mechanism still works in this case.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08 13:43       ` Johannes Schindelin
@ 2018-01-09  6:50         ` Jeff King
  2018-01-09 13:05           ` Johannes Schindelin
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2018-01-09  6:50 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, git,
	dstolee, git, gitster, jrnieder

On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote:

> Take the interactive rebase for example. It generates todo lists with
> abbreviated commit names, for readability (and it is *really* important to
> keep this readable). As we expect new objects to be introduced by the
> interactive rebase, we convert that todo list to unabbreviated commit
> names before executing the interactive rebase.
> 
> Your idea (to not care about unambiguous abbreviations) would break that.

I think that could be easily worked around for rebase by asking git to
check ambiguity during the conversion. Speed is much less of a problem
there, because we're doing a relatively small number of abbreviations
(compared to "git log --oneline --raw", which is abbreviating almost
every object in the repository).

But I agree it's a potential problem for other scripts that we might not
have control over. I hadn't really intended this to be the default
behavior (my patch was just trying to show the direction). But it does
make for a pretty awful interface if callers have to opt into it
manually ("git log --oneline --no-really-go-fast").

I am a bit curious if there's a bounded probability that people would
find acceptable for Git to give an ambiguous abbreviation. We already
accept 1 in 2^160, of course. But would, e.g., one in a million be OK?

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08 13:43       ` Derrick Stolee
@ 2018-01-09  7:12         ` Jeff King
  0 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2018-01-09  7:12 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, git, dstolee, git,
	gitster, johannes.schindelin, jrnieder

On Mon, Jan 08, 2018 at 08:43:44AM -0500, Derrick Stolee wrote:

> > Just to make sure I'm parsing this correctly: normal lookups do get faster
> > when you have a single index, given the right setup?
> > 
> > I'm curious what that setup looked like. Is it just tons and tons of
> > packs? Is it ones where the packs do not follow the mru patterns very
> > well?
> 
> The way I repacked the Linux repo creates an artificially good set of packs
> for the MRU cache. When the packfiles are partitioned instead by the time
> the objects were pushed to a remote, the MRU cache performs poorly.
> Improving these object lookups are a primary reason for the MIDX feature,
> and almost all commands improve because of it. 'git log' is just the
> simplest to use for demonstration.

Interesting.

The idea of the pack mru (and the "last pack looked in" before it) is
that there would be good locality for time-segmented packs (like those
generated by pushes), since most operations also tend to visit history
in chronological-ish order (e.g., log). Tree-wide operations would be an
exception there, though, since files would have many ages across the
tree (though in practice, one would hope that pretty-ancient history
would eventually be replaced in a modern tree).

I've often wondered, though, if the mru (and "last pack") work mainly
because the "normal" distribution for a repository is to have one big
pack with most of history, and then a couple of smaller ones (what
hasn't been packed yet). Even something as naive as "look in the last
pack" works there, because it turns into "look in the biggest pack".  If
it has most of the objects, it's the most likely place for the previous
and the next objects to be.

But from my understanding of how you handle the Windows repository, you
have tons of equal-sized packs that are never coalesced. Which is quite
a different pattern.

I would expect your "--max-pack-size" thing to end up with something
roughly segmented like pushes, though, just because we do order the
write phase in reverse chronological order. Well, mostly. We do each
object type in its own chunks, and there's some reordering based on
deltas. So given the sizes, it's likely that most of your trees all end
up in a few packs.

> > There may be other reasons to want MIDX or something like it, but I just
> > wonder if we can do this much simpler thing to cover the abbreviation
> > case. I guess the question is whether somebody is going to be annoyed in
> > the off chance that they hit a collision.
> 
> No only are users going to be annoyed when they hit collisions after
> copy-pasting an abbreviated hash, there are also a large number of tools
> that people build that use abbreviated hashes (either for presenting to
> users or because they didn't turn off abbreviations).
>
> Abbreviations cause performance issues in other commands, too (like
> 'fetch'!), so whatever short-circuit you put in, it would need to be global.
> A flag on one builtin would not suffice.

Yeah, I agree it's potentially problematic for a lot of reasons. I was
just hoping we could bound the probabilities in a way that is
acceptable. Maybe it's a fool's errand.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-09  6:50         ` Jeff King
@ 2018-01-09 13:05           ` Johannes Schindelin
  2018-01-09 19:51             ` Stefan Beller
  2018-01-10 10:57             ` Jeff King
  0 siblings, 2 replies; 48+ messages in thread
From: Johannes Schindelin @ 2018-01-09 13:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, git,
	dstolee, git, gitster, jrnieder

Hi Peff,

On Tue, 9 Jan 2018, Jeff King wrote:

> On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote:
> 
> > Take the interactive rebase for example. It generates todo lists with
> > abbreviated commit names, for readability (and it is *really* important to
> > keep this readable). As we expect new objects to be introduced by the
> > interactive rebase, we convert that todo list to unabbreviated commit
> > names before executing the interactive rebase.
> > 
> > Your idea (to not care about unambiguous abbreviations) would break that.
> 
> I think that could be easily worked around for rebase by asking git to
> check ambiguity during the conversion.

Sure.

It also points to a flaw in your reasoning, and you should take my example
further: previously, we guaranteed unique abbreviations, and who is to say
that there is no script out there relying on that guarantee? You?

> But I agree it's a potential problem for other scripts that we might not
> have control over. I hadn't really intended this to be the default
> behavior (my patch was just trying to show the direction). But it does
> make for a pretty awful interface if callers have to opt into it
> manually ("git log --oneline --no-really-go-fast").

Exactly.

> I am a bit curious if there's a bounded probability that people would
> find acceptable for Git to give an ambiguous abbreviation. We already
> accept 1 in 2^160, of course. But would, e.g., one in a million be OK?

What is that going to solve?

I think a better alternative would be to introduce a new abbreviation mode
that is *intended* to stop caring about unique abbreviations.

In web interfaces, for example, it makes tons of sense to show, say, 8
digits in link texts and have the full name in the actual link URL.

Currently, we have no way to tell Git: look, I want to have a short label
for this, but I do not really care that this is unambiguous, I just want
something to talk about.

(And you are correct, of course, that the uniqueness of abbreviations will
no longer be guaranteed for partial clones, and it is already not
guaranteed for shallow clones, and it will only be possible to guarantee
this, ever, for one particular repository at a particular time.)

I am just hesitant to change things that would break existing setups.

Just to throw this out there: --abbrev=8! would be one possible convention
to state "I want exactly 8 hex digits, don't bother checking for
uniqueness".

Not sure how urgent such a feature is.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-09 13:05           ` Johannes Schindelin
@ 2018-01-09 19:51             ` Stefan Beller
  2018-01-09 20:12               ` Junio C Hamano
  2018-01-10 17:05               ` Johannes Schindelin
  2018-01-10 10:57             ` Jeff King
  1 sibling, 2 replies; 48+ messages in thread
From: Stefan Beller @ 2018-01-09 19:51 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jeff King, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	git, Derrick Stolee, Jeff Hostetler, Junio C Hamano,
	Jonathan Nieder

On Tue, Jan 9, 2018 at 5:05 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi Peff,
>
> On Tue, 9 Jan 2018, Jeff King wrote:
>
>> On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote:
>>
>> > Take the interactive rebase for example. It generates todo lists with
>> > abbreviated commit names, for readability (and it is *really* important to
>> > keep this readable). As we expect new objects to be introduced by the
>> > interactive rebase, we convert that todo list to unabbreviated commit
>> > names before executing the interactive rebase.
>> >
>> > Your idea (to not care about unambiguous abbreviations) would break that.
>>
>> I think that could be easily worked around for rebase by asking git to
>> check ambiguity during the conversion.
>
> Sure.
>
> It also points to a flaw in your reasoning, and you should take my example
> further: previously, we guaranteed unique abbreviations, and who is to say
> that there is no script out there relying on that guarantee? You?
>
>> But I agree it's a potential problem for other scripts that we might not
>> have control over. I hadn't really intended this to be the default
>> behavior (my patch was just trying to show the direction). But it does
>> make for a pretty awful interface if callers have to opt into it
>> manually ("git log --oneline --no-really-go-fast").
>
> Exactly.
>
>> I am a bit curious if there's a bounded probability that people would
>> find acceptable for Git to give an ambiguous abbreviation. We already
>> accept 1 in 2^160, of course. But would, e.g., one in a million be OK?
>
> What is that going to solve?
>
> I think a better alternative would be to introduce a new abbreviation mode
> that is *intended* to stop caring about unique abbreviations.
>
> In web interfaces, for example, it makes tons of sense to show, say, 8
> digits in link texts and have the full name in the actual link URL.
>
> Currently, we have no way to tell Git: look, I want to have a short label
> for this, but I do not really care that this is unambiguous, I just want
> something to talk about.
>
> (And you are correct, of course, that the uniqueness of abbreviations will
> no longer be guaranteed for partial clones, and it is already not
> guaranteed for shallow clones, and it will only be possible to guarantee
> this, ever, for one particular repository at a particular time.)
>
> I am just hesitant to change things that would break existing setups.
>
> Just to throw this out there: --abbrev=8! would be one possible convention
> to state "I want exactly 8 hex digits, don't bother checking for
> uniqueness".
>
> Not sure how urgent such a feature is.

You seem to have spent some time on rebase, including lamenting on the
in-expressiveness of the instruction sheet (c.f. "rebase a mergy history sucks")

And in that light, I'd like to propose a new naming scheme:

(a) assume that we "tag" HEAD at the start of the rebase
(b) any abbreviation must be given as committish anchored to said ref:

pick REBASE_HEAD~1 commit subject
pick REBASE_HEAD~2 distak the gostim
pick REBASE_HEAD~3 Document foo
pick REBASE_HEAD~4 Testing the star-gazer

And as we have the full descriptiveness of the committishs there,
each commit can be described in a unique way using the graph relationship.
I am just throwing the name REBASE_HEAD out there to trigger some associations
("similar to FETCH_HEAD"), but I dislike the name.

(c) this would not solve the problem of mergy history, yet. For that we'd need
    to introduce more keywords, that allow us to move around in the DAG,
    such that we can reset to a specific revision or name revisions on the fly.
    So maybe all we need is "reset", "name" (== short lived tag),
    "merge" (rewrite parents of HEAD) to be expressive enough, but still keep
    the line oriented interface?

Stefan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-09 19:51             ` Stefan Beller
@ 2018-01-09 20:12               ` Junio C Hamano
  2018-01-09 20:16                 ` Stefan Beller
  2018-01-10 17:05               ` Johannes Schindelin
  1 sibling, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2018-01-09 20:12 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Johannes Schindelin, Jeff King, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, git, Derrick Stolee,
	Jeff Hostetler, Jonathan Nieder

Stefan Beller <sbeller@google.com> writes:

> And in that light, I'd like to propose a new naming scheme:
>
> (a) assume that we "tag" HEAD at the start of the rebase
> (b) any abbreviation must be given as committish anchored to said ref:
>
> pick REBASE_HEAD~1 commit subject
> pick REBASE_HEAD~2 distak the gostim
> pick REBASE_HEAD~3 Document foo
> pick REBASE_HEAD~4 Testing the star-gazer
>
> And as we have the full descriptiveness of the committishs there,
> each commit can be described in a unique way using the graph relationship.
> I am just throwing the name REBASE_HEAD out there to trigger some associations
> ("similar to FETCH_HEAD"), but I dislike the name.
>
> (c) this would not solve the problem of mergy history, yet. For that we'd need
>     to introduce more keywords, that allow us to move around in the DAG,
>     such that we can reset to a specific revision or name revisions on the fly.
>     So maybe all we need is "reset", "name" (== short lived tag),
>     "merge" (rewrite parents of HEAD) to be expressive enough, but still keep
>     the line oriented interface?

It is correct to say that (c) is an issue that is not solved by (b),
but with the current scheme, the commits are internally referenced
by full object names, and just before it is presented to the end
users, these names are abbreviated down to unique prefix.  The
machinery expands these abbreviated names back to the full names
before going forward, so it is not an issue that we are creating new
commits during the rebase.

Does it make it easier to read if we used ~$n notation based on a
fixed point, instead of shortened unique object names?  What
improvement is (b) trying to achieve?




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-09 20:12               ` Junio C Hamano
@ 2018-01-09 20:16                 ` Stefan Beller
  2018-01-09 21:31                   ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Beller @ 2018-01-09 20:16 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Jeff King, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, git, Derrick Stolee,
	Jeff Hostetler, Jonathan Nieder

On Tue, Jan 9, 2018 at 12:12 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> And in that light, I'd like to propose a new naming scheme:
>>
>> (a) assume that we "tag" HEAD at the start of the rebase
>> (b) any abbreviation must be given as committish anchored to said ref:
>>
>> pick REBASE_HEAD~1 commit subject
>> pick REBASE_HEAD~2 distak the gostim
>> pick REBASE_HEAD~3 Document foo
>> pick REBASE_HEAD~4 Testing the star-gazer
>>
>> And as we have the full descriptiveness of the committishs there,
>> each commit can be described in a unique way using the graph relationship.
>> I am just throwing the name REBASE_HEAD out there to trigger some associations
>> ("similar to FETCH_HEAD"), but I dislike the name.
>>
>> (c) this would not solve the problem of mergy history, yet. For that we'd need
>>     to introduce more keywords, that allow us to move around in the DAG,
>>     such that we can reset to a specific revision or name revisions on the fly.
>>     So maybe all we need is "reset", "name" (== short lived tag),
>>     "merge" (rewrite parents of HEAD) to be expressive enough, but still keep
>>     the line oriented interface?
>
> It is correct to say that (c) is an issue that is not solved by (b),
> but with the current scheme, the commits are internally referenced
> by full object names, and just before it is presented to the end
> users, these names are abbreviated down to unique prefix.  The
> machinery expands these abbreviated names back to the full names
> before going forward, so it is not an issue that we are creating new
> commits during the rebase.
>
> Does it make it easier to read if we used ~$n notation based on a
> fixed point, instead of shortened unique object names?  What
> improvement is (b) trying to achieve?
>

Johannes wrote:
> I think a better alternative would be to introduce a new abbreviation mode
> that is *intended* to stop caring about unique abbreviations.
>
> In web interfaces, for example, it makes tons of sense to show, say, 8
> digits in link texts and have the full name in the actual link URL.

And that is what (b) would solve, as it is shorter than the full hash and
yet exact.

(c) was mostly speculation on top of (b) if we can take it any further.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-09 20:16                 ` Stefan Beller
@ 2018-01-09 21:31                   ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2018-01-09 21:31 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Johannes Schindelin, Jeff King, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, git, Derrick Stolee,
	Jeff Hostetler, Jonathan Nieder

Stefan Beller <sbeller@google.com> writes:

> Johannes wrote:
>> I think a better alternative would be to introduce a new abbreviation mode
>> that is *intended* to stop caring about unique abbreviations.
>>
>> In web interfaces, for example, it makes tons of sense to show, say, 8
>> digits in link texts and have the full name in the actual link URL.
>
> And that is what (b) would solve, as it is shorter than the full hash and
> yet exact.

I still do not get it, even though I fully agree that in Web UI what
Dscho envisions makes tons of sense.  Use some short handle that
does not need to be unique inside repository to display, but have a
full information that can be used by machines.  The shortened ones
need to be unique _within_ a given todo list, to be displayed as
text to be "clicked", where the A element's href attribute that
surrounds that "clickable" text has fully unambiguous information.

And that fully unambiguous information, because it is for machine
consumption, can be a full object name without any shortening.

I do not see a need for REBASE_HEAD~$n to make it less robust
(i.e. we now need to worry about making sure it is not lost or moved
while we need it and clean it up when we are done, etc.)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-09 13:05           ` Johannes Schindelin
  2018-01-09 19:51             ` Stefan Beller
@ 2018-01-10 10:57             ` Jeff King
  1 sibling, 0 replies; 48+ messages in thread
From: Jeff King @ 2018-01-10 10:57 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, git,
	dstolee, git, gitster, jrnieder

On Tue, Jan 09, 2018 at 02:05:09PM +0100, Johannes Schindelin wrote:

> > I think that could be easily worked around for rebase by asking git to
> > check ambiguity during the conversion.
> 
> Sure.
> 
> It also points to a flaw in your reasoning, and you should take my example
> further: previously, we guaranteed unique abbreviations, and who is to say
> that there is no script out there relying on that guarantee? You?

I'm not sure what you think my reasoning was, since I made the same
point directly after. ;)

> > I am a bit curious if there's a bounded probability that people would
> > find acceptable for Git to give an ambiguous abbreviation. We already
> > accept 1 in 2^160, of course. But would, e.g., one in a million be OK?
> 
> What is that going to solve?

I'm not sure I understand your question. We can bound the probability of
a collision by increasing the size of the abbreviation. My question was
whether there is a size where that tradeoff makes sense. So it "solves"
the problem of worrying about collisions.

> I think a better alternative would be to introduce a new abbreviation mode
> that is *intended* to stop caring about unique abbreviations.
> 
> In web interfaces, for example, it makes tons of sense to show, say, 8
> digits in link texts and have the full name in the actual link URL.

I'm not sure if that would be useful option or not. I certainly agree
that static abbreviations are a useful thing to want. But for scripted
callers, it's usually easy to cut down the abbreviation yourself. I
think your web interface example is a good one. The caller will ask Git
for the full 40-hex hash, and then cut it down itself when generating
the link (and I just so happen to have worked on a web interface that
does this[1]).

So would it be useful for humans to use? That's what I'm not sure about.
It seems cumbersome to have to add "--abbrev=8!" on the command line to
each invocation. But if it were triggered by config, it seems like we
would risk breaking scripts.

> I am just hesitant to change things that would break existing setups.

Me too. I'm not sure if I'm seriously advocating the "bound the
probability" thing. It's just something I think is worth pondering a
little.

-Peff

[1] Actually, there _is_ something that it's useful for Git to tell the
    caller, which is the approximate number of objects in the
    repository. Eight digits is not enough for the kernel, for example,
    and these days we use 12 digits by default. I'm not sure if such a
    web interface would rather ask for "%H %h" and generate the link
    text from the second half, or it would rather just get a ballpark on
    the number of objects, and do the simple abbreviation math itself
    (right now GitHub does not do either; I don't know about other
    interfaces).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-09 19:51             ` Stefan Beller
  2018-01-09 20:12               ` Junio C Hamano
@ 2018-01-10 17:05               ` Johannes Schindelin
  1 sibling, 0 replies; 48+ messages in thread
From: Johannes Schindelin @ 2018-01-10 17:05 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff King, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	git, Derrick Stolee, Jeff Hostetler, Junio C Hamano,
	Jonathan Nieder

Hi Stefan,

On Tue, 9 Jan 2018, Stefan Beller wrote:

> On Tue, Jan 9, 2018 at 5:05 AM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
>
> > Just to throw this out there: --abbrev=8! would be one possible
> > convention to state "I want exactly 8 hex digits, don't bother
> > checking for uniqueness".
> >
> > Not sure how urgent such a feature is.
> 
> You seem to have spent some time on rebase, including lamenting on the
> in-expressiveness of the instruction sheet (c.f. "rebase a mergy history
> sucks")

Let's call it "todo list" again?

And yes, I spend a *lot* of time in rebase. It's part of my duties as Git
for Windows maintainer.

> And in that light, I'd like to propose a new naming scheme:
> 
> (a) assume that we "tag" HEAD at the start of the rebase
> (b) any abbreviation must be given as committish anchored to said ref:
> 
> pick REBASE_HEAD~1 commit subject
> pick REBASE_HEAD~2 distak the gostim
> pick REBASE_HEAD~3 Document foo
> pick REBASE_HEAD~4 Testing the star-gazer

I do not necessarily agree that this is better because it is valid only
locally, only in the current rebase (and you enter new problems because
you implicitly require REBASE_HEAD to be worktree-local lest you prevent
simultaneous rebases in related worktrees).

That prevents you from, say, chatting to your colleague and mentioning
that commit that you are uncertain about. Imagine asking "Hey Todd, do you
know what REBASE_HEAD~2 is all about?". Granted, if you ask about
"deadbeef" and it just so happens that this shortened name is ambiguous in
Todd's repository. But that is pretty unlikely, isn't it?

> And as we have the full descriptiveness of the committishs there, each
> commit can be described in a unique way using the graph relationship.  I
> am just throwing the name REBASE_HEAD out there to trigger some
> associations ("similar to FETCH_HEAD"), but I dislike the name.

Not only that. It would also be confusing to read after reordering...

> (c) this would not solve the problem of mergy history, yet. For that
> we'd need to introduce more keywords, that allow us to move around in
> the DAG, [...]

You probably missed my hint that I have a working solution for that.
Please have a look at

	https://github.com/git/git/pull/447

Short version: for a commit topology like this:

	A - B - C
	  \   /
	    D

the generated list would look like this:

	# branch D
	pick 0123 A
	label branch-point
	pick 1234 D
	label D

	reset branch-point
	pick 2345 B
	merge 3456 D C

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
                   ` (18 preceding siblings ...)
  2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
@ 2018-01-10 18:25 ` Martin Fick
  2018-01-10 19:39   ` Derrick Stolee
  19 siblings, 1 reply; 48+ messages in thread
From: Martin Fick @ 2018-01-10 18:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, gitster, Johannes.Shindelin, jrnieder

On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee 
wrote:
> This RFC includes a new way to index the objects in
> multiple packs using one file, called the multi-pack
> index (MIDX).
...
> The main goals of this RFC are:
> 
> * Determine interest in this feature.
> 
> * Find other use cases for the MIDX feature.

My interest in this feature would be to speed up fetches 
when there is more than one large pack-file with many of the 
same objects that are in other pack-files.   What does your 
MIDX design do when it encounters multiple copies of the 
same object in different pack files?  Does it index them all, 
or does it keep a single copy?

In our Gerrit instance (Gerrit uses jgit), we have multiple 
copies of the linux kernel repos linked together via the 
alternatives file mechanism.  These repos have many different 
references (mostly Gerrit change references), but they share 
most of the common objects from the mainline.  I have found 
that during a large fetch such as a clone, jgit spends a 
significant amount of extra time by having the extra large 
pack-files from the other repos visible to it, usually around 
an extra minute per instance of these (without them, the 
clone takes around 7mins).  This adds up easily with a few 
repos extra repos, it can almost double the time.

My investigations have shown that this is due to jgit 
searching each of these pack files to decide which version of 
each object to send.  I don't fully understand its selection 
criteria, however if I shortcut it to just pick the first 
copy of an object that it finds, I regain my lost time.  I 
don't know if git suffers from a similar problem?  If git 
doesn't suffer from this then it likely just uses the first 
copy of an object it finds (which may not be the best object 
to send?)

It would be nice if this use case could be improved with 
MIDX.  To do so, it seems that it would either require that 
MIDX either only put "the best" version of an object (i.e. 
pre-select which one to use), or include the extra 
information to help make the selection process of which copy 
to use (perhaps based on the operation being performed) 
fast.

This also leads me to ask, what other additional information 
(bitmaps?) for other operations, besides object location, 
might suddenly be valuable in an index that potentially 
points to multiple copies of objects?  Would such 
information be appropriate in MIDX, or would it be better in 
another index?

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-10 18:25 ` Martin Fick
@ 2018-01-10 19:39   ` Derrick Stolee
  2018-01-10 21:01     ` Martin Fick
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2018-01-10 19:39 UTC (permalink / raw)
  To: Martin Fick
  Cc: git, dstolee, git, peff, gitster, Johannes.Shindelin, jrnieder

On 1/10/2018 1:25 PM, Martin Fick wrote:
> On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
> wrote:
>> This RFC includes a new way to index the objects in
>> multiple packs using one file, called the multi-pack
>> index (MIDX).
> ...
>> The main goals of this RFC are:
>>
>> * Determine interest in this feature.
>>
>> * Find other use cases for the MIDX feature.
> My interest in this feature would be to speed up fetches
> when there is more than one large pack-file with many of the
> same objects that are in other pack-files.   What does your
> MIDX design do when it encounters multiple copies of the
> same object in different pack files?  Does it index them all,
> or does it keep a single copy?

The MIDX currently keeps only one reference to each object. Duplicates 
are dropped during writing. (See the care taken in commit 04/18 to avoid 
duplicates.) Since midx_sha1_compare() does not use anything other than 
the OID to order the objects, there is no decision being made about 
which pack is "better". The MIDX writes the first copy it finds and 
discards the others.

It would not be difficult to include a check in midx_sha1_compare() to 
favor one packfile over another based on some measurement (size? 
mtime?). Since this would be a heuristic at best, I left it out of the 
current patch.

> In our Gerrit instance (Gerrit uses jgit), we have multiple
> copies of the linux kernel repos linked together via the
> alternatives file mechanism.

GVFS also uses alternates for sharing packfiles across multiple copies 
of the repo. The MIDX is designed to cover all packfiles in the same 
directory, but is not designed to cover packfiles in multiple 
alternates; currently, each alternate would need its own MIDX file. Does 
that cause issues with your setup?

>    These repos have many different
> references (mostly Gerrit change references), but they share
> most of the common objects from the mainline.  I have found
> that during a large fetch such as a clone, jgit spends a
> significant amount of extra time by having the extra large
> pack-files from the other repos visible to it, usually around
> an extra minute per instance of these (without them, the
> clone takes around 7mins).  This adds up easily with a few
> repos extra repos, it can almost double the time.
>
> My investigations have shown that this is due to jgit
> searching each of these pack files to decide which version of
> each object to send.  I don't fully understand its selection
> criteria, however if I shortcut it to just pick the first
> copy of an object that it finds, I regain my lost time.  I
> don't know if git suffers from a similar problem?  If git
> doesn't suffer from this then it likely just uses the first
> copy of an object it finds (which may not be the best object
> to send?)
>
> It would be nice if this use case could be improved with
> MIDX.  To do so, it seems that it would either require that
> MIDX either only put "the best" version of an object (i.e.
> pre-select which one to use), or include the extra
> information to help make the selection process of which copy
> to use (perhaps based on the operation being performed)
> fast.

I'm not sure if there is sufficient value in storing multiple references 
to the same object stored in multiple packfiles. There could be value in 
carefully deciding which copy is "best" during the MIDX write, but 
during read is not a good time to make such a decision. It also 
increases the size of the file to store multiple copies.

> This also leads me to ask, what other additional information
> (bitmaps?) for other operations, besides object location,
> might suddenly be valuable in an index that potentially
> points to multiple copies of objects?  Would such
> information be appropriate in MIDX, or would it be better in
> another index?

For applications to bitmaps, it is probably best that we only include 
one copy of each object. Otherwise, we need to include extra bits in the 
bitmaps for those copies (when asking "is this object in the bitmap?").

Thanks for the context with Gerrit's duplicate object problem. I'll try 
to incorporate it in to the design document (commit 01/18) for the v1 patch.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-10 19:39   ` Derrick Stolee
@ 2018-01-10 21:01     ` Martin Fick
  0 siblings, 0 replies; 48+ messages in thread
From: Martin Fick @ 2018-01-10 21:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, gitster, Johannes.Shindelin, jrnieder

On Wednesday, January 10, 2018 02:39:13 PM Derrick Stolee 
wrote:
> On 1/10/2018 1:25 PM, Martin Fick wrote:
> > On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
> > 
> > wrote:
> >> This RFC includes a new way to index the objects in
> >> multiple packs using one file, called the multi-pack
> >> index (MIDX).
> > 
> > ...
> > 
> >> The main goals of this RFC are:
> >> 
> >> * Determine interest in this feature.
> >> 
> >> * Find other use cases for the MIDX feature.
> > 
> > My interest in this feature would be to speed up fetches
> > when there is more than one large pack-file with many of
> > the same objects that are in other pack-files.   What
> > does your MIDX design do when it encounters multiple
> > copies of the same object in different pack files? 
> > Does it index them all, or does it keep a single copy?
> 
> The MIDX currently keeps only one reference to each
> object. Duplicates are dropped during writing. (See the
> care taken in commit 04/18 to avoid duplicates.) Since
> midx_sha1_compare() does not use anything other than the
> OID to order the objects, there is no decision being made
> about which pack is "better". The MIDX writes the first
> copy it finds and discards the others.

This would likely speed things up then, even if the chosen 
objects are suboptimal.

> It would not be difficult to include a check in
> midx_sha1_compare() to favor one packfile over another
> based on some measurement (size? mtime?). Since this
> would be a heuristic at best, I left it out of the
> current patch.

Yeah, I didn't know what heuristic to use either, I tended 
to think that the bigger pack-file would be valuable because 
it is more likely to share deltas with other objects in that 
pack, so more easy to send them.  However, that is likely 
only true during clones or other large fetches when we want 
most objects.  During small "update" fetches, the newer 
packs might be better?

I also thought that objects in alternates should be 
considered less valuable for my use case, however in the 
github fork use case, the alternates might be more valuable?

So yes heuristics, and I don't know what is best.  Perhaps 
some config options could be used to set heuristics like 
this.  Whatever the heuristics are, since they would be a 
part of the MIDX packing process it would be easy to change.  
This assumes that keeping only one copy in the index is the 
right thing.  The question would be, what if we need 
different heuristics for different operations?  Would it make 
sense to have multiple MIDX files covering the same packs 
then, one for fetch, one for merge...?

> > In our Gerrit instance (Gerrit uses jgit), we have
> > multiple copies of the linux kernel repos linked
> > together via the alternatives file mechanism.
> 
> GVFS also uses alternates for sharing packfiles across
> multiple copies of the repo. The MIDX is designed to
> cover all packfiles in the same directory, but is not
> designed to cover packfiles in multiple alternates;
> currently, each alternate would need its own MIDX file.
> Does that cause issues with your setup?

No, since the other large packfiles are all in other repos 
(alternates).  Is there a reason the MIDX would not want to 
cover the alternates?  If you don't then you would seemingly 
loose any benefits of the MIDX when you have alternates in 
use.

...
> > It would be nice if this use case could be improved with
> > MIDX.  To do so, it seems that it would either require
> > that MIDX either only put "the best" version of an
> > object (i.e. pre-select which one to use), or include
> > the extra information to help make the selection
> > process of which copy to use (perhaps based on the
> > operation being performed) fast.
> 
> I'm not sure if there is sufficient value in storing
> multiple references to the same object stored in multiple
> packfiles. There could be value in carefully deciding
> which copy is "best" during the MIDX write, but during
> read is not a good time to make such a decision. It also
> increases the size of the file to store multiple copies.

Yes, I am not sure either, it would be good to have input 
from experts here.

> > This also leads me to ask, what other additional
> > information (bitmaps?) for other operations, besides
> > object location, might suddenly be valuable in an index
> > that potentially points to multiple copies of objects? 
> > Would such information be appropriate in MIDX, or would
> > it be better in another index?
> 
> For applications to bitmaps, it is probably best that we
> only include one copy of each object. Otherwise, we need
> to include extra bits in the bitmaps for those copies
> (when asking "is this object in the bitmap?").

Agreed.  Would the MIDX potentially pave the way to create 
multi-pack bitmaps?

> Thanks for the context with Gerrit's duplicate object
> problem. I'll try to incorporate it in to the design
> document (commit 01/18) for the v1 patch.

FYI, this is not a typical Gerrit thing, it is something 
special we do in our installation.

Thanks for your thoughts,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-01-08  0:08   ` Derrick Stolee
  2018-01-08 10:20     ` Jeff King
  2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
@ 2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
  2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
                         ` (3 more replies)
  2 siblings, 4 replies; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-06  8:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, gitster, johannes.schindelin, jrnieder


On Mon, Jan 08 2018, Derrick Stolee wrote:

> On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Sun, Jan 07 2018, Derrick Stolee jotted:
>>
>>>      git log --oneline --raw --parents
>>>
>>> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
>>> ----------+-------------+------------+--------+----------
>>>          1 |     35.64 s |    35.28 s |  -1.0% |   -1.0%
>>>         24 |     90.81 s |    40.06 s | -55.9% |  +12.4%
>>>        127 |    257.97 s |    42.25 s | -83.6% |  +18.6%
>>>
>>> The last column is the relative difference between the MIDX-enabled repo
>>> and the single-pack repo. The goal of the MIDX feature is to present the
>>> ODB as if it was fully repacked, so there is still room for improvement.
>>>
>>> Changing the command to
>>>
>>>      git log --oneline --raw --parents --abbrev=40
>>>
>>> has no observable difference (sub 1% change in all cases). This is likely
>>> due to the repack I used putting commits and trees in a small number of
>>> packfiles so the MRU cache workes very well. On more naturally-created
>>> lists of packfiles, there can be up to 20% improvement on this command.
>>>
>>> We are using a version of this patch with an upcoming release of GVFS.
>>> This feature is particularly important in that space since GVFS performs
>>> a "prefetch" step that downloads a pack of commits and trees on a daily
>>> basis. These packfiles are placed in an alternate that is shared by all
>>> enlistments. Some users have 150+ packfiles and the MRU misses and
>>> abbreviation computations are significant. Now, GVFS manages the MIDX file
>>> after adding new prefetch packfiles using the following command:
>>>
>>>      git midx --write --update-head --delete-expired --pack-dir=<alt>
>>
>> (Not a critique of this, just a (stupid) question)
>>
>> What's the practical use-case for this feature? Since it doesn't help
>> with --abbrev=40 the speedup is all in the part that ensures we don't
>> show an ambiguous SHA-1.
>
> The point of including the --abbrev=40 is to point out that object
> lookups do not get slower with the MIDX feature. Using these "git log"
> options is a good way to balance object lookups and abbreviations with
> object parsing and diff machinery.[...]

[snip]

> [...]And while the public data shape I shared did not show a
> difference, our private testing of the Windows repository did show a
> valuable improvement when isolating to object lookups and ignoring
> abbreviation calculations.

Replying to this old thread since I see you're prepearing the MIDX for
submission again and this seemed like the best venue.

Your WIP branch (github.com/git/derrickstolee/midx/upstream) still only
references the speedups in abbreviation calculations, but here you
allude to other improvements. It would be very nice to have some summary
of that in docs / commit messages when you submit this.

I've been meaning to get around to submitting something like I mentioned
in https://public-inbox.org/git/87efn0bkls.fsf@evledraar.gmail.com/
i.e. a way to expand the abbrev mode to not check disambiguations, which
would look something like:

    core.abbrev = 20
    core.validateAbbrev = false

Or:

    core.abbrev = +2
    core.validateAbbrev = false

So, using the example from the above referenced E-Mail +2 would make
linux.git emit hashes of 14 characters, without any abbreviation
checking (just trusting in statistics to work in your favor).

As noted by JS in this thread that wouldn't be acceptable for your
use-case, but there's plenty of people (including me) who'd appreciate
the speedup without being a 100% sure we're emitting unambiguous hashes,
since that trade-off is better than time spent generating another index
on-disk. So I see it as a complimentary & orthogonal feature.

But with that implemented I wouldn't get any benefit from things that
use the MIDX that aren't abbreviations, so what are those?

>> The reason we do that at all is because it makes for a prettier UI.
>
> We tried setting core.abbrev=40 on GVFS enlistments to speed up
> performance and the users rebelled against the hideous output. They
> would rather have slower speeds than long hashes.
>
>> Are there things that both want the pretty SHA-1 and also care about the
>> throughput? I'd have expected machine parsing to just use
>> --no-abbrev-commit.
>
> The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes.
>
>> If something cares about both throughput and e.g. is saving the
>> abbreviated SHA-1s isn't it better off picking some arbitrary size
>> (e.g. --abbrev=20), after all the default abbreviation is going to show
>> something as small as possible, which may soon become ambigous after the
>> next commit.
>
> Unfortunately, with the way the abbreviation algorithms work, using
> --abbrev=20 will have similar performance problems because you still
> need to inspect all packfiles to ensure there isn't a collision in the
> first 20 hex characters.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation
  2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
@ 2018-06-06 10:27       ` Ævar Arnfjörð Bjarmason
  2018-06-06 10:27       ` [RFC PATCH 1/2] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-06 10:27 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, dstolee, git, peff, johannes.schindelin, jrnieder,
	Linus Torvalds, Ævar Arnfjörð Bjarmason

On Wed, Jun 06 2018, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Jan 08 2018, Derrick Stolee wrote:
>
>> On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Sun, Jan 07 2018, Derrick Stolee jotted:
>>>
>>>>      git log --oneline --raw --parents
>>>>
>>>> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
>>>> ----------+-------------+------------+--------+----------
>>>>          1 |     35.64 s |    35.28 s |  -1.0% |   -1.0%
>>>>         24 |     90.81 s |    40.06 s | -55.9% |  +12.4%
>>>>        127 |    257.97 s |    42.25 s | -83.6% |  +18.6%
>>>>
>>>> The last column is the relative difference between the MIDX-enabled repo
>>>> and the single-pack repo. The goal of the MIDX feature is to present the
>>>> ODB as if it was fully repacked, so there is still room for improvement.
>>>>
>>>> Changing the command to
>>>>
>>>>      git log --oneline --raw --parents --abbrev=40
>>>>
>>>> has no observable difference (sub 1% change in all cases). This is likely
>>>> due to the repack I used putting commits and trees in a small number of
>>>> packfiles so the MRU cache workes very well. On more naturally-created
>>>> lists of packfiles, there can be up to 20% improvement on this command.
>>>>
>>>> We are using a version of this patch with an upcoming release of GVFS.
>>>> This feature is particularly important in that space since GVFS performs
>>>> a "prefetch" step that downloads a pack of commits and trees on a daily
>>>> basis. These packfiles are placed in an alternate that is shared by all
>>>> enlistments. Some users have 150+ packfiles and the MRU misses and
>>>> abbreviation computations are significant. Now, GVFS manages the MIDX file
>>>> after adding new prefetch packfiles using the following command:
>>>>
>>>>      git midx --write --update-head --delete-expired --pack-dir=<alt>
>>>
>>> (Not a critique of this, just a (stupid) question)
>>>
>>> What's the practical use-case for this feature? Since it doesn't help
>>> with --abbrev=40 the speedup is all in the part that ensures we don't
>>> show an ambiguous SHA-1.
>>
>> The point of including the --abbrev=40 is to point out that object
>> lookups do not get slower with the MIDX feature. Using these "git log"
>> options is a good way to balance object lookups and abbreviations with
>> object parsing and diff machinery.[...]
>
> [snip]
>
>> [...]And while the public data shape I shared did not show a
>> difference, our private testing of the Windows repository did show a
>> valuable improvement when isolating to object lookups and ignoring
>> abbreviation calculations.
>
> Replying to this old thread since I see you're prepearing the MIDX for
> submission again and this seemed like the best venue.
>
> Your WIP branch (github.com/git/derrickstolee/midx/upstream) still only
> references the speedups in abbreviation calculations, but here you
> allude to other improvements. It would be very nice to have some summary
> of that in docs / commit messages when you submit this.
>
> I've been meaning to get around to submitting something like I mentioned
> in https://public-inbox.org/git/87efn0bkls.fsf@evledraar.gmail.com/
> i.e. a way to expand the abbrev mode to not check disambiguations, which
> would look something like:
>
>     core.abbrev = 20
>     core.validateAbbrev = false
>
> Or:
>
>     core.abbrev = +2
>     core.validateAbbrev = false
>
> So, using the example from the above referenced E-Mail +2 would make
> linux.git emit hashes of 14 characters, without any abbreviation
> checking (just trusting in statistics to work in your favor).
>
> As noted by JS in this thread that wouldn't be acceptable for your
> use-case, but there's plenty of people (including me) who'd appreciate
> the speedup without being a 100% sure we're emitting unambiguous hashes,
> since that trade-off is better than time spent generating another index
> on-disk. So I see it as a complimentary & orthogonal feature.
>
> But with that implemented I wouldn't get any benefit from things that
> use the MIDX that aren't abbreviations, so what are those?

I won't have time to finish this today, but it's already in a shape
that I think is useful for discussion to see what others think.

I still need to make this be supported by --abbrev=* and have
e.g. --abbrev=+2 work. I got as far as this with that:

    diff --git a/parse-options-cb.c b/parse-options-cb.c
    index 0f9f311a7a..7cc9d3dfe6 100644
    --- a/parse-options-cb.c
    +++ b/parse-options-cb.c
    @@ -16,13 +16,23 @@ int parse_opt_abbrev_cb(const struct option *opt, const char *arg, int unset)
     	if (!arg) {
     		v = unset ? 0 : DEFAULT_ABBREV;
     	} else {
    +		const char *origarg = arg;
     		v = strtol(arg, (char **)&arg, 10);
     		if (*arg)
     			return opterror(opt, "expects a numerical value", 0);
    -		if (v && v < MINIMUM_ABBREV)
    +		if (*origarg == '+' || *origarg == '-') {
    +			if (v == 0) {
    +				return opterror(opt, "relative abbrev must be non-zero", 0);
    +			} else {
    +				default_abbrev_relative = v;
    +				v = -1;
    +			}
    +		} else if (v && v < MINIMUM_ABBREV) {
     			v = MINIMUM_ABBREV;
    -		else if (v > 40)
    +		} else if (v > 40) {
     			v = 40;
    +		}
     	}
     	*(int *)(opt->value) = v;
     	return 0;

But e.g. blame would print 40 character SHA-1s on +2, I didn't have
time to dig into why.

This'll also need tests, I haven't added any yet, and finally it's
probably a good idea to split off the core.abbrev=[+-]NUM feature into
a seperate patch from core.validateAbbrev, although with my 2/2 the
two can be used in isolation, or together.

Ævar Arnfjörð Bjarmason (2):
  config.c: use braces on multiple conditional arms
  sha1-name: add core.validateAbbrev & relative core.abbrev

 Documentation/config.txt | 46 ++++++++++++++++++++++++++++++++++++++++
 cache.h                  |  2 ++
 config.c                 | 18 ++++++++++++++--
 environment.c            |  2 ++
 sha1-name.c              | 15 +++++++++++++
 5 files changed, 81 insertions(+), 2 deletions(-)

-- 
2.17.0.290.gded63e768a


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 1/2] config.c: use braces on multiple conditional arms
  2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
  2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
@ 2018-06-06 10:27       ` Ævar Arnfjörð Bjarmason
  2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
  2018-06-06 11:24       ` [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
  3 siblings, 0 replies; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-06 10:27 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, dstolee, git, peff, johannes.schindelin, jrnieder,
	Linus Torvalds, Ævar Arnfjörð Bjarmason

Adjust this code that'll be modified in a subsequent change per the
CodingGuidelines.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 config.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/config.c b/config.c
index fbbf0f8e9f..12f762ad92 100644
--- a/config.c
+++ b/config.c
@@ -1149,9 +1149,9 @@ static int git_default_core_config(const char *var, const char *value)
 	if (!strcmp(var, "core.abbrev")) {
 		if (!value)
 			return config_error_nonbool(var);
-		if (!strcasecmp(value, "auto"))
+		if (!strcasecmp(value, "auto")) {
 			default_abbrev = -1;
-		else {
+		} else {
 			int abbrev = git_config_int(var, value);
 			if (abbrev < minimum_abbrev || abbrev > 40)
 				return error("abbrev length out of range: %d", abbrev);
-- 
2.17.0.290.gded63e768a


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev
  2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
  2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
  2018-06-06 10:27       ` [RFC PATCH 1/2] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
@ 2018-06-06 10:27       ` Ævar Arnfjörð Bjarmason
  2018-06-06 12:04         ` Christian Couder
  2018-06-06 11:24       ` [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
  3 siblings, 1 reply; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-06 10:27 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, dstolee, git, peff, johannes.schindelin, jrnieder,
	Linus Torvalds, Ævar Arnfjörð Bjarmason

Since Linus added auto-sizing for abbreviations in
e6c587c733 ("abbrev: auto size the default abbreviation", 2016-09-30)
we've been less likely to produce a short SHA-1 today that'll collide
on the same repository tomorrow, since before we'd always pick the
bare minimum we could get away with.

But we still do a full disambiguation check, which can be very
expensive in some cases. There's a work-in-progress MIDX
implementation to address that[1].

This change adds an alternate method of achieving some of the same
ends (but possibly not all, see [2] and replies to the original thread
at [1]).

Now, as described in the docs the user can set a relative abbreviation
length via core.abbrev, e.g. on linux.git core.abbrev=+2 will produce
SHA-1s that are 14 characters (as opposed to the implicit 12).

This in combination with core.validateAbbrev=false (off by default)
allows for picking a trade-off between performance, and the odds that
future or remote (or current and local) short SHA-1 will be ambiguous.

1. https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
2. https://public-inbox.org/git/87lgbsz61p.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config.txt | 46 ++++++++++++++++++++++++++++++++++++++++
 cache.h                  |  2 ++
 config.c                 | 14 ++++++++++++
 environment.c            |  2 ++
 sha1-name.c              | 15 +++++++++++++
 5 files changed, 79 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ab641bf5a9..8624110818 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -919,6 +919,52 @@ core.abbrev::
 	in your repository, which hopefully is enough for
 	abbreviated object names to stay unique for some time.
 	The minimum length is 4.
++
+This can also be set to relative values such as `+2` or `-2`, which
+means to add or subtract N characters from the SHA-1 that Git would
+otherwise print. This is useful in combination with the
+`core.validateAbbrev` setting, or to get more future-proof hashes to
+reference in the future in a repository whose number of objects is
+expected to grow.
+
+core.validateAbbrev::
+	If set to false (true by default) don't do any validation when
+	printing abbreviated object names to see if they're really
+	unique. This makes printing objects more performant at the
+	cost of potentially printing object names that aren't unique
+	within the current repository.
++
+When printing abbreviated object names Git needs to look through the
+local object store. This is an `O(log N)` operation assuming all the
+objects are in a single pack file, but `X * O(log N)` given `X` pack
+files, which can get expensive on some larger repositories.
++
+This setting changes that to `O(1)`, but with the trade-off that
+depending on the value of `core.abbrev` way may be printing
+abbreviated hashes that collide. Too see how likely this is, try
+running:
++
+-----------------------------------------------------------------------------------------------------------
+git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' | sort | uniq -c | sort -nr
+-----------------------------------------------------------------------------------------------------------
++
+This shows how many commits were found at each abbreviation length. On
+linux.git in June 2018 this shows a bit more than 750,000 commits,
+with just 4 needing 11 characters to be fully abbreviated, and the
+default heuristic picks a length of 12.
++
+Even without `core.validateAbbrev=false` the results abbreviation
+already a bit of a probability game. They're guaranteed at the moment
+of generation, but as more objects are added, ambiguities may be
+introduced. Likewise, what's unambiguous for you may not be for
+somebody else you're communicating with, if they have their own clone.
++
+Therefore the default of `core.validateAbbrev=true` may not save you
+in practice if you're sharing the SHA-1 or noting it now to use after
+a `git fetch`. You may be better off setting `core.abbrev` to
+e.g. `+2` to add 2 extra characters to the SHA-1 in combination with
+`core.validateAbbrev=false` to get a reasonable trade-off between
+safety and performance.
 
 add.ignoreErrors::
 add.ignore-errors (deprecated)::
diff --git a/cache.h b/cache.h
index 89a107a7f7..6dc5af9482 100644
--- a/cache.h
+++ b/cache.h
@@ -772,6 +772,8 @@ extern int check_stat;
 extern int quote_path_fully;
 extern int has_symlinks;
 extern int minimum_abbrev, default_abbrev;
+extern int default_abbrev_relative;
+extern int validate_abbrev;
 extern int ignore_case;
 extern int assume_unchanged;
 extern int prefer_symlink_refs;
diff --git a/config.c b/config.c
index 12f762ad92..b6e0d17af1 100644
--- a/config.c
+++ b/config.c
@@ -1146,11 +1146,25 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.validateabbrev")) {
+		if (!value)
+			return config_error_nonbool(var);
+		validate_abbrev =  git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.abbrev")) {
 		if (!value)
 			return config_error_nonbool(var);
 		if (!strcasecmp(value, "auto")) {
 			default_abbrev = -1;
+		} else if (*value == '+' || *value == '-') {
+			int relative = git_config_int(var, value);
+			if (relative == 0)
+				die(_("bad core.abbrev value %s. "
+				      "relative values must be non-zero"),
+				    value);
+			default_abbrev_relative = relative;
 		} else {
 			int abbrev = git_config_int(var, value);
 			if (abbrev < minimum_abbrev || abbrev > 40)
diff --git a/environment.c b/environment.c
index 2a6de2330b..4a24d8126b 100644
--- a/environment.c
+++ b/environment.c
@@ -22,6 +22,8 @@ int trust_ctime = 1;
 int check_stat = 1;
 int has_symlinks = 1;
 int minimum_abbrev = 4, default_abbrev = -1;
+int default_abbrev_relative = 0;
+int validate_abbrev = 1;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/sha1-name.c b/sha1-name.c
index 60d9ef3c7e..aa7ccea14d 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -576,6 +576,7 @@ int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len)
 	struct disambiguate_state ds;
 	struct min_abbrev_data mad;
 	struct object_id oid_ret;
+	int dar = default_abbrev_relative;
 	if (len < 0) {
 		unsigned long count = approximate_object_count();
 		/*
@@ -602,6 +603,20 @@ int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len)
 	if (len == GIT_SHA1_HEXSZ || !len)
 		return GIT_SHA1_HEXSZ;
 
+	if (dar) {
+		if (len + dar < MINIMUM_ABBREV) {
+			len = MINIMUM_ABBREV;
+			dar = 0;
+		}
+
+		if (validate_abbrev) {
+			len += dar;
+		} else {
+			hex[len + dar] = 0;
+			return len + dar;
+		}
+	}
+
 	mad.init_len = len;
 	mad.cur_len = len;
 	mad.hex = hex;
-- 
2.17.0.290.gded63e768a


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
  2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
                         ` (2 preceding siblings ...)
  2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
@ 2018-06-06 11:24       ` Derrick Stolee
  3 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2018-06-06 11:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, dstolee, git, peff, gitster, johannes.schindelin, jrnieder

On 6/6/2018 4:13 AM, Ævar Arnfjörð Bjarmason wrote:
> On Mon, Jan 08 2018, Derrick Stolee wrote:
>
>> On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>> On Sun, Jan 07 2018, Derrick Stolee jotted:
>>>
>>>>       git log --oneline --raw --parents
>>>>
>>>> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
>>>> ----------+-------------+------------+--------+----------
>>>>           1 |     35.64 s |    35.28 s |  -1.0% |   -1.0%
>>>>          24 |     90.81 s |    40.06 s | -55.9% |  +12.4%
>>>>         127 |    257.97 s |    42.25 s | -83.6% |  +18.6%
>>>>
>>>> The last column is the relative difference between the MIDX-enabled repo
>>>> and the single-pack repo. The goal of the MIDX feature is to present the
>>>> ODB as if it was fully repacked, so there is still room for improvement.
>>>>
>>>> Changing the command to
>>>>
>>>>       git log --oneline --raw --parents --abbrev=40
>>>>
>>>> has no observable difference (sub 1% change in all cases). This is likely
>>>> due to the repack I used putting commits and trees in a small number of
>>>> packfiles so the MRU cache workes very well. On more naturally-created
>>>> lists of packfiles, there can be up to 20% improvement on this command.
>>>>
>>>> We are using a version of this patch with an upcoming release of GVFS.
>>>> This feature is particularly important in that space since GVFS performs
>>>> a "prefetch" step that downloads a pack of commits and trees on a daily
>>>> basis. These packfiles are placed in an alternate that is shared by all
>>>> enlistments. Some users have 150+ packfiles and the MRU misses and
>>>> abbreviation computations are significant. Now, GVFS manages the MIDX file
>>>> after adding new prefetch packfiles using the following command:
>>>>
>>>>       git midx --write --update-head --delete-expired --pack-dir=<alt>
>>> (Not a critique of this, just a (stupid) question)
>>>
>>> What's the practical use-case for this feature? Since it doesn't help
>>> with --abbrev=40 the speedup is all in the part that ensures we don't
>>> show an ambiguous SHA-1.
>> The point of including the --abbrev=40 is to point out that object
>> lookups do not get slower with the MIDX feature. Using these "git log"
>> options is a good way to balance object lookups and abbreviations with
>> object parsing and diff machinery.[...]
> [snip]
>
>> [...]And while the public data shape I shared did not show a
>> difference, our private testing of the Windows repository did show a
>> valuable improvement when isolating to object lookups and ignoring
>> abbreviation calculations.
> Replying to this old thread since I see you're prepearing the MIDX for
> submission again and this seemed like the best venue.

You're really good at tracking new commits in my GitHub page ;)

>
> Your WIP branch (github.com/git/derrickstolee/midx/upstream) still only
> references the speedups in abbreviation calculations, but here you
> allude to other improvements. It would be very nice to have some summary
> of that in docs / commit messages when you submit this.

The new version is essentially a complete rewrite of the feature, since 
we learned a lot about how to add a new data store from the commit-graph 
series. The design document [1] refers to some of the immediate benefits 
and future benefits. Some of these future benefits were discussed at the 
contributor's summit [2].

[1] 
https://github.com/derrickstolee/git/blob/midx/upstream/Documentation/technical/midx.txt

[2] 
https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
     Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)

>
> I've been meaning to get around to submitting something like I mentioned
> in https://public-inbox.org/git/87efn0bkls.fsf@evledraar.gmail.com/
> i.e. a way to expand the abbrev mode to not check disambiguations, which
> would look something like:
>
>      core.abbrev = 20
>      core.validateAbbrev = false
>
> Or:
>
>      core.abbrev = +2
>      core.validateAbbrev = false
>
> So, using the example from the above referenced E-Mail +2 would make
> linux.git emit hashes of 14 characters, without any abbreviation
> checking (just trusting in statistics to work in your favor).
>
> As noted by JS in this thread that wouldn't be acceptable for your
> use-case, but there's plenty of people (including me) who'd appreciate
> the speedup without being a 100% sure we're emitting unambiguous hashes,
> since that trade-off is better than time spent generating another index
> on-disk. So I see it as a complimentary & orthogonal feature.
>
> But with that implemented I wouldn't get any benefit from things that
> use the MIDX that aren't abbreviations, so what are those?

The MIDX is built for handling many packfiles. As opposed to the 
commit-graph feature, your repo needs to be _really_big_ to need the 
MIDX. Most just repack into one packfile on a regular basis.

One case for vanilla Git: we've heard from lots of customers disabling 
gc.auto in their build machines because they didn't want to wait for a 
repack/gc after a fetch and before a build. Then, they end up in a 
many-pack situation because they never scheduled time for that repack/gc.

For GVFS, we virtualize the repo, so objects can be downloaded from the 
server on-demand. Since round-trips to the server are expensive, we have 
the notion of a "prefetch pack". Our cache servers precompute packfiles 
containing all of the commits and trees introduced to the server that 
day. On a 'git fetch' command, GVFS intercepts via a hook and downloads 
the list of daily prefetch packs since the last prefetch. These are 
placed in an alternate (the "shared object cache"), so multiple 
enlistments can point to the same, fixed list of packs. The cache 
servers fold these packs together after 30 days, so users on a fresh 
clone start with 31 packs and users who go on vacation for a month get 
at most 31 packs in a single download.

The performance gains from the MIDX for abbrevations are the easiest to 
demonstrate. It's not hard to see why the single binary search to find 
the two closest OIDs is better than iterating through all of the 
packfiles and running a binary search on each.

The other thing that the MIDX does right now is be a single source for 
object lookups. This is harder to demonstrate the benefit in Git on its 
own, because using `git repack -adfF --max-pack-size=128m` repacks in a 
"good" way. This repack actually makes the MRU packfile list really 
effective.

One reason the MIDX is important for object lookups in the GVFS case is 
two-fold:

1. The time-based organization of the prefetch packs makes the trees 
necessary for a 'checkout' spread across many packfiles. Even with the 
working directory virtualization, every 'checkout' needs to recompute 
the .git/index, and that requires walking lots of trees.

2. When we are looking for a blob that is not in the local cache, we 
need to check that the blob is on-disk before requesting it from the 
server. In a many-pack scenario, those object misses are as expensive as 
the abbreviation case. This may affect partial clone, so I'll talk with 
Jeff H about that situation.

I expect to send my v1 today. I'm just trying to create a synthetic 
environment that demonstrates this benefit to object lookups using the 
Linux repository.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev
  2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
@ 2018-06-06 12:04         ` Christian Couder
  0 siblings, 0 replies; 48+ messages in thread
From: Christian Couder @ 2018-06-06 12:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, dstolee, Jeff Hostetler, Jeff King,
	Johannes Schindelin, Jonathan Nieder, Linus Torvalds

On Wed, Jun 6, 2018 at 12:27 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:

> +This setting changes that to `O(1)`, but with the trade-off that
> +depending on the value of `core.abbrev` way may be printing

s/way may be printing/we may be printing/

> +abbreviated hashes that collide. Too see how likely this is, try

s/Too see/To see/

> +running:

[...]

> +Even without `core.validateAbbrev=false` the results abbreviation
> +already a bit of a probability game.

s/the results abbreviation already a bit of/the resulting abbreviation
is already a bit of/ maybe?

> diff --git a/sha1-name.c b/sha1-name.c
> index 60d9ef3c7e..aa7ccea14d 100644
> --- a/sha1-name.c
> +++ b/sha1-name.c
> @@ -576,6 +576,7 @@ int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len)
>         struct disambiguate_state ds;
>         struct min_abbrev_data mad;
>         struct object_id oid_ret;
> +       int dar = default_abbrev_relative;
>         if (len < 0) {
>                 unsigned long count = approximate_object_count();
>                 /*
> @@ -602,6 +603,20 @@ int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len)
>         if (len == GIT_SHA1_HEXSZ || !len)
>                 return GIT_SHA1_HEXSZ;
>
> +       if (dar) {
> +               if (len + dar < MINIMUM_ABBREV) {
> +                       len = MINIMUM_ABBREV;
> +                       dar = 0;
> +               }
> +
> +               if (validate_abbrev) {
> +                       len += dar;
> +               } else {
> +                       hex[len + dar] = 0;
> +                       return len + dar;
> +               }

I wonder what happens if len + dar > GIT_SHA1_HEXSZ

> +       }

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2018-06-06 12:05 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
2018-01-08 19:32   ` Jonathan Tan
2018-01-08 20:35     ` Derrick Stolee
2018-01-08 22:06       ` Jonathan Tan
2018-01-07 18:14 ` [RFC PATCH 02/18] midx: specify midx file format Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 03/18] midx: create core.midx config setting Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 04/18] midx: write multi-pack indexes for an object list Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 05/18] midx: create midx builtin with --write mode Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 06/18] midx: add t5318-midx.sh test script Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 07/18] midx: teach midx --write to update midx-head Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 08/18] midx: teach git-midx to read midx file details Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 09/18] midx: find details of nth object in midx Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 10/18] midx: use existing midx when writing Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 11/18] midx: teach git-midx to clear midx files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 12/18] midx: teach git-midx to delete expired files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 14/18] midx: load midx files when loading packs Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 15/18] midx: use midx for approximate object count Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx() Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 17/18] sha1_name: use midx for abbreviations Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 18/18] packfile: use midx for object loads Derrick Stolee
2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
2018-01-08  0:08   ` Derrick Stolee
2018-01-08 10:20     ` Jeff King
2018-01-08 10:27       ` Jeff King
2018-01-08 12:28         ` Ævar Arnfjörð Bjarmason
2018-01-08 13:43       ` Johannes Schindelin
2018-01-09  6:50         ` Jeff King
2018-01-09 13:05           ` Johannes Schindelin
2018-01-09 19:51             ` Stefan Beller
2018-01-09 20:12               ` Junio C Hamano
2018-01-09 20:16                 ` Stefan Beller
2018-01-09 21:31                   ` Junio C Hamano
2018-01-10 17:05               ` Johannes Schindelin
2018-01-10 10:57             ` Jeff King
2018-01-08 13:43       ` Derrick Stolee
2018-01-09  7:12         ` Jeff King
2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 1/2] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
2018-06-06 12:04         ` Christian Couder
2018-06-06 11:24       ` [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-10 18:25 ` Martin Fick
2018-01-10 19:39   ` Derrick Stolee
2018-01-10 21:01     ` Martin Fick

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).