git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	git@vger.kernel.org, dstolee@microsoft.com,
	git@jeffhostetler.com, gitster@pobox.com,
	johannes.schindelin@gmx.de, jrnieder@gmail.com
Subject: Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Date: Tue, 9 Jan 2018 02:12:08 -0500	[thread overview]
Message-ID: <20180109071208.GB32257@sigill.intra.peff.net> (raw)
In-Reply-To: <d00ba196-715d-4613-530c-ef7f7c4562f4@gmail.com>

On Mon, Jan 08, 2018 at 08:43:44AM -0500, Derrick Stolee wrote:

> > Just to make sure I'm parsing this correctly: normal lookups do get faster
> > when you have a single index, given the right setup?
> > 
> > I'm curious what that setup looked like. Is it just tons and tons of
> > packs? Is it ones where the packs do not follow the mru patterns very
> > well?
> 
> The way I repacked the Linux repo creates an artificially good set of packs
> for the MRU cache. When the packfiles are partitioned instead by the time
> the objects were pushed to a remote, the MRU cache performs poorly.
> Improving these object lookups are a primary reason for the MIDX feature,
> and almost all commands improve because of it. 'git log' is just the
> simplest to use for demonstration.

Interesting.

The idea of the pack mru (and the "last pack looked in" before it) is
that there would be good locality for time-segmented packs (like those
generated by pushes), since most operations also tend to visit history
in chronological-ish order (e.g., log). Tree-wide operations would be an
exception there, though, since files would have many ages across the
tree (though in practice, one would hope that pretty-ancient history
would eventually be replaced in a modern tree).

I've often wondered, though, if the mru (and "last pack") work mainly
because the "normal" distribution for a repository is to have one big
pack with most of history, and then a couple of smaller ones (what
hasn't been packed yet). Even something as naive as "look in the last
pack" works there, because it turns into "look in the biggest pack".  If
it has most of the objects, it's the most likely place for the previous
and the next objects to be.

But from my understanding of how you handle the Windows repository, you
have tons of equal-sized packs that are never coalesced. Which is quite
a different pattern.

I would expect your "--max-pack-size" thing to end up with something
roughly segmented like pushes, though, just because we do order the
write phase in reverse chronological order. Well, mostly. We do each
object type in its own chunks, and there's some reordering based on
deltas. So given the sizes, it's likely that most of your trees all end
up in a few packs.

> > There may be other reasons to want MIDX or something like it, but I just
> > wonder if we can do this much simpler thing to cover the abbreviation
> > case. I guess the question is whether somebody is going to be annoyed in
> > the off chance that they hit a collision.
> 
> No only are users going to be annoyed when they hit collisions after
> copy-pasting an abbreviated hash, there are also a large number of tools
> that people build that use abbreviated hashes (either for presenting to
> users or because they didn't turn off abbreviations).
>
> Abbreviations cause performance issues in other commands, too (like
> 'fetch'!), so whatever short-circuit you put in, it would need to be global.
> A flag on one builtin would not suffice.

Yeah, I agree it's potentially problematic for a lot of reasons. I was
just hoping we could bound the probabilities in a way that is
acceptable. Maybe it's a fool's errand.

-Peff

  reply	other threads:[~2018-01-09  7:12 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
2018-01-08 19:32   ` Jonathan Tan
2018-01-08 20:35     ` Derrick Stolee
2018-01-08 22:06       ` Jonathan Tan
2018-01-07 18:14 ` [RFC PATCH 02/18] midx: specify midx file format Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 03/18] midx: create core.midx config setting Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 04/18] midx: write multi-pack indexes for an object list Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 05/18] midx: create midx builtin with --write mode Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 06/18] midx: add t5318-midx.sh test script Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 07/18] midx: teach midx --write to update midx-head Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 08/18] midx: teach git-midx to read midx file details Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 09/18] midx: find details of nth object in midx Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 10/18] midx: use existing midx when writing Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 11/18] midx: teach git-midx to clear midx files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 12/18] midx: teach git-midx to delete expired files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 14/18] midx: load midx files when loading packs Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 15/18] midx: use midx for approximate object count Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx() Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 17/18] sha1_name: use midx for abbreviations Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 18/18] packfile: use midx for object loads Derrick Stolee
2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
2018-01-08  0:08   ` Derrick Stolee
2018-01-08 10:20     ` Jeff King
2018-01-08 10:27       ` Jeff King
2018-01-08 12:28         ` Ævar Arnfjörð Bjarmason
2018-01-08 13:43       ` Johannes Schindelin
2018-01-09  6:50         ` Jeff King
2018-01-09 13:05           ` Johannes Schindelin
2018-01-09 19:51             ` Stefan Beller
2018-01-09 20:12               ` Junio C Hamano
2018-01-09 20:16                 ` Stefan Beller
2018-01-09 21:31                   ` Junio C Hamano
2018-01-10 17:05               ` Johannes Schindelin
2018-01-10 10:57             ` Jeff King
2018-01-08 13:43       ` Derrick Stolee
2018-01-09  7:12         ` Jeff King [this message]
2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 1/2] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
2018-06-06 12:04         ` Christian Couder
2018-06-06 11:24       ` [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-10 18:25 ` Martin Fick
2018-01-10 19:39   ` Derrick Stolee
2018-01-10 21:01     ` Martin Fick

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180109071208.GB32257@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=jrnieder@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).