git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Martin Fick <mfick@codeaurora.org>
Cc: git@vger.kernel.org, dstolee@microsoft.com,
	git@jeffhostetler.com, peff@peff.net, gitster@pobox.com,
	Johannes.Shindelin@gmx.de, jrnieder@gmail.com
Subject: Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Date: Wed, 10 Jan 2018 14:39:13 -0500	[thread overview]
Message-ID: <7b3df316-8a35-0dcf-da17-64202f600821@gmail.com> (raw)
In-Reply-To: <2648064.NJaMxHkljU@mfick-lnx>

On 1/10/2018 1:25 PM, Martin Fick wrote:
> On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
> wrote:
>> This RFC includes a new way to index the objects in
>> multiple packs using one file, called the multi-pack
>> index (MIDX).
> ...
>> The main goals of this RFC are:
>>
>> * Determine interest in this feature.
>>
>> * Find other use cases for the MIDX feature.
> My interest in this feature would be to speed up fetches
> when there is more than one large pack-file with many of the
> same objects that are in other pack-files.   What does your
> MIDX design do when it encounters multiple copies of the
> same object in different pack files?  Does it index them all,
> or does it keep a single copy?

The MIDX currently keeps only one reference to each object. Duplicates 
are dropped during writing. (See the care taken in commit 04/18 to avoid 
duplicates.) Since midx_sha1_compare() does not use anything other than 
the OID to order the objects, there is no decision being made about 
which pack is "better". The MIDX writes the first copy it finds and 
discards the others.

It would not be difficult to include a check in midx_sha1_compare() to 
favor one packfile over another based on some measurement (size? 
mtime?). Since this would be a heuristic at best, I left it out of the 
current patch.

> In our Gerrit instance (Gerrit uses jgit), we have multiple
> copies of the linux kernel repos linked together via the
> alternatives file mechanism.

GVFS also uses alternates for sharing packfiles across multiple copies 
of the repo. The MIDX is designed to cover all packfiles in the same 
directory, but is not designed to cover packfiles in multiple 
alternates; currently, each alternate would need its own MIDX file. Does 
that cause issues with your setup?

>    These repos have many different
> references (mostly Gerrit change references), but they share
> most of the common objects from the mainline.  I have found
> that during a large fetch such as a clone, jgit spends a
> significant amount of extra time by having the extra large
> pack-files from the other repos visible to it, usually around
> an extra minute per instance of these (without them, the
> clone takes around 7mins).  This adds up easily with a few
> repos extra repos, it can almost double the time.
>
> My investigations have shown that this is due to jgit
> searching each of these pack files to decide which version of
> each object to send.  I don't fully understand its selection
> criteria, however if I shortcut it to just pick the first
> copy of an object that it finds, I regain my lost time.  I
> don't know if git suffers from a similar problem?  If git
> doesn't suffer from this then it likely just uses the first
> copy of an object it finds (which may not be the best object
> to send?)
>
> It would be nice if this use case could be improved with
> MIDX.  To do so, it seems that it would either require that
> MIDX either only put "the best" version of an object (i.e.
> pre-select which one to use), or include the extra
> information to help make the selection process of which copy
> to use (perhaps based on the operation being performed)
> fast.

I'm not sure if there is sufficient value in storing multiple references 
to the same object stored in multiple packfiles. There could be value in 
carefully deciding which copy is "best" during the MIDX write, but 
during read is not a good time to make such a decision. It also 
increases the size of the file to store multiple copies.

> This also leads me to ask, what other additional information
> (bitmaps?) for other operations, besides object location,
> might suddenly be valuable in an index that potentially
> points to multiple copies of objects?  Would such
> information be appropriate in MIDX, or would it be better in
> another index?

For applications to bitmaps, it is probably best that we only include 
one copy of each object. Otherwise, we need to include extra bits in the 
bitmaps for those copies (when asking "is this object in the bitmap?").

Thanks for the context with Gerrit's duplicate object problem. I'll try 
to incorporate it in to the design document (commit 01/18) for the v1 patch.

Thanks,
-Stolee


  reply	other threads:[~2018-01-10 19:39 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
2018-01-08 19:32   ` Jonathan Tan
2018-01-08 20:35     ` Derrick Stolee
2018-01-08 22:06       ` Jonathan Tan
2018-01-07 18:14 ` [RFC PATCH 02/18] midx: specify midx file format Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 03/18] midx: create core.midx config setting Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 04/18] midx: write multi-pack indexes for an object list Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 05/18] midx: create midx builtin with --write mode Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 06/18] midx: add t5318-midx.sh test script Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 07/18] midx: teach midx --write to update midx-head Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 08/18] midx: teach git-midx to read midx file details Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 09/18] midx: find details of nth object in midx Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 10/18] midx: use existing midx when writing Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 11/18] midx: teach git-midx to clear midx files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 12/18] midx: teach git-midx to delete expired files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 14/18] midx: load midx files when loading packs Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 15/18] midx: use midx for approximate object count Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx() Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 17/18] sha1_name: use midx for abbreviations Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 18/18] packfile: use midx for object loads Derrick Stolee
2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
2018-01-08  0:08   ` Derrick Stolee
2018-01-08 10:20     ` Jeff King
2018-01-08 10:27       ` Jeff King
2018-01-08 12:28         ` Ævar Arnfjörð Bjarmason
2018-01-08 13:43       ` Johannes Schindelin
2018-01-09  6:50         ` Jeff King
2018-01-09 13:05           ` Johannes Schindelin
2018-01-09 19:51             ` Stefan Beller
2018-01-09 20:12               ` Junio C Hamano
2018-01-09 20:16                 ` Stefan Beller
2018-01-09 21:31                   ` Junio C Hamano
2018-01-10 17:05               ` Johannes Schindelin
2018-01-10 10:57             ` Jeff King
2018-01-08 13:43       ` Derrick Stolee
2018-01-09  7:12         ` Jeff King
2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 1/2] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
2018-06-06 12:04         ` Christian Couder
2018-06-06 11:24       ` [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-10 18:25 ` Martin Fick
2018-01-10 19:39   ` Derrick Stolee [this message]
2018-01-10 21:01     ` Martin Fick

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7b3df316-8a35-0dcf-da17-64202f600821@gmail.com \
    --to=stolee@gmail.com \
    --cc=Johannes.Shindelin@gmx.de \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=mfick@codeaurora.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).