git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Martin Fick <mfick@codeaurora.org>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org, dstolee@microsoft.com,
	git@jeffhostetler.com, peff@peff.net, gitster@pobox.com,
	Johannes.Shindelin@gmx.de, jrnieder@gmail.com
Subject: Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Date: Wed, 10 Jan 2018 14:01:28 -0700	[thread overview]
Message-ID: <7273121.kKAycuMGzH@mfick-lnx> (raw)
In-Reply-To: <7b3df316-8a35-0dcf-da17-64202f600821@gmail.com>

On Wednesday, January 10, 2018 02:39:13 PM Derrick Stolee 
wrote:
> On 1/10/2018 1:25 PM, Martin Fick wrote:
> > On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
> > 
> > wrote:
> >> This RFC includes a new way to index the objects in
> >> multiple packs using one file, called the multi-pack
> >> index (MIDX).
> > 
> > ...
> > 
> >> The main goals of this RFC are:
> >> 
> >> * Determine interest in this feature.
> >> 
> >> * Find other use cases for the MIDX feature.
> > 
> > My interest in this feature would be to speed up fetches
> > when there is more than one large pack-file with many of
> > the same objects that are in other pack-files.   What
> > does your MIDX design do when it encounters multiple
> > copies of the same object in different pack files? 
> > Does it index them all, or does it keep a single copy?
> 
> The MIDX currently keeps only one reference to each
> object. Duplicates are dropped during writing. (See the
> care taken in commit 04/18 to avoid duplicates.) Since
> midx_sha1_compare() does not use anything other than the
> OID to order the objects, there is no decision being made
> about which pack is "better". The MIDX writes the first
> copy it finds and discards the others.

This would likely speed things up then, even if the chosen 
objects are suboptimal.

> It would not be difficult to include a check in
> midx_sha1_compare() to favor one packfile over another
> based on some measurement (size? mtime?). Since this
> would be a heuristic at best, I left it out of the
> current patch.

Yeah, I didn't know what heuristic to use either, I tended 
to think that the bigger pack-file would be valuable because 
it is more likely to share deltas with other objects in that 
pack, so more easy to send them.  However, that is likely 
only true during clones or other large fetches when we want 
most objects.  During small "update" fetches, the newer 
packs might be better?

I also thought that objects in alternates should be 
considered less valuable for my use case, however in the 
github fork use case, the alternates might be more valuable?

So yes heuristics, and I don't know what is best.  Perhaps 
some config options could be used to set heuristics like 
this.  Whatever the heuristics are, since they would be a 
part of the MIDX packing process it would be easy to change.  
This assumes that keeping only one copy in the index is the 
right thing.  The question would be, what if we need 
different heuristics for different operations?  Would it make 
sense to have multiple MIDX files covering the same packs 
then, one for fetch, one for merge...?

> > In our Gerrit instance (Gerrit uses jgit), we have
> > multiple copies of the linux kernel repos linked
> > together via the alternatives file mechanism.
> 
> GVFS also uses alternates for sharing packfiles across
> multiple copies of the repo. The MIDX is designed to
> cover all packfiles in the same directory, but is not
> designed to cover packfiles in multiple alternates;
> currently, each alternate would need its own MIDX file.
> Does that cause issues with your setup?

No, since the other large packfiles are all in other repos 
(alternates).  Is there a reason the MIDX would not want to 
cover the alternates?  If you don't then you would seemingly 
loose any benefits of the MIDX when you have alternates in 
use.

...
> > It would be nice if this use case could be improved with
> > MIDX.  To do so, it seems that it would either require
> > that MIDX either only put "the best" version of an
> > object (i.e. pre-select which one to use), or include
> > the extra information to help make the selection
> > process of which copy to use (perhaps based on the
> > operation being performed) fast.
> 
> I'm not sure if there is sufficient value in storing
> multiple references to the same object stored in multiple
> packfiles. There could be value in carefully deciding
> which copy is "best" during the MIDX write, but during
> read is not a good time to make such a decision. It also
> increases the size of the file to store multiple copies.

Yes, I am not sure either, it would be good to have input 
from experts here.

> > This also leads me to ask, what other additional
> > information (bitmaps?) for other operations, besides
> > object location, might suddenly be valuable in an index
> > that potentially points to multiple copies of objects? 
> > Would such information be appropriate in MIDX, or would
> > it be better in another index?
> 
> For applications to bitmaps, it is probably best that we
> only include one copy of each object. Otherwise, we need
> to include extra bits in the bitmaps for those copies
> (when asking "is this object in the bitmap?").

Agreed.  Would the MIDX potentially pave the way to create 
multi-pack bitmaps?

> Thanks for the context with Gerrit's duplicate object
> problem. I'll try to incorporate it in to the design
> document (commit 01/18) for the v1 patch.

FYI, this is not a typical Gerrit thing, it is something 
special we do in our installation.

Thanks for your thoughts,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


      reply	other threads:[~2018-01-10 21:01 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Derrick Stolee
2018-01-08 19:32   ` Jonathan Tan
2018-01-08 20:35     ` Derrick Stolee
2018-01-08 22:06       ` Jonathan Tan
2018-01-07 18:14 ` [RFC PATCH 02/18] midx: specify midx file format Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 03/18] midx: create core.midx config setting Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 04/18] midx: write multi-pack indexes for an object list Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 05/18] midx: create midx builtin with --write mode Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 06/18] midx: add t5318-midx.sh test script Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 07/18] midx: teach midx --write to update midx-head Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 08/18] midx: teach git-midx to read midx file details Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 09/18] midx: find details of nth object in midx Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 10/18] midx: use existing midx when writing Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 11/18] midx: teach git-midx to clear midx files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 12/18] midx: teach git-midx to delete expired files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 14/18] midx: load midx files when loading packs Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 15/18] midx: use midx for approximate object count Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx() Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 17/18] sha1_name: use midx for abbreviations Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 18/18] packfile: use midx for object loads Derrick Stolee
2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
2018-01-08  0:08   ` Derrick Stolee
2018-01-08 10:20     ` Jeff King
2018-01-08 10:27       ` Jeff King
2018-01-08 12:28         ` Ævar Arnfjörð Bjarmason
2018-01-08 13:43       ` Johannes Schindelin
2018-01-09  6:50         ` Jeff King
2018-01-09 13:05           ` Johannes Schindelin
2018-01-09 19:51             ` Stefan Beller
2018-01-09 20:12               ` Junio C Hamano
2018-01-09 20:16                 ` Stefan Beller
2018-01-09 21:31                   ` Junio C Hamano
2018-01-10 17:05               ` Johannes Schindelin
2018-01-10 10:57             ` Jeff King
2018-01-08 13:43       ` Derrick Stolee
2018-01-09  7:12         ` Jeff King
2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 1/2] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
2018-06-06 12:04         ` Christian Couder
2018-06-06 11:24       ` [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-10 18:25 ` Martin Fick
2018-01-10 19:39   ` Derrick Stolee
2018-01-10 21:01     ` Martin Fick [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7273121.kKAycuMGzH@mfick-lnx \
    --to=mfick@codeaurora.org \
    --cc=Johannes.Shindelin@gmx.de \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).