Git packs friendly to block-level deduplication

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Git packs friendly to block-level deduplication
@ 2018-01-24 22:03 Ævar Arnfjörð Bjarmason
  2018-01-24 22:19 ` Mike Hommey
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-24 22:03 UTC (permalink / raw)
  To: Git mailing list; +Cc: Junio C Hamano, Jeff King

If you have a bunch of git repositories cloned of the same project on
the same filesystem, it would be nice of the packs that are produced
would be friendly to block-level deduplication.

This would save space, and the blocks would be more likely to be in
cache when you access them, likely speeding up git operations even if
the packing itself is less efficient.

Here's a hacky one-liner that clones git/git and peff/git (almost the
same content) and md5sums each 4k packed block, and sort | uniq -c's
them to see how many are the same:

    (
       cd /tmp &&
       rm -rf git*;
       git clone --reference ~/g/git --dissociate git@github.com:git/git.git git1 &&
       git clone --reference ~/g/git --dissociate git@github.com:peff/git.git git2 &&
       for repo in git1 git2
       do
           (
               cd $repo &&
               git repack -A -d --max-pack-size=10m
           )
       done &&
       parallel "perl -MDigest::MD5=md5_hex -wE 'open my \$fh, q[<], shift; my \$s; while (read \$fh, \$s, 2**12) { say md5_hex(\$s) }' {}" ::: \
           $(find /tmp/git*/.git/objects/pack -type f)|sort|uniq -c|sort -nr|awk '{print $1}'|sort|uniq -c|sort -nr
    )

This produces a total of 0 blocks that are the same. If after the repack
we throw this in there after the repack:

    echo 5be1f00a9a | git pack-objects --no-reuse-delta --no-reuse-object --revs .git/objects/pack/manual

Just over 8% of the blocks are the same, and of course this pack
entirely duplicates the existing packs, and I don't know how to coerce
repack/pack-objects into keeping this manual-* pack and re-packing the
rest, removing any objects that exist in the manual-* pack.

Documentation/technical/pack-heuristics.txt goes over some of the ideas
behind the algorithm, and Junio's 1b4bb16b9e ("pack-objects: optimize
"recency order"", 2011-06-30) seems to be the last major tweak to it.

I couldn't find any references to someone trying to get this particular
use-case working on-list. I.e. to pack different repositories with a
shared history in such a way as to optimize for getting the most amount
of identical blocks within packs.

It should be possible to produce such a pack, e.g. by having a repack
mode that would say:

 1. Find what the main branch is
 2. Get its commits in reverse order, produce packs of some chunk-size
    of commit batches.
 3. Pack all the remaining content

This would delta much less efficiently, but as noted above the
block-level deduplication might make up for it, and in any case some
might want to use less disk space.

Has anyone here barked up this tree before? Suggestions? Tips on where
to start hacking the repack code to accomplish this would be most
welcome.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:03 Git packs friendly to block-level deduplication Ævar Arnfjörð Bjarmason
@ 2018-01-24 22:19 ` Mike Hommey
  2018-01-24 22:23   ` Junio C Hamano
  2018-01-24 22:25 ` Eric Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Mike Hommey @ 2018-01-24 22:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git mailing list, Junio C Hamano, Jeff King

On Wed, Jan 24, 2018 at 11:03:47PM +0100, Ævar Arnfjörð Bjarmason wrote:
> If you have a bunch of git repositories cloned of the same project on
> the same filesystem, it would be nice of the packs that are produced
> would be friendly to block-level deduplication.
> 
> This would save space, and the blocks would be more likely to be in
> cache when you access them, likely speeding up git operations even if
> the packing itself is less efficient.
> 
> Here's a hacky one-liner that clones git/git and peff/git (almost the
> same content) and md5sums each 4k packed block, and sort | uniq -c's
> them to see how many are the same:
> 
>     (
>        cd /tmp &&
>        rm -rf git*;
>        git clone --reference ~/g/git --dissociate git@github.com:git/git.git git1 &&
>        git clone --reference ~/g/git --dissociate git@github.com:peff/git.git git2 &&
>        for repo in git1 git2
>        do
>            (
>                cd $repo &&
>                git repack -A -d --max-pack-size=10m
>            )
>        done &&
>        parallel "perl -MDigest::MD5=md5_hex -wE 'open my \$fh, q[<], shift; my \$s; while (read \$fh, \$s, 2**12) { say md5_hex(\$s) }' {}" ::: \
>            $(find /tmp/git*/.git/objects/pack -type f)|sort|uniq -c|sort -nr|awk '{print $1}'|sort|uniq -c|sort -nr
>     )
> 
> This produces a total of 0 blocks that are the same. If after the repack
> we throw this in there after the repack:
> 
>     echo 5be1f00a9a | git pack-objects --no-reuse-delta --no-reuse-object --revs .git/objects/pack/manual
> 
> Just over 8% of the blocks are the same, and of course this pack
> entirely duplicates the existing packs, and I don't know how to coerce
> repack/pack-objects into keeping this manual-* pack and re-packing the
> rest, removing any objects that exist in the manual-* pack.
> 
> Documentation/technical/pack-heuristics.txt goes over some of the ideas
> behind the algorithm, and Junio's 1b4bb16b9e ("pack-objects: optimize
> "recency order"", 2011-06-30) seems to be the last major tweak to it.
> 
> I couldn't find any references to someone trying to get this particular
> use-case working on-list. I.e. to pack different repositories with a
> shared history in such a way as to optimize for getting the most amount
> of identical blocks within packs.
> 
> It should be possible to produce such a pack, e.g. by having a repack
> mode that would say:
> 
>  1. Find what the main branch is
>  2. Get its commits in reverse order, produce packs of some chunk-size
>     of commit batches.
>  3. Pack all the remaining content
> 
> This would delta much less efficiently, but as noted above the
> block-level deduplication might make up for it, and in any case some
> might want to use less disk space.
> 
> Has anyone here barked up this tree before? Suggestions? Tips on where
> to start hacking the repack code to accomplish this would be most
> welcome.

FWIW, I sidestep the problem entirely by using alternatives.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:19 ` Mike Hommey
@ 2018-01-24 22:23   ` Junio C Hamano
  2018-01-24 22:30     ` Mike Hommey
  2018-01-24 22:47     ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2018-01-24 22:23 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Ævar Arnfjörð Bjarmason, Git mailing list,
	Jeff King

Mike Hommey <mh@glandium.org> writes:

> FWIW, I sidestep the problem entirely by using alternatives.

That's a funny way to use the word "side-step", I would say, as the
alternate object store support is there exactly for this use case.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:03 Git packs friendly to block-level deduplication Ævar Arnfjörð Bjarmason
  2018-01-24 22:19 ` Mike Hommey
@ 2018-01-24 22:25 ` Eric Wong
  2018-01-24 22:37 ` Elijah Newren
  2018-01-24 23:22 ` Jeff King
  3 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2018-01-24 22:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Junio C Hamano, Jeff King

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> If you have a bunch of git repositories cloned of the same project on
> the same filesystem, it would be nice of the packs that are produced
> would be friendly to block-level deduplication.

Fwiw, I currently get around this when mirroring by having all
the remotes I care about mirrored to a "hidden" root repo, and
having other repos point to the hidden root objects via
object/info/alternates.  Not that great for usability and
probably potentially racy...

> Has anyone here barked up this tree before? Suggestions? Tips on where
> to start hacking the repack code to accomplish this would be most
> welcome.

The Debian version of gzip(1) has an --rsyncable patch which
might be of help.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:23   ` Junio C Hamano
@ 2018-01-24 22:30     ` Mike Hommey
  2018-01-24 22:47     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 13+ messages in thread
From: Mike Hommey @ 2018-01-24 22:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Git mailing list,
	Jeff King

On Wed, Jan 24, 2018 at 02:23:57PM -0800, Junio C Hamano wrote:
> Mike Hommey <mh@glandium.org> writes:
> 
> > FWIW, I sidestep the problem entirely by using alternatives.
> 
> That's a funny way to use the word "side-step", I would say, as the
> alternate object store support is there exactly for this use case.

It's footgunny, though.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:03 Git packs friendly to block-level deduplication Ævar Arnfjörð Bjarmason
  2018-01-24 22:19 ` Mike Hommey
  2018-01-24 22:25 ` Eric Wong
@ 2018-01-24 22:37 ` Elijah Newren
  2018-01-24 23:06   ` Ævar Arnfjörð Bjarmason
  2018-01-24 23:22 ` Jeff King
  3 siblings, 1 reply; 13+ messages in thread
From: Elijah Newren @ 2018-01-24 22:37 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git mailing list, Junio C Hamano, Jeff King

On Wed, Jan 24, 2018 at 2:03 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> If you have a bunch of git repositories cloned of the same project on
> the same filesystem, it would be nice of the packs that are produced
> would be friendly to block-level deduplication.
>
> This would save space, and the blocks would be more likely to be in
> cache when you access them, likely speeding up git operations even if
> the packing itself is less efficient.
>
> Here's a hacky one-liner that clones git/git and peff/git (almost the
> same content) and md5sums each 4k packed block, and sort | uniq -c's
> them to see how many are the same:

<snip>

>
> Has anyone here barked up this tree before? Suggestions? Tips on where
> to start hacking the repack code to accomplish this would be most
> welcome.

Does this overlap with the desire to have resumable clones?  I'm
curious what would happen if you did the same experiment with two
separate clones of git/git, cloned one right after the other so that
hopefully the upstream git/git didn't receive any updates between your
two separate clones.  (In other words, how much do packfiles differ in
practice for different packings of the same data?)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:23   ` Junio C Hamano
  2018-01-24 22:30     ` Mike Hommey
@ 2018-01-24 22:47     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-24 22:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Mike Hommey, Git mailing list, Jeff King, Eric Wong


On Wed, Jan 24 2018, Junio C. Hamano jotted:

> Mike Hommey <mh@glandium.org> writes:
>
>> FWIW, I sidestep the problem entirely by using alternatives.
>
> That's a funny way to use the word "side-step", I would say, as the
> alternate object store support is there exactly for this use case.

Things you can't do with alternates that block-level de-duplication
gives you:

 1. Your filesystem may be mounted from some NFS host that does
    block-level deduplication internally against other content you don't
    have permission to access, think the /home of a bunch of dev VMs you
    know will have the same repos cloned (along with most of the same FS
    content, e.g. the OS).

    In this case the storage can de-duplicate blocks purely as an
    implementation without git knowing about it, as long as git (or any
    other program using the FS) can be coerced into writing the same
    blocks other gits on other machines write, at least most of the
    time.

 2. Ditto NFS, but e.g. chroot'd /home on a local non-NFS.

 3. Even if the repos are all on the same host they may just be ad-hoc
    cloned in /home by different users, it's easy to write something in
    /etc/gitconfig to give them the same repack settings, less so to
    maintain some git-clone wrapper that implictily adds --reference
    (they'll not know, or forget) to all clones, or goes hunting around
    for checkouts and adding alternates after the fact.

 4. With alternates you always need to maintain some blessed "clone from
    this" repo that can't go away least everything cloned from it become
    corrupt and needs manual repair. If you're aiming to just save
    storage block-level deduplication may be a better trade-off.

Also once you clone with --reference doesn't the local clone only add
new objects as you "git fetch", never pruning those if the same objects
appear in the alternate later on, or am I misremembering things?

I mainly have use-case #1 & #3, although they could both be made to use
alternates with some hassle (e.g. for #1 exposing a separate read-only
copy of "these are alternates" to each VM) it seemed worthwhile to see
if repack could be made to be more block-level deduplication friendly,
as deploying that is easier.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:37 ` Elijah Newren
@ 2018-01-24 23:06   ` Ævar Arnfjörð Bjarmason
  2018-01-24 23:32     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-24 23:06 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git mailing list, Junio C Hamano, Jeff King


On Wed, Jan 24 2018, Elijah Newren jotted:

> On Wed, Jan 24, 2018 at 2:03 PM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> If you have a bunch of git repositories cloned of the same project on
>> the same filesystem, it would be nice of the packs that are produced
>> would be friendly to block-level deduplication.
>>
>> This would save space, and the blocks would be more likely to be in
>> cache when you access them, likely speeding up git operations even if
>> the packing itself is less efficient.
>>
>> Here's a hacky one-liner that clones git/git and peff/git (almost the
>> same content) and md5sums each 4k packed block, and sort | uniq -c's
>> them to see how many are the same:
>
> <snip>
>
>>
>> Has anyone here barked up this tree before? Suggestions? Tips on where
>> to start hacking the repack code to accomplish this would be most
>> welcome.
>
> Does this overlap with the desire to have resumable clones?  I'm
> curious what would happen if you did the same experiment with two
> separate clones of git/git, cloned one right after the other so that
> hopefully the upstream git/git didn't receive any updates between your
> two separate clones.  (In other words, how much do packfiles differ in
> practice for different packings of the same data?)

If you clone git/git from Github twice in a row you get the exact same
pack, and AFAICT this is true of git in general (but may change between
versions).

If you make a local commit to that, copy the dir, and repack -A -d you
get the exact same packs again.

If you then make just one local commit to one copy (even with
--allow-empty) and repack, you get entirely differnt packs, in my test
2.5% of the blocks remain the same.

Obviously you could pack *that* new content incrementally and keep the
existing pack, but that won't help you with de-duping the initially
cloned data, which is what matters.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 22:03 Git packs friendly to block-level deduplication Ævar Arnfjörð Bjarmason
                   ` (2 preceding siblings ...)
  2018-01-24 22:37 ` Elijah Newren
@ 2018-01-24 23:22 ` Jeff King
  2018-01-25  0:03   ` Ævar Arnfjörð Bjarmason
  3 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2018-01-24 23:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git mailing list, Junio C Hamano

On Wed, Jan 24, 2018 at 11:03:47PM +0100, Ævar Arnfjörð Bjarmason wrote:

> This produces a total of 0 blocks that are the same. If after the repack
> we throw this in there after the repack:
> 
>     echo 5be1f00a9a | git pack-objects --no-reuse-delta --no-reuse-object --revs .git/objects/pack/manual
> 
> Just over 8% of the blocks are the same, and of course this pack
> entirely duplicates the existing packs, and I don't know how to coerce
> repack/pack-objects into keeping this manual-* pack and re-packing the
> rest, removing any objects that exist in the manual-* pack.

I think touching manual-*.keep would do what you want (followed by
"repack -ad" to drop the duplicate objects).

You may also want to use "--threads=1" to avoid non-determinism in the
generated packs. In theory, both repos would then produce identical base
packs, though it does not seem to do so in practice (I didn't dig in to
what the different may be).

> I couldn't find any references to someone trying to get this particular
> use-case working on-list. I.e. to pack different repositories with a
> shared history in such a way as to optimize for getting the most amount
> of identical blocks within packs.

I don't recall any discussion on this topic before.

I think you're fighting against two things here:

  - the order in which we find deltas; obviously a delta of A against B
    is quite different than B against A

  - the order of objects written to disk

Those mostly work backwards through the history graph, so adding new
history on top of old will cause changes at the beginning of the file,
and "shift" the rest so that the blocks don't match.

If you reverse the order of those, then the shared history is more
likely to provide a common start to the pack. See compute_write_order()
and the final line of type_size_sort().

> It should be possible to produce such a pack, e.g. by having a repack
> mode that would say:
> 
>  1. Find what the main branch is
>  2. Get its commits in reverse order, produce packs of some chunk-size
>     of commit batches.
>  3. Pack all the remaining content
> 
> This would delta much less efficiently, but as noted above the
> block-level deduplication might make up for it, and in any case some
> might want to use less disk space.

We do something a bit like this at GitHub. There we have a single pack
holding all of the objects for many forks. So the deduplication is done
already, but we want to avoid deltas that cross fork boundaries (since
they mean throwing away the delta and recomputing from scratch when
somebody fetches). And then we write the result in layers, although
right now there are only 2 layers (some "base" fork gets all of its
objects, and then everybody else's objects are dumped on top).

I suspect some of the same concepts could be applied. If you're
interested in playing with it, I happened to extract it into a single
patch recently (it's on my list of "stuff to send upstream" but I
haven't gotten around to polishing it fully). It's the
"jk/delta-islands" branch of https://github.com/peff/git (which I happen
to know you already have a clone of ;) ).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 23:06   ` Ævar Arnfjörð Bjarmason
@ 2018-01-24 23:32     ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2018-01-24 23:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren, Git mailing list, Junio C Hamano

On Thu, Jan 25, 2018 at 12:06:59AM +0100, Ævar Arnfjörð Bjarmason wrote:

> >> Has anyone here barked up this tree before? Suggestions? Tips on where
> >> to start hacking the repack code to accomplish this would be most
> >> welcome.
> >
> > Does this overlap with the desire to have resumable clones?  I'm
> > curious what would happen if you did the same experiment with two
> > separate clones of git/git, cloned one right after the other so that
> > hopefully the upstream git/git didn't receive any updates between your
> > two separate clones.  (In other words, how much do packfiles differ in
> > practice for different packings of the same data?)
> 
> If you clone git/git from Github twice in a row you get the exact same
> pack, and AFAICT this is true of git in general (but may change between
> versions).

That's definitely not guaranteed. It _tends_ to be the case over the
short term because we use --threads=1 on the server. But it may differ
if:

  - we repack on the server, which we do based on pushes

  - somebody pushes, even to another fork. The exact results depend
    on the packs in which we find the objects, and a new push may
    duplicate some existing objects but with a different representation,
    (e.g., a different delta base).

I'm actually interested in adding an etags-like protocol extension that
would work something like this:

  - server says "here's a pack, and its opaque tag is XYZ".

  - on resume, the client says "can I resume pack with tag XYZ"?

  - the server then decides if the on-disk state is sufficient for it to
    agree to recreate XYZ (e.g., number and identity of packs). If yes,
    then it resumes. If no, then it says "nope" and the two sides go
    through a normal fetch again.

The important thing is that the tag is opaque to the client. So a stock
implementation could use the on-disk state to decide. But a server could
choose to cache the packs it sends for a period of time (especially if
the client hangs up before we've sent the whole thing). We already do
this to a limited degree at GitHub in order to efficiently serve
multiple clients simultaneously fetching the same pack (e.g., imagine a
fleet of AWS machines all triggering "git fetch" at once).

I think that's a tangent to what you're looking for in this thread,
though.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-24 23:22 ` Jeff King
@ 2018-01-25  0:03   ` Ævar Arnfjörð Bjarmason
  2018-01-25  0:10     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-25  0:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Git mailing list, Junio C Hamano


On Wed, Jan 24 2018, Jeff King jotted:

> On Wed, Jan 24, 2018 at 11:03:47PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> This produces a total of 0 blocks that are the same. If after the repack
>> we throw this in there after the repack:
>>
>>     echo 5be1f00a9a | git pack-objects --no-reuse-delta --no-reuse-object --revs .git/objects/pack/manual
>>
>> Just over 8% of the blocks are the same, and of course this pack
>> entirely duplicates the existing packs, and I don't know how to coerce
>> repack/pack-objects into keeping this manual-* pack and re-packing the
>> rest, removing any objects that exist in the manual-* pack.
>
> I think touching manual-*.keep would do what you want (followed by
> "repack -ad" to drop the duplicate objects).

Thanks, that got the number of identical blocks just north of 15%...

> You may also want to use "--threads=1" to avoid non-determinism in the
> generated packs. In theory, both repos would then produce identical base
> packs, though it does not seem to do so in practice (I didn't dig in to
> what the different may be).

..and north of 20% with --threads=1.

>> I couldn't find any references to someone trying to get this particular
>> use-case working on-list. I.e. to pack different repositories with a
>> shared history in such a way as to optimize for getting the most amount
>> of identical blocks within packs.
>
> I don't recall any discussion on this topic before.
>
> I think you're fighting against two things here:
>
>   - the order in which we find deltas; obviously a delta of A against B
>     is quite different than B against A
>
>   - the order of objects written to disk
>
> Those mostly work backwards through the history graph, so adding new
> history on top of old will cause changes at the beginning of the file,
> and "shift" the rest so that the blocks don't match.
>
> If you reverse the order of those, then the shared history is more
> likely to provide a common start to the pack. See compute_write_order()
> and the final line of type_size_sort().

I'll have to poke at what compute_write_order() is doing, but FWIW this
to type_size_sort() got shared blocks down to 3%:

    diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
    index 81ad914cfc..c9ada1bd1c 100644
    --- a/builtin/pack-objects.c
    +++ b/builtin/pack-objects.c
    @@ -1764,7 +1764,7 @@ static int type_size_sort(const void *_a, const void *_b)
                    return -1;
            if (a->size < b->size)
                    return 1;
    -       return a < b ? -1 : (a > b);  /* newest first */
    +       return b < a ? -1 : (b > a);  /* newest first */
     }

     struct unpacked {

>> It should be possible to produce such a pack, e.g. by having a repack
>> mode that would say:
>>
>>  1. Find what the main branch is
>>  2. Get its commits in reverse order, produce packs of some chunk-size
>>     of commit batches.
>>  3. Pack all the remaining content
>>
>> This would delta much less efficiently, but as noted above the
>> block-level deduplication might make up for it, and in any case some
>> might want to use less disk space.
>
> We do something a bit like this at GitHub. There we have a single pack
> holding all of the objects for many forks. So the deduplication is done
> already, but we want to avoid deltas that cross fork boundaries (since
> they mean throwing away the delta and recomputing from scratch when
> somebody fetches). And then we write the result in layers, although
> right now there are only 2 layers (some "base" fork gets all of its
> objects, and then everybody else's objects are dumped on top).
>
> I suspect some of the same concepts could be applied. If you're
> interested in playing with it, I happened to extract it into a single
> patch recently (it's on my list of "stuff to send upstream" but I
> haven't gotten around to polishing it fully). It's the
> "jk/delta-islands" branch of https://github.com/peff/git (which I happen
> to know you already have a clone of ;) ).

Thanks. I'll look into that, although the above results (sans hacking on
the core pack-objects logic) suggest that even once I create an island
I'm getting at most 20%.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-25  0:03   ` Ævar Arnfjörð Bjarmason
@ 2018-01-25  0:10     ` Jeff King
  2018-01-25  0:29       ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2018-01-25  0:10 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git mailing list, Junio C Hamano

On Thu, Jan 25, 2018 at 01:03:25AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > You may also want to use "--threads=1" to avoid non-determinism in the
> > generated packs. In theory, both repos would then produce identical base
> > packs, though it does not seem to do so in practice (I didn't dig in to
> > what the different may be).
> 
> ..and north of 20% with --threads=1.
>
> [...]
>
> Thanks. I'll look into that, although the above results (sans hacking on
> the core pack-objects logic) suggest that even once I create an island
> I'm getting at most 20%.

I think it may be worth figuring out where the two differ. With
--no-reuse-object and --no-reuse-delta, I'd think that the pack
generated for a particular apex commit would be totally deterministic,
regardless of other objects available in the repo. But it's not for some
reason.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Git packs friendly to block-level deduplication
  2018-01-25  0:10     ` Jeff King
@ 2018-01-25  0:29       ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2018-01-25  0:29 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git mailing list, Junio C Hamano

On Wed, Jan 24, 2018 at 07:10:15PM -0500, Jeff King wrote:

> On Thu, Jan 25, 2018 at 01:03:25AM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> > > You may also want to use "--threads=1" to avoid non-determinism in the
> > > generated packs. In theory, both repos would then produce identical base
> > > packs, though it does not seem to do so in practice (I didn't dig in to
> > > what the different may be).
> > 
> > ..and north of 20% with --threads=1.
> >
> > [...]
> >
> > Thanks. I'll look into that, although the above results (sans hacking on
> > the core pack-objects logic) suggest that even once I create an island
> > I'm getting at most 20%.
> 
> I think it may be worth figuring out where the two differ. With
> --no-reuse-object and --no-reuse-delta, I'd think that the pack
> generated for a particular apex commit would be totally deterministic,
> regardless of other objects available in the repo. But it's not for some
> reason.

I think I see it. If I compare the objects in pack-order of the two
packs:

  diff -u <(git show-index <$one | sort -n) <(git show-index <$two | sort -n)

I get:

  6281 fac64e011f1b1ecabcccf7ad2511efcac3e26bdc (9381a4f9)
  6585 59c276cf4da0705064c32c9dba54baefa282ea55 (1b02a1c0)
  6869 8279ed033f703d4115bee620dccd32a9ec94d9aa (ca7acf33)
 -7042 298d861208d71089dd308761ae96738e81ad3e68 (135aefd7)
 -7222 ea7b5de1c1187294d3d4dca93b129e049ca7ca76 (3708bbfc)
 -7512 b6947af2294ea0c814f5b4cb8737c782895519b2 (719fad14)
 -8004 e26f7f19b6c7485f04234946a59ab8f4fd21d6d1 (59127876)
 -8826 2512f15446149235156528dafbe75930c712b29e (fc72468e)
  [...]
 +7042 2512f15446149235156528dafbe75930c712b29e (fc72468e)
 +7215 c6c75c93aaeb97bab6fd25b672a641c84dd85d59 (68fe0b8d)
 +7391 36438dc19dd2a305dddebd44bf7a65f1a220075b (c424e726)
 +7565 1eaabe34fc6f486367a176207420378f587d3b48 (4a27450b)

So the write order differs; one pack stuck 2512f15446 much earlier.

That's probably due to this line from compute_write_order():

     /*
      * Mark objects that are at the tip of tags.
      */
     for_each_tag_ref(mark_tagged, NULL);

If I remove that line, then your manual pack (with --threads=1) gives me
identical packs for each repo.

And if I then mark each with a .keep, then "git repack -ad" to remove
duplicates, your block-dedup check yields:

   17742 2
    1939 1

which is 90%, and about what I'd expect between those two repos. So I
think the idea of computing a "base" pack (either as a separate pack, or
as a layer in the pack) is a viable strategy.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-01-25  0:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24 22:03 Git packs friendly to block-level deduplication Ævar Arnfjörð Bjarmason
2018-01-24 22:19 ` Mike Hommey
2018-01-24 22:23   ` Junio C Hamano
2018-01-24 22:30     ` Mike Hommey
2018-01-24 22:47     ` Ævar Arnfjörð Bjarmason
2018-01-24 22:25 ` Eric Wong
2018-01-24 22:37 ` Elijah Newren
2018-01-24 23:06   ` Ævar Arnfjörð Bjarmason
2018-01-24 23:32     ` Jeff King
2018-01-24 23:22 ` Jeff King
2018-01-25  0:03   ` Ævar Arnfjörð Bjarmason
2018-01-25  0:10     ` Jeff King
2018-01-25  0:29       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).