git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: Ramsay Jones <ramsay@ramsayjones.plus.com>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Duy Nguyen <pclouds@gmail.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Stefan Beller <sbeller@google.com>,
	Christian Couder <chriscool@tuxfamily.org>,
	SZEDER Gabor <szeder.dev@gmail.com>
Subject: Re: [PATCH v4 1/7] Add delta-islands.{c,h}
Date: Mon, 13 Aug 2018 05:33:59 +0200	[thread overview]
Message-ID: <CAP8UFD1tX+rAxQc47o-50Kzo6hnX9mTWH2BPSq9HiO_OgBzYTw@mail.gmail.com> (raw)
In-Reply-To: <2113c74c-b230-6ea1-a980-d6d008bac759@ramsayjones.plus.com>

On Mon, Aug 13, 2018 at 3:14 AM, Ramsay Jones
<ramsay@ramsayjones.plus.com> wrote:
> On 12/08/18 06:11, Christian Couder wrote:

>> Because the inefficiency primarily arises when an
>> object is delitified against another object that does
>
> s/delitified/deltified/ ?

Ok, this will be in the next reroll if any.

>> not exist in the same fork, we partition objects into
>> sets that appear in the same fork, and define
>> "delta islands". When finding delta base, we do not
>> allow an object outside the same island to be
>> considered as its base.

>> --- /dev/null
>> +++ b/delta-islands.c
>> @@ -0,0 +1,493 @@
>> +#include "cache.h"
>> +#include "attr.h"
>> +#include "object.h"
>> +#include "blob.h"
>> +#include "commit.h"
>> +#include "tag.h"
>> +#include "tree.h"
>> +#include "delta.h"
>> +#include "pack.h"
>> +#include "tree-walk.h"
>> +#include "diff.h"
>> +#include "revision.h"
>> +#include "list-objects.h"
>> +#include "progress.h"
>> +#include "refs.h"
>> +#include "khash.h"
>
> I was wondering how many copies of the inline functions
> introduced by this header we had, so:
>
>   $ nm git | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | sort -rn | grep kh_
>         3 kh_resize_sha1
>         3 kh_put_sha1
>         3 kh_init_sha1
>         3 kh_get_sha1
>         1 kh_resize_str
>         1 kh_resize_sha1_pos
>         1 kh_put_str
>         1 kh_put_sha1_pos
>         1 kh_init_str
>         1 kh_init_sha1_pos
>         1 kh_get_str
>         1 kh_get_sha1_pos
>         1 kh_destroy_sha1
>   $
>
> Looking at the individual object files, we see:
>
>   $ nm pack-bitmap-write.o | grep ' t ' | grep kh_
>   00000000000001cc t kh_get_sha1
>   00000000000001b7 t kh_init_sha1
>   000000000000085e t kh_put_sha1
>   0000000000000310 t kh_resize_sha1
>   $
>
> So, the two instances of the sha1 hash-map are never
> destroyed (kh_destroy_sha1 is not present in the object
> file).

This is interesting (even though it seems related to more code than
the current patch series).

As those hash maps are in 'struct bitmap_writer' and a static instance is used:

  static struct bitmap_writer writer;

it maybe ok.

>   $ nm pack-bitmap.o | grep ' t ' | grep kh_
>   00000000000002d9 t kh_destroy_sha1
>   000000000000032b t kh_get_sha1
>   0000000000000daa t kh_get_sha1_pos
>   00000000000002c4 t kh_init_sha1
>   0000000000000d95 t kh_init_sha1_pos
>   00000000000009bd t kh_put_sha1
>   0000000000001432 t kh_put_sha1_pos
>   000000000000046f t kh_resize_sha1
>   0000000000000eee t kh_resize_sha1_pos
>   $
>
> The sha1_pos hash-map is not destroyed here.

Yeah, maybe a line like:

kh_destroy_pos(b->ext_index.positions);

is missing from free_bitmap_index()?

Adding that should be in a separate patch from this series though.

>   $ nm delta-islands.o | grep ' t ' | grep kh_
>   00000000000002be t kh_get_sha1
>   0000000000000e52 t kh_get_str
>   00000000000002a9 t kh_init_sha1
>   0000000000000e3d t kh_init_str
>   0000000000000950 t kh_put_sha1
>   00000000000014e4 t kh_put_str
>   0000000000000402 t kh_resize_sha1
>   0000000000000f96 t kh_resize_str
>   $
>
> And neither the sha1 or str hash-maps are destroyed here.
> (That is not necessarily a problem, of course! ;-) )

The instances are declared as static:

  static khash_sha1 *island_marks;

  static kh_str_t *remote_islands;

so it maybe ok.

>> +struct island_bitmap {
>> +     uint32_t refcount;
>> +     uint32_t bits[];
>
> Use FLEX_ARRAY here? We are slowly moving toward requiring
> certain C99 features, but I can't remember a flex array
> weather-balloon patch.

This was already discussed by Junio and Peff there:

https://public-inbox.org/git/20180727130229.GB18599@sigill.intra.peff.net/

>> +};

>> +int in_same_island(const struct object_id *trg_oid, const struct object_id *src_oid)
>
> Hmm, what does the trg_ prefix stand for?
>
>> +{
>> +     khiter_t trg_pos, src_pos;
>> +
>> +     /* If we aren't using islands, assume everything goes together. */
>> +     if (!island_marks)
>> +             return 1;
>> +
>> +     /*
>> +      * If we don't have a bitmap for the target, we can delta it
>
> ... Ah, OK, trg_ => target.

I am ok to replace "trg" with "target" (or maybe "dst"? or something
else) and "src" with "source" if you think it would make things
clearer.

>> +static void add_ref_to_island(const char *island_name, const struct object_id *oid)
>> +{
>> +     uint64_t sha_core;
>> +     struct remote_island *rl = NULL;
>> +
>> +     int hash_ret;
>> +     khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret);
>> +
>> +     if (hash_ret) {
>> +             kh_key(remote_islands, pos) = xstrdup(island_name);
>> +             kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct remote_island));
>> +     }
>> +
>> +     rl = kh_value(remote_islands, pos);
>> +     oid_array_append(&rl->oids, oid);
>> +
>> +     memcpy(&sha_core, oid->hash, sizeof(uint64_t));
>> +     rl->hash += sha_core;
>
> Hmm, so the first 64-bits of the oid of each ref that is part of
> this island is added together as a 'hash' for the island. And this
> is used to de-duplicate the islands? Any false positives? (does it
> matter - it would only affect performance, not correctness, right?)

I would think that a false positive from pure chance is very unlikely.
We would need to approach billions of delta islands (as 2 to the power
64/2 is in the order of billions) for the probability to be
significant. GitHub has less than 50 millions users and it is very
unlikely that a significant proportion of these users will fork the
same repo.

Now if there is a false positive because two forks have exactly the
same refs, then it is not a problem if they are considered the same,
because they are actually the same.

>> +}

> Sorry, I spent so long reading this patch, I have run out of
> time tonight (and I am busy tomorrow) to read the rest of the
> series.

Thank you for your interesting review,
Christian.

  reply	other threads:[~2018-08-13  3:34 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-12  5:11 [PATCH v4 0/7] Add delta islands support Christian Couder
2018-08-12  5:11 ` [PATCH v4 1/7] Add delta-islands.{c,h} Christian Couder
2018-08-13  1:14   ` Ramsay Jones
2018-08-13  3:33     ` Christian Couder [this message]
2018-08-13 12:17       ` Ramsay Jones
2018-08-13 18:11         ` Jeff King
2018-08-16  6:02           ` Christian Couder
2018-08-13 19:00       ` Jeff King
2018-08-16  6:04         ` Christian Couder
2018-08-12  5:11 ` [PATCH v4 2/7] pack-objects: refactor code into compute_layer_order() Christian Couder
2018-08-12  5:11 ` [PATCH v4 3/7] pack-objects: add delta-islands support Christian Couder
2018-08-12  5:11 ` [PATCH v4 4/7] repack: " Christian Couder
2018-08-12  5:11 ` [PATCH v4 5/7] t: add t5319-delta-islands.sh Christian Couder
2018-08-12  5:11 ` [PATCH v4 6/7] pack-objects: move tree_depth into 'struct packing_data' Christian Couder
2018-08-12  5:11 ` [PATCH v4 7/7] pack-objects: move 'layer' " Christian Couder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8UFD1tX+rAxQc47o-50Kzo6hnX9mTWH2BPSq9HiO_OgBzYTw@mail.gmail.com \
    --to=christian.couder@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=chriscool@tuxfamily.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=sbeller@google.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).