git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Geoffrey Irving" <irving@naml.us>
To: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: [PATCH] Adding a cache of commit to patch-id pairs to speed up git-cherry
Date: Mon, 2 Jun 2008 07:50:52 -0700	[thread overview]
Message-ID: <7f9d599f0806020750g78e6816dl884d36bb903c707b@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.1.00.0806020649110.13507@racer.site.net>

On Sun, Jun 1, 2008 at 11:13 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sun, 1 Jun 2008, Geoffrey Irving wrote:
>
>> The dominant cost of git-cherry is the computation of patch-ids for each
>> relevant commit.  Once computed, these pairs are now stored in a hash
>> table in $GIT_DIR/patch-id-cache to speed up repeated invocations.
>>
>> The basic structure of patch-id-cache.c was cannibalized from Johannes
>> Schindelin's notes-index structure, though most of the code was
>> rewritten. The hash table is still kept strictly sorted by commit, but
>> the entire table is now read into memory.
>
> I do not think that this "read-the-entire-table-into-memory" paradigm is a
> wise choice. mmap()ing, I would have understood, but reading a potentially
> pretty large table into memory?

I'll switch to mmapping.

>> See commit d56dbe8cb56ce9b4697d1f1c2425e2e133a932a5 for the original
>> code.
>
> This is not in any official git.git repository.  And my branches are prone
> to be rebased.  So this is not a good reference.  The mailing list,
> however, would be one.

That was added only because I knew it wouldn't be the final patch, and
that it'd be useful to you.

>> diff --git a/Makefile b/Makefile
>> index cce5a6e..3a5396d 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -435,6 +435,7 @@ LIB_OBJS += pager.o
>>  LIB_OBJS += parse-options.o
>>  LIB_OBJS += patch-delta.o
>>  LIB_OBJS += patch-ids.o
>> +LIB_OBJS += patch-id-cache.o
>
> If all you do is a hashmap from SHA-1 to SHA-1, then I think
> "patch-id-cache" is a misnomer for that file and structure.
>
> Let's not repeat the same naming mistakes as we have for path_list and
> decoration.

Is persistent_sha1_map a good name?

>> +static void read_patch_id_cache()
>> +{
>> +     int fd = open(git_path("patch-id-cache"), O_RDONLY);
>> +
>> +     if (fd < 0) { // allocate a new cache
>> +             size_t size = 64;
>> +             cache = xcalloc(get_hash_size(size), 1);
>> +             cache->key_size = cache->actual_size = size;
>> +             return;
>> +     }
>> +
>> +     struct patch_id_cache header;
>> +     size_t len = read_in_full(fd, &header, sizeof(header));
>> +     if (len != sizeof(header))
>> +             die("cannot read patch-id-cache header");
>> +     if (header.key_size & (header.key_size-1))
>> +             die("patch-id-cache key size %lld is not a power of two", (long
>> long)header.key_size);
>
> Very long line.
>
> Why does the key_size have to be a power of two?

It doesn't, but the current code creates it that way, so that was a
sanity check.  Well, actually the current code uses & instead of %, so
it does require it unless I switch back to %.

>> +     free(cache);
>> +     cache = xmalloc(get_hash_size(header.actual_size));
>> +     *cache = header;
>
> This free() kind of defies the xmalloc in the if() clause above.
>
> Besides, those three lines are a pretty awkward way to avoid writing
> xrealloc(), don't you think?
>
>> +     size_t entry_size = header.actual_size * sizeof(struct patch_id_entry);
>
> Declaration after statement.
>
>> +     len = read_in_full(fd, cache->entries, entry_size);
>> +     if (len != entry_size)
>> +             die("cannot read patch-id-cache entries");
>> +     written_count = cache->count;
>
> Huh?  It is not written_count.  Besides, why do you use a global here?
> (Yes, I know, you probably just copied it from my code, but you could
> clean it up in the process ;-)
>
>> +     // It's impossible to check whether the cache is actually accurate,
>
> C++ style comments.
>
>> +void write_patch_id_cache()
>> +{
>> +     if (!cache || cache->count == written_count)
>> +             return;
>
> Does that mean that the patch_id_cache is not updated when the number
> of commits stays the same after committing, branch -d and gc?

The patch_id_cache is only updated when it changes.  Since entries are
immutable and are never deleted, all changes increase the count.

>> +     struct lock_file update_lock;
>> +     int fd = hold_lock_file_for_update(&update_lock,
>> +                     git_path("patch-id-cache"), 0);
>> +
>> +     if (fd < 0) {
>> +             error("could not construct patch-id-cache");
>> +             return;
>> +     }
>
> Not good.  Printing an error, but not returning any indicator of it.
>
>> +int get_cached_commit_patch_id(const unsigned char *commit_sha1,
>> +             unsigned char *patch_id_sha1)
>
> Again, I do not think it is wise to name this function so focused on this
> particular purpose, when its use is clearly not limited to patch ids.
>
>> +void cache_commit_patch_id(const unsigned char *commit_sha1,
>> +             const unsigned char *patch_id_sha1)
>> +{
>> +     if (4*cache->count > 3*cache->key_size) { // resize
>
> C++ comment.
>
>> +             struct patch_id_cache *old_cache = cache;
>> +             size_t new_key_size = 2*cache->key_size,
>> +                     new_actual_size = cache->key_size + cache->actual_size;
>> +             cache = xcalloc(get_hash_size(new_actual_size), 1);
>> +             cache->key_size = new_key_size;
>> +             cache->actual_size = new_actual_size;
>> +
>> +             // reinsert all entries
>> +             size_t i;
>> +             for (i = 0; i < old_cache->actual_size; i++)
>> +                     cache_commit_patch_id(old_cache->entries[i].commit_sha1,
>> +                             old_cache->entries[i].patch_id_sha1);
>> +     }
>
> Does this not just _beg_ to live in an own function ("grow_hash")?
>
>> +     struct patch_id_entry entry;
>
> Declaration after statement.
>
>> +     hashcpy(entry.commit_sha1, commit_sha1);
>> +     hashcpy(entry.patch_id_sha1, patch_id_sha1);
>
> It would be more elegant to copy the SHA-1s _after_ finding where to write
> them.

Alas, that would break the elegance of my loop, since the loop swaps
in new entries to keep the table sorted.  I can remove the sorting if
you want: I only left it in there to be more similar to your code.

>> +     size_t i = get_hash_index(commit_sha1);
>
> Declaration after statement.
>
>> +     for (;;) {
>> +             if (i == cache->actual_size) {
>> +                     cache->actual_size++; // adding one suffices for expected O(1)
>
> Long line.  C++ comment.  "one suffices"?
>
>> +                     cache = xrealloc(cache, get_hash_size(cache->actual_size));
>> +                     cache->entries[cache->actual_size-1] = entry;
>> +                     return;
>> +             }
>> +             int cmp = hashcmp(entry.commit_sha1, cache->entries[i].commit_sha1);
>
> Declar... ah, well, suffice to say that you should read the
> CodingGuidelines, and try to fix up all the offending sites in your code.

I'd be happy to do that, but I don't see mention of either C++
comments or declarations after statements in the CodingGuidelines.

>> +             if (!cmp) {
>> +                     if (hashcmp(entry.patch_id_sha1, cache->entries[i].patch_id_sha1))
>> +                             die("found mismatched entry in patch-id-cache");
>
> I wonder if that potentially expensive operation should not rather be
> wrapped in an assert() call (since I recently learnt that Git's source
> code has more than one instance of assert()).

A 20 byte comparison doesn't seem potentially expensive to me.

>> +                     return;
>> +             } else if (cmp < 0) {
>> +                     i++;
>> +             } else if (is_null_sha1(cache->entries[i].commit_sha1)) {
>> +                     cache->entries[i] = entry;
>> +                     cache->count++;
>> +                     return;
>> +             } else {
>> +                     struct patch_id_entry tmp = cache->entries[i];
>> +                     cache->entries[i] = entry;
>> +                     entry = tmp;
>> +                     i++;
>> +             }
>
> You seem to be wanting to implement a "sorted hashmap"... this is
> non-standard, and AFAICT affects performance negatively.

I'll strip it out.  I actually stripped it out initially, and then
added it back to mirror your code.

>> diff --git a/patch-ids.c b/patch-ids.c
>> index 3be5d31..91fabeb 100644
>> --- a/patch-ids.c
>> +++ b/patch-ids.c
>> @@ -3,16 +3,30 @@
>>  #include "commit.h"
>>  #include "patch-ids.h"
>>
>> +extern void write_patch_id_cache();
>> +extern int get_cached_commit_patch_id(const unsigned char
>> *commit_sha1, unsigned char *patch_id_sha1);
>> +extern void cache_commit_patch_id(const unsigned char *commit_sha1,
>> const unsigned char *patch_id_sha1);
>> +
>
> Uh oh.  This is not good.  If you do not have declarations like these in
> header files (that are included from the implementing .c file, of course),
> it is all too easy to have the signatures of the implementation and the
> declaration diverge.
>
> So move them into an own header file, where they belong.  IIRC I did this
> in the notes code, so why did you choose not to imitate that?

No good reason.

>>  static uint32_t take2(const unsigned char *id)
>> @@ -136,6 +150,8 @@ int free_patch_ids(struct patch_ids *ids)
>>               next = patches->next;
>>               free(patches);
>>       }
>> +
>> +     write_patch_id_cache();
>>       return 0;
>
> That's cute.

Thanks (assuming cute means good).

I'll wait to actually make any of these changes until you and Jeff
decide whether I should stick to hashes or use binary search.  It
seems sad not to use hashes for a map when we get the best hash keys
in the world for free, so I'd prefer not switching.

Geoffrey

  parent reply	other threads:[~2008-06-02 14:51 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-02  3:54 [PATCH] Adding a cache of commit to patch-id pairs to speed up git-cherry Geoffrey Irving
2008-06-02  6:13 ` Johannes Schindelin
2008-06-02  6:42   ` Jeff King
2008-06-02 14:35     ` Geoffrey Irving
2008-06-02 15:37       ` Johannes Schindelin
2008-06-02 15:49         ` Geoffrey Irving
2008-06-02 15:56           ` Shawn O. Pearce
2008-06-02 16:18           ` Johannes Schindelin
2008-06-02 16:26             ` Geoffrey Irving
2008-06-02 18:15               ` Johannes Schindelin
2008-06-07 23:50                 ` Geoffrey Irving
2008-06-08 16:10                   ` Johannes Schindelin
2008-06-02 14:50   ` Geoffrey Irving [this message]
2008-06-02 15:52     ` Johannes Schindelin
2008-06-02 17:23     ` Geoffrey Irving
2008-06-02 18:22       ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7f9d599f0806020750g78e6816dl884d36bb903c707b@mail.gmail.com \
    --to=irving@naml.us \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).