Re: Typesafer git hash patch

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Typesafer git hash patch
Date: Tue, 28 Feb 2017 12:25:20 -0800	[thread overview]
Message-ID: <CA+55aFzUhWinWqK30GBc1BKy-v6QtDdO2BLUODkiqg9XoKLrwA@mail.gmail.com> (raw)
In-Reply-To: <xmqqvarujdmv.fsf@gitster.mtv.corp.google.com>

On Tue, Feb 28, 2017 at 11:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>
>> Having the hashes be more encapsulated does seem to make things better
>> in many ways. What I did was to also just unify the notion of "hash_t"
>> and "struct object_id", so the two are entirely interchangeable.
>
> Sorry, but at this point in your description, you completely lost
> me.  I thought "struct object_id" was what you call "hash_t" in the
> above.

So what happened was that I started out just encapsulating

   unsigned char sha1[20];

as a

   hash_t hash;

and that made sense in a lot of situations. I always thought that code that used

    struct object_id oid;

is just too ugly to live, so I'm not actually all that big of a fan of
the oid approach.

But the two approaches really are pretty much equivalent logically,
even if they don't look the same.

So I wanted to unify things: "One type to bring them all and in the
darkness bind them".

So I just basically made this:

    typedef struct object_id {
            unsigned char hash[GIT_HASH_SIZE];
    } hash_t;

to create one single data structure that doesn't make my eyes bleed.
That "struct object_id" still exists, but I don't generally have to
look at it when doing the conversion, and any current users "just
work".

>> turns into
>>
>> +               const hash_t *mb = &result->item->object.oid;
>> +               if (!hashcmp(mb, current_bad_oid)) {
>
> Hmph.  I somehow thought the longer term directio for the above code
> would be to turn it into
>
>                 if (!oidcmp(&result->item->object.oid, &current_bad_oid))

Well, you can actually do it with my patch, since I left "oidcmp()"
alone and it's just an alias for "hashcmp()" in my tree.

Except I think "oid" is an odious name, and really confusing and not
at all descriptive.

Using a three-letter acronym when we have a four-letter actual word to
say it feels stupid and wrong to me.

So what my conversion does is basically say that the name is *hash*.
So instead of using "oidcmp", you use "hashcmp":

        if (!hashcmp(&result->item->object.oid, &current_bad_oid))

and functions take a "hash_t *" argument rather than a "struct
object_id *" argument, and when there was any kind of confusion and
mixing of use, I converted to "hash_t".

Both oid and "unsigned char *" users got converted.

In other words, what I was aiming for was getting rid - entirely - of
the "two different types", and I disliked both "oid" and "unsigned
char []", so neither replaces the other.

> Having said all that, I do not offhand see a huge benefit of the
> current layout that has one layer between the hash (i.e. oid.hash)
> and the object name (i.e. oid) over "there is no need for oid.hash;
> oid is just a hash", which you seem to be doing.

Yes exactly.

>> And as part of the type safety, I do think I may have found a bug:
>>
>> show_one_mergetag():
>>
>>                 strbuf_addf(&verify_message, "tag %s names a non-parent %s\n",
>>                                     tag->tag, tag->tagged->oid.hash);
>>
>> note how it prints out the "non-parent %s", but that's a SHA1 hash
>> that hasn't been converted to hex. Hmm?
>
> Yup.  That needs fixing, obviously.

I suspect nobody has ever hit that case - I tried to google for "names
a non-parent" and "tag" and "git" and the only thing that I found was
hits to git source.

So I was actually fairly impressed that the only thing I found was one
totally insignificant bug in a printout.

I did find a lot of cases where we really do mix a buffer of memory
("unsigned char *") with the hash. Not unsurprisingly, most of them
were in pack-file handling and in the tree parsing.

And some thing do the reverse, and really walk a hash name byte by
byte. Things like "find_pack_entry_one()" really does walk the bytes
of the hash.

With the conversion in place, those painful things are a bit more
obvious. So there's a couple of places where I just did a hard
conversion from a "unsigned char *" to a hash_t, but they are now
obvious casts and there's only 17 of them:

  [torvalds@i7 git]$ git grep '(hash_t \*)'
  builtin/index-pack.c:           hashcpy(ref_hash, (hash_t *) fill(20));
  builtin/pack-redundant.c:               hash_t *h1 = (hash_t
*)(p1_base + p1_off);
  builtin/pack-redundant.c:               hash_t *h2 = (hash_t
*)(p2_base + p2_off);
  builtin/pack-redundant.c:               hash_t *h1 = (hash_t
*)(p1_base + p1_off);
  builtin/pack-redundant.c:               hash_t *h2 = (hash_t
*)(p2_base + p2_off);
  builtin/pack-redundant.c:               hash_t *h = (hash_t *)(base + off);
  dir.c:  hashcpy(&ud->exclude_sha1, (hash_t *)rd->data);
  fast-import.c:          hashcpy(&e->versions[0].hash, (hash_t *)c);
  fast-import.c:          hashcpy(&e->versions[1].hash, (hash_t *)c);
  match-trees.c:  hashcpy((hash_t *)rewrite_here, rewrite_with);
  sha1-lookup.c:                      lo, mi, hi, sha1_to_hex((hash_t *)key));
  sha1_file.c:    return (hash_t *)(base + idx * GIT_SHA1_RAWSZ);
  sha1_file.c:            return (hash_t *)base;
  sha1_file.c:            return (hash_t *) (index + 24 * n + 4);
  sha1_file.c:            return (hash_t *) (index + 20 * n);
  sha1_file.c:            int cmp = hashcmp((hash_t *)(index + mi *
stride), (hash_t *)sha1);
  split-index.c:  hashcpy(&si->base_sha1, (hash_t *)data);

and there are basically an equal number of cases where I do the
reverse (by doing hash->hash to get the byte array data of the hash).

So the patch doesn't *fix* anything, but it does, I think, make it
easier to see the problems.

And the *bulk* of the code doesn't look inside the hashes at all.

                     Linus

next prev parent reply	other threads:[~2017-02-28 20:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-28  6:59 Typesafer git hash patch Linus Torvalds
     [not found] ` <xmqqvarujdmv.fsf@gitster.mtv.corp.google.com>
2017-02-28 20:19   ` brian m. carlson
2017-02-28 20:38     ` Linus Torvalds
2017-02-28 20:25   ` Linus Torvalds [this message]
2017-02-28 20:45     ` brian m. carlson
2017-02-28 20:26 ` Jeff King
2017-02-28 20:33   ` brian m. carlson
2017-02-28 20:37     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+55aFzUhWinWqK30GBc1BKy-v6QtDdO2BLUODkiqg9XoKLrwA@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).