On Sun, Feb 26, 2017 at 07:09:44AM +0900, Mike Hommey wrote: > On Sat, Feb 25, 2017 at 02:26:56PM -0500, Jeff King wrote: > > I looked at that earlier, because I think it's a reasonable idea for > > future-proofing. The first byte is a "varint", but I couldn't find where > > they defined that format. > > > > The closest I could find is: > > > > https://github.com/multiformats/unsigned-varint > > > > whose README says: > > > > This unsigned varint (VARiable INTeger) format is for the use in all > > the multiformats. > > > > - We have not yet decided on a format yet. When we do, this readme > > will be updated. > > > > - We have time. All multiformats are far from requiring this varint. > > > > which is not exactly confidence inspiring. They also put the length at > > the front of the hash. That's probably convenient if you're parsing an > > unknown set of hashes, but I'm not sure it's helpful inside Git objects. > > And there's an incentive to minimize header data at the front of a hash, > > because every byte is one more byte that every single hash will collide > > over, and people will have to type when passing hashes to "git show", > > etc. The multihash spec also says that it's not necessary to implement varints until we have 127 hashes, and considering that will be in the far future, I'm quite happy to punt that problem down the road to someone else[0]. > > I'd almost rather use something _really_ verbose like > > > > sha256:1234abcd... > > > > in all of the objects. And then when we get an unadorned hash from the > > user, we guess it's sha256 (or whatever), and fallback to treating it as > > a sha1. > > > > Using a syntactically-obvious name like that also solves one other > > problem: there are sha1 hashes whose first bytes will encode as a "this > > is sha256" multihash, creating some ambiguity. > > Indeed, multihash only really is interesting when *all* hashes use it. > And obviously, git can't change the existing sha1s. Well, that's why I said in new objects. If we're going to default to a new hash, we can store it inside the object format, but not actually expose it to the user. In other words, if we used SHA-256, a tree object would refer to the SHA-1 empty blob as 1114e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 and the SHA-256 empty blob as 1220473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813, but user-visible code would parse them as e69d... and 473a... (or as sha1:e69d and 473a, or something). There's very little code which actually parses objects, so it's easy enough to introduce a few new functions to read and write the prefixed versions within the objects, and leave the rest to work in the same old user-visible way (or in the way that you've proposed). Note also that we need some way to distinguish objects in binary form, since if we mix hashes, we need to be able to read data directly from pack files and other locations where we serialize data that way. Multihash would do that, even if we didn't expose that to the user. [0] And for the record, I'm a maintenance programmer, and I dislike it when people punt the problem down the road to someone else, because that's usually me. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204