On Mon, Feb 27, 2017 at 01:00:01PM +0000, Ian Jackson wrote:
> I said I was working on a transition plan.  Here it is.  This is
> obviously a draft for review, and I have no official status in the git
> project.  But I have extensive experience of protocol compatibility
> engineering, and I hope this will be helpful.
> 
> Ian.
> 
> 
> Subject: Transition plan for git to move to a new hash function
> 
> 
> BASIC PRINCIPLES
> ================
> 
> We run multiple hashes in parallel.  Each object is named by exactly
> one hash.  We define that objects with identical content, but named by
> different hash functions, are different objects.

I think this is fine.

> Objects of one hash may refer to objects named by a different hash
> function to their own.  Preference rules arrange that normally, new
> hash objects refer to other new hash objects.

The existing codebase isn't really intended with that in mind.

It's not that I am arguing against this because I think it's a bad idea,
I'm arguing against it because as a contributor, I'm doubtful that this
is easily achievable given the state of the codebase.

> The intention is that for most projects, the existing SHA-1 based
> history will be retained and a new history built on top of it.
> (Rewriting is also possible but means a per-project hard switch.)

I like Peff's suggested approach in which we essentially rewrite history
under the hood, but have a lookup table which looks up the old hash
based on the new hash.  That allows us to refer to old objects, but not
have to share serialized data that mentions both hashes.

Obviously only the SHA-1 versions of old tags and commits will be able
to be validated, but that shouldn't be an issue.  We can hook that code
into a conversion routine that can handle on-the-fly object conversion.

We also can implement (optionally disabled) fallback functionality to
look up old SHA-1 hash names based on the new hash.

> We extend the textual object name syntax to explicitly name the hash
> used.  Every program that invokes git or speaks git protocols will
> need to understand the extended object name syntax.
> 
> Packfiles need to be extended to be able to contain objects named by
> new hash functions.  Blob objects with identical contents but named by
> different hash functions would ideally share storage.
> 
> Safety catches preferent accidental incorporation into a project of
> incompatibly-new objects, or additional deprecatedly-old objects.
> This allows for incremental deployment.

We have a compatibility mechanism already in place: if the
repositoryFormatVersion option is set to 1, but an unknown extension
flag is set, Git will bail out.

For network protocols, we have the server offer a hash=foo extension,
and make the client echo it back, and either bail or convert on the fly.
This makes it fast for new clients, and slow for old clients, which
encourages migration.

We could also store old-style packs for easy fetch by clients.

> TEXTUAL SYNTAX
> ==============
> 
> The object name textual syntax is extended.  The new syntax may be
> used in all textual git objects and protocols (commits, tags, command
> lines, etc.).
> 
> We declare that the object name syntax is henceforth
>   [A-Z]+[0-9a-z]+ | [0-9a-f]+
> and that names [A-Z].* are deprecated as ref name components.

I'd simply say that we have data always be in the new format if it's
available, and tag the old SHA-1 versions instead.  Otherwise, as Peff
pointed out, we're going to be stuck typing a bunch of identical stuff
every time.  Again, this encourages migration.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204