On Mon, Feb 27, 2017 at 01:00:01PM +0000, Ian Jackson wrote: > I said I was working on a transition plan. Here it is. This is > obviously a draft for review, and I have no official status in the git > project. But I have extensive experience of protocol compatibility > engineering, and I hope this will be helpful. > > Ian. > > > Subject: Transition plan for git to move to a new hash function > > > BASIC PRINCIPLES > ================ > > We run multiple hashes in parallel. Each object is named by exactly > one hash. We define that objects with identical content, but named by > different hash functions, are different objects. I think this is fine. > Objects of one hash may refer to objects named by a different hash > function to their own. Preference rules arrange that normally, new > hash objects refer to other new hash objects. The existing codebase isn't really intended with that in mind. It's not that I am arguing against this because I think it's a bad idea, I'm arguing against it because as a contributor, I'm doubtful that this is easily achievable given the state of the codebase. > The intention is that for most projects, the existing SHA-1 based > history will be retained and a new history built on top of it. > (Rewriting is also possible but means a per-project hard switch.) I like Peff's suggested approach in which we essentially rewrite history under the hood, but have a lookup table which looks up the old hash based on the new hash. That allows us to refer to old objects, but not have to share serialized data that mentions both hashes. Obviously only the SHA-1 versions of old tags and commits will be able to be validated, but that shouldn't be an issue. We can hook that code into a conversion routine that can handle on-the-fly object conversion. We also can implement (optionally disabled) fallback functionality to look up old SHA-1 hash names based on the new hash. > We extend the textual object name syntax to explicitly name the hash > used. Every program that invokes git or speaks git protocols will > need to understand the extended object name syntax. > > Packfiles need to be extended to be able to contain objects named by > new hash functions. Blob objects with identical contents but named by > different hash functions would ideally share storage. > > Safety catches preferent accidental incorporation into a project of > incompatibly-new objects, or additional deprecatedly-old objects. > This allows for incremental deployment. We have a compatibility mechanism already in place: if the repositoryFormatVersion option is set to 1, but an unknown extension flag is set, Git will bail out. For network protocols, we have the server offer a hash=foo extension, and make the client echo it back, and either bail or convert on the fly. This makes it fast for new clients, and slow for old clients, which encourages migration. We could also store old-style packs for easy fetch by clients. > TEXTUAL SYNTAX > ============== > > The object name textual syntax is extended. The new syntax may be > used in all textual git objects and protocols (commits, tags, command > lines, etc.). > > We declare that the object name syntax is henceforth > [A-Z]+[0-9a-z]+ | [0-9a-f]+ > and that names [A-Z].* are deprecated as ref name components. I'd simply say that we have data always be in the new format if it's available, and tag the old SHA-1 versions instead. Otherwise, as Peff pointed out, we're going to be stuck typing a bunch of identical stuff every time. Again, this encourages migration. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204