Git Merge contributor summit notes

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Git Merge contributor summit notes
@ 2018-03-10  0:06 Alex Vandiver
  2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Alex Vandiver @ 2018-03-10  0:06 UTC (permalink / raw)
  To: git
  Cc: git, jonathantanmy, bmwill, stolee, sbeller, avarab, peff,
	johannes.schindelin

[-- Attachment #1: Type: text/plain, Size: 30700 bytes --]

It was great to meet some of you in person!  Some notes from the
Contributor Summit at Git Merge are below.  Taken in haste, so
my apologies if there are any mis-statements.

 - Alex

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


  "Does anyone think there's a compelling reason for git to exist?"
    - peff


Partial clone (Jeff Hostetler / Jonathan Tan)
---------------------------------------------
 - Request that the server not send everything
 - Motivated by getting Windows into git
 - Also by not having to fetch large blobs that are in-tree
 - Allows client to request a clone that excludes some set of objects, with incomplete packfiles
 - Decoration on objects that include promise for later on-demand backfill
 - In `master`, have a way of:
   - omitting all blobs
   - omitting large blobs
   - sparse checkout specification stored on server
 - Hook in read_object to fetch objects in bulk

 - Future work:
   - A way to fetch blobsizes for virtual checkouts
   - Give me new blobs that this tree references relative to now
   - Omit some subset of trees
   - Modify other commits to exclude omitted blobs
   - Protocol v2 may have better verbs for sparse specification, etc

Questions:
 - Reference server implementation?
   - In git itself
   - VSTS does not support
 - What happens if a commit becomes unreachable?  Does promise still apply?
   - Probably yes?
   - If the promise is broken, probably crashes
   - Can differentiate between promise that was made, and one that wasn't
   => Demanding commitment from server to never GC seems like a strong promise
 - Interactions with external object db
   - promises include bulk fetches, as opposed to external db, which is one-at-a-time
   - dry-run semantics to determine which objects will be needed
   - very important for small objects, like commits/trees (which is not in `master`, only blobs)
   - perhaps for protocol V2
 - server has to promise more, requires some level of online operation
   - annotate that only some refs are forever?
   - requires enabling the "fetch any SHA" flags
   - rebasing might require now-missing objects?
     - No, to build on them you must have fetched them
     - Well, building on someone else's work may mean you don't have all of them
   - server is less aggressive about GC'ing by keeping "weak references" when there are promises?
   - hosting requires that you be able to forcibly remove information
 - being able to know where a reference came from?
   - as being able to know why an object was needed, for more advanced logic
 - Does `git grep` attempt to fetch blobs that are deferred?
   - will always attempt to fetch
   - one fetch per object, even!
   - might not be true for sparse checkouts
   - Maybe limit to skipping "binary files"?
   - Currently sparse checkout grep "works" because grep defaults to looking at the index, not the commit
   - Does the above behavior for grepping revisions
   - Don't yet have a flag to exclude grep on non-fetched objects
   - Should `git grep -L` die if it can't fetch the file?
   - Need a config option for "should we die, or try to move on"?
 - What's the endgame?  Only a few codepaths that are aware, or threaded through everywhere?
   - Fallback to fetch on demand means there's an almost-reasonable fallback
   - Better prediction with bulk fetching
   - Are most commands going to _need_ to be sensitive to it?
   - GVFS has a caching server in the building
   - A few git commands have been disabled (see recent mail from Stolee); those are likely candidates for code that needs to be aware of de-hydrated objects
 - Is there an API to know what objects are actually local?
   - No external API
   - GVFS has a REST API
 - Some way to later ask about files?
   - "virtualized filesystem"?
   - hook to say "focus on this world of files"
   - GVFS writes out your index currently
 - Will this always require turning off reachability checks?
   - Possibly
 - Shallow clones, instead of partial?
   - Don't download the history, just the objects
   - More of a protocol V2 property
   - Having all of the trees/commits make this reasonable
 - GVFS vs this?
   - GVFS was a first pass
   - Now trying to mainstream productize that
   - Goal is to remove features from GVFS and replace with this

Protocol V2 (Brandon)
--------------------
 - Main problem is that forward compatibility negotiation wasn't possible
 - Found a way to sneak in the V2 negotiation via side-channel in all transports
 - "environment variable" GIT_PROTOCOL which server can detect
 - Ability to transmit and ignore, or not transmit, means forward/backward compat
 - HTTP header / environment variable
 - ...soooo now what?
 - Keep as similar as possible, but more layout changes to remove bad characteristics
 - Like fixing flush semantics
 - Remove ref advertisement (250M of refs every fetch from Android!)
 - Capabilities are currently in first packet, 1K limit
 - First response is capabilities from the server, like commands or features
 - Or server-side grep/log, for sparse checkouts
 - Client then issues request for one of those commands
 - Server executes command, sends back result
 - HTTP vs everything else -- HTTP is very complicated, git -> CURL / CURL -> git
 - Clear flush semantics make the protocol stateless (?)
 - Add in ability to make protocol stateful
 - Already deployed for local repositories

Questions
 - How does fetch differ?
   - Similar to what HTTP does today
   - Still stateless
   - Because otherwise people forget about HTTP
   - Force everyone to think about stateless behavior
   - May want to tweak to explicitly allow stateful requests
   - Microsoft wanted SSH to be a proxy for HTTP
   - Most other verbs other than fetch, are easy to make stateless
   - Fetch is inherently multi-round; if one ignores HTTP, make it hard to implement

 - ls-remote vs ls-refs
   - no necessary consistency across datacenters; multi-round makes this complicated for consistency
   - might want requirements like push, for consistency
   - branch -> object-id gives possibility for race conditions
   - Jonathan Tan wrote patches to ask for refs, not oids; server says "in this packfile, master is 0xdeadbeef" after ref advertisement
   - (peff) But that's a hack; the only reason is because protocol is stateless
   - Difference between protocol v2, and fetch V2; build protocol V2 first, then think about negotiation for fetch V2
   - Seems like optimizing for one case -- fetch refs/heads/*, requires two rounds and still has race condition
   - Could optimize for that, too, but then why have ref advertisement?
   - (brandon) Do you have to rethink about how fetch works _before_ upgrading the protocol?

 - How often do you contact two different HTTP servers?  keepalive should solve that
   - google doesn't do that; keepalive is just to the frontend, which is just a proxy

 - jgit implementation in progress
   - start some portion of real traffic using it
   - turned on for internal traffic in the client
   - local operations use the new protocol

- concerns from the server folks?
   - (stolee) making it stateless is good
   - (stolee) microsoft will start working on it as soon as it's final
   - (jon) github has not started implementation, will once landed

 - (peff) Time to deprecate the git anonymous protocol?
   - Biggest pain to sneak information into
   - Shawn/Johannes added in additional parameters after a null byte
   - Bug; if there's anything other than host, then die
   - But it doesn't check anything after _two_ null bytes.
   - "Two null bytes, for protocol V2"
   - Only in use by github and individual users
   - Would not be too sad if daemon went away
   - Git for Windows has interest in daemon
   - Still interested in simple HTTP wrapper?
   - HTTP deployment could be made eaiser
   - Useful for unauthenticated internal push
   - Perhaps make the daemon use HTTPS?  Because it needs to be _simple_
   - Currently run out of inittab

 - Series as currently out
   - Only used for local operations
   - Not confident on remote CURL
   - Once jgit implementation is done, should be more confident
   - e.g. authentication may be messed up
   - only file:// is currently in production
   - test scripts to exercise HTTP, so only thing unknown is auth
   - May need interop tests? there is one, but not part as standard tests
   - Dscho can set up something in VSTS infra to allow these two versions to be tested
   - Tests should specify their versions; might be as simple as `cd ...; make` and maybe they should be in Travis


Serialized commit graph (stolee)
--------------------------------
 - current patch includes a way to write a file which is commit history
 - replaces object database lookups by separate file
 - stores object id, commit date, parent information
 - parent information includes reference id
 - `git log --graph` is 10% of the time it used to be
 - putting generation numbers in speeds things up further
 - queued up for the next series
 - generation is an integer, guarantees that everything that it can reach is lower
 - root is id 1, every child is max(parents)+1
 - can give nice guarantees if one knows there are no merges in a commit range

 - partial clones?
   - all commits are there
 - shallow clones
   - won't work
   - can define generation numbers based on current information
   - if deepened, would need to toss
   - server might be able to give this information
   - or maybe the commit graph
   - and the partial clone protocol could then be used to look up e.g. author information
 - VSTS can do sub-second graph requests
 - one caveat: topo order in revision walk
   - because computes in-degrees, /then/ topo sort; merging those two to be incremental is hard
 - (peff) custom contains traversals, can be changed to use generation numbers (+ reachability bitmaps?)
 - commit graph is easier to maintain than reachability bitmap
 - bitbucket has similar code that predates reachability bitmaps; generate reachability DAG in a text file; diminishing returns to improve further
 - generation numbers are useful for computing merge bases
 - status does a merge-case calculation (e.g. --no-ahead-behind)
 - derived data, can be extended further
 - VSTS adds bloom filters to know which paths have changed on the commit
 - tree-same check in the bloom filter is fast; speeds up file history checks
 - might be useful in the client as well, since limited-traversal is common
 - if the file history is _very_ sparse, then bloom filter is useful
 - but needs pre-compute, so useful to do once
 - first make the client do it, then think about how to serve it centrally

Questions:
 - How to make sure that graph stays in sync?
   - Not all commits need to be in the graph
   - Can still parse the commits to get the data
   - Assume infinite generation number if not in the graph
   - Can even include commits that are in odb
   - Graph need not be 100% synced to object data
   - Hook is in parse_commit; it doesn't find it, falls back to object database
 - What about rebase?
   - commits are not mutable; contains no refs
   - same caveats and fallbacks as bitmap index
 - What about when the commits are not immutable?  E.g. grafts
   - There is logic to check for grafts
   - Code is copy/pasted
   - If you build the graph file, and you then add grafts, it will break
   - Perhaps if you have grafts, then should disable
   - But shallow is kinda a graft
   - Needs hardening around shallow clones
 - Work with multiple packfiles?
   - Not tied to packfiles at all
   - Works on commits, not packfiles
 - Incremental updates?
   - Can be; has "additive", which is still O(n), not O(new)
   - Small enough that not really useful to do the split-index trick for later merge
   - 1/10 of .idx content
   - Nothing to update on commit, etc because it's optional
 - GVFS has daily prefetch; this also gets the new DAG cache
 - If one knows that the timestamps are perfect, then is there a need for generation numbers?
   - Two reasons: timestamps need to be 64bit in priority queue; also need a way to promise that there is no clockskew
   - Could add a config option to promise timestamps are right?
   - We already do that; 5 in a row that are wrong will give a wrong answer in `git log`
   - Even without generation numbers, still useful
   - All about being able to make promises, instead of worrying about that one bad commit
   - Suspect because they push everything down as far as possible; generation numbers are probably small, vs broad for timestamps?
 - Does it now have a gc/fsck?
   - Requires running it yourself now
   - GC to generate it, fsck needs strong guarantees about the file
   - Those files can happen at the same time as generation numbers
   - Probably `git-commit-graph.auto` that can be turned on
   - Eventually opt-out and not opt-in


Pitch of research on Git (Gabriel)
----------------------------------
 - PhD student researching version control amd how it has changed practices in software development
 - Understanding open source projects, working on large-scale codebases
 - Takes into account the tools that allow it to work
 - Working on it for 1y
 - Focused on the history of git first
 - Why was it invented when it was?
 - Historical flame wars about CVS, BitKeeper, etc
 - Combine graph analysis and time analysis looking for patterns where new features changed how development happens
 - Also now meeting git developers, and how brainstorming happens
 - Knowing why you contribute to git, how it relates to your work
 - Research is ongoing, no results yet

Questions:
 - Results that are of interest to sociologists -- also to the community?
   - Most will be obvious to you
   - But some data analysis may also prove some intuitions
   - e.g. how much does one section of code stay unchanged over time?
 - Statistics on the git project itself?  Program manager asked why google pays to work on git
   - One of the main reasons is to generate traces of work
   - Proof of what has happened
   - Tips about how best to extract data from the logs
 - Git community exclusively?
   - Focus on git itself
   - Input from other version control systems; CVS/Subversion/etc
 - Interested in seeing what made git popular


Repository OO (Stefan)
----------------------
 - submodules were hard to get into git
 - shell into the submodule, which is inefficient
 - want to run in the same process
 - but need to have abstractions for repositories
 - how to get help / reviews / refactoring
 - a lot of grunt work
 - refactoring is not just for submodules, but also for multi-threading; need structure to put mutex
 - advice for people who are adding globals, about where to put them
 - passing repository down to the methods, instead of a global
 - except not all methods have that parameter
 - like sha1 -> objectid; pick one part, and work on it
 - use coccinelle
 - unclear what the action item is here?
 - most of the work is in finding boundaries to split out
 - have a "the repository" pointer, work from the deepest levels of the stack up
   - that's what is currently done
 - libgit might be able to speak to some difficulties -- stashes / refs
 - not about inheritance, more about stack-global state
 - not buying into OO C pain, but being able to swap out the object backend
 - about reusable, ant not shelling out, not really "true OO"
 - grafts + replacement refs are loaded just once, then submodules are loaded, so submodules can be grafted in the main repo (!)
 - someplaces we merge object stores from the subrepo into the main object store
 - also need to change attributes, config, etc
 - also need to include the index
 - theindex can become macro for the therepository.theindex
 - endgame is to have all builtins give a repository, configset to work on (and then remove therepository)
 - have therepository be a pointer as you push onto as you enter the submodules codepath
 - implement it as dynamic scoping
 - but that doesn't help with threading
 - hacky, but might give some quick gains
 - exception handling within a submodule?
 - doing them in-process gives better error handling than exec and parse the error code
 - do it like refs code, which builds up error code in a strbuf
 - how to get it into next in a way that doesn't cause a lot of conflicts
 - could be merged directly into master?
 - any way to feed them in incrementally?
 - junio is currently treating as not a special series
 - conflict with objectid
 - trickling patches means needs reviewers
 => Please help review this patch series


Index format / API (Jeff H)
---------------------------
 - referenced everywhere via macros
 - needs above refactoring to fix thread safety
 - index is great big linear thing
 - about 40 places that just do a serial scan
 - list is ordered by pathname, difficult to search on
 - no way to "work on a directory" without a bunch of locking/coordination
 - invest in an index API which would allow higher-level operations than for loop
 - expose directories as the API
 - have API methods to work on directories
 - shallow clone recurses with a prefix to change, and does things deep in the recursion
 - directory iterators that know (once loaded) what a directory is, let sparse skip by directory
 - as iterating, need to insert; fine with single-threaded, horrible with multi-threaded
 => Do we want higher-level iterators on the index?
 - hierarchical information baked into the format, don't need to load entire thing to write out
 - don't need to read whole thing (e.g. 450M for Windows) to insert a single row

Questions:
 - Index serves dual index -- staging area for commit, also optimization for ext2
   Split into differential from HEAD, other part is acceleration (fsmonitor)
    - having a staging area that is less cache makes it easier to explain
 - Refs code has an iterator concept to build on?
    - iterators are composable, though somewhat painful in C
 - Packed refs is similar in the problems and size and hierarchy
    - almost every git command searches through refs/replace/
    - invalidation of tree objects as soon as their contents are changed (?)
 - split index just solves the write problem, not the read problem
 - filesystem mtime cache + difference from HEAD
 - but both need to mention filesystem paths
 - alter format to have hierarchy at the same time?
 - reftable might share an object model?
 - there was a GSOC series for partially-read index, but needs better index abstraction
 - abstracted iterator for the for loops, then look at the other places
 - probably 4-5 different iterators


Performance misc (Ævar)
-----------------------
 - Status update on what helps performance
 - traversal stuff, protocol v2
 - other small performance tweaks
 - strbuf %s took up 12% of CPU
 - abbreviation fix is in 2.16
 - gprof / visual studio build to profile
 - delayed checkout
   - 2.14 git batches these clean/smudge commands
   - possible to clean/smudge "not now, try later"
   - helps with downloads with LFS
 - central error reporting for git
   - `git status` logging
   - git config that collects data, pushes to known endpoint with `git push`
   - pre_command and post_command hooks, for logs
   - `gvfs diagnose` that looks at packfiles, etc
   - detect BSODs, etc
   - Dropbox writes out json with index properties and command-line information for status/fetch/push, fork/execs external tool to upload
   - windows trace facility; would be nice to have cross-platform
   - would hosting providers care?
   - zipfile of logs to give when debugging
   - sanitizing data is harder
   - more in a company setting
   - fileshare to upload zipfile
   - most of the errors are proxy when they shouldn't, wrong proxy, proxy specific to particular URL; so upload endpoint wouldn't work
   - GIT_TRACE is supposed to be that (for proxy)
   - but we need more trace variables
   - series to make tracing cheaper
   - except that curl selects the proxy
   - trace should have an API, so it can call an executable
   - dump to .git/traces/... and everything else happens externally
   - tools like visual studio can't set GIT_TRACE, so
   - sourcetree has seen user environments where commands just take forever
   - third-party tools like perf/strace - could we be better leveraging those?
   - distribute turn-key solution to handout to collect more data?

 - git-sizer measures various measures (trees, biggest checkout, etc)

 - fsmonitor code has home rough edges still, but has proven useful
   - alexmv to upstream the faster Go watchman client, rewrite in C for portability


Multipack index (stolee)
------------------------
 - can't repack the Windows repo, it's too big
 - 150-200 packfiles, because they don't have enough space to have 2 copies to repack
 - searches are no longer O(log N), but rather O(M * log N)
 - one multipack index that tells you which packfile and index into that
 - RFC now; queued up after the commit graph work
 - needs to have interactions with fsck and gc
 - at very least delete the .midx
 - larger file; sum of all idx; would be good to be incremental
 - do it like the split index
 - also for build machines that turn off auto gc, but don't ever gc
 - repo with big blogs that don't delta compress well, isolate into their own packfiles
 - VSTS splits into 4M index files by type
 - takes apart incoming packfiles; this does not re-delta
 - if you order them by traversal order...
 - use thin packs
 - does not delete idx files, supplements them
 - might envision dropping idx files, storing thin packs


Conservancy update (peff)
-------------------------
 - same business as usual
 - some money ($25k), more than we need, not enough to do anything
 - trademark stuff has settled down
 - clarified policies around trademark
 - project leadership committee of three; Shawn Pierce died, leaving a vacancy
 - charter says a simple majority vote of remaining members to choose new one
 - will take it to the mailing list
 - not a lot of resposibilities; authority over the resources that the project owns (bank account), website
 - explicitly no authority over the development; conservancy says that coding is the domain of the project itself
 - spend on travel to git merge, etc
 - from GSOC / Amazon royalties / donations
 - "enforce" the trademark; lawyer from conservancy informs committee 1/month
 - most don't get authority (commercial "git foo") some do (git logo keycaps)


Git website (peff)
------------------
 - less in danger of falling over than before


New hash (Stefan, etc)
----------------------
 - discussed on the mailing list
 - actual plan checked in to Documentation/technical/hash-function-transition.txt
 - lots of work renaming
 - any actual work with the transition plan?
 - local conversion first; fetch/push have translation table
 - like git-svn
 - also modified pack and index format to have lookup/translation efficiently
 - brian's series to eliminate SHA1 strings from the codebase
 - testsuite is not working well because hardcoded SHA1 values
 - flip a bit in the sha1 computation and see what breaks in the testsuite
 - will also need a way to do the conversion itself; traverse and write out new version
 - without that, can start new repos, but not work on old ones
 - on-disk formats will need to change -- something to keep in mind with new index work
 - documentation describes packfile and index formats
 - what time frame are we talking?
 - public perception question
 - signing commits doesn't help (just signs commit object) unless you "recursive sign"
 - switched to SHA1dc; we detect and reject known collision technique
 - do it now because it takes too long if we start when the collision drops
 - always call it "new hash" to reduce bikeshedding
 - is translation table a backdoor? has it been reviewed by crypto folks?
   - no, but everything gets translated
 - meant to avoid a flag day for entire repositories
 - linus can decide to upgrade to newhash; if pushes to server that is not newhash aware, that's fine
 - will need a wire protocol change
 - v2 might add a capability for newhash
 - "now that you mention md5, it's a good idea"
 - can use md5 to test the conversion
 - is there a technical reason for why not /n/ hashes?
 - the slow step goes away as people converge to the new hash
 - beneficial to make up some fake hash function for testing
 - is there a plan on how we decide which hash function?
 - trust junio to merge commits when appropriate
 - conservancy committee explicitly does not make code decisions
 - waiting will just give better data
 - some hash functions are in silicon (e.g. microsoft cares)
 - any movement in libgit2 / jgit?
   - basic stuff for libgit2; same testsuite problems
   - no work in jgit
 - most optimistic forecast?
   - could be done in 1-2y
 - submodules with one hash function?
   - unable to convert project unless all submodules are converted
   - OO-ing is not a prereq


Resumable clone (peff)
----------------------
 - also resumable fetch
 - a lot of proposals use a bundle, do that over resumable protocol
 - might be able to put it into the protocol
 - exact byte-for-byte to serve a fetch depends on many things (packfiles, deltas, versions, etc)
 - hash all server variability down
 - if client is cut off halfway, provide the token
 - server asks if on-disk state has changed, throwing up your hands is no worse than now
 - so depends on how often those things change
 - new packs break this, of course
 - reads:writes is 100:1
 - most major hosts have caching infrastructure; protocol change is generic
 - hosting providers can decide to deal this primitive as they will
 - token can provide a high-water mark for the packs to use
 - how does the client know how far it got?
   - byte offset
   - use same ref advertisement
 - likely `git clone --continue` to resume instead of auto-retry
 - overlaps with ability to send multiple packfiles; or using CDN
   - this is not helpful for CDN use case


Ref table (Stefan)
------------------
 - file format invented by Shawn, used server-side
 - efficient for compressing refs
 - smaller than packed refs file by 60%
 - hashes are stored in binary, prefix compression on ref names
 - still binary-searchable gives block in which ref in which ref would be stored
 - from infrastructure point of view, roughly like packfiles
 - eventually GC'd down to one reftable; so need a GC operation
 - gives a new feature to developers: "what did I fetch this morning"
 - reflog can answer updates; reftable can give the contents of one transaction
 - list made reference to some geometric compression
 - also gives atomic reads with all of its references
 - also gives reflogs for deleted references
 - would mean that reftable only has one global lock
 - but contention likely doesn't matter all that much
 - might get a 20ms delay waiting for the lock
 - also helps with case sensitivity of refs
 - need to keep file/directory comparison because back-compat
 - jgit has a reference implementation of reftable
 => multiple people are interested in it; please announce to list if you start working on it


Recreate merges (Johannes)
--------------------------
 - git for windows has patches atop core git (~570 on 70 branches)
 - steadily working towards upstreaming
 - currently have to be forward-ported each time
 - doesn't use rebase because would linearize the commits
 - git "garden shears" to snip the weeds out of the thicket of branches
 - backend to interactive rebase
 - use `exec` verb to make branch structure
 - works fine for a couple years
 - sequencer is now in C, and more performant / cross-platform
 - patch series has two parts
   - implement new verbs: "label" a work-tree-local ref / "reset" can reset to worktree / "merge"
   - write out interactive rebase script
 - problem with merge is how to represent the merge commit
 - `merge -C deadbeef cafe`
 - can create new merges by omitting `-C`
 - useful for patch series splitting into multiple merges
 - "evil" merges because, e.g. upstream changed a signature of a function
 - newest idea is to use duality of rebase/merge
 - pretend that a rebase was a merge; gives you base commit to merge
 - take the old merge commit; merge new tips as if they had been merged into upsteam, with old merge as merge-base
 - intend to introduce atop recreate-merges
 - who needs this, besides GVFS / git for windows?
 - git-imerge to help with rebasing evil-merged mess into a clean series?
 - want a way to train from merge commits and re-apply those?
 - rerere isn't sufficient to push the merge conflicts back; it's too language-agnostic
 - `git rebase -p` would become `--preserve-merges`
 - useful for admins that need to rebase onto rewritten history


Submodules (Stefan)
-------------------
 - want to deprecate `repo` tool
 - google has an internal fork of git that has a submodule workflow
 - doing refactoring first, then modify to add workflows as users want it
 - gave `--recurse-submodules` to more commands
 - in far future want submodules to be more transparent
 - `git commit` should tell you to write two commit messages if you touch things in a submodule
 - unless you specify, you work on the whole super-repo
 - any planned changes to gitlink?
   - no planned changes here
   - `repo` users sometimes want to follow latest master; tells you nothing about a point in time
 - gerrit or CI system can move the subrepo pointers
 - ~1y ago there was a follow-branch feature that confuses users
 - might just be a documentation gap
 - `git submodule update` does let you follow master, though it will show as differences
 - `git mv` does sortof work for submodules
 - submodule merge strategy can solve some of the merge conflicts
 - pull requests to two repos are problematic?  `git pull --rebase --recurse-submodules` will rebase the submodule
 - shared library code does not evolve as fast as main project code


Rebase modality (Stefan)
------------------------
 - problem is `git rebase` forces you into a mode
 - multiple tree entries in a commit which represent the index
 - merge conflict in a rebase could record multiple trees and then keep going
 - would that cause more conflicts later if you defer one? depends on the series
 - if you have a merge, it complains; could instead store higher-stage entries as metadata
 - hardest part is the user interaction; would like to see a side-by-side from the user interaction point of view
 - make sure that earlier conflicts get carried forward
 - some way to dry-run and see which would have conflicts
 - rebase is perhaps not the main gain from this; more that multiple trees may be useful
 - But this leads to explosion of possibilities, which don't actually include all of them!
 - But the dry run still has value

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-10  0:06 Git Merge contributor summit notes Alex Vandiver
@ 2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
  2018-03-11  0:02   ` Junio C Hamano
  2018-03-12 23:40   ` Jeff King
  2018-03-12 23:33 ` Jeff King
  2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-03-10 13:01 UTC (permalink / raw)
  To: Alex Vandiver
  Cc: git, git, jonathantanmy, bmwill, stolee, sbeller, peff,
	johannes.schindelin


On Sat, Mar 10 2018, Alex Vandiver jotted:

> It was great to meet some of you in person!  Some notes from the
> Contributor Summit at Git Merge are below.  Taken in haste, so
> my apologies if there are any mis-statements.

Thanks a lot for taking these notes. I've read them over and they're all
accurate per my wetware recollection. Adding some things I remember
about various discussions below where I think it may help to clarify
things a bit.

>  - Alex
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>   "Does anyone think there's a compelling reason for git to exist?"
>     - peff
>
>
> Partial clone (Jeff Hostetler / Jonathan Tan)
> ---------------------------------------------
>  - Request that the server not send everything
>  - Motivated by getting Windows into git
>  - Also by not having to fetch large blobs that are in-tree
>  - Allows client to request a clone that excludes some set of objects, with incomplete packfiles
>  - Decoration on objects that include promise for later on-demand backfill
>  - In `master`, have a way of:
>    - omitting all blobs
>    - omitting large blobs
>    - sparse checkout specification stored on server
>  - Hook in read_object to fetch objects in bulk
>
>  - Future work:
>    - A way to fetch blobsizes for virtual checkouts
>    - Give me new blobs that this tree references relative to now
>    - Omit some subset of trees
>    - Modify other commits to exclude omitted blobs
>    - Protocol v2 may have better verbs for sparse specification, etc
>
> Questions:
>  - Reference server implementation?
>    - In git itself
>    - VSTS does not support
>  - What happens if a commit becomes unreachable?  Does promise still apply?
>    - Probably yes?
>    - If the promise is broken, probably crashes
>    - Can differentiate between promise that was made, and one that wasn't
>    => Demanding commitment from server to never GC seems like a strong promise
>  - Interactions with external object db
>    - promises include bulk fetches, as opposed to external db, which is one-at-a-time
>    - dry-run semantics to determine which objects will be needed
>    - very important for small objects, like commits/trees (which is not in `master`, only blobs)
>    - perhaps for protocol V2
>  - server has to promise more, requires some level of online operation
>    - annotate that only some refs are forever?
>    - requires enabling the "fetch any SHA" flags
>    - rebasing might require now-missing objects?
>      - No, to build on them you must have fetched them
>      - Well, building on someone else's work may mean you don't have all of them
>    - server is less aggressive about GC'ing by keeping "weak references" when there are promises?
>    - hosting requires that you be able to forcibly remove information
>  - being able to know where a reference came from?
>    - as being able to know why an object was needed, for more advanced logic
>  - Does `git grep` attempt to fetch blobs that are deferred?
>    - will always attempt to fetch
>    - one fetch per object, even!
>    - might not be true for sparse checkouts
>    - Maybe limit to skipping "binary files"?
>    - Currently sparse checkout grep "works" because grep defaults to looking at the index, not the commit
>    - Does the above behavior for grepping revisions
>    - Don't yet have a flag to exclude grep on non-fetched objects
>    - Should `git grep -L` die if it can't fetch the file?
>    - Need a config option for "should we die, or try to move on"?
>  - What's the endgame?  Only a few codepaths that are aware, or threaded through everywhere?
>    - Fallback to fetch on demand means there's an almost-reasonable fallback
>    - Better prediction with bulk fetching
>    - Are most commands going to _need_ to be sensitive to it?
>    - GVFS has a caching server in the building
>    - A few git commands have been disabled (see recent mail from Stolee); those are likely candidates for code that needs to be aware of de-hydrated objects
>  - Is there an API to know what objects are actually local?
>    - No external API
>    - GVFS has a REST API
>  - Some way to later ask about files?
>    - "virtualized filesystem"?
>    - hook to say "focus on this world of files"
>    - GVFS writes out your index currently
>  - Will this always require turning off reachability checks?
>    - Possibly
>  - Shallow clones, instead of partial?
>    - Don't download the history, just the objects
>    - More of a protocol V2 property
>    - Having all of the trees/commits make this reasonable
>  - GVFS vs this?
>    - GVFS was a first pass
>    - Now trying to mainstream productize that
>    - Goal is to remove features from GVFS and replace with this

As I understood it Microsoft deploys this in a mode where they're not
vulnerable to the caveats noted above, i.e. the server serving this up
only has branches that are fast-forwarded (and never deleted).

However, if you were to build history on a server where you're counting
on lazily getting a blob later and the server breaks that promise, we're
in a state of having corrupted the local repo (most git commands will
just fail).

Some sub-mode where you can declare that only some branches should
implicitly promise that they have lazy blobs would be useful, but it
wasn't clear to me whether such a thing would be very hard to implement.

In any case, this is something that needs active server cooperation, and
is very unlikely to be deployed by people who don't know the caveats
involved, so I for one am all for getting this in even if there's some
significant caveats like that.

> Protocol V2 (Brandon)
> [...]
>  - (peff) Time to deprecate the git anonymous protocol?
>    - Biggest pain to sneak information into
>    - Shawn/Johannes added in additional parameters after a null byte
>    - Bug; if there's anything other than host, then die
>    - But it doesn't check anything after _two_ null bytes.
>    - "Two null bytes, for protocol V2"
>    - Only in use by github and individual users
>    - Would not be too sad if daemon went away
>    - Git for Windows has interest in daemon
>    - Still interested in simple HTTP wrapper?
>    - HTTP deployment could be made eaiser
>    - Useful for unauthenticated internal push
>    - Perhaps make the daemon use HTTPS?  Because it needs to be _simple_
>    - Currently run out of inittab

I think the conclusion was that nobody cares about the git:// protocol,
but people do care about it being super easy to spin up a server, and
currently it's easiest to spin up git://, but we could also ship with
some git-daemon mode that had a stand-alone webserver (or ssh server) to
get around that.

>  - Series as currently out
>    - Only used for local operations
>    - Not confident on remote CURL
>    - Once jgit implementation is done, should be more confident
>    - e.g. authentication may be messed up
>    - only file:// is currently in production
>    - test scripts to exercise HTTP, so only thing unknown is auth
>    - May need interop tests? there is one, but not part as standard tests
>    - Dscho can set up something in VSTS infra to allow these two versions to be tested
>    - Tests should specify their versions; might be as simple as `cd ...; make` and maybe they should be in Travis

FWIW "local operations" here refers to `git clone file://` and the like
which Google apparently does a lot of with git, and is stess testing the
v2 protocol.

> [...]
>  - some hash functions are in silicon (e.g. microsoft cares)

FWIW this refers to https://en.wikipedia.org/wiki/Intel_SHA_extensions &
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500e/CJHDEBAF.html
among others. Previous on-list discussion at
https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
@ 2018-03-11  0:02   ` Junio C Hamano
  2018-03-12 23:40   ` Jeff King
  1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2018-03-11  0:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Alex Vandiver, git, git, jonathantanmy, bmwill, stolee, sbeller,
	peff, johannes.schindelin

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Sat, Mar 10 2018, Alex Vandiver jotted:
>
>> It was great to meet some of you in person!  Some notes from the
>> Contributor Summit at Git Merge are below.  Taken in haste, so
>> my apologies if there are any mis-statements.
>
> Thanks a lot for taking these notes. I've read them over and they're all
> accurate per my wetware recollection. Adding some things I remember
> about various discussions below where I think it may help to clarify
> things a bit.
>
>>  - Alex

Thanks, both, for sharing.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-10  0:06 Git Merge contributor summit notes Alex Vandiver
  2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
@ 2018-03-12 23:33 ` Jeff King
  2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2018-03-12 23:33 UTC (permalink / raw)
  To: Alex Vandiver
  Cc: git, git, jonathantanmy, bmwill, stolee, sbeller, avarab,
	johannes.schindelin

On Fri, Mar 09, 2018 at 04:06:18PM -0800, Alex Vandiver wrote:

> It was great to meet some of you in person!  Some notes from the
> Contributor Summit at Git Merge are below.  Taken in haste, so
> my apologies if there are any mis-statements.

Thanks very much for these notes!

I think in future years we should do a better job of making sure we have
an official note-taker so that this stuff makes it onto the list. I was
very happy when you announced part-way through the summit that you had
already been taking notes. :)

>   "Does anyone think there's a compelling reason for git to exist?"
>     - peff

Heh, those words did indeed escape my mouth.

Your notes look accurate overall from a brief skim. I'm still post-trip
recovering, but I may try to follow-up and expand on a few areas where I
have thoughts. And I'd encourage others to do the same as a way of
bridging the discussion back to the list.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
  2018-03-11  0:02   ` Junio C Hamano
@ 2018-03-12 23:40   ` Jeff King
  2018-03-13  0:49     ` Brandon Williams
  1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2018-03-12 23:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Alex Vandiver, git, git, jonathantanmy, bmwill, stolee, sbeller,
	johannes.schindelin

On Sat, Mar 10, 2018 at 02:01:14PM +0100, Ævar Arnfjörð Bjarmason wrote:

> >  - (peff) Time to deprecate the git anonymous protocol?
> [...]
> 
> I think the conclusion was that nobody cares about the git:// protocol,
> but people do care about it being super easy to spin up a server, and
> currently it's easiest to spin up git://, but we could also ship with
> some git-daemon mode that had a stand-alone webserver (or ssh server) to
> get around that.

I don't think keeping support for git:// is too onerous at this point
(especially because it should make the jump to protocol v2 with the
rest). But it really is a pretty dated protocol, lacking any kind of
useful security properties (yes, I know, if we're all verifying signed
tags it's great, but realistically people are fetching the tip of master
over a hijack-able TCP connection and running arbitrary code on the
result). It might be nice if it went away completely so we don't have to
warn people off of it.

The only thing git:// really has going over git-over-http right now is
that it doesn't suffer from the stateless-rpc overhead. But if we unify
that behavior in v2, then any advantage goes away.

I do agree we should have _something_ that is easy to spin up. But it
would be wonderful if git-over-http could become that, and we could just
deprecate git://. I suppose it's possible people build clients without
curl, but I suspect that's an extreme minority these days (most third
party hosters don't seem to offer git:// at all).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-12 23:40   ` Jeff King
@ 2018-03-13  0:49     ` Brandon Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Brandon Williams @ 2018-03-13  0:49 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Alex Vandiver, git, git,
	jonathantanmy, stolee, sbeller, johannes.schindelin

On 03/12, Jeff King wrote:
> On Sat, Mar 10, 2018 at 02:01:14PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> > >  - (peff) Time to deprecate the git anonymous protocol?
> > [...]
> > 
> > I think the conclusion was that nobody cares about the git:// protocol,
> > but people do care about it being super easy to spin up a server, and
> > currently it's easiest to spin up git://, but we could also ship with
> > some git-daemon mode that had a stand-alone webserver (or ssh server) to
> > get around that.
> 
> I don't think keeping support for git:// is too onerous at this point
> (especially because it should make the jump to protocol v2 with the
> rest). But it really is a pretty dated protocol, lacking any kind of
> useful security properties (yes, I know, if we're all verifying signed
> tags it's great, but realistically people are fetching the tip of master
> over a hijack-able TCP connection and running arbitrary code on the
> result). It might be nice if it went away completely so we don't have to
> warn people off of it.
> 
> The only thing git:// really has going over git-over-http right now is
> that it doesn't suffer from the stateless-rpc overhead. But if we unify
> that behavior in v2, then any advantage goes away.

It's still my intention to unify this behavior in v2 but then begin
working on improving negotiation as a whole (once v2 is in) so that we
can hopefully get rid of the nasty corner cases that exist in http://.
Since v2 will be hidden behind a config anyway, it may be prudent to
wait until negotiation gets better before we entertain making v2 default
(well there's also needing to wait for hosting providers to begin
supporting it).

> 
> I do agree we should have _something_ that is easy to spin up. But it
> would be wonderful if git-over-http could become that, and we could just
> deprecate git://. I suppose it's possible people build clients without
> curl, but I suspect that's an extreme minority these days (most third
> party hosters don't seem to offer git:// at all).
> 
> -Peff

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-10  0:06 Git Merge contributor summit notes Alex Vandiver
  2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
  2018-03-12 23:33 ` Jeff King
@ 2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
  2018-03-26 17:33   ` Jeff Hostetler
  2018-03-26 20:54   ` Per-object encryption " Jonathan Nieder
  2 siblings, 2 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-03-25 22:58 UTC (permalink / raw)
  To: Alex Vandiver
  Cc: git, git, jonathantanmy, bmwill, stolee, sbeller, peff,
	johannes.schindelin, Jonathan Nieder, Michael Haggerty

On Sat, Mar 10 2018, Alex Vandiver wrote:

> New hash (Stefan, etc)
> ----------------------
>  - discussed on the mailing list
>  - actual plan checked in to Documentation/technical/hash-function-transition.txt
>  - lots of work renaming
>  - any actual work with the transition plan?
>  - local conversion first; fetch/push have translation table
>  - like git-svn
>  - also modified pack and index format to have lookup/translation efficiently
>  - brian's series to eliminate SHA1 strings from the codebase
>  - testsuite is not working well because hardcoded SHA1 values
>  - flip a bit in the sha1 computation and see what breaks in the testsuite
>  - will also need a way to do the conversion itself; traverse and write out new version
>  - without that, can start new repos, but not work on old ones
>  - on-disk formats will need to change -- something to keep in mind with new index work
>  - documentation describes packfile and index formats
>  - what time frame are we talking?
>  - public perception question
>  - signing commits doesn't help (just signs commit object) unless you "recursive sign"
>  - switched to SHA1dc; we detect and reject known collision technique
>  - do it now because it takes too long if we start when the collision drops
>  - always call it "new hash" to reduce bikeshedding
>  - is translation table a backdoor? has it been reviewed by crypto folks?
>    - no, but everything gets translated
>  - meant to avoid a flag day for entire repositories
>  - linus can decide to upgrade to newhash; if pushes to server that is not newhash aware, that's fine
>  - will need a wire protocol change
>  - v2 might add a capability for newhash
>  - "now that you mention md5, it's a good idea"
>  - can use md5 to test the conversion
>  - is there a technical reason for why not /n/ hashes?
>  - the slow step goes away as people converge to the new hash
>  - beneficial to make up some fake hash function for testing
>  - is there a plan on how we decide which hash function?
>  - trust junio to merge commits when appropriate
>  - conservancy committee explicitly does not make code decisions
>  - waiting will just give better data
>  - some hash functions are in silicon (e.g. microsoft cares)
>  - any movement in libgit2 / jgit?
>    - basic stuff for libgit2; same testsuite problems
>    - no work in jgit
>  - most optimistic forecast?
>    - could be done in 1-2y
>  - submodules with one hash function?
>    - unable to convert project unless all submodules are converted
>    - OO-ing is not a prereq

Late reply, but one thing I brought up at the time is that we'll want to
keep this code around even after the NewHash migration at least for
testing purposes, should we ever need to move to NewNewHash.

It occurred to me recently that once we have such a layer it could be
(ab)used with some relatively minor changes to do any arbitrary
local-to-remote object content translation, unless I've missed something
(but I just re-read hash-function-transition.txt now...).

E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
remote server so that you upload a GPG encrypted version of all your
blobs, and have your trees reference those blobs.

Because we'd be doing arbitrary translations for all of
commits/trees/blobs this could go further than other bolted-on
encryption solutions for Git. E.g. paths in trees could be encrypted
too, as well as all the content of the commit object that isn't parent
info & the like (but that would have different hashes).

Basically clean/smudge filters on steroids, but for every object in the
repo. Anyone who got a hold of it would still see the shape of the repo
& approximate content size, but other than that it wouldn't be more info
than they'd get via `fast-export --anonymize` now.

I mainly find it interesting because presents an intersection between a
feature we might want to offer anyway, and something that would stress
the hash transition codepath going forward, to make sure it hasn't all
bitrotted by the time we'll need NewHash->NewNewHash.

Git hosting providers would hate it, but they should probably be
charging users by how much Michael Haggerty's git-sizer tool hates their
repo anyway :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
@ 2018-03-26 17:33   ` Jeff Hostetler
  2018-03-26 17:56     ` Stefan Beller
                       ` (2 more replies)
  2018-03-26 20:54   ` Per-object encryption " Jonathan Nieder
  1 sibling, 3 replies; 17+ messages in thread
From: Jeff Hostetler @ 2018-03-26 17:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Alex Vandiver
  Cc: git, jonathantanmy, bmwill, stolee, sbeller, peff,
	johannes.schindelin, Jonathan Nieder, Michael Haggerty



On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sat, Mar 10 2018, Alex Vandiver wrote:
> 
>> New hash (Stefan, etc)
>> ----------------------
>>   - discussed on the mailing list
>>   - actual plan checked in to Documentation/technical/hash-function-transition.txt
>>   - lots of work renaming
>>   - any actual work with the transition plan?
>>   - local conversion first; fetch/push have translation table
>>   - like git-svn
>>   - also modified pack and index format to have lookup/translation efficiently
>>   - brian's series to eliminate SHA1 strings from the codebase
>>   - testsuite is not working well because hardcoded SHA1 values
>>   - flip a bit in the sha1 computation and see what breaks in the testsuite
>>   - will also need a way to do the conversion itself; traverse and write out new version
>>   - without that, can start new repos, but not work on old ones
>>   - on-disk formats will need to change -- something to keep in mind with new index work
>>   - documentation describes packfile and index formats
>>   - what time frame are we talking?
>>   - public perception question
>>   - signing commits doesn't help (just signs commit object) unless you "recursive sign"
>>   - switched to SHA1dc; we detect and reject known collision technique
>>   - do it now because it takes too long if we start when the collision drops
>>   - always call it "new hash" to reduce bikeshedding
>>   - is translation table a backdoor? has it been reviewed by crypto folks?
>>     - no, but everything gets translated
>>   - meant to avoid a flag day for entire repositories
>>   - linus can decide to upgrade to newhash; if pushes to server that is not newhash aware, that's fine
>>   - will need a wire protocol change
>>   - v2 might add a capability for newhash
>>   - "now that you mention md5, it's a good idea"
>>   - can use md5 to test the conversion
>>   - is there a technical reason for why not /n/ hashes?
>>   - the slow step goes away as people converge to the new hash
>>   - beneficial to make up some fake hash function for testing
>>   - is there a plan on how we decide which hash function?
>>   - trust junio to merge commits when appropriate
>>   - conservancy committee explicitly does not make code decisions
>>   - waiting will just give better data
>>   - some hash functions are in silicon (e.g. microsoft cares)
>>   - any movement in libgit2 / jgit?
>>     - basic stuff for libgit2; same testsuite problems
>>     - no work in jgit
>>   - most optimistic forecast?
>>     - could be done in 1-2y
>>   - submodules with one hash function?
>>     - unable to convert project unless all submodules are converted
>>     - OO-ing is not a prereq
> 
> Late reply, but one thing I brought up at the time is that we'll want to
> keep this code around even after the NewHash migration at least for
> testing purposes, should we ever need to move to NewNewHash.
> 
> It occurred to me recently that once we have such a layer it could be
> (ab)used with some relatively minor changes to do any arbitrary
> local-to-remote object content translation, unless I've missed something
> (but I just re-read hash-function-transition.txt now...).
> 
> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> remote server so that you upload a GPG encrypted version of all your
> blobs, and have your trees reference those blobs.
> 
> Because we'd be doing arbitrary translations for all of
> commits/trees/blobs this could go further than other bolted-on
> encryption solutions for Git. E.g. paths in trees could be encrypted
> too, as well as all the content of the commit object that isn't parent
> info & the like (but that would have different hashes).
> 
> Basically clean/smudge filters on steroids, but for every object in the
> repo. Anyone who got a hold of it would still see the shape of the repo
> & approximate content size, but other than that it wouldn't be more info
> than they'd get via `fast-export --anonymize` now.
> 
> I mainly find it interesting because presents an intersection between a
> feature we might want to offer anyway, and something that would stress
> the hash transition codepath going forward, to make sure it hasn't all
> bitrotted by the time we'll need NewHash->NewNewHash.
> 
> Git hosting providers would hate it, but they should probably be
> charging users by how much Michael Haggerty's git-sizer tool hates their
> repo anyway :)
> 

While we are converting to a new hash function, it would be nice
if we could add a couple of fields to the end of the OID:  the object
type and the raw uncompressed object size.

If would be nice if we could extend the OID to include 6 bytes of data
(4 or 8 bits for the type and the rest for the raw object size), and
just say that an OID is a {hash,type,size} tuple.

There are lots of places where we open an object to see what type it is
or how big it is.  This requires uncompressing/undeltafying the object
(or at least decoding enough to get the header).  In the case of missing
objects (partial clone or a gvfs-like projection) it requires either
dynamically fetching the object or asking an object-size-server for the
data.

All of these cases could be eliminated if the type/size were available
in the OID.

Just a thought.  While we are converting to a new hash it seems like
this would be a good time to at least discuss it.

Jeff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-26 17:33   ` Jeff Hostetler
@ 2018-03-26 17:56     ` Stefan Beller
  2018-03-26 18:54       ` Jeff Hostetler
  2018-03-26 18:05     ` Brandon Williams
  2018-03-26 21:00     ` Including object type and size in object id (Re: Git Merge contributor summit notes) Jonathan Nieder
  2 siblings, 1 reply; 17+ messages in thread
From: Stefan Beller @ 2018-03-26 17:56 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Ævar Arnfjörð Bjarmason, alexmv, git, Jonathan Tan,
	Brandon Williams, Derrick Stolee, Jeff King, Johannes Schindelin,
	Jonathan Nieder, Michael Haggerty

On Mon, Mar 26, 2018 at 10:33 AM Jeff Hostetler <git@jeffhostetler.com>
wrote:



> On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
> >
> > On Sat, Mar 10 2018, Alex Vandiver wrote:
> >
> >> New hash (Stefan, etc)
> >> ----------------------
> >>   - discussed on the mailing list
> >>   - actual plan checked in to
Documentation/technical/hash-function-transition.txt
> >>   - lots of work renaming
> >>   - any actual work with the transition plan?
> >>   - local conversion first; fetch/push have translation table
> >>   - like git-svn
> >>   - also modified pack and index format to have lookup/translation
efficiently
> >>   - brian's series to eliminate SHA1 strings from the codebase
> >>   - testsuite is not working well because hardcoded SHA1 values
> >>   - flip a bit in the sha1 computation and see what breaks in the
testsuite
> >>   - will also need a way to do the conversion itself; traverse and
write out new version
> >>   - without that, can start new repos, but not work on old ones
> >>   - on-disk formats will need to change -- something to keep in mind
with new index work
> >>   - documentation describes packfile and index formats
> >>   - what time frame are we talking?
> >>   - public perception question
> >>   - signing commits doesn't help (just signs commit object) unless you
"recursive sign"
> >>   - switched to SHA1dc; we detect and reject known collision technique
> >>   - do it now because it takes too long if we start when the collision
drops
> >>   - always call it "new hash" to reduce bikeshedding
> >>   - is translation table a backdoor? has it been reviewed by crypto
folks?
> >>     - no, but everything gets translated
> >>   - meant to avoid a flag day for entire repositories
> >>   - linus can decide to upgrade to newhash; if pushes to server that
is not newhash aware, that's fine
> >>   - will need a wire protocol change
> >>   - v2 might add a capability for newhash
> >>   - "now that you mention md5, it's a good idea"
> >>   - can use md5 to test the conversion
> >>   - is there a technical reason for why not /n/ hashes?
> >>   - the slow step goes away as people converge to the new hash
> >>   - beneficial to make up some fake hash function for testing
> >>   - is there a plan on how we decide which hash function?
> >>   - trust junio to merge commits when appropriate
> >>   - conservancy committee explicitly does not make code decisions
> >>   - waiting will just give better data
> >>   - some hash functions are in silicon (e.g. microsoft cares)
> >>   - any movement in libgit2 / jgit?
> >>     - basic stuff for libgit2; same testsuite problems
> >>     - no work in jgit
> >>   - most optimistic forecast?
> >>     - could be done in 1-2y
> >>   - submodules with one hash function?
> >>     - unable to convert project unless all submodules are converted
> >>     - OO-ing is not a prereq
> >
> > Late reply, but one thing I brought up at the time is that we'll want to
> > keep this code around even after the NewHash migration at least for
> > testing purposes, should we ever need to move to NewNewHash.
> >
> > It occurred to me recently that once we have such a layer it could be
> > (ab)used with some relatively minor changes to do any arbitrary
> > local-to-remote object content translation, unless I've missed something
> > (but I just re-read hash-function-transition.txt now...).
> >
> > E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> > remote server so that you upload a GPG encrypted version of all your
> > blobs, and have your trees reference those blobs.
> >
> > Because we'd be doing arbitrary translations for all of
> > commits/trees/blobs this could go further than other bolted-on
> > encryption solutions for Git. E.g. paths in trees could be encrypted
> > too, as well as all the content of the commit object that isn't parent
> > info & the like (but that would have different hashes).
> >
> > Basically clean/smudge filters on steroids, but for every object in the
> > repo. Anyone who got a hold of it would still see the shape of the repo
> > & approximate content size, but other than that it wouldn't be more info
> > than they'd get via `fast-export --anonymize` now.
> >
> > I mainly find it interesting because presents an intersection between a
> > feature we might want to offer anyway, and something that would stress
> > the hash transition codepath going forward, to make sure it hasn't all
> > bitrotted by the time we'll need NewHash->NewNewHash.
> >
> > Git hosting providers would hate it, but they should probably be
> > charging users by how much Michael Haggerty's git-sizer tool hates their
> > repo anyway :)
> >

> While we are converting to a new hash function, it would be nice
> if we could add a couple of fields to the end of the OID:  the object
> type and the raw uncompressed object size.

This would allow to craft invalid OIDs, i.e. the correct hash value with
the wrong object type. (This is different field of "invalid" compared to
today, where we either have or do not have the object named by the
hash value. If we don't have it, it may be just unknown to us, but not
"wrong".)

> If would be nice if we could extend the OID to include 6 bytes of data
> (4 or 8 bits for the type and the rest for the raw object size), and
> just say that an OID is a {hash,type,size} tuple.

My suspicion is that the size of the OID is directly proportional to
the speed of lookup (actually worse than linear, due to CPU caches
being finite), specifically given Stollees work on walking the DAG.
Hence I would appreciate if an OID would not contain redundant
information and have a high information density.

> There are lots of places where we open an object to see what type it is
> or how big it is.  This requires uncompressing/undeltafying the object
> (or at least decoding enough to get the header).  In the case of missing
> objects (partial clone or a gvfs-like projection) it requires either
> dynamically fetching the object or asking an object-size-server for the
> data.

The commit graph could have these infos in another column, too?
Then we would have to add the promised objects
to that data structure as well, but that would look like
a way better design IMHO.

> Just a thought.  While we are converting to a new hash it seems like
> this would be a good time to at least discuss it.

I'd agree.

Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-26 17:33   ` Jeff Hostetler
  2018-03-26 17:56     ` Stefan Beller
@ 2018-03-26 18:05     ` Brandon Williams
  2018-04-07 20:37       ` Jakub Narebski
  2018-03-26 21:00     ` Including object type and size in object id (Re: Git Merge contributor summit notes) Jonathan Nieder
  2 siblings, 1 reply; 17+ messages in thread
From: Brandon Williams @ 2018-03-26 18:05 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Ævar Arnfjörð Bjarmason, Alex Vandiver, git,
	jonathantanmy, stolee, sbeller, peff, johannes.schindelin,
	Jonathan Nieder, Michael Haggerty

On 03/26, Jeff Hostetler wrote:
> 
> 
> On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
> > 
> > On Sat, Mar 10 2018, Alex Vandiver wrote:
> > 
> > > New hash (Stefan, etc)
> > > ----------------------
> > >   - discussed on the mailing list
> > >   - actual plan checked in to Documentation/technical/hash-function-transition.txt
> > >   - lots of work renaming
> > >   - any actual work with the transition plan?
> > >   - local conversion first; fetch/push have translation table
> > >   - like git-svn
> > >   - also modified pack and index format to have lookup/translation efficiently
> > >   - brian's series to eliminate SHA1 strings from the codebase
> > >   - testsuite is not working well because hardcoded SHA1 values
> > >   - flip a bit in the sha1 computation and see what breaks in the testsuite
> > >   - will also need a way to do the conversion itself; traverse and write out new version
> > >   - without that, can start new repos, but not work on old ones
> > >   - on-disk formats will need to change -- something to keep in mind with new index work
> > >   - documentation describes packfile and index formats
> > >   - what time frame are we talking?
> > >   - public perception question
> > >   - signing commits doesn't help (just signs commit object) unless you "recursive sign"
> > >   - switched to SHA1dc; we detect and reject known collision technique
> > >   - do it now because it takes too long if we start when the collision drops
> > >   - always call it "new hash" to reduce bikeshedding
> > >   - is translation table a backdoor? has it been reviewed by crypto folks?
> > >     - no, but everything gets translated
> > >   - meant to avoid a flag day for entire repositories
> > >   - linus can decide to upgrade to newhash; if pushes to server that is not newhash aware, that's fine
> > >   - will need a wire protocol change
> > >   - v2 might add a capability for newhash
> > >   - "now that you mention md5, it's a good idea"
> > >   - can use md5 to test the conversion
> > >   - is there a technical reason for why not /n/ hashes?
> > >   - the slow step goes away as people converge to the new hash
> > >   - beneficial to make up some fake hash function for testing
> > >   - is there a plan on how we decide which hash function?
> > >   - trust junio to merge commits when appropriate
> > >   - conservancy committee explicitly does not make code decisions
> > >   - waiting will just give better data
> > >   - some hash functions are in silicon (e.g. microsoft cares)
> > >   - any movement in libgit2 / jgit?
> > >     - basic stuff for libgit2; same testsuite problems
> > >     - no work in jgit
> > >   - most optimistic forecast?
> > >     - could be done in 1-2y
> > >   - submodules with one hash function?
> > >     - unable to convert project unless all submodules are converted
> > >     - OO-ing is not a prereq
> > 
> > Late reply, but one thing I brought up at the time is that we'll want to
> > keep this code around even after the NewHash migration at least for
> > testing purposes, should we ever need to move to NewNewHash.
> > 
> > It occurred to me recently that once we have such a layer it could be
> > (ab)used with some relatively minor changes to do any arbitrary
> > local-to-remote object content translation, unless I've missed something
> > (but I just re-read hash-function-transition.txt now...).
> > 
> > E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> > remote server so that you upload a GPG encrypted version of all your
> > blobs, and have your trees reference those blobs.
> > 
> > Because we'd be doing arbitrary translations for all of
> > commits/trees/blobs this could go further than other bolted-on
> > encryption solutions for Git. E.g. paths in trees could be encrypted
> > too, as well as all the content of the commit object that isn't parent
> > info & the like (but that would have different hashes).
> > 
> > Basically clean/smudge filters on steroids, but for every object in the
> > repo. Anyone who got a hold of it would still see the shape of the repo
> > & approximate content size, but other than that it wouldn't be more info
> > than they'd get via `fast-export --anonymize` now.
> > 
> > I mainly find it interesting because presents an intersection between a
> > feature we might want to offer anyway, and something that would stress
> > the hash transition codepath going forward, to make sure it hasn't all
> > bitrotted by the time we'll need NewHash->NewNewHash.
> > 
> > Git hosting providers would hate it, but they should probably be
> > charging users by how much Michael Haggerty's git-sizer tool hates their
> > repo anyway :)
> > 
> 
> While we are converting to a new hash function, it would be nice
> if we could add a couple of fields to the end of the OID:  the object
> type and the raw uncompressed object size.
> 
> If would be nice if we could extend the OID to include 6 bytes of data
> (4 or 8 bits for the type and the rest for the raw object size), and
> just say that an OID is a {hash,type,size} tuple.
> 
> There are lots of places where we open an object to see what type it is
> or how big it is.  This requires uncompressing/undeltafying the object
> (or at least decoding enough to get the header).  In the case of missing
> objects (partial clone or a gvfs-like projection) it requires either
> dynamically fetching the object or asking an object-size-server for the
> data.
> 
> All of these cases could be eliminated if the type/size were available
> in the OID.
> 
> Just a thought.  While we are converting to a new hash it seems like
> this would be a good time to at least discuss it.

Echoing what Stefan said.  I don't think its a good idea to embed this
sort of data into the OID.  There are a lot of reasons but one of them
being that would block having access to this data behind completing the
hash transition (which could very well still be years away from
completing).

I think that a much better approach would be to create a meta-data data
structure, much like the commit graph that stolee has been working on)
which can store this data along side the objects (but not in the
packfiles themselves).  It could be a stacking structure which is
periodically coalesced and we could add in a wire feature to fetch this
meta data from the server upon fetching objects.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-26 17:56     ` Stefan Beller
@ 2018-03-26 18:54       ` Jeff Hostetler
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff Hostetler @ 2018-03-26 18:54 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Ævar Arnfjörð Bjarmason, alexmv, git, Jonathan Tan,
	Brandon Williams, Derrick Stolee, Jeff King, Johannes Schindelin,
	Jonathan Nieder, Michael Haggerty



On 3/26/2018 1:56 PM, Stefan Beller wrote:
> On Mon, Mar 26, 2018 at 10:33 AM Jeff Hostetler <git@jeffhostetler.com>
> wrote:
> 
> 
> 
>> On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Sat, Mar 10 2018, Alex Vandiver wrote:
>>>
>>>> New hash (Stefan, etc)
>>>> ----------------------
>>>>    - discussed on the mailing list
>>>>    - actual plan checked in to
> Documentation/technical/hash-function-transition.txt
>>>>    - lots of work renaming
>>>>    - any actual work with the transition plan?
>>>>    - local conversion first; fetch/push have translation table
>>>>    - like git-svn
>>>>    - also modified pack and index format to have lookup/translation
> efficiently
>>>>    - brian's series to eliminate SHA1 strings from the codebase
>>>>    - testsuite is not working well because hardcoded SHA1 values
>>>>    - flip a bit in the sha1 computation and see what breaks in the
> testsuite
>>>>    - will also need a way to do the conversion itself; traverse and
> write out new version
>>>>    - without that, can start new repos, but not work on old ones
>>>>    - on-disk formats will need to change -- something to keep in mind
> with new index work
>>>>    - documentation describes packfile and index formats
>>>>    - what time frame are we talking?
>>>>    - public perception question
>>>>    - signing commits doesn't help (just signs commit object) unless you
> "recursive sign"
>>>>    - switched to SHA1dc; we detect and reject known collision technique
>>>>    - do it now because it takes too long if we start when the collision
> drops
>>>>    - always call it "new hash" to reduce bikeshedding
>>>>    - is translation table a backdoor? has it been reviewed by crypto
> folks?
>>>>      - no, but everything gets translated
>>>>    - meant to avoid a flag day for entire repositories
>>>>    - linus can decide to upgrade to newhash; if pushes to server that
> is not newhash aware, that's fine
>>>>    - will need a wire protocol change
>>>>    - v2 might add a capability for newhash
>>>>    - "now that you mention md5, it's a good idea"
>>>>    - can use md5 to test the conversion
>>>>    - is there a technical reason for why not /n/ hashes?
>>>>    - the slow step goes away as people converge to the new hash
>>>>    - beneficial to make up some fake hash function for testing
>>>>    - is there a plan on how we decide which hash function?
>>>>    - trust junio to merge commits when appropriate
>>>>    - conservancy committee explicitly does not make code decisions
>>>>    - waiting will just give better data
>>>>    - some hash functions are in silicon (e.g. microsoft cares)
>>>>    - any movement in libgit2 / jgit?
>>>>      - basic stuff for libgit2; same testsuite problems
>>>>      - no work in jgit
>>>>    - most optimistic forecast?
>>>>      - could be done in 1-2y
>>>>    - submodules with one hash function?
>>>>      - unable to convert project unless all submodules are converted
>>>>      - OO-ing is not a prereq
>>>
>>> Late reply, but one thing I brought up at the time is that we'll want to
>>> keep this code around even after the NewHash migration at least for
>>> testing purposes, should we ever need to move to NewNewHash.
>>>
>>> It occurred to me recently that once we have such a layer it could be
>>> (ab)used with some relatively minor changes to do any arbitrary
>>> local-to-remote object content translation, unless I've missed something
>>> (but I just re-read hash-function-transition.txt now...).
>>>
>>> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
>>> remote server so that you upload a GPG encrypted version of all your
>>> blobs, and have your trees reference those blobs.
>>>
>>> Because we'd be doing arbitrary translations for all of
>>> commits/trees/blobs this could go further than other bolted-on
>>> encryption solutions for Git. E.g. paths in trees could be encrypted
>>> too, as well as all the content of the commit object that isn't parent
>>> info & the like (but that would have different hashes).
>>>
>>> Basically clean/smudge filters on steroids, but for every object in the
>>> repo. Anyone who got a hold of it would still see the shape of the repo
>>> & approximate content size, but other than that it wouldn't be more info
>>> than they'd get via `fast-export --anonymize` now.
>>>
>>> I mainly find it interesting because presents an intersection between a
>>> feature we might want to offer anyway, and something that would stress
>>> the hash transition codepath going forward, to make sure it hasn't all
>>> bitrotted by the time we'll need NewHash->NewNewHash.
>>>
>>> Git hosting providers would hate it, but they should probably be
>>> charging users by how much Michael Haggerty's git-sizer tool hates their
>>> repo anyway :)
>>>
> 
>> While we are converting to a new hash function, it would be nice
>> if we could add a couple of fields to the end of the OID:  the object
>> type and the raw uncompressed object size.
> 
> This would allow to craft invalid OIDs, i.e. the correct hash value with
> the wrong object type. (This is different field of "invalid" compared to
> today, where we either have or do not have the object named by the
> hash value. If we don't have it, it may be just unknown to us, but not
> "wrong".)

An invalid OID (such as a wrong object type) could be detected as soon
as we open the object and read the header -- just as the hash can be
verified when the object is read.

The hash value that we use to ask for an object came from somewhere.
Either we have the raw content and are asking if the ODB already has
a copy -or- we have a commit or tree object that references an OID
and we want to dive into objects it references, such as parent-commits,
sub-trees, or blobs.  In the former, we can compute the correct OID tuple
as we compute the OID-hash now.  In the latter, all of tho containing-objects
would have the augmented OID tuple.  And can ask the have/not-have question
as before on the OID tuple.  In all of those cases, we would not ask for
the object with the correct-hash-but-wrong-type.

If the containing-objects have the OID tuple, then the size/type is
explicitly baked into the hash of the parent object and we guard against
another form of collision/extension attack.

> 
>> If would be nice if we could extend the OID to include 6 bytes of data
>> (4 or 8 bits for the type and the rest for the raw object size), and
>> just say that an OID is a {hash,type,size} tuple.
> 
> My suspicion is that the size of the OID is directly proportional to
> the speed of lookup (actually worse than linear, due to CPU caches
> being finite), specifically given Stollees work on walking the DAG.
> Hence I would appreciate if an OID would not contain redundant
> information and have a high information density.
> 
>> There are lots of places where we open an object to see what type it is
>> or how big it is.  This requires uncompressing/undeltafying the object
>> (or at least decoding enough to get the header).  In the case of missing
>> objects (partial clone or a gvfs-like projection) it requires either
>> dynamically fetching the object or asking an object-size-server for the
>> data.
> 
> The commit graph could have these infos in another column, too?
> Then we would have to add the promised objects
> to that data structure as well, but that would look like
> a way better design IMHO.
> 
>> Just a thought.  While we are converting to a new hash it seems like
>> this would be a good time to at least discuss it.
> 
> I'd agree.
> 
> Stefan
> 

Thanks
Jeff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Per-object encryption (Re: Git Merge contributor summit notes)
  2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
  2018-03-26 17:33   ` Jeff Hostetler
@ 2018-03-26 20:54   ` Jonathan Nieder
  2018-03-26 21:22     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 17+ messages in thread
From: Jonathan Nieder @ 2018-03-26 20:54 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Alex Vandiver, git, git, jonathantanmy, bmwill, stolee, sbeller,
	peff, johannes.schindelin, Michael Haggerty

Hi Ævar,

Ævar Arnfjörð Bjarmason wrote:

> It occurred to me recently that once we have such a layer it could be
> (ab)used with some relatively minor changes to do any arbitrary
> local-to-remote object content translation, unless I've missed something
> (but I just re-read hash-function-transition.txt now...).
>
> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> remote server so that you upload a GPG encrypted version of all your
> blobs, and have your trees reference those blobs.

Interesting!

To be clear, this would only work with deterministic encryption.
Normal GPG encryption would not have the round-tripping properties
required by the design.

If I understand correctly, it also requires both sides of the
connection to have access to the encryption key.  Otherwise they
cannot perform ordinary operations like revision walks.  So I'm not
seeing a huge advantage over ordinary transport-layer encryption.

That said, it's an interesting idea --- thanks for that.  I'm changing
the subject line since otherwise there's no way I'll find this again. :)

Jonathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Including object type and size in object id (Re: Git Merge contributor summit notes)
  2018-03-26 17:33   ` Jeff Hostetler
  2018-03-26 17:56     ` Stefan Beller
  2018-03-26 18:05     ` Brandon Williams
@ 2018-03-26 21:00     ` Jonathan Nieder
  2018-03-26 21:42       ` Jeff Hostetler
  2018-03-26 22:40       ` Junio C Hamano
  2 siblings, 2 replies; 17+ messages in thread
From: Jonathan Nieder @ 2018-03-26 21:00 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Ævar Arnfjörð Bjarmason, Alex Vandiver, git,
	jonathantanmy, bmwill, stolee, sbeller, peff, johannes.schindelin,
	Michael Haggerty

(administrivia: please omit parts of the text you are replying to that
 are not relevant to the reply.  This makes it easier to see what you're
 replying to, especially in mail readers that don't hide quoted text by
 the default)
Hi Jeff,

Jeff Hostetler wrote:
[long quote snipped]

> While we are converting to a new hash function, it would be nice
> if we could add a couple of fields to the end of the OID:  the object
> type and the raw uncompressed object size.
>
> If would be nice if we could extend the OID to include 6 bytes of data
> (4 or 8 bits for the type and the rest for the raw object size), and
> just say that an OID is a {hash,type,size} tuple.
>
> There are lots of places where we open an object to see what type it is
> or how big it is.  This requires uncompressing/undeltafying the object
> (or at least decoding enough to get the header).  In the case of missing
> objects (partial clone or a gvfs-like projection) it requires either
> dynamically fetching the object or asking an object-size-server for the
> data.
>
> All of these cases could be eliminated if the type/size were available
> in the OID.

This implies a limit on the object size (e.g. 5 bytes in your
example).  What happens when someone wants to encode an object larger
than that limit?

This also decreases the number of bits available for the hash, but
that shouldn't be a big issue.

Aside from those two, I don't see any downsides.  It would mean that
tree objects contain information about the sizes of blobs contained
there, which helps with virtual file systems.  It's also possible to
do that without putting the size in the object id, but maybe having it
in the object id is simpler.

Will think more about this.

Thanks for the idea,
Jonathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-object encryption (Re: Git Merge contributor summit notes)
  2018-03-26 20:54   ` Per-object encryption " Jonathan Nieder
@ 2018-03-26 21:22     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-03-26 21:22 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Alex Vandiver, git, git, jonathantanmy, bmwill, stolee, sbeller,
	peff, johannes.schindelin, Michael Haggerty

On Mon, Mar 26 2018, Jonathan Nieder wrote:

> Hi Ævar,
>
> Ævar Arnfjörð Bjarmason wrote:
>
>> It occurred to me recently that once we have such a layer it could be
>> (ab)used with some relatively minor changes to do any arbitrary
>> local-to-remote object content translation, unless I've missed something
>> (but I just re-read hash-function-transition.txt now...).
>>
>> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
>> remote server so that you upload a GPG encrypted version of all your
>> blobs, and have your trees reference those blobs.
>
> Interesting!
>
> To be clear, this would only work with deterministic encryption.
> Normal GPG encryption would not have the round-tripping properties
> required by the design.

Right, sorry. I was being lazy. For simplicity let's say rot13 or some
other deterministic algorithm.

> If I understand correctly, it also requires both sides of the
> connection to have access to the encryption key.  Otherwise they
> cannot perform ordinary operations like revision walks.  So I'm not
> seeing a huge advantage over ordinary transport-layer encryption.
>
> That said, it's an interesting idea --- thanks for that.  I'm changing
> the subject line since otherwise there's no way I'll find this again. :)

In this specific implementation I have in mind only one side would have
the key, we'd encrypt just up to the point where the repository would
still pass fsck. But of course once we had that facility we could do any
arbitrary translation .

I.e. consider the latest commit in git.git:

    commit 90bbd502d54fe920356fa9278055dc9c9bfe9a56
    tree 5539308dc384fd11055be9d6a0cc1cce7d495150
    parent 085f5f95a2723e8f9f4d037c01db5b786355ba49
    parent d32eb83c1db7d0a8bb54fe743c6d1dd674d372c5
    author Junio C Hamano <gitster@pobox.com> 1521754611 -0700
    committer Junio C Hamano <gitster@pobox.com> 1521754611 -0700

        Sync with Git 2.16.3

With rot13 "encryption" it would be:

    commit <different hash>
    tree <different hash>
    parent <different hash>
    parent <different hash>
    author Whavb P Unznab <tvgfgre@cbobk.pbz> 1521754611 -0700
    committer Whavb P Unznab <tvgfgre@cbobk.pbz> 1521754611 -0700

        Flap jvgu Tvg 2.16.3

And an ls-tree on that tree hash would instead of README.md give you:

    100644 blob <different hash> ERNQZR.zq

And inspecting that blob would give you:

    # Rot13'd "Hello, World!"
    Uryyb, Jbeyq!

So obviously for the encryption use-case such a repo would leak a lot of
info compared to just uploading the fast-export version of it
periodically as one big encrypted blob to store somewhere, but the
advantage would be:

 * It's better than existing "just munge the blobs" encryption solutions
   bolted on top of git, because at least you encrypt the commit
   message, author names & filenames.

 * Since it would be a valid repo even without the key, you could use
   git hosting solutions for it, similar to checking in encrypted blobs
   in existing git repos.

 * As noted, it could be a permanent stress test on the SHA-1<->NewHash
   codepath.

   I can't think of a reason for why once we have that we couldn't add
   the equivalent of clean/smudge filters.

   We need to unpack & repack & re-hash all the stuff we send over the
   wire anyway, so we can munge it as it goes in/out as long as the same
   input values always yield the same output values.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Including object type and size in object id (Re: Git Merge contributor summit notes)
  2018-03-26 21:00     ` Including object type and size in object id (Re: Git Merge contributor summit notes) Jonathan Nieder
@ 2018-03-26 21:42       ` Jeff Hostetler
  2018-03-26 22:40       ` Junio C Hamano
  1 sibling, 0 replies; 17+ messages in thread
From: Jeff Hostetler @ 2018-03-26 21:42 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ævar Arnfjörð Bjarmason, Alex Vandiver, git,
	jonathantanmy, bmwill, stolee, sbeller, peff, johannes.schindelin,
	Michael Haggerty



On 3/26/2018 5:00 PM, Jonathan Nieder wrote:
> Jeff Hostetler wrote:
> [long quote snipped]
> 
>> While we are converting to a new hash function, it would be nice
>> if we could add a couple of fields to the end of the OID:  the object
>> type and the raw uncompressed object size.
>>
>> If would be nice if we could extend the OID to include 6 bytes of data
>> (4 or 8 bits for the type and the rest for the raw object size), and
>> just say that an OID is a {hash,type,size} tuple.
>>
>> There are lots of places where we open an object to see what type it is
>> or how big it is.  This requires uncompressing/undeltafying the object
>> (or at least decoding enough to get the header).  In the case of missing
>> objects (partial clone or a gvfs-like projection) it requires either
>> dynamically fetching the object or asking an object-size-server for the
>> data.
>>
>> All of these cases could be eliminated if the type/size were available
>> in the OID.
> 
> This implies a limit on the object size (e.g. 5 bytes in your
> example).  What happens when someone wants to encode an object larger
> than that limit?

I could say add a full uint64 to the tail end of the hash, but
we currently don't handle blobs/objects larger then 4GB right now
anyway, right?

5 bytes for the size is just a compromise -- 1TB blobs would be
terrible to think about...
  
> 
> This also decreases the number of bits available for the hash, but
> that shouldn't be a big issue.

I was suggesting extending the OIDs by 6 bytes while we are changing
the hash function.

> Aside from those two, I don't see any downsides.  It would mean that
> tree objects contain information about the sizes of blobs contained
> there, which helps with virtual file systems.  It's also possible to
> do that without putting the size in the object id, but maybe having it
> in the object id is simpler.
> 
> Will think more about this.
> 
> Thanks for the idea,
> Jonathan
> 

Thanks
Jeff


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Including object type and size in object id (Re: Git Merge contributor summit notes)
  2018-03-26 21:00     ` Including object type and size in object id (Re: Git Merge contributor summit notes) Jonathan Nieder
  2018-03-26 21:42       ` Jeff Hostetler
@ 2018-03-26 22:40       ` Junio C Hamano
  1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2018-03-26 22:40 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jeff Hostetler, Ævar Arnfjörð Bjarmason,
	Alex Vandiver, git, jonathantanmy, bmwill, stolee, sbeller, peff,
	johannes.schindelin, Michael Haggerty

Jonathan Nieder <jrnieder@gmail.com> writes:

> This implies a limit on the object size (e.g. 5 bytes in your
> example).  What happens when someone wants to encode an object larger
> than that limit?
>
> This also decreases the number of bits available for the hash, but
> that shouldn't be a big issue.

I actually thought that the latter "downside" makes the object name
a tad larger.

But let's not go there, really.

"X is handy if we can get it on the surface without looking into it"
will grow.  Somebody may want to have the generation number of a
commit in the commit object name.  Yet another somebody may want to
be able to quickly learn the object name for the top-level tree from
the commit object name alone.  We need to stop somewhere, and as
already suggested in the thread(s), having auxiliary look-up table
is a better way to go, encoding nothing in the name, as we are going
to need such a look-up table because it is unrealistic to encode
everything we would want in the name anyway.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Git Merge contributor summit notes
  2018-03-26 18:05     ` Brandon Williams
@ 2018-04-07 20:37       ` Jakub Narebski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Narebski @ 2018-04-07 20:37 UTC (permalink / raw)
  To: Brandon Williams
  Cc: Jeff Hostetler, Ævar Arnfjörð Bjarmason,
	Alex Vandiver, git, jonathantanmy, stolee, sbeller, peff,
	johannes.schindelin, Jonathan Nieder, Michael Haggerty

Brandon Williams <bmwill@google.com> writes:
> On 03/26, Jeff Hostetler wrote:

[...]
>> All of these cases could be eliminated if the type/size were available
>> in the OID.
>> 
>> Just a thought.  While we are converting to a new hash it seems like
>> this would be a good time to at least discuss it.
>
> Echoing what Stefan said.  I don't think its a good idea to embed this
> sort of data into the OID.  There are a lot of reasons but one of them
> being that would block having access to this data behind completing the
> hash transition (which could very well still be years away from
> completing).
>
> I think that a much better approach would be to create a meta-data data
> structure, much like the commit graph that stolee has been working on)
> which can store this data along side the objects (but not in the
> packfiles themselves).  It could be a stacking structure which is
> periodically coalesced and we could add in a wire feature to fetch this
> meta data from the server upon fetching objects.

Well, the type of the object is available, from what I remember, in the
bitmap file for a packfile (if one does enable creaating them).  There
are four compressed bit vectors, one for each type, with bit set to 1 on
i-th place if i-th object in packfile is of given type.

Just FYI.
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-04-07 20:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-10  0:06 Git Merge contributor summit notes Alex Vandiver
2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
2018-03-11  0:02   ` Junio C Hamano
2018-03-12 23:40   ` Jeff King
2018-03-13  0:49     ` Brandon Williams
2018-03-12 23:33 ` Jeff King
2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
2018-03-26 17:33   ` Jeff Hostetler
2018-03-26 17:56     ` Stefan Beller
2018-03-26 18:54       ` Jeff Hostetler
2018-03-26 18:05     ` Brandon Williams
2018-04-07 20:37       ` Jakub Narebski
2018-03-26 21:00     ` Including object type and size in object id (Re: Git Merge contributor summit notes) Jonathan Nieder
2018-03-26 21:42       ` Jeff Hostetler
2018-03-26 22:40       ` Junio C Hamano
2018-03-26 20:54   ` Per-object encryption " Jonathan Nieder
2018-03-26 21:22     ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).