Notes from the Git Contributor's Summit, 2022

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Notes from the Git Contributor's Summit, 2022
@ 2022-09-29 19:17 Taylor Blau
  2022-09-29 19:19 ` [TOPIC 1/8] Bundle URIs Taylor Blau
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:17 UTC (permalink / raw)
  To: git

It was wonderful to see everybody in person again a couple of weeks ago
at Git Merge :-).

Now that things have begun to settle after the conference, I polished up
the notes that were taken during the Contributor's Summit to share on
the list.

The notes are available in Google Docs, too, for folks who prefer to
view them there are the following link:

    https://docs.google.com/document/d/1gVGZtkCLF3CWPt3xQnIJUy8XP702zGSxvOPk1r-6_8s

At the Contributor's Summit, we discussed the following topics:

  - Bundle URIs (12 votes)
  - State of sha256 transition (8 votes)
  - Timeline: Deleting merge-recursive, remapping 'recursive' to 'ort' (8 votes)
  - git clone --filter=commit:0 (8 votes)
  - Managing ever growing pack index sizes on servers (8 votes)
  - The next year of bitmap work (5 votes)
  - Server side merges and rebases (& new rebase/cherry-pick UI?) (7 votes)
  - State of sparsity developments and future plans (7 votes)
  - Ideas for speeding up object connectivity checks in git-receive-pack (6 votes)
  - Alternative ways to write to VFS-backed worktrees (6 votes)
  - How to run git securely in shared services (6 votes)

The list of all topics proposed (and the number of votes they received)
are here:

    https://docs.google.com/spreadsheets/d/1QhkUkYvqtGJtN7ViiTmrfcP3s0cXgqyAIACRD5Q24Mg

Some topics were combined together and others didn't have a note-taker,
but the vast majority did.

I'll send the broken-out notes for each topic that did have a note-taker
in a response to this message for posterity, and so folks can continue
the discussion on the list.

(As an aside, if you have any feedback about how the Contributor's
Summit went, please feel free to share it with me off-list, as we are
already starting to put together plans for next year's Git Merge :-)).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 1/8] Bundle URIs
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
@ 2022-09-29 19:19 ` Taylor Blau
  2022-09-29 19:19 ` [TOPIC 2/8] State of SHA-256 transition Taylor Blau
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:19 UTC (permalink / raw)
  To: git

# Bundle URIs (Stolee)

- Unlike packfile URIs, includes refs, does not need to be delta-ed
	against what server sends
- Doc checked into Documentation/technical
- URI can be provided by user at CLI or advertised by server
- Most users won't experience anything if they git-clone, but it will
	only benefit the git hosting providers. It will allow them to offload
	data to CDNs, being closer to the client.
- With bundle files you can download them and start of from there and
	fetch the objects you're missing in a regular manner.
- Jrnieder: Packfile URIs and Bundle URIs are trying to achieve the same
	thing.  How can we duplicate efforts? E.g. how can we prevent the
	client from leaking information to a possibly untrusted server?
- Stolee: Are you want to provide a way to provide authentication?
- Jrnieder: Analogy to the web - you don't want to leak information to
	websites you don't trust. The security model is pretty complicated, we
	don't want to replicate things like same origin policies.
- Stolee: So, e.g. the server provides a hash of the content expected at
	the bundle URI and the client can verify? We wanted to explicitly
	avoid that because we don't want the server and bundle provider to
	need to know anything about each other.
- Jrnieder: Compare to packfile URIS - Packfile URIs are only advertised
	for the server, so the security model is mostly the same as a
		"regular" fetch/clone
- Jonathantanmy: Another difference: the objects in bundles must be
	associated with refs, you can't just have e.g. large objects.
	Packfiles can contain arbitrary objects.
- Stolee: Let's talk about the security model more on the mailing list
- Ævar: We're also open for a breakout session on this topic

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 2/8] State of SHA-256 transition
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
  2022-09-29 19:19 ` [TOPIC 1/8] Bundle URIs Taylor Blau
@ 2022-09-29 19:19 ` Taylor Blau
  2022-09-29 19:20 ` [TOPIC 3/8] Merge ORT timeline Taylor Blau
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:19 UTC (permalink / raw)
  To: git

# SHA-256 transition (brian)

- (brian) Functional version of "state four" implementation with only
	SHA-256 in the repository
- Interop work (to use sha1 and sha256) is mostly stalled, brian is
	mostly not working on it at the moment
- Current implementation is partially functional, though failing a lot
	of tests.  Can write SHA-256 objects into the repo, according to the
	transition, will write a loose mapping between SHA-1 and SHA-256,
	along with index v3 with the hashes for both
- When you index a pack, computes both hashes and stores them in the
	loose object store or pack
- Tricky part is when you're indexing a pack, you don't always get all
	blobs before all trees, before all commits, etc.
- In order to rewrite a commit from SHA-256 -> SHA-1, you need all
	reachable objects before in order to compute the hash. Try to look up
	in a temporary lookup table ahead of time, and lazily hash the object
	we're going to get and come back to it later.
- "Rewind the pack" to compute the proper objects, which works
- For submodules (currently unwritten), going to send both hashes over
	the wire, but unfortunately no way to validate those in real time. If
	your submodules are checked out, rewritten automatically.
- brian working on it slowly as they get to it, hopes that their
	employer will devote more time to it
- Wants to also work on libgit2 at the same time, since it doesn't yet
	understand SHA-256, though they hope that somebody else will work on
	it, since they are tired of writing SEGVs :-).
- (demetr): what if you have a remote that speaks only SHA-1?
	 - Goal is to have that information come over the pipe, and rewrite
		 into SHA-256 upon entering the new objects into the repository
- (demetr): can you then push a converted-into-SHA-256 repository back
	to a SHA-1 repo
	 - Goal is to be able to do that, unless you have a SHA-1 collision,
		 in which case it won't work.
	 - No major hosting platform yet supports only SHA-256 repositories,
		 though maybe Gitolite and CGit do
- (Peff): so, in the worst case, index-pack takes twice as long?
	 - brian: depends on how many are blob objects, since only takes a
		 single pass
	 - Will try to rewrite objects in as few passes as possible
	 - May need multiple passes in order to visit objects in topological
		 order
	 - Actually: worst case is N where N is the maximum tree depth
- (Stolee): what you really need is reverse-topo order on the object
	graph
	 - brian: yes, would be nice if the server sent them in that order.
		 But the server doesn't know how to do that.
- (Emily): so for something like shallow/partial-clone, the server needs
	to be able to do SHA-256 for you to compute it yourself?
	 - brian: there will be a capability, since data needs to come over
		 the pipe for submodules, and could be extended for shallow and
		 partial clones as well. Would fit into protocol v2, and will be
		 essential for submodules, so will have to exist regardless.
	 - Hopefully server has that information, though how that expensive
		 will be to compute is highly dependent.
- (jrn): submodules have to be updated, do you have an idea of what that
	protocol change will look like?
	 - brian: fuzzy idea, but nothing concrete yet
	 - (jrn): this reminds me of the early days of partial clones where we
		 talked about "promised" objects at the edge and associated metadata
- (Toon): so no interop, but is there a way to do a single step
	conversion from SHA-1 to SHA-256?
	 - brian: yes, you can use fast-export and fast-import. Currently any
		 signatures references are broken, but in the future would like to
		 update them (that code exists, but it hasn't been upstreamed)
	 - doesn't quite work with smoothly submodules, since you have to
		 rewrite them first, then generate a set of marks, and then export
		 and import
	 - verified with git/git, resulting index isn't substantially larger
		 (basically 32 bytes per object, along with slightly larger commit
		 and tree objects)
- (demetr): Could be significantly larger if you have a zillion commits
	 - brian: we'd have other problems before then :-).
- (Elijah): common in commit messages to refer back to earlier commits.
	Do we want to rewrite those?
	 - brian: maybe, depends on future plans if/when we deprecate earlier
		 hash algos
	 - (jrn): Don't have a good way to retroactively change commit
		 messages, but we do have git notes. First instinct is to use notes
		 for this kind of historical reference info
	 - (Terry): annotated tags?
	 - (Elijah): filter-repo does this kind of commit message munging

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 3/8] Merge ORT timeline
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
  2022-09-29 19:19 ` [TOPIC 1/8] Bundle URIs Taylor Blau
  2022-09-29 19:19 ` [TOPIC 2/8] State of SHA-256 transition Taylor Blau
@ 2022-09-29 19:20 ` Taylor Blau
  2022-09-29 19:20 ` [TOPIC 4/8] Commit `--filter`'s Taylor Blau
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:20 UTC (permalink / raw)
  To: git

# Merge ORT (Elijah)

- Git has multiple merge backends - default was "recursive", now
	"merge-ort"
- When merge-ort was written, it was intended to be a replacement
- Would people be okay removing the old merge-recursive code? ORT was
	meant as a drop-in replacement, but it does have some differences in
	behavior
	 - If it is okay, what's the timeline?
- (Taylor): gradual guarded by config might be a good approach (e.g.,
	"merge.recursiveIsORT")
- (Johannes): describing the differences in behavior
	 - Merge-ort rename detection is always on, but merge-recursive is
		 opt-in
	 - ORT uses histogram diffs (arguably more correct, matching unique
		 lines in diffs). Leads to (sometimes) merge conflicts where
		 merge-recursive didn't find it, but ort did. Still probably more
		 correct, though. Much more rarely, it's the opposite - recursive
		 found a conflict, ORT did not
	 - Conclusion: wait 2 major versions to deprecate
- (Taylor): maybe we should add "turn off rename detection" option
- (Johannes): should we even give users that option in the first place?
	If they don't have it, they won't be as upset when we get rid of it ;)
- (Peff) how long has ort been the default? 2 versions. Now people have
	recursive as a escape hatch. But we don't know if/when people use if.
	Also recursive with find-renames is an escape hatch.
- (Emily): we know how often people are using escape hatches (at
	Google), could take the same approach with this option
- (jrn): do we have other signals for how often this escape hatch is
	used? Stack Overflow posts?
	 - No one's named it as a solution on the mailing list, though, so we
		 don't know from that medium
	 - Agree with Johannes, might be best to not give the option to users
		 because this way we have more chance of getting signal.
- (Peff): the mailing list isn't representative of the larger Git
	community, so people bringing it up to the mailing list might not be
	indicative of usage
	 - Leaving the hatch doesn't seem like it'd incur a huge maintenance
		 burden
- (Terry): some distros might have significant propagation delay, should
	probably bake in extra time because of slow adoption
- (Ævar): I'm happy to follow your decision
	 - Some behavior difference, but it's working -better-
	 - At some point, we should be willing to say "if you need an old
		 feature, use an old version"
- Some observed differences might be libgit2 recursive vs. ORT, unclear
	which ones though
- (Johannes): We can bake in a deprecation notice, like: if you see
	something wrong, now is the time to bring it up! We'd rather fix it
	now
- (Jeff): most users won't be bothered - they'll see a conflict and
	resolve it, without thinking about which algorithm generated the
	conflict
- (jrn): for most usage, this is a completely safe change. The
	discussion comes up because of the fear that some user might have some
	use of merge they regularly do that hangs, that kind of thing, rather
	than the subtle merge resolution changes that we've discussed. I think
	we're safe. :)
- (brian): observing something like Debian or other LTS versions of
	projects - if there isn't a lot of screaming after a couple months,
	it's probably safe.  Even if you wait a decade, though, there'll
	always be one or two who suddenly encounter issues with the new
	feature after the old is long deprecated.
- (Ævar): haven't seen any huge warning in the merge docs saying we've
	changed something, but with how many users are already using it, it's
	likely very few people will ever notice

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 4/8] Commit `--filter`'s
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
                   ` (2 preceding siblings ...)
  2022-09-29 19:20 ` [TOPIC 3/8] Merge ORT timeline Taylor Blau
@ 2022-09-29 19:20 ` Taylor Blau
  2022-09-29 19:21 ` [TOPIC 5/8] Server side merges and rebases Taylor Blau
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:20 UTC (permalink / raw)
  To: git

# git clone --filter=commit:0 (jonathantanmy)

- Partial clone that can omit commits (we already support trees and
	blobs)
- Pros:
	 - Don't need all commits, save network and disk I/O. There's a repo
		 at Google that grows so quickly that having just commits is too
		 much
- Cons
	 - Git assumes that all of the commits are present locally; very
		 pervasive assumption.
			- Blobs don't have outlinks, not a problem. Tree depth is somewhat
				limited. Commits go all the back to the beginning of the repo
			-  `git bisect` without commits
			- Lose out on optimizations like `fetch` skipping negotiator,
				commit graph generation numbers
- Has everyone else thought about this?
- Peff: Compare to shallow clone (create a ‘graft' and pretend that the
	commit has no more parents). How do we handle the continuous N + 1
	fetching?  Jonathantanmy: Not a big issue, we can batch fetch. It's
	jumping around that's a problem
- Peff: What if the server sends the commit graph?
	 - Taylor: we could just send the generation number(s) of the parents
		 of the commits on the boundary of what you're sending.
	 - Emily: We can't verify it though, we'd have to just trust the
		 server
	 - Taylor: true, but that's the case even if you send the whole commit
		 graph, too
- Jrn: Partial clone - we know the server is there, so we still have a
	"full clone", but part of the "full clone" lives in the server. There
	are git services that don't need a full copy of the repo, e.g. for CI,
	we only need a view of the directories we're building.
- Two use cases for partial clone
	 - Shallow clone replacement: user only cares about a single commit
	 - Operation that involves history walking (e.g. git describe). We
		 might as well fetch all the commits (i.e., convert to tree or blob
		 filtering when we notice this operation). Are these operations
		 distinguishable?
- Rdamazio: What if all of the history walking happens only on the
	server? (e.g.  git blame). Jrnieder: For git blame specifically, that
	makes sense, but are you thinking of other things like git log?
	Rodrigo: Yes
- Johannes: That doesn't sound like it will scale. Stolee: At GitHub, we
	already do run blame on the server Rodrigo: At google, we precompute
	that
- Terry: More and more things want "stateless" operations (don't care
	about history) - that's probably the majority of use cases. There's
	also a popular use case of "1 week/month" of history. It would be
	great to not pay the penalty of fetching all commits. Today, we only
	have shallow clone, which pretends that history is different from what
	it actually is, and it's very difficult to maintain this on the server
	(sending not enough objects, sending too many objects). But filters
	are much easier to maintain.
- Victoria: Is this a replacement for shallow clones then? Terry: Yes.
- Stolee: 2 technical areas of apprehension
	 - VFS for git tried to do this by having only the initial commit and
		 fetching later objects one-by-one. Didn't work at all, was very
		 slow.
	 - Treeless clones - when traversing the tree, we keep refetching the
		 tree when we traverse it.
- We would have to drastically rework how GIt interacts with partial
	clones
- Taylor: Or we could teach the server to preempt the operations ("i'm
	going to run git log, send me the right things")
	 - Stolee: Or run it on the server
	 - Taylor: yes, that would be the other extreme approach to this
- Jrnieder: With treeless clones, we don't propagate the filter on the
	catch up fetch, and there are some code locations that assume that if
	we have a tree, we have all of its children. If anything, commit
	filters are even easier because we have nothing - so we can do "all or
	nothing"
	 - Stolee: I agree that it's simpler, but I still think it'll be
		 really slow.  So either, we need to do something much smarter than
		 object-by-object fetches, or to prevent users from running
		 problematic commands. We would eventually have to fix the problem
		 for treeless clones, so what if we start with full commit history,
			 but not all of the trees. We can fix that first before starting
			 on commit filters
	 - Jrnieder: I can see the need for all of the commits up until a
		 certain point in time, but I don't know if there's a need to solve
		 the general problem of omitting arbitrary commits e.g. jumping
		 around in bisect
- Rodrigo: We have some experience for doing this with Mercurial at
	Google - we hide the full history, users know they exist, but they can
	refetch if they wish. Stolee/Peff: That sounds like reimplementing
	shallow clone.
- Taylor: Is there any other kind of filter other than commit:0?
	Jonathantanmy: No plans yet.
- Peff: Wouldn't you need to implement the general case to do batching
	of commits in "git log"? Jonathantanmy: Maybe not, we could e.g. reuse
	the shallow clone protocol.  Managing ever growing pack index sizes on
	servers
- Some repositories have over 15 years of history with 1000 active
	developers, so pack indices can be between 1 and 2 GB. "GC pack"
	contains everything reachable from refs/heads/- and refs/tags/-
- Time-based slicing for repositories to allow smaller repositories.
	"Remove" history from before a certain point. Done by taking a shallow
	clone and using that as the new repository.
- What about folks who are only interested in the last week's history?
- Pack repositories based on time-based slicing.  Moving back to older
	history can fall back to older packs as necessary.
- Some people, like documentation folks, don't need the entire history
	and might be fine with a more limited environment.
- Chromium packs to three packs: one is a cruft/garbage pack, and the
	other are reachable objects.  refs/heads is packed into one pack, and
	refs/changes (PR-equivalent) are in the other.
- JGit doesn't have a reverse index yet
- Taylor: considering packing reverse index into main index.  The
	tension is that we need to make using multiple packs more flexible.
	Introduce bitmap chains when repacking to make things more stable and
	less expensive.
- Stolee: Consider older packs that are stable and only repack newer
	things.
- Peff: One reason not to have lots of packs on disk is missing out on
	deltas.  We could use thin packs on disk.
- Stolee: Future goal is to only include full delta chains in the stable
	packs.
- git gc –aggressive used to make really deep deltas and has been fixed
	to be less aggressive to avoid runtime performance costs.  Between 10
	and 50 shows real performance improvements but the old default was
	like 200.
- The original numbers were picked randomly without measurement.
- Patrick: GitLab maintenance architecture is evolving.  Each push is
	incremental (repack into one pack) or full repack (everything into one
	pack with deltas).
- Stable ordering for determining preferred objects (SHA ordering is not
	suitable).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 5/8] Server side merges and rebases
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
                   ` (3 preceding siblings ...)
  2022-09-29 19:20 ` [TOPIC 4/8] Commit `--filter`'s Taylor Blau
@ 2022-09-29 19:21 ` Taylor Blau
  2022-09-29 19:21 ` [TOPIC 6/8] State of sparsity work Taylor Blau
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:21 UTC (permalink / raw)
  To: git

# Server side merges and rebases (& new rebase/cherry-pick UI?) (Elijah)

- Elijah: tried to implement the git side of the cherry pick as flags to
	git merge subcommand, but everything turned out to be incompatible.
	Used git merge-tree instead, much better, but this doesn't create a
	new commit, only a new top-level tree.
- Rebase and cherry-pick is even more tricky because we need sequences
	of commits. Does the current UI make sense?
- I want to create commits on a not-checked out branch, or rebase, or
	cherry-pick. Not only on the server, but on any client.
- Rebase skips cherry-picks, but that is probably just an optimization
	for when rebase for a shell script. Always doing cherry-picks is
		faster these days, but is a behavior change
- Creating a new commit and modifying the working tree - this lets hooks
	run, but I don't want them to run the server
- Rebase and cherry-pick are typically centered around HEAD, I would
	prefer to replace with just a commit range. If you don't make the
	assumption around HEAD, cherry-pick and rebase aren't that different.
- How do we display conflicts generated on the server side so that ? We
	don't have a representation for that. Taylor: Probably just block the
	operation on the server. Elijah: That's my intuition too.
- We have a lot of users who want to cherry-pick a commit on a bunch of
	LTS branches, it would be great if they don't have to check out those
	branches.
- What about cherry-picking to older branches? It's super slow to check
	out the old branch and it's a big pain to update.
- Want to be able to replay merges. Not just like rebase
	--rebase-merges, but with extra content/resolutions
- Emily: Rebase has famously bad UX. Could we create a new command
	that fixes the problems, like checkout and switch? Elijah: I'm worried
	that I'll copy the old terminology, so I'd need feedback on that.
- Stolee: We could rework the underlying API that supports rebase and
	cherry-pick and use that for the new UX.
	 - Jrnieder: We don't have plumbing commands for this yet, which would
		 be very nice to have. For changes motivated by "cherry-pick has
		 this bad behavior", if we're not making an overall better UX then
		 I'd encourage "go ahead and make cherry-pick no longer have that
		 bad behavior"
	 - Jonathantanmy: I think base + theirs + ours is good enough. Elijah:
		 Sounds like git merge-tree, I don't think that's enough for the
		 server case. I'm sometimes porting over multiple commits instead of
		 just one, ort can do some optimizations on that, but one-by-one
		 invocations would lose that info. Also, this isn't enough to replay
		 merges.
- Peff: It would be good to have a machine-readable representation of a
	conflict that the server can serve, but also can be materialized by
	client tools.  Taylor: It would be even cooler if we could push that
	representation and have "collaborative" merge resolution. Elijah:
	Merge-tree can output files with conflict markers. We'd have to add
	info to represent the index conflict. With rebase, we'd need to
	represent different conflicts at different points.
- Martin: Does ort handle conflicts with renames? E.g. renaming two
	files to the same name. Elijah: Yes
- Elijah: One format would be input to git update-ref --stdin, so
	instead of making all of changes, you could output the data that git
	update-refs can ingest later.
- Waleed: Do you support rebasing non-linear sequences? Elijah: Yes,
	but..  (didn't hear)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 6/8] State of sparsity work
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
                   ` (4 preceding siblings ...)
  2022-09-29 19:21 ` [TOPIC 5/8] Server side merges and rebases Taylor Blau
@ 2022-09-29 19:21 ` Taylor Blau
  2022-09-29 19:21 ` [TOPIC 7/8] Speeding up the connectivity check Taylor Blau
  2022-09-29 19:22 ` [TOPIC 8/8] Using Git securely in shared services Taylor Blau
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:21 UTC (permalink / raw)
  To: git

# State of sparsity developments and future plans (Victoria)

- (Victoria) Integrating commands with sparse index, making them
	compatible with sparse indexes, not un-sparsifying the index before
	executing themselves
- Have worked on some more recently
- GSoC student has worked on a handful as well
- Near-term future is about finding commands that need to touch the
	index, don't support sparse index, and then make them compatible
- In some cases, that is going to require expanding the index to be
	non-sparse, especially if you are touching something outside of the
	sparsity cone
	 - That is somewhat straightforward
- More interesting questions: what is the future of sparsity? Recently,
	Elijah pushed a change to make sparse-checkout's cone mode the default
	(nice, since it is required by sparse index)
- As we move forward, what should we change the defaults of?
	 - Sparse index for sparse checkouts in cone mode?
- Scalar as a testing ground for larger features, including sparse index
	 - Could make sparse index the default in Scalar for cone-mode sparse
		 checkouts, and then see how it goes
	 - Or, could just go for it sooner (after we have integrated sparse
		 index with enough commands)
- A handful of internal, logistical things that would have to happen for
	sparse index to become the default. Currently, commands are assumed to
	not work with the sparse index.
- Question for everybody: what is a good balance between pushing sparse
	index, and waiting to introduce it to more users by holding off on
	changing the default.
- (Stolee): sparse checkout and submodules became a difficulty when
	mentoring their GSoC student.
- (JTan): possible to decouple sparse index and cone-mode sparse
	checkouts from each other? This would be easy to test - turn it on,
	all of the test suite automatically uses it. Jrnieder: This sounds ok
	for the filesystem, but I don't know how this would work for this
		"VFS-backed Git" idea on the spreadsheet. (other things…)
- (Stolee): We need cone mode today because they're the only way to
	definitively say that we've reached the boundary. But we can also
	expand the idea of "cone" to allow more paths (files instead of
	directories) in the cone.
- (Taylor): What do we need to tell subcommands to assume that sparse
	index is supported? (Victoria) Gut feeling for the most part.
	(Jrnieder): I'd prefer this to happen sooner rather than later. This
	is easier for maintainability since we don't have to worry about
	commands being in two possible modes of operation. We can break these
	incompatible APIs by renaming them to prevent them from being misused
	by new commands.
- (Victoria): So just break things that always use the full index?
	Sounds ok.  (Stolee) This sounds similar to the_index macros, which
	we've tried to remove for the most part but we've stopped. Doing this
	conversion everywhere sounds extremely difficult - we've done an audit
	on this. (Jrnieder): Oh, I just meant renaming the API without
	changing semantics. Intentionally break everything.
- (Victoria): We'd need to write new tests for lots of commands because
	the existing tests don't actually interact with the "sparse" parts of
	the index.
- (Ævar): Is this just a matter of telling `git init` to initialize a
	sparse index? (Stolee): No, we need to force the tests to work on
	sparse directory entries.
- (Jrnieder): This sounds like a good fit for feature.experimental
- (VIctoria): Is sparse index a good git default instead of just "for
	large repos"? I'd think yes. (Jrnieder) Yes I think any sparse
	checkout user would want this. (Stolee): LIterally every command that
	touches the index has been converted (used for Microsoft Office
	monorepo), so it's just a matter of doing this for the whole project.
- (Elijah): I would like partial filters from sparse patterns.
	--filter=blob:none doesn't let you disconnect from the server.
	 - (Jrnieder) The DX of sparse checkout + blob:none has been pretty
		 good.  (Elijah): but you need to stay connected to the server.
		 (Jrnieder) Ah, thanks for explaining, sorry for the confusion
		 (Elijah) It would be great to have "sparse clone"s and have
		 commands that work just inside of that cone when disconnected. Make
		 "grep", "log", etc respect the sparse pattern
	 - (Stolee): We've thought about this, but it is very expensive on the
		 server side and makes bitmaps unusable. Alternatively, we could
		 start with blob:none and then backfill. That sounds more promising,
		 but that's not just a plain partial clone.
	 - (Jonathantanmy): FYI there's a protocol feature that already allows
		 clones to specify a sparse filter (referencing a blob with sparse
		 patterns that's present on the server), but I don't know of any
		 implementation that has this enabled.
	 - (Jrnieder): Can we delete this? (Github folks): We don't like it,
		 we invented the uploadpackfilter config to disable it :) Is this
		 just cleanup?  (Jrnieder): Yes, and this dead end will stop being a
		 distraction.
	 - (Peff) We could already implement most of the backfilling using
		 current commands, but that might skip over some delta-ing
		 optimizations. We could have a protocol change to provide the path
		 as a hint to the server.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 7/8] Speeding up the connectivity check
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
                   ` (5 preceding siblings ...)
  2022-09-29 19:21 ` [TOPIC 6/8] State of sparsity work Taylor Blau
@ 2022-09-29 19:21 ` Taylor Blau
  2022-09-29 19:22 ` [TOPIC 8/8] Using Git securely in shared services Taylor Blau
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:21 UTC (permalink / raw)
  To: git

# Ideas for speeding up object connectivity checks in git-receive-pack
(Patrick Steinhardt)

- Patrick: At GitLab we've noticed some repositories have become slow
	with a lot of references.
- Initially I thought it was the object negotiation.
- The connectivity checks seem to be the culprit. The connectivity check
	is implemented quite naively. "git rev-list <commits> --not --all"
- A while ago I tried to optimize it. Stop walking when you reach an
	existing object. List feedback was that I had not gotten the semantics
	right, since that existing object is not necessarily pointed to by a
	ref and not everything it references is necessarily present in the
	repository.
- Trial 2: optimize rev-list. Sped up connectivity check by 60-70%! But
	the motivating repositories had grown, so it was still too slow.
- How do other people do it, with millions of references?
- Terry: Yes, we've seen this problem.
- Just the setup time for the revision walk takes a long time. Initially
	we weren't using reachability bitmaps; using those also helped. Then
	using subsets of references — first HEAD, then refs/heads/-, …
- Jrnieder: Initially we did this for reachability checks; do
	connectivity checks do it, too? Terry: Yes, Demetr did that.
- Demetr: When the connectivity check fails, we fall back to a fuller
	set of refs.
- Patrick: Many of our references are not even visible to the public. If
	Git makes configurable which refs are part of the connectivity check,
	that would already make things faster in our case.
- Taylor: Did you experiment with reachability bitmaps? Am I remembering
	correctly that they made things slower?
- Patrick: We did some experiments with reachability bitmaps, but not
	specifically for this problem. In those experiments they did make some
	things slower.
	 - Taylor: one thing that you could do is make the bitmap traversal by
		 building up a complete bitmap of the boundary between haves and
		 wants instead of a bitmap of all of the haves. Involves far less
		 object traversal, and there are some clever ways to do this.
	 - Taylor: As long as you can quickly determine the boundary between
		 the haves and wants for a given request, the connectivity check
		 should be as fast as you need.
- Peff: One difference between how Git and JGit does this is JGit is
	building up a structure of "what is reachable". You could persist a
	bitmap representing "here's everything that's reachable from this
	repository" and subtract that out; that would help with many cases.
	The problem with one reachability bitmap like that is that it goes
	stale whenever someone pushes something. But you could make a set of
	"pseudo-merge bitmaps" for each group of 10,000 refs or so. Especially
	if you're clever about which refs you do that for, that can be a
		significant improvement ("these 2,000,000 refs haven't been touched
		for a year, so I can use this bitmap and don't even have to examine
			them").
- Terry: There are two ways that unreachable objects appear. One is by
	branches being deleted or rewound. The other is failed pushes where
	objects were persisted and the ref update didn't succeed. I wanted to
	distinguish between objects that are "applied" and "unapplied", became
	very thorny. And with that, on a branch rewind, we can calculate what
	just became unreachable.
- Waleed Khan: Object negotiation with bloom filters paper
	 - Kleppman and Howard 2020
	 - Blog post
		 https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html
	 - Paper https://arxiv.org/abs/2012.00472
	 - Quote: "Unlike the skipping algorithm used by Git, our algorithm
		 never unnecessarily sends any commits that the other side already
		 has, and the Bloom filters are very compact, even for large commit
		 histories." Alternative ways to write to VFS-backed worktrees (e.g.
		 write($HASH) instead of write(<bytes>))
- Google tried to use the sparse-index to control what gets materialized
	by in a VFS-like system. What if we replaced the write() syscall +
	writing bytes and write a hash instead. It could save a lot, e.g. in
	an online IDE.
- Stolee: The "write" analog of FSMonitor for VFS-for-Git is the
	post-index-change hook. We suppress the writes by manipulating the
	index and then communicating the write back to the VFS.
- Jrnieder: This is bigger than just the "write()" call; if we have a
	git-aware filesystem, we can be much less wasteful than we are today.
	E.g. FSMonitor can only make `git status` so fast because we still
	have to stat(), but with a VFS, we could ask the VFS what has changed.
- Do we think this is useful for anything other than a VFS? Do we still
	want this even if it's only useful for VFS?
- Stolee: We could make FSMonitor more git-aware e.g. it doesn't know
	about writes that we make. JeffH: We also can't say "ignore .o files,
	i only care about source files". That also helps greatly. Writing a
	hash instead of the contents is probably about as expensive, most of
	the savings are in avoiding the stat() calls. This also sounds racy.
- Emily: Another reason to do this work is that this is a good jumping
	off point to libify the Git internals. Is there any reason not to do
	that? Jrnieder: To make this concrete: do you mean for example
	creating a worktree.h with a vtable of worktree operations and having
	things talk to that instead of the FS? Emily: Yes. Things like that
	are the reason why we have libgit2, so what if Git could just ship its
	own library. Bmc: Libgit2 is used in lots of places like the Rust
	package manager (Cargo). The problem is that Git is GPLv2, which is
	not usable by lots of folks.
- Stolee: The stance of the Git project is that the API is the CLI, not
	individual files. But I think this is a good thing to have for the
	project as a whole, even just internally.
- We can finally have unit tests!!??!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [TOPIC 8/8] Using Git securely in shared services
  2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
                   ` (6 preceding siblings ...)
  2022-09-29 19:21 ` [TOPIC 7/8] Speeding up the connectivity check Taylor Blau
@ 2022-09-29 19:22 ` Taylor Blau
  7 siblings, 0 replies; 9+ messages in thread
From: Taylor Blau @ 2022-09-29 19:22 UTC (permalink / raw)
  To: git

How to run git securely in shared services (Kevin)

- Kevin: What do we think about GIt on a shared device? e.g. Git trusts
	the repo more than the global config, but the repo might not be
	trustable. What do we think of, say, inverting this precedence?
- bmc: Repo config overriding global config is an important feature we
	should not lose. But I could imagine some global option that affects
	that behavior — making it very explicit, on that particular machine.
- Stolee: I wrote an email to the mailing list months ago about this
	subject.  Title "What is Git's security boundary?" Concrete proposal:
	anything that executes an executable could go through a hook that is
	installed at the global or system level, and local config can refer to
	that. E.g. "please run vim, which has been controlled at the system
	level".
	 - I got zero traction, but there's some prior art.
- Taylor: We think of Git as special in this way. For "make", we
	wouldn't ask this same question.
- Jrnieder: With "make", it's very clear to users that arbitrary
	commands might be run, but users don't have that same expectation when
	just browsing code.
- Taylor: We could create a mode that ignores repo config, and that
	makes prompts safer. But we're inherently make-like.
- Jrnieder: That's obvious to us, but not to most users. I think we're
	quite far from having Git's security model match users' mental model
	and I think it's hard to change the behavior but would be even harder
	to change users' mental model in this example.
- Jrnieder. I wish we could separate out "repo properties" from "actual
	config": keep my user preferences separate from the things that Git
	needs to run.
- Ævar: Emacs has solved this. Emacs can run arbitrary code for all
	kinds of things, but it prompts users to approve the code first. We
	could allowlist harmless config, and then only prompt users for
	sketchy things. Taylor: This kind of allowlisting sounds impossible
	though. Bmc: Can we do this just for core.repositoryformat and
	extensions.-?
- (much spirited discussion, did not hear)
- JTan: If we want repositories to still be movable, how would we
	maintain this allowlist?
- Ævar: There are ways to do this for just certain variables, or certain
	variables in certain paths, etc. We have a lot of space to do the
	right thing.
- Bmc: It's important to make this behavior configurable from the
	command line.
- Ævar: I was experimenting with this because I wanted a way to have
	config in-repo. It would be very useful even if we could only control
	a subset of config
- Jrnieder: We already have some defense against hooks-and-config in
	special cases e.g. uploadpack doesn't trust the local repo's hooks.
	Suppose we have completely solved the problem of protecting against
	those; are we comfortable with changing the threat model to encompass
	normal commands in local repositories?
- Peff: I think the safest option is to just ignore the in-repo config
	altogether. Johannes: But the unsafe thing isn't parsing the repo,
	it's executing code. We could just shift the boundary to "don't
	execute code outside of safe.directory".

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-09-29 19:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-29 19:17 Notes from the Git Contributor's Summit, 2022 Taylor Blau
2022-09-29 19:19 ` [TOPIC 1/8] Bundle URIs Taylor Blau
2022-09-29 19:19 ` [TOPIC 2/8] State of SHA-256 transition Taylor Blau
2022-09-29 19:20 ` [TOPIC 3/8] Merge ORT timeline Taylor Blau
2022-09-29 19:20 ` [TOPIC 4/8] Commit `--filter`'s Taylor Blau
2022-09-29 19:21 ` [TOPIC 5/8] Server side merges and rebases Taylor Blau
2022-09-29 19:21 ` [TOPIC 6/8] State of sparsity work Taylor Blau
2022-09-29 19:21 ` [TOPIC 7/8] Speeding up the connectivity check Taylor Blau
2022-09-29 19:22 ` [TOPIC 8/8] Using Git securely in shared services Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).