From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Subject: [TOPIC 5/12] Replacing Git LFS using multiple promisor remotes
Date: Mon, 2 Oct 2023 11:19:32 -0400 [thread overview]
Message-ID: <ZRrfhMbExAa7cmX0@nand.local> (raw)
In-Reply-To: <ZRregi3JJXFs4Msb@nand.local>
(Presenter: Christian Couder, Notetaker: Jonathan Nieder)
* Idea: Git LFS has some downsides
* Not integrated into Git, that's a problem in itself
* Not easy to change decisions after the affect about what blobs to offload
into LFS storage
* So I started work some years ago on multiple promisor remotes as an
alternative to Git LFS
* Works! Requires some pieces
* Filtering objects when repacking (git repack --filter, due to be merged
hopefully soon)
* I'm curious about issues related to Git LFS - what leads people not to use Git
LFS and to do things in other, less efficient ways?
* Choices
* We can discuss details of a demo I worked on a few years ago
* We can discuss Git LFS, how it works, and how we can do better
* brian: Sounds like this is a mostly server-side improvement. How does this
work on the client side for avoiding to need old versions of huge files?
* Christian: On the client side, you can get those files when you need them
(using partial clone), and repack --filter allows you to remove your local
copy when you don't need them any more
* There could be more options and commands to manage that kind of removal
* Terry: with multiple promisor remotes, does gc write the large files as their
own separate packfiles? What does the setup look like in practice?
* Christian: You can do that. But you can also use a remote helper to access
the remotes where the large files live. Such a cache server can be a plain
http server hosting the large files, and the remote helper can know how to
do a basic HTTP GET or RANGE request to get that file.
* It can also work if the separate remote can be a git remote, specialized in
handling large files.
* Terry: So it can behave more like an LFS server, but as a native part of
the git protocol. How flexible is it?
* Christian: yes. Remote helpers can be scripts, they don't need to know a
lot of things when they're just being used to get a few objects.
* Jonathan Tan: is it important for this use case that the server serve regular
files instead of git packfiles?
* Christian: not so important, but it can be useful because some people may
want to access their large objects in different ways. As they're large,
it's expensive to store them; using the same server to store them for all
purposes can make things less expensive. E.g. "just stick the file on
Google Drive".
* Taylor: in concept, this seems like a sensible direction. My concern would be
immaturity of partial clone client behavior in these multiple-promisor
scenarios
* I don't think we have a lot of these users at GitHub. Have others had heavy
use of partial clone? Have there been heavy issues on the client side?
* Terry: Within the Android world, partial clone is heavily used by users and
CI/CD and it's working well.
* jrnieder: Two qualifications to add, we've been using it with blob filters
and not tree filters. Haven't been using multiple promisor remotes.
* Patrick: What's nice about LFS is that it's able to easily offload objects
to a CDN. Reduce strain on the Git server itself. We might need a protocol
addition here to redirect to a CDN.
* Jonathan Tan: if we have a protocol addition (server-side option for blob-only
fetch or something), we can use a remote helper to do the appropriate logic,
not necessarily involving a git server
* The issue, though, is that Git expects packfiles, as the way it stores
things in its object store.
* As long as the CDN supports serving packfiles, this would all be doable
using current Git.
* If the file format differs, may need more work.
* jrn: Going back to Terry's question on the distinction between this and using
an LFS server. One key diff is that with git LFS, is that the identifier is
not the object ID, it's some other hash. Are there any other fundamental
difference?
* Christian: With git LFS if you want some blobs to be stored with LFS and
they're not stored in LFS anymore you have to rewrite the history.
* Using the git object ID gives you that flexibility
* brian: One thing Git LFS has that Git doesn't is deduping
* On macOS and Windows and btrfs on Linux, having only one underlying copy of
the file
* That's possible because we store the file uncompressed
* That's a feature some people would like to have some time. Not out of the
question to do in Git, would require a change to how objects are stored in
the git object store
* jrn: Is anyone using the demonstrated setup?
* Christian: Doesn't seem so. It was considered interesting when demoed in
GitLab.
* Jonathan Tan: is the COW thing brian mentioned part of what this would be
intended to support?
* Christian: Ultimately that would be possible.
* brian: To replace Git LFS, you need the ability to store uncompressed
objects in the git object store. E.g. game textures. Avoids waste of CPU
and lets you use reflinks (ioctl to share extents).
* Patrick: objects need the header prefix to denote the object type.
* brian: Yes, you'd need the blobs + metadata. That's part of what Git LFS
gives us within GitHub, avoiding having to spend CPU on compressing these
large objects to serve to the user.
* jrn: Going back to the discussion with multiple promisors. When people turn on
multiple promisors by mistake, the level of flexibility has been a problem.
This causes a lot of failed/slow requests - git is very optimistic and tries
to fetch objects from everywhere. This suggests the approach that Jonathan
suggested, where the helper is responsible for choosing where to get objects
from, it might help mitigate these issues.
* Christian: yes
* Minh: can the server say "here are most of the objects you asked for, but
these other objects I'd encourage you to get from elsewhere"?
* Christian: you can configure the same promisor remote on the server. If the
client doesn't use the promisor remote and only contacts the main server,
the server will contact the promisor remote, get the object, and send it to
the client. It's not very efficient, but it works. Another downside is that
if this happens, that object from the promisor remote is now also on the
server, so you need to remove it if you don't want to keep it there.
* Minh: it seems someone has to pack the object with the header and compute
the git blob id for it, which is itself expensive
* Christian: if the promisor remote is a regular git server, then yes, the
objects will be compressed in git packfile format. But if it's a plain HTTP
server and you access with a helper, it doesn't need to. But of course, if
the objects are ever fetched by the main server, then it's in packfile or
loose object format there.
next prev parent reply other threads:[~2023-10-02 15:19 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-02 15:15 Notes from the Git Contributor's Summit, 2023 Taylor Blau
2023-10-02 15:17 ` [TOPIC 0/12] Welcome / Conservancy Update Taylor Blau
2023-10-02 15:17 ` [TOPIC 1/12] Next-gen reference backends Taylor Blau
2023-10-02 15:18 ` [TOPIC 02/12] Libification Goals and Progress Taylor Blau
2023-10-02 15:18 ` [TOPIC 3/12] Designing a Makefile for multiple libraries Taylor Blau
2023-10-02 15:19 ` [TOPIC 4/12] Scaling Git from a forge's perspective Taylor Blau
2023-10-02 15:19 ` Taylor Blau [this message]
2023-10-02 15:20 ` [TOPIC 6/12] Clarifying backwards compatibility and when we break it Taylor Blau
2023-10-02 15:21 ` [TOPIC 7/12] Authentication to new hosts without setup Taylor Blau
2023-10-02 15:21 ` [TOPIC 8/12] Update on jj, including at Google Taylor Blau
2023-10-02 15:21 ` [TOPIC 9/12] Code churn and cleanups Taylor Blau
2023-10-02 15:22 ` [TOPIC 10/12] Project management practices Taylor Blau
2023-10-02 15:22 ` [TOPIC 11/12] Improving new contributor on-boarding Taylor Blau
2023-10-02 15:22 ` [TOPIC 12/12] Overflow discussion Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZRrfhMbExAa7cmX0@nand.local \
--to=me@ttaylorr.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).