mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <>
Subject: [TOPIC 5/12] Replacing Git LFS using multiple promisor remotes
Date: Mon, 2 Oct 2023 11:19:32 -0400	[thread overview]
Message-ID: <ZRrfhMbExAa7cmX0@nand.local> (raw)
In-Reply-To: <ZRregi3JJXFs4Msb@nand.local>

(Presenter: Christian Couder, Notetaker: Jonathan Nieder)

* Idea: Git LFS has some downsides
   * Not integrated into Git, that's a problem in itself
   * Not easy to change decisions after the affect about what blobs to offload
     into LFS storage
* So I started work some years ago on multiple promisor remotes as an
  alternative to Git LFS
* Works! Requires some pieces
   * Filtering objects when repacking (git repack --filter, due to be merged
     hopefully soon)
* I'm curious about issues related to Git LFS - what leads people not to use Git
  LFS and to do things in other, less efficient ways?
* Choices
   * We can discuss details of a demo I worked on a few years ago
   * We can discuss Git LFS, how it works, and how we can do better
* brian: Sounds like this is a mostly server-side improvement. How does this
  work on the client side for avoiding to need old versions of huge files?
   * Christian: On the client side, you can get those files when you need them
     (using partial clone), and repack --filter allows you to remove your local
     copy when you don't need them any more
   * There could be more options and commands to manage that kind of removal
* Terry: with multiple promisor remotes, does gc write the large files as their
  own separate packfiles? What does the setup look like in practice?
   * Christian: You can do that. But you can also use a remote helper to access
     the remotes where the large files live. Such a cache server can be a plain
     http server hosting the large files, and the remote helper can know how to
     do a basic HTTP GET or RANGE request to get that file.
   * It can also work if the separate remote can be a git remote, specialized in
     handling large files.
   * Terry: So it can behave more like an LFS server, but as a native part of
     the git protocol. How flexible is it?
   * Christian: yes. Remote helpers can be scripts, they don't need to know a
     lot of things when they're just being used to get a few objects.
* Jonathan Tan: is it important for this use case that the server serve regular
  files instead of git packfiles?
   * Christian: not so important, but it can be useful because some people may
     want to access their large objects in different ways. As they're large,
     it's expensive to store them; using the same server to store them for all
     purposes can make things less expensive. E.g. "just stick the file on
     Google Drive".
* Taylor: in concept, this seems like a sensible direction. My concern would be
  immaturity of partial clone client behavior in these multiple-promisor
   * I don't think we have a lot of these users at GitHub. Have others had heavy
     use of partial clone? Have there been heavy issues on the client side?
   * Terry: Within the Android world, partial clone is heavily used by users and
     CI/CD and it's working well.
   * jrnieder: Two qualifications to add, we've been using it with blob filters
     and not tree filters. Haven't been using multiple promisor remotes.
   * Patrick: What's nice about LFS is that it's able to easily offload objects
     to a CDN. Reduce strain on the Git server itself. We might need a protocol
     addition here to redirect to a CDN.
* Jonathan Tan: if we have a protocol addition (server-side option for blob-only
  fetch or something), we can use a remote helper to do the appropriate logic,
  not necessarily involving a git server
   * The issue, though, is that Git expects packfiles, as the way it stores
     things in its object store.
   * As long as the CDN supports serving packfiles, this would all be doable
     using current Git.
   * If the file format differs, may need more work.
* jrn: Going back to Terry's question on the distinction between this and using
  an LFS server. One key diff is that with git LFS, is that the identifier is
  not the object ID, it's some other hash. Are there any other fundamental
   * Christian: With git LFS if you want some blobs to be stored with LFS and
     they're not stored in LFS anymore you have to rewrite the history.
   * Using the git object ID gives you that flexibility
* brian: One thing Git LFS has that Git doesn't is deduping
   * On macOS and Windows and btrfs on Linux, having only one underlying copy of
     the file
   * That's possible because we store the file uncompressed
   * That's a feature some people would like to have some time. Not out of the
     question to do in Git, would require a change to how objects are stored in
     the git object store
* jrn: Is anyone using the demonstrated setup?
   * Christian: Doesn't seem so. It was considered interesting when demoed in
* Jonathan Tan: is the COW thing brian mentioned part of what this would be
  intended to support?
   * Christian: Ultimately that would be possible.
   * brian: To replace Git LFS, you need the ability to store uncompressed
     objects in the git object store. E.g. game textures. Avoids waste of CPU
     and lets you use reflinks (ioctl to share extents).
   * Patrick: objects need the header prefix to denote the object type.
   * brian: Yes, you'd need the blobs + metadata. That's part of what Git LFS
     gives us within GitHub, avoiding having to spend CPU on compressing these
     large objects to serve to the user.
* jrn: Going back to the discussion with multiple promisors. When people turn on
  multiple promisors by mistake, the level of flexibility has been a problem.
  This causes a lot of failed/slow requests - git is very optimistic and tries
  to fetch objects from everywhere. This suggests the approach that Jonathan
  suggested, where the helper is responsible for choosing where to get objects
  from, it might help mitigate these issues.
   * Christian: yes
* Minh: can the server say "here are most of the objects you asked for, but
  these other objects I'd encourage you to get from elsewhere"?
   * Christian: you can configure the same promisor remote on the server. If the
     client doesn't use the promisor remote and only contacts the main server,
     the server will contact the promisor remote, get the object, and send it to
     the client. It's not very efficient, but it works. Another downside is that
     if this happens, that object from the promisor remote is now also on the
     server, so you need to remove it if you don't want to keep it there.
   * Minh: it seems someone has to pack the object with the header and compute
     the git blob id for it, which is itself expensive
   * Christian: if the promisor remote is a regular git server, then yes, the
     objects will be compressed in git packfile format. But if it's a plain HTTP
     server and you access with a helper, it doesn't need to. But of course, if
     the objects are ever fetched by the main server, then it's in packfile or
     loose object format there.

  parent reply	other threads:[~2023-10-02 15:19 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-02 15:15 Notes from the Git Contributor's Summit, 2023 Taylor Blau
2023-10-02 15:17 ` [TOPIC 0/12] Welcome / Conservancy Update Taylor Blau
2023-10-02 15:17 ` [TOPIC 1/12] Next-gen reference backends Taylor Blau
2023-10-02 15:18 ` [TOPIC 02/12] Libification Goals and Progress Taylor Blau
2023-10-02 15:18 ` [TOPIC 3/12] Designing a Makefile for multiple libraries Taylor Blau
2023-10-02 15:19 ` [TOPIC 4/12] Scaling Git from a forge's perspective Taylor Blau
2023-10-02 15:19 ` Taylor Blau [this message]
2023-10-02 15:20 ` [TOPIC 6/12] Clarifying backwards compatibility and when we break it Taylor Blau
2023-10-02 15:21 ` [TOPIC 7/12] Authentication to new hosts without setup Taylor Blau
2023-10-02 15:21 ` [TOPIC 8/12] Update on jj, including at Google Taylor Blau
2023-10-02 15:21 ` [TOPIC 9/12] Code churn and cleanups Taylor Blau
2023-10-02 15:22 ` [TOPIC 10/12] Project management practices Taylor Blau
2023-10-02 15:22 ` [TOPIC 11/12] Improving new contributor on-boarding Taylor Blau
2023-10-02 15:22 ` [TOPIC 12/12] Overflow discussion Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRrfhMbExAa7cmX0@nand.local \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).