mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <>
Subject: [TOPIC 4/12] Scaling Git from a forge's perspective
Date: Mon, 2 Oct 2023 11:19:06 -0400	[thread overview]
Message-ID: <ZRrfamSepdiQU9CH@nand.local> (raw)
In-Reply-To: <ZRregi3JJXFs4Msb@nand.local>

(Presenter: Taylor Blau, Notetaker: Karthik Nayak)

* Things on my mind!
* There's been a bunch of work from the forges over the last few years -
  bitmaps, commit-graphs. etc.
* Q: What should we do next? Curious to hear from everyone. Including Keanen's
* Boundary-based bitmap traversals, already spoke about it last year. If you
  have lots of tips that you're excluding from the rev-list query. Backlog to
  check the perf of this.
   * Patrick: still not activated it on production. Faced some issues the last
     time it was activated. We do plan to experiment with this
   * Taylor: Curious of the impact.
   * In almost all cases they perform better, in some equal and very few worse.
* (Jonathan Nieder) Two open-ended questions:
   * Different forges run into the same problems. Maybe its worth comparing
     notes. Do we have a good way to do this. In Git discord there is a server
     operator channel, but only two messages.
      * Taylor and Patrick have conversations over this via email exchange.
      * Keanen: Used to have a quarterly meeting. Attendance is low.
      * From an opportunistic perspective, when people want to do this,
        currently seems like 1:1 conversations take place, but there hasn't been
        a wider-group forum
      * Server operator monthly might be fun to revive
      * Git contributor summit is where this generally happens. :)
   * At the last Git Merge there was a talk by Stolee about Git as a database
     and how as a user that can guide you in scaling. Potential roadmap for how
     a git server could do some of that automatically. Potential idea? For
     example, sharding by time? Like gc automatically generating a pack to serve
     shallow clones for recent history.
      * Extending cruft-pack implementation to more organically have a threshold
        on the number of bytes. The current scheme of rewriting the entire
        cruft-pack might not be the best for big repos.
      * Patrick: We currently have such a mechanism for geometric repacking.
* (Taylor Blau) Geometric repacking was done a number of years ago, to more
  gradually compress the repository from many to few packfiles. We still have
  periodic cases where the repository is reduced to 2 packs, one cruft and one
  of the objects. If you had some set of packs which contained disjoint objects
  (no duplicates), could we extend the verbatim packs to work with these
  multiple packs. Anyone had similar issues?
   * Jonathan: One problem is whether to know if a pack has a non-redundant
     reachable object or not without worrying about things like TTL. In git,
     there is "push quarantine" code, if the hook rejects it, it doesn't get
     added to the repo. In JGit there is nothing similar yet, so someone could
     push a bunch of objects, which get stored even though they're rejected by a
     pre-receive hook. Which could end up with packs with unreachable objects.
     With history rewriting we also run into complexity about knowing what packs
     are "live".
      * Patrick: Deterministically pruning objects from the repository is hard
        to solve. In GitLab it's a problem where replicas of the repository
        contain objects which probably need to be deleted.
      * Jeff H: Can we have a classification of refs which makes classification
        possible wherein some refs are transient and some are long term.
         * Jeff King: There are a bunch of heuristic inputs which can help with
           this. Like how older objects have lesser chance of change vs newer.
         * Taylor: Order by recency, so older ones are in one bitmap and newer
           changeable ones could be one clump of bitmaps.
* Minh: I have a question about Taylor's proposal of a single pack composed of
  multiple disjoint packs. Midx can notice duplicate objects. Does that help
  with knowing what can be streamed through?
   * Taylor: The pack reuse code is a bit too naive at this point, but
     conceptually this would work. We already have tools for working with packs
     like this. But this does give more flexibility.
* Taylor: GitHub recently switched to merge-ort for test merges, tremendous
  improvements, but sometimes creates a bunch of loose objects. Option to have
  merge-ort to side step loose objects (write to fast-import or write a pack
   * Things slow down when writing to the filesystem so much.
   * Jonathan Tan: one thing we've discussed is having support in git for a pack
     handle representing a still-open pack file that you can append to and read
     from in the context of an operation.
   * Dscho: that sounds like the sanest thing to do. There's a robust invariant
     of needing an idx for the pack file that you need for working with it
     efficiently, which requires the pack file to be closed. So some things to
     figure out there, I'm interested to follow it.
   * Junio: There was a patch sent to list to restrict the streaming interface.
     I wonder if that moves in the opposite direction of what we're describing
   * brian: In sha256 work I noticed it only currently works on blobs. But I
     don't think adapting it to other object types would be a major departure.
     As long as we don't make the interop harder, I don't see a big problem with
     doing that. Conversion happens at the pack-indexing time.
   * Elijah: Did I understand correctly that this produces a lot of cruft
   * Dscho: Yes. We perform test merges and then no ref points to them.
   * Elijah: Nice. "git log --remerge-diff" similarly produces objects that
     don't need to be stored when it performs test merges; that code path is
     careful not to commit them to the object store. You might be able to reuse
     some of that code.
   * Dscho: Thanks! I'll take a look.

  parent reply	other threads:[~2023-10-02 15:19 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-02 15:15 Notes from the Git Contributor's Summit, 2023 Taylor Blau
2023-10-02 15:17 ` [TOPIC 0/12] Welcome / Conservancy Update Taylor Blau
2023-10-02 15:17 ` [TOPIC 1/12] Next-gen reference backends Taylor Blau
2023-10-02 15:18 ` [TOPIC 02/12] Libification Goals and Progress Taylor Blau
2023-10-02 15:18 ` [TOPIC 3/12] Designing a Makefile for multiple libraries Taylor Blau
2023-10-02 15:19 ` Taylor Blau [this message]
2023-10-02 15:19 ` [TOPIC 5/12] Replacing Git LFS using multiple promisor remotes Taylor Blau
2023-10-02 15:20 ` [TOPIC 6/12] Clarifying backwards compatibility and when we break it Taylor Blau
2023-10-02 15:21 ` [TOPIC 7/12] Authentication to new hosts without setup Taylor Blau
2023-10-02 15:21 ` [TOPIC 8/12] Update on jj, including at Google Taylor Blau
2023-10-02 15:21 ` [TOPIC 9/12] Code churn and cleanups Taylor Blau
2023-10-02 15:22 ` [TOPIC 10/12] Project management practices Taylor Blau
2023-10-02 15:22 ` [TOPIC 11/12] Improving new contributor on-boarding Taylor Blau
2023-10-02 15:22 ` [TOPIC 12/12] Overflow discussion Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRrfamSepdiQU9CH@nand.local \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).