Reducing CPU load on git server

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Reducing CPU load on git server
@ 2016-08-28 19:42 W. David Jarvis
  2016-08-28 21:20 ` Jakub Narębski
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: W. David Jarvis @ 2016-08-28 19:42 UTC (permalink / raw)
  To: git

Hi all -

I've run into a problem that I'm looking for some help with. Let me
describe the situation, and then some thoughts.

The company I work for uses git. We use GitHub Enterprise as a
frontend for our primary git server. We're using Chef solo to manage a
fleet of upwards of 10,000 hosts, which regularly pull from our Chef
git repository so as to converge.

A few years ago, in order to reduce the load on the primary server,
the firm set up a fleet of replication servers. This allowed the
majority of our infrastructure to target the replication servers for
all pull-based activity rather than the primary git server.

The actual replication process works as follows:

1. The primary git server receives a push and sends a webhook with the
details of the push (repo, ref, sha, some metadata) to a "publisher"
box

2. The publisher enqueues the details of the webhook into a queue

3. A fleet of "subscriber" (replica) boxes each reads the payload of
the enqueued message. Each of these then tries to either clone the
repository if they don't already have it, or they run `git fetch`.

During the course of either of these operations we use a
repository-level lockfile. If another request comes in while the repo
is locked, we re-enqueue it. When that request comes in later, if the
push event time is earlier than the most recent successful fetch, we
don't do any further work.

---

The problem we've been running into has been that as we've continued
to add developers, the CPU load on our primary instance has gone
through the roof. We've upgraded the machine to some of the punchiest
hardware we can get and it's regularly exceeding 70% CPU load. We know
that the overwhelming drive of this CPU load is near-constant git
operations on some of our larger and more active repositories.

Migrating off of Chef solo is essentially a non-starter at this point
in time, and we've also added a considerable number of non-Chef git
dependencies such that getting rid of the replication service would be
a massive undertaking. Even if we were to kill the replication fleet,
we'd have to figure out where to point that traffic in such a way that
we didn't overwhelm the primary git server anyways.

So I'm looking for some help in figuring out what to do next. At the
moment, I have two thoughts:

1. We currently run a blanket `git fetch` rather than specifically
fetching the ref that was pushed. My understanding from poking around
the git source code is that this causes the replication server to send
a list of all of its ref tips to the primary server, and the primary
server then has to verify and compare each of these tips to the ref
tips residing on the server.

That might not be a ton of work to do on an individual fetch
operation, but some of our repositories have over 5,000 branches and
are pushed to 1,000 times a day. Algorithmically, this would suggest
that the cost of a fetch will go up in terms of both N -- branches and
M -- pushes, so we're talking about a cost of N*M, both of which will
increase when we hire developers. This implies exponential growth in
load as we add engineers, which is...not good.

Also worth noting - if we assume that the replication server should be
reasonably up-to-date (see lockfile logic description, above), we're
talking about typically packing objects for one ref, and at most a
small number (<5).

My hypothesis is that moving to fetching the specific branch rather
than doing a blanket fetch would have a significant and material
impact on server load.

If we do go down this route, we'll potentially need to do some
refactoring around how we handle "failed fetches", which relates both
to our locking logic and to the actual potential failure of a git
fetch. Discussion of the locking mechanism follows:

2. Our current locking mechanism seems to me to be un-necessary. I'm
aware that git uses a few different locking mechanisms internally, and
the use of a repo-level lockfile would seem to only guarantee that
we're using coarser-grained locking than git actually supports. But I
don't know if git supports specific fetch-level ref locking that would
permit concurrent ref fetch operations. If it does, our current
architecture would seem to prevent git from being able to take
advantage of those mechanisms.

In other words, let's imagine a world in which we ditch our current
repo-level locking mechanism entirely. Let's also presume we move to
fetching specific refs rather than using blanket fetches. Does that
mean that if a fetch for ref A and a fetch for ref B are issued at
roughly the exact same time, the two will be able to be executed at
once without running into some git-internal locking mechanism on a
granularity coarser than the ref? i.e. are fetch A and fetch B going
to be blocked on the other's completion in any way? (let's presume
that ref A and ref B are not parents of each other).

---

I'm neither a contributor nor an expert in git, so my inferences thus
far as based purely off of what I would describe as "stumbling around
the source code like a drunken baby".

The ultimate goal for us is just figuring out how we can best reduce
the CPU load on the primary instance so that we don't find ourselves
in a situation where we're not able to run basic git operations
anymore.

If I'm barking up the wrong tree, or if there are other optimizations
we should be considering, I'd be eager to learn about those as well.
Of course, if what I'm describing sounds about right, I'd like
confirmation of that from some people who actually know what they're
talking about (i.e., not me :) ).

Thanks.

 - Venantius

-- 
============
venanti.us
203.918.2328
============

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-28 19:42 Reducing CPU load on git server W. David Jarvis
@ 2016-08-28 21:20 ` Jakub Narębski
  2016-08-28 23:18   ` W. David Jarvis
  2016-08-29  5:47 ` Jeff King
  2016-08-29 20:14 ` Reducing CPU load on git server Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 16+ messages in thread
From: Jakub Narębski @ 2016-08-28 21:20 UTC (permalink / raw)
  To: W. David Jarvis, git

W dniu 28.08.2016 o 21:42, W. David Jarvis pisze:

> The ultimate goal for us is just figuring out how we can best reduce
> the CPU load on the primary instance so that we don't find ourselves
> in a situation where we're not able to run basic git operations
> anymore.

I assume that you have turned on pack bitmaps?  See for example
"Counting Objects" blog post on GitHub Engineering blog
http://githubengineering.com/counting-objects/

There are a few other articles there worth reading in your
situation.
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-28 21:20 ` Jakub Narębski
@ 2016-08-28 23:18   ` W. David Jarvis
  0 siblings, 0 replies; 16+ messages in thread
From: W. David Jarvis @ 2016-08-28 23:18 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git

My assumption is that pack bitmaps are enabled since the primary
server is a GitHub Enterprise instance, but I'll have to confirm.

On Sun, Aug 28, 2016 at 2:20 PM, Jakub Narębski <jnareb@gmail.com> wrote:
> W dniu 28.08.2016 o 21:42, W. David Jarvis pisze:
>
>> The ultimate goal for us is just figuring out how we can best reduce
>> the CPU load on the primary instance so that we don't find ourselves
>> in a situation where we're not able to run basic git operations
>> anymore.
>
> I assume that you have turned on pack bitmaps?  See for example
> "Counting Objects" blog post on GitHub Engineering blog
> http://githubengineering.com/counting-objects/
>
> There are a few other articles there worth reading in your
> situation.
> --
> Jakub Narębski



-- 
============
venanti.us
203.918.2328
============

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-28 19:42 Reducing CPU load on git server W. David Jarvis
  2016-08-28 21:20 ` Jakub Narębski
@ 2016-08-29  5:47 ` Jeff King
  2016-08-29 10:46   ` Jakub Narębski
  2016-08-29 19:16   ` W. David Jarvis
  2016-08-29 20:14 ` Reducing CPU load on git server Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 16+ messages in thread
From: Jeff King @ 2016-08-29  5:47 UTC (permalink / raw)
  To: W. David Jarvis; +Cc: git

On Sun, Aug 28, 2016 at 12:42:52PM -0700, W. David Jarvis wrote:

> The actual replication process works as follows:
> 
> 1. The primary git server receives a push and sends a webhook with the
> details of the push (repo, ref, sha, some metadata) to a "publisher"
> box
> 
> 2. The publisher enqueues the details of the webhook into a queue
> 
> 3. A fleet of "subscriber" (replica) boxes each reads the payload of
> the enqueued message. Each of these then tries to either clone the
> repository if they don't already have it, or they run `git fetch`.

So your load is probably really spiky, as you get thundering herds of
fetchers after every push (the spikes may have a long flatline at the
top, as it takes time to process the whole herd).

> 1. We currently run a blanket `git fetch` rather than specifically
> fetching the ref that was pushed. My understanding from poking around
> the git source code is that this causes the replication server to send
> a list of all of its ref tips to the primary server, and the primary
> server then has to verify and compare each of these tips to the ref
> tips residing on the server.

Yes, though I'd be surprised if this negotiation is that expensive in
practice. In my experience it's not generally, and even if we ended up
traversing every commit in the repository, that's on the order of a few
seconds even for large, active repositories.

In my experience, the problem in a mass-fetch like this ends up being
pack-objects preparing the packfile. It has to do a similar traversal,
but _also_ look at all of the trees and blobs reachable from there, and
then search for likely delta-compression candidates.

Do you know which processes are generating the load? git-upload-pack
does the negotiation, and then pack-objects does the actual packing.

> My hypothesis is that moving to fetching the specific branch rather
> than doing a blanket fetch would have a significant and material
> impact on server load.

Maybe. If pack-objects is where your load is coming from, then
counter-intuitively things sometimes get _worse_ as you fetch less. The
problem is that git will generally re-use deltas it has on disk when
sending to the clients. But if the clients are missing some of the
objects (because they don't fetch all of the branches), then we cannot
use those deltas and may need to recompute new ones. So you might see
some parts of the fetch get cheaper (negotiation, pack-object's
"Counting objects" phase), but "Compressing objects" gets more
expensive.

This is particularly noticeable with shallow fetches, which in my
experience are much more expensive to serve.

Jakub mentioned bitmaps, and if you are using GitHub Enterprise, they
are enabled. But they won't really help here. They are essentially
cached information that git generates at repack time. But if we _just_
got a push, then the new objects to fetch won't be part of the cache,
and we'll fall back to traversing them as normal.  On the other hand,
this should be a relatively small bit of history to traverse, so I'd
doubt that "Counting objects" is that expensive in your case (but you
should be able to get a rough sense by watching the progress meter
during a fetch).

I'd suspect more that delta compression is expensive (we know we just
got some new objects, but we don't know if we can make good deltas
against the objects the client already has). That's a gut feeling,
though.

If the fetch is small, that _also_ shouldn't be too expensive. But
things add up when you have a large number of machines all making the
same request at once. So it's entirely possible that the machine just
gets hit with a lot of 5s CPU tasks all at once. If you only have a
couple cores, that takes many multiples of 5s to clear out.

There's nothing in upstream git to help smooth these loads, but since
you mentioned GitHub Enterprise, I happen to know that it does have a
system for coalescing multiple fetches into a single pack-objects. I
_think_ it's in GHE 2.5, so you might check which version you're
running (and possibly also talk to GitHub Support, who might have more
advice; there are also tools for finding out which git processes are
generating the most load, etc).

> In other words, let's imagine a world in which we ditch our current
> repo-level locking mechanism entirely. Let's also presume we move to
> fetching specific refs rather than using blanket fetches. Does that
> mean that if a fetch for ref A and a fetch for ref B are issued at
> roughly the exact same time, the two will be able to be executed at
> once without running into some git-internal locking mechanism on a
> granularity coarser than the ref? i.e. are fetch A and fetch B going
> to be blocked on the other's completion in any way? (let's presume
> that ref A and ref B are not parents of each other).

Generally no, they should not conflict. Writes into the object database
can happen simultaneously. Ref updates take a per-ref lock, so you
should generally be able to write two unrelated refs at once. The big
exception is that ref deletion required taking a repo-wide lock, but
that presumably wouldn't be a problem for your case.

I'm still not convinced that the single-ref fetching will really help,
though.

> The ultimate goal for us is just figuring out how we can best reduce
> the CPU load on the primary instance so that we don't find ourselves
> in a situation where we're not able to run basic git operations
> anymore.

I suspect there's room for improvement and tuning of the primary. But
barring that, one option would be to have a hierarchy of replicas. Have
"k" first-tier replicas fetch from the primary, then "k" second-tier
replicas fetch from them, and so on. Trade propagation delay for
distributing the load. :)

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29  5:47 ` Jeff King
@ 2016-08-29 10:46   ` Jakub Narębski
  2016-08-29 17:18     ` Jeff King
  2016-08-29 19:16   ` W. David Jarvis
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Narębski @ 2016-08-29 10:46 UTC (permalink / raw)
  To: git; +Cc: git

W dniu 29.08.2016 o 07:47, Jeff King pisze:
> On Sun, Aug 28, 2016 at 12:42:52PM -0700, W. David Jarvis wrote:
> 
>> The actual replication process works as follows:
>>
>> 1. The primary git server receives a push and sends a webhook with the
>> details of the push (repo, ref, sha, some metadata) to a "publisher"
>> box
>>
>> 2. The publisher enqueues the details of the webhook into a queue
>>
>> 3. A fleet of "subscriber" (replica) boxes each reads the payload of
>> the enqueued message. Each of these then tries to either clone the
>> repository if they don't already have it, or they run `git fetch`.
> 
> So your load is probably really spiky, as you get thundering herds of
> fetchers after every push (the spikes may have a long flatline at the
> top, as it takes time to process the whole herd).

One solution I have heard about, in the context of web cache, to reduce
the thundering herd problem (there caused by cache expiring at the same
time in many clients) was to add some random or quasi-random distribution
to expiration time.  In your situation adding a random delay with some
specified deviation could help.

Note however that it is, I think, incompatible (to some extent) with
"caching" solution, where the 'thundering herd' get served the same
packfile.  Or at least one solution can reduce the positive effect
of the other.

>> 1. We currently run a blanket `git fetch` rather than specifically
>> fetching the ref that was pushed. My understanding from poking around
>> the git source code is that this causes the replication server to send
>> a list of all of its ref tips to the primary server, and the primary
>> server then has to verify and compare each of these tips to the ref
>> tips residing on the server.

[...]
> There's nothing in upstream git to help smooth these loads, but since
> you mentioned GitHub Enterprise, I happen to know that it does have a
> system for coalescing multiple fetches into a single pack-objects. I
> _think_ it's in GHE 2.5, so you might check which version you're
> running (and possibly also talk to GitHub Support, who might have more
> advice; there are also tools for finding out which git processes are
> generating the most load, etc).

I wonder if this system for coalescing multiple fetches is something
generic, or is it something specific to GitHub / GitHub Enterprise
architecture?  If it is the former, would it be considered for
upstreaming, and if so, when it would be in Git itself?

One thing to note: if you have repositories which are to have the
same contents, you can distribute the pack-file to them and update
references without going through Git.  It can be done on push
(push to master, distribute to mirrors), or as part of fetch
(master fetches from central repository, distributes to mirrors).
I think; I have never managed large set of replicated Git repositories.

If mirrors can get out of sync, you would need to ensure that the
repository doing the actual fetch / receiving the actual push is
a least common denominator, that it it looks like lagging behind
all other mirrors in set.  There is no problem if repository gets
packfile with more objects than it needs.

>> In other words, let's imagine a world in which we ditch our current
>> repo-level locking mechanism entirely. Let's also presume we move to
>> fetching specific refs rather than using blanket fetches. Does that
>> mean that if a fetch for ref A and a fetch for ref B are issued at
>> roughly the exact same time, the two will be able to be executed at
>> once without running into some git-internal locking mechanism on a
>> granularity coarser than the ref? i.e. are fetch A and fetch B going
>> to be blocked on the other's completion in any way? (let's presume
>> that ref A and ref B are not parents of each other).
> 
> Generally no, they should not conflict. Writes into the object database
> can happen simultaneously. Ref updates take a per-ref lock, so you
> should generally be able to write two unrelated refs at once. The big
> exception is that ref deletion required taking a repo-wide lock, but
> that presumably wouldn't be a problem for your case.

Doesn't Git avoid taking locks, and use lockless synchronization
mechanisms (though possibly equivalent to locks)?  I think it takes
lockfile to update reflog together with reference, but if reflogs
are turned off (and I think they are off for bare repositories by
default), ref update uses "atomic file write" (write + rename)
and compare-and-swap primitive.  Updating repository is lock-free:
first update repository object database, then reference.

That said, it might be that per-repository global lock that you
use is beneficial, limiting the amount of concurrent access; but
it could be detrimental, that global-lock contention is the cause
of stalls and latency.

>> The ultimate goal for us is just figuring out how we can best reduce
>> the CPU load on the primary instance so that we don't find ourselves
>> in a situation where we're not able to run basic git operations
>> anymore.
> 
> I suspect there's room for improvement and tuning of the primary. But
> barring that, one option would be to have a hierarchy of replicas. Have
> "k" first-tier replicas fetch from the primary, then "k" second-tier
> replicas fetch from them, and so on. Trade propagation delay for
> distributing the load. :)

I guess that trying to replicate DGit approach that GitHub uses, see
"Introducing DGit" (http://githubengineering.com/introducing-dgit)
is currently out of question?

I wonder if you can propagate pushes, instead of fetches...

Best regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29 10:46   ` Jakub Narębski
@ 2016-08-29 17:18     ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2016-08-29 17:18 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: W. David Jarvis, git

On Mon, Aug 29, 2016 at 12:46:27PM +0200, Jakub Narębski wrote:

> > So your load is probably really spiky, as you get thundering herds of
> > fetchers after every push (the spikes may have a long flatline at the
> > top, as it takes time to process the whole herd).
> 
> One solution I have heard about, in the context of web cache, to reduce
> the thundering herd problem (there caused by cache expiring at the same
> time in many clients) was to add some random or quasi-random distribution
> to expiration time.  In your situation adding a random delay with some
> specified deviation could help.

That smooths the spikes, but you still have to serve all of the requests
eventually. So if your problem is that the load spikes and the system
slows to a crawl as a result (or runs out of RAM, etc), then
distributing the load helps. But if you have enough load that your
system is constantly busy, queueing the load in a different order just
shifts it around.

GHE will also introduce delays into starting git when load spikes, but
that's a separate system that coalescing identical requests.

> I wonder if this system for coalescing multiple fetches is something
> generic, or is it something specific to GitHub / GitHub Enterprise
> architecture?  If it is the former, would it be considered for
> upstreaming, and if so, when it would be in Git itself?

I've already sent upstream the patch for a "hook" that sits between
upload-pack and pack-objects (and it will be in v2.10). So that can call
an arbitrary script which can then make scheduling policy for
pack-objects, coalesce similar requests, etc.

GHE has a generic tool for coalescing program invocations that is not
Git-specific at all (it compares its stdin and command line arguments to
decide when two requests are identical, runs the command on its
arguments, and then passes the output to all members of the herd). That
_might_ be open-sourced in the future, but I don't have a specific
timeline.

> One thing to note: if you have repositories which are to have the
> same contents, you can distribute the pack-file to them and update
> references without going through Git.  It can be done on push
> (push to master, distribute to mirrors), or as part of fetch
> (master fetches from central repository, distributes to mirrors).
> I think; I have never managed large set of replicated Git repositories.

Doing it naively has some gotchas, because you want to make sure you
have all of the necessary objects. But if you are going this route,
probably distributed a git-bundle is the simplest way.

> > Generally no, they should not conflict. Writes into the object database
> > can happen simultaneously. Ref updates take a per-ref lock, so you
> > should generally be able to write two unrelated refs at once. The big
> > exception is that ref deletion required taking a repo-wide lock, but
> > that presumably wouldn't be a problem for your case.
> 
> Doesn't Git avoid taking locks, and use lockless synchronization
> mechanisms (though possibly equivalent to locks)?  I think it takes
> lockfile to update reflog together with reference, but if reflogs
> are turned off (and I think they are off for bare repositories by
> default), ref update uses "atomic file write" (write + rename)
> and compare-and-swap primitive.  Updating repository is lock-free:
> first update repository object database, then reference.

There is a lockfile to make the compare-and-swap atomic, but yes, it's
fundamentally based around the compare-and-swap. I don't think that
matters to the end user though. Fundamentally they will see "I hoped to
move from X to Y, but somebody else wrote Z, aborting", which is the
same as "I did not win the lock race, aborting".

The point is that updating two different refs is generally independent,
and updating the same ref is not.

> I guess that trying to replicate DGit approach that GitHub uses, see
> "Introducing DGit" (http://githubengineering.com/introducing-dgit)
> is currently out of question?

Minor nitpick (that you don't even have any way of knowing about, so
maybe more of a public service announcement). GitHub will stop using the
"DGit" name because it's too confusingly similar to "Git" (and "Git" is
trademarked by the project). There's a new blog post coming that
mentions the name change, and that historic one will have a note added
retroactively. The new name is "GitHub Spokes" (get it, Hub, Spokes?).

But in response to your question, I'll caution that replicating it is a
lot of work. :)

Since the original problem report mentions GHE, I'll note that newer
versions of GHE do support clustering and can share the git load across
multiple Spokes servers. So in theory that could make the replica layer
go away entirely, because it all happens behind the scenes.

-Peff

PS Sorry, I generally try to avoid hawking GitHub wares on the list, but
   since the OP mentioned GHE specifically, and because there aren't
   really generic solutions to most of these things, I do think it's a
   viable path for a solution for him.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29  5:47 ` Jeff King
  2016-08-29 10:46   ` Jakub Narębski
@ 2016-08-29 19:16   ` W. David Jarvis
  2016-08-29 21:31     ` Jeff King
  1 sibling, 1 reply; 16+ messages in thread
From: W. David Jarvis @ 2016-08-29 19:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git

> So your load is probably really spiky, as you get thundering herds of
> fetchers after every push (the spikes may have a long flatline at the
> top, as it takes time to process the whole herd).

It is quite spiky, yes. At the moment, however, the replication fleet
is relatively small (at the moment it's just 4 machines). We had 6
machines earlier this month and we had hoped that terminating two of
them would lead to a material drop in CPU usage, but we didn't see a
really significant reduction.

> Yes, though I'd be surprised if this negotiation is that expensive in
> practice. In my experience it's not generally, and even if we ended up
> traversing every commit in the repository, that's on the order of a few
> seconds even for large, active repositories.
>
> In my experience, the problem in a mass-fetch like this ends up being
> pack-objects preparing the packfile. It has to do a similar traversal,
> but _also_ look at all of the trees and blobs reachable from there, and
> then search for likely delta-compression candidates.
>
> Do you know which processes are generating the load? git-upload-pack
> does the negotiation, and then pack-objects does the actual packing.

When I look at expensive operations (ones that I can see consuming
90%+ of a CPU for more than a second), there are often pack-objects
processes running that will consume an entire core for multiple
seconds (I also saw one pack-object counting process run for several
minutes while using up a full core). rev-list shows up as a pretty
active CPU consumer, as do prune and blame-tree.

I'd say overall that in terms of high-CPU consumption activities,
`prune` and `rev-list` show up the most frequently.

On the subject of prune - I forgot to mention that the `git fetch`
calls from the subscribers are running `git fetch --prune`. I'm not
sure if that changes the projected load profile.

> Maybe. If pack-objects is where your load is coming from, then
> counter-intuitively things sometimes get _worse_ as you fetch less. The
> problem is that git will generally re-use deltas it has on disk when
> sending to the clients. But if the clients are missing some of the
> objects (because they don't fetch all of the branches), then we cannot
> use those deltas and may need to recompute new ones. So you might see
> some parts of the fetch get cheaper (negotiation, pack-object's
> "Counting objects" phase), but "Compressing objects" gets more
> expensive.

I might be misunderstanding this, but if the subscriber is already "up
to date" modulo a single updated ref tip, then this problem shouldn't
occur, right? Concretely: if ref A is built off of ref B, and the
subscriber already has B when it requests A, that shouldn't cause this
behavior, but it would cause this behavior if the subscriber didn't
have B when it requested A.

> This is particularly noticeable with shallow fetches, which in my
> experience are much more expensive to serve.

I don't think we're doing shallow fetches anywhere in this system.

> Jakub mentioned bitmaps, and if you are using GitHub Enterprise, they
> are enabled. But they won't really help here. They are essentially
> cached information that git generates at repack time. But if we _just_
> got a push, then the new objects to fetch won't be part of the cache,
> and we'll fall back to traversing them as normal.  On the other hand,
> this should be a relatively small bit of history to traverse, so I'd
> doubt that "Counting objects" is that expensive in your case (but you
> should be able to get a rough sense by watching the progress meter
> during a fetch).

See comment above about a long-running counting objects process. I
couldn't tell which of our repositories it was counting, but it went
for about 3 minutes with full core utilization. I didn't see us
counting pack-objects frequently but it's an expensive operation.

> I'd suspect more that delta compression is expensive (we know we just
> got some new objects, but we don't know if we can make good deltas
> against the objects the client already has). That's a gut feeling,
> though.
>
> If the fetch is small, that _also_ shouldn't be too expensive. But
> things add up when you have a large number of machines all making the
> same request at once. So it's entirely possible that the machine just
> gets hit with a lot of 5s CPU tasks all at once. If you only have a
> couple cores, that takes many multiples of 5s to clear out.

I think this would show up if I was sitting around running `top` on
the machine, but that doesn't end up being what I see. That might just
be a function of there being a relatively small number of replication
machines, I'm not sure. But I'm not noticing 4 of the same tasks get
spawned simultaneously, which says to me that we're either utilizing a
cache or there's some locking behavior involved.

> There's nothing in upstream git to help smooth these loads, but since
> you mentioned GitHub Enterprise, I happen to know that it does have a
> system for coalescing multiple fetches into a single pack-objects. I
> _think_ it's in GHE 2.5, so you might check which version you're
> running (and possibly also talk to GitHub Support, who might have more
> advice; there are also tools for finding out which git processes are
> generating the most load, etc).

We're on 2.6.4 at the moment.

> I suspect there's room for improvement and tuning of the primary. But
> barring that, one option would be to have a hierarchy of replicas. Have
> "k" first-tier replicas fetch from the primary, then "k" second-tier
> replicas fetch from them, and so on. Trade propagation delay for
> distributing the load. :)

Yep, this was the first thought we had. I started fleshing out the
architecture for it but wasn't sure if the majority of the CPU load
was being triggered by the first request to make it in, in which case
moving to a multi-tiered architecture wouldn't help us. When we didn't
notice a material reduction in CPU load after reducing the replication
fleet by 1/3, I started to worry about this and stopped moving forward
on the multi-tiered architecture until I understood the behavior
better.

 - V

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-28 19:42 Reducing CPU load on git server W. David Jarvis
  2016-08-28 21:20 ` Jakub Narębski
  2016-08-29  5:47 ` Jeff King
@ 2016-08-29 20:14 ` Ævar Arnfjörð Bjarmason
  2016-08-29 20:57   ` W. David Jarvis
  2 siblings, 1 reply; 16+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-08-29 20:14 UTC (permalink / raw)
  To: W. David Jarvis; +Cc: Git

On Sun, Aug 28, 2016 at 9:42 PM, W. David Jarvis
<william.d.jarvis@gmail.com> wrote:
> I've run into a problem that I'm looking for some help with. Let me
> describe the situation, and then some thoughts.

Just a few points that you may not have considered, and I didn't see
mentioned in this thread:

 * Consider having that queue of yours just send the pushed payload
instead of "pull this", see git-bundle. This can turn this sync entire
thing into a static file distribution problem.

 * It's not clear from your post why you have to worry about all these
branches, surely your Chef instances just need the "master" branch,
just push that around.

 * If you do need branches consider archiving stale tags/branches
after some time. I implemented this where I work, we just have a
$REPO-archive.git with every tag/branch ever created for a given
$REPO.git, and delete refs after a certain time.

 * If your problem is that you're CPU bound on the master have you
considered maybe solving this with something like NFS, i.e. replace
your ad-hoc replication with just a bunch of "slave" boxes that mount
the remote filesystem.

 * Or, if you're willing to deal with occasional transitory repo
corruption (just retry?): rsync.

 * Theres's no reason for why your replication chain needs to be
single-level if master CPU is really the issue. You could have master
-> N slaves -> N*X slaves, or some combination thereof.

 * Does it really even matter that your "slave" machines are all
up-to-date? We have something similar at work but it's just a minutely
cronjob that does "git fetch" on some repos, since the downstream
thing (e.g. the chef run) doesn't run more than once every 30m or
whatever anyway.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29 20:14 ` Reducing CPU load on git server Ævar Arnfjörð Bjarmason
@ 2016-08-29 20:57   ` W. David Jarvis
  2016-08-29 21:31     ` Dennis Kaarsemaker
  0 siblings, 1 reply; 16+ messages in thread
From: W. David Jarvis @ 2016-08-29 20:57 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git

>  * Consider having that queue of yours just send the pushed payload
> instead of "pull this", see git-bundle. This can turn this sync entire
> thing into a static file distribution problem.

As far as I know, GHE doesn't support this out of the box. We've asked
them for essentially this, though. Due to the nature of our license we
may not be able to configure something like this on the server
instance ourselves.

>  * It's not clear from your post why you have to worry about all these
> branches, surely your Chef instances just need the "master" branch,
> just push that around.

We allow deployments from non-master branches, so we do need multiple
branches. We also use the replication fleet as the target for our
build system, which needs to be able to build essentially any branch
on any repository.

>  * If you do need branches consider archiving stale tags/branches
> after some time. I implemented this where I work, we just have a
> $REPO-archive.git with every tag/branch ever created for a given
> $REPO.git, and delete refs after a certain time.

This is something else that we're actively considering. Why did your
company implement this -- was it to reduce load, or just to clean up
your repositories? Did you notice any change in server load?

>  * If your problem is that you're CPU bound on the master have you
> considered maybe solving this with something like NFS, i.e. replace
> your ad-hoc replication with just a bunch of "slave" boxes that mount
> the remote filesystem.

This is definitely an interesting idea. It'd be a significant
architectural change, though, and not one I'm sure we'd be able to get
support for.

>  * Or, if you're willing to deal with occasional transitory repo
> corruption (just retry?): rsync.

I think this is a cost we're not necessarily okay with having to deal with.

>  * Theres's no reason for why your replication chain needs to be
> single-level if master CPU is really the issue. You could have master
> -> N slaves -> N*X slaves, or some combination thereof.

This was discussed above - if the primary driver of load is the first
fetch, then moving to a multi-tiered architecture will not solve our
problems.

>  * Does it really even matter that your "slave" machines are all
> up-to-date? We have something similar at work but it's just a minutely
> cronjob that does "git fetch" on some repos, since the downstream
> thing (e.g. the chef run) doesn't run more than once every 30m or
> whatever anyway.

It does, because we use the replication fleet for our build server.

 - V

-- 
============
venanti.us
203.918.2328
============

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29 19:16   ` W. David Jarvis
@ 2016-08-29 21:31     ` Jeff King
  2016-08-29 22:41       ` W. David Jarvis
  2016-08-30 10:46       ` git blame <directory> [was: Reducing CPU load on git server] Jakub Narębski
  0 siblings, 2 replies; 16+ messages in thread
From: Jeff King @ 2016-08-29 21:31 UTC (permalink / raw)
  To: W. David Jarvis; +Cc: git

On Mon, Aug 29, 2016 at 12:16:20PM -0700, W. David Jarvis wrote:

> > Do you know which processes are generating the load? git-upload-pack
> > does the negotiation, and then pack-objects does the actual packing.
> 
> When I look at expensive operations (ones that I can see consuming
> 90%+ of a CPU for more than a second), there are often pack-objects
> processes running that will consume an entire core for multiple
> seconds (I also saw one pack-object counting process run for several
> minutes while using up a full core).

Pegging CPU for a few seconds doesn't sound out-of-place for
pack-objects serving a fetch or clone on a large repository. And I can
certainly believe "minutes", especially if it was not serving a fetch,
but doing repository maintenance on a large repository.

Talk to GitHub Enterprise support folks about what kind of process
monitoring and accounting is available. Recent versions of GHE can
easily tell things like which repositories and processes are using the
most CPU, RAM, I/O, and network, which ones are seeing a lot of
parallelism, etc.

> rev-list shows up as a pretty active CPU consumer, as do prune and
> blame-tree.
> 
> I'd say overall that in terms of high-CPU consumption activities,
> `prune` and `rev-list` show up the most frequently.

None of those operations is triggered by client fetches. You'll see
"rev-list" for a variety of operations, so that's hard to pinpoint. But
I'm surprised that "prune" is a common one for you. It is run as part of
the post-push, but I happen to know that the version that ships on GHE
is optimized to use bitmaps, and to avoid doing any work when there are
no loose objects that need pruning in the first place.

Blame-tree is a GitHub-specific command (it feeds the main repository
view page), and is a known CPU hog. There's more clever caching for that
coming down the pipe, but it's not shipped yet.

> On the subject of prune - I forgot to mention that the `git fetch`
> calls from the subscribers are running `git fetch --prune`. I'm not
> sure if that changes the projected load profile.

That shouldn't change anything; the pruning is purely a client side
thing.

> > Maybe. If pack-objects is where your load is coming from, then
> > counter-intuitively things sometimes get _worse_ as you fetch less. The
> > problem is that git will generally re-use deltas it has on disk when
> > sending to the clients. But if the clients are missing some of the
> > objects (because they don't fetch all of the branches), then we cannot
> > use those deltas and may need to recompute new ones. So you might see
> > some parts of the fetch get cheaper (negotiation, pack-object's
> > "Counting objects" phase), but "Compressing objects" gets more
> > expensive.
> 
> I might be misunderstanding this, but if the subscriber is already "up
> to date" modulo a single updated ref tip, then this problem shouldn't
> occur, right? Concretely: if ref A is built off of ref B, and the
> subscriber already has B when it requests A, that shouldn't cause this
> behavior, but it would cause this behavior if the subscriber didn't
> have B when it requested A.

Correct. So this shouldn't be a thing you are running into now, but it's
something that might be made worse if you switch to fetching only single
refs.

> See comment above about a long-running counting objects process. I
> couldn't tell which of our repositories it was counting, but it went
> for about 3 minutes with full core utilization. I didn't see us
> counting pack-objects frequently but it's an expensive operation.

That really sounds like repository maintenance. Repacks of
torvalds/linux (including all of its forks) on github.com take ~15
minutes of CPU. There may be some optimization opportunities there (I
have a few things I'd like to explore in the next few months), but most
of it is pretty fundamental. It literally takes a few minutes just to
walk the entire object graph for that repo (that's one of the more
extreme cases, of course, but presumably you are hosting some large
repositories).

Maintenance like that should be a very occasional operation, but it's
possible that you have a very busy repo.

> > There's nothing in upstream git to help smooth these loads, but since
> > you mentioned GitHub Enterprise, I happen to know that it does have a
> > system for coalescing multiple fetches into a single pack-objects. I
> > _think_ it's in GHE 2.5, so you might check which version you're
> > running (and possibly also talk to GitHub Support, who might have more
> > advice; there are also tools for finding out which git processes are
> > generating the most load, etc).
> 
> We're on 2.6.4 at the moment.

OK, I double-checked, and your version should be coalescing identical
fetches.

Given that, and that a lot of the load you mentioned above is coming
from non-fetch sources, it sounds like switching anything with your
replica fetch strategy isn't likely to help much. And a multi-tiered
architecture won't help if the load is being generated by requests that
are serving the web-views directly on the box.

I'd really encourage you to talk with GitHub Support about performance
and clustering. It sounds like there may be some GitHub-specific things
to tweak. And it may be that the load is just too much for a single
machine, and would benefit from spreading the load across multiple git
servers.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29 20:57   ` W. David Jarvis
@ 2016-08-29 21:31     ` Dennis Kaarsemaker
  0 siblings, 0 replies; 16+ messages in thread
From: Dennis Kaarsemaker @ 2016-08-29 21:31 UTC (permalink / raw)
  To: W. David Jarvis, Ævar Arnfjörð Bjarmason; +Cc: Git

On ma, 2016-08-29 at 13:57 -0700, W. David Jarvis wrote:

> >  * If you do need branches consider archiving stale tags/branches
> > after some time. I implemented this where I work, we just have a
> > $REPO-archive.git with every tag/branch ever created for a given
> > $REPO.git, and delete refs after a certain time.
> 
> This is something else that we're actively considering. Why did your
> company implement this -- was it to reduce load, or just to clean up
> your repositories? Did you notice any change in server load?

At 50k refs, ref negotiation gets expensive, especially over http

D. 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29 21:31     ` Jeff King
@ 2016-08-29 22:41       ` W. David Jarvis
  2016-08-31  6:02         ` Jeff King
  2016-08-30 10:46       ` git blame <directory> [was: Reducing CPU load on git server] Jakub Narębski
  1 sibling, 1 reply; 16+ messages in thread
From: W. David Jarvis @ 2016-08-29 22:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git

> Pegging CPU for a few seconds doesn't sound out-of-place for
> pack-objects serving a fetch or clone on a large repository. And I can
> certainly believe "minutes", especially if it was not serving a fetch,
> but doing repository maintenance on a large repository.
>
> Talk to GitHub Enterprise support folks about what kind of process
> monitoring and accounting is available. Recent versions of GHE can
> easily tell things like which repositories and processes are using the
> most CPU, RAM, I/O, and network, which ones are seeing a lot of
> parallelism, etc.

We have an open support thread with the GHE support folks, but the
only feedback we've gotten so far on this subject is that they believe
our CPU load is being driven by the quantity of fetch operations (an
average of 16 fetch requests per second during normal business hours,
so 4 requests per second per subscriber box). About 4,000 fetch
requests on our main repository per day.

> None of those operations is triggered by client fetches. You'll see
> "rev-list" for a variety of operations, so that's hard to pinpoint. But
> I'm surprised that "prune" is a common one for you. It is run as part of
> the post-push, but I happen to know that the version that ships on GHE
> is optimized to use bitmaps, and to avoid doing any work when there are
> no loose objects that need pruning in the first place.

Would regular force-pushing trigger prune operations? Our engineering
body loves to force-push.

>> I might be misunderstanding this, but if the subscriber is already "up
>> to date" modulo a single updated ref tip, then this problem shouldn't
>> occur, right? Concretely: if ref A is built off of ref B, and the
>> subscriber already has B when it requests A, that shouldn't cause this
>> behavior, but it would cause this behavior if the subscriber didn't
>> have B when it requested A.
>
> Correct. So this shouldn't be a thing you are running into now, but it's
> something that might be made worse if you switch to fetching only single
> refs.

But in theory if we were always up-to-date (since we'd always fetch
any updated ref) we wouldn't run into this problem? We could have a
cron job to ensure that we run a full git fetch every once in a while
but I'd hope that if this was written properly we'd almost always have
the most recent commit for any dependency ref.

> That really sounds like repository maintenance. Repacks of
> torvalds/linux (including all of its forks) on github.com take ~15
> minutes of CPU. There may be some optimization opportunities there (I
> have a few things I'd like to explore in the next few months), but most
> of it is pretty fundamental. It literally takes a few minutes just to
> walk the entire object graph for that repo (that's one of the more
> extreme cases, of course, but presumably you are hosting some large
> repositories).
>
> Maintenance like that should be a very occasional operation, but it's
> possible that you have a very busy repo.

Our primary repository is fairly busy. It has about 1/3 the commits of
Linux and about 1/3 the refs, but seems otherwise to be on the same
scale. And, of course, it both hasn't been around for as long as Linux
has and has been experiencing exponential growth, which means its
current activity is higher than it has ever been before -- might put
it on a similar scale to Linux's current activity.

> OK, I double-checked, and your version should be coalescing identical
> fetches.
>
> Given that, and that a lot of the load you mentioned above is coming
> from non-fetch sources, it sounds like switching anything with your
> replica fetch strategy isn't likely to help much. And a multi-tiered
> architecture won't help if the load is being generated by requests that
> are serving the web-views directly on the box.
>
> I'd really encourage you to talk with GitHub Support about performance
> and clustering. It sounds like there may be some GitHub-specific things
> to tweak. And it may be that the load is just too much for a single
> machine, and would benefit from spreading the load across multiple git
> servers.

What surprised us is that we had been running this on an r3.4xlarge
(16vCPU) on AWS for two years without too much issue. Then in a span
of months we started experiencing massive CPU load, which forced us to
upgrade the box to one with 32 vCPU (and better CPUs). We just don't
understand what the precise driver of load is here.

As noted above, we are talking with GitHub about performance -- we've
also been pushing them to start working on a clustering plan, but my
impression has been that they're reluctant to go down that path. I
suspect that we use GHE much more aggressively than the typical GHE
client, but I could be wrong about that.

 - V

-- 
============
venanti.us
203.918.2328
============

^ permalink raw reply	[flat|nested] 16+ messages in thread

* git blame <directory> [was: Reducing CPU load on git server]
  2016-08-29 21:31     ` Jeff King
  2016-08-29 22:41       ` W. David Jarvis
@ 2016-08-30 10:46       ` Jakub Narębski
  2016-08-31  5:42         ` Jeff King
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Narębski @ 2016-08-30 10:46 UTC (permalink / raw)
  To: Jeff King, W. David Jarvis; +Cc: git

W dniu 29.08.2016 o 23:31, Jeff King pisze:

> Blame-tree is a GitHub-specific command (it feeds the main repository
> view page), and is a known CPU hog. There's more clever caching for that
> coming down the pipe, but it's not shipped yet.

I wonder if having support for 'git blame <directory>' in Git core would
be something interesting to Git users.  I once tried to implement it,
but it went nowhere.  Would it be hard to implement?

I guess that GitHub offers this view as a part of the home page view
for a repository is something of a historical artifact; namely that
web interfaces for other version control systems used this.  But I
wonder how useful it is.  Neither cgit nor gitweb offer such view,
both GitLab and Bitbucket start with summary age with no blame-tree.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: git blame <directory> [was: Reducing CPU load on git server]
  2016-08-30 10:46       ` git blame <directory> [was: Reducing CPU load on git server] Jakub Narębski
@ 2016-08-31  5:42         ` Jeff King
  2016-08-31  7:28           ` Dennis Kaarsemaker
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2016-08-31  5:42 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Dennis Kaarsemaker, git

On Tue, Aug 30, 2016 at 12:46:20PM +0200, Jakub Narębski wrote:

> W dniu 29.08.2016 o 23:31, Jeff King pisze:
> 
> > Blame-tree is a GitHub-specific command (it feeds the main repository
> > view page), and is a known CPU hog. There's more clever caching for that
> > coming down the pipe, but it's not shipped yet.
> 
> I wonder if having support for 'git blame <directory>' in Git core would
> be something interesting to Git users.  I once tried to implement it,
> but it went nowhere.  Would it be hard to implement?

I think there's some interest; I have received a few off-list emails
over the years about it. There was some preliminary discussion long ago:

  http://public-inbox.org/git/20110302164031.GA18233@sigill.intra.peff.net/

The code that runs on GitHub is available in my fork of git. I haven't
submitted it upstream because there are some lingering issues. I
mentioned them on-list in the first few items of:

  http://public-inbox.org/git/20130318121243.GC14789@sigill.intra.peff.net/

That code is in the jk/blame-tree branch of https://github.com/peff/git
if you are interested in addressing them (note that I haven't touched
that code in a few years except for rebasing it forward, so it may have
bitrotted a little).

Here's a snippet from an off-list conversation I had with Dennis (cc'd)
in 2014 (I think he uses that blame-tree code as part of a custom git
web interface):

> The things I think it needs are:
> 
>   1. The max-depth patches need to be reconciled with Duy's pathspec
>      work upstream. The current implementation works only for tree
>      diffs, and is not really part of the pathspec at all.
> 
>   2. Docs/output formats for blame-tree need to be friendlier, as you
>      noticed.
> 
>   3. Blame-tree does not use revision pathspecs at all. This makes it
>      take way longer than it could, because it does not prune away side
>      branches deep in history that affect only paths whose blame we have
>      already found. But the current pathspec code is so slow that using
>      it outweighs the pruning benefit.
> 
>      I have a series, which I just pushed up to jk/faster-blame-tree,
>      which tries to improve this.  But it's got a lot of new, untested
>      code itself (we are not even running it at GitHub yet). It's also
>      based on v1.9.4; I think there are going to be a lot of conflicts
>      with the combine-tree work done in v2.0.
> 
> [...]
> 
> I also think it would probably make sense for blame-tree to support the
> same output formats as git-blame (e.g., to have an identical --porcelain
> mode, to have a reasonable human-readable format by default, etc).

That's all I could dig out of my archives. I'd be happy if somebody
wanted to pick it up and run with it. Polishing for upstream has been on
my list for several years now, but there's usually something more
important (or interesting) to work on at any given moment.

You might also look at how GitLab does it. I've never talked to them
about it, and as far as I know they do not use blame-tree.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Reducing CPU load on git server
  2016-08-29 22:41       ` W. David Jarvis
@ 2016-08-31  6:02         ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2016-08-31  6:02 UTC (permalink / raw)
  To: W. David Jarvis; +Cc: git

On Mon, Aug 29, 2016 at 03:41:59PM -0700, W. David Jarvis wrote:

> We have an open support thread with the GHE support folks, but the
> only feedback we've gotten so far on this subject is that they believe
> our CPU load is being driven by the quantity of fetch operations (an
> average of 16 fetch requests per second during normal business hours,
> so 4 requests per second per subscriber box). About 4,000 fetch
> requests on our main repository per day.

Hmm. That might well be, but I'd have to see the numbers to say more. At
this point I think we're exhausting what is useful to talk about on the
Git list.  I'll point GHE Support at this thread, which might help your
conversation with them (and they might pull me in behind the scenes).

I'll try to answer any Git-specific questions here, though.

> > None of those operations is triggered by client fetches. You'll see
> > "rev-list" for a variety of operations, so that's hard to pinpoint. But
> > I'm surprised that "prune" is a common one for you. It is run as part of
> > the post-push, but I happen to know that the version that ships on GHE
> > is optimized to use bitmaps, and to avoid doing any work when there are
> > no loose objects that need pruning in the first place.
> 
> Would regular force-pushing trigger prune operations? Our engineering
> body loves to force-push.

No, it shouldn't make a difference. For stock git, "prune" will only be
run occasionally as part of "git gc". On GitHub Enterprise, every push
kicks off a "sync" job that moves objects from a specific fork into
storage shared by all of the related forks. So GHE will run prune more
often than stock git would, but force-pushing wouldn't have any effect
on that.

There are also some custom patches to optimize prune on GHE, so it
shouldn't generally be very expensive. Unless perhaps for some reason
the reachability bitmaps on your repository aren't performing very well.

You could try something like comparing:

  time git rev-list --objects --all >/dev/null

and

  time git rev-list --objects --all --use-bitmap-index >/dev/null

on your server. The second should be a lot faster. If it's not, that may
be an indication that Git could be doing a better job of selecting
bitmap commits (that code is not GitHub-specific at all).

> >> I might be misunderstanding this, but if the subscriber is already "up
> >> to date" modulo a single updated ref tip, then this problem shouldn't
> >> occur, right? Concretely: if ref A is built off of ref B, and the
> >> subscriber already has B when it requests A, that shouldn't cause this
> >> behavior, but it would cause this behavior if the subscriber didn't
> >> have B when it requested A.
> >
> > Correct. So this shouldn't be a thing you are running into now, but it's
> > something that might be made worse if you switch to fetching only single
> > refs.
> 
> But in theory if we were always up-to-date (since we'd always fetch
> any updated ref) we wouldn't run into this problem? We could have a
> cron job to ensure that we run a full git fetch every once in a while
> but I'd hope that if this was written properly we'd almost always have
> the most recent commit for any dependency ref.

It's a little more complicated than that. What you're really going for
is letting git reuse on-disk deltas when serving fetches. But depending
on when the last repack was run, we might be cobbling together the fetch
from multiple packs on disk, in which case there will still be some
delta search. In my experience that's not _generally_ a big deal,
though. Small fetches don't have that many deltas to find.

> Our primary repository is fairly busy. It has about 1/3 the commits of
> Linux and about 1/3 the refs, but seems otherwise to be on the same
> scale. And, of course, it both hasn't been around for as long as Linux
> has and has been experiencing exponential growth, which means its
> current activity is higher than it has ever been before -- might put
> it on a similar scale to Linux's current activity.

Most of the work for repacking scales with the number of total objects
(not quite linearly, though).  For torvalds/linux (and its forks),
that's around 16 million objects. You might try "git count-objects -v"
on your server for comparison (but do it in the "network.git" directory,
as that's the shared object storage).

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: git blame <directory> [was: Reducing CPU load on git server]
  2016-08-31  5:42         ` Jeff King
@ 2016-08-31  7:28           ` Dennis Kaarsemaker
  0 siblings, 0 replies; 16+ messages in thread
From: Dennis Kaarsemaker @ 2016-08-31  7:28 UTC (permalink / raw)
  To: Jeff King, Jakub Narębski; +Cc: git

On wo, 2016-08-31 at 01:42 -0400, Jeff King wrote:
> On Tue, Aug 30, 2016 at 12:46:20PM +0200, Jakub Narębski wrote:
> 
> > I wonder if having support for 'git blame ' in Git core would
> > be something interesting to Git users.  I once tried to implement it,
> > but it went nowhere.  Would it be hard to implement?
> I think there's some interest; I have received a few off-list emails
> over the years about it. There was some preliminary discussion long ago:
> 
>8 <snip>
>
> Here's a snippet from an off-list conversation I had with Dennis (cc'd)
> in 2014 (I think he uses that blame-tree code as part of a custom git
> web interface):

I still do, but haven't worked on that webinterface in a while. I also
still build git with blame-tree merged in (which does occasionally
require a rebase with some tinkering).

> That's all I could dig out of my archives. I'd be happy if somebody> wanted to pick it up and run with it. Polishing for upstream has been on
> my list for several years now, but there's usually something more
> important (or interesting) to work on at any given moment.

Same here. I've been meaning to pick this up, but it touches some parts
of git I'm still not too familiar with and there are always other
things to do :)

D.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-31  7:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-28 19:42 Reducing CPU load on git server W. David Jarvis
2016-08-28 21:20 ` Jakub Narębski
2016-08-28 23:18   ` W. David Jarvis
2016-08-29  5:47 ` Jeff King
2016-08-29 10:46   ` Jakub Narębski
2016-08-29 17:18     ` Jeff King
2016-08-29 19:16   ` W. David Jarvis
2016-08-29 21:31     ` Jeff King
2016-08-29 22:41       ` W. David Jarvis
2016-08-31  6:02         ` Jeff King
2016-08-30 10:46       ` git blame <directory> [was: Reducing CPU load on git server] Jakub Narębski
2016-08-31  5:42         ` Jeff King
2016-08-31  7:28           ` Dennis Kaarsemaker
2016-08-29 20:14 ` Reducing CPU load on git server Ævar Arnfjörð Bjarmason
2016-08-29 20:57   ` W. David Jarvis
2016-08-29 21:31     ` Dennis Kaarsemaker

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).