git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [Question] clone performance
@ 2019-08-24  1:59 randall.s.becker
  2019-08-24 21:00 ` Bryan Turner
  0 siblings, 1 reply; 5+ messages in thread
From: randall.s.becker @ 2019-08-24  1:59 UTC (permalink / raw)
  To: git

Hi All,

I'm trying to answer a question for a customer on clone performance. They
are doing at least 2-3 clones a day, of repositories with about 2500 files
and 10Gb of content. This is stressing the file system. I have tried to
convince them that their process is not reasonable and should stick with
existing clones, using branch checkout rather that re=cloning for each
feature branch. Sadly, I have not been successful - not for a lack of
trying. Is there any way to improve raw clone performance in a situation
like this, where status really doesn't matter, because the clone's life span
is under 48 hours.

TIA,
Randall



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Question] clone performance
  2019-08-24  1:59 [Question] clone performance randall.s.becker
@ 2019-08-24 21:00 ` Bryan Turner
  2019-08-26 14:16   ` randall.s.becker
  0 siblings, 1 reply; 5+ messages in thread
From: Bryan Turner @ 2019-08-24 21:00 UTC (permalink / raw)
  To: randall.s.becker; +Cc: Git Users

On Fri, Aug 23, 2019 at 6:59 PM <randall.s.becker@rogers.com> wrote:
>
> Hi All,
>
> I'm trying to answer a question for a customer on clone performance. They
> are doing at least 2-3 clones a day, of repositories with about 2500 files
> and 10Gb of content. This is stressing the file system.

Can you go into a bit more detail about what "stress" means? Using too
much disk space? Too many IOPS reading/packing? Since you specifically
called out the filesystem, does that mean the CPU/memory usage is
acceptable?

Depending on how well-packed the repository is, Git will reuse a lot
of the existing pack (and a "perfectly" packed repository can achieve
complete reuse, with no "Compressing objects" phase at all). Delta
islands[1] can help increase reuse and reduce the need for on-the-fly
compression, if the repository includes a lot of refs that aren't
generally cloned.

Another relatively recent addition is uploadpack.packobjectshook[2],
which can simplify caching of packfiles so they can be reused on
subsequent requests. Whether or not this will be beneficial is likely
to be influenced by how many times the exact same commits are cloned
and how much extra disk space is available for storing cached packs.

Not sure if any of this is helpful, but I hope it will be!
Bryan

[1] https://git-scm.com/docs/git-pack-objects#_delta_islands
[2] https://git-scm.com/docs/git-config#Documentation/git-config.txt-uploadpackpackObjectsHook
> I have tried to
> convince them that their process is not reasonable and should stick with
> existing clones, using branch checkout rather that re=cloning for each
> feature branch. Sadly, I have not been successful - not for a lack of
> trying. Is there any way to improve raw clone performance in a situation
> like this, where status really doesn't matter, because the clone's life span
> is under 48 hours.
>
> TIA,
> Randall
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [Question] clone performance
  2019-08-24 21:00 ` Bryan Turner
@ 2019-08-26 14:16   ` randall.s.becker
  2019-08-26 18:21     ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: randall.s.becker @ 2019-08-26 14:16 UTC (permalink / raw)
  To: 'Bryan Turner'; +Cc: 'Git Users'

On August 24, 2019 5:00 PM, Bryan Turner wrote:
> On Fri, Aug 23, 2019 at 6:59 PM <randall.s.becker@rogers.com> wrote:
> >
> > Hi All,
> >
> > I'm trying to answer a question for a customer on clone performance.
> > They are doing at least 2-3 clones a day, of repositories with about
> > 2500 files and 10Gb of content. This is stressing the file system.
> 
> Can you go into a bit more detail about what "stress" means? Using too
> much disk space? Too many IOPS reading/packing? Since you specifically
> called out the filesystem, does that mean the CPU/memory usage is
> acceptable?

The upstream is BitBucket, which does a gc frequently. I'm not sure any of this is relating to the pack structure. Git is spending most of its time writing the large number of large files into the working directory - it is stress mostly the disk, with a bit on the CPU (neither is acceptable to the customer). I am really unsure there is any way to make things better. The core issue is that the customer insists on doing a clone for every feature branch instead of using pull/checkout. I have been unable to change their mind - to this point anyway.

We are going to be setting up a detailed performance analysis that may lead to some data the git team can use.

Regards,
Randall


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Question] clone performance
  2019-08-26 14:16   ` randall.s.becker
@ 2019-08-26 18:21     ` Jeff King
  2019-08-26 19:27       ` Elijah Newren
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff King @ 2019-08-26 18:21 UTC (permalink / raw)
  To: randall.s.becker; +Cc: 'Bryan Turner', 'Git Users'

On Mon, Aug 26, 2019 at 10:16:48AM -0400, randall.s.becker@rogers.com wrote:

> On August 24, 2019 5:00 PM, Bryan Turner wrote:
> > On Fri, Aug 23, 2019 at 6:59 PM <randall.s.becker@rogers.com> wrote:
> > >
> > > Hi All,
> > >
> > > I'm trying to answer a question for a customer on clone performance.
> > > They are doing at least 2-3 clones a day, of repositories with about
> > > 2500 files and 10Gb of content. This is stressing the file system.
> > 
> > Can you go into a bit more detail about what "stress" means? Using too
> > much disk space? Too many IOPS reading/packing? Since you specifically
> > called out the filesystem, does that mean the CPU/memory usage is
> > acceptable?
> 
> The upstream is BitBucket, which does a gc frequently. I'm not sure
> any of this is relating to the pack structure. Git is spending most of
> its time writing the large number of large files into the working
> directory - it is stress mostly the disk, with a bit on the CPU
> (neither is acceptable to the customer). I am really unsure there is
> any way to make things better. The core issue is that the customer
> insists on doing a clone for every feature branch instead of using
> pull/checkout. I have been unable to change their mind - to this point
> anyway.

Yeah, at the point of checkout there's basically no impact from anything
the server is doing or has done (technically it could make things worse
for you by returning a pack with absurdly long delta chains or
something, but that would be CPU and not disk stress).

I doubt there's much to optimize in Git here. It's literally just
writing files to disk as quickly as it can, and it sounds like disk
performance is your bottleneck.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Question] clone performance
  2019-08-26 18:21     ` Jeff King
@ 2019-08-26 19:27       ` Elijah Newren
  0 siblings, 0 replies; 5+ messages in thread
From: Elijah Newren @ 2019-08-26 19:27 UTC (permalink / raw)
  To: Jeff King; +Cc: randall.s.becker, Bryan Turner, Git Users

On Mon, Aug 26, 2019 at 12:04 PM Jeff King <peff@peff.net> wrote:
>
> On Mon, Aug 26, 2019 at 10:16:48AM -0400, randall.s.becker@rogers.com wrote:
>
> > On August 24, 2019 5:00 PM, Bryan Turner wrote:
> > > On Fri, Aug 23, 2019 at 6:59 PM <randall.s.becker@rogers.com> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I'm trying to answer a question for a customer on clone performance.
> > > > They are doing at least 2-3 clones a day, of repositories with about
> > > > 2500 files and 10Gb of content. This is stressing the file system.
> > >
> > > Can you go into a bit more detail about what "stress" means? Using too
> > > much disk space? Too many IOPS reading/packing? Since you specifically
> > > called out the filesystem, does that mean the CPU/memory usage is
> > > acceptable?
> >
> > The upstream is BitBucket, which does a gc frequently. I'm not sure
> > any of this is relating to the pack structure. Git is spending most of
> > its time writing the large number of large files into the working
> > directory - it is stress mostly the disk, with a bit on the CPU
> > (neither is acceptable to the customer). I am really unsure there is
> > any way to make things better. The core issue is that the customer
> > insists on doing a clone for every feature branch instead of using
> > pull/checkout. I have been unable to change their mind - to this point
> > anyway.
>
> Yeah, at the point of checkout there's basically no impact from anything
> the server is doing or has done (technically it could make things worse
> for you by returning a pack with absurdly long delta chains or
> something, but that would be CPU and not disk stress).
>
> I doubt there's much to optimize in Git here. It's literally just
> writing files to disk as quickly as it can, and it sounds like disk
> performance is your bottleneck.

Well, if it's just checkout, Stolee's sparse-checkout series he just
posted may be of interest to them...once it's polished up and included
in git, of course.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-08-26 19:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-24  1:59 [Question] clone performance randall.s.becker
2019-08-24 21:00 ` Bryan Turner
2019-08-26 14:16   ` randall.s.becker
2019-08-26 18:21     ` Jeff King
2019-08-26 19:27       ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).