git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Geert Jansen <gerardu@amazon.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: Matheus Tavares <matheus.bernardino@usp.br>, <git@vger.kernel.org>
Subject: Re: RFC: auto-enabling parallel-checkout on NFS
Date: Mon, 23 Nov 2020 23:18:18 +0000	[thread overview]
Message-ID: <20201123231817.GA28189@dev-dsk-gerardu-1d-54592b62.us-east-1.amazon.com> (raw)
In-Reply-To: <212a2def-6811-b6e4-0550-ecae2fe0c02c@jeffhostetler.com>

Hi Jeff,

On Thu, Nov 19, 2020 at 09:04:34AM -0500, Jeff Hostetler wrote:

> On 11/18/20 11:01 PM, Matheus Tavares wrote:
> >
> >On Mon, Nov 16, 2020 at 12:19 PM Jeff Hostetler <git@jeffhostetler.com> wrote:
> >>
> >>I can't really speak to NFS performance, but I have to wonder if there's
> >>not something else affecting the results -- 4 and/or 8 core results are
> >>better than 16+ results in some columns.  And we get diminishing returns
> >>after ~16.
> >
> >Yeah, that's a good point. I'm not sure yet what's causing the
> >diminishing returns, but Geert and I are investigating. Maybe we are
> >hitting some limit for parallelism in this scenario.
> 
> I seem to recall back when I was working on this problem that
> the unzip of each blob was a major pain point.  Combine this
> long delta-chains and each worker would need multiple rounds of
> read/memmap, unzip, and de-delta before it had the complete blob
> and could then smudge and write.

I think that there are two cases here:

1) (CPU bound case) On local machines with multiple cores and SSD disks,
   checkout is CPU bound and the parallel checkout works because the unzipping
   can now run on multiple CPUs in parallel. Shorter chains would use less CPU
   time and we'd see a smilar benefit on both paralell and sequential checkout.

2) (IO bound case) On networked file systems, file system IO is pretty much
   always the bottleneck for git and similar applications that use small files.
   On NFS calling open() is always a round trip, and so is close() (in the
   absence of delegations and O_CREAT). The latency of these calls depends on
   the NFS server and network distance, but 1ms is a reasonable order of
   magnitude when thinking about this. Beause this 1ms is a lot more than the
   typical CPU time to process a single blob, checkout will be IO bound.
   Parallel checkout works by allowing the application to maintain an IO depth
   > 1 for these workloads, which amortizes the network latency over multiple
   requests.

Regarding the diminishing returns: I did some initial analysis of Mattheus'
code and I'm not sure yet. I see the code achieving a high IO depth in our
server logs, which would indicate that the diminishing returns are caused by
file system contention. This would have to be some kind of general contention
since it happens both on NFS and EFS. I will do a deeper investigation on this
and will report what I find.

Best regards,
Geert

  parent reply	other threads:[~2020-11-23 23:24 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-15 19:43 Matheus Tavares
2020-11-16 15:19 ` Jeff Hostetler
2020-11-19  4:01   ` Matheus Tavares
2020-11-19 14:04     ` Jeff Hostetler
2020-11-20 12:10       ` Ævar Arnfjörð Bjarmason
2020-11-23 23:18       ` Geert Jansen [this message]
2020-11-19  9:01 ` Ævar Arnfjörð Bjarmason
2020-11-19 14:11   ` Jeff Hostetler
2020-11-23 23:37   ` Geert Jansen
2020-11-24 12:58     ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201123231817.GA28189@dev-dsk-gerardu-1d-54592b62.us-east-1.amazon.com \
    --to=gerardu@amazon.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=matheus.bernardino@usp.br \
    --subject='Re: RFC: auto-enabling parallel-checkout on NFS' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).