git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Nicolas Pitre <nico@fluxnic.net>
To: Zygo Blaxell <zblaxell@gibbs.hungrycats.org>
Cc: Junio C Hamano <gitster@pobox.com>,
	Dmitry Potapov <dpotapov@gmail.com>,
	Ilari Liusvaara <ilari.liusvaara@elisanet.fi>,
	Thomas Rast <trast@student.ethz.ch>,
	Jonathan Nieder <jrnieder@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH] Teach "git add" and friends to be paranoid
Date: Mon, 22 Feb 2010 13:01:13 -0500 (EST)	[thread overview]
Message-ID: <alpine.LFD.2.00.1002221238110.1946@xanadu.home> (raw)
In-Reply-To: <20100222173122.GG11733@gibbs.hungrycats.org>

On Mon, 22 Feb 2010, Zygo Blaxell wrote:

> On Mon, Feb 22, 2010 at 10:40:59AM -0500, Nicolas Pitre wrote:
> > On Sun, 21 Feb 2010, Junio C Hamano wrote:
> > > Dmitry Potapov <dpotapov@gmail.com> writes:
> > > > But overall the outcome is clear -- read() is always a winner.
> > > 
> > > "... a winner, below 128kB; above that the difference is within noise and
> > > measurement error"?
> > 
> > read() is not always a winner.  A read() call will always have the data 
> > duplicated in memory.  Especially with large files, it is more efficient 
> > on the system as a whole to mmap() a 50 MB file rather than allocating 
> > an extra 50 MB of anonymous memory that cannot be paged out (except to 
> > the swap file which would be yet another data duplication).  With mmap() 
> > when there is memory pressure the read-only mapped memory is simply 
> > dropped with no extra IO.
> 
> That holds if you're comparing read() and mmap() of the entire file as a
> single chunk, instead of in fixed-size chunks at the sweet spot between
> syscall overhead and CPU cache size.

Obviously.  But we currently don't have the infrastructure to do chunked 
read of the input data.  I think we should do that eventually, by 
applying the pack windowing code to input files as well.  That would 
make memory usage constant even for huge files, but this is much more 
complicated to support especially for data fed through stdin.

> If you're read()ing a chunk at a time into a fixed size buffer, and
> doing sha1 and deflate in chunks, the data should be copied once into CPU
> cache, processed with both algorithms, and replaced with new data from
> the next chunk.  The data will be copied from the page cache instead
> of directly mapped, which is a small overhead, but setting up the page
> map in mmap() also a small overhead, so you have to use benchmarks to
> know which of the overheads is smaller.  It might be that there's no
> one answer that applies to all CPU configurations.

Normally mmap() has more overhead than read().  However mmap() provides 
much nicer properties than read() by simplifying the code a lot, and by 
letting the OS manage memory pressure much more gracefully.

> If you're doing mmap() and sha1 and deflate of a 50MB file in two
> separate passes that are the same size as the file, you load 50MB of
> data into CPU cache at least twice, you get two sets of associated
> things like TLB misses, and if the file is very large, you page it from
> disk twice.  So it might make sense to process in chunks regardless
> of read() vs mmap() fetching the data.

We do have to make two separate passes anyway.  The first pass is to 
hash the data only, and if that hash already exists in the object store 
then we call it done and skip over the deflate process which is still 
the dominant cost.  And that happens quite often.

However, with a really large file, then it becomes advantageous to 
simply do the hash and deflate in parallel one chunk at a time, and 
simply discard the newly created objects if it happens to already 
exists.  That's the whole idea behind the newly introduced 
core.bigFileThreshold config variable (but the code to honor it in 
sha1_file.c doesn't exist yet).

> If you're malloc()ing 50MB, you're wasting memory and CPU bandwidth
> making up pages full of zeros before you've even processed the first byte.
> I don't see how that could ever be faster for large file cases.

It can't.  This is why read() is not much better than mmap() in those 
cases.


Nicolas

  reply	other threads:[~2010-02-22 18:01 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org>
2010-02-12  0:27 ` Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs Jonathan Nieder
2010-02-12  1:23   ` Zygo Blaxell
2010-02-13 12:12     ` Jonathan Nieder
2010-02-13 13:39       ` Ilari Liusvaara
2010-02-13 14:39         ` Thomas Rast
2010-02-13 16:29           ` Ilari Liusvaara
2010-02-13 22:09             ` Dmitry Potapov
2010-02-13 22:37               ` Zygo Blaxell
2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
2010-02-14  1:37                   ` Junio C Hamano
2010-02-14  2:18                     ` Dmitry Potapov
2010-02-14  3:14                       ` Junio C Hamano
2010-02-14 11:14                         ` Thomas Rast
2010-02-14 11:46                           ` Junio C Hamano
2010-02-14  1:53                   ` Johannes Schindelin
2010-02-14  2:00                     ` Junio C Hamano
2010-02-14  2:42                     ` Dmitry Potapov
2010-02-14 11:07                       ` Jakub Narebski
2010-02-14 11:55                       ` Paolo Bonzini
2010-02-14 18:10                       ` Johannes Schindelin
2010-02-14 19:06                         ` Dmitry Potapov
2010-02-14 19:22                           ` Johannes Schindelin
2010-02-14 19:28                             ` Johannes Schindelin
2010-02-14 19:56                               ` Dmitry Potapov
2010-02-14 23:52                                 ` Zygo Blaxell
2010-02-15  5:05                                 ` Nicolas Pitre
2010-02-15 12:23                                   ` Dmitry Potapov
2010-02-15  7:48                                 ` Paolo Bonzini
2010-02-15 12:25                                   ` Dmitry Potapov
2010-02-14 19:55                             ` Dmitry Potapov
2010-02-14 23:13                           ` Avery Pennarun
2010-02-15  4:16                             ` Nicolas Pitre
2010-02-15  5:01                               ` Avery Pennarun
2010-02-15  5:48                                 ` Nicolas Pitre
2010-02-15 19:19                                   ` Avery Pennarun
2010-02-15 19:29                                     ` Nicolas Pitre
2010-02-14  3:05                   ` [PATCH v2] " Dmitry Potapov
2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
2010-02-18  1:20                     ` Junio C Hamano
2010-02-18 15:32                       ` Zygo Blaxell
2010-02-19 17:51                         ` Junio C Hamano
2010-02-18  1:38                     ` Jeff King
2010-02-18  4:55                       ` Nicolas Pitre
2010-02-18  5:36                         ` Junio C Hamano
2010-02-18  7:27                           ` Wincent Colaiuta
2010-02-18 16:18                             ` Zygo Blaxell
2010-02-18 18:12                               ` Jonathan Nieder
2010-02-18 18:35                                 ` Junio C Hamano
2010-02-22 12:59                           ` Paolo Bonzini
2010-02-22 13:33                             ` Dmitry Potapov
2010-02-18 10:14                     ` Thomas Rast
2010-02-18 18:16                       ` Junio C Hamano
2010-02-18 19:58                         ` Nicolas Pitre
2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
2010-02-18 20:58                             ` Nicolas Pitre
2010-02-19  9:27                               ` Erik Faye-Lund
2010-02-22 22:20                               ` Bill Lear
2010-02-22 22:31                                 ` Nicolas Pitre
2010-02-18 20:14                           ` [PATCH] Teach "git add" and friends to be paranoid Peter Harris
2010-02-18 20:17                           ` Junio C Hamano
2010-02-18 21:30                             ` Nicolas Pitre
2010-02-19  1:04                               ` Jonathan Nieder
2010-02-19 15:26                                 ` Zygo Blaxell
2010-02-19 17:52                                   ` Junio C Hamano
2010-02-19 19:08                                     ` Zygo Blaxell
2010-02-19  8:28                     ` Dmitry Potapov
2010-02-19 17:52                       ` Junio C Hamano
2010-02-20 19:23                         ` Junio C Hamano
2010-02-21  7:21                           ` Dmitry Potapov
2010-02-21 19:32                             ` Junio C Hamano
2010-02-22  3:35                               ` Dmitry Potapov
2010-02-22  6:59                                 ` Junio C Hamano
2010-02-22 12:25                                   ` Dmitry Potapov
2010-02-22 15:40                                   ` Nicolas Pitre
2010-02-22 16:01                                     ` Dmitry Potapov
2010-02-22 17:31                                     ` Zygo Blaxell
2010-02-22 18:01                                       ` Nicolas Pitre [this message]
2010-02-22 19:56                                         ` Junio C Hamano
2010-02-22 20:52                                           ` Nicolas Pitre
2010-02-22 18:05                                       ` Dmitry Potapov
2010-02-22 18:14                                         ` Nicolas Pitre
2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
2010-02-14  1:53     ` mmap with MAP_PRIVATE is useless Junio C Hamano
2010-02-14  2:11       ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.1002221238110.1946@xanadu.home \
    --to=nico@fluxnic.net \
    --cc=dpotapov@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=ilari.liusvaara@elisanet.fi \
    --cc=jrnieder@gmail.com \
    --cc=trast@student.ethz.ch \
    --cc=zblaxell@gibbs.hungrycats.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).