git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Avery Pennarun <apenwarr@gmail.com>
To: Dmitry Potapov <dpotapov@gmail.com>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Zygo Blaxell <zblaxell@esightcorp.com>,
	Ilari Liusvaara <ilari.liusvaara@elisanet.fi>,
	Thomas Rast <trast@student.ethz.ch>,
	Jonathan Nieder <jrnieder@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH] don't use mmap() to hash files
Date: Sun, 14 Feb 2010 18:13:13 -0500	[thread overview]
Message-ID: <32541b131002141513m29f9a796ma8fb5855a45f91e9@mail.gmail.com> (raw)
In-Reply-To: <37fcd2781002141106v761ce6e0kc5c5bdd5001f72a9@mail.gmail.com>

On Sun, Feb 14, 2010 at 2:06 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> On Sun, Feb 14, 2010 at 9:10 PM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
>> That's comparing oranges to apples. In one case, the address space runs
>> out, in the other the available memory. The latter is much more likely.
>
> "much more likely" is not a very qualitative characteristic... I would
> prefer to see numbers.

Well, the numbers are rather easy to calculate of course.  On a 32-bit
machine, your (ideal) maximum address space size is 4GB.  On a 64-bit
machine, it's a heck of a lot bigger.  And in either case, a single
process consuming it all doesn't matter since it won't hurt other
processes.  But the available RAM is frequently less than 4GB and that
has to be shared between *all* your processes.

> BTW, probably, it is not difficult to stream a large file in chunks (and
> it may be even much faster, because we work on CPU cache), but I suspect
> it will not resolve all issues with huge files, because eventually we
> need to store them in a pack file. So we need to develop some strategy
> how to deal with them.

It definitely doesn't resolve all the issues.  There are different
ways of looking at this; one is to not bother make git-add work
smoothly with large files, because calculating the deltas will later
cause a disastrous meltdown anyway.  In fact, arguably you should
prevent git-add from adding large files at all, because at least then
you don't get the repository into a hard-to-recover-from state with
huge files.  (This happened at work a few months ago; most people have
no idea what to do in such a situation.)

The other way to look at it is that if we want git to *eventually*
work with huge files, we have to fix each bug one at a time, and we
can't go making things worse.

For my own situation, I think I'm more likely to (and I know people
who are more likely to) try storing huge files in git than I am likely
to modify a file *while* I'm trying to store it in git.

> One way to deal with them is to stream directly into a separate pack.
> Still, it does not resolve all problems, because each pack file should
> be mapped into a memory, and this may be a problem for 32-bit system
> (or even 64-bit systems where a sysadmin set limit on amount virtual
> memory available a single program).
>
> The other way to handle huge files is to split them into chunks.
> http://article.gmane.org/gmane.comp.version-control.git/120112

I have a bit of experience splitting files into chunks:
http://groups.google.com/group/bup-list/browse_thread/thread/812031efd4c5f7e4

It works.  Also note that the speed gain from mmap'ing packs appears
to be much less than the gain from mmap'ing indexes.  You could
probably sacrifice most or all of the former and never really notice.
Caching expanded deltas can be pretty valuable, though.  (bup
presently avoids that whole question by not using deltas.)

I can also confirm that streaming objects directly into packs is a
massive performance increase when dealing with big files.  However,
you then start to run into git's heuristics that often assume (for
example) that if an object is in a pack, it should never (or rarely)
be pruned.  This is normally a fine assumption, because if it was
likely to get pruned, it probably never would have been put into a
pack in the first place.

Have fun,

Avery

  parent reply	other threads:[~2010-02-14 23:13 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org>
2010-02-12  0:27 ` Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs Jonathan Nieder
2010-02-12  1:23   ` Zygo Blaxell
2010-02-13 12:12     ` Jonathan Nieder
2010-02-13 13:39       ` Ilari Liusvaara
2010-02-13 14:39         ` Thomas Rast
2010-02-13 16:29           ` Ilari Liusvaara
2010-02-13 22:09             ` Dmitry Potapov
2010-02-13 22:37               ` Zygo Blaxell
2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
2010-02-14  1:37                   ` Junio C Hamano
2010-02-14  2:18                     ` Dmitry Potapov
2010-02-14  3:14                       ` Junio C Hamano
2010-02-14 11:14                         ` Thomas Rast
2010-02-14 11:46                           ` Junio C Hamano
2010-02-14  1:53                   ` Johannes Schindelin
2010-02-14  2:00                     ` Junio C Hamano
2010-02-14  2:42                     ` Dmitry Potapov
2010-02-14 11:07                       ` Jakub Narebski
2010-02-14 11:55                       ` Paolo Bonzini
2010-02-14 18:10                       ` Johannes Schindelin
2010-02-14 19:06                         ` Dmitry Potapov
2010-02-14 19:22                           ` Johannes Schindelin
2010-02-14 19:28                             ` Johannes Schindelin
2010-02-14 19:56                               ` Dmitry Potapov
2010-02-14 23:52                                 ` Zygo Blaxell
2010-02-15  5:05                                 ` Nicolas Pitre
2010-02-15 12:23                                   ` Dmitry Potapov
2010-02-15  7:48                                 ` Paolo Bonzini
2010-02-15 12:25                                   ` Dmitry Potapov
2010-02-14 19:55                             ` Dmitry Potapov
2010-02-14 23:13                           ` Avery Pennarun [this message]
2010-02-15  4:16                             ` Nicolas Pitre
2010-02-15  5:01                               ` Avery Pennarun
2010-02-15  5:48                                 ` Nicolas Pitre
2010-02-15 19:19                                   ` Avery Pennarun
2010-02-15 19:29                                     ` Nicolas Pitre
2010-02-14  3:05                   ` [PATCH v2] " Dmitry Potapov
2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
2010-02-18  1:20                     ` Junio C Hamano
2010-02-18 15:32                       ` Zygo Blaxell
2010-02-19 17:51                         ` Junio C Hamano
2010-02-18  1:38                     ` Jeff King
2010-02-18  4:55                       ` Nicolas Pitre
2010-02-18  5:36                         ` Junio C Hamano
2010-02-18  7:27                           ` Wincent Colaiuta
2010-02-18 16:18                             ` Zygo Blaxell
2010-02-18 18:12                               ` Jonathan Nieder
2010-02-18 18:35                                 ` Junio C Hamano
2010-02-22 12:59                           ` Paolo Bonzini
2010-02-22 13:33                             ` Dmitry Potapov
2010-02-18 10:14                     ` Thomas Rast
2010-02-18 18:16                       ` Junio C Hamano
2010-02-18 19:58                         ` Nicolas Pitre
2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
2010-02-18 20:58                             ` Nicolas Pitre
2010-02-19  9:27                               ` Erik Faye-Lund
2010-02-22 22:20                               ` Bill Lear
2010-02-22 22:31                                 ` Nicolas Pitre
2010-02-18 20:14                           ` [PATCH] Teach "git add" and friends to be paranoid Peter Harris
2010-02-18 20:17                           ` Junio C Hamano
2010-02-18 21:30                             ` Nicolas Pitre
2010-02-19  1:04                               ` Jonathan Nieder
2010-02-19 15:26                                 ` Zygo Blaxell
2010-02-19 17:52                                   ` Junio C Hamano
2010-02-19 19:08                                     ` Zygo Blaxell
2010-02-19  8:28                     ` Dmitry Potapov
2010-02-19 17:52                       ` Junio C Hamano
2010-02-20 19:23                         ` Junio C Hamano
2010-02-21  7:21                           ` Dmitry Potapov
2010-02-21 19:32                             ` Junio C Hamano
2010-02-22  3:35                               ` Dmitry Potapov
2010-02-22  6:59                                 ` Junio C Hamano
2010-02-22 12:25                                   ` Dmitry Potapov
2010-02-22 15:40                                   ` Nicolas Pitre
2010-02-22 16:01                                     ` Dmitry Potapov
2010-02-22 17:31                                     ` Zygo Blaxell
2010-02-22 18:01                                       ` Nicolas Pitre
2010-02-22 19:56                                         ` Junio C Hamano
2010-02-22 20:52                                           ` Nicolas Pitre
2010-02-22 18:05                                       ` Dmitry Potapov
2010-02-22 18:14                                         ` Nicolas Pitre
2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
2010-02-14  1:53     ` mmap with MAP_PRIVATE is useless Junio C Hamano
2010-02-14  2:11       ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=32541b131002141513m29f9a796ma8fb5855a45f91e9@mail.gmail.com \
    --to=apenwarr@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=dpotapov@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=ilari.liusvaara@elisanet.fi \
    --cc=jrnieder@gmail.com \
    --cc=trast@student.ethz.ch \
    --cc=zblaxell@esightcorp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).