git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Avery Pennarun <apenwarr@gmail.com>
To: Nicolas Pitre <nico@fluxnic.net>
Cc: Dmitry Potapov <dpotapov@gmail.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Zygo Blaxell <zblaxell@esightcorp.com>,
	Ilari Liusvaara <ilari.liusvaara@elisanet.fi>,
	Thomas Rast <trast@student.ethz.ch>,
	Jonathan Nieder <jrnieder@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH] don't use mmap() to hash files
Date: Mon, 15 Feb 2010 00:01:49 -0500	[thread overview]
Message-ID: <32541b131002142101i226663cfk90d1ba14f1031788@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.2.00.1002142252020.1946@xanadu.home>

On Sun, Feb 14, 2010 at 11:16 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Sun, 14 Feb 2010, Avery Pennarun wrote:
>> In fact, arguably you should prevent git-add from adding large files
>> at all, because at least then you don't get the repository into a
>> hard-to-recover-from state with huge files.  (This happened at work a
>> few months ago; most people have no idea what to do in such a
>> situation.)
>
> Git needs to be fixed in that case, not be crippled.

That would be ideal, but is more work than disabling imports for large
files by default (for example), which would be easy.  In any case, my
solution at work was to say "if it hurts, don't do that" and it seems
to have worked out okay for now.

>> For my own situation, I think I'm more likely to (and I know people
>> who are more likely to) try storing huge files in git than I am likely
>> to modify a file *while* I'm trying to store it in git.
>
> And fancy operations on huge files are pretty unlikely.  Blame, diff,
> etc, are suited for text file which are by nature relatively small.
> And if your source code is all pasted in one single huge file that Git
> can't handle right now, then the compiler is unlikely to cope either.

Well, I'm thinking of things like textual database dumps, such as
those produced by mysqldump.  It would be nice to be able to diff
those efficiently, even if they're several gigs in size.  bup's
hierarchical chunking allows this.

>> > The other way to handle huge files is to split them into chunks.
>> > http://article.gmane.org/gmane.comp.version-control.git/120112
>
> No.  The chunk idea doesn't fit the Git model well enough without many
> corner cases all over the place which is a major drawback.  I think that
> was discussed in that thread already.
>
>> I have a bit of experience splitting files into chunks:
>> http://groups.google.com/group/bup-list/browse_thread/thread/812031efd4c5f7e4

Note that bup's rolling-checksum-based hierarchical chunking is not
the same as the chunking that was discussed in that thread, and it
resolves most of the problems.  Unless I'm missing something.

Also note that bup just uses normal tree objects (for better or worse)
instead of introducing a new object type.

>> It works.  Also note that the speed gain from mmap'ing packs appears
>> to be much less than the gain from mmap'ing indexes.  You could
>> probably sacrifice most or all of the former and never really notice.
>> Caching expanded deltas can be pretty valuable, though.  (bup
>> presently avoids that whole question by not using deltas.)
>
> We do have a cache of expanded deltas already.

Yes, sorry to have implied otherwise.  I was just comparing the
performance advantage of the delta expansion cache (which should be a
lot) with that of mmaping packfiles (which probably isn't much since
the packfile data is typically needed in expanded form anyway).

>> I can also confirm that streaming objects directly into packs is a
>> massive performance increase when dealing with big files.  However,
>> you then start to run into git's heuristics that often assume (for
>> example) that if an object is in a pack, it should never (or rarely)
>> be pruned.  This is normally a fine assumption, because if it was
>> likely to get pruned, it probably never would have been put into a
>> pack in the first place.
>
> Would you please for my own sanity tell me where we do such thing.  I
> thought I had a firm grip on the pack model but you're casting a shadow
> of doubts on some code I might have written myself.

Sorry, I didn't hunt down the code, but I ran into it while
experimenting before.  The rules are something like:

- git-prune only prunes unpacked objects

- git-repack claims to be willing to explode unreachable objects back
into loose objects with -A, but I'm not quite sure if its definition
of "unreachable" is the same as mine.  And I'm not sure rewriting a
pack with -A makes the old pack reliably unreachable according to -d.
It's possible I was just being dense.

- there seems to be no documented situation in which you can ever
delete unused objects from a pack without using repack -a or -A, which
can be amazingly slow if your packs are huge.  (Ideally you'd only
repack the particular packs that you want to shrink.)  For example, my
bup repo is currently 200 GB.

Anyway, I didn't have much luck when playing with it earlier, but
didn't investigate since I assumed it's just a workflow that nobody
much cares about.  Which I think is a reasonable position for git
developers to take anyway.

Have fun,

Avery

  reply	other threads:[~2010-02-15  5:02 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org>
2010-02-12  0:27 ` Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs Jonathan Nieder
2010-02-12  1:23   ` Zygo Blaxell
2010-02-13 12:12     ` Jonathan Nieder
2010-02-13 13:39       ` Ilari Liusvaara
2010-02-13 14:39         ` Thomas Rast
2010-02-13 16:29           ` Ilari Liusvaara
2010-02-13 22:09             ` Dmitry Potapov
2010-02-13 22:37               ` Zygo Blaxell
2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
2010-02-14  1:37                   ` Junio C Hamano
2010-02-14  2:18                     ` Dmitry Potapov
2010-02-14  3:14                       ` Junio C Hamano
2010-02-14 11:14                         ` Thomas Rast
2010-02-14 11:46                           ` Junio C Hamano
2010-02-14  1:53                   ` Johannes Schindelin
2010-02-14  2:00                     ` Junio C Hamano
2010-02-14  2:42                     ` Dmitry Potapov
2010-02-14 11:07                       ` Jakub Narebski
2010-02-14 11:55                       ` Paolo Bonzini
2010-02-14 18:10                       ` Johannes Schindelin
2010-02-14 19:06                         ` Dmitry Potapov
2010-02-14 19:22                           ` Johannes Schindelin
2010-02-14 19:28                             ` Johannes Schindelin
2010-02-14 19:56                               ` Dmitry Potapov
2010-02-14 23:52                                 ` Zygo Blaxell
2010-02-15  5:05                                 ` Nicolas Pitre
2010-02-15 12:23                                   ` Dmitry Potapov
2010-02-15  7:48                                 ` Paolo Bonzini
2010-02-15 12:25                                   ` Dmitry Potapov
2010-02-14 19:55                             ` Dmitry Potapov
2010-02-14 23:13                           ` Avery Pennarun
2010-02-15  4:16                             ` Nicolas Pitre
2010-02-15  5:01                               ` Avery Pennarun [this message]
2010-02-15  5:48                                 ` Nicolas Pitre
2010-02-15 19:19                                   ` Avery Pennarun
2010-02-15 19:29                                     ` Nicolas Pitre
2010-02-14  3:05                   ` [PATCH v2] " Dmitry Potapov
2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
2010-02-18  1:20                     ` Junio C Hamano
2010-02-18 15:32                       ` Zygo Blaxell
2010-02-19 17:51                         ` Junio C Hamano
2010-02-18  1:38                     ` Jeff King
2010-02-18  4:55                       ` Nicolas Pitre
2010-02-18  5:36                         ` Junio C Hamano
2010-02-18  7:27                           ` Wincent Colaiuta
2010-02-18 16:18                             ` Zygo Blaxell
2010-02-18 18:12                               ` Jonathan Nieder
2010-02-18 18:35                                 ` Junio C Hamano
2010-02-22 12:59                           ` Paolo Bonzini
2010-02-22 13:33                             ` Dmitry Potapov
2010-02-18 10:14                     ` Thomas Rast
2010-02-18 18:16                       ` Junio C Hamano
2010-02-18 19:58                         ` Nicolas Pitre
2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
2010-02-18 20:58                             ` Nicolas Pitre
2010-02-19  9:27                               ` Erik Faye-Lund
2010-02-22 22:20                               ` Bill Lear
2010-02-22 22:31                                 ` Nicolas Pitre
2010-02-18 20:14                           ` [PATCH] Teach "git add" and friends to be paranoid Peter Harris
2010-02-18 20:17                           ` Junio C Hamano
2010-02-18 21:30                             ` Nicolas Pitre
2010-02-19  1:04                               ` Jonathan Nieder
2010-02-19 15:26                                 ` Zygo Blaxell
2010-02-19 17:52                                   ` Junio C Hamano
2010-02-19 19:08                                     ` Zygo Blaxell
2010-02-19  8:28                     ` Dmitry Potapov
2010-02-19 17:52                       ` Junio C Hamano
2010-02-20 19:23                         ` Junio C Hamano
2010-02-21  7:21                           ` Dmitry Potapov
2010-02-21 19:32                             ` Junio C Hamano
2010-02-22  3:35                               ` Dmitry Potapov
2010-02-22  6:59                                 ` Junio C Hamano
2010-02-22 12:25                                   ` Dmitry Potapov
2010-02-22 15:40                                   ` Nicolas Pitre
2010-02-22 16:01                                     ` Dmitry Potapov
2010-02-22 17:31                                     ` Zygo Blaxell
2010-02-22 18:01                                       ` Nicolas Pitre
2010-02-22 19:56                                         ` Junio C Hamano
2010-02-22 20:52                                           ` Nicolas Pitre
2010-02-22 18:05                                       ` Dmitry Potapov
2010-02-22 18:14                                         ` Nicolas Pitre
2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
2010-02-14  1:53     ` mmap with MAP_PRIVATE is useless Junio C Hamano
2010-02-14  2:11       ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=32541b131002142101i226663cfk90d1ba14f1031788@mail.gmail.com \
    --to=apenwarr@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=dpotapov@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=ilari.liusvaara@elisanet.fi \
    --cc=jrnieder@gmail.com \
    --cc=nico@fluxnic.net \
    --cc=trast@student.ethz.ch \
    --cc=zblaxell@esightcorp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).