git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Andrew Ardill <andrew.ardill@gmail.com>
Cc: Farshid Zavareh <fhzavareh@gmail.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Should I store large text files on Git LFS?
Date: Tue, 25 Jul 2017 15:13:47 -0400	[thread overview]
Message-ID: <20170725191347.e2p7goxho2rcemz4@sigill.intra.peff.net> (raw)
In-Reply-To: <CAH5451nbY+Xo0Fpe2OdsxwJeRV1ddZmYX7v-bPYgRsbS2kNJSg@mail.gmail.com>

On Tue, Jul 25, 2017 at 06:06:49PM +1000, Andrew Ardill wrote:

> Let's have a look:
> 
> $ git rev-list --objects --all |
>   git cat-file --batch-check='%(objectsize:disk) %(objectsize)
> %(deltabase) %(rest)'
> 174 262 0000000000000000000000000000000000000000
> 171 260 0000000000000000000000000000000000000000
> 139 212 0000000000000000000000000000000000000000
> 47 36 0000000000000000000000000000000000000000
> 377503831 2310238304 0000000000000000000000000000000000000000 data.txt
> 47 36 0000000000000000000000000000000000000000
> 500182546 3740427683 0000000000000000000000000000000000000000 data.txt
> 47 36 0000000000000000000000000000000000000000
> 447340264 3357717475 0000000000000000000000000000000000000000 data.txt
> 
> Yep, all zlib.

OK, that makes sense.

> What do you think is a reasonable config for storing text files this
> large, to get good delta compression, or is it more of a trial and
> error to find out what works best?

I think it would really depend on what's in your repo. If you just have
gigantic text files and no big binaries, and you have enough RAM to do
diffs on the text files, it's not unreasonable to just send
core.bigfilethreshold to something really big and not worry about it.

In general, a diff is going to want memory at least 2x the size of the
file (for the old and new images). And we tend to keep in memory all of
the images for a single tree-diff at one time (so if you touched two
gigantic files in one commit, then "git log -p" is probably going to
peak at having all four before/after images in memory at once).

If you just want deltas but not diffs, you can probably do:

  echo '*.gigantic -diff' >.gitattributes
  git config core.bigfilethreshold 10G

I think that will turn off streaming of the blobs in some code paths,
too. But hopefully a _single_ copy of each file would be OK to hold in
RAM. If it's not, you might also be able to get away with packing once
with:

  git -c core.bigfilethreshold=10G repack -adf

and then further repacks will carry those deltas forward. I think we
only apply the limit when actively searching for new deltas, not when
reusing existing ones.

As you can see, core.bigfilethreshold is a pretty blunt instrument. It
might be nice if .gitattributes understood other types of patterns
besides filenames, so you could do something like:

  echo '[size > 500MB] delta -diff' >.gitattributes

or something like that. I don't think it's come up enough for anybody to
care too much about it or work on it.

-Peff

  reply	other threads:[~2017-07-25 19:13 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-24  2:01 Should I store large text files on Git LFS? Farshid Zavareh
2017-07-24  2:29 ` Andrew Ardill
2017-07-24  3:46   ` Farshid Zavareh
2017-07-24  4:13     ` David Lang
2017-07-24  4:18       ` Farshid Zavareh
     [not found]       ` <CANENsPpdQzBqStGjq4jUsAB0-7U8_SQq+=kjmJe6pJtiXxnYFg@mail.gmail.com>
2017-07-24  4:19         ` David Lang
     [not found]   ` <CANENsPr271w=a4YNOYdrp9UM4L_eA1VZMRP_UrH+NZ+2PWM_qg@mail.gmail.com>
2017-07-24  4:58     ` Andrew Ardill
2017-07-24 18:11       ` Jeff King
2017-07-24 19:41         ` Junio C Hamano
2017-07-25  8:06         ` Andrew Ardill
2017-07-25 19:13           ` Jeff King [this message]
2017-07-25 20:52             ` Junio C Hamano
2017-07-25 21:13               ` Jeff King
2017-07-25 21:38                 ` Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170725191347.e2p7goxho2rcemz4@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=andrew.ardill@gmail.com \
    --cc=fhzavareh@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).