git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Andrew Ardill <andrew.ardill@gmail.com>
To: Farshid Zavareh <fhzavareh@gmail.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Should I store large text files on Git LFS?
Date: Mon, 24 Jul 2017 14:58:38 +1000	[thread overview]
Message-ID: <CAH5451m0P4eZMXj8ojgbd+q-8scoJpwn9UcZLvqYKM=+8hhWPg@mail.gmail.com> (raw)
In-Reply-To: <CANENsPr271w=a4YNOYdrp9UM4L_eA1VZMRP_UrH+NZ+2PWM_qg@mail.gmail.com>

Hi Farshid,

On 24 July 2017 at 13:45, Farshid Zavareh <fhzavareh@gmail.com> wrote:
> I'll probably test this myself, but would modifying and committing a 4GB
> text file actually add 4GB to the repository's size? I anticipate that it
> won't, since Git keeps track of the changes only, instead of storing a copy
> of the whole file (whereas this is not the case with binary files, hence the
> need for LFS).

I decided to do a little test myself. I add three versions of the same
data set (sometimes slightly different cuts of the parent data set,
which I don't have) each between 2 and 4GB in size.
Each time I added a new version it added ~500MB to the repository, and
operations on the repository took 35-45 seconds to complete.
Running `git gc` compressed the objects fairly well, saving ~400MB of
space. I would imagine that even more space would be saved
(proportionally) if there were a lot more similar files in the repo.
The time to checkout different commits didn't change much, I presume
that most of the time is spent copying the large file into the working
directory, but I didn't test that. I did test adding some other small
files, and sometimes it was slow (when cold I think?) and other times
fast.

Overall, I think as long as the files change rarely, and the
repository remains responsive, having these large files in the
repository is ok. They're still big, and if most people will never use
them it will be annoying for people to clone and checkout updated
versions of the files. If you have a lot of the files, or they update
often, or most people don't need all the files, using something like
LFS will help a lot.

$ git version  # running on my windows machine at work
git version 2.6.3.windows.1

$ git init git-csv-test && cd git-csv-test
$ du -h --max-depth=2  # including here to compare after large data
files are added
35K     ./.git/hooks
1.0K    ./.git/info
0       ./.git/objects
0       ./.git/refs
43K     ./.git
43K     .

$ git add data.txt  # first version of the data file, 3.2 GB
$ git commit
$ du -h --max-depth=2  # the data gets compressed down to ~580M of
objects in the git store
35K     ./.git/hooks
1.0K    ./.git/info
2.0K    ./.git/logs
580M    ./.git/objects
1.0K    ./.git/refs
581M    ./.git
3.7G    .


$ git add data.txt  # second version of the data file, 3.6 GB
$ git commit
$ du -h --max-depth=1  # an extra ~520M of objects added
1.2G    ./.git
4.7G    .


$ time git add data.txt  # 42.344s - second version of the data file, 2.2 GB
$ git commit  # takes about 30 seconds to load editor
$ du -h --max-depth=1
1.7G    ./.git
3.9G    .

$ time git checkout HEAD^  # 36.509s
$ time git checkout HEAD^  # 44.658s
$ time git checkout master  # 38.267s

$ git gc
$ du -h --max-depth=1
1.3G    ./.git
3.4G    .

$ time git checkout HEAD^  # 34.743s
$ time git checkout HEAD^  # 41.226s

Regards,

Andrew Ardill

  parent reply	other threads:[~2017-07-24  4:59 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-24  2:01 Should I store large text files on Git LFS? Farshid Zavareh
2017-07-24  2:29 ` Andrew Ardill
2017-07-24  3:46   ` Farshid Zavareh
2017-07-24  4:13     ` David Lang
2017-07-24  4:18       ` Farshid Zavareh
     [not found]       ` <CANENsPpdQzBqStGjq4jUsAB0-7U8_SQq+=kjmJe6pJtiXxnYFg@mail.gmail.com>
2017-07-24  4:19         ` David Lang
     [not found]   ` <CANENsPr271w=a4YNOYdrp9UM4L_eA1VZMRP_UrH+NZ+2PWM_qg@mail.gmail.com>
2017-07-24  4:58     ` Andrew Ardill [this message]
2017-07-24 18:11       ` Jeff King
2017-07-24 19:41         ` Junio C Hamano
2017-07-25  8:06         ` Andrew Ardill
2017-07-25 19:13           ` Jeff King
2017-07-25 20:52             ` Junio C Hamano
2017-07-25 21:13               ` Jeff King
2017-07-25 21:38                 ` Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAH5451m0P4eZMXj8ojgbd+q-8scoJpwn9UcZLvqYKM=+8hhWPg@mail.gmail.com' \
    --to=andrew.ardill@gmail.com \
    --cc=fhzavareh@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).