git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Nicolas Pitre <nico@cam.org>
Cc: Tomasz Kontusz <roverorna@gmail.com>, git <git@vger.kernel.org>
Subject: Re: Continue git clone after interruption
Date: Tue, 18 Aug 2009 23:02:09 +0200	[thread overview]
Message-ID: <200908182302.10619.jnareb@gmail.com> (raw)
In-Reply-To: <alpine.LFD.2.00.0908181537360.6044@xanadu.home>

On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>> Nicolas Pitre <nico@cam.org> writes:

>>> That won't buy you much.  You should realize that a pack is made of:
>>> 
>>> 1) Commit objects.  Yes they're all put together at the front of the pack,
>>>    but they roughly are the equivalent of:
>>> 
>>> 	git log --pretty=raw | gzip | wc -c
>>> 
>>>    For the Linux repo as of now that is around 32 MB.
>> 
>> For my clone of Git repository this gives 3.8 MB
>>  
>>> 2) Tree and blob objects.  Those are the bulk of the content for the top 
>>>    commit. [...]  You can estimate the size of this data with:
>>> 
>>> 	git archive --format=tar HEAD | gzip | wc -c
>>> 
>>>    On the same Linux repo this is currently 75 MB.
>> 
>> On the same Git repository this gives 2.5 MB
> 
> Interesting to see that the commit history is larger than the latest 
> source tree.  Probably that would be the same with the Linux kernel as 
> well if all versions since the beginning with adequate commit logs were 
> included in the repo.

Note that having reflog and/or patch management interface like StGit,
and frequently reworking commits (e.g. using rebase) means more commit
objects in repository.

Also Git repository has 3 independent branches: 'man', 'html' and 'todo',
from whose branches objects are not included in "git archive HEAD".

> 
>>> 3) Delta objects.  Those are making the rest of the pack, plus a couple 
>>>    tree/blob objects that were not found in the top commit and are 
>>>    different enough from any object in that top commit not to be 
>>>    represented as deltas.  Still, the majority of objects for all the 
>>>    remaining commits are delta objects.
>> 
>> You forgot that delta chains are bound by pack.depth limit, which
>> defaults to 50.  You would have then additional full objects.
> 
> Sure, but that's probably not significant.  the delta chain depth is 
> limited, but not the width.  A given base object can have unlimited 
> delta "children", and so on at each depth level.

You can probably get number and size taken by delta and non-delta (base)
objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
nor contrib/stats/packinfo.pl did help me arrive at this data.

>> The single packfile for this (just gc'ed) Git repository is 37 MB.
>> Much more than 3.8 MB + 2.5 MB = 6.3 MB.
> 
> What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to 
> be occupied by deltas.

True.
 
>> [cut]
>> 
>> There is another way which we can go to implement resumable clone.
>> Let's git first try to clone whole repository (single pack; BTW what
>> happens if this pack is larger than file size limit for given
>> filesystem?).
> 
> We currently fail.  Seems that no one ever had a problem with that so 
> far. We'd have to split the pack stream into multiple packs on the 
> receiving end.  But frankly, if you have a repository large enough to 
> bust your filesystem's file size limit then maybe you should seriously 
> reconsider your choice of development environment.

Do we fail gracefully (with an error message), or does git crash then?

If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
FAT is often used on SSD, on USB drive.  Although if you have  2 GB
packfile, you are doing something wrong, or UGFWIINI (Using Git For
What It Is Not Intended).
 
>> If it fails, client ask first for first half of of
>> repository (half as in bisect, but it is server that has to calculate
>> it).  If it downloads, it will ask server for the rest of repository.
>> If it fails, it would reduce size in half again, and ask about 1/4 of
>> repository in packfile first.
> 
> Problem people with slow links have won't be helped at all with this.  
> What if the network connection gets broken only after 49% of the 
> transfer and that took 3 hours to download?  You'll attempt a 25% size 
> transfer which would take 1.5 hour despite the fact that you already 
> spent that much time downloading that first 1/4 of the repository 
> already.  And yet what if you're unlucky and now the network craps on 
> you after 23% of that second attempt?

A modification then.

First try ordinary clone.  If it fails because network is unreliable,
check how much we did download, and ask server for packfile of slightly
smaller size; this means that we are asking server for approximate pack
size limit, not for bisect-like partitioning revision list.

> I think it is better to "prime" the repository with the content of the 
> top commit in the most straight forward manner using git-archive which 
> has the potential to be fully restartable at any point with little 
> complexity on the server side.

But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?

A question about pack protocol negotiation.  If clients presents some
objects as "have", server can and does assume that client has all 
prerequisites for such objects, e.g. for tree objects that it has
all objects for files and directories inside tree; for commit it means
all ancestors and all objects in snapshot (have top tree, and its 
prerequisites).  Do I understand this correctly?

If we have partial packfile which crashed during downloading, can we
extract from it some full objects (including blobs)?  Can we pass
tree and blob objects as "have" to server, and is it taken into account?
Perhaps instead of separate step of resumable-downloading of top commit
objects (in snapshot), we can pass to server what we did download in
full?


BTW. because of compression it might be more difficult to resume 
archive creation in the middle, I think...

-- 
Jakub Narebski
Poland

  reply	other threads:[~2009-08-18 21:02 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23   ` Shawn O. Pearce
2009-08-18  5:43   ` Matthieu Moy
2009-08-18  6:58     ` Tomasz Kontusz
2009-08-18 17:56       ` Nicolas Pitre
2009-08-18 18:45         ` Jakub Narebski
2009-08-18 20:01           ` Nicolas Pitre
2009-08-18 21:02             ` Jakub Narebski [this message]
2009-08-18 21:32               ` Nicolas Pitre
2009-08-19 15:19                 ` Jakub Narebski
2009-08-19 19:04                   ` Nicolas Pitre
2009-08-19 19:42                     ` Jakub Narebski
2009-08-19 21:13                       ` Nicolas Pitre
2009-08-20  0:26                         ` Sam Vilain
2009-08-20  7:37                         ` Jakub Narebski
2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
2009-08-20  8:23                             ` Jakub Narebski
2009-08-20 18:41                           ` Nicolas Pitre
2009-08-21 10:07                             ` Jakub Narebski
2009-08-21 10:26                               ` Matthieu Moy
2009-08-21 21:07                               ` Nicolas Pitre
2009-08-21 21:41                                 ` Jakub Narebski
2009-08-22  0:59                                   ` Nicolas Pitre
2009-08-21 23:07                                 ` Sam Vilain
2009-08-22  3:37                                   ` Nicolas Pitre
2009-08-22  5:50                                     ` Sam Vilain
2009-08-22  8:13                                       ` Nicolas Pitre
2009-08-23 10:37                                         ` Sam Vilain
2009-08-20 22:57                           ` Sam Vilain
2009-08-18 22:28             ` Johannes Schindelin
2009-08-18 23:40               ` Nicolas Pitre
2009-08-19  7:35                 ` Johannes Schindelin
2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
2009-08-19  9:52                     ` Johannes Schindelin
2009-08-19 17:21                   ` Nicolas Pitre
2009-08-19 22:23                     ` René Scharfe
2009-08-19  4:42           ` Sitaram Chamarty
2009-08-19  9:53             ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200908182302.10619.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=nico@cam.org \
    --cc=roverorna@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).