Re: [PATCH] RFC: git lazy clone proof-of-concept

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: Jan Holesovsky <kendy@suse.cz>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] RFC: git lazy clone proof-of-concept
Date: Mon, 11 Feb 2008 02:20:07 +0100	[thread overview]
Message-ID: <200802110220.08078.jnareb@gmail.com> (raw)
In-Reply-To: <200802091627.25913.kendy@suse.cz>

Hi, Jan!

On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> 
>> It was not implemented because it was thought to be hard; git assumes
>> in many places that if it has an object, it has all objects referenced
>> by it.
>>
>> But it is very nice of you to [try to] implement 'lazy clone'/'remote
>> alternates'.
>>
>> Could you provide some benchmarks (time, network throughtput, latency)
>> for your implementation?
> 
> Unfortunately not yet :-(  The only data I have that clone done on 
> git://localhost/ooo.git took 10 minutes without the lazy clone, and 7.5 
> minutes with it - and then I sent the patch for review here ;-)  The deadline 
> for our SVN vs. git comparison for OOo is the next Friday, so I'll definitely 
> have some better data by then.

Here perhaps another optimization which wasn't done because git is
fast enough on moderately-sized repositories, namely that IIRC git-clone
(and git-fetch for sure) over native (smart) protocol recreates pack,
even if sometimes better and simplier would be to just copy (transfer)
existing pack.

But this would need multi-pack "extension". (it should work just now
without transport protocol extension, receiver must only be aware
of the need to split resulting pack, and index them all).

>> Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
>> you would need machine with large amount of memory to repack it
>> tightly in sensible time!
> 
> As I answered elsewhere, unfortunately it goes out of memory even on 8G 
> machine (x86-64), so...  But still trying.

I hope that would work better...

>>> Shallow clone is not a possibility - we don't get patches through
>>> mailing lists, so we need the pull/push, and also thanks to the OOo
>>> development cycle, we have too many living heads which causes the
>>> shallow clone to download about 1.5G even with --depth 1.
>>
>> Wouldn't be easier to try to fix shallow clone implementation to allow
>> for pushing from shallow to full clone (fetching from full to shallow
>> is implemented), and perhaps also push/pull between two shallow
>> clones?
> 
> I tried to look into it a bit, but unfortunately did not see a clear way how 
> to do it transparently for the user - say you pull a branch that is based off 
> a commit you do not have.  But of course, I could have missed something ;-)

If I remember correctly fetching _into_ shallow clone works correctly,
as deepening depth of shallow clone. What is not implemented AFAIK, but
should be not too hard would be to allow to push from shallow clone
to full clone. This way the network of full clones (functioning as
centres to publish your work) and shallow + few branches repos (working
repositories).

I don't know if that would be enough.

For better support git would need to exchange graft-like information,
and use union of restrictions to get correct commits.

Perhaps it would be best to mail 'shallow clone' author...

>> As to many living heads: first, you don't need to fetch all
>> heads. Currently git-clone has no option to select subset of heads to
>> clone, but you can always use git-init + hand configuration +
>> git-remote and git-fetch for actual fetching.
> 
> Right, might be interesting as well.  But still the missing push/pull is 
> problematic for us [or at least I see it as a problem ;-)].

You can configure separate 'remote's for the same repository
with different heads. This would work both for pull and for push.

I think the solution proposed by Marco Costalba, namely of creating
"archive" repository, and "live" repository, joining them if needed
by grafts, similarly to how linux kernel has live repo, and historical
import repo, would be good alternative to shallow or lazy clone.

There would be "archive" repo (or repos), read only, with whole history,
very tightly packed with kept packs, with all branches and all tags,
and "live" repo, with only current history (a year, or since major
API change, or from today, or something like that), with only important
branches (or repos, each containg important for a team set of branches).
There would be prepared graft file to join two histories, if you have
to examine full history. Hopefully repo would be smaller.

>> By the way, did you try to split OpenOffice.org repository at the
>> components boundary into submodules (subprojects)? This would also
>> limit amount of needed download, as you don't neeed to download and
>> checkout all subprojects.
> 
> Yes, and got to much nicer repositories by that ;-) - by only moving some 
> binary stuff out of the CVS to a separate tree.  The problem is that the deal 
> is to compare the same stuff in SVN and git - so no choice for me in fact.

Sidenote: due to (from what I have read) heavy use of topic branches
in OOo development, Subversion would have to be used with svnmerge
extension, or together with SVK, to make work with it not complete
pain.

In CVS you could have ad-hoc modules, and ad-hoc partial checkouts
(so called 'modules'), but that plays merry hell with whole tree,
atomic, recoverable state commits. In Git you have to plan carefully
boundaries between submodules / subprojects. Additional advantage
is that you would have boundaries more clear, and better modularity
usually leads to better code.

Comparing directly Subversion and Git is a bit stupid: they promote
different workflows. From what I've read Git with its ability to very
easily create branches, with easy _merging_ of branches, and ability
to easily create _private_ branches (testing branches) have much
common witch chosen OOo SCM workflow. Playing to strentghs of
Subversion because that is why you used because of limits of previously
used tools is not smart.

But if you have to, then you have to. Git would hopefully get lazy
clone support from your effort. But perhaps it would be possible
(if additional work) to prepare two repositories: first the same
as Subversion (and same as now in CVS), second one "how it should
be done with Git".

>> The problem of course is _how_ to split repository into
>> submodules. Submodules should be enough self contained so the
>> whole-tree commit is alsays (or almost always) only about submodule.
> 
> I hope it will be doable _if_ the git wins & will be chosen for OOo.

I hope that ability to work with submodules (with ability to not
clone / checkout modules if not needed), i.e. "svn:externals
done right" to para[hrase SVN slogan, would be one of reasons to
chose Git over Subversion.

>>> Lazy clone sounded like the right idea to me.  With this
>>> proof-of-concept implementation, just about 550M from the 2.5G is
>>> downloaded, which is still about twice as much in comparison with
>>> downloading a tarball, but bearable.
>>
>> Do you have any numbers for OOo repository like number of revisions,
>> depth of DAG of commits (maximum number of revisions in one line of
>> commits), number of files, size of checkout, average size of file,
>> etc.?
> 
> I'll try to provide the data ASAP.

For example what is the size of full checkout (all version-control
managed files). Of for example it is 0.5 GB it would be hard to go
to less that 0.5GB or so with pack size, even with compression
of objects themselves in pack file.

Such large repositories, like Mozilla, GCC, or now OpenOffice.org
tests the limits of Git. Perhaps snapshot-based distributed SCMs
cannot deal sensibly with such large projects; I hope this is not
the case.

I wonder if packv4 improvements, which development stalled because
(if I understand correctly) because it didn't brough as much
improvements, and what is now was good enough for up-till-now
projects, would help with OpenOffice.org repository...

P.S. From what I have read OOo uses CVS + some branch DB; does
your importer make use of this branch info database?

-- 
Jakub Narebski
Poland

next prev parent reply	other threads:[~2008-02-11  1:20 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
2008-02-08 18:03 ` Nicolas Pitre
2008-02-09 14:25   ` Jan Holesovsky
2008-02-09 22:05     ` Mike Hommey
2008-02-09 23:38       ` Nicolas Pitre
2008-02-10  7:23     ` Marco Costalba
2008-02-10 12:08       ` Johannes Schindelin
2008-02-10 16:46         ` David Symonds
2008-02-10 17:45           ` Johannes Schindelin
2008-02-10 19:45             ` Nicolas Pitre
2008-02-10 20:32               ` Johannes Schindelin
2008-02-08 18:14 ` Harvey Harrison
2008-02-09 14:27   ` Jan Holesovsky
2008-02-08 18:20 ` Johannes Schindelin
2008-02-08 18:49 ` Mike Hommey
2008-02-08 19:04   ` Johannes Schindelin
2008-02-09 15:06   ` Jan Holesovsky
2008-02-08 19:00 ` Jakub Narebski
2008-02-08 19:26   ` Jon Smirl
2008-02-08 20:09     ` Nicolas Pitre
2008-02-11 10:13       ` Andreas Ericsson
2008-02-12  2:55         ` [PATCH 1/2] pack-objects: Allow setting the #threads equal to #cpus automatically Brandon Casey
2008-02-12  5:53           ` Andreas Ericsson
     [not found]         ` <1202784078-23700-1-git-send-email-casey@nrlssc.navy.mil>
2008-02-12  2:59           ` [PATCH 2/2] pack-objects: Default to zero threads, meaning auto-assign to #cpus Brandon Casey
2008-02-12  4:57             ` Nicolas Pitre
2008-02-08 20:19     ` [PATCH] RFC: git lazy clone proof-of-concept Harvey Harrison
2008-02-08 20:24       ` Jon Smirl
2008-02-08 20:25         ` Harvey Harrison
2008-02-08 20:41           ` Jon Smirl
2008-02-09 15:27   ` Jan Holesovsky
2008-02-10  3:10     ` Nicolas Pitre
2008-02-10  4:59       ` Sean
2008-02-10  5:22         ` Nicolas Pitre
2008-02-10  5:35           ` Sean
2008-02-11  1:42             ` Jakub Narebski
2008-02-11  2:04               ` Nicolas Pitre
2008-02-11 10:11                 ` Jakub Narebski
2008-02-10  9:34         ` Joachim B Haga
2008-02-10 16:43       ` Johannes Schindelin
2008-02-10 17:01         ` Jon Smirl
2008-02-10 17:36           ` Johannes Schindelin
2008-02-10 18:47         ` Johannes Schindelin
2008-02-10 19:42           ` Nicolas Pitre
2008-02-10 20:11             ` Jon Smirl
2008-02-12 20:37           ` Johannes Schindelin
2008-02-12 21:05             ` Nicolas Pitre
2008-02-12 21:08             ` Linus Torvalds
2008-02-12 21:36               ` Jon Smirl
2008-02-12 21:59                 ` Linus Torvalds
2008-02-12 22:25                   ` Linus Torvalds
2008-02-12 22:43                     ` Jon Smirl
2008-02-12 23:39                       ` Linus Torvalds
2008-02-12 21:25             ` Jon Smirl
2008-02-14 19:20             ` Johannes Schindelin
2008-02-14 20:05               ` Jakub Narebski
2008-02-14 20:16                 ` Nicolas Pitre
2008-02-14 21:04                 ` Johannes Schindelin
2008-02-14 21:59                   ` Jakub Narebski
2008-02-14 23:38                     ` Johannes Schindelin
2008-02-14 23:51                       ` Brian Downing
2008-02-14 23:57                         ` Brian Downing
2008-02-15  0:08                         ` Johannes Schindelin
2008-02-15  1:41                           ` Nicolas Pitre
2008-02-17  8:18                             ` Shawn O. Pearce
2008-02-17  9:05                               ` Junio C Hamano
2008-02-17 18:44                               ` Nicolas Pitre
2008-02-15  1:07                       ` Jakub Narebski
2008-02-15  9:43                     ` Jan Holesovsky
2008-02-14 21:08                 ` Brandon Casey
2008-02-15  9:34               ` Jan Holesovsky
2008-02-10 19:50         ` Nicolas Pitre
2008-02-14 19:41           ` Brandon Casey
2008-02-14 19:58             ` Johannes Schindelin
2008-02-14 20:11             ` Nicolas Pitre
2008-02-11  1:20     ` Jakub Narebski [this message]
2008-02-08 20:16 ` Johannes Schindelin
2008-02-08 21:35   ` Jakub Narebski
2008-02-08 21:52     ` Johannes Schindelin
2008-02-08 22:03       ` Mike Hommey
2008-02-08 22:34         ` Johannes Schindelin
2008-02-08 22:50           ` Mike Hommey
2008-02-08 23:14             ` Johannes Schindelin
2008-02-08 23:38               ` Mike Hommey
2008-02-09 21:20                 ` Jan Hudec
2008-02-09 15:54       ` Jan Holesovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200802110220.08078.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=kendy@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).