git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Kapil Jain <jkapil.cs@gmail.com>
To: unlisted-recipients:; (no To-header on input)
Cc: git@vger.kernel.org, Thomas Gummerer <t.gummerer@gmail.com>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: New Ft. for Git : Allow resumable cloning of repositories.
Date: Sun, 10 Mar 2019 21:29:33 +0530	[thread overview]
Message-ID: <CAMknYEMP73D=LSKKvYKpmTdR3LAxc5UMgT3gxiQDZBghkLFo_g@mail.gmail.com> (raw)
In-Reply-To: <20190308174314.129611-1-jonathantanmy@google.com>

On Fri, Mar 8, 2019 at 11:13 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> This is indeed a nice feature to have, and thanks for details of how
> this would be accomplished.
>
> One issue is that when cloning a repository, we do not download many
> files - we only download one dynamically generated packfile containing
> all the objects we want.

Since the packfile is dynamically generated specifically for a client
request, and is destroyed from the server as soon as the connection
between them closes.
Is this the reason why we cannot pause it in between like we can do
with download managers ?

I read through the progit ebook 'git internels' chapter and the
following thought came to me:

Assume a pack file as follows:
---
$ git verify-pack -v .git/objects/pack/pack-
978e03944f5c581011e6998cd0e9e30000905586.idx
b042a60ef7dff760008df33cee372b945b6e884e blob   22054 5799 1463
033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5 blob   9 20 7262 1 \
  b042a60ef7dff760008df33cee372b945b6e884e
.git/objects/pack/pack-978e03944f5c581011e6998cd0e9e30000905586.pack: ok
---

Here 033b blob refers b042 blob, and both blobs are different versions
of the same file.

Before this pack was made, both of these blobs were stored separately
and thus were taking more space.
Packfile is made to save space, by only storing latest version and its
delta with earlier version. Both delta and latest version are stored
in compressed form right ?

Now, here is another approach to save space without needing to create pack:

Earlier both the blobs had their object files as:

.git/objects/03/3b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e

Lets say b042 is latest and 033b is its earlier version.

what git does in packfile can be done right here by:

storing latest version in
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e and its delta
in .git/objects/03/3b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5, with the
delta version we can add a header that tells it to check for
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e and apply delta
on it to get the earlier version.

Doing this, eliminates the big packfile, and all the objects are
spread into folders. We can now make this resume-able right ?

Please point out what i missed here.
Is it possible to do the above ? if yes then what was the reason to
introduce concept of packfile ?

> You might be interested in some work I'm doing to offload part of the
> packfile response to CDNs:
>
> https://public-inbox.org/git/cover.1550963965.git.jonathantanmy@google.com/
>
> This means that when cloning/fetching, multiple files could be
> downloaded, meaning that a scheme like you suggest would be more
> worthwhile. (In fact, I allude to such a scheme in the design document
> in patch 5.)

currently reading through all the discussion on this strategy.

      reply	other threads:[~2019-03-10 15:59 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-08 15:43 New Ft. for Git : Allow resumable cloning of repositories Kapil Jain
2019-03-08 17:43 ` Jonathan Tan
2019-03-10 15:59   ` Kapil Jain [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMknYEMP73D=LSKKvYKpmTdR3LAxc5UMgT3gxiQDZBghkLFo_g@mail.gmail.com' \
    --to=jkapil.cs@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=t.gummerer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).