git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* New Ft. for Git : Allow resumable cloning of repositories.
@ 2019-03-08 15:43 Kapil Jain
  2019-03-08 17:43 ` Jonathan Tan
  0 siblings, 1 reply; 3+ messages in thread
From: Kapil Jain @ 2019-03-08 15:43 UTC (permalink / raw)
  To: git, Thomas Gummerer

Objective: Allow pause and resume functionality while cloning repositories.

Below is a rough idea on how this may be achieved.

1) Create a repository_name.json file.
2) repository_name.json will be an index file containing list of all
the files in the repository with default status being "False".
   "False" status of a file signifies that this file is not yet fully
downloaded.

Something like this:

{
  'file1.ext' : "False",
  'file2.ext' : "False",
  'file3.ext' : "False"
}

3) As a file finishes downloading, say 'file1.ext' and 'file2.ext'
have finished downloading, their status will change to:

Something like this:

{
  'file1.ext' : "True",
  'file2.ext' : "True",
  'file3.ext' : "False"
}

4) Suppose due to some reason, before 'file3.ext' could finish
download; cloning is interrupted.
5) After the interruption the repository_name.json and downloaded
files are preserved.
6) Now, when cloning of the same repository begins next time, files
would be downloaded based on information taken from
repository_name.json file.

Note 1: Doing this for cloning would be the main objective, further
this may be extended for fetching, pulling, and pushing too.

Note 2: Since this is gsoc time, please don't take this to be a
project idea for gsoc, as it was pointed out on irc that this would be
a time intensive functionality.

I want to work on building this functionality.
Please discuss thoughts on this, so as to make a technically sound to-do list.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: New Ft. for Git : Allow resumable cloning of repositories.
  2019-03-08 15:43 New Ft. for Git : Allow resumable cloning of repositories Kapil Jain
@ 2019-03-08 17:43 ` Jonathan Tan
  2019-03-10 15:59   ` Kapil Jain
  0 siblings, 1 reply; 3+ messages in thread
From: Jonathan Tan @ 2019-03-08 17:43 UTC (permalink / raw)
  To: jkapil.cs; +Cc: git, t.gummerer, Jonathan Tan

> Objective: Allow pause and resume functionality while cloning repositories.
> 
> Below is a rough idea on how this may be achieved.

This is indeed a nice feature to have, and thanks for details of how
this would be accomplished.

> 1) Create a repository_name.json file.
> 2) repository_name.json will be an index file containing list of all
> the files in the repository with default status being "False".
>    "False" status of a file signifies that this file is not yet fully
> downloaded.
> 
> Something like this:
> 
> {
>   'file1.ext' : "False",
>   'file2.ext' : "False",
>   'file3.ext' : "False"
> }

One issue is that when cloning a repository, we do not download many
files - we only download one dynamically generated packfile containing
all the objects we want.

You might be interested in some work I'm doing to offload part of the
packfile response to CDNs:

https://public-inbox.org/git/cover.1550963965.git.jonathantanmy@google.com/

This means that when cloning/fetching, multiple files could be
downloaded, meaning that a scheme like you suggest would be more
worthwhile. (In fact, I allude to such a scheme in the design document
in patch 5.)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: New Ft. for Git : Allow resumable cloning of repositories.
  2019-03-08 17:43 ` Jonathan Tan
@ 2019-03-10 15:59   ` Kapil Jain
  0 siblings, 0 replies; 3+ messages in thread
From: Kapil Jain @ 2019-03-10 15:59 UTC (permalink / raw)
  Cc: git, Thomas Gummerer, Jonathan Tan

On Fri, Mar 8, 2019 at 11:13 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> This is indeed a nice feature to have, and thanks for details of how
> this would be accomplished.
>
> One issue is that when cloning a repository, we do not download many
> files - we only download one dynamically generated packfile containing
> all the objects we want.

Since the packfile is dynamically generated specifically for a client
request, and is destroyed from the server as soon as the connection
between them closes.
Is this the reason why we cannot pause it in between like we can do
with download managers ?

I read through the progit ebook 'git internels' chapter and the
following thought came to me:

Assume a pack file as follows:
---
$ git verify-pack -v .git/objects/pack/pack-
978e03944f5c581011e6998cd0e9e30000905586.idx
b042a60ef7dff760008df33cee372b945b6e884e blob   22054 5799 1463
033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5 blob   9 20 7262 1 \
  b042a60ef7dff760008df33cee372b945b6e884e
.git/objects/pack/pack-978e03944f5c581011e6998cd0e9e30000905586.pack: ok
---

Here 033b blob refers b042 blob, and both blobs are different versions
of the same file.

Before this pack was made, both of these blobs were stored separately
and thus were taking more space.
Packfile is made to save space, by only storing latest version and its
delta with earlier version. Both delta and latest version are stored
in compressed form right ?

Now, here is another approach to save space without needing to create pack:

Earlier both the blobs had their object files as:

.git/objects/03/3b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e

Lets say b042 is latest and 033b is its earlier version.

what git does in packfile can be done right here by:

storing latest version in
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e and its delta
in .git/objects/03/3b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5, with the
delta version we can add a header that tells it to check for
.git/objects/b0/42a60ef7dff760008df33cee372b945b6e884e and apply delta
on it to get the earlier version.

Doing this, eliminates the big packfile, and all the objects are
spread into folders. We can now make this resume-able right ?

Please point out what i missed here.
Is it possible to do the above ? if yes then what was the reason to
introduce concept of packfile ?

> You might be interested in some work I'm doing to offload part of the
> packfile response to CDNs:
>
> https://public-inbox.org/git/cover.1550963965.git.jonathantanmy@google.com/
>
> This means that when cloning/fetching, multiple files could be
> downloaded, meaning that a scheme like you suggest would be more
> worthwhile. (In fact, I allude to such a scheme in the design document
> in patch 5.)

currently reading through all the discussion on this strategy.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-03-10 15:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-08 15:43 New Ft. for Git : Allow resumable cloning of repositories Kapil Jain
2019-03-08 17:43 ` Jonathan Tan
2019-03-10 15:59   ` Kapil Jain

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).