git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
From: Kevin Wern <kevin.m.wern@gmail.com>
To: git@vger.kernel.org
Subject: [PATCH 00/11] Resumable clone
Date: Thu, 15 Sep 2016 20:12:11 -0400
Message-ID: <1473984742-12516-1-git-send-email-kevin.m.wern@gmail.com> (raw)

Hey, all,

It's been a while (sent a very short patch in May), but I've
still been working on the resumable clone feature and checking up on
the mailing list for any updates. After submitting the prime-clone
service alone, I figured implementing the whole thing would be the best
way to understand the full scope of the problem (this is my first real
contribution here, and learning while working on such an involved
feature has not been easy). 

This is a functional implementation handling a direct http/ftp URI to a
single, fully connected packfile (i.e. the link is a direct path to the
file, not a prefix or guess). My hope is that this acts as a bare
minimum cross-section spanning the full requirments that can expand in
width as more cases are added (.info file, split bundle, daemon
download service). This is certainly not perfect, but I think it at
least prototypes each component involved in the workflow.

This patch series is based on jc/bundle, because the logic to find the
tips of a pack's history already exists there (I call index-pack
--clone-bundle on the downloaded file, and read the file to write the
references to a temporary directory). If I need to re-implement this
logic or base it on another branch, let me know. For ease of pulling
and testing, I included the branch here:

https://github.com/kevinwern/git/tree/feature/prime-clone

Although there are a few changes internally from the last patch,
the "alternate resource" url to download is configured on the
server side in exactly the same way:

[primeclone]
	url = http://location/pack-$NAME.pack
	filetype = pack

The prime-clone service simply outputs the components as:

####url filetype
0000

On the client side, the transport_prime_clone and
transport_download_primer APIs are built to be more robust (i.e. read
messages without dying due to protocol errors), so that git clone can
always try them without being dependent on the capability output of
git-upload-pack. transport_download_primer is dependent on the success
of transport_prime_clone, but transport_prime_clone is always run on an
initial clone. Part of achieving this robustness involves adding
*_gentle functions to pkt_line, so that prime_clone can fail silently
without dying.

The transport_download_primer function uses a resumable download,
which is applicable to both automatic and manual resuming. Automatic
is programmatically reconnecting to the resource after being
interrupted (up to a set number of times). Manual is using a newly
taught --resume option on the command line:

git clone --resume <resumable_work_or_git_dir>

Right now, a manually resumable directory is left behind only if the
*client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE,
is set (right before the download). For an initial clone, if the
connection fails after automatic resuming, the client erases the
partial resources and falls through to a normal clone. However, once a
resumable directory is left behind by the program, it is NEVER
deleted/abandoned after it is continued with --resume.

I think determining when a resource is "unsalvageable" should be more
nuanced. Especially in a case where a connection is perpetually poor
and the user wishes to resume over a long period of time. The timeout
logic itself *definitely* needs more nuance than "repeat 5 times", such
as expanding wait times and using earlier successes when deciding to
try again. Right now, I think the most important part of this patch is
that these two paths (falling through after a failed download, exiting
to be manually resumed later) exist.

Off the top of my head, outstanding issues/TODOs inlcude:
	- The above issue of determining when to fall through, when to
	  reattempt, and when to write the resumable info and exit
	  in git clone.
	- Creating git-daemon service to download a resumable resource.
	  Pretty straightforward, I think, especially if
	  http.getanyfile already exists. This falls more under
	  "haven't gotten to yet" than dilemma.
	- Logic for git clone to determine when a full clone would
	  be superior, such as when a clone is local or a reference is
	  given.
	- Configuring prime-clone for multiple resources, in two
	  dimensions: (a) resources to choose from (e.g. fall back to
	  a second resource if the first one doesn't work) and (b)
	  resources to be downloaded together or in sequence (e.g.
	  download http://host/this, then http://host/that). Maybe
	  prime-clone could also handle client preferences in terms of
	  filetype or protocol. For this, I just have to re-read a few
	  discussions about the filetypes we use to see if there are
	  any outliers that aren't representable in this way. I think
	  this is another "haven't gotten to yet".
	- Related to the above, seeing if there are any outlying
	  resource types whose process can't be modularized into:
	  download to location, use, clean one way if failed, clean
	  another way if succeeded. The "split bundle," for example,
	  is retrieved (download), read for the pack location (use),
	  and then the packfile is retrieved (download). I believe, in
	  this case, all of that can be considered the "download," and
	  then indexing/writing can be considered "use." But I'm not
	  sure if there are more extreme cases.
	- Creating the logic to guess a packfile, and append that to a
	  prefix specified by the admin. Additionally, allowing the
	  admin to use a custom script to use their own logic to
	  output the URL.
	- Preventing the retry wait period (currently set by using
	  select()) from being interrupted by other system calls.
	  I believe there is a setting in libcurl, but I don't want
	  to make any potentially large-impact changes without
	  discussing it first. Plus, I believe changes to http.c were
	  up for discussion anyway.
	- Finding if there's a more elegant way to access the alternate
	  resource than invoking remote-helper with a url we don't care
	  about (the same url that will be specified later to stdin
	  with "download-primer").
	- Finding if there is a better way to suppress index-pack's
	  output than creating a run-command option specifically to
	  suppress stdout.
	- When running with ssh and a password, the credentials are
	  prompted for twice. I don't know if there is a way to
	  preserve credentials between executions. I couldn't find any
	  examples in git's source.

Some of these are issues I've been actively working on, but I'm
hitting a point where keeping everyone up-to-date trumps completeness.
Hopefully, the bulk of the 'learning and re-doing' is done and I can
update more frequently in smaller increments.

I will probably work on the git-daemon download service, the curl
timeout issue, and supporting other filetypes next.

Feedback is appreciated.

Kevin Wern (11):
  Resumable clone: create service git-prime-clone
  Resumable clone: add prime-clone endpoints
  pkt-line: create gentle packet_read_line functions
  Resumable clone: add prime-clone to remote-curl
  Resumable clone: add output parsing to connect.c
  Resumable clone: implement transport_prime_clone
  Resumable clone: add resumable download to http/curl
  Resumable clone: create transport_download_primer
  path: add resumable marker
  run command: add RUN_COMMAND_NO_STDOUT
  Resumable clone: implement primer logic in git-clone

 .gitignore                         |   1 +
 Documentation/git-clone.txt        |  16 +
 Documentation/git-daemon.txt       |   7 +
 Documentation/git-http-backend.txt |   7 +
 Documentation/git-prime-clone.txt  |  39 +++
 Makefile                           |   2 +
 builtin.h                          |   1 +
 builtin/clone.c                    | 590 +++++++++++++++++++++++++++++++------
 builtin/prime-clone.c              |  77 +++++
 cache.h                            |   1 +
 connect.c                          |  47 +++
 connect.h                          |  10 +-
 daemon.c                           |   7 +
 git.c                              |   1 +
 http-backend.c                     |  22 +-
 http.c                             |  86 +++++-
 http.h                             |   7 +-
 path.c                             |   1 +
 pkt-line.c                         |  47 ++-
 pkt-line.h                         |  16 +
 remote-curl.c                      | 192 +++++++++---
 run-command.c                      |   1 +
 run-command.h                      |   1 +
 t/t9904-git-prime-clone.sh         | 181 ++++++++++++
 transport-helper.c                 |  75 ++++-
 transport.c                        |  53 ++++
 transport.h                        |  27 ++
 27 files changed, 1361 insertions(+), 154 deletions(-)
 create mode 100644 Documentation/git-prime-clone.txt
 create mode 100644 builtin/prime-clone.c
 create mode 100755 t/t9904-git-prime-clone.sh

-- 
2.7.4


             reply index

Thread overview: 39+ messages in thread (expand / mbox.gz / Atom feed / [top])
2016-09-16  0:12 Kevin Wern [this message]
2016-09-16  0:12 ` [PATCH 01/11] Resumable clone: create service git-prime-clone Kevin Wern
2016-09-16 20:53   ` Junio C Hamano
2016-09-28  4:40     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 02/11] Resumable clone: add prime-clone endpoints Kevin Wern
2016-09-19 13:15   ` Duy Nguyen
2016-09-28  4:43     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 03/11] pkt-line: create gentle packet_read_line functions Kevin Wern
2016-09-16 22:17   ` Junio C Hamano
2016-09-28  4:42     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 04/11] Resumable clone: add prime-clone to remote-curl Kevin Wern
2016-09-19 13:52   ` Duy Nguyen
2016-09-28  6:45     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 05/11] Resumable clone: add output parsing to connect.c Kevin Wern
2016-09-16  0:12 ` [PATCH 06/11] Resumable clone: implement transport_prime_clone Kevin Wern
2016-09-16  0:12 ` [PATCH 07/11] Resumable clone: add resumable download to http/curl Kevin Wern
2016-09-16 22:45   ` Junio C Hamano
2016-09-28  6:41     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 08/11] Resumable clone: create transport_download_primer Kevin Wern
2016-09-16  0:12 ` [PATCH 09/11] path: add resumable marker Kevin Wern
2016-09-19 13:24   ` Duy Nguyen
2016-09-16  0:12 ` [PATCH 10/11] run command: add RUN_COMMAND_NO_STDOUT Kevin Wern
2016-09-16 23:07   ` Junio C Hamano
2016-09-18 19:22     ` Johannes Schindelin
2016-09-28  4:46     ` Kevin Wern
2016-09-28 17:54       ` Junio C Hamano
2016-09-28 18:06         ` Kevin Wern
2016-09-16  0:12 ` [PATCH 11/11] Resumable clone: implement primer logic in git-clone Kevin Wern
2016-09-16 23:32   ` Junio C Hamano
2016-09-28  5:49     ` Kevin Wern
2016-09-19 14:04   ` Duy Nguyen
2016-09-19 17:16     ` Junio C Hamano
2016-09-28  4:44     ` Kevin Wern
2016-09-16 20:47 ` [PATCH 00/11] Resumable clone Junio C Hamano
2016-09-27 21:51 ` Eric Wong
2016-09-27 22:07   ` Junio C Hamano
2016-09-28 17:32     ` Junio C Hamano
2016-09-28 18:22       ` Junio C Hamano
2016-09-28 20:46     ` Eric Wong

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply to all the recipients using the --to, --cc,
  and --in-reply-to switches of git-send-email(1):

  git send-email \
    --in-reply-to=1473984742-12516-1-git-send-email-kevin.m.wern@gmail.com \
    --to=kevin.m.wern@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox