git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Robin Ruede <r.ruede@gmail.com>
To: git@vger.kernel.org
Cc: Robin Ruede <r.ruede@gmail.com>
Subject: [PATCH/RFC 0/7] Add possibility to clone specific subdirectories
Date: Thu, 28 Jul 2016 18:02:19 +0200	[thread overview]
Message-ID: <20160728160226.24018-1-r.ruede@gmail.com> (raw)

This patch series adds a `--sparse-prefix=` option to multiple commands,
allowing fetching repository contents from only a subdirectory of a remote.

This works along with sparse-checkout, and is especially useful for repositories
where a subdirectory has meaning when standing alone.

* Motivation (example use cases)

1. Git repositories used for managing large/binary files
  My university has a repository containing lecture slides etc.
  as pdfs, with a subdirectory for each lecture. The bandwith for getting the
  whole repository (even with --depth=1) is 4GiB with significant processing
  time, getting the complete history of a single lecture uses 25MiB and
  completes instantly.
2. package-manager-like repositories. Examples:
  a) Arch Linux package build files repository [1]
  b) Rust crates.io packages [2]
  c) TypeScript type definitions [3]
3. Excluding a specific directory containing e.g. large binary assets
  Not currently possible with this patch set, but could be added
  (see problem 2 below).
4. Getting the history of a single file
5. Other uses
  As a non kernel developer, I wanted to quickly search through
  the code of only the btrfs filesystem using the git tools, but I do not have
  a local clone of the complete repository. Using `--depth=100` in combination
  with `--sparse-prefix=/fs/btrfs` allows me to have little bandwidth usage
  while still retaining some history.
6. This is trivial in SVN, and searching on the internet, there are multiple
questions about this feature [4-7]

* Examples usage:

Getting the source of the btrfs filesystem with a bit of history:

    $ git clone git@server:linux --depth=100 # shallow, not sparse
    Receiving objects: 100% (814945/814945), 438.55 MiB | 35.21 MiB/s, done.
    ...
    $ git clone git@server:linux --depth=100 --sparse-prefix=/fs/btrfs # sparse and shallow
    Receiving objects: 100% (503747/503747), 121.45 MiB | 59.75 MiB/s, done.
    ...
    $ cd linux && ls ./
    fs
    $ ls fs/
    btrfs
    $ git log --oneline
    (repo behaves the same as a full clone with sparse-checkout /fs/btrfs)



* Open problems:

1. Currently all trees are still included. It would be possible to
include only the trees relevant to the sparse files, which would significantly
reduce the pack sizes for repositories containing a lot of small files changing
often. For example package managers using git. Not sure in how many places all
trees are presumed present.

2. This patch set implements it as a simple single prefix check command line
option.
Using the exclude_list format (same as in sparse-checkout) might be useful.
The server needs to check these patterns for all files in history, so I'm not
sure if allowing multiple/complex patterns is a good idea.

3. This patch set assumes the sparse-prefix and sparse-checkout does not change.
running clone and fetch both need to have the --sparse-prefix= option, otherwise
complete packs will be fetched. Not sure what the best way to store the
information is, possibly create a new file `.git/sparse` similar to
`.git/shallow` containing the path(s).

3. Bitmap indices cannot be used, because they do not contain the paths of the
objects. So for creating packs, the whole DAG has to be walked.

4. Fsck complains about missing blobs. Should be fairly easy to fix.

5. Tests and documentation is missing.

[1]: https://git.archlinux.org/svntogit/packages.git/
[2]: https://github.com/rust-lang/crates.io-index
[3]: https://github.com/DefinitelyTyped/DefinitelyTyped
[4]: https://stackoverflow.com/questions/600079/is-there-any-way-to-clone-a-git-repositorys-sub-directory-only
[5]: https://stackoverflow.com/questions/11834386/cloning-only-a-subdirectory-with-git
[6]: https://askubuntu.com/questions/460885/how-to-clone-git-repository-only-some-directories
[7]: https://coderwall.com/p/o2fasg/how-to-download-a-project-subdirectory-from-github

Robin Ruede (7):
  list-objects: add sparse-prefix option to rev_info
  pack-objects: add sparse-prefix
  Skip checking integrity of files ignored by sparse
  fetch-pack: add sparse prefix to smart protocol
  fetch: add sparse-prefix option
  clone: add sparse-prefix option
  remote-curl: add sparse prefix

 builtin/clone.c        | 27 ++++++++++++++++++++++++---
 builtin/fetch-pack.c   |  6 ++++++
 builtin/fetch.c        | 19 ++++++++++++++-----
 builtin/pack-objects.c | 11 +++++++++++
 cache-tree.c           |  3 ++-
 connected.c            |  7 ++++++-
 fetch-pack.c           |  4 ++++
 fetch-pack.h           |  1 +
 list-objects.c         |  4 +++-
 remote-curl.c          | 17 ++++++++++++++++-
 revision.c             |  4 ++++
 revision.h             |  1 +
 transport.c            |  4 ++++
 transport.h            |  4 ++++
 upload-pack.c          | 15 ++++++++++++++-
 15 files changed, 114 insertions(+), 13 deletions(-)

-- 
2.9.1.283.g3ca5b4c.dirty


             reply	other threads:[~2016-07-28 16:02 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-28 16:02 Robin Ruede [this message]
2016-07-28 16:02 ` [PATCH/RFC 1/7] list-objects: add sparse-prefix option to rev_info Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 2/7] pack-objects: add sparse-prefix Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 3/7] Skip checking integrity of files ignored by sparse Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 4/7] fetch-pack: add sparse prefix to smart protocol Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 5/7] fetch: add sparse-prefix option Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 6/7] clone: " Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 7/7] remote-curl: add sparse prefix Robin Ruede
2016-07-28 16:59 ` [PATCH/RFC 0/7] Add possibility to clone specific subdirectories Duy Nguyen
2016-07-28 17:03   ` Duy Nguyen
2016-07-28 20:33   ` Junio C Hamano
2016-07-29 15:51     ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160728160226.24018-1-r.ruede@gmail.com \
    --to=r.ruede@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).