git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Duy Nguyen <pclouds@gmail.com>
To: Robin Ruede <r.ruede@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH/RFC 0/7] Add possibility to clone specific subdirectories
Date: Thu, 28 Jul 2016 18:59:16 +0200	[thread overview]
Message-ID: <CACsJy8AW3Z+C81F6w7WiZXXvcLRv9PB4=Wjbze6eE_MPgikU8A@mail.gmail.com> (raw)
In-Reply-To: <20160728160226.24018-1-r.ruede@gmail.com>

On Thu, Jul 28, 2016 at 6:02 PM, Robin Ruede <r.ruede@gmail.com> wrote:
> This patch series adds a `--sparse-prefix=` option to multiple commands,
> allowing fetching repository contents from only a subdirectory of a remote.
>
> This works along with sparse-checkout, and is especially useful for repositories
> where a subdirectory has meaning when standing alone.

Ah.. this is what I call narrow checkout [1] (but gmane is down at the moment)

[1] http://thread.gmane.org/gmane.comp.version-control.git/155427

> * Motivation (example use cases)
>
> ...

nods nods.. all good stuff

> * Open problems:
>
> 1. Currently all trees are still included. It would be possible to
> include only the trees relevant to the sparse files, which would significantly
> reduce the pack sizes for repositories containing a lot of small files changing
> often. For example package managers using git. Not sure in how many places all
> trees are presumed present.

You can limit some trees by passing a pathspec to "git rev-list" (in
your "list-objects" patch). All trees completely outside sub/dir will
be excluded. Trees leading to it (e.g. root tree and "sub") are still
included. Not having all trees open up a new set of problems.. This is
what I did in narrow clone: pass some directories (as pathspec) to
rev-list on the server side, then deal with lack of trees on client
side.

> 2. This patch set implements it as a simple single prefix check command line
> option.
> Using the exclude_list format (same as in sparse-checkout) might be useful.
> The server needs to check these patterns for all files in history, so I'm not
> sure if allowing multiple/complex patterns is a good idea.

I would go with something else than sparse-checkout, which I call
narrow checkout: instead of flattening the entire tree in index and
keep only files there, we keep trees that we don't have as trees.
Those trees have the same "sparse checkout" attributes, e.g. ignore
worktree and some of submodules e.g. don't bother checking the
associated hash. This approach [2] eliminates changes in cache-tree.c
(i.e. 3/7).

And you would need something like that, when you don't have all the
trees (from open problem 1), because you just can't flatten trees when
you don't have them.

[2] https://github.com/pclouds/git/commits/lanh/narrow-checkout (I
think core functionality is in place, but narrow operation still needs
more work)

> 3. This patch set assumes the sparse-prefix and sparse-checkout does not change.
> running clone and fetch both need to have the --sparse-prefix= option, otherwise
> complete packs will be fetched. Not sure what the best way to store the
> information is, possibly create a new file `.git/sparse` similar to
> `.git/shallow` containing the path(s).

Something like .git/shallow, yes. It's similar in nature anyway
(shallow cuts depth, you cut the side)

> 3. Bitmap indices cannot be used, because they do not contain the paths of the
> objects. So for creating packs, the whole DAG has to be walked.

And shallow clones have this same problem. Something to be sorted out :)

> 4. Fsck complains about missing blobs. Should be fairly easy to fix.

Not really. You'll have to associate path information with blobs
before you decide that a blob should exist or not. Sparse patterns are
just not designed for that (tree walking). If you narrow (heh) down to
just path prefix not full blown sparse patterns, then it's feasible to
walk tree and filter. A subset of pathspec would be good because we
can already filter by pathspec, but I would not go full pathspec at
the first step.

> 5. Tests and documentation is missing.

Personally I would go with my narrow clone approach, but the ability
to selectively exclude some large blobs is still good, I think.
However, another approach to excluding some blobs is the external
object database [3]. It gives you what you need with a lot less code
impact (but you will not be able to work offline 100% the time like
what you can now with git)

[3] https://public-inbox.org/git/20160613085546.11784-1-chriscool%40tuxfamily.org/
-- 
Duy

  parent reply	other threads:[~2016-07-28 16:59 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-28 16:02 [PATCH/RFC 0/7] Add possibility to clone specific subdirectories Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 1/7] list-objects: add sparse-prefix option to rev_info Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 2/7] pack-objects: add sparse-prefix Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 3/7] Skip checking integrity of files ignored by sparse Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 4/7] fetch-pack: add sparse prefix to smart protocol Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 5/7] fetch: add sparse-prefix option Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 6/7] clone: " Robin Ruede
2016-07-28 16:02 ` [PATCH/RFC 7/7] remote-curl: add sparse prefix Robin Ruede
2016-07-28 16:59 ` Duy Nguyen [this message]
2016-07-28 17:03   ` [PATCH/RFC 0/7] Add possibility to clone specific subdirectories Duy Nguyen
2016-07-28 20:33   ` Junio C Hamano
2016-07-29 15:51     ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACsJy8AW3Z+C81F6w7WiZXXvcLRv9PB4=Wjbze6eE_MPgikU8A@mail.gmail.com' \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=r.ruede@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).