git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Removing Partial Clone / Filtered Clone on a repo
@ 2021-06-01 10:24 Tao Klerks
  2021-06-01 10:39 ` Derrick Stolee
  0 siblings, 1 reply; 6+ messages in thread
From: Tao Klerks @ 2021-06-01 10:24 UTC (permalink / raw)
  To: git

Hi folks,

I'm trying to deepen my understanding of the Partial Clone
functionality for a possible deployment at scale (with a large-ish
13GB project where we are using date-based shallow clones for the time
being), and one thing that I can't get my head around yet is how you
"unfilter" an existing filtered clone.

The gitlab intro document
(https://docs.gitlab.com/ee/topics/git/partial_clone.html#remove-partial-clone-filtering)
suggests that you need to get the full list of missing blobs, and pass
that into a fetch...:

git fetch origin $(git rev-list --objects --all --missing=print | grep
-oP '^\?\K\w+')

In my project's case, that would be millions of blob IDs! I tested
this with a path-based filter to rev-list, to see what getting 30,000
blobs might look like, and it took a looong while... I don't
understand much about the negotiation process, but I have to assume
there is a fixed per-blob cost in this scenario which is *much* higher
than in a "regular" fetch or clone.

Obviously one answer is to throw away the repo and start again with a
clean unfiltered clone... But between repo-local config, project
settings in IDEs / external tools, and unpushed local branches, this
is an awkward thing to ask people to do.

I initially thought it might be possible to add an extra remote
(without filter / promisor settings), mess with the negotiation
settings to make the new remote not know anything about what's local,
and then get a full set of refs and their blobs from that remote...
but I must have misunderstood how the negotation-tip stuff works
because I can't get that to do anything (it always "sees" my existing
refs and I just get the new remote's refs "for free" without object
transfer).

The official doc at https://git-scm.com/docs/partial-clone makes no
mention of plans or goals (or non-goals) related to this "unfiltering"
- is it something that we should expect a story to emerge around?

Thanks,
Tao Klerks

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Removing Partial Clone / Filtered Clone on a repo
  2021-06-01 10:24 Removing Partial Clone / Filtered Clone on a repo Tao Klerks
@ 2021-06-01 10:39 ` Derrick Stolee
  2021-06-01 13:16   ` Tao Klerks
  0 siblings, 1 reply; 6+ messages in thread
From: Derrick Stolee @ 2021-06-01 10:39 UTC (permalink / raw)
  To: Tao Klerks, git

On 6/1/21 6:24 AM, Tao Klerks wrote:
> Hi folks,
> 
> I'm trying to deepen my understanding of the Partial Clone
> functionality for a possible deployment at scale (with a large-ish
> 13GB project where we are using date-based shallow clones for the time
> being), and one thing that I can't get my head around yet is how you
> "unfilter" an existing filtered clone.
> 
> The gitlab intro document
> (https://docs.gitlab.com/ee/topics/git/partial_clone.html#remove-partial-clone-filtering)
> suggests that you need to get the full list of missing blobs, and pass
> that into a fetch...:
> 
> git fetch origin $(git rev-list --objects --all --missing=print | grep
> -oP '^\?\K\w+')

I think the short answer is to split your "git rev-list" call
into batches by limiting the count. Perhaps pipe that command
to a file and then split it into batches of "reasonable" size.

Your definition of "reasonable" may vary, so try a few numbers.

> The official doc at https://git-scm.com/docs/partial-clone makes no
> mention of plans or goals (or non-goals) related to this "unfiltering"
> - is it something that we should expect a story to emerge around?

The design is not intended for this kind of "unfiltering". The
feature is built for repositories where doing so would be too
expensive (both network time and disk space) to be valuable.

Also, asking for the objects one-by-one like this is very
inefficient on the server side. A fresh clone can make use of
existing delta compression in a way that this type of request
cannot (at least, not easily). You _would_ be better off making
a fresh clone and then adding that pack-file to your
.git/objects/pack directory of the repository you want.

Could you describe more about your scenario and why you want to
get all objects?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Removing Partial Clone / Filtered Clone on a repo
  2021-06-01 10:39 ` Derrick Stolee
@ 2021-06-01 13:16   ` Tao Klerks
  2021-06-01 13:40     ` Derrick Stolee
  0 siblings, 1 reply; 6+ messages in thread
From: Tao Klerks @ 2021-06-01 13:16 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

On Tue, Jun 1, 2021 at 12:39 PM Derrick Stolee <stolee@gmail.com> wrote:

> Could you describe more about your scenario and why you want to
> get all objects?

A 13GB (with 1.2GB shallow head) repo is in that in-between spot where
you want to be able to get something useful to the user as fast as
possible (read: in less than the 4 hours it would take to download the
whole thing over a mediocre VPN, with corresponding risk of errors
partway), but where a user might later (eg overnight) want to get the
rest of the repo, to avoid history inconsistency issues.

In our current mode of operation (Shallow clones to 15 months' depth
by default), the initial clone can complete in well under an hour, but
the problem with the resulting clone is that normal git tooling will
see the shallow grafted commit as the "initial commit" of all older
files, and that causes no end of confusion on the part of users, eg on
"git blame". This is the main reason why we would like to consider
moving to full-history but filtered-blob clones.

(there are other reasons around manageability, eg the git server's
behavior around --shallow-since when some branches in refspec scope
are older than that date; it sends them with all their history,
effectively downloading the whole repo; similarly if a refspec is
expanded and the next fetch is run without explicit --shallow-since,
and finds new branches not already shallow-grafted, it will download
those in their entirely because the shallow-since date is not
persisted beyond the shallow grafts themselves).

With a (full-history all-trees no-blobs-except-HEAD) filtered clone,
the initial download can be quite a bit smaller than the shallow clone
scenario above (eg 1.5GB vs 2.2GB), and most of the disadvantages of
shallow clones are addressed: the just-in-time fetching can typically
work quite naturally, there are no "lies" in the history, nor are
there scenarios where you suddenly fetch an extra 10GB of history
without wanting/expecting to.

With the filtered clone there are still little edge-cases that might
motivate a user to "bite the bullet" and unfilter their clone,
however: The most obvious one I've found so far is "git blame" - it
loops fetch requests serially until it bottoms out, which on an older
poorly-factored file (hundreds or thousands of commits, each touching
different bits of a file) will effectively never complete, at
10s/fetch. And depending on the UI tooling the user is using, they may
have almost no visibility into why this "git blame" (or "annotate", or
whatever the given UI calls it) seems to hang forever.

You can work around this "git blame" issue for *most* situations, in
the case of our repo, by using a different initial filter spec, eg
"--filter=blob:limit=200k", which only costs you an extra 1GB or so...
But then you still have outliers - and in fact, the most "blameable"
files will tend to be the larger ones... :)

My working theory is that we should explain all the following to users:
* Your initial download is a nice compromise between functionality and
download delay
* You have almost all the useful history, and you have it within less
than an hour
* If you try to use "git blame" (or some other as-yet-undiscovered
scenarios) on a larger file, it may hang. In that case cancel, run a
magic command we provide which fetches all the blobs in that specific
file's history, and try again. (the magic command is a path-filtered
rev-list looking for missing objects, passed into fetch)
* If you ever get tired of the rare weird hangs, you have the option
of running *some process* that "unfilters" the repo, paying down that
initial compromise (and taking up a bit more HD space), eg overnight

This explanation is a little awkward, but less awkward than the
previous "'git blame' lies to you - it blames completely the wrong
person for the bulk of the history for the bulk of the files;
unshallow overnight if this bothers you", which is the current story
with shallow clone.

Of course, if unfiltering a large repo is impractical (and if it will
remain so), then we will probably need to err on the side of
generosity in the original clone - eg 1M instead of 200k as the blob
filter, 3GB vs 2.5GB as the initial download - and remove the last
line of the explanation! If unfiltering, or refiltering, were
practical, then we would probably err on the size of
less-blobs-by-default to optimize first download.

Over time, as we refactor the project itself to reduce the incidence
of megafiles, I would expect to be able to drop the
standard/recommended blob-size-limit too.

Sorry about the wall-of-text, hopefully I've answered the question!

Thanks,
Tao

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Removing Partial Clone / Filtered Clone on a repo
  2021-06-01 13:16   ` Tao Klerks
@ 2021-06-01 13:40     ` Derrick Stolee
  2021-06-01 16:54       ` Tao Klerks
  0 siblings, 1 reply; 6+ messages in thread
From: Derrick Stolee @ 2021-06-01 13:40 UTC (permalink / raw)
  To: Tao Klerks; +Cc: git

On 6/1/2021 9:16 AM, Tao Klerks wrote:
> On Tue, Jun 1, 2021 at 12:39 PM Derrick Stolee <stolee@gmail.com> wrote:
> 
>> Could you describe more about your scenario and why you want to
>> get all objects?
> 
> A 13GB (with 1.2GB shallow head) repo is in that in-between spot where
> you want to be able to get something useful to the user as fast as
> possible (read: in less than the 4 hours it would take to download the
> whole thing over a mediocre VPN, with corresponding risk of errors
> partway), but where a user might later (eg overnight) want to get the
> rest of the repo, to avoid history inconsistency issues.

As you describe below, the inconsistency is in terms of performance,
not correctness. I thought it was worth a clarification.

...
> With the filtered clone there are still little edge-cases that might
> motivate a user to "bite the bullet" and unfilter their clone,
> however: The most obvious one I've found so far is "git blame" - it
> loops fetch requests serially until it bottoms out, which on an older
> poorly-factored file (hundreds or thousands of commits, each touching
> different bits of a file) will effectively never complete, at
> 10s/fetch. And depending on the UI tooling the user is using, they may
> have almost no visibility into why this "git blame" (or "annotate", or
> whatever the given UI calls it) seems to hang forever.

I'm aware that the first 'git blame' on a file is a bit slow in the
partial clone case. It's been on my list for improvement whenever I
get the "spare" time to do it. However, if someone else wants to work
on it I will briefly outline the approach I was going to investigate:

  During the history walk for 'git blame', it might be helpful to
  collect a batch of blobs to download in a single round trip. This
  requires refactoring the search to walk the commit history and
  collect a list of (commit id, blob id) pairs as if we were doing
  a simplified history walk. We can then ask for the list of blob id's
  in a single request and then perform the line-by-line blaming logic
  on that list. [If we ever hit a point where we would do a rename
  check, pause the walk and request all blobs so far and flush the
  line-by-line diff before continuing.]

This basic idea is likely difficult to implement, but would likely
dramatically improve the first 'git blame' in a blobless clone. A
similar approach could maybe be used by the line-log logic
(git log -L).

> You can work around this "git blame" issue for *most* situations, in
> the case of our repo, by using a different initial filter spec, eg
> "--filter=blob:limit=200k", which only costs you an extra 1GB or so...
> But then you still have outliers - and in fact, the most "blameable"
> files will tend to be the larger ones... :)

I'm interested in this claim that 'the most "blameable" files will
tend to be the larger ones.' I typically expect blame to be used on
human-readable text files, and my initial reaction is that larger
files are harder to use with 'git blame'.

However, your 200k limit isn't so large that we can't expect _some_
files to reach that size. Looking at the root of git.git I see a
few files above 100k and files like diff.c reaching very close to
200k (uncompressed). I tend to find that the files in git.git are
smaller than the typical large project.
 
> My working theory is that we should explain all the following to users:
> * Your initial download is a nice compromise between functionality and
> download delay
> * You have almost all the useful history, and you have it within less
> than an hour
> * If you try to use "git blame" (or some other as-yet-undiscovered
> scenarios) on a larger file, it may hang. In that case cancel, run a
> magic command we provide which fetches all the blobs in that specific
> file's history, and try again. (the magic command is a path-filtered
> rev-list looking for missing objects, passed into fetch)
> * If you ever get tired of the rare weird hangs, you have the option
> of running *some process* that "unfilters" the repo, paying down that
> initial compromise (and taking up a bit more HD space), eg overnight

Partial clone is all about tradeoffs: you get faster clones that
download missing objects as needed. The user behavior dictates how
many objects are needed, so the user has the capability to adjust
that need. The fewer objects needed locally, the faster the repo
will be.

Your concern about slow commands is noted, but also blindly
downloading every file in history will slow the repo due to the
full size of the objects on disk.

I think there is merit to your back-filling history idea. There
are likely benefits to the "download everything missing" concept,
but also it would be good to design such a feature to have other
custom knobs, such as:

* Get only "recent" history, perhaps with a "--since=<date>"
  kind of flag. This would walk commits only to a certain date,
  then find all missing blobs reachable from their root trees.

* Get only a "cone" of history. This could work especially well
  with sparse-checkout, but other pathspecs could be used to
  limit the walk. 

...
> Of course, if unfiltering a large repo is impractical (and if it will
> remain so), then we will probably need to err on the side of
> generosity in the original clone - eg 1M instead of 200k as the blob
> filter, 3GB vs 2.5GB as the initial download - and remove the last
> line of the explanation! If unfiltering, or refiltering, were
> practical, then we would probably err on the size of
> less-blobs-by-default to optimize first download.

I'm glad that you have self-discovered a workaround to handle
these cases. If we had a refiltering feature, then you could even
start with a blobless clone to have an extremely fast initial
clone, followed by a background job that downloads the remaining
objects.

> Over time, as we refactor the project itself to reduce the incidence
> of megafiles, I would expect to be able to drop the
> standard/recommended blob-size-limit too.

My experience working with large repos and partial clone is
similar: the new pain points introduced by these features make
users aware of "repository smells" in their organization and
they tend to self-correct by refactoring the repository. This
is a never-ending process as repos grow, especially with many
contributors.

Thank you for sharing your experience!

-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Removing Partial Clone / Filtered Clone on a repo
  2021-06-01 13:40     ` Derrick Stolee
@ 2021-06-01 16:54       ` Tao Klerks
  2021-06-02  5:04         ` Tao Klerks
  0 siblings, 1 reply; 6+ messages in thread
From: Tao Klerks @ 2021-06-01 16:54 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

On Tue, Jun 1, 2021 at 3:40 PM Derrick Stolee <stolee@gmail.com> wrote:
> > you want to be able to get something useful to the user as fast as
> > possible [...] but where a user might later (eg overnight) want to get the
> > rest of the repo, to avoid history inconsistency issues.
>
> As you describe below, the inconsistency is in terms of performance,
> not correctness. I thought it was worth a clarification.

Sorry I was not clear here - I did not mean formal correctness nor
performance, when referring to the incentive to get the rest of the
repo - I was referring to the fact that a medium-shallow clone (eg 15
months of a 20-year project) provides an inconsistent perspective on
the code history:
 * On the one hand, most of the time you have everything you need, and
when you bump up against *available* history limits from a file or
branch history view, it's reasonably clear that's what's happening (in
some UI tools this is more explicit than in others).
 * On the other hand, when you happen to look at something older, it
is easy for the history to seem to "lie", showing changes made in a
file by a person that really *didn't* make those changes. Their commit
just happened to be selected as the shallow graft, and so seems to
have "added" all the files in the project. This reasonably
intelligible when looking at file history, but extremely non-obvious
when looking at git blame (in a medium-shallow clone).

> I'm aware that the first 'git blame' on a file is a bit slow in the
> partial clone case.

Without wanting to harp on about it, it can easily be pathologically
slow, eg in my case a random well-trafficked file has 300 in-scope
commits, at 10 seconds per independent blob fetch - and so ends up
taking an hour to git blame (the first time for such a file, as you
noted).

> It's been on my list for improvement whenever I
> get the "spare" time to do it. However, if someone else wants to work
> on it I will briefly outline the approach I was going to investigate:

One reason I wasn't asking about / angling for this, particularly, is
that I expect there will be other tools doing their own versions of
this. I haven't tested "tig" on this, for example, but I suspect it
doesn't do a plain git blame, given what I've seen of its instantly
showing the file contents and  "gradually" filling in the authorship
data. I for one rarely use plain git blame, I don't know much about
the usage patterns of other users. Most of "my" users will be using
Intellij IDEA, which seems to have a surprisingly solid/scalable git
integration (but I have not yet tested this case there yet)

There also other related reasons to go for a "get most of the relevant
blobs across history" approach, specifically around tooling: there are
lots of tools & integrations that use git libraries (or even homebrew
implementations) rather than the git binaries / IPC, and many of those
tend to lag *far* behind in support for things like shallow clone,
partial clone, mailmap, core.splitindex, replace refs, etc etc. My
current beef is with Sublime Merge, which is snappy as one could wish
for, really lovely to use within its scope, but doesn't have any idea
what a promisor is, and simply says "nah, no content here" when you
look at a missing blob. (for the moment)

> > the most "blameable"
> > files will tend to be the larger ones... :)
>
> I'm interested in this claim that 'the most "blameable" files will
> tend to be the larger ones.' I typically expect blame to be used on
> human-readable text files, and my initial reaction is that larger
> files are harder to use with 'git blame'.

Absolutely, I meant "the larger text/code files", not including other
stuff that tends to accumulate in the higher filesize brackets. I
meant that I, for one, in this project at least, often find myself
using git blame (or equivalent) to "spelunk" into who touched a
specific line, in cases where looking at the plain history is useless
because there have been many hundreds or thousands of changes - and in
my limited experience, files with that many reasons to change tend to
be large.

> Your concern about slow commands is noted, but also blindly
> downloading every file in history will slow the repo due to the
> full size of the objects on disk.

I have in the past claimed that "larger repo" (specifically, a deeper
clone that gets many larger blobs) is slower, but haven't actually
found any significant evidence to back my claim. Obviously something
like "git gc" will be slower, but is there anything in the practical
day-to-day that cares whether the commit depth is 10,000 commits or
200,000 commits for a given branch, or whether you only have the blobs
at the "tip" of the branch/project, or all the blobs in history?
(besides GC, specifically)

> it would be good to design such a feature to have other
> custom knobs, such as:
> * Get only "recent" history, perhaps with a "--since=<date>"
>   kind of flag. This would walk commits only to a certain date,
>   then find all missing blobs reachable from their root trees.

As long as you know at initial clone time that this is what you want,
combining shallow clone with sparse clone already enables this today
(shallow clone, set up filter, unshallow, and potentially remove
filter). You can even do more complicated things like unshallowing
with different increasingly-aggressive filters in multiple
steps/fetches over different time periods. The main challenge that I
perceive at the moment is that you're effectively locked into "one
shot". As soon as you've retrieved the commits with blobs missing,
"filling them in" at scale seems to be orders of magnitude more
expensive than an equivalent clone would have been.

> If we had a refiltering feature, then you could even
> start with a blobless clone to have an extremely fast initial
> clone, followed by a background job that downloads the remaining
> objects.

Yes please!


I think one thing that I'm not clearly understanding yet in this
conversation, is whether the tax on explicit and specialized blob list
fetching could be made much lower. As far as I can tell, in a blobless
clone with full trees we have most of the data one could want, to
decide what blobs to request - paths, filetypes, and commit dates.
This leaves three pain points that I am aware of:
* Filesizes are not (afaik) available in a blobless clone. This sounds
like a pretty deep limitation, which I'll gloss over.
* Blob paths are available in trees, but not trivially exposed by git
rev-list - could a new "--missing" option value make sense? Or does it
make just as much sense to expect the caller/scripter to iterate
ls-tree outputs? (I assume doing so would be much slower, but have not
tested)
* Something about the "git fetch <remote> blob-hash ..." pattern seems
to scale very poorly - is that something that might see change in
future, or is it a fundamental issue?


Thanks again for the detailed feedback!
Tao

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Removing Partial Clone / Filtered Clone on a repo
  2021-06-01 16:54       ` Tao Klerks
@ 2021-06-02  5:04         ` Tao Klerks
  0 siblings, 0 replies; 6+ messages in thread
From: Tao Klerks @ 2021-06-02  5:04 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

I understand replying to myself is bad form, but I need to add a
correction/clarification to a statement I made below:

On Tue, Jun 1, 2021 at 6:54 PM Tao Klerks <tao@klerks.biz> wrote:> >
it would be good to design such a feature to have other> > custom
knobs, such as:
> > * Get only "recent" history, perhaps with a "--since=<date>"
> >   kind of flag. This would walk commits only to a certain date,
> >   then find all missing blobs reachable from their root trees.
>
> As long as you know at initial clone time that this is what you want,
> combining shallow clone with sparse clone already enables this today
> (shallow clone, set up filter, unshallow, and potentially remove
> filter). You can even do more complicated things like unshallowing
> with different increasingly-aggressive filters in multiple
> steps/fetches over different time periods. The main challenge that I
> perceive at the moment is that you're effectively locked into "one
> shot". As soon as you've retrieved the commits with blobs missing,
> "filling them in" at scale seems to be orders of magnitude more
> expensive than an equivalent clone would have been.

As I just noted in another thread, there seems to be one extra step
needed to pull this off: you need to add a *.promisor file for the
initial shallow clone's packfile, because otherwise (at least with the
2.31 client that I am using) later "git fetch" calls take forever
doing something with rev-list that I don't understand, presumably due
to the relationship between promisor packfiles and non-promisor
packfiles...

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-02  5:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-01 10:24 Removing Partial Clone / Filtered Clone on a repo Tao Klerks
2021-06-01 10:39 ` Derrick Stolee
2021-06-01 13:16   ` Tao Klerks
2021-06-01 13:40     ` Derrick Stolee
2021-06-01 16:54       ` Tao Klerks
2021-06-02  5:04         ` Tao Klerks

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).