git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC] Extending git-replace
@ 2020-01-14  5:33 Kaushik Srenevasan
  2020-01-14  6:55 ` Elijah Newren
  2020-01-14 18:19 ` David Turner
  0 siblings, 2 replies; 9+ messages in thread
From: Kaushik Srenevasan @ 2020-01-14  5:33 UTC (permalink / raw)
  To: git

We’ve been trying to get rid of objects larger than a certain size
from one of our repositories that contains tens of thousands of
branches and hundreds of thousands of commits. While we’re able to
accomplish this using BFG[0] , it results in ~ 90% of the repository’s
history being rewritten. This presents the following problems
1. There are various systems (Phabricator for one) that use the commit
hash as a key in various databases. Rewriting history will require
that we update all of these systems.
2. We’ll have to force everyone to reclone a copy of this repository.

I was looking through the git code base to see if there is a way
around it when I chanced upon `git-replace`. While the basic idea of
`git-replace` is what I am looking for, it doesn’t quite fit the bill
due to the `--no-replace-objects` switch, the `GIT_NO_REPLACE_OBJECTS`
environment variable, and `--no-replace-objects` being the default for
certain git commands. Namely fsck, upload-pack, pack/unpack-objects,
prune and index-pack. That Git may still try to load a replaced object
when a git command is run with the `--no-replace-objects` option
prevents me from removing it from the ODB permanently. Not being able
to run prune and fsck on a repository where we’ve deleted the object
that’s been replaced with git-replace effectively rules this option
out for us.

A feature that allowed such permanent replacement (say a
`git-blacklist` or a `git-replace --blacklist`) might work as follows:
1. Blacklisted objects are stored as references under a new namespace
-- `refs/blacklist`.
2. The object loader unconditionally translates a blacklisted OID into
the OID it’s been replaced with.
3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly
always a part of fetch and push transactions.

This essentially turns the blacklist references namespace into an
additional piece of metadata that gets transmitted to a client when a
repository is cloned and is kept updated automatically.

I’ve been playing around with a prototype I wrote and haven’t observed
any breakage yet. I’m writing to seek advice on this approach and to
understand if this is something (if not in its current form, some
version of it) that has a chance of making it into the product if we
were to implement it. Happy to write up a more detailed design and
share my prototype as a starting point for discussion.

                                           -- Kaushik

[0] https://rtyley.github.io/bfg-repo-cleaner/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14  5:33 [RFC] Extending git-replace Kaushik Srenevasan
@ 2020-01-14  6:55 ` Elijah Newren
  2020-01-14 19:11   ` Jonathan Tan
  2020-01-16  3:30   ` Kaushik Srenevasan
  2020-01-14 18:19 ` David Turner
  1 sibling, 2 replies; 9+ messages in thread
From: Elijah Newren @ 2020-01-14  6:55 UTC (permalink / raw)
  To: Kaushik Srenevasan; +Cc: Git Mailing List, Jonathan Tan

Hi Kaushik,

On Mon, Jan 13, 2020 at 9:39 PM Kaushik Srenevasan <kaushik@twitter.com> wrote:
>
> We’ve been trying to get rid of objects larger than a certain size
> from one of our repositories that contains tens of thousands of
> branches and hundreds of thousands of commits. While we’re able to
> accomplish this using BFG[0] , it results in ~ 90% of the repository’s
> history being rewritten. This presents the following problems
> 1. There are various systems (Phabricator for one) that use the commit
> hash as a key in various databases. Rewriting history will require
> that we update all of these systems.

Not necessarily...

> 2. We’ll have to force everyone to reclone a copy of this repository.

True.

> I was looking through the git code base to see if there is a way
> around it when I chanced upon `git-replace`. While the basic idea of
> `git-replace` is what I am looking for, it doesn’t quite fit the bill
> due to the `--no-replace-objects` switch, the `GIT_NO_REPLACE_OBJECTS`
> environment variable, and `--no-replace-objects` being the default for
> certain git commands. Namely fsck, upload-pack, pack/unpack-objects,
> prune and index-pack. That Git may still try to load a replaced object
> when a git command is run with the `--no-replace-objects` option
> prevents me from removing it from the ODB permanently. Not being able
> to run prune and fsck on a repository where we’ve deleted the object
> that’s been replaced with git-replace effectively rules this option
> out for us.
>
> A feature that allowed such permanent replacement (say a
> `git-blacklist` or a `git-replace --blacklist`) might work as follows:
> 1. Blacklisted objects are stored as references under a new namespace
> -- `refs/blacklist`.
> 2. The object loader unconditionally translates a blacklisted OID into
> the OID it’s been replaced with.
> 3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly
> always a part of fetch and push transactions.
>
> This essentially turns the blacklist references namespace into an
> additional piece of metadata that gets transmitted to a client when a
> repository is cloned and is kept updated automatically.
>
> I’ve been playing around with a prototype I wrote and haven’t observed
> any breakage yet. I’m writing to seek advice on this approach and to
> understand if this is something (if not in its current form, some
> version of it) that has a chance of making it into the product if we
> were to implement it. Happy to write up a more detailed design and
> share my prototype as a starting point for discussion.

I'll get back to this in a minute, but wanted to point out a couple
other ideas for consideration:

1) You can rewrite history, and then use replace references to map old
commit IDs to new commit IDs.  This allows anyone to continue using
old commit IDs (which aren't even part of the new repository anymore)
in git commands and git automatically uses and shows the new commit
IDs.  No problems with fsck or prune or fetch either.  Creating these
replace refs is fairly simple if your repository rewriting program
(e.g. git-filter-repo or BFG Repo Cleaner) provides a mapping of old
IDs to new IDs, and if you are using git-filter-repo it even creates
the replace refs for you.  (The one downside is that you can't use
abbreviated refs to refer to replace refs, thus you can't use
abbreviated old commit IDs in this scheme.)

The downside is that various repository hosting tools ignore replace
refs.  Thus if you try to browse to a commit in the web UI of Gerrit
or GitHub using the old commit IDs, it'll just show you a commit not
found page.  Phabricator and GitLab may well be the same (haven't
tried yet).  However, teaching these tools to pay attention to replace
refs would make this simple mechanism for rewriting feel close to
seamless other than asking people to reclone.  It's possible that
teaching the Webby tools to pay attention to replace refs might not be
too difficult, at least for the open source systems, though I admit I
haven't dug into it myself.

2) Some folks might be okay with a clone that won't pass fsck or
prune, at least in special circumstances.  We're actually doing that
on purpose to deal with one of our large repositories.  We don't
provide that to normal developers, but we do use "cheap, fake clones"
in our CI systems.  These slim clones have 99% of all objects, but
happen to be missing the really big ones, resulting in only needing
1/7 of the time to download.  (And no, don't try to point out shallow
clones to me.  I hate those things, they're an awful hack, *and* they
don't work for us.  It's nice getting all commit history, all trees,
and most blobs including all for at least the last two years while
still saving lots of space.)

[For the curious, I did make a simple script to create these "cheap,
fake clones" for repositories of interest.  See
https://github.com/newren/sequester-old-big-blobs.  But they are
definitely a hack with some sharp corners, with failing fsck and
prunes only being part of the story.]


3) Back to your idea...

What you're proposing actually sounds very similar to partial clones,
whose idea is to make it okay to download a subset of history.  The
primary problems with partial clones are (a) they are still under
development and are just experimental, (b) they are currently
implemented with a "promisor" mode, meaning that if a command tries to
run over any piece of missing data then the command pauses while the
objects are downloaded from the server.  I want an offline mode (even
if I'm online) where only explicit downloading from the server (clone,
fetch, etc.) occurs.

Instead of inventing yet another partial-clone-like thing, it'd be
nice if your new mechanism could just be implemented in terms of
partial clones, extending them as you need.  I don't like the idea of
supporting multiple competing implementations of partial clones
withing git.git, but if it's just some extensions of the existing
capability then it sounds great.  But you may want to talk with
Jonathan Tan if you want to go this route (cc'd), since he's the
partial clone expert.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14  5:33 [RFC] Extending git-replace Kaushik Srenevasan
  2020-01-14  6:55 ` Elijah Newren
@ 2020-01-14 18:19 ` David Turner
  2020-01-14 19:03   ` Jonathan Tan
  1 sibling, 1 reply; 9+ messages in thread
From: David Turner @ 2020-01-14 18:19 UTC (permalink / raw)
  To: Kaushik Srenevasan, git

On Mon, 2020-01-13 at 21:33 -0800, Kaushik Srenevasan wrote:
> A feature that allowed such permanent replacement (say a
> `git-blacklist` or a `git-replace --blacklist`) might work as
> follows:
> 1. Blacklisted objects are stored as references under a new namespace
> -- `refs/blacklist`.
> 2. The object loader unconditionally translates a blacklisted OID
> into
> the OID it’s been replaced with.
> 3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly
> always a part of fetch and push transactions.

There are definitely some security implications here. I assume that
there's a config on the client to trust the server's refs/blacklist/*,
and that the documentation for this explains that it allows your repo
to be messed with in quite dangerous ways.  And on the server, I would
expect that only privileged users could push to refs/blacklist/*

To Elijah's point that this is related to partial clones and promisors,
I think Kaushik's idea is subtly different in that it involves
replacements, while promisors try to offer a seamless experience.  I
wonder whether Kaushik actually needs the replacement functionality?  

That is, would it be sufficient if every replaced file were replaced
with the exact text "me caga en la leche" instead of a custom hand-
crafted replacement?  I guess it's a bit complicated because while
that's a reasonable blob, it's not a valid commit.  So maybe this
mechanism would be limited to blobs.  I thought about whether we could
a different flavor of replacement for commits, but those generally have
to be custom because they each have different parents. 

And if that would be sufficient, could promisors be used for this?  I
don't know how those interact with fsck and the other commands that
you're worried about.  Basically, the idea would be to use most of the
existing promisor code, and then have a mode where instead of visiting
the promisor, we just always return "me caga en la leche" (and this
does not have its SHA checked, of course).

This could work together with some sort refs/blacklist mechanism to
enable the server to choose which objects the client replaces.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14 18:19 ` David Turner
@ 2020-01-14 19:03   ` Jonathan Tan
  2020-01-14 20:39     ` Elijah Newren
  0 siblings, 1 reply; 9+ messages in thread
From: Jonathan Tan @ 2020-01-14 19:03 UTC (permalink / raw)
  To: novalis; +Cc: kaushik, git, Jonathan Tan

> That is, would it be sufficient if every replaced file were replaced
> with the exact text "me caga en la leche" instead of a custom hand-
> crafted replacement?  I guess it's a bit complicated because while
> that's a reasonable blob, it's not a valid commit.  So maybe this
> mechanism would be limited to blobs.  I thought about whether we could
> a different flavor of replacement for commits, but those generally have
> to be custom because they each have different parents.

Since the original email just discussed blobs, I'll confine myself to
discussing blobs. (Commits are trickier, as you said.)

> And if that would be sufficient, could promisors be used for this?  I
> don't know how those interact with fsck and the other commands that
> you're worried about.  Basically, the idea would be to use most of the
> existing promisor code, and then have a mode where instead of visiting
> the promisor, we just always return "me caga en la leche" (and this
> does not have its SHA checked, of course).

Missing promisor objects do not prevent fsck from passing - this is part
of the original design (any packfiles we download from the specifically
designated promisor remote are marked as such, and any objects that the
objects in the packfile refer to are considered OK to be missing).

Currently, when a missing object is read, it is first fetched (there are
some more details that I can go over if you have any specific
questions). What you're suggesting here is to return a fake blob with
wrong hash - I haven't looked at all the callers of read-object
functions in detail, but I don't think all of them are ready for such a
behavioral change. Maybe it would be sufficient to just make this work
in a more limited scope (e.g. checkout only - and if we need different
replacement blobs for different object IDs, maybe we could have
something similar to the clean/smudge filters).

> This could work together with some sort refs/blacklist mechanism to
> enable the server to choose which objects the client replaces.

In the original email, Kaushik mentioned objects larger than a certain
size - we already have support for that (--filter=blob:limit=1000000,
for example). Having said that, Git is already able to tolerate any
exclusion (of tree or blob) from the server - we already need this in
order to support changing of filters, for example.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14  6:55 ` Elijah Newren
@ 2020-01-14 19:11   ` Jonathan Tan
  2020-01-16  3:30   ` Kaushik Srenevasan
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Tan @ 2020-01-14 19:11 UTC (permalink / raw)
  To: newren; +Cc: kaushik, git, jonathantanmy

> 2) Some folks might be okay with a clone that won't pass fsck or
> prune, at least in special circumstances.  We're actually doing that
> on purpose to deal with one of our large repositories.  We don't
> provide that to normal developers, but we do use "cheap, fake clones"
> in our CI systems.  These slim clones have 99% of all objects, but
> happen to be missing the really big ones, resulting in only needing
> 1/7 of the time to download.  (And no, don't try to point out shallow
> clones to me.  I hate those things, they're an awful hack, *and* they
> don't work for us.  It's nice getting all commit history, all trees,
> and most blobs including all for at least the last two years while
> still saving lots of space.)
> 
> [For the curious, I did make a simple script to create these "cheap,
> fake clones" for repositories of interest.  See
> https://github.com/newren/sequester-old-big-blobs.  But they are
> definitely a hack with some sharp corners, with failing fsck and
> prunes only being part of the story.]

If you want to reduce the sharpness of the corners, it might be possible
to designate the pack with missing blobs as a promisor pack (add a
.promisor file - which is just like the .keep file except
s/keep/promisor/) and a fake promisor remote. That will make fsck and
repack (GC) work.

> 3) Back to your idea...
> 
> What you're proposing actually sounds very similar to partial clones,
> whose idea is to make it okay to download a subset of history.  The
> primary problems with partial clones are (a) they are still under
> development and are just experimental, (b) they are currently
> implemented with a "promisor" mode, meaning that if a command tries to
> run over any piece of missing data then the command pauses while the
> objects are downloaded from the server.  I want an offline mode (even
> if I'm online) where only explicit downloading from the server (clone,
> fetch, etc.) occurs.

David Turner had an idea of what could be done (instead of fetching) in
such an offline mode [1], so I replied there.

[1] https://lore.kernel.org/git/d4361b6d34513a3fdefa564d10269f60d4732208.camel@novalis.org/

> Instead of inventing yet another partial-clone-like thing, it'd be
> nice if your new mechanism could just be implemented in terms of
> partial clones, extending them as you need.  I don't like the idea of
> supporting multiple competing implementations of partial clones
> withing git.git, but if it's just some extensions of the existing
> capability then it sounds great.  But you may want to talk with
> Jonathan Tan if you want to go this route (cc'd), since he's the
> partial clone expert.

Ah, thanks for your kind words.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14 19:03   ` Jonathan Tan
@ 2020-01-14 20:39     ` Elijah Newren
  2020-01-14 21:57       ` Jonathan Tan
  0 siblings, 1 reply; 9+ messages in thread
From: Elijah Newren @ 2020-01-14 20:39 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: novalis, Kaushik Srenevasan, Git Mailing List

On Tue, Jan 14, 2020 at 11:05 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > That is, would it be sufficient if every replaced file were replaced
> > with the exact text "me caga en la leche" instead of a custom hand-
> > crafted replacement?  I guess it's a bit complicated because while
> > that's a reasonable blob, it's not a valid commit.  So maybe this
> > mechanism would be limited to blobs.  I thought about whether we could
> > a different flavor of replacement for commits, but those generally have
> > to be custom because they each have different parents.
>
> Since the original email just discussed blobs, I'll confine myself to
> discussing blobs. (Commits are trickier, as you said.)
>
> > And if that would be sufficient, could promisors be used for this?  I
> > don't know how those interact with fsck and the other commands that
> > you're worried about.  Basically, the idea would be to use most of the
> > existing promisor code, and then have a mode where instead of visiting
> > the promisor, we just always return "me caga en la leche" (and this
> > does not have its SHA checked, of course).

Maybe; it doesn't necessarily need to be the same object returned, and
these replacements could be user-specified via replace refs...

> Missing promisor objects do not prevent fsck from passing - this is part
> of the original design (any packfiles we download from the specifically
> designated promisor remote are marked as such, and any objects that the
> objects in the packfile refer to are considered OK to be missing).

Is there ever a risk that objects in the downloaded packfile come
across as deltas against other objects that are missing/excluded, or
does the partial clone machinery ensure that doesn't happen?  (Because
this was certainly the biggest pain-point with my "fake cheap clone"
hacks.)

> Currently, when a missing object is read, it is first fetched (there are
> some more details that I can go over if you have any specific
> questions). What you're suggesting here is to return a fake blob with
> wrong hash - I haven't looked at all the callers of read-object
> functions in detail, but I don't think all of them are ready for such a
> behavioral change.

git-replace already took care of that for you and provides that
guarantee, modulo the --no-replace-objects & fsck & prune & fetch &
whatnot cases that ignore replace objects as Kaushik mentioned.  I
took advantage of this to great effect with my "fake cheap clone"
hacks.  Based in part on your other email where you made a suggestion
about promisors, I'm starting to think a pretty good first cut
solution might look like the following:

  * user manually adds a bunch of replace refs to map the unwanted big
blobs to something else (e.g. a README about how the files were
stripped, or something similar to this)
  * a partial clone specification that says "exclude objects that are
referenced by replace refs"
  * add a fake promisor to the downloaded promisor pack so that if
anyone runs with --no-replace-objects or similar then they get an
error saying the specified objects don't exist and can't be
downloaded.

Anyone see any obvious problems with this?

>  Maybe it would be sufficient to just make this work
> in a more limited scope (e.g. checkout only - and if we need different
> replacement blobs for different object IDs, maybe we could have
> something similar to the clean/smudge filters).

>
> > This could work together with some sort refs/blacklist mechanism to
> > enable the server to choose which objects the client replaces.
>
> In the original email, Kaushik mentioned objects larger than a certain
> size - we already have support for that (--filter=blob:limit=1000000,
> for example). Having said that, Git is already able to tolerate any
> exclusion (of tree or blob) from the server - we already need this in
> order to support changing of filters, for example.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14 20:39     ` Elijah Newren
@ 2020-01-14 21:57       ` Jonathan Tan
  2020-01-14 22:46         ` Elijah Newren
  0 siblings, 1 reply; 9+ messages in thread
From: Jonathan Tan @ 2020-01-14 21:57 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, novalis, kaushik, git

> > Missing promisor objects do not prevent fsck from passing - this is part
> > of the original design (any packfiles we download from the specifically
> > designated promisor remote are marked as such, and any objects that the
> > objects in the packfile refer to are considered OK to be missing).
> 
> Is there ever a risk that objects in the downloaded packfile come
> across as deltas against other objects that are missing/excluded, or
> does the partial clone machinery ensure that doesn't happen?  (Because
> this was certainly the biggest pain-point with my "fake cheap clone"
> hacks.)

The server may send thin packs during a fetch or clone, but because the
client runs index-pack (which calculates the hash of every object
downloaded, necessitating having the full object, which in turn triggers
fetches of any delta bases), this should not happen.

But if you create the packfile in some other way and then manually set a
fake promisor remote (as I perhaps too naively suggested) then the
mechanism will attempt to fetch missing delta bases, which (I think) is
not what you want.

> > Currently, when a missing object is read, it is first fetched (there are
> > some more details that I can go over if you have any specific
> > questions). What you're suggesting here is to return a fake blob with
> > wrong hash - I haven't looked at all the callers of read-object
> > functions in detail, but I don't think all of them are ready for such a
> > behavioral change.
> 
> git-replace already took care of that for you and provides that
> guarantee, modulo the --no-replace-objects & fsck & prune & fetch &
> whatnot cases that ignore replace objects as Kaushik mentioned.  I
> took advantage of this to great effect with my "fake cheap clone"
> hacks.  Based in part on your other email where you made a suggestion
> about promisors, I'm starting to think a pretty good first cut
> solution might look like the following:
> 
>   * user manually adds a bunch of replace refs to map the unwanted big
> blobs to something else (e.g. a README about how the files were
> stripped, or something similar to this)
>   * a partial clone specification that says "exclude objects that are
> referenced by replace refs"
>   * add a fake promisor to the downloaded promisor pack so that if
> anyone runs with --no-replace-objects or similar then they get an
> error saying the specified objects don't exist and can't be
> downloaded.
> 
> Anyone see any obvious problems with this?

Looking at the list of commands given in the original email (fsck,
upload-pack, pack/unpack-objects, prune and index-pack), if we use a
filter by blob size (instead of the partial clone specification
suggested), this would satisfy the purposes of fsck and prune only.

If we had a partial clone specification that excludes object referenced
by replace refs, then upload-pack from this partial repository (and
pack-objects) would work too.

But there might be non-obvious problems that I haven't thought of.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14 21:57       ` Jonathan Tan
@ 2020-01-14 22:46         ` Elijah Newren
  0 siblings, 0 replies; 9+ messages in thread
From: Elijah Newren @ 2020-01-14 22:46 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: novalis, Kaushik Srenevasan, Git Mailing List

On Tue, Jan 14, 2020 at 1:57 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > > Missing promisor objects do not prevent fsck from passing - this is part
> > > of the original design (any packfiles we download from the specifically
> > > designated promisor remote are marked as such, and any objects that the
> > > objects in the packfile refer to are considered OK to be missing).
> >
> > Is there ever a risk that objects in the downloaded packfile come
> > across as deltas against other objects that are missing/excluded, or
> > does the partial clone machinery ensure that doesn't happen?  (Because
> > this was certainly the biggest pain-point with my "fake cheap clone"
> > hacks.)
>
> The server may send thin packs during a fetch or clone, but because the
> client runs index-pack (which calculates the hash of every object
> downloaded, necessitating having the full object, which in turn triggers
> fetches of any delta bases), this should not happen.

So if a user does a partial clone, filtering by blob size >= 1M, and
if they have several blobs of size just above and just below that
limit, then the partial clone will work but probably cause them to
still download several blobs above the limit size anyway?  (Which, if
I'm understanding correctly, happens because the blobs just smaller
than 1M likely will delta well against the blobs just larger than 1M.)

> But if you create the packfile in some other way and then manually set a
> fake promisor remote (as I perhaps too naively suggested) then the
> mechanism will attempt to fetch missing delta bases, which (I think) is
> not what you want.

Well, it's not optimal, but we're currently just dying with cryptic
errors whenever we have missing delta bases, and this happens whenever
we have an accidental fetch of older branches (although this does have
the nice side effect of notifying us of stray fetches in our CI
scripts).  Your promisor suggestion would at least permit gc's &
prunes if we use it in more places, so should be an improvement.  I
just wanted to verify whether this problem with delta bases would
remain.

> > > Currently, when a missing object is read, it is first fetched (there are
> > > some more details that I can go over if you have any specific
> > > questions). What you're suggesting here is to return a fake blob with
> > > wrong hash - I haven't looked at all the callers of read-object
> > > functions in detail, but I don't think all of them are ready for such a
> > > behavioral change.
> >
> > git-replace already took care of that for you and provides that
> > guarantee, modulo the --no-replace-objects & fsck & prune & fetch &
> > whatnot cases that ignore replace objects as Kaushik mentioned.  I
> > took advantage of this to great effect with my "fake cheap clone"
> > hacks.  Based in part on your other email where you made a suggestion
> > about promisors, I'm starting to think a pretty good first cut
> > solution might look like the following:
> >
> >   * user manually adds a bunch of replace refs to map the unwanted big
> > blobs to something else (e.g. a README about how the files were
> > stripped, or something similar to this)
> >   * a partial clone specification that says "exclude objects that are
> > referenced by replace refs"
> >   * add a fake promisor to the downloaded promisor pack so that if
> > anyone runs with --no-replace-objects or similar then they get an
> > error saying the specified objects don't exist and can't be
> > downloaded.
> >
> > Anyone see any obvious problems with this?
>
> Looking at the list of commands given in the original email (fsck,
> upload-pack, pack/unpack-objects, prune and index-pack), if we use a
> filter by blob size (instead of the partial clone specification
> suggested), this would satisfy the purposes of fsck and prune only.
>
> If we had a partial clone specification that excludes object referenced
> by replace refs, then upload-pack from this partial repository (and
> pack-objects) would work too.
>
> But there might be non-obvious problems that I haven't thought of.

Cool, sounds like it's at least worth investigating.  Maybe Kaushik is
interested, or maybe I consider throwing it on my backlog and coming
back to it in a year or two.  :-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Extending git-replace
  2020-01-14  6:55 ` Elijah Newren
  2020-01-14 19:11   ` Jonathan Tan
@ 2020-01-16  3:30   ` Kaushik Srenevasan
  1 sibling, 0 replies; 9+ messages in thread
From: Kaushik Srenevasan @ 2020-01-16  3:30 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Jonathan Tan

Hi Elijah,

On Mon, Jan 13, 2020 at 10:55 PM Elijah Newren <newren@gmail.com> wrote:
> 1) You can rewrite history, and then use replace references to map old
> commit IDs to new commit IDs.  This allows anyone to continue using
> old commit IDs (which aren't even part of the new repository anymore)
> in git commands and git automatically uses and shows the new commit
> IDs.  No problems with fsck or prune or fetch either.  Creating these
> replace refs is fairly simple if your repository rewriting program
> (e.g. git-filter-repo or BFG Repo Cleaner) provides a mapping of old
> IDs to new IDs, and if you are using git-filter-repo it even creates
> the replace refs for you.  (The one downside is that you can't use
> abbreviated refs to refer to replace refs, thus you can't use
> abbreviated old commit IDs in this scheme.)
>

This is the path we're considering taking unless something easier
comes out of this (or other) proposal(s). We're working on determining
compatibility with tools. Thanks for the pointer to git-filter-repo.
It looks great!

> Instead of inventing yet another partial-clone-like thing, it'd be
> nice if your new mechanism could just be implemented in terms of
> partial clones, extending them as you need.  I don't like the idea of
> supporting multiple competing implementations of partial clones
> withing git.git, but if it's just some extensions of the existing
> capability then it sounds great.  But you may want to talk with
> Jonathan Tan if you want to go this route (cc'd), since he's the
> partial clone expert.

I agree that it isn't worth inventing another partial clone like
feature. It sounds however, like something based on partial clone will
not solve the problem on the "server"? or perhaps I'm missing
something (as I've not had a chance to check out the implementation
yet). While I'm not at all insisting that `git-blacklist` be the way
to achieve it, we'd (Twitter) like to be able to permanently get rid
of the objects in question while retaining the ability to run GC and
FSCK on all copies of the repository, preferably without having to
rewrite history.

Even merely making `--no-replace-objects` be FALSE by default for GC
and FSCK (and printing a warning instead), while retaining existing
behavior when it is explicitly requested, would significantly improve
`git-replace`'s usability (for this purpose). The bits related to ref
transfer in my proposal are optional. Git users can either be required
to explicitly fetch the refs/replacement namespace (as they do today),
or we could print a message (at the end of clone), letting the user
know that there are replacements available on the server. I'd only
proposed a new command as changing `git-reaplce` thus, would break
backward compatibility.

                                -- Kaushik

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-01-16  3:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-14  5:33 [RFC] Extending git-replace Kaushik Srenevasan
2020-01-14  6:55 ` Elijah Newren
2020-01-14 19:11   ` Jonathan Tan
2020-01-16  3:30   ` Kaushik Srenevasan
2020-01-14 18:19 ` David Turner
2020-01-14 19:03   ` Jonathan Tan
2020-01-14 20:39     ` Elijah Newren
2020-01-14 21:57       ` Jonathan Tan
2020-01-14 22:46         ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).