git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* New command/tool: git filter-repo
@ 2019-01-31  8:57 Elijah Newren
  2019-01-31 19:09 ` Junio C Hamano
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Elijah Newren @ 2019-01-31  8:57 UTC (permalink / raw)
  To: Git Mailing List

Hi everyone,

git-filter-repo[1], a filter-branch-like tool for rewriting repository
history, is ready for more widespread testing and feedback.  The rough
edges I previously mentioned have been fixed, and it has several useful
features already, though more development work is ongoing (docs are a
bit sparse right now, though -h provides some help).

Why filter-repo vs. filter-branch?

  * filter-branch is extremely to unusably slow (multiple orders of
    magnitude slower than it should be) for non-trivial repositories.

  * filter-branch made a number of usability choices that are okay for
    small repos, but these choices sometimes conflict as more options
    are combined, and the overall usability often causes difficulties
    for users trying to work with intermediate or larger repos.

  * filter-branch is missing some basic features.

The first two are intrinsic to filter-branch's design at this point
and cannot be backward-compatibly fixed.

Requirements:
  * Python 2 (for now?)
  * A version of git with en/fast-export-import topic (in master of git.git)
  * A version of git with the --combined-all-names option to diff-tree;
    I have submitted[2] this patch, but it hasn't been picked up yet.

What's the future?  (Core command of git.git?  place it in contrib?  keep it
in a separate repo?)  I'm hoping to discuss that at the contributor summit
today, but feedback on the list is also welcome.


Thanks,
Elijah

[1] https://github.com/newren/git-filter-repo; it's a ~2800 line
    single-file python script, depending only on the python standard
    library (and execution of git commands), all of which is designed
    to make build/installation trivial: you just need to add it to
    your $PATH.
[2] https://public-inbox.org/git/20190126221811.20241-1-newren@gmail.com/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
@ 2019-01-31 19:09 ` Junio C Hamano
  2019-01-31 20:43   ` Elijah Newren
  2019-01-31 20:47 ` Elijah Newren
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2019-01-31 19:09 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

Elijah Newren <newren@gmail.com> writes:

> git-filter-repo[1], a filter-branch-like tool for rewriting repository
> history, is ready for more widespread testing and feedback.  The rough
> edges I previously mentioned have been fixed, and it has several useful
> features already, though more development work is ongoing (docs are a
> bit sparse right now, though -h provides some help).
>
> Why filter-repo vs. filter-branch?

How does it compare with bfg-repo-cleaner?  Somehow I was led to
believe that all serious users of filter-branch like functionality
are using bfg-repo-cleaner instead.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-01-31 19:09 ` Junio C Hamano
@ 2019-01-31 20:43   ` Elijah Newren
  2019-01-31 23:36     ` Roberto Tyley
  0 siblings, 1 reply; 13+ messages in thread
From: Elijah Newren @ 2019-01-31 20:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > git-filter-repo[1], a filter-branch-like tool for rewriting repository
> > history, is ready for more widespread testing and feedback.  The rough
> > edges I previously mentioned have been fixed, and it has several useful
> > features already, though more development work is ongoing (docs are a
> > bit sparse right now, though -h provides some help).
> >
> > Why filter-repo vs. filter-branch?
>
> How does it compare with bfg-repo-cleaner?  Somehow I was led to
> believe that all serious users of filter-branch like functionality
> are using bfg-repo-cleaner instead.

No, bfg-repo-cleaner only covers an important subset of the usecases.
bfg-repo-cleaner does a really good job if your goal is to remove a
few big files and/or to remove some sensitive text (matched via
regexes) from all blobs.  It was designed for that specific role and
has more options in this area than filter-repo currently has.  But
even within this design space it was optimized for, it is missing two
things that I really want:

  * pruning of commits which become empty due to filtering
  * providing a way for the user to know what needs to be cleaned up.
It has options like --strip-blobs-bigger-than <size> or
--strip-biggest-blobs <NUM>, but no way for the user to figure out
what <size> or <NUM> should be.  Also, since it just focuses on really
big blobs, it misses cases like someone checking in directories with a
huge number of small-to-moderately sized files (e.g. bower_components/
or node_modules/, though these could also contain a few big blobs
too), or someone checking in a lot of moderately sized files of a
uniform extension (e.g. .webm, .tar.gz, .zip, .mp4, .avi).  I've seen
cases in the wild where the correct cleaning of history was more about
filtering out directories or extensions than a couple big files.
filter-repo's --analyze option creates some reports that help with
this tremendously.  Also, the options to delete files by glob/basename
overlook the fact that renames may have occurred.  Having a report
that mentions renames that have occurred in history (also part of
filter-repo's --analyze option) can be very helpful.

Outside of this specific usecase, bfg-repo-cleaner is not very useful.
It simply lacks more general filtering capabilties:

  * While bfg-repo-cleaner has facilities to remove certain paths, it
has none to say you only want to keep certain paths.  Unlike
filter-branch where you can use a pipeline to list all files, grep to
remove the ones you want to keep from the list, then pipe the
remainder of paths to xargs git rm, bfg-repo-cleaner doesn't have a
facility for shell commands.  Instead in bfg-repo-cleaner you would
need to emulate this by exhaustively listing directories and
paths/globs of file basenames to delete, but that assumes the user
knows all paths that have ever existed making this solution not only
onerous but error prone.  More of the filterings I see these days are
about just keeping a directory (or perhaps a handful of them) rather
than just removing or cleaning a few files.  Also, this makes pruning
of commits which become empty much more important, but as noted above,
bfg-repo-cleaner lacks that ability.
  * It has no facilities for renaming paths.  You'd have to use a
different tool to do that, but then why not use the other tool to do
the whole job?  Even if you do decide to use both tools, some
capabilities of one tool can be neutered by such an approach (e.g.
bfg-repo-cleaner's carefully rewritten commit messages that tried to
ensure abbreviated commit shas referred to the new commit ids)
  * It has no facilities for affecting other parts of history, such as
changing author/committer/tagger names or emails, changing commit
timestamp or timezone, reparenting commits, splicing repository
histories together, filtering files differently based on commit
timestamp, etc. -- all of which can be done with filter-repo (though
some of those things requires writing a small python script; see basic
examples in t/lib-usage/*)

Personally, I also find it kind of annoying that bfg-repo-cleaner
doesn't automatically repack and shrink the repo when it is done and
instead prints multiple commands the user can run to achieve that,
even though it's the core use case for the tool.  Granted, they may
have had last-ditch recovery-of-the-original-repo in mind in case the
user ran in a repository they shouldn't have, but I much prefer to
have the tool just check if the repo looks like a fresh clone and bail
if not, so that users have a far easier recovery mechanism -- just
throw away the clone you were filtering and re-clone.  Once you do
that, auto repacking and shrinking is pretty natural.  (And you can
always provide a --force option to allow filtering & rewriting in a
repo that isn't a fresh clone.)


Elijah

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
  2019-01-31 19:09 ` Junio C Hamano
@ 2019-01-31 20:47 ` Elijah Newren
  2019-02-08  1:25 ` Elijah Newren
  2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
  3 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-01-31 20:47 UTC (permalink / raw)
  To: Git Mailing List

On Thu, Jan 31, 2019 at 9:57 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi everyone,
>
> git-filter-repo[1], a filter-branch-like tool for rewriting repository
> history, is ready for more widespread testing and feedback.
...
> What's the future?  (Core command of git.git?  place it in contrib?  keep it
> in a separate repo?)  I'm hoping to discuss that at the contributor summit
> today, but feedback on the list is also welcome.

Turns out we didn't have enough time and didn't discuss it at the
contributor summit.  So, I'm even more interested in feedback from the
mailing list.

Thanks,
Elijah

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-01-31 20:43   ` Elijah Newren
@ 2019-01-31 23:36     ` Roberto Tyley
  2019-02-01  7:38       ` Elijah Newren
  0 siblings, 1 reply; 13+ messages in thread
From: Roberto Tyley @ 2019-01-31 23:36 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Junio C Hamano, Git Mailing List

On Thu, 31 Jan 2019 at 22:37, Elijah Newren <newren@gmail.com> wrote:
> On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote:
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > git-filter-repo[1], a filter-branch-like tool for rewriting repository
> > > history, is ready for more widespread testing and feedback.  The rough
> > > edges I previously mentioned have been fixed, and it has several useful
> > > features already, though more development work is ongoing (docs are a
> > > bit sparse right now, though -h provides some help).
> > >
> > > Why filter-repo vs. filter-branch?

I like the name! I think a lot of users are interested in filtering
their entire repo, rather than rewriting a single branch.

> > How does it compare with bfg-repo-cleaner?  Somehow I was led to
> > believe that all serious users of filter-branch like functionality
> > are using bfg-repo-cleaner instead.
>
> No, bfg-repo-cleaner only covers an important subset of the usecases.

That's true - the focus with BFG Repo-Cleaner is on removing unwanted
data - completely eradicating it from a repo's history. There are some
mistakes in history that repo owners just really *do not* want to
share (ie large files, private data/credentials), and they can be a
critical blocker to sharing or working with a Git repo. In terms of
rewriting history, my internal criterion for what I features I really
want to be in the BFG is: is this unwanted data completely stopping
many users from sharing their code or doing their work?

I understand that when it comes to rewriting history, there are loads
of other operations that people sometimes want to perform, beyond
removing unwanted data - merging/splitting of history,
anonymization/renaming of committers, etc. Some of those might be nice
to add to the BFG - but as with many OSS-maintainers, I have limited
time, and a life to balance outside of software...!

> bfg-repo-cleaner does a really good job if your goal is to remove a
> few big files and/or to remove some sensitive text (matched via
> regexes) from all blobs.  It was designed for that specific role and
> has more options in this area than filter-repo currently has.  But
> even within this design space it was optimized for, it is missing two
> things that I really want:
>
>   * pruning of commits which become empty due to filtering

There certainly have been several users asking for this feature on the
BFG, and even a kindly contributed PR for the functionality which I've
yet to merge. As it doesn't actually stop users from doing work - so
far as I can see - it's something that I've done a poor job of
following up.

>   * providing a way for the user to know what needs to be cleaned up.
> It has options like --strip-blobs-bigger-than <size> or
> --strip-biggest-blobs <NUM>, but no way for the user to figure out
> what <size> or <NUM> should be.

For users of GitHub, It's normally 100MB with
--strip-blobs-bigger-than <size> :-)

> Also, since it just focuses on really
> big blobs, it misses cases like someone checking in directories with a
> huge number of small-to-moderately sized files (e.g. bower_components/
> or node_modules/, though these could also contain a few big blobs

For those use-cases, it might be that BFG's --delete-folders flag is
useful, especially given the protected-head-commit feature of the BFG.


It's getting late for me, must be even later in Brussels - I wish I
could have made it there to join in! Merry Git Merge to you all, and
good luck to you Elijah with git-filter-repo.

Roberto

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-01-31 23:36     ` Roberto Tyley
@ 2019-02-01  7:38       ` Elijah Newren
  0 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-02-01  7:38 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: Junio C Hamano, Git Mailing List

Hi Roberto,

First of all, thanks for the feedback, and for the awesome work on BFG
repo filter!

On Fri, Feb 1, 2019 at 12:36 AM Roberto Tyley <roberto.tyley@gmail.com> wrote:
>
> On Thu, 31 Jan 2019 at 22:37, Elijah Newren <newren@gmail.com> wrote:
> > On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote:
> > > Elijah Newren <newren@gmail.com> writes:
> > >
> > > > git-filter-repo[1], a filter-branch-like tool for rewriting repository
> > > > history, is ready for more widespread testing and feedback.  The rough
> > > > edges I previously mentioned have been fixed, and it has several useful
> > > > features already, though more development work is ongoing (docs are a
> > > > bit sparse right now, though -h provides some help).
> > > >
> > > > Why filter-repo vs. filter-branch?
>
> I like the name! I think a lot of users are interested in filtering
> their entire repo, rather than rewriting a single branch.
>
> > > How does it compare with bfg-repo-cleaner?  Somehow I was led to
> > > believe that all serious users of filter-branch like functionality
> > > are using bfg-repo-cleaner instead.
> >
> > No, bfg-repo-cleaner only covers an important subset of the usecases.
>
> That's true - the focus with BFG Repo-Cleaner is on removing unwanted
> data - completely eradicating it from a repo's history. There are some
> mistakes in history that repo owners just really *do not* want to
> share (ie large files, private data/credentials), and they can be a
> critical blocker to sharing or working with a Git repo. In terms of
> rewriting history, my internal criterion for what I features I really
> want to be in the BFG is: is this unwanted data completely stopping
> many users from sharing their code or doing their work?
>
> I understand that when it comes to rewriting history, there are loads
> of other operations that people sometimes want to perform, beyond
> removing unwanted data - merging/splitting of history,
> anonymization/renaming of committers, etc. Some of those might be nice
> to add to the BFG - but as with many OSS-maintainers, I have limited
> time, and a life to balance outside of software...!

Totally understand; you picked a certain set of usecases and provided
a really good tool for those ones.  I focused more on repository
migration (not between different version control systems, but some
other special flag-day event.  e.g. we have a whole bunch of plugins,
several of them are unmaintained, they are in lots of different code
hosting platforms; and someone decides we want to put all the widely
used plugins into a big monorepo for various reasons.)  It's just that
when doing some kind of big migration is a good time to do cleanup as
well, especially when typical actual plugin size history might be a
few hundred kilobytes, but for some reason many are dozens or hundreds
of megabytes, so filter-repo also provides some facilities for
cleaning unwanted stuff out as well.

> > bfg-repo-cleaner does a really good job if your goal is to remove a
> > few big files and/or to remove some sensitive text (matched via
> > regexes) from all blobs.  It was designed for that specific role and
> > has more options in this area than filter-repo currently has.  But
> > even within this design space it was optimized for, it is missing two
> > things that I really want:
> >
> >   * pruning of commits which become empty due to filtering
>
> There certainly have been several users asking for this feature on the
> BFG, and even a kindly contributed PR for the functionality which I've
> yet to merge. As it doesn't actually stop users from doing work - so
> far as I can see - it's something that I've done a poor job of
> following up.

Just as a heads up: It may be a lot uglier than it at first looks;
there's all kinds of special cases and I'm not quite sure I've got it
all right in filter-repo yet:

* Some projects intentionally create empty commits for versioning or
publishing or other reasons, and do not want these commits to be
removed.  Thus, it's important to remove commits that *become* empty
(due to other filtering rules), not commits which started empty.
* There's a special case for the above rule: if a user only wants the
history of a certain directory, then any empty commits that pre-dated
that directory aren't wanted/needed.  The way I handle this is that if
a commit which had no changes relative to its parent became an orphan
(its parent and all other ancestors were pruned due to becoming
empty), then the fact that it was orphaned implies that it *became*
empty and is thus prunable.
* There are also topological changes possible: If a parent and all
further ancestors on one side of history are pruned, a merge commit
may become a non-merge commit.  Often that will make the merge commit
itself prunable, though there is the possibility that the merge had
other changes tucked into it.
* Merges also may become degenerate:
  * Case 1: both sides of history may have commits pruned away (due to
becoming empty) all the way back to the merge base, meaning the merge
commit now has the merge base for both parents.  I think having
redundant parents is senseless and the redundant ones should be
pruned.  In most cases, this should also mean the merge commit is no
longer empty as it'll likely have no changes relative to the merge
base.
  * Case 2: one side of history may have commits pruned away back to
the merge base, meaning that the merge commit now merges one parent
with an ancestor of that parent.  If the merge commit has no changes
relative to the newer parent, then it could potentially be pruned
away.  However: What if this merge commit *started* as a merge of some
commit with its own ancestor?  If the second parent is the newer
commit, this was likely a --no-ff merge and probably shouldn't be
pruned.  But, if a project has a strong policy of always doing --no-ff
merges to incorporate changes, then even if the merge commit didn't
start out looking like a --no-ff merge we should probably keep it even
if it ends up looking like one.  (But, of course, in either case if
the first parent is the newer one, then it's not a --no-ff merge and
unless the merge commit has extra file changes in it, we should be
able to prune it.)
* Also, if your scheme checks for changes against the first parent to
see what changes exist in a merge commit, if the first parent
history is pruned away due to filtering choices (e.g. it didn't touch
the directory of interest), then your methodology may miss the fact
that this merge commit has an empty set of changes relative to its
remaining parent and thus mistakenly retain some commits which became
empty and were expected to be pruned.  I believe I saw filter-branch
messing this up, though I'd have to double check.
* There might be other special cases I've overlooked, though I've
tried to cover them all.

I'm currently thinking that perhaps a flag or pair of flags might be
useful here (perhaps --empty-pruning={always,never,auto}
--no-ff-pruning={always,never,auto}) with 'auto' perhaps being pruning
of commits which didn't start out that way (empty or a no-ff merge)
but became so.  I'm still undecided on this...

> >   * providing a way for the user to know what needs to be cleaned up.
> > It has options like --strip-blobs-bigger-than <size> or
> > --strip-biggest-blobs <NUM>, but no way for the user to figure out
> > what <size> or <NUM> should be.
>
> For users of GitHub, It's normally 100MB with
> --strip-blobs-bigger-than <size> :-)

If you're only interested in what GitHub won't allow for you to
continue working with your repo, sure.  If you are migrating your repo
for some reason and want to take the chance to clean it up at the same
time so everyone doesn't have to pay heavy clone costs, there are many
other interesting values.  :-)

> > Also, since it just focuses on really
> > big blobs, it misses cases like someone checking in directories with a
> > huge number of small-to-moderately sized files (e.g. bower_components/
> > or node_modules/, though these could also contain a few big blobs
>
> For those use-cases, it might be that BFG's --delete-folders flag is
> useful, especially given the protected-head-commit feature of the BFG.

Absolutely.  However, this somewhat assumes the user doing the
filtering knows what parts of history are extraneous.  I've seen many
cases where the original author(s) of some repo weren't that familiar
with version control and committed a whole bunch of things they
shouldn't.  They then left the company or project and a few years
later, someone else deletes those files in a new commit, possibly even
mentioning in the commit message that it was stuff that was never
needed.  Several more years go by, the second individual isn't with
the project any more either, and someone else comes along, possibly
not even familiar with the language or build tools the project uses
but was tasked with migrating the history anyway.  Having a tool which
can tell them why this small plugin happens to be much bigger than
expected with pointers to relevant bits ("what's this
bower_components/ directory that accounts for most the size and was
deleted four years ago?") is very helpful.

I've also seen this with big repos where there are lots of ugly
travesties in all kinds of different places, and no one still around
knows about more than a couple of them.  The developers may be willing
to go through a flag day to have the history rewritten to expunge the
things that never should have been committed to get clone times down,
but only removing a few big blobs or expecting there to be an
individual who knows which directories can be nuked isn't always the
best answer.  Given enough time, anyone can run enough git commands to
kind of figure out what are the big things, but I wanted something
that made that job easier.


Granted, 'git filter-repo --analyze' is a separate read-only step
(other than the reports it writes), so even if people wanted to use
BFG repo cleaner or other filtering tools they could use this aspect
of filter-repo to guide decisions.

> It's getting late for me, must be even later in Brussels - I wish I
> could have made it there to join in! Merry Git Merge to you all, and
> good luck to you Elijah with git-filter-repo.

Thanks again for all the feedback; maybe we'll get to meet at a future
conference.  Best of luck to you as well.

Elijah

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
  2019-01-31 19:09 ` Junio C Hamano
  2019-01-31 20:47 ` Elijah Newren
@ 2019-02-08  1:25 ` Elijah Newren
  2019-02-08 10:22   ` Johannes Schindelin
  2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
  3 siblings, 1 reply; 13+ messages in thread
From: Elijah Newren @ 2019-02-08  1:25 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Lars Schneider

Hi,

On Thu, Jan 31, 2019 at 12:57 AM Elijah Newren <newren@gmail.com> wrote:
> git-filter-repo[1], a filter-branch-like tool for rewriting repository
> history, is ready for more widespread testing and feedback.  The rough


Someone at the Contributor Summit (Michael Haggerty perhaps?) asked me
about performance numbers on known repositories for filter-repo and
how it compared to other tools; I gave extremely rough estimates, but
here I belatedly provide some more detailed figures.  In each case, I
report both filtering time, and cleanup (gc or clone) time[0]:


Testcase 1: Remove a single file (configure.ac) from each commit in git.git:

  * filter-branch[1a]:  2413.978s + 34.812s
  * BFG (8-core)[1b]:     38.743s + 30.333s
  * BFG (40-core)[1b]:    24.680s + 35.165s
  * filter-repo[1c]:      35.582s + 15.690s

  Caveats: filter-repo failed and needed workarounds; see [1d]

Testcase 2: Keep two directories (guides/ and tools/) from rails.git:

  * filter-branch[2a]: 14586.655s + 22.726s
  * BFG (8-core)[2b]:     27.675s + 15.786s
  * BFG (40-core)[2b]:    24.883s + 20.463s
  * filter-repo[2c]:      10.951s + 12.500s

  Caveats: filter-branch failed at the end of this operation; see [2d].
           AFAICT, BFG can't do this operation; used approximations instead[2e].

Testcase 3: Replacing one string with another throughout all files in linux.git:

  * filter-branch[3a]: Estimated at about 3.5 months (~8.9e6 seconds)
  * BFG (8-core)[3b]:   2144.904s + 693.79s
  * BFG (40-core)[3b]:  1178.577s + 636.887s
  * filter-repo[3c]:    1203.147s + 159.620s

  Caveats: filter-branch failed at ~12 hours; see [3d].


Other details about measurements at [4].  Take-aways and biased
opinions at [5].


Hope this was interesting,
Elijah



*************** Footnotes (Minutiae for the curious) ***************

[0] git-filter-branch's manpage suggests re-cloning to get rid of old objects,
    BFG as its last step provides the user commands to execute in order to
    clean out old objects, and filter-repo automatically runs such commands.
    As such, time of post-run gc seems like a relevant thing to report.
    Commands used and timed:

  * filter-branch: time git clone file://$(pwd) ../nuke-me-clone
  * BFG:           git reflog expire --expire=now --all && time git gc
--prune=now
  * filter-repo:   N/A (internally runs same commands as I manually ran for BFG)


[1a] time git filter-branch --index-filter 'git rm --quiet --cached
--ignore-unmatch configure.ac' --tag-name-filter cat --prune-empty --
--all

[1b] time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files configure.ac

[1c] git tag | grep v1.0rc | xargs git tag -d
     git tag -d junio-gpg-pub
     time git filter-repo --path configure.ac --invert-paths

[1d] git fast-export when run with certain flags will abort in repos
     with tags of blobs or tags of tags.  I had to first delete 7 tags
     to get this testcase to run, as shown in the commands above in
     [1c].  I'll probably patch fast-export to fix this.


[2a] time git filter-branch --index-filter 'git ls-files -z | tr "\0"
"\n" | grep -v -e ^guides/ -e ^tools/ | tr "\n" "\0" | xargs -0 git rm
--quiet --cached --ignore-unmatch' --tag-name-filter cat --prune-empty
-- --all

[2b] git log --format=%n --name-only | sort | uniq | grep -v ^$ > all-files.txt
     time java -jar ~/Downloads/bfg-1.13.0.jar --delete-folders
"{$(grep / all-files.txt | sed -e 's/"//' -e s%/.*%% | uniq | grep -v
-e guides -e tools | tr '\n' ,)}" --delete-files "{$(comm -23 <(grep
-v / all-files.txt) <(grep -e guides/ -e tools/ all-files.txt | sed -e
s%.*/%% | sort) | tr '\n' ,)}"

[2c] time git filter-repo --path guides --path tools

[2d] filter-branch fails at the very end when noting which refs were
     deleted/rewritten with:
         error: cannot lock ref 'refs/tags/v0.10.0': is at
b68b47672e613e94a7859c9549e9cd4b401f7b79 but expected
e2724aa1856253f4fc48ddc251583042c5f06029
         Could not delete refs/tags/v0.10.0
     Turns out b68b47672e613e94a7859c9549e9cd4b401f7b79 is an
     annotated tag in the original repo pointing to the commit
     e2724aa1856253f4fc48ddc251583042c5f06029.  I do not know the
     cause of this bug, but since it was almost at the very end, I
     just reported the time used before it hit this error.

[2e] Unless I am misunderstanding, BFG is not capable of this
     filtering operation because it uses basenames for --delete-files
     and --delete-folders, and some names appear in several
     directories (e.g. .gitignore, Rakefile, tasks).  As such, with
     the BFG you either have to delete files/directories that
     shouldn't be, or leave files and folders around that you wanted
     to have deleted.  The command in [2b] has some of both, but
     should still give a good estimate of how long it would take BFG
     to do this kind of operation if file and directory basenames in
     the rails repository happened to be named uniquely.

[3a] time git filter-branch -d /dev/shm/tmp --tree-filter 'git
ls-files | xargs sed -i s/secretly/covertly/' --tag-name-filter cat --
--all

[3b] time java -jar ~/Downloads/bfg-1.13.0.jar --replace-text <(echo
'secretly==>covertly')

[3c] time git filter-repo --replace-text <(echo 'secretly==>covertly')

[3d] filter-branch failed after 45704 seconds, predicting another
     8836429 seconds (~102 days) remaining at the time.  As commits
     earlier in history tend to be smaller, filter-branch nearly
     always underestimates the time required, sometimes considerably.
     filter-branch failed on commit
     af25e94d4dcfb9608846242fabdd4e6014e5c9f0 due to an empty ident.
     I possibly could have worked around it with --env-filter, but
     it's not like I'm going to wait for it to finish anyway.

[4] Other notes about timings:
  * All tests were run on an 8 cpu system, except for the "BFG
    40-core" tests which were run on a 40 core system.  (filter-branch
    and filter-repo are not multi-threaded and gain nothing from more
    cores.)
  * More precisely, I ran on AWS with an m4.2xlarge with two 50-GB GP2
    volumes (150 Iops) for tests.  The 40-core system was an
    m4.10xlarge.
  * Before each command, to try to avoid warm disk caches helping or
    hurting depending on the order I ran commands in, I first ran:
    * rsync -az --delete ../$REPO-orig/ ./
    * git status
    * $TOOL -h
  * Testing was imperfect; I just ran once and recorded the time.  It took
    long enough to gather the data as it was.
  * when additional commands were needed for the filtering
    (e.g. getting the all-files.txt list to generate the BFG command,
    or deleting tags that fast-export couldn't handle for
    filter-repo), I did not include the times of those commands in the
    overall execution time.  It would have added a few hundredths of a
    second to filter-repo's git.git time, and about 5-6 seconds to BFG's
    rails.git time.
  * filter-repo self-reports time until filtering finishes and time
    until entirely done.  I took difference between its self-report of
    overall time and the "time" command's report of overall time (which
    was typically order ~ 0.1s), and added that to filter-repo's
    filtering time, assuming that most the discrepancy would be due to
    python startup.

[5] Performance is only one measurement.  Features, capabilities,
usability, etc. matter too.  filter-branch is a general purpose
filtering tool, but in my opinion, not a good one -- and not just
because of performance.  BFG Repo Cleaner is a good tool, but it is
special purpose; it is designed for a few particular usecases
(limiting the kinds of things I could try in my comparison above).  My
hope is that filter-repo serves as a good general purpose filtering
tool so that people can stop suffering from filter-branch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-02-08  1:25 ` Elijah Newren
@ 2019-02-08 10:22   ` Johannes Schindelin
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Schindelin @ 2019-02-08 10:22 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Lars Schneider

Hi Elijah,

On Thu, 7 Feb 2019, Elijah Newren wrote:

> On Thu, Jan 31, 2019 at 12:57 AM Elijah Newren <newren@gmail.com> wrote:
> > git-filter-repo[1], a filter-branch-like tool for rewriting repository
> > history, is ready for more widespread testing and feedback.  The rough
> 
> Someone at the Contributor Summit (Michael Haggerty perhaps?) asked me
> about performance numbers on known repositories for filter-repo and
> how it compared to other tools; I gave extremely rough estimates, but
> here I belatedly provide some more detailed figures.  In each case, I
> report both filtering time, and cleanup (gc or clone) time[0]:
> 
> [...]

Those are pretty good numbers right there.

>  My hope is that filter-repo serves as a good general purpose filtering
>  tool so that people can stop suffering from filter-branch.

I agree. `git filter-branch` was simply pulled out of Cogito when that
project was declared discontinued, as the last nugget we could steal from
the carcass (we stole such a lot from Cogito that we're probably the
reason it died).

And when somebody started to rewrite it in C (calling it `git
rewrite-commits` if memory serves), I asked to change it a little so it
could be used as a drop-in replacement for `filter-branch`, and
unfortunately stopped that effort in its tracks.

So it is really good for me to see you picking up the initiative and make
that particular shell script obsolete.

Ciao,
Dscho

P.S.: No, I am not willing to even attempt to run filter-branch on
Windows.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
                   ` (2 preceding siblings ...)
  2019-02-08  1:25 ` Elijah Newren
@ 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
  2019-02-08 20:13   ` Johannes Schindelin
                     ` (2 more replies)
  3 siblings, 3 replies; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-02-08 18:53 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List


On Thu, Jan 31 2019, Elijah Newren wrote:

> What's the future?  (Core command of git.git?  place it in contrib?  keep it
> in a separate repo?)  I'm hoping to discuss that at the contributor summit
> today, but feedback on the list is also welcome.

Some of this I may have mentioned at the summit, but here for the list:

* I think it should be a candidate for a core (not "just contrib")
  git.git command, given that we have someone willing to maintain it &
  deal with bugs etc. I'm not worried about that given the author.

* It's unfortunate in terms of API we need to support going forward that
  this obligates us to support a fairly intricate python API going
  forward, so it's similar (but more detailed) to Git.pm (which I also
  tried to get rid of as an external API a while ago).

  However, as you correctly note that's the only way a command like this
  can be really fast, we already have the "no special API" command with
  git-filter-branch, and that's horribly slow.

  But perhaps there's ways we can in advance deal with a potential
  future breaking API change. E.g. some Pythonic way of versioning the
  API, or just prominently documenting whatever (low?) stability
  guarantees we're making.

  I imagine if we need to make breaking changes in the future that'll
  less big of a deal than in other cases, since we'd expect the API use
  to be one-off migration scripts, although maybe it'll get used for
  all-the-time exports (e.g. mirroring internal->external repos with
  filtering).

* The rest of our commands are hooked up to the i18n framework. I don't
  think this should be a blocker, but it's worth thinking about what the
  plan for this is.

  Are we going to need the equivalent of Git::I18N for Python (which
  presumably will be a run-time dependency on something needing the
  Python API that links to gettext).

  Or perhaps we could do the translated strings in C, by making the
  program you're invoking be a C command, invoking the Python part as a
  helper (which would need to re-invoke a helper if it prints its own
  messages).

Thanks for working on this!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
@ 2019-02-08 20:13   ` Johannes Schindelin
  2019-02-11 16:00     ` Elijah Newren
  2019-02-11 15:47   ` Elijah Newren
  2019-06-08 16:20   ` Elijah Newren
  2 siblings, 1 reply; 13+ messages in thread
From: Johannes Schindelin @ 2019-02-08 20:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Elijah Newren, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 691 bytes --]

Hi Ævar,

On Fri, 8 Feb 2019, Ævar Arnfjörð Bjarmason wrote:

> [...]
>
>   But perhaps there's ways we can in advance deal with a potential
>   future breaking API change. E.g. some Pythonic way of versioning the
>   API, or just prominently documenting whatever (low?) stability
>   guarantees we're making.

Another thing to keep in mind: it being in Python prevents it from being
distributed with Git for Windows. The Git for Windows installer already
weighs way more than it used to (it used to be under 30MB, now it is
44MB), and I am simply not willing to increase the footprint dramatically
just for one rarely used command.

If only it were written as a built-in...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
  2019-02-08 20:13   ` Johannes Schindelin
@ 2019-02-11 15:47   ` Elijah Newren
  2019-06-08 16:20   ` Elijah Newren
  2 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-02-11 15:47 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List

Hi Ævar,

On Fri, Feb 8, 2019 at 10:53 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Thu, Jan 31 2019, Elijah Newren wrote:
>
> > What's the future?  (Core command of git.git?  place it in contrib?  keep it
> > in a separate repo?)  I'm hoping to discuss that at the contributor summit
> > today, but feedback on the list is also welcome.
>
> Some of this I may have mentioned at the summit, but here for the list:
>
> * I think it should be a candidate for a core (not "just contrib")
>   git.git command, given that we have someone willing to maintain it &
>   deal with bugs etc. I'm not worried about that given the author.
>
> * It's unfortunate in terms of API we need to support going forward that
>   this obligates us to support a fairly intricate python API going
>   forward, so it's similar (but more detailed) to Git.pm (which I also
>   tried to get rid of as an external API a while ago).
>
>   However, as you correctly note that's the only way a command like this
>   can be really fast, we already have the "no special API" command with
>   git-filter-branch, and that's horribly slow.
>
>   But perhaps there's ways we can in advance deal with a potential
>   future breaking API change. E.g. some Pythonic way of versioning the
>   API, or just prominently documenting whatever (low?) stability
>   guarantees we're making.
>
>   I imagine if we need to make breaking changes in the future that'll
>   less big of a deal than in other cases, since we'd expect the API use
>   to be one-off migration scripts, although maybe it'll get used for
>   all-the-time exports (e.g. mirroring internal->external repos with
>   filtering).
>
> * The rest of our commands are hooked up to the i18n framework. I don't
>   think this should be a blocker, but it's worth thinking about what the
>   plan for this is.
>
>   Are we going to need the equivalent of Git::I18N for Python (which
>   presumably will be a run-time dependency on something needing the
>   Python API that links to gettext).
>
>   Or perhaps we could do the translated strings in C, by making the
>   program you're invoking be a C command, invoking the Python part as a
>   helper (which would need to re-invoke a helper if it prints its own
>   messages).
>
> Thanks for working on this!

Good points.  I'll dig in to the i18n story.  As you point out, the
API stability may be tricky, but you may be right that we just need to
prominently document whatever guarantee we want to make and that it's
designed more for one-off migration scripts than continuing exports.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-02-08 20:13   ` Johannes Schindelin
@ 2019-02-11 16:00     ` Elijah Newren
  0 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-02-11 16:00 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

Hi Dscho,

On Fri, Feb 8, 2019 at 12:13 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Ævar,
>
> On Fri, 8 Feb 2019, Ævar Arnfjörð Bjarmason wrote:
>
> > [...]
> >
> >   But perhaps there's ways we can in advance deal with a potential
> >   future breaking API change. E.g. some Pythonic way of versioning the
> >   API, or just prominently documenting whatever (low?) stability
> >   guarantees we're making.
>
> Another thing to keep in mind: it being in Python prevents it from being
> distributed with Git for Windows. The Git for Windows installer already
> weighs way more than it used to (it used to be under 30MB, now it is
> 44MB), and I am simply not willing to increase the footprint dramatically
> just for one rarely used command.

That would be unfortunate, though understandable.  I am curious,
though: do you include and does anyone use filter-branch on windows?
You mentioned elsewhere in this thread that you weren't even willing
to attempt to run filter-branch there.  If people aren't using
filter-branch on windows, then there's nothing for me to save them
from anyway.  If they are, I'm curious to hear more about the usecases
and motivations, even if the cost of my tool is too high for you to
include.

Also, since filter-branch and filter-repo are meant mostly as one-shot
migration tools, it is already not uncommon for people to do it on a
different machine (perhaps one with more RAM, or faster disks), and at
most one person on the team needs to run it (sometimes folks even look
to an "expert" outside the team to run the migration for them).  Once
migrated, they push the results back and are done with the tool.

> If only it were written as a built-in...

A built-in would be great, IF it could provide all the same
capabilities and with at least the same speed.  However, making it a
built-in would fundamentally remove a significant chunk of its power
and flexibility, which was part of the driving force for creating this
tool.

Elijah

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: New command/tool: git filter-repo
  2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
  2019-02-08 20:13   ` Johannes Schindelin
  2019-02-11 15:47   ` Elijah Newren
@ 2019-06-08 16:20   ` Elijah Newren
  2 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-06-08 16:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List

Hi,

Now that there's a released version of git that has all necessary
flags and features[1] to run git filter-repo
(https://github.com/newren/git-filter-repo), I thought I'd send an
update...

On Fri, Feb 8, 2019 at 10:53 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Thu, Jan 31 2019, Elijah Newren wrote:
>
> > What's the future?  (Core command of git.git?  place it in contrib?  keep it
> > in a separate repo?)  I'm hoping to discuss that at the contributor summit
> > today, but feedback on the list is also welcome.
>
> Some of this I may have mentioned at the summit, but here for the list:
>
> * I think it should be a candidate for a core (not "just contrib")
>   git.git command, given that we have someone willing to maintain it &
>   deal with bugs etc. I'm not worried about that given the author.
>
> * It's unfortunate in terms of API we need to support going forward that
>   this obligates us to support a fairly intricate python API going
>   forward, so it's similar (but more detailed) to Git.pm (which I also
>   tried to get rid of as an external API a while ago).
>
>   However, as you correctly note that's the only way a command like this
>   can be really fast, we already have the "no special API" command with
>   git-filter-branch, and that's horribly slow.
>
>   But perhaps there's ways we can in advance deal with a potential
>   future breaking API change. E.g. some Pythonic way of versioning the
>   API, or just prominently documenting whatever (low?) stability
>   guarantees we're making.
>
>   I imagine if we need to make breaking changes in the future that'll
>   less big of a deal than in other cases, since we'd expect the API use
>   to be one-off migration scripts, although maybe it'll get used for
>   all-the-time exports (e.g. mirroring internal->external repos with
>   filtering).
>
> * The rest of our commands are hooked up to the i18n framework. I don't
>   think this should be a blocker, but it's worth thinking about what the
>   plan for this is.
>
>   Are we going to need the equivalent of Git::I18N for Python (which
>   presumably will be a run-time dependency on something needing the
>   Python API that links to gettext).
>
>   Or perhaps we could do the translated strings in C, by making the
>   program you're invoking be a C command, invoking the Python part as a
>   helper (which would need to re-invoke a helper if it prints its own
>   messages).
>
> Thanks for working on this!

I've implemented these, and several other things too.  Changes since last time:

* Now i18n-ized
* Several disclaimers about API backcompat (this is more of a one-shot
conversion tool) [2]
* Converted to Python3 (Python2 is EOL at EOY)
* Pruning of become-empty and become-degenerate+empty commits has been
fixed up (I mentioned this as a concern last time)
* Testsuite has been fleshed out, including not only multiple small
fixes to filter repo, but more fixes to git itself[3]
* Usage, Examples, Internals, and Limitations documentation now exists
(in README.md format and built-in -h help; no manpage yet)
* Several new filters and abilities have been added

Now that filter-repo is complete and more easily tested, what are
folks thoughts on incorporating it in git.git?  (And if we do go that
route, can we avoid losing its history so I can bisect issues if
necessary?)

Thanks,
Elijah


[1] Well, except people will get an error if they use
--preserve-commit-encoding saying they don't have a new enough git
since en/fast-export-encoding is still sitting in next and didn't make
it in to git-2.22.  But I only know of one repo that even uses special
commit encodings, so I suspect few if any will even care about that
flag.

[2] The warnings appear in several places to try to make sure people
notice them; a not quite complete list:
https://github.com/newren/git-filter-repo/blame/master/README.md#L620-L625
https://github.com/newren/git-filter-repo/blame/master/README.md#L727-L730
https://github.com/newren/git-filter-repo/blame/master/README.md#L989-L993
https://github.com/newren/git-filter-repo/blob/master/git-filter-repo#L13-L30
https://github.com/newren/git-filter-repo/blob/master/git-filter-repo#L1523,L1524
https://github.com/newren/git-filter-repo/blob/master/t/t9391/commit_info.py#L4-L6
https://github.com/newren/git-filter-repo/blob/master/t/t9391/create_fast_export_output.py#L4-L6
https://github.com/newren/git-filter-repo/blob/master/t/t9391/file_filter.py#L4-L6
https://github.com/newren/git-filter-repo/blob/master/t/t9391/splice_repos.py#L4-L6
https://github.com/newren/git-filter-repo/blob/master/t/t9391/strip-cvs-keywords.py#L4-L6

[3] https://github.com/newren/git-filter-repo/tree/develop#upstream-improvements

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-06-08 16:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
2019-01-31 19:09 ` Junio C Hamano
2019-01-31 20:43   ` Elijah Newren
2019-01-31 23:36     ` Roberto Tyley
2019-02-01  7:38       ` Elijah Newren
2019-01-31 20:47 ` Elijah Newren
2019-02-08  1:25 ` Elijah Newren
2019-02-08 10:22   ` Johannes Schindelin
2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
2019-02-08 20:13   ` Johannes Schindelin
2019-02-11 16:00     ` Elijah Newren
2019-02-11 15:47   ` Elijah Newren
2019-06-08 16:20   ` Elijah Newren

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).