From: Elijah Newren <newren@gmail.com>
To: Git Mailing List <git@vger.kernel.org>
Cc: Lars Schneider <larsxschneider@gmail.com>
Subject: Re: New command/tool: git filter-repo
Date: Thu, 7 Feb 2019 17:25:01 -0800 [thread overview]
Message-ID: <CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BFC--s+D0ijRkFCRxP5Lxfi+__YF4EdxkpO5z+GoNW7Gg@mail.gmail.com>
Hi,
On Thu, Jan 31, 2019 at 12:57 AM Elijah Newren <newren@gmail.com> wrote:
> git-filter-repo[1], a filter-branch-like tool for rewriting repository
> history, is ready for more widespread testing and feedback. The rough
Someone at the Contributor Summit (Michael Haggerty perhaps?) asked me
about performance numbers on known repositories for filter-repo and
how it compared to other tools; I gave extremely rough estimates, but
here I belatedly provide some more detailed figures. In each case, I
report both filtering time, and cleanup (gc or clone) time[0]:
Testcase 1: Remove a single file (configure.ac) from each commit in git.git:
* filter-branch[1a]: 2413.978s + 34.812s
* BFG (8-core)[1b]: 38.743s + 30.333s
* BFG (40-core)[1b]: 24.680s + 35.165s
* filter-repo[1c]: 35.582s + 15.690s
Caveats: filter-repo failed and needed workarounds; see [1d]
Testcase 2: Keep two directories (guides/ and tools/) from rails.git:
* filter-branch[2a]: 14586.655s + 22.726s
* BFG (8-core)[2b]: 27.675s + 15.786s
* BFG (40-core)[2b]: 24.883s + 20.463s
* filter-repo[2c]: 10.951s + 12.500s
Caveats: filter-branch failed at the end of this operation; see [2d].
AFAICT, BFG can't do this operation; used approximations instead[2e].
Testcase 3: Replacing one string with another throughout all files in linux.git:
* filter-branch[3a]: Estimated at about 3.5 months (~8.9e6 seconds)
* BFG (8-core)[3b]: 2144.904s + 693.79s
* BFG (40-core)[3b]: 1178.577s + 636.887s
* filter-repo[3c]: 1203.147s + 159.620s
Caveats: filter-branch failed at ~12 hours; see [3d].
Other details about measurements at [4]. Take-aways and biased
opinions at [5].
Hope this was interesting,
Elijah
*************** Footnotes (Minutiae for the curious) ***************
[0] git-filter-branch's manpage suggests re-cloning to get rid of old objects,
BFG as its last step provides the user commands to execute in order to
clean out old objects, and filter-repo automatically runs such commands.
As such, time of post-run gc seems like a relevant thing to report.
Commands used and timed:
* filter-branch: time git clone file://$(pwd) ../nuke-me-clone
* BFG: git reflog expire --expire=now --all && time git gc
--prune=now
* filter-repo: N/A (internally runs same commands as I manually ran for BFG)
[1a] time git filter-branch --index-filter 'git rm --quiet --cached
--ignore-unmatch configure.ac' --tag-name-filter cat --prune-empty --
--all
[1b] time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files configure.ac
[1c] git tag | grep v1.0rc | xargs git tag -d
git tag -d junio-gpg-pub
time git filter-repo --path configure.ac --invert-paths
[1d] git fast-export when run with certain flags will abort in repos
with tags of blobs or tags of tags. I had to first delete 7 tags
to get this testcase to run, as shown in the commands above in
[1c]. I'll probably patch fast-export to fix this.
[2a] time git filter-branch --index-filter 'git ls-files -z | tr "\0"
"\n" | grep -v -e ^guides/ -e ^tools/ | tr "\n" "\0" | xargs -0 git rm
--quiet --cached --ignore-unmatch' --tag-name-filter cat --prune-empty
-- --all
[2b] git log --format=%n --name-only | sort | uniq | grep -v ^$ > all-files.txt
time java -jar ~/Downloads/bfg-1.13.0.jar --delete-folders
"{$(grep / all-files.txt | sed -e 's/"//' -e s%/.*%% | uniq | grep -v
-e guides -e tools | tr '\n' ,)}" --delete-files "{$(comm -23 <(grep
-v / all-files.txt) <(grep -e guides/ -e tools/ all-files.txt | sed -e
s%.*/%% | sort) | tr '\n' ,)}"
[2c] time git filter-repo --path guides --path tools
[2d] filter-branch fails at the very end when noting which refs were
deleted/rewritten with:
error: cannot lock ref 'refs/tags/v0.10.0': is at
b68b47672e613e94a7859c9549e9cd4b401f7b79 but expected
e2724aa1856253f4fc48ddc251583042c5f06029
Could not delete refs/tags/v0.10.0
Turns out b68b47672e613e94a7859c9549e9cd4b401f7b79 is an
annotated tag in the original repo pointing to the commit
e2724aa1856253f4fc48ddc251583042c5f06029. I do not know the
cause of this bug, but since it was almost at the very end, I
just reported the time used before it hit this error.
[2e] Unless I am misunderstanding, BFG is not capable of this
filtering operation because it uses basenames for --delete-files
and --delete-folders, and some names appear in several
directories (e.g. .gitignore, Rakefile, tasks). As such, with
the BFG you either have to delete files/directories that
shouldn't be, or leave files and folders around that you wanted
to have deleted. The command in [2b] has some of both, but
should still give a good estimate of how long it would take BFG
to do this kind of operation if file and directory basenames in
the rails repository happened to be named uniquely.
[3a] time git filter-branch -d /dev/shm/tmp --tree-filter 'git
ls-files | xargs sed -i s/secretly/covertly/' --tag-name-filter cat --
--all
[3b] time java -jar ~/Downloads/bfg-1.13.0.jar --replace-text <(echo
'secretly==>covertly')
[3c] time git filter-repo --replace-text <(echo 'secretly==>covertly')
[3d] filter-branch failed after 45704 seconds, predicting another
8836429 seconds (~102 days) remaining at the time. As commits
earlier in history tend to be smaller, filter-branch nearly
always underestimates the time required, sometimes considerably.
filter-branch failed on commit
af25e94d4dcfb9608846242fabdd4e6014e5c9f0 due to an empty ident.
I possibly could have worked around it with --env-filter, but
it's not like I'm going to wait for it to finish anyway.
[4] Other notes about timings:
* All tests were run on an 8 cpu system, except for the "BFG
40-core" tests which were run on a 40 core system. (filter-branch
and filter-repo are not multi-threaded and gain nothing from more
cores.)
* More precisely, I ran on AWS with an m4.2xlarge with two 50-GB GP2
volumes (150 Iops) for tests. The 40-core system was an
m4.10xlarge.
* Before each command, to try to avoid warm disk caches helping or
hurting depending on the order I ran commands in, I first ran:
* rsync -az --delete ../$REPO-orig/ ./
* git status
* $TOOL -h
* Testing was imperfect; I just ran once and recorded the time. It took
long enough to gather the data as it was.
* when additional commands were needed for the filtering
(e.g. getting the all-files.txt list to generate the BFG command,
or deleting tags that fast-export couldn't handle for
filter-repo), I did not include the times of those commands in the
overall execution time. It would have added a few hundredths of a
second to filter-repo's git.git time, and about 5-6 seconds to BFG's
rails.git time.
* filter-repo self-reports time until filtering finishes and time
until entirely done. I took difference between its self-report of
overall time and the "time" command's report of overall time (which
was typically order ~ 0.1s), and added that to filter-repo's
filtering time, assuming that most the discrepancy would be due to
python startup.
[5] Performance is only one measurement. Features, capabilities,
usability, etc. matter too. filter-branch is a general purpose
filtering tool, but in my opinion, not a good one -- and not just
because of performance. BFG Repo Cleaner is a good tool, but it is
special purpose; it is designed for a few particular usecases
(limiting the kinds of things I could try in my comparison above). My
hope is that filter-repo serves as a good general purpose filtering
tool so that people can stop suffering from filter-branch.
next prev parent reply other threads:[~2019-02-08 1:25 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-31 8:57 New command/tool: git filter-repo Elijah Newren
2019-01-31 19:09 ` Junio C Hamano
2019-01-31 20:43 ` Elijah Newren
2019-01-31 23:36 ` Roberto Tyley
2019-02-01 7:38 ` Elijah Newren
2019-01-31 20:47 ` Elijah Newren
2019-02-08 1:25 ` Elijah Newren [this message]
2019-02-08 10:22 ` Johannes Schindelin
2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
2019-02-08 20:13 ` Johannes Schindelin
2019-02-11 16:00 ` Elijah Newren
2019-02-11 15:47 ` Elijah Newren
2019-06-08 16:20 ` Elijah Newren
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com \
--to=newren@gmail.com \
--cc=git@vger.kernel.org \
--cc=larsxschneider@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).