git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Git Mailing List <git@vger.kernel.org>
Cc: Lars Schneider <larsxschneider@gmail.com>
Subject: Re: New command/tool: git filter-repo
Date: Thu, 7 Feb 2019 17:25:01 -0800	[thread overview]
Message-ID: <CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BFC--s+D0ijRkFCRxP5Lxfi+__YF4EdxkpO5z+GoNW7Gg@mail.gmail.com>

Hi,

On Thu, Jan 31, 2019 at 12:57 AM Elijah Newren <newren@gmail.com> wrote:
> git-filter-repo[1], a filter-branch-like tool for rewriting repository
> history, is ready for more widespread testing and feedback.  The rough


Someone at the Contributor Summit (Michael Haggerty perhaps?) asked me
about performance numbers on known repositories for filter-repo and
how it compared to other tools; I gave extremely rough estimates, but
here I belatedly provide some more detailed figures.  In each case, I
report both filtering time, and cleanup (gc or clone) time[0]:


Testcase 1: Remove a single file (configure.ac) from each commit in git.git:

  * filter-branch[1a]:  2413.978s + 34.812s
  * BFG (8-core)[1b]:     38.743s + 30.333s
  * BFG (40-core)[1b]:    24.680s + 35.165s
  * filter-repo[1c]:      35.582s + 15.690s

  Caveats: filter-repo failed and needed workarounds; see [1d]

Testcase 2: Keep two directories (guides/ and tools/) from rails.git:

  * filter-branch[2a]: 14586.655s + 22.726s
  * BFG (8-core)[2b]:     27.675s + 15.786s
  * BFG (40-core)[2b]:    24.883s + 20.463s
  * filter-repo[2c]:      10.951s + 12.500s

  Caveats: filter-branch failed at the end of this operation; see [2d].
           AFAICT, BFG can't do this operation; used approximations instead[2e].

Testcase 3: Replacing one string with another throughout all files in linux.git:

  * filter-branch[3a]: Estimated at about 3.5 months (~8.9e6 seconds)
  * BFG (8-core)[3b]:   2144.904s + 693.79s
  * BFG (40-core)[3b]:  1178.577s + 636.887s
  * filter-repo[3c]:    1203.147s + 159.620s

  Caveats: filter-branch failed at ~12 hours; see [3d].


Other details about measurements at [4].  Take-aways and biased
opinions at [5].


Hope this was interesting,
Elijah



*************** Footnotes (Minutiae for the curious) ***************

[0] git-filter-branch's manpage suggests re-cloning to get rid of old objects,
    BFG as its last step provides the user commands to execute in order to
    clean out old objects, and filter-repo automatically runs such commands.
    As such, time of post-run gc seems like a relevant thing to report.
    Commands used and timed:

  * filter-branch: time git clone file://$(pwd) ../nuke-me-clone
  * BFG:           git reflog expire --expire=now --all && time git gc
--prune=now
  * filter-repo:   N/A (internally runs same commands as I manually ran for BFG)


[1a] time git filter-branch --index-filter 'git rm --quiet --cached
--ignore-unmatch configure.ac' --tag-name-filter cat --prune-empty --
--all

[1b] time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files configure.ac

[1c] git tag | grep v1.0rc | xargs git tag -d
     git tag -d junio-gpg-pub
     time git filter-repo --path configure.ac --invert-paths

[1d] git fast-export when run with certain flags will abort in repos
     with tags of blobs or tags of tags.  I had to first delete 7 tags
     to get this testcase to run, as shown in the commands above in
     [1c].  I'll probably patch fast-export to fix this.


[2a] time git filter-branch --index-filter 'git ls-files -z | tr "\0"
"\n" | grep -v -e ^guides/ -e ^tools/ | tr "\n" "\0" | xargs -0 git rm
--quiet --cached --ignore-unmatch' --tag-name-filter cat --prune-empty
-- --all

[2b] git log --format=%n --name-only | sort | uniq | grep -v ^$ > all-files.txt
     time java -jar ~/Downloads/bfg-1.13.0.jar --delete-folders
"{$(grep / all-files.txt | sed -e 's/"//' -e s%/.*%% | uniq | grep -v
-e guides -e tools | tr '\n' ,)}" --delete-files "{$(comm -23 <(grep
-v / all-files.txt) <(grep -e guides/ -e tools/ all-files.txt | sed -e
s%.*/%% | sort) | tr '\n' ,)}"

[2c] time git filter-repo --path guides --path tools

[2d] filter-branch fails at the very end when noting which refs were
     deleted/rewritten with:
         error: cannot lock ref 'refs/tags/v0.10.0': is at
b68b47672e613e94a7859c9549e9cd4b401f7b79 but expected
e2724aa1856253f4fc48ddc251583042c5f06029
         Could not delete refs/tags/v0.10.0
     Turns out b68b47672e613e94a7859c9549e9cd4b401f7b79 is an
     annotated tag in the original repo pointing to the commit
     e2724aa1856253f4fc48ddc251583042c5f06029.  I do not know the
     cause of this bug, but since it was almost at the very end, I
     just reported the time used before it hit this error.

[2e] Unless I am misunderstanding, BFG is not capable of this
     filtering operation because it uses basenames for --delete-files
     and --delete-folders, and some names appear in several
     directories (e.g. .gitignore, Rakefile, tasks).  As such, with
     the BFG you either have to delete files/directories that
     shouldn't be, or leave files and folders around that you wanted
     to have deleted.  The command in [2b] has some of both, but
     should still give a good estimate of how long it would take BFG
     to do this kind of operation if file and directory basenames in
     the rails repository happened to be named uniquely.

[3a] time git filter-branch -d /dev/shm/tmp --tree-filter 'git
ls-files | xargs sed -i s/secretly/covertly/' --tag-name-filter cat --
--all

[3b] time java -jar ~/Downloads/bfg-1.13.0.jar --replace-text <(echo
'secretly==>covertly')

[3c] time git filter-repo --replace-text <(echo 'secretly==>covertly')

[3d] filter-branch failed after 45704 seconds, predicting another
     8836429 seconds (~102 days) remaining at the time.  As commits
     earlier in history tend to be smaller, filter-branch nearly
     always underestimates the time required, sometimes considerably.
     filter-branch failed on commit
     af25e94d4dcfb9608846242fabdd4e6014e5c9f0 due to an empty ident.
     I possibly could have worked around it with --env-filter, but
     it's not like I'm going to wait for it to finish anyway.

[4] Other notes about timings:
  * All tests were run on an 8 cpu system, except for the "BFG
    40-core" tests which were run on a 40 core system.  (filter-branch
    and filter-repo are not multi-threaded and gain nothing from more
    cores.)
  * More precisely, I ran on AWS with an m4.2xlarge with two 50-GB GP2
    volumes (150 Iops) for tests.  The 40-core system was an
    m4.10xlarge.
  * Before each command, to try to avoid warm disk caches helping or
    hurting depending on the order I ran commands in, I first ran:
    * rsync -az --delete ../$REPO-orig/ ./
    * git status
    * $TOOL -h
  * Testing was imperfect; I just ran once and recorded the time.  It took
    long enough to gather the data as it was.
  * when additional commands were needed for the filtering
    (e.g. getting the all-files.txt list to generate the BFG command,
    or deleting tags that fast-export couldn't handle for
    filter-repo), I did not include the times of those commands in the
    overall execution time.  It would have added a few hundredths of a
    second to filter-repo's git.git time, and about 5-6 seconds to BFG's
    rails.git time.
  * filter-repo self-reports time until filtering finishes and time
    until entirely done.  I took difference between its self-report of
    overall time and the "time" command's report of overall time (which
    was typically order ~ 0.1s), and added that to filter-repo's
    filtering time, assuming that most the discrepancy would be due to
    python startup.

[5] Performance is only one measurement.  Features, capabilities,
usability, etc. matter too.  filter-branch is a general purpose
filtering tool, but in my opinion, not a good one -- and not just
because of performance.  BFG Repo Cleaner is a good tool, but it is
special purpose; it is designed for a few particular usecases
(limiting the kinds of things I could try in my comparison above).  My
hope is that filter-repo serves as a good general purpose filtering
tool so that people can stop suffering from filter-branch.

  parent reply	other threads:[~2019-02-08  1:25 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
2019-01-31 19:09 ` Junio C Hamano
2019-01-31 20:43   ` Elijah Newren
2019-01-31 23:36     ` Roberto Tyley
2019-02-01  7:38       ` Elijah Newren
2019-01-31 20:47 ` Elijah Newren
2019-02-08  1:25 ` Elijah Newren [this message]
2019-02-08 10:22   ` Johannes Schindelin
2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
2019-02-08 20:13   ` Johannes Schindelin
2019-02-11 16:00     ` Elijah Newren
2019-02-11 15:47   ` Elijah Newren
2019-06-08 16:20   ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).