* New command/tool: git filter-repo @ 2019-01-31 8:57 Elijah Newren 2019-01-31 19:09 ` Junio C Hamano ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Elijah Newren @ 2019-01-31 8:57 UTC (permalink / raw) To: Git Mailing List Hi everyone, git-filter-repo[1], a filter-branch-like tool for rewriting repository history, is ready for more widespread testing and feedback. The rough edges I previously mentioned have been fixed, and it has several useful features already, though more development work is ongoing (docs are a bit sparse right now, though -h provides some help). Why filter-repo vs. filter-branch? * filter-branch is extremely to unusably slow (multiple orders of magnitude slower than it should be) for non-trivial repositories. * filter-branch made a number of usability choices that are okay for small repos, but these choices sometimes conflict as more options are combined, and the overall usability often causes difficulties for users trying to work with intermediate or larger repos. * filter-branch is missing some basic features. The first two are intrinsic to filter-branch's design at this point and cannot be backward-compatibly fixed. Requirements: * Python 2 (for now?) * A version of git with en/fast-export-import topic (in master of git.git) * A version of git with the --combined-all-names option to diff-tree; I have submitted[2] this patch, but it hasn't been picked up yet. What's the future? (Core command of git.git? place it in contrib? keep it in a separate repo?) I'm hoping to discuss that at the contributor summit today, but feedback on the list is also welcome. Thanks, Elijah [1] https://github.com/newren/git-filter-repo; it's a ~2800 line single-file python script, depending only on the python standard library (and execution of git commands), all of which is designed to make build/installation trivial: you just need to add it to your $PATH. [2] https://public-inbox.org/git/20190126221811.20241-1-newren@gmail.com/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-01-31 8:57 New command/tool: git filter-repo Elijah Newren @ 2019-01-31 19:09 ` Junio C Hamano 2019-01-31 20:43 ` Elijah Newren 2019-01-31 20:47 ` Elijah Newren ` (2 subsequent siblings) 3 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2019-01-31 19:09 UTC (permalink / raw) To: Elijah Newren; +Cc: Git Mailing List Elijah Newren <newren@gmail.com> writes: > git-filter-repo[1], a filter-branch-like tool for rewriting repository > history, is ready for more widespread testing and feedback. The rough > edges I previously mentioned have been fixed, and it has several useful > features already, though more development work is ongoing (docs are a > bit sparse right now, though -h provides some help). > > Why filter-repo vs. filter-branch? How does it compare with bfg-repo-cleaner? Somehow I was led to believe that all serious users of filter-branch like functionality are using bfg-repo-cleaner instead. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-01-31 19:09 ` Junio C Hamano @ 2019-01-31 20:43 ` Elijah Newren 2019-01-31 23:36 ` Roberto Tyley 0 siblings, 1 reply; 13+ messages in thread From: Elijah Newren @ 2019-01-31 20:43 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > > git-filter-repo[1], a filter-branch-like tool for rewriting repository > > history, is ready for more widespread testing and feedback. The rough > > edges I previously mentioned have been fixed, and it has several useful > > features already, though more development work is ongoing (docs are a > > bit sparse right now, though -h provides some help). > > > > Why filter-repo vs. filter-branch? > > How does it compare with bfg-repo-cleaner? Somehow I was led to > believe that all serious users of filter-branch like functionality > are using bfg-repo-cleaner instead. No, bfg-repo-cleaner only covers an important subset of the usecases. bfg-repo-cleaner does a really good job if your goal is to remove a few big files and/or to remove some sensitive text (matched via regexes) from all blobs. It was designed for that specific role and has more options in this area than filter-repo currently has. But even within this design space it was optimized for, it is missing two things that I really want: * pruning of commits which become empty due to filtering * providing a way for the user to know what needs to be cleaned up. It has options like --strip-blobs-bigger-than <size> or --strip-biggest-blobs <NUM>, but no way for the user to figure out what <size> or <NUM> should be. Also, since it just focuses on really big blobs, it misses cases like someone checking in directories with a huge number of small-to-moderately sized files (e.g. bower_components/ or node_modules/, though these could also contain a few big blobs too), or someone checking in a lot of moderately sized files of a uniform extension (e.g. .webm, .tar.gz, .zip, .mp4, .avi). I've seen cases in the wild where the correct cleaning of history was more about filtering out directories or extensions than a couple big files. filter-repo's --analyze option creates some reports that help with this tremendously. Also, the options to delete files by glob/basename overlook the fact that renames may have occurred. Having a report that mentions renames that have occurred in history (also part of filter-repo's --analyze option) can be very helpful. Outside of this specific usecase, bfg-repo-cleaner is not very useful. It simply lacks more general filtering capabilties: * While bfg-repo-cleaner has facilities to remove certain paths, it has none to say you only want to keep certain paths. Unlike filter-branch where you can use a pipeline to list all files, grep to remove the ones you want to keep from the list, then pipe the remainder of paths to xargs git rm, bfg-repo-cleaner doesn't have a facility for shell commands. Instead in bfg-repo-cleaner you would need to emulate this by exhaustively listing directories and paths/globs of file basenames to delete, but that assumes the user knows all paths that have ever existed making this solution not only onerous but error prone. More of the filterings I see these days are about just keeping a directory (or perhaps a handful of them) rather than just removing or cleaning a few files. Also, this makes pruning of commits which become empty much more important, but as noted above, bfg-repo-cleaner lacks that ability. * It has no facilities for renaming paths. You'd have to use a different tool to do that, but then why not use the other tool to do the whole job? Even if you do decide to use both tools, some capabilities of one tool can be neutered by such an approach (e.g. bfg-repo-cleaner's carefully rewritten commit messages that tried to ensure abbreviated commit shas referred to the new commit ids) * It has no facilities for affecting other parts of history, such as changing author/committer/tagger names or emails, changing commit timestamp or timezone, reparenting commits, splicing repository histories together, filtering files differently based on commit timestamp, etc. -- all of which can be done with filter-repo (though some of those things requires writing a small python script; see basic examples in t/lib-usage/*) Personally, I also find it kind of annoying that bfg-repo-cleaner doesn't automatically repack and shrink the repo when it is done and instead prints multiple commands the user can run to achieve that, even though it's the core use case for the tool. Granted, they may have had last-ditch recovery-of-the-original-repo in mind in case the user ran in a repository they shouldn't have, but I much prefer to have the tool just check if the repo looks like a fresh clone and bail if not, so that users have a far easier recovery mechanism -- just throw away the clone you were filtering and re-clone. Once you do that, auto repacking and shrinking is pretty natural. (And you can always provide a --force option to allow filtering & rewriting in a repo that isn't a fresh clone.) Elijah ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-01-31 20:43 ` Elijah Newren @ 2019-01-31 23:36 ` Roberto Tyley 2019-02-01 7:38 ` Elijah Newren 0 siblings, 1 reply; 13+ messages in thread From: Roberto Tyley @ 2019-01-31 23:36 UTC (permalink / raw) To: Elijah Newren; +Cc: Junio C Hamano, Git Mailing List On Thu, 31 Jan 2019 at 22:37, Elijah Newren <newren@gmail.com> wrote: > On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > > > > git-filter-repo[1], a filter-branch-like tool for rewriting repository > > > history, is ready for more widespread testing and feedback. The rough > > > edges I previously mentioned have been fixed, and it has several useful > > > features already, though more development work is ongoing (docs are a > > > bit sparse right now, though -h provides some help). > > > > > > Why filter-repo vs. filter-branch? I like the name! I think a lot of users are interested in filtering their entire repo, rather than rewriting a single branch. > > How does it compare with bfg-repo-cleaner? Somehow I was led to > > believe that all serious users of filter-branch like functionality > > are using bfg-repo-cleaner instead. > > No, bfg-repo-cleaner only covers an important subset of the usecases. That's true - the focus with BFG Repo-Cleaner is on removing unwanted data - completely eradicating it from a repo's history. There are some mistakes in history that repo owners just really *do not* want to share (ie large files, private data/credentials), and they can be a critical blocker to sharing or working with a Git repo. In terms of rewriting history, my internal criterion for what I features I really want to be in the BFG is: is this unwanted data completely stopping many users from sharing their code or doing their work? I understand that when it comes to rewriting history, there are loads of other operations that people sometimes want to perform, beyond removing unwanted data - merging/splitting of history, anonymization/renaming of committers, etc. Some of those might be nice to add to the BFG - but as with many OSS-maintainers, I have limited time, and a life to balance outside of software...! > bfg-repo-cleaner does a really good job if your goal is to remove a > few big files and/or to remove some sensitive text (matched via > regexes) from all blobs. It was designed for that specific role and > has more options in this area than filter-repo currently has. But > even within this design space it was optimized for, it is missing two > things that I really want: > > * pruning of commits which become empty due to filtering There certainly have been several users asking for this feature on the BFG, and even a kindly contributed PR for the functionality which I've yet to merge. As it doesn't actually stop users from doing work - so far as I can see - it's something that I've done a poor job of following up. > * providing a way for the user to know what needs to be cleaned up. > It has options like --strip-blobs-bigger-than <size> or > --strip-biggest-blobs <NUM>, but no way for the user to figure out > what <size> or <NUM> should be. For users of GitHub, It's normally 100MB with --strip-blobs-bigger-than <size> :-) > Also, since it just focuses on really > big blobs, it misses cases like someone checking in directories with a > huge number of small-to-moderately sized files (e.g. bower_components/ > or node_modules/, though these could also contain a few big blobs For those use-cases, it might be that BFG's --delete-folders flag is useful, especially given the protected-head-commit feature of the BFG. It's getting late for me, must be even later in Brussels - I wish I could have made it there to join in! Merry Git Merge to you all, and good luck to you Elijah with git-filter-repo. Roberto ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-01-31 23:36 ` Roberto Tyley @ 2019-02-01 7:38 ` Elijah Newren 0 siblings, 0 replies; 13+ messages in thread From: Elijah Newren @ 2019-02-01 7:38 UTC (permalink / raw) To: Roberto Tyley; +Cc: Junio C Hamano, Git Mailing List Hi Roberto, First of all, thanks for the feedback, and for the awesome work on BFG repo filter! On Fri, Feb 1, 2019 at 12:36 AM Roberto Tyley <roberto.tyley@gmail.com> wrote: > > On Thu, 31 Jan 2019 at 22:37, Elijah Newren <newren@gmail.com> wrote: > > On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote: > > > Elijah Newren <newren@gmail.com> writes: > > > > > > > git-filter-repo[1], a filter-branch-like tool for rewriting repository > > > > history, is ready for more widespread testing and feedback. The rough > > > > edges I previously mentioned have been fixed, and it has several useful > > > > features already, though more development work is ongoing (docs are a > > > > bit sparse right now, though -h provides some help). > > > > > > > > Why filter-repo vs. filter-branch? > > I like the name! I think a lot of users are interested in filtering > their entire repo, rather than rewriting a single branch. > > > > How does it compare with bfg-repo-cleaner? Somehow I was led to > > > believe that all serious users of filter-branch like functionality > > > are using bfg-repo-cleaner instead. > > > > No, bfg-repo-cleaner only covers an important subset of the usecases. > > That's true - the focus with BFG Repo-Cleaner is on removing unwanted > data - completely eradicating it from a repo's history. There are some > mistakes in history that repo owners just really *do not* want to > share (ie large files, private data/credentials), and they can be a > critical blocker to sharing or working with a Git repo. In terms of > rewriting history, my internal criterion for what I features I really > want to be in the BFG is: is this unwanted data completely stopping > many users from sharing their code or doing their work? > > I understand that when it comes to rewriting history, there are loads > of other operations that people sometimes want to perform, beyond > removing unwanted data - merging/splitting of history, > anonymization/renaming of committers, etc. Some of those might be nice > to add to the BFG - but as with many OSS-maintainers, I have limited > time, and a life to balance outside of software...! Totally understand; you picked a certain set of usecases and provided a really good tool for those ones. I focused more on repository migration (not between different version control systems, but some other special flag-day event. e.g. we have a whole bunch of plugins, several of them are unmaintained, they are in lots of different code hosting platforms; and someone decides we want to put all the widely used plugins into a big monorepo for various reasons.) It's just that when doing some kind of big migration is a good time to do cleanup as well, especially when typical actual plugin size history might be a few hundred kilobytes, but for some reason many are dozens or hundreds of megabytes, so filter-repo also provides some facilities for cleaning unwanted stuff out as well. > > bfg-repo-cleaner does a really good job if your goal is to remove a > > few big files and/or to remove some sensitive text (matched via > > regexes) from all blobs. It was designed for that specific role and > > has more options in this area than filter-repo currently has. But > > even within this design space it was optimized for, it is missing two > > things that I really want: > > > > * pruning of commits which become empty due to filtering > > There certainly have been several users asking for this feature on the > BFG, and even a kindly contributed PR for the functionality which I've > yet to merge. As it doesn't actually stop users from doing work - so > far as I can see - it's something that I've done a poor job of > following up. Just as a heads up: It may be a lot uglier than it at first looks; there's all kinds of special cases and I'm not quite sure I've got it all right in filter-repo yet: * Some projects intentionally create empty commits for versioning or publishing or other reasons, and do not want these commits to be removed. Thus, it's important to remove commits that *become* empty (due to other filtering rules), not commits which started empty. * There's a special case for the above rule: if a user only wants the history of a certain directory, then any empty commits that pre-dated that directory aren't wanted/needed. The way I handle this is that if a commit which had no changes relative to its parent became an orphan (its parent and all other ancestors were pruned due to becoming empty), then the fact that it was orphaned implies that it *became* empty and is thus prunable. * There are also topological changes possible: If a parent and all further ancestors on one side of history are pruned, a merge commit may become a non-merge commit. Often that will make the merge commit itself prunable, though there is the possibility that the merge had other changes tucked into it. * Merges also may become degenerate: * Case 1: both sides of history may have commits pruned away (due to becoming empty) all the way back to the merge base, meaning the merge commit now has the merge base for both parents. I think having redundant parents is senseless and the redundant ones should be pruned. In most cases, this should also mean the merge commit is no longer empty as it'll likely have no changes relative to the merge base. * Case 2: one side of history may have commits pruned away back to the merge base, meaning that the merge commit now merges one parent with an ancestor of that parent. If the merge commit has no changes relative to the newer parent, then it could potentially be pruned away. However: What if this merge commit *started* as a merge of some commit with its own ancestor? If the second parent is the newer commit, this was likely a --no-ff merge and probably shouldn't be pruned. But, if a project has a strong policy of always doing --no-ff merges to incorporate changes, then even if the merge commit didn't start out looking like a --no-ff merge we should probably keep it even if it ends up looking like one. (But, of course, in either case if the first parent is the newer one, then it's not a --no-ff merge and unless the merge commit has extra file changes in it, we should be able to prune it.) * Also, if your scheme checks for changes against the first parent to see what changes exist in a merge commit, if the first parent history is pruned away due to filtering choices (e.g. it didn't touch the directory of interest), then your methodology may miss the fact that this merge commit has an empty set of changes relative to its remaining parent and thus mistakenly retain some commits which became empty and were expected to be pruned. I believe I saw filter-branch messing this up, though I'd have to double check. * There might be other special cases I've overlooked, though I've tried to cover them all. I'm currently thinking that perhaps a flag or pair of flags might be useful here (perhaps --empty-pruning={always,never,auto} --no-ff-pruning={always,never,auto}) with 'auto' perhaps being pruning of commits which didn't start out that way (empty or a no-ff merge) but became so. I'm still undecided on this... > > * providing a way for the user to know what needs to be cleaned up. > > It has options like --strip-blobs-bigger-than <size> or > > --strip-biggest-blobs <NUM>, but no way for the user to figure out > > what <size> or <NUM> should be. > > For users of GitHub, It's normally 100MB with > --strip-blobs-bigger-than <size> :-) If you're only interested in what GitHub won't allow for you to continue working with your repo, sure. If you are migrating your repo for some reason and want to take the chance to clean it up at the same time so everyone doesn't have to pay heavy clone costs, there are many other interesting values. :-) > > Also, since it just focuses on really > > big blobs, it misses cases like someone checking in directories with a > > huge number of small-to-moderately sized files (e.g. bower_components/ > > or node_modules/, though these could also contain a few big blobs > > For those use-cases, it might be that BFG's --delete-folders flag is > useful, especially given the protected-head-commit feature of the BFG. Absolutely. However, this somewhat assumes the user doing the filtering knows what parts of history are extraneous. I've seen many cases where the original author(s) of some repo weren't that familiar with version control and committed a whole bunch of things they shouldn't. They then left the company or project and a few years later, someone else deletes those files in a new commit, possibly even mentioning in the commit message that it was stuff that was never needed. Several more years go by, the second individual isn't with the project any more either, and someone else comes along, possibly not even familiar with the language or build tools the project uses but was tasked with migrating the history anyway. Having a tool which can tell them why this small plugin happens to be much bigger than expected with pointers to relevant bits ("what's this bower_components/ directory that accounts for most the size and was deleted four years ago?") is very helpful. I've also seen this with big repos where there are lots of ugly travesties in all kinds of different places, and no one still around knows about more than a couple of them. The developers may be willing to go through a flag day to have the history rewritten to expunge the things that never should have been committed to get clone times down, but only removing a few big blobs or expecting there to be an individual who knows which directories can be nuked isn't always the best answer. Given enough time, anyone can run enough git commands to kind of figure out what are the big things, but I wanted something that made that job easier. Granted, 'git filter-repo --analyze' is a separate read-only step (other than the reports it writes), so even if people wanted to use BFG repo cleaner or other filtering tools they could use this aspect of filter-repo to guide decisions. > It's getting late for me, must be even later in Brussels - I wish I > could have made it there to join in! Merry Git Merge to you all, and > good luck to you Elijah with git-filter-repo. Thanks again for all the feedback; maybe we'll get to meet at a future conference. Best of luck to you as well. Elijah ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-01-31 8:57 New command/tool: git filter-repo Elijah Newren 2019-01-31 19:09 ` Junio C Hamano @ 2019-01-31 20:47 ` Elijah Newren 2019-02-08 1:25 ` Elijah Newren 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason 3 siblings, 0 replies; 13+ messages in thread From: Elijah Newren @ 2019-01-31 20:47 UTC (permalink / raw) To: Git Mailing List On Thu, Jan 31, 2019 at 9:57 AM Elijah Newren <newren@gmail.com> wrote: > > Hi everyone, > > git-filter-repo[1], a filter-branch-like tool for rewriting repository > history, is ready for more widespread testing and feedback. ... > What's the future? (Core command of git.git? place it in contrib? keep it > in a separate repo?) I'm hoping to discuss that at the contributor summit > today, but feedback on the list is also welcome. Turns out we didn't have enough time and didn't discuss it at the contributor summit. So, I'm even more interested in feedback from the mailing list. Thanks, Elijah ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-01-31 8:57 New command/tool: git filter-repo Elijah Newren 2019-01-31 19:09 ` Junio C Hamano 2019-01-31 20:47 ` Elijah Newren @ 2019-02-08 1:25 ` Elijah Newren 2019-02-08 10:22 ` Johannes Schindelin 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason 3 siblings, 1 reply; 13+ messages in thread From: Elijah Newren @ 2019-02-08 1:25 UTC (permalink / raw) To: Git Mailing List; +Cc: Lars Schneider Hi, On Thu, Jan 31, 2019 at 12:57 AM Elijah Newren <newren@gmail.com> wrote: > git-filter-repo[1], a filter-branch-like tool for rewriting repository > history, is ready for more widespread testing and feedback. The rough Someone at the Contributor Summit (Michael Haggerty perhaps?) asked me about performance numbers on known repositories for filter-repo and how it compared to other tools; I gave extremely rough estimates, but here I belatedly provide some more detailed figures. In each case, I report both filtering time, and cleanup (gc or clone) time[0]: Testcase 1: Remove a single file (configure.ac) from each commit in git.git: * filter-branch[1a]: 2413.978s + 34.812s * BFG (8-core)[1b]: 38.743s + 30.333s * BFG (40-core)[1b]: 24.680s + 35.165s * filter-repo[1c]: 35.582s + 15.690s Caveats: filter-repo failed and needed workarounds; see [1d] Testcase 2: Keep two directories (guides/ and tools/) from rails.git: * filter-branch[2a]: 14586.655s + 22.726s * BFG (8-core)[2b]: 27.675s + 15.786s * BFG (40-core)[2b]: 24.883s + 20.463s * filter-repo[2c]: 10.951s + 12.500s Caveats: filter-branch failed at the end of this operation; see [2d]. AFAICT, BFG can't do this operation; used approximations instead[2e]. Testcase 3: Replacing one string with another throughout all files in linux.git: * filter-branch[3a]: Estimated at about 3.5 months (~8.9e6 seconds) * BFG (8-core)[3b]: 2144.904s + 693.79s * BFG (40-core)[3b]: 1178.577s + 636.887s * filter-repo[3c]: 1203.147s + 159.620s Caveats: filter-branch failed at ~12 hours; see [3d]. Other details about measurements at [4]. Take-aways and biased opinions at [5]. Hope this was interesting, Elijah *************** Footnotes (Minutiae for the curious) *************** [0] git-filter-branch's manpage suggests re-cloning to get rid of old objects, BFG as its last step provides the user commands to execute in order to clean out old objects, and filter-repo automatically runs such commands. As such, time of post-run gc seems like a relevant thing to report. Commands used and timed: * filter-branch: time git clone file://$(pwd) ../nuke-me-clone * BFG: git reflog expire --expire=now --all && time git gc --prune=now * filter-repo: N/A (internally runs same commands as I manually ran for BFG) [1a] time git filter-branch --index-filter 'git rm --quiet --cached --ignore-unmatch configure.ac' --tag-name-filter cat --prune-empty -- --all [1b] time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files configure.ac [1c] git tag | grep v1.0rc | xargs git tag -d git tag -d junio-gpg-pub time git filter-repo --path configure.ac --invert-paths [1d] git fast-export when run with certain flags will abort in repos with tags of blobs or tags of tags. I had to first delete 7 tags to get this testcase to run, as shown in the commands above in [1c]. I'll probably patch fast-export to fix this. [2a] time git filter-branch --index-filter 'git ls-files -z | tr "\0" "\n" | grep -v -e ^guides/ -e ^tools/ | tr "\n" "\0" | xargs -0 git rm --quiet --cached --ignore-unmatch' --tag-name-filter cat --prune-empty -- --all [2b] git log --format=%n --name-only | sort | uniq | grep -v ^$ > all-files.txt time java -jar ~/Downloads/bfg-1.13.0.jar --delete-folders "{$(grep / all-files.txt | sed -e 's/"//' -e s%/.*%% | uniq | grep -v -e guides -e tools | tr '\n' ,)}" --delete-files "{$(comm -23 <(grep -v / all-files.txt) <(grep -e guides/ -e tools/ all-files.txt | sed -e s%.*/%% | sort) | tr '\n' ,)}" [2c] time git filter-repo --path guides --path tools [2d] filter-branch fails at the very end when noting which refs were deleted/rewritten with: error: cannot lock ref 'refs/tags/v0.10.0': is at b68b47672e613e94a7859c9549e9cd4b401f7b79 but expected e2724aa1856253f4fc48ddc251583042c5f06029 Could not delete refs/tags/v0.10.0 Turns out b68b47672e613e94a7859c9549e9cd4b401f7b79 is an annotated tag in the original repo pointing to the commit e2724aa1856253f4fc48ddc251583042c5f06029. I do not know the cause of this bug, but since it was almost at the very end, I just reported the time used before it hit this error. [2e] Unless I am misunderstanding, BFG is not capable of this filtering operation because it uses basenames for --delete-files and --delete-folders, and some names appear in several directories (e.g. .gitignore, Rakefile, tasks). As such, with the BFG you either have to delete files/directories that shouldn't be, or leave files and folders around that you wanted to have deleted. The command in [2b] has some of both, but should still give a good estimate of how long it would take BFG to do this kind of operation if file and directory basenames in the rails repository happened to be named uniquely. [3a] time git filter-branch -d /dev/shm/tmp --tree-filter 'git ls-files | xargs sed -i s/secretly/covertly/' --tag-name-filter cat -- --all [3b] time java -jar ~/Downloads/bfg-1.13.0.jar --replace-text <(echo 'secretly==>covertly') [3c] time git filter-repo --replace-text <(echo 'secretly==>covertly') [3d] filter-branch failed after 45704 seconds, predicting another 8836429 seconds (~102 days) remaining at the time. As commits earlier in history tend to be smaller, filter-branch nearly always underestimates the time required, sometimes considerably. filter-branch failed on commit af25e94d4dcfb9608846242fabdd4e6014e5c9f0 due to an empty ident. I possibly could have worked around it with --env-filter, but it's not like I'm going to wait for it to finish anyway. [4] Other notes about timings: * All tests were run on an 8 cpu system, except for the "BFG 40-core" tests which were run on a 40 core system. (filter-branch and filter-repo are not multi-threaded and gain nothing from more cores.) * More precisely, I ran on AWS with an m4.2xlarge with two 50-GB GP2 volumes (150 Iops) for tests. The 40-core system was an m4.10xlarge. * Before each command, to try to avoid warm disk caches helping or hurting depending on the order I ran commands in, I first ran: * rsync -az --delete ../$REPO-orig/ ./ * git status * $TOOL -h * Testing was imperfect; I just ran once and recorded the time. It took long enough to gather the data as it was. * when additional commands were needed for the filtering (e.g. getting the all-files.txt list to generate the BFG command, or deleting tags that fast-export couldn't handle for filter-repo), I did not include the times of those commands in the overall execution time. It would have added a few hundredths of a second to filter-repo's git.git time, and about 5-6 seconds to BFG's rails.git time. * filter-repo self-reports time until filtering finishes and time until entirely done. I took difference between its self-report of overall time and the "time" command's report of overall time (which was typically order ~ 0.1s), and added that to filter-repo's filtering time, assuming that most the discrepancy would be due to python startup. [5] Performance is only one measurement. Features, capabilities, usability, etc. matter too. filter-branch is a general purpose filtering tool, but in my opinion, not a good one -- and not just because of performance. BFG Repo Cleaner is a good tool, but it is special purpose; it is designed for a few particular usecases (limiting the kinds of things I could try in my comparison above). My hope is that filter-repo serves as a good general purpose filtering tool so that people can stop suffering from filter-branch. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-02-08 1:25 ` Elijah Newren @ 2019-02-08 10:22 ` Johannes Schindelin 0 siblings, 0 replies; 13+ messages in thread From: Johannes Schindelin @ 2019-02-08 10:22 UTC (permalink / raw) To: Elijah Newren; +Cc: Git Mailing List, Lars Schneider Hi Elijah, On Thu, 7 Feb 2019, Elijah Newren wrote: > On Thu, Jan 31, 2019 at 12:57 AM Elijah Newren <newren@gmail.com> wrote: > > git-filter-repo[1], a filter-branch-like tool for rewriting repository > > history, is ready for more widespread testing and feedback. The rough > > Someone at the Contributor Summit (Michael Haggerty perhaps?) asked me > about performance numbers on known repositories for filter-repo and > how it compared to other tools; I gave extremely rough estimates, but > here I belatedly provide some more detailed figures. In each case, I > report both filtering time, and cleanup (gc or clone) time[0]: > > [...] Those are pretty good numbers right there. > My hope is that filter-repo serves as a good general purpose filtering > tool so that people can stop suffering from filter-branch. I agree. `git filter-branch` was simply pulled out of Cogito when that project was declared discontinued, as the last nugget we could steal from the carcass (we stole such a lot from Cogito that we're probably the reason it died). And when somebody started to rewrite it in C (calling it `git rewrite-commits` if memory serves), I asked to change it a little so it could be used as a drop-in replacement for `filter-branch`, and unfortunately stopped that effort in its tracks. So it is really good for me to see you picking up the initiative and make that particular shell script obsolete. Ciao, Dscho P.S.: No, I am not willing to even attempt to run filter-branch on Windows. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-01-31 8:57 New command/tool: git filter-repo Elijah Newren ` (2 preceding siblings ...) 2019-02-08 1:25 ` Elijah Newren @ 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason 2019-02-08 20:13 ` Johannes Schindelin ` (2 more replies) 3 siblings, 3 replies; 13+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-02-08 18:53 UTC (permalink / raw) To: Elijah Newren; +Cc: Git Mailing List On Thu, Jan 31 2019, Elijah Newren wrote: > What's the future? (Core command of git.git? place it in contrib? keep it > in a separate repo?) I'm hoping to discuss that at the contributor summit > today, but feedback on the list is also welcome. Some of this I may have mentioned at the summit, but here for the list: * I think it should be a candidate for a core (not "just contrib") git.git command, given that we have someone willing to maintain it & deal with bugs etc. I'm not worried about that given the author. * It's unfortunate in terms of API we need to support going forward that this obligates us to support a fairly intricate python API going forward, so it's similar (but more detailed) to Git.pm (which I also tried to get rid of as an external API a while ago). However, as you correctly note that's the only way a command like this can be really fast, we already have the "no special API" command with git-filter-branch, and that's horribly slow. But perhaps there's ways we can in advance deal with a potential future breaking API change. E.g. some Pythonic way of versioning the API, or just prominently documenting whatever (low?) stability guarantees we're making. I imagine if we need to make breaking changes in the future that'll less big of a deal than in other cases, since we'd expect the API use to be one-off migration scripts, although maybe it'll get used for all-the-time exports (e.g. mirroring internal->external repos with filtering). * The rest of our commands are hooked up to the i18n framework. I don't think this should be a blocker, but it's worth thinking about what the plan for this is. Are we going to need the equivalent of Git::I18N for Python (which presumably will be a run-time dependency on something needing the Python API that links to gettext). Or perhaps we could do the translated strings in C, by making the program you're invoking be a C command, invoking the Python part as a helper (which would need to re-invoke a helper if it prints its own messages). Thanks for working on this! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason @ 2019-02-08 20:13 ` Johannes Schindelin 2019-02-11 16:00 ` Elijah Newren 2019-02-11 15:47 ` Elijah Newren 2019-06-08 16:20 ` Elijah Newren 2 siblings, 1 reply; 13+ messages in thread From: Johannes Schindelin @ 2019-02-08 20:13 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Elijah Newren, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 691 bytes --] Hi Ævar, On Fri, 8 Feb 2019, Ævar Arnfjörð Bjarmason wrote: > [...] > > But perhaps there's ways we can in advance deal with a potential > future breaking API change. E.g. some Pythonic way of versioning the > API, or just prominently documenting whatever (low?) stability > guarantees we're making. Another thing to keep in mind: it being in Python prevents it from being distributed with Git for Windows. The Git for Windows installer already weighs way more than it used to (it used to be under 30MB, now it is 44MB), and I am simply not willing to increase the footprint dramatically just for one rarely used command. If only it were written as a built-in... Ciao, Dscho ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-02-08 20:13 ` Johannes Schindelin @ 2019-02-11 16:00 ` Elijah Newren 0 siblings, 0 replies; 13+ messages in thread From: Elijah Newren @ 2019-02-11 16:00 UTC (permalink / raw) To: Johannes Schindelin Cc: Ævar Arnfjörð Bjarmason, Git Mailing List Hi Dscho, On Fri, Feb 8, 2019 at 12:13 PM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Ævar, > > On Fri, 8 Feb 2019, Ævar Arnfjörð Bjarmason wrote: > > > [...] > > > > But perhaps there's ways we can in advance deal with a potential > > future breaking API change. E.g. some Pythonic way of versioning the > > API, or just prominently documenting whatever (low?) stability > > guarantees we're making. > > Another thing to keep in mind: it being in Python prevents it from being > distributed with Git for Windows. The Git for Windows installer already > weighs way more than it used to (it used to be under 30MB, now it is > 44MB), and I am simply not willing to increase the footprint dramatically > just for one rarely used command. That would be unfortunate, though understandable. I am curious, though: do you include and does anyone use filter-branch on windows? You mentioned elsewhere in this thread that you weren't even willing to attempt to run filter-branch there. If people aren't using filter-branch on windows, then there's nothing for me to save them from anyway. If they are, I'm curious to hear more about the usecases and motivations, even if the cost of my tool is too high for you to include. Also, since filter-branch and filter-repo are meant mostly as one-shot migration tools, it is already not uncommon for people to do it on a different machine (perhaps one with more RAM, or faster disks), and at most one person on the team needs to run it (sometimes folks even look to an "expert" outside the team to run the migration for them). Once migrated, they push the results back and are done with the tool. > If only it were written as a built-in... A built-in would be great, IF it could provide all the same capabilities and with at least the same speed. However, making it a built-in would fundamentally remove a significant chunk of its power and flexibility, which was part of the driving force for creating this tool. Elijah ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason 2019-02-08 20:13 ` Johannes Schindelin @ 2019-02-11 15:47 ` Elijah Newren 2019-06-08 16:20 ` Elijah Newren 2 siblings, 0 replies; 13+ messages in thread From: Elijah Newren @ 2019-02-11 15:47 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List Hi Ævar, On Fri, Feb 8, 2019 at 10:53 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > On Thu, Jan 31 2019, Elijah Newren wrote: > > > What's the future? (Core command of git.git? place it in contrib? keep it > > in a separate repo?) I'm hoping to discuss that at the contributor summit > > today, but feedback on the list is also welcome. > > Some of this I may have mentioned at the summit, but here for the list: > > * I think it should be a candidate for a core (not "just contrib") > git.git command, given that we have someone willing to maintain it & > deal with bugs etc. I'm not worried about that given the author. > > * It's unfortunate in terms of API we need to support going forward that > this obligates us to support a fairly intricate python API going > forward, so it's similar (but more detailed) to Git.pm (which I also > tried to get rid of as an external API a while ago). > > However, as you correctly note that's the only way a command like this > can be really fast, we already have the "no special API" command with > git-filter-branch, and that's horribly slow. > > But perhaps there's ways we can in advance deal with a potential > future breaking API change. E.g. some Pythonic way of versioning the > API, or just prominently documenting whatever (low?) stability > guarantees we're making. > > I imagine if we need to make breaking changes in the future that'll > less big of a deal than in other cases, since we'd expect the API use > to be one-off migration scripts, although maybe it'll get used for > all-the-time exports (e.g. mirroring internal->external repos with > filtering). > > * The rest of our commands are hooked up to the i18n framework. I don't > think this should be a blocker, but it's worth thinking about what the > plan for this is. > > Are we going to need the equivalent of Git::I18N for Python (which > presumably will be a run-time dependency on something needing the > Python API that links to gettext). > > Or perhaps we could do the translated strings in C, by making the > program you're invoking be a C command, invoking the Python part as a > helper (which would need to re-invoke a helper if it prints its own > messages). > > Thanks for working on this! Good points. I'll dig in to the i18n story. As you point out, the API stability may be tricky, but you may be right that we just need to prominently document whatever guarantee we want to make and that it's designed more for one-off migration scripts than continuing exports. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: New command/tool: git filter-repo 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason 2019-02-08 20:13 ` Johannes Schindelin 2019-02-11 15:47 ` Elijah Newren @ 2019-06-08 16:20 ` Elijah Newren 2 siblings, 0 replies; 13+ messages in thread From: Elijah Newren @ 2019-06-08 16:20 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List Hi, Now that there's a released version of git that has all necessary flags and features[1] to run git filter-repo (https://github.com/newren/git-filter-repo), I thought I'd send an update... On Fri, Feb 8, 2019 at 10:53 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > On Thu, Jan 31 2019, Elijah Newren wrote: > > > What's the future? (Core command of git.git? place it in contrib? keep it > > in a separate repo?) I'm hoping to discuss that at the contributor summit > > today, but feedback on the list is also welcome. > > Some of this I may have mentioned at the summit, but here for the list: > > * I think it should be a candidate for a core (not "just contrib") > git.git command, given that we have someone willing to maintain it & > deal with bugs etc. I'm not worried about that given the author. > > * It's unfortunate in terms of API we need to support going forward that > this obligates us to support a fairly intricate python API going > forward, so it's similar (but more detailed) to Git.pm (which I also > tried to get rid of as an external API a while ago). > > However, as you correctly note that's the only way a command like this > can be really fast, we already have the "no special API" command with > git-filter-branch, and that's horribly slow. > > But perhaps there's ways we can in advance deal with a potential > future breaking API change. E.g. some Pythonic way of versioning the > API, or just prominently documenting whatever (low?) stability > guarantees we're making. > > I imagine if we need to make breaking changes in the future that'll > less big of a deal than in other cases, since we'd expect the API use > to be one-off migration scripts, although maybe it'll get used for > all-the-time exports (e.g. mirroring internal->external repos with > filtering). > > * The rest of our commands are hooked up to the i18n framework. I don't > think this should be a blocker, but it's worth thinking about what the > plan for this is. > > Are we going to need the equivalent of Git::I18N for Python (which > presumably will be a run-time dependency on something needing the > Python API that links to gettext). > > Or perhaps we could do the translated strings in C, by making the > program you're invoking be a C command, invoking the Python part as a > helper (which would need to re-invoke a helper if it prints its own > messages). > > Thanks for working on this! I've implemented these, and several other things too. Changes since last time: * Now i18n-ized * Several disclaimers about API backcompat (this is more of a one-shot conversion tool) [2] * Converted to Python3 (Python2 is EOL at EOY) * Pruning of become-empty and become-degenerate+empty commits has been fixed up (I mentioned this as a concern last time) * Testsuite has been fleshed out, including not only multiple small fixes to filter repo, but more fixes to git itself[3] * Usage, Examples, Internals, and Limitations documentation now exists (in README.md format and built-in -h help; no manpage yet) * Several new filters and abilities have been added Now that filter-repo is complete and more easily tested, what are folks thoughts on incorporating it in git.git? (And if we do go that route, can we avoid losing its history so I can bisect issues if necessary?) Thanks, Elijah [1] Well, except people will get an error if they use --preserve-commit-encoding saying they don't have a new enough git since en/fast-export-encoding is still sitting in next and didn't make it in to git-2.22. But I only know of one repo that even uses special commit encodings, so I suspect few if any will even care about that flag. [2] The warnings appear in several places to try to make sure people notice them; a not quite complete list: https://github.com/newren/git-filter-repo/blame/master/README.md#L620-L625 https://github.com/newren/git-filter-repo/blame/master/README.md#L727-L730 https://github.com/newren/git-filter-repo/blame/master/README.md#L989-L993 https://github.com/newren/git-filter-repo/blob/master/git-filter-repo#L13-L30 https://github.com/newren/git-filter-repo/blob/master/git-filter-repo#L1523,L1524 https://github.com/newren/git-filter-repo/blob/master/t/t9391/commit_info.py#L4-L6 https://github.com/newren/git-filter-repo/blob/master/t/t9391/create_fast_export_output.py#L4-L6 https://github.com/newren/git-filter-repo/blob/master/t/t9391/file_filter.py#L4-L6 https://github.com/newren/git-filter-repo/blob/master/t/t9391/splice_repos.py#L4-L6 https://github.com/newren/git-filter-repo/blob/master/t/t9391/strip-cvs-keywords.py#L4-L6 [3] https://github.com/newren/git-filter-repo/tree/develop#upstream-improvements ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-06-08 16:22 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-31 8:57 New command/tool: git filter-repo Elijah Newren 2019-01-31 19:09 ` Junio C Hamano 2019-01-31 20:43 ` Elijah Newren 2019-01-31 23:36 ` Roberto Tyley 2019-02-01 7:38 ` Elijah Newren 2019-01-31 20:47 ` Elijah Newren 2019-02-08 1:25 ` Elijah Newren 2019-02-08 10:22 ` Johannes Schindelin 2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason 2019-02-08 20:13 ` Johannes Schindelin 2019-02-11 16:00 ` Elijah Newren 2019-02-11 15:47 ` Elijah Newren 2019-06-08 16:20 ` Elijah Newren
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).