* [PATCH v2 1/4] t6006: simplify and optimize empty message test
2019-08-28 0:22 ` [PATCH v2 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
@ 2019-08-28 0:22 ` Elijah Newren
2019-08-28 0:22 ` [PATCH v2 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
` (4 subsequent siblings)
5 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-28 0:22 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Elijah Newren
Test t6006.71 ("oneline with empty message") was creating two commits
with simple commit messages, and then running filter-branch to rewrite
the commit messages to be empty. This test was written this way because
the --allow-empty-message option to git commit did not exist at the
time. Simplify this test and avoid the need to invoke filter-branch by
just using --allow-empty-message when creating the commit.
Despite only being one piece of the 71st test and there being 73 tests
overall, this small change to just this one test speeds up the overall
execution time of t6006 (as measured by the best of 3 runs of `time
./t6006-rev-list-format.sh`) by about 11% on Linux and by 13% on
Mac.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t6006-rev-list-format.sh | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
index da113d975b..d30e41c9f7 100755
--- a/t/t6006-rev-list-format.sh
+++ b/t/t6006-rev-list-format.sh
@@ -501,9 +501,8 @@ test_expect_success 'reflog identity' '
'
test_expect_success 'oneline with empty message' '
- git commit -m "dummy" --allow-empty &&
- git commit -m "dummy" --allow-empty &&
- git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
+ git commit --allow-empty --allow-empty-message &&
+ git commit --allow-empty --allow-empty-message &&
git rev-list --oneline HEAD >test.txt &&
test_line_count = 5 test.txt &&
git rev-list --oneline --graph HEAD >testg.txt &&
--
2.23.0.3.gcc10030edf.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v2 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-08-28 0:22 ` [PATCH v2 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-08-28 0:22 ` [PATCH v2 1/4] t6006: simplify and optimize empty message test Elijah Newren
@ 2019-08-28 0:22 ` Elijah Newren
2019-08-28 6:00 ` Eric Sunshine
2019-08-28 0:22 ` [PATCH v2 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
` (3 subsequent siblings)
5 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-08-28 0:22 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Elijah Newren
fast-export and fast-import can easily handle the simple rewrite that
was being done by filter-branch, and should be significantly faster on
systems with a slow fork. Timings from before and after on two laptops
that I have access to (measured via `time ./t3427-rebase-subtree.sh`,
i.e. including everything in this test -- not just the filter-branch or
fast-export/fast-import pair):
Linux: 4.305s -> 3.684s (~17% speedup)
Mac: 10.128s -> 7.038s (~30% speedup)
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t3427-rebase-subtree.sh | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/t/t3427-rebase-subtree.sh b/t/t3427-rebase-subtree.sh
index d8640522a0..943ae92226 100755
--- a/t/t3427-rebase-subtree.sh
+++ b/t/t3427-rebase-subtree.sh
@@ -11,6 +11,12 @@ commit_message() {
git log --pretty=format:%s -1 "$1"
}
+extract_files_subtree() {
+ git fast-export --no-data HEAD -- files_subtree/ \
+ | sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" \
+ | git fast-import --force --quiet
+}
+
test_expect_success 'setup' '
test_commit README &&
mkdir files &&
@@ -42,7 +48,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-preserve-merges-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master4"
@@ -53,7 +59,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-preserve-merges-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "files_subtree/master5"
@@ -64,7 +70,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-keep-empty-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -75,7 +81,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-keep-empty-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -86,7 +92,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto empty commit' '
reset_rebase &&
git checkout -b rebase-keep-empty-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
@@ -96,7 +102,7 @@ test_expect_failure REBASE_P \
test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
reset_rebase &&
git checkout -b rebase-onto-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -106,7 +112,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
reset_rebase &&
git checkout -b rebase-onto-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -115,7 +121,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
test_expect_failure 'Rebase -Xsubtree --onto empty commit' '
reset_rebase &&
git checkout -b rebase-onto-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
--
2.23.0.3.gcc10030edf.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v2 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-08-28 0:22 ` [PATCH v2 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
@ 2019-08-28 6:00 ` Eric Sunshine
0 siblings, 0 replies; 73+ messages in thread
From: Eric Sunshine @ 2019-08-28 6:00 UTC (permalink / raw)
To: Elijah Newren
Cc: Git List, Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder
On Tue, Aug 27, 2019 at 8:22 PM Elijah Newren <newren@gmail.com> wrote:
> fast-export and fast-import can easily handle the simple rewrite that
> was being done by filter-branch, and should be significantly faster on
> systems with a slow fork. Timings from before and after on two laptops
> that I have access to (measured via `time ./t3427-rebase-subtree.sh`,
> i.e. including everything in this test -- not just the filter-branch or
> fast-export/fast-import pair):
> [...]
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> diff --git a/t/t3427-rebase-subtree.sh b/t/t3427-rebase-subtree.sh
> @@ -11,6 +11,12 @@ commit_message() {
> +extract_files_subtree() {
Style nit: add space before opening '('
(However, commit_message() function just above this doesn't follow
that style, so...)
> + git fast-export --no-data HEAD -- files_subtree/ \
> + | sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" \
> + | git fast-import --force --quiet
This would be a bit less noisy if you ended each line with the pipe
operator, allowing you to drop the backslashes:
git fast-export --no-data HEAD -- files_subtree/ |
sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
git fast-import --force --quiet
> +}
Not sure any of this is worth a re-roll.
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v2 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-08-28 0:22 ` [PATCH v2 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-08-28 0:22 ` [PATCH v2 1/4] t6006: simplify and optimize empty message test Elijah Newren
2019-08-28 0:22 ` [PATCH v2 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
@ 2019-08-28 0:22 ` Elijah Newren
2019-08-28 6:17 ` Eric Sunshine
2019-08-28 0:22 ` [RFC PATCH v2 4/4] Remove git-filter-branch, it is now external to git.git Elijah Newren
` (2 subsequent siblings)
5 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-08-28 0:22 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Elijah Newren
filter-branch suffers from a deluge of disguised dangers that disfigure
history rewrites (i.e. deviate from the deliberate changes). Many of
these problems are unobtrusive and can easily go undiscovered until the
new repository is in use. This can result in problems ranging from an
even messier history than what led folks to filter-branch in the first
place, to data loss or corruption. These issues cannot be backward
compatibly fixed, so add a warning to both filter-branch and its manpage
recommending that another tool (such as filter-repo) be used instead.
Also, update other manpages that referenced filter-branch. Several of
these needed updates even if we could continue recommending
filter-branch, either due to implying that something was unique to
filter-branch when it applied more generally to all history rewriting
tools (e.g. BFG, reposurgeon, fast-import, filter-repo), or because
something about filter-branch was used as an example despite other more
commonly known examples now existing. Reword these sections to fix
these issues and to avoid recommending filter-branch.
Finally, remove the section explaining BFG Repo Cleaner as an
alternative to filter-branch. I feel somewhat bad about this,
especially since I feel like I learned so much from BFG that I put to
good use in filter-repo (which is much more than I can say for
filter-branch), but keeping that section presented a few problems:
* In order to recommend that people quit using filter-branch, we need
to provide them a recomendation for something else to use that
can handle all the same types of rewrites. To my knowledge,
filter-repo is the only such tool. So it needs to be mentioned.
* I don't want to give conflicting recommendations to users
* If we recommend two tools, we shouldn't expect users to learn both
and pick which one to use; we should explain which problems one
can solve that the other can't or when one is much faster than
the other.
* BFG and filter-repo have similar performance
* All filtering types that BFG can do, filter-repo can also do. In
fact, filter-repo comes with a reimplementation of BFG named
bfg-ish which provides the same user-interface as BFG but with
several bugfixes and new features that are hard to implement in
BFG due to its technical underpinnings.
While I could still mention both tools, it seems like I would need to
provide some kind of comparison and I would ultimately just say that
filter-repo can do everything BFG can, so ultimately it seems that it
is just better to remove that section altogether.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
Documentation/git-fast-export.txt | 6 ++--
Documentation/git-filter-branch.txt | 45 +++++++++--------------------
Documentation/git-gc.txt | 17 +++++------
Documentation/git-rebase.txt | 2 +-
Documentation/git-replace.txt | 10 +++----
Documentation/git-svn.txt | 4 +--
Documentation/githooks.txt | 7 +++--
contrib/svn-fe/svn-fe.txt | 4 +--
git-filter-branch.sh | 13 +++++++++
9 files changed, 52 insertions(+), 56 deletions(-)
mode change 100755 => 100644 git-filter-branch.sh
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index cc940eb9ad..784e934009 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -17,9 +17,9 @@ This program dumps the given revisions in a form suitable to be piped
into 'git fast-import'.
You can use it as a human-readable bundle replacement (see
-linkgit:git-bundle[1]), or as a kind of an interactive
-'git filter-branch'.
-
+linkgit:git-bundle[1]), or as a format that can be edited before being
+fed to 'git fast-import' in order to do history rewrites (an ability
+relied on by tools like 'git filter-repo').
OPTIONS
-------
diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index 6b53dd7e06..e4047d472e 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -16,6 +16,20 @@ SYNOPSIS
[--original <namespace>] [-d <directory>] [-f | --force]
[--state-branch <branch>] [--] [<rev-list options>...]
+WARNING
+-------
+'git filter-branch' has a plethora of pitfalls that can produce non-obvious
+manglings of the intended history rewrite (and can leave you with little
+time to investigate such problems since it has such abysmal performance).
+These safety and performance issues cannot be backward compatibly fixed and
+as such, its use is not recommended. Please use an alternative history
+filtering tool such as https://github.com/newren/git-filter-repo/[git
+filter-repo]. If you still need to use 'git filter-branch', please
+carefully read the "Safety" section of the message on the Git mailing list
+https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
+the land mines of filter-branch] and vigilantly avoid as many of the
+hazards listed there as reasonably possible.
+
DESCRIPTION
-----------
Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,37 +459,6 @@ warned.
(or if your git-gc is not new enough to support arguments to
`--prune`, use `git repack -ad; git prune` instead).
-NOTES
------
-
-git-filter-branch allows you to make complex shell-scripted rewrites
-of your Git history, but you probably don't need this flexibility if
-you're simply _removing unwanted data_ like large files or passwords.
-For those operations you may want to consider
-http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
-a JVM-based alternative to git-filter-branch, typically at least
-10-50x faster for those use-cases, and with quite different
-characteristics:
-
-* Any particular version of a file is cleaned exactly _once_. The BFG,
- unlike git-filter-branch, does not give you the opportunity to
- handle a file differently based on where or when it was committed
- within your history. This constraint gives the core performance
- benefit of The BFG, and is well-suited to the task of cleansing bad
- data - you don't care _where_ the bad data is, you just want it
- _gone_.
-
-* By default The BFG takes full advantage of multi-core machines,
- cleansing commit file-trees in parallel. git-filter-branch cleans
- commits sequentially (i.e. in a single-threaded manner), though it
- _is_ possible to write filters that include their own parallelism,
- in the scripts executed against each commit.
-
-* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
- are much more restrictive than git-filter branch, and dedicated just
- to the tasks of removing unwanted data- e.g:
- `--strip-blobs-bigger-than 1M`.
-
GIT
---
Part of the linkgit:git[1] suite
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 247f765604..0c114ad1ca 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -115,15 +115,14 @@ NOTES
-----
'git gc' tries very hard not to delete objects that are referenced
-anywhere in your repository. In
-particular, it will keep not only objects referenced by your current set
-of branches and tags, but also objects referenced by the index,
-remote-tracking branches, refs saved by 'git filter-branch' in
-refs/original/, reflogs (which may reference commits in branches
-that were later amended or rewound), and anything else in the refs/* namespace.
-If you are expecting some objects to be deleted and they aren't, check
-all of those locations and decide whether it makes sense in your case to
-remove those references.
+anywhere in your repository. In particular, it will keep not only
+objects referenced by your current set of branches and tags, but also
+objects referenced by the index, remote-tracking branches, notes saved
+by 'git notes' under refs/notes/, reflogs (which may reference commits
+in branches that were later amended or rewound), and anything else in
+the refs/* namespace. If you are expecting some objects to be deleted
+and they aren't, check all of those locations and decide whether it
+makes sense in your case to remove those references.
On the other hand, when 'git gc' runs concurrently with another process,
there is a risk of it deleting an object that the other process is using
diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 6156609cf7..2f201d85d4 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -832,7 +832,7 @@ Hard case: The changes are not the same.::
This happens if the 'subsystem' rebase had conflicts, or used
`--interactive` to omit, edit, squash, or fixup commits; or
if the upstream used one of `commit --amend`, `reset`, or
- `filter-branch`.
+ a full history rewriting command like `filter-repo`.
The easy case
diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
index 246dc9943c..35595a2cd3 100644
--- a/Documentation/git-replace.txt
+++ b/Documentation/git-replace.txt
@@ -123,10 +123,10 @@ The following format are available:
CREATING REPLACEMENT OBJECTS
----------------------------
-linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
-linkgit:git-rebase[1], among other git commands, can be used to create
-replacement objects from existing objects. The `--edit` option can
-also be used with 'git replace' to create a replacement object by
+linkgit:git-hash-object[1], linkgit:git-rebase[1], and
+linkgit:git-filter-repo[1], among other git commands, can be used to
+create replacement objects from existing objects. The `--edit` option
+can also be used with 'git replace' to create a replacement object by
editing an existing object.
If you want to replace many blobs, trees or commits that are part of a
@@ -148,8 +148,8 @@ pending objects.
SEE ALSO
--------
linkgit:git-hash-object[1]
-linkgit:git-filter-branch[1]
linkgit:git-rebase[1]
+linkgit:git-filter-repo[1]
linkgit:git-tag[1]
linkgit:git-branch[1]
linkgit:git-commit[1]
diff --git a/Documentation/git-svn.txt b/Documentation/git-svn.txt
index 30711625fd..f2762dd5d4 100644
--- a/Documentation/git-svn.txt
+++ b/Documentation/git-svn.txt
@@ -769,9 +769,9 @@ option for (hopefully) obvious reasons.
+
This option is NOT recommended as it makes it difficult to track down
old references to SVN revision numbers in existing documentation, bug
-reports and archives. If you plan to eventually migrate from SVN to Git
+reports, and archives. If you plan to eventually migrate from SVN to Git
and are certain about dropping SVN history, consider
-linkgit:git-filter-branch[1] instead. filter-branch also allows
+linkgit:git-filter-repo[1] instead. filter-repo also allows
reformatting of metadata for ease-of-reading and rewriting authorship
info for non-"svn.authorsFile" users.
diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 82cd573776..997548f5ed 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -425,9 +425,10 @@ post-rewrite
This hook is invoked by commands that rewrite commits
(linkgit:git-commit[1] when called with `--amend` and
-linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
-it!). Its first argument denotes the command it was invoked by:
-currently one of `amend` or `rebase`. Further command-dependent
+linkgit:git-rebase[1]; however, full-history (re)writing tools like
+linkgit:git-fast-import[1] or linkgit:git-filter-repo[1] typically do
+not call it!). Its first argument denotes the command it was invoked
+by: currently one of `amend` or `rebase`. Further command-dependent
arguments may be passed in the future.
The hook receives a list of the rewritten commits on stdin, in the
diff --git a/contrib/svn-fe/svn-fe.txt b/contrib/svn-fe/svn-fe.txt
index a3425f4770..19333fc8df 100644
--- a/contrib/svn-fe/svn-fe.txt
+++ b/contrib/svn-fe/svn-fe.txt
@@ -56,7 +56,7 @@ line. This line has the form `git-svn-id: URL@REVNO UUID`.
The resulting repository will generally require further processing
to put each project in its own repository and to separate the history
-of each branch. The 'git filter-branch --subdirectory-filter' command
+of each branch. The 'git filter-repo --subdirectory-filter' command
may be useful for this purpose.
BUGS
@@ -67,5 +67,5 @@ The exit status does not reflect whether an error was detected.
SEE ALSO
--------
-git-svn(1), svn2git(1), svk(1), git-filter-branch(1), git-fast-import(1),
+git-svn(1), svn2git(1), svk(1), git-filter-repo(1), git-fast-import(1),
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
old mode 100755
new mode 100644
index 5c5afa2b98..7b1865c1d5
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -83,6 +83,19 @@ set_ident () {
finish_ident COMMITTER
}
+if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
+ -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
+ cat <<EOF
+WARNING: git-filter-branch has a glut of gotchas generating mangled history
+ rewrites. Please use an alternative filtering tool such as 'git
+ filter-repo' (https://github.com/newren/git-filter-repo/) instead.
+ See the filter-branch manual page for more details; to squelch
+ this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.
+
+EOF
+ sleep 5
+fi
+
USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
[--tree-filter <command>] [--index-filter <command>]
[--parent-filter <command>] [--msg-filter <command>]
--
2.23.0.3.gcc10030edf.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v2 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-08-28 0:22 ` [PATCH v2 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-08-28 6:17 ` Eric Sunshine
2019-08-28 21:48 ` Elijah Newren
0 siblings, 1 reply; 73+ messages in thread
From: Eric Sunshine @ 2019-08-28 6:17 UTC (permalink / raw)
To: Elijah Newren
Cc: Git List, Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder
On Tue, Aug 27, 2019 at 8:22 PM Elijah Newren <newren@gmail.com> wrote:
> filter-branch suffers from a deluge of disguised dangers that disfigure
> history rewrites (i.e. deviate from the deliberate changes). [...]
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
> @@ -16,6 +16,20 @@ SYNOPSIS
> +WARNING
> +-------
> +'git filter-branch' has a plethora of pitfalls that can produce non-obvious
> +manglings of the intended history rewrite (and can leave you with little
> +time to investigate such problems since it has such abysmal performance).
> +These safety and performance issues cannot be backward compatibly fixed and
> +as such, its use is not recommended. Please use an alternative history
> +filtering tool such as https://github.com/newren/git-filter-repo/[git
> +filter-repo]. If you still need to use 'git filter-branch', please
> +carefully read the "Safety" section of the message on the Git mailing list
> +https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
> +the land mines of filter-branch] and vigilantly avoid as many of the
> +hazards listed there as reasonably possible.
Is there a good reason to not simply copy the "Safety" section from
that email directly into this document so that readers don't have to
go chasing down the link (especially those who are reading
documentation offline)?
> diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
> @@ -832,7 +832,7 @@ Hard case: The changes are not the same.::
> This happens if the 'subsystem' rebase had conflicts, or used
> `--interactive` to omit, edit, squash, or fixup commits; or
> if the upstream used one of `commit --amend`, `reset`, or
> - `filter-branch`.
> + a full history rewriting command like `filter-repo`.
Do we want a clickable link to `filter-repo` here?
> diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
> @@ -123,10 +123,10 @@ The following format are available:
> +linkgit:git-hash-object[1], linkgit:git-rebase[1], and
> +linkgit:git-filter-repo[1], among other git commands, can be used to
> [...]
> @@ -148,8 +148,8 @@ pending objects.
> linkgit:git-hash-object[1]
> linkgit:git-rebase[1]
> +linkgit:git-filter-repo[1]
Are these 'linkgit:' references to `filter-repo` going to be
meaningful if that tool is not incorporated into the Git project
proper? Perhaps use a generic clickable link instead.
Same comment applies to other 'linkgit:' invocations in the remainder
of the patch.
> diff --git a/git-filter-branch.sh b/git-filter-branch.sh
> old mode 100755
> new mode 100644
Why lose the executable bit?
> @@ -83,6 +83,19 @@ set_ident () {
> +if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
> + -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
If this script didn't already have a mix of styles, I'd say something
about modern style being:
if test -z "$FILTER_BRANCH_SQUELCH_WARNING" &&
test -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
then
...
fi
> + cat <<EOF
> +WARNING: git-filter-branch has a glut of gotchas generating mangled history
> + rewrites. Please use an alternative filtering tool such as 'git
> + filter-repo' (https://github.com/newren/git-filter-repo/) instead.
> + See the filter-branch manual page for more details; to squelch
> + this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.
The "and pause" threw me. There's more than a bit of ambiguity
surrounding it. Perhaps drop it?
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH v2 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-08-28 6:17 ` Eric Sunshine
@ 2019-08-28 21:48 ` Elijah Newren
0 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-28 21:48 UTC (permalink / raw)
To: Eric Sunshine
Cc: Git List, Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder
On Tue, Aug 27, 2019 at 11:17 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Tue, Aug 27, 2019 at 8:22 PM Elijah Newren <newren@gmail.com> wrote:
> > filter-branch suffers from a deluge of disguised dangers that disfigure
> > history rewrites (i.e. deviate from the deliberate changes). [...]
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> > diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
> > @@ -16,6 +16,20 @@ SYNOPSIS
> > +WARNING
> > +-------
> > +'git filter-branch' has a plethora of pitfalls that can produce non-obvious
> > +manglings of the intended history rewrite (and can leave you with little
> > +time to investigate such problems since it has such abysmal performance).
> > +These safety and performance issues cannot be backward compatibly fixed and
> > +as such, its use is not recommended. Please use an alternative history
> > +filtering tool such as https://github.com/newren/git-filter-repo/[git
> > +filter-repo]. If you still need to use 'git filter-branch', please
> > +carefully read the "Safety" section of the message on the Git mailing list
> > +https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
> > +the land mines of filter-branch] and vigilantly avoid as many of the
> > +hazards listed there as reasonably possible.
>
> Is there a good reason to not simply copy the "Safety" section from
> that email directly into this document so that readers don't have to
> go chasing down the link (especially those who are reading
> documentation offline)?
Makes sense, I can include it. However, saying e.g. "the
git-filter-branch manpage is missing..." or "the git-filter-branch
manpage actually documents <crazy buggy behavior> as expected" feels
really weird to include on the git-filter-branch manpage. I'll try to
touch it up.
> > diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
> > @@ -832,7 +832,7 @@ Hard case: The changes are not the same.::
> > This happens if the 'subsystem' rebase had conflicts, or used
> > `--interactive` to omit, edit, squash, or fixup commits; or
> > if the upstream used one of `commit --amend`, `reset`, or
> > - `filter-branch`.
> > + a full history rewriting command like `filter-repo`.
>
> Do we want a clickable link to `filter-repo` here?
I guess it can't hurt.
> > diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
> > @@ -123,10 +123,10 @@ The following format are available:
> > +linkgit:git-hash-object[1], linkgit:git-rebase[1], and
> > +linkgit:git-filter-repo[1], among other git commands, can be used to
> > [...]
> > @@ -148,8 +148,8 @@ pending objects.
> > linkgit:git-hash-object[1]
> > linkgit:git-rebase[1]
> > +linkgit:git-filter-repo[1]
>
> Are these 'linkgit:' references to `filter-repo` going to be
> meaningful if that tool is not incorporated into the Git project
> proper? Perhaps use a generic clickable link instead.
>
> Same comment applies to other 'linkgit:' invocations in the remainder
> of the patch.
I'm fixing them up.
> > diff --git a/git-filter-branch.sh b/git-filter-branch.sh
> > old mode 100755
> > new mode 100644
>
> Why lose the executable bit?
Whoops. Did some rebasing and fixups, then continued editing my
buffer of the file after one of the rebases, realized the file was
deleted (because of the final patch in the series), moved the file out
of the way and rebased again and copied the file back into place, and
forgot to check the filemode.
> > @@ -83,6 +83,19 @@ set_ident () {
> > +if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
> > + -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
>
> If this script didn't already have a mix of styles, I'd say something
> about modern style being:
>
> if test -z "$FILTER_BRANCH_SQUELCH_WARNING" &&
> test -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
> then
> ...
> fi
>
> > + cat <<EOF
> > +WARNING: git-filter-branch has a glut of gotchas generating mangled history
> > + rewrites. Please use an alternative filtering tool such as 'git
> > + filter-repo' (https://github.com/newren/git-filter-repo/) instead.
> > + See the filter-branch manual page for more details; to squelch
> > + this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.
>
> The "and pause" threw me. There's more than a bit of ambiguity
> surrounding it. Perhaps drop it?
Sure, will do.
^ permalink raw reply [flat|nested] 73+ messages in thread
* [RFC PATCH v2 4/4] Remove git-filter-branch, it is now external to git.git
2019-08-28 0:22 ` [PATCH v2 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (2 preceding siblings ...)
2019-08-28 0:22 ` [PATCH v2 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-08-28 0:22 ` Elijah Newren
2019-08-29 0:06 ` [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-09-03 18:55 ` [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
5 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-28 0:22 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Elijah Newren
git-filter-branch still exists, still has the same regression tests,
etc., but it is now being tracked in a separate repo that users will
need to download separately.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
.gitignore | 1 -
Documentation/git-filter-branch.txt | 464 -------------------
Makefile | 1 -
command-list.txt | 1 -
git-filter-branch.sh | 675 ----------------------------
t/perf/p7000-filter-branch.sh | 24 -
t/t7003-filter-branch.sh | 505 ---------------------
t/t7009-filter-branch-null-sha1.sh | 55 ---
t/t9902-completion.sh | 12 +-
9 files changed, 6 insertions(+), 1732 deletions(-)
delete mode 100644 Documentation/git-filter-branch.txt
delete mode 100644 git-filter-branch.sh
delete mode 100755 t/perf/p7000-filter-branch.sh
delete mode 100755 t/t7003-filter-branch.sh
delete mode 100755 t/t7009-filter-branch-null-sha1.sh
diff --git a/.gitignore b/.gitignore
index 521d8f4fb4..97f5d8afea 100644
--- a/.gitignore
+++ b/.gitignore
@@ -63,7 +63,6 @@
/git-fast-import
/git-fetch
/git-fetch-pack
-/git-filter-branch
/git-fmt-merge-msg
/git-for-each-ref
/git-format-patch
diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
deleted file mode 100644
index e4047d472e..0000000000
--- a/Documentation/git-filter-branch.txt
+++ /dev/null
@@ -1,464 +0,0 @@
-git-filter-branch(1)
-====================
-
-NAME
-----
-git-filter-branch - Rewrite branches
-
-SYNOPSIS
---------
-[verse]
-'git filter-branch' [--setup <command>] [--subdirectory-filter <directory>]
- [--env-filter <command>] [--tree-filter <command>]
- [--index-filter <command>] [--parent-filter <command>]
- [--msg-filter <command>] [--commit-filter <command>]
- [--tag-name-filter <command>] [--prune-empty]
- [--original <namespace>] [-d <directory>] [-f | --force]
- [--state-branch <branch>] [--] [<rev-list options>...]
-
-WARNING
--------
-'git filter-branch' has a plethora of pitfalls that can produce non-obvious
-manglings of the intended history rewrite (and can leave you with little
-time to investigate such problems since it has such abysmal performance).
-These safety and performance issues cannot be backward compatibly fixed and
-as such, its use is not recommended. Please use an alternative history
-filtering tool such as https://github.com/newren/git-filter-repo/[git
-filter-repo]. If you still need to use 'git filter-branch', please
-carefully read the "Safety" section of the message on the Git mailing list
-https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
-the land mines of filter-branch] and vigilantly avoid as many of the
-hazards listed there as reasonably possible.
-
-DESCRIPTION
------------
-Lets you rewrite Git revision history by rewriting the branches mentioned
-in the <rev-list options>, applying custom filters on each revision.
-Those filters can modify each tree (e.g. removing a file or running
-a perl rewrite on all files) or information about each commit.
-Otherwise, all information (including original commit times or merge
-information) will be preserved.
-
-The command will only rewrite the _positive_ refs mentioned in the
-command line (e.g. if you pass 'a..b', only 'b' will be rewritten).
-If you specify no filters, the commits will be recommitted without any
-changes, which would normally have no effect. Nevertheless, this may be
-useful in the future for compensating for some Git bugs or such,
-therefore such a usage is permitted.
-
-*NOTE*: This command honors `.git/info/grafts` file and refs in
-the `refs/replace/` namespace.
-If you have any grafts or replacement refs defined, running this command
-will make them permanent.
-
-*WARNING*! The rewritten history will have different object names for all
-the objects and will not converge with the original branch. You will not
-be able to easily push and distribute the rewritten branch on top of the
-original branch. Please do not use this command if you do not know the
-full implications, and avoid using it anyway, if a simple single commit
-would suffice to fix your problem. (See the "RECOVERING FROM UPSTREAM
-REBASE" section in linkgit:git-rebase[1] for further information about
-rewriting published history.)
-
-Always verify that the rewritten version is correct: The original refs,
-if different from the rewritten ones, will be stored in the namespace
-'refs/original/'.
-
-Note that since this operation is very I/O expensive, it might
-be a good idea to redirect the temporary directory off-disk with the
-`-d` option, e.g. on tmpfs. Reportedly the speedup is very noticeable.
-
-
-Filters
-~~~~~~~
-
-The filters are applied in the order as listed below. The <command>
-argument is always evaluated in the shell context using the 'eval' command
-(with the notable exception of the commit filter, for technical reasons).
-Prior to that, the `$GIT_COMMIT` environment variable will be set to contain
-the id of the commit being rewritten. Also, GIT_AUTHOR_NAME,
-GIT_AUTHOR_EMAIL, GIT_AUTHOR_DATE, GIT_COMMITTER_NAME, GIT_COMMITTER_EMAIL,
-and GIT_COMMITTER_DATE are taken from the current commit and exported to
-the environment, in order to affect the author and committer identities of
-the replacement commit created by linkgit:git-commit-tree[1] after the
-filters have run.
-
-If any evaluation of <command> returns a non-zero exit status, the whole
-operation will be aborted.
-
-A 'map' function is available that takes an "original sha1 id" argument
-and outputs a "rewritten sha1 id" if the commit has been already
-rewritten, and "original sha1 id" otherwise; the 'map' function can
-return several ids on separate lines if your commit filter emitted
-multiple commits.
-
-
-OPTIONS
--------
-
---setup <command>::
- This is not a real filter executed for each commit but a one
- time setup just before the loop. Therefore no commit-specific
- variables are defined yet. Functions or variables defined here
- can be used or modified in the following filter steps except
- the commit filter, for technical reasons.
-
---subdirectory-filter <directory>::
- Only look at the history which touches the given subdirectory.
- The result will contain that directory (and only that) as its
- project root. Implies <<Remap_to_ancestor>>.
-
---env-filter <command>::
- This filter may be used if you only need to modify the environment
- in which the commit will be performed. Specifically, you might
- want to rewrite the author/committer name/email/time environment
- variables (see linkgit:git-commit-tree[1] for details).
-
---tree-filter <command>::
- This is the filter for rewriting the tree and its contents.
- The argument is evaluated in shell with the working
- directory set to the root of the checked out tree. The new tree
- is then used as-is (new files are auto-added, disappeared files
- are auto-removed - neither .gitignore files nor any other ignore
- rules *HAVE ANY EFFECT*!).
-
---index-filter <command>::
- This is the filter for rewriting the index. It is similar to the
- tree filter but does not check out the tree, which makes it much
- faster. Frequently used with `git rm --cached
- --ignore-unmatch ...`, see EXAMPLES below. For hairy
- cases, see linkgit:git-update-index[1].
-
---parent-filter <command>::
- This is the filter for rewriting the commit's parent list.
- It will receive the parent string on stdin and shall output
- the new parent string on stdout. The parent string is in
- the format described in linkgit:git-commit-tree[1]: empty for
- the initial commit, "-p parent" for a normal commit and
- "-p parent1 -p parent2 -p parent3 ..." for a merge commit.
-
---msg-filter <command>::
- This is the filter for rewriting the commit messages.
- The argument is evaluated in the shell with the original
- commit message on standard input; its standard output is
- used as the new commit message.
-
---commit-filter <command>::
- This is the filter for performing the commit.
- If this filter is specified, it will be called instead of the
- 'git commit-tree' command, with arguments of the form
- "<TREE_ID> [(-p <PARENT_COMMIT_ID>)...]" and the log message on
- stdin. The commit id is expected on stdout.
-+
-As a special extension, the commit filter may emit multiple
-commit ids; in that case, the rewritten children of the original commit will
-have all of them as parents.
-+
-You can use the 'map' convenience function in this filter, and other
-convenience functions, too. For example, calling 'skip_commit "$@"'
-will leave out the current commit (but not its changes! If you want
-that, use 'git rebase' instead).
-+
-You can also use the `git_commit_non_empty_tree "$@"` instead of
-`git commit-tree "$@"` if you don't wish to keep commits with a single parent
-and that makes no change to the tree.
-
---tag-name-filter <command>::
- This is the filter for rewriting tag names. When passed,
- it will be called for every tag ref that points to a rewritten
- object (or to a tag object which points to a rewritten object).
- The original tag name is passed via standard input, and the new
- tag name is expected on standard output.
-+
-The original tags are not deleted, but can be overwritten;
-use "--tag-name-filter cat" to simply update the tags. In this
-case, be very careful and make sure you have the old tags
-backed up in case the conversion has run afoul.
-+
-Nearly proper rewriting of tag objects is supported. If the tag has
-a message attached, a new tag object will be created with the same message,
-author, and timestamp. If the tag has a signature attached, the
-signature will be stripped. It is by definition impossible to preserve
-signatures. The reason this is "nearly" proper, is because ideally if
-the tag did not change (points to the same object, has the same name, etc.)
-it should retain any signature. That is not the case, signatures will always
-be removed, buyer beware. There is also no support for changing the
-author or timestamp (or the tag message for that matter). Tags which point
-to other tags will be rewritten to point to the underlying commit.
-
---prune-empty::
- Some filters will generate empty commits that leave the tree untouched.
- This option instructs git-filter-branch to remove such commits if they
- have exactly one or zero non-pruned parents; merge commits will
- therefore remain intact. This option cannot be used together with
- `--commit-filter`, though the same effect can be achieved by using the
- provided `git_commit_non_empty_tree` function in a commit filter.
-
---original <namespace>::
- Use this option to set the namespace where the original commits
- will be stored. The default value is 'refs/original'.
-
--d <directory>::
- Use this option to set the path to the temporary directory used for
- rewriting. When applying a tree filter, the command needs to
- temporarily check out the tree to some directory, which may consume
- considerable space in case of large projects. By default it
- does this in the `.git-rewrite/` directory but you can override
- that choice by this parameter.
-
--f::
---force::
- 'git filter-branch' refuses to start with an existing temporary
- directory or when there are already refs starting with
- 'refs/original/', unless forced.
-
---state-branch <branch>::
- This option will cause the mapping from old to new objects to
- be loaded from named branch upon startup and saved as a new
- commit to that branch upon exit, enabling incremental of large
- trees. If '<branch>' does not exist it will be created.
-
-<rev-list options>...::
- Arguments for 'git rev-list'. All positive refs included by
- these options are rewritten. You may also specify options
- such as `--all`, but you must use `--` to separate them from
- the 'git filter-branch' options. Implies <<Remap_to_ancestor>>.
-
-
-[[Remap_to_ancestor]]
-Remap to ancestor
-~~~~~~~~~~~~~~~~~
-
-By using linkgit:git-rev-list[1] arguments, e.g., path limiters, you can limit the
-set of revisions which get rewritten. However, positive refs on the command
-line are distinguished: we don't let them be excluded by such limiters. For
-this purpose, they are instead rewritten to point at the nearest ancestor that
-was not excluded.
-
-
-EXIT STATUS
------------
-
-On success, the exit status is `0`. If the filter can't find any commits to
-rewrite, the exit status is `2`. On any other error, the exit status may be
-any other non-zero value.
-
-
-EXAMPLES
---------
-
-Suppose you want to remove a file (containing confidential information
-or copyright violation) from all commits:
-
--------------------------------------------------------
-git filter-branch --tree-filter 'rm filename' HEAD
--------------------------------------------------------
-
-However, if the file is absent from the tree of some commit,
-a simple `rm filename` will fail for that tree and commit.
-Thus you may instead want to use `rm -f filename` as the script.
-
-Using `--index-filter` with 'git rm' yields a significantly faster
-version. Like with using `rm filename`, `git rm --cached filename`
-will fail if the file is absent from the tree of a commit. If you
-want to "completely forget" a file, it does not matter when it entered
-history, so we also add `--ignore-unmatch`:
-
---------------------------------------------------------------------------
-git filter-branch --index-filter 'git rm --cached --ignore-unmatch filename' HEAD
---------------------------------------------------------------------------
-
-Now, you will get the rewritten history saved in HEAD.
-
-To rewrite the repository to look as if `foodir/` had been its project
-root, and discard all other history:
-
--------------------------------------------------------
-git filter-branch --subdirectory-filter foodir -- --all
--------------------------------------------------------
-
-Thus you can, e.g., turn a library subdirectory into a repository of
-its own. Note the `--` that separates 'filter-branch' options from
-revision options, and the `--all` to rewrite all branches and tags.
-
-To set a commit (which typically is at the tip of another
-history) to be the parent of the current initial commit, in
-order to paste the other history behind the current history:
-
--------------------------------------------------------------------
-git filter-branch --parent-filter 'sed "s/^\$/-p <graft-id>/"' HEAD
--------------------------------------------------------------------
-
-(if the parent string is empty - which happens when we are dealing with
-the initial commit - add graftcommit as a parent). Note that this assumes
-history with a single root (that is, no merge without common ancestors
-happened). If this is not the case, use:
-
---------------------------------------------------------------------------
-git filter-branch --parent-filter \
- 'test $GIT_COMMIT = <commit-id> && echo "-p <graft-id>" || cat' HEAD
---------------------------------------------------------------------------
-
-or even simpler:
-
------------------------------------------------
-git replace --graft $commit-id $graft-id
-git filter-branch $graft-id..HEAD
------------------------------------------------
-
-To remove commits authored by "Darl McBribe" from the history:
-
-------------------------------------------------------------------------------
-git filter-branch --commit-filter '
- if [ "$GIT_AUTHOR_NAME" = "Darl McBribe" ];
- then
- skip_commit "$@";
- else
- git commit-tree "$@";
- fi' HEAD
-------------------------------------------------------------------------------
-
-The function 'skip_commit' is defined as follows:
-
---------------------------
-skip_commit()
-{
- shift;
- while [ -n "$1" ];
- do
- shift;
- map "$1";
- shift;
- done;
-}
---------------------------
-
-The shift magic first throws away the tree id and then the -p
-parameters. Note that this handles merges properly! In case Darl
-committed a merge between P1 and P2, it will be propagated properly
-and all children of the merge will become merge commits with P1,P2
-as their parents instead of the merge commit.
-
-*NOTE* the changes introduced by the commits, and which are not reverted
-by subsequent commits, will still be in the rewritten branch. If you want
-to throw out _changes_ together with the commits, you should use the
-interactive mode of 'git rebase'.
-
-You can rewrite the commit log messages using `--msg-filter`. For
-example, 'git svn-id' strings in a repository created by 'git svn' can
-be removed this way:
-
--------------------------------------------------------
-git filter-branch --msg-filter '
- sed -e "/^git-svn-id:/d"
-'
--------------------------------------------------------
-
-If you need to add 'Acked-by' lines to, say, the last 10 commits (none
-of which is a merge), use this command:
-
---------------------------------------------------------
-git filter-branch --msg-filter '
- cat &&
- echo "Acked-by: Bugs Bunny <bunny@bugzilla.org>"
-' HEAD~10..HEAD
---------------------------------------------------------
-
-The `--env-filter` option can be used to modify committer and/or author
-identity. For example, if you found out that your commits have the wrong
-identity due to a misconfigured user.email, you can make a correction,
-before publishing the project, like this:
-
---------------------------------------------------------
-git filter-branch --env-filter '
- if test "$GIT_AUTHOR_EMAIL" = "root@localhost"
- then
- GIT_AUTHOR_EMAIL=john@example.com
- fi
- if test "$GIT_COMMITTER_EMAIL" = "root@localhost"
- then
- GIT_COMMITTER_EMAIL=john@example.com
- fi
-' -- --all
---------------------------------------------------------
-
-To restrict rewriting to only part of the history, specify a revision
-range in addition to the new branch name. The new branch name will
-point to the top-most revision that a 'git rev-list' of this range
-will print.
-
-Consider this history:
-
-------------------
- D--E--F--G--H
- / /
-A--B-----C
-------------------
-
-To rewrite only commits D,E,F,G,H, but leave A, B and C alone, use:
-
---------------------------------
-git filter-branch ... C..H
---------------------------------
-
-To rewrite commits E,F,G,H, use one of these:
-
-----------------------------------------
-git filter-branch ... C..H --not D
-git filter-branch ... D..H --not C
-----------------------------------------
-
-To move the whole tree into a subdirectory, or remove it from there:
-
----------------------------------------------------------------
-git filter-branch --index-filter \
- 'git ls-files -s | sed "s-\t\"*-&newsubdir/-" |
- GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
- git update-index --index-info &&
- mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD
----------------------------------------------------------------
-
-
-
-CHECKLIST FOR SHRINKING A REPOSITORY
-------------------------------------
-
-git-filter-branch can be used to get rid of a subset of files,
-usually with some combination of `--index-filter` and
-`--subdirectory-filter`. People expect the resulting repository to
-be smaller than the original, but you need a few more steps to
-actually make it smaller, because Git tries hard not to lose your
-objects until you tell it to. First make sure that:
-
-* You really removed all variants of a filename, if a blob was moved
- over its lifetime. `git log --name-only --follow --all -- filename`
- can help you find renames.
-
-* You really filtered all refs: use `--tag-name-filter cat -- --all`
- when calling git-filter-branch.
-
-Then there are two ways to get a smaller repository. A safer way is
-to clone, that keeps your original intact.
-
-* Clone it with `git clone file:///path/to/repo`. The clone
- will not have the removed objects. See linkgit:git-clone[1]. (Note
- that cloning with a plain path just hardlinks everything!)
-
-If you really don't want to clone it, for whatever reasons, check the
-following points instead (in this order). This is a very destructive
-approach, so *make a backup* or go back to cloning it. You have been
-warned.
-
-* Remove the original refs backed up by git-filter-branch: say `git
- for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git
- update-ref -d`.
-
-* Expire all reflogs with `git reflog expire --expire=now --all`.
-
-* Garbage collect all unreferenced objects with `git gc --prune=now`
- (or if your git-gc is not new enough to support arguments to
- `--prune`, use `git repack -ad; git prune` instead).
-
-GIT
----
-Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index f9255344ae..20850def5d 100644
--- a/Makefile
+++ b/Makefile
@@ -607,7 +607,6 @@ unexport CDPATH
SCRIPT_SH += git-bisect.sh
SCRIPT_SH += git-difftool--helper.sh
-SCRIPT_SH += git-filter-branch.sh
SCRIPT_SH += git-merge-octopus.sh
SCRIPT_SH += git-merge-one-file.sh
SCRIPT_SH += git-merge-resolve.sh
diff --git a/command-list.txt b/command-list.txt
index a9ac72bef4..1ba65d9516 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -90,7 +90,6 @@ git-fast-export ancillarymanipulators
git-fast-import ancillarymanipulators
git-fetch mainporcelain remote
git-fetch-pack synchingrepositories
-git-filter-branch ancillarymanipulators
git-fmt-merge-msg purehelpers
git-for-each-ref plumbinginterrogators
git-format-patch mainporcelain
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
deleted file mode 100644
index 7b1865c1d5..0000000000
--- a/git-filter-branch.sh
+++ /dev/null
@@ -1,675 +0,0 @@
-#!/bin/sh
-#
-# Rewrite revision history
-# Copyright (c) Petr Baudis, 2006
-# Minimal changes to "port" it to core-git (c) Johannes Schindelin, 2007
-#
-# Lets you rewrite the revision history of the current branch, creating
-# a new branch. You can specify a number of filters to modify the commits,
-# files and trees.
-
-# The following functions will also be available in the commit filter:
-
-functions=$(cat << \EOF
-EMPTY_TREE=$(git hash-object -t tree /dev/null)
-
-warn () {
- echo "$*" >&2
-}
-
-map()
-{
- # if it was not rewritten, take the original
- if test -r "$workdir/../map/$1"
- then
- cat "$workdir/../map/$1"
- else
- echo "$1"
- fi
-}
-
-# if you run 'skip_commit "$@"' in a commit filter, it will print
-# the (mapped) parents, effectively skipping the commit.
-
-skip_commit()
-{
- shift;
- while [ -n "$1" ];
- do
- shift;
- map "$1";
- shift;
- done;
-}
-
-# if you run 'git_commit_non_empty_tree "$@"' in a commit filter,
-# it will skip commits that leave the tree untouched, commit the other.
-git_commit_non_empty_tree()
-{
- if test $# = 3 && test "$1" = $(git rev-parse "$3^{tree}"); then
- map "$3"
- elif test $# = 1 && test "$1" = $EMPTY_TREE; then
- :
- else
- git commit-tree "$@"
- fi
-}
-# override die(): this version puts in an extra line break, so that
-# the progress is still visible
-
-die()
-{
- echo >&2
- echo "$*" >&2
- exit 1
-}
-EOF
-)
-
-eval "$functions"
-
-finish_ident() {
- # Ensure non-empty id name.
- echo "case \"\$GIT_$1_NAME\" in \"\") GIT_$1_NAME=\"\${GIT_$1_EMAIL%%@*}\" && export GIT_$1_NAME;; esac"
- # And make sure everything is exported.
- echo "export GIT_$1_NAME"
- echo "export GIT_$1_EMAIL"
- echo "export GIT_$1_DATE"
-}
-
-set_ident () {
- parse_ident_from_commit author AUTHOR committer COMMITTER
- finish_ident AUTHOR
- finish_ident COMMITTER
-}
-
-if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
- -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
- cat <<EOF
-WARNING: git-filter-branch has a glut of gotchas generating mangled history
- rewrites. Please use an alternative filtering tool such as 'git
- filter-repo' (https://github.com/newren/git-filter-repo/) instead.
- See the filter-branch manual page for more details; to squelch
- this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.
-
-EOF
- sleep 5
-fi
-
-USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
- [--tree-filter <command>] [--index-filter <command>]
- [--parent-filter <command>] [--msg-filter <command>]
- [--commit-filter <command>] [--tag-name-filter <command>]
- [--original <namespace>]
- [-d <directory>] [-f | --force] [--state-branch <branch>]
- [--] [<rev-list options>...]"
-
-OPTIONS_SPEC=
-. git-sh-setup
-
-if [ "$(is_bare_repository)" = false ]; then
- require_clean_work_tree 'rewrite branches'
-fi
-
-tempdir=.git-rewrite
-filter_setup=
-filter_env=
-filter_tree=
-filter_index=
-filter_parent=
-filter_msg=cat
-filter_commit=
-filter_tag_name=
-filter_subdir=
-state_branch=
-orig_namespace=refs/original/
-force=
-prune_empty=
-remap_to_ancestor=
-while :
-do
- case "$1" in
- --)
- shift
- break
- ;;
- --force|-f)
- shift
- force=t
- continue
- ;;
- --remap-to-ancestor)
- # deprecated ($remap_to_ancestor is set now automatically)
- shift
- remap_to_ancestor=t
- continue
- ;;
- --prune-empty)
- shift
- prune_empty=t
- continue
- ;;
- -*)
- ;;
- *)
- break;
- esac
-
- # all switches take one argument
- ARG="$1"
- case "$#" in 1) usage ;; esac
- shift
- OPTARG="$1"
- shift
-
- case "$ARG" in
- -d)
- tempdir="$OPTARG"
- ;;
- --setup)
- filter_setup="$OPTARG"
- ;;
- --subdirectory-filter)
- filter_subdir="$OPTARG"
- remap_to_ancestor=t
- ;;
- --env-filter)
- filter_env="$OPTARG"
- ;;
- --tree-filter)
- filter_tree="$OPTARG"
- ;;
- --index-filter)
- filter_index="$OPTARG"
- ;;
- --parent-filter)
- filter_parent="$OPTARG"
- ;;
- --msg-filter)
- filter_msg="$OPTARG"
- ;;
- --commit-filter)
- filter_commit="$functions; $OPTARG"
- ;;
- --tag-name-filter)
- filter_tag_name="$OPTARG"
- ;;
- --original)
- orig_namespace=$(expr "$OPTARG/" : '\(.*[^/]\)/*$')/
- ;;
- --state-branch)
- state_branch="$OPTARG"
- ;;
- *)
- usage
- ;;
- esac
-done
-
-case "$prune_empty,$filter_commit" in
-,)
- filter_commit='git commit-tree "$@"';;
-t,)
- filter_commit="$functions;"' git_commit_non_empty_tree "$@"';;
-,*)
- ;;
-*)
- die "Cannot set --prune-empty and --commit-filter at the same time"
-esac
-
-case "$force" in
-t)
- rm -rf "$tempdir"
-;;
-'')
- test -d "$tempdir" &&
- die "$tempdir already exists, please remove it"
-esac
-orig_dir=$(pwd)
-mkdir -p "$tempdir/t" &&
-tempdir="$(cd "$tempdir"; pwd)" &&
-cd "$tempdir/t" &&
-workdir="$(pwd)" ||
-die ""
-
-# Remove tempdir on exit
-trap 'cd "$orig_dir"; rm -rf "$tempdir"' 0
-
-ORIG_GIT_DIR="$GIT_DIR"
-ORIG_GIT_WORK_TREE="$GIT_WORK_TREE"
-ORIG_GIT_INDEX_FILE="$GIT_INDEX_FILE"
-ORIG_GIT_AUTHOR_NAME="$GIT_AUTHOR_NAME"
-ORIG_GIT_AUTHOR_EMAIL="$GIT_AUTHOR_EMAIL"
-ORIG_GIT_AUTHOR_DATE="$GIT_AUTHOR_DATE"
-ORIG_GIT_COMMITTER_NAME="$GIT_COMMITTER_NAME"
-ORIG_GIT_COMMITTER_EMAIL="$GIT_COMMITTER_EMAIL"
-ORIG_GIT_COMMITTER_DATE="$GIT_COMMITTER_DATE"
-
-GIT_WORK_TREE=.
-export GIT_DIR GIT_WORK_TREE
-
-# Make sure refs/original is empty
-git for-each-ref > "$tempdir"/backup-refs || exit
-while read sha1 type name
-do
- case "$force,$name" in
- ,$orig_namespace*)
- die "Cannot create a new backup.
-A previous backup already exists in $orig_namespace
-Force overwriting the backup with -f"
- ;;
- t,$orig_namespace*)
- git update-ref -d "$name" $sha1
- ;;
- esac
-done < "$tempdir"/backup-refs
-
-# The refs should be updated if their heads were rewritten
-git rev-parse --no-flags --revs-only --symbolic-full-name \
- --default HEAD "$@" > "$tempdir"/raw-refs || exit
-while read ref
-do
- case "$ref" in ^?*) continue ;; esac
-
- if git rev-parse --verify "$ref"^0 >/dev/null 2>&1
- then
- echo "$ref"
- else
- warn "WARNING: not rewriting '$ref' (not a committish)"
- fi
-done >"$tempdir"/heads <"$tempdir"/raw-refs
-
-test -s "$tempdir"/heads ||
- die "You must specify a ref to rewrite."
-
-GIT_INDEX_FILE="$(pwd)/../index"
-export GIT_INDEX_FILE
-
-# map old->new commit ids for rewriting parents
-mkdir ../map || die "Could not create map/ directory"
-
-if test -n "$state_branch"
-then
- state_commit=$(git rev-parse --no-flags --revs-only "$state_branch")
- if test -n "$state_commit"
- then
- echo "Populating map from $state_branch ($state_commit)" 1>&2
- perl -e'open(MAP, "-|", "git show $ARGV[0]:filter.map") or die;
- while (<MAP>) {
- m/(.*):(.*)/ or die;
- open F, ">../map/$1" or die;
- print F "$2" or die;
- close(F) or die;
- }
- close(MAP) or die;' "$state_commit" \
- || die "Unable to load state from $state_branch:filter.map"
- else
- echo "Branch $state_branch does not exist. Will create" 1>&2
- fi
-fi
-
-# we need "--" only if there are no path arguments in $@
-nonrevs=$(git rev-parse --no-revs "$@") || exit
-if test -z "$nonrevs"
-then
- dashdash=--
-else
- dashdash=
- remap_to_ancestor=t
-fi
-
-git rev-parse --revs-only "$@" >../parse
-
-case "$filter_subdir" in
-"")
- eval set -- "$(git rev-parse --sq --no-revs "$@")"
- ;;
-*)
- eval set -- "$(git rev-parse --sq --no-revs "$@" $dashdash \
- "$filter_subdir")"
- ;;
-esac
-
-git rev-list --reverse --topo-order --default HEAD \
- --parents --simplify-merges --stdin "$@" <../parse >../revs ||
- die "Could not get the commits"
-commits=$(wc -l <../revs | tr -d " ")
-
-test $commits -eq 0 && die_with_status 2 "Found nothing to rewrite"
-
-# Rewrite the commits
-report_progress ()
-{
- if test -n "$progress" &&
- test $git_filter_branch__commit_count -gt $next_sample_at
- then
- count=$git_filter_branch__commit_count
-
- now=$(date +%s)
- elapsed=$(($now - $start_timestamp))
- remaining=$(( ($commits - $count) * $elapsed / $count ))
- if test $elapsed -gt 0
- then
- next_sample_at=$(( ($elapsed + 1) * $count / $elapsed ))
- else
- next_sample_at=$(($next_sample_at + 1))
- fi
- progress=" ($elapsed seconds passed, remaining $remaining predicted)"
- fi
- printf "\rRewrite $commit ($count/$commits)$progress "
-}
-
-git_filter_branch__commit_count=0
-
-progress= start_timestamp=
-if date '+%s' 2>/dev/null | grep -q '^[0-9][0-9]*$'
-then
- next_sample_at=0
- progress="dummy to ensure this is not empty"
- start_timestamp=$(date '+%s')
-fi
-
-if test -n "$filter_index" ||
- test -n "$filter_tree" ||
- test -n "$filter_subdir"
-then
- need_index=t
-else
- need_index=
-fi
-
-eval "$filter_setup" < /dev/null ||
- die "filter setup failed: $filter_setup"
-
-while read commit parents; do
- git_filter_branch__commit_count=$(($git_filter_branch__commit_count+1))
-
- report_progress
- test -f "$workdir"/../map/$commit && continue
-
- case "$filter_subdir" in
- "")
- if test -n "$need_index"
- then
- GIT_ALLOW_NULL_SHA1=1 git read-tree -i -m $commit
- fi
- ;;
- *)
- # The commit may not have the subdirectory at all
- err=$(GIT_ALLOW_NULL_SHA1=1 \
- git read-tree -i -m $commit:"$filter_subdir" 2>&1) || {
- if ! git rev-parse -q --verify $commit:"$filter_subdir"
- then
- rm -f "$GIT_INDEX_FILE"
- else
- echo >&2 "$err"
- false
- fi
- }
- esac || die "Could not initialize the index"
-
- GIT_COMMIT=$commit
- export GIT_COMMIT
- git cat-file commit "$commit" >../commit ||
- die "Cannot read commit $commit"
-
- eval "$(set_ident <../commit)" ||
- die "setting author/committer failed for commit $commit"
- eval "$filter_env" < /dev/null ||
- die "env filter failed: $filter_env"
-
- if [ "$filter_tree" ]; then
- git checkout-index -f -u -a ||
- die "Could not checkout the index"
- # files that $commit removed are now still in the working tree;
- # remove them, else they would be added again
- git clean -d -q -f -x
- eval "$filter_tree" < /dev/null ||
- die "tree filter failed: $filter_tree"
-
- (
- git diff-index -r --name-only --ignore-submodules $commit -- &&
- git ls-files --others
- ) > "$tempdir"/tree-state || exit
- git update-index --add --replace --remove --stdin \
- < "$tempdir"/tree-state || exit
- fi
-
- eval "$filter_index" < /dev/null ||
- die "index filter failed: $filter_index"
-
- parentstr=
- for parent in $parents; do
- for reparent in $(map "$parent"); do
- case "$parentstr " in
- *" -p $reparent "*)
- ;;
- *)
- parentstr="$parentstr -p $reparent"
- ;;
- esac
- done
- done
- if [ "$filter_parent" ]; then
- parentstr="$(echo "$parentstr" | eval "$filter_parent")" ||
- die "parent filter failed: $filter_parent"
- fi
-
- {
- while IFS='' read -r header_line && test -n "$header_line"
- do
- # skip header lines...
- :;
- done
- # and output the actual commit message
- cat
- } <../commit |
- eval "$filter_msg" > ../message ||
- die "msg filter failed: $filter_msg"
-
- if test -n "$need_index"
- then
- tree=$(git write-tree)
- else
- tree=$(git rev-parse "$commit^{tree}")
- fi
- workdir=$workdir @SHELL_PATH@ -c "$filter_commit" "git commit-tree" \
- "$tree" $parentstr < ../message > ../map/$commit ||
- die "could not write rewritten commit"
-done <../revs
-
-# If we are filtering for paths, as in the case of a subdirectory
-# filter, it is possible that a specified head is not in the set of
-# rewritten commits, because it was pruned by the revision walker.
-# Ancestor remapping fixes this by mapping these heads to the unique
-# nearest ancestor that survived the pruning.
-
-if test "$remap_to_ancestor" = t
-then
- while read ref
- do
- sha1=$(git rev-parse "$ref"^0)
- test -f "$workdir"/../map/$sha1 && continue
- ancestor=$(git rev-list --simplify-merges -1 "$ref" "$@")
- test "$ancestor" && echo $(map $ancestor) >> "$workdir"/../map/$sha1
- done < "$tempdir"/heads
-fi
-
-# Finally update the refs
-
-_x40='[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]'
-_x40="$_x40$_x40$_x40$_x40$_x40$_x40$_x40$_x40"
-echo
-while read ref
-do
- # avoid rewriting a ref twice
- test -f "$orig_namespace$ref" && continue
-
- sha1=$(git rev-parse "$ref"^0)
- rewritten=$(map $sha1)
-
- test $sha1 = "$rewritten" &&
- warn "WARNING: Ref '$ref' is unchanged" &&
- continue
-
- case "$rewritten" in
- '')
- echo "Ref '$ref' was deleted"
- git update-ref -m "filter-branch: delete" -d "$ref" $sha1 ||
- die "Could not delete $ref"
- ;;
- $_x40)
- echo "Ref '$ref' was rewritten"
- if ! git update-ref -m "filter-branch: rewrite" \
- "$ref" $rewritten $sha1 2>/dev/null; then
- if test $(git cat-file -t "$ref") = tag; then
- if test -z "$filter_tag_name"; then
- warn "WARNING: You said to rewrite tagged commits, but not the corresponding tag."
- warn "WARNING: Perhaps use '--tag-name-filter cat' to rewrite the tag."
- fi
- else
- die "Could not rewrite $ref"
- fi
- fi
- ;;
- *)
- # NEEDSWORK: possibly add -Werror, making this an error
- warn "WARNING: '$ref' was rewritten into multiple commits:"
- warn "$rewritten"
- warn "WARNING: Ref '$ref' points to the first one now."
- rewritten=$(echo "$rewritten" | head -n 1)
- git update-ref -m "filter-branch: rewrite to first" \
- "$ref" $rewritten $sha1 ||
- die "Could not rewrite $ref"
- ;;
- esac
- git update-ref -m "filter-branch: backup" "$orig_namespace$ref" $sha1 ||
- exit
-done < "$tempdir"/heads
-
-# TODO: This should possibly go, with the semantics that all positive given
-# refs are updated, and their original heads stored in refs/original/
-# Filter tags
-
-if [ "$filter_tag_name" ]; then
- git for-each-ref --format='%(objectname) %(objecttype) %(refname)' refs/tags |
- while read sha1 type ref; do
- ref="${ref#refs/tags/}"
- # XXX: Rewrite tagged trees as well?
- if [ "$type" != "commit" -a "$type" != "tag" ]; then
- continue;
- fi
-
- if [ "$type" = "tag" ]; then
- # Dereference to a commit
- sha1t="$sha1"
- sha1="$(git rev-parse -q "$sha1"^{commit})" || continue
- fi
-
- [ -f "../map/$sha1" ] || continue
- new_sha1="$(cat "../map/$sha1")"
- GIT_COMMIT="$sha1"
- export GIT_COMMIT
- new_ref="$(echo "$ref" | eval "$filter_tag_name")" ||
- die "tag name filter failed: $filter_tag_name"
-
- echo "$ref -> $new_ref ($sha1 -> $new_sha1)"
-
- if [ "$type" = "tag" ]; then
- new_sha1=$( ( printf 'object %s\ntype commit\ntag %s\n' \
- "$new_sha1" "$new_ref"
- git cat-file tag "$ref" |
- sed -n \
- -e '1,/^$/{
- /^object /d
- /^type /d
- /^tag /d
- }' \
- -e '/^-----BEGIN PGP SIGNATURE-----/q' \
- -e 'p' ) |
- git hash-object -t tag -w --stdin) ||
- die "Could not create new tag object for $ref"
- if git cat-file tag "$ref" | \
- sane_grep '^-----BEGIN PGP SIGNATURE-----' >/dev/null 2>&1
- then
- warn "gpg signature stripped from tag object $sha1t"
- fi
- fi
-
- git update-ref "refs/tags/$new_ref" "$new_sha1" ||
- die "Could not write tag $new_ref"
- done
-fi
-
-unset GIT_DIR GIT_WORK_TREE GIT_INDEX_FILE
-unset GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL GIT_AUTHOR_DATE
-unset GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL GIT_COMMITTER_DATE
-test -z "$ORIG_GIT_DIR" || {
- GIT_DIR="$ORIG_GIT_DIR" && export GIT_DIR
-}
-test -z "$ORIG_GIT_WORK_TREE" || {
- GIT_WORK_TREE="$ORIG_GIT_WORK_TREE" &&
- export GIT_WORK_TREE
-}
-test -z "$ORIG_GIT_INDEX_FILE" || {
- GIT_INDEX_FILE="$ORIG_GIT_INDEX_FILE" &&
- export GIT_INDEX_FILE
-}
-test -z "$ORIG_GIT_AUTHOR_NAME" || {
- GIT_AUTHOR_NAME="$ORIG_GIT_AUTHOR_NAME" &&
- export GIT_AUTHOR_NAME
-}
-test -z "$ORIG_GIT_AUTHOR_EMAIL" || {
- GIT_AUTHOR_EMAIL="$ORIG_GIT_AUTHOR_EMAIL" &&
- export GIT_AUTHOR_EMAIL
-}
-test -z "$ORIG_GIT_AUTHOR_DATE" || {
- GIT_AUTHOR_DATE="$ORIG_GIT_AUTHOR_DATE" &&
- export GIT_AUTHOR_DATE
-}
-test -z "$ORIG_GIT_COMMITTER_NAME" || {
- GIT_COMMITTER_NAME="$ORIG_GIT_COMMITTER_NAME" &&
- export GIT_COMMITTER_NAME
-}
-test -z "$ORIG_GIT_COMMITTER_EMAIL" || {
- GIT_COMMITTER_EMAIL="$ORIG_GIT_COMMITTER_EMAIL" &&
- export GIT_COMMITTER_EMAIL
-}
-test -z "$ORIG_GIT_COMMITTER_DATE" || {
- GIT_COMMITTER_DATE="$ORIG_GIT_COMMITTER_DATE" &&
- export GIT_COMMITTER_DATE
-}
-
-if test -n "$state_branch"
-then
- echo "Saving rewrite state to $state_branch" 1>&2
- state_blob=$(
- perl -e'opendir D, "../map" or die;
- open H, "|-", "git hash-object -w --stdin" or die;
- foreach (sort readdir(D)) {
- next if m/^\.\.?$/;
- open F, "<../map/$_" or die;
- chomp($f = <F>);
- print H "$_:$f\n" or die;
- }
- close(H) or die;' || die "Unable to save state")
- state_tree=$(printf '100644 blob %s\tfilter.map\n' "$state_blob" | git mktree)
- if test -n "$state_commit"
- then
- state_commit=$(echo "Sync" | git commit-tree "$state_tree" -p "$state_commit")
- else
- state_commit=$(echo "Sync" | git commit-tree "$state_tree" )
- fi
- git update-ref "$state_branch" "$state_commit"
-fi
-
-cd "$orig_dir"
-rm -rf "$tempdir"
-
-trap - 0
-
-if [ "$(is_bare_repository)" = false ]; then
- git read-tree -u -m HEAD || exit
-fi
-
-exit 0
diff --git a/t/perf/p7000-filter-branch.sh b/t/perf/p7000-filter-branch.sh
deleted file mode 100755
index b029586ccb..0000000000
--- a/t/perf/p7000-filter-branch.sh
+++ /dev/null
@@ -1,24 +0,0 @@
-#!/bin/sh
-
-test_description='performance of filter-branch'
-. ./perf-lib.sh
-
-test_perf_default_repo
-test_checkout_worktree
-
-test_expect_success 'mark bases for tests' '
- git tag -f tip &&
- git tag -f base HEAD~100
-'
-
-test_perf 'noop filter' '
- git checkout --detach tip &&
- git filter-branch -f base..HEAD
-'
-
-test_perf 'noop prune-empty' '
- git checkout --detach tip &&
- git filter-branch -f --prune-empty base..HEAD
-'
-
-test_done
diff --git a/t/t7003-filter-branch.sh b/t/t7003-filter-branch.sh
deleted file mode 100755
index e23de7d0b5..0000000000
--- a/t/t7003-filter-branch.sh
+++ /dev/null
@@ -1,505 +0,0 @@
-#!/bin/sh
-
-test_description='git filter-branch'
-. ./test-lib.sh
-. "$TEST_DIRECTORY/lib-gpg.sh"
-
-test_expect_success 'setup' '
- test_commit A &&
- GIT_COMMITTER_DATE="@0 +0000" GIT_AUTHOR_DATE="@0 +0000" &&
- test_commit --notick B &&
- git checkout -b branch B &&
- test_commit D &&
- mkdir dir &&
- test_commit dir/D &&
- test_commit E &&
- git checkout master &&
- test_commit C &&
- git checkout branch &&
- git merge C &&
- git tag F &&
- test_commit G &&
- test_commit H
-'
-# * (HEAD, branch) H
-# * G
-# * Merge commit 'C' into branch
-# |\
-# | * (master) C
-# * | E
-# * | dir/D
-# * | D
-# |/
-# * B
-# * A
-
-
-H=$(git rev-parse H)
-
-test_expect_success 'rewrite identically' '
- git filter-branch branch
-'
-test_expect_success 'result is really identical' '
- test $H = $(git rev-parse HEAD)
-'
-
-test_expect_success 'rewrite bare repository identically' '
- (git config core.bare true && cd .git &&
- git filter-branch branch > filter-output 2>&1 &&
- ! fgrep fatal filter-output)
-'
-git config core.bare false
-test_expect_success 'result is really identical' '
- test $H = $(git rev-parse HEAD)
-'
-
-TRASHDIR=$(pwd)
-test_expect_success 'correct GIT_DIR while using -d' '
- mkdir drepo &&
- ( cd drepo &&
- git init &&
- test_commit drepo &&
- git filter-branch -d "$TRASHDIR/dfoo" \
- --index-filter "cp \"$TRASHDIR\"/dfoo/backup-refs \"$TRASHDIR\"" \
- ) &&
- grep drepo "$TRASHDIR/backup-refs"
-'
-
-test_expect_success 'tree-filter works with -d' '
- git init drepo-tree &&
- (
- cd drepo-tree &&
- test_commit one &&
- git filter-branch -d "$TRASHDIR/dfoo" \
- --tree-filter "echo changed >one.t" &&
- echo changed >expect &&
- git cat-file blob HEAD:one.t >actual &&
- test_cmp expect actual &&
- test_cmp one.t actual
- )
-'
-
-test_expect_success 'Fail if commit filter fails' '
- test_must_fail git filter-branch -f --commit-filter "exit 1" HEAD
-'
-
-test_expect_success 'rewrite, renaming a specific file' '
- git filter-branch -f --tree-filter "mv D.t doh || :" HEAD
-'
-
-test_expect_success 'test that the file was renamed' '
- test D = "$(git show HEAD:doh --)" &&
- ! test -f D.t &&
- test -f doh &&
- test D = "$(cat doh)"
-'
-
-test_expect_success 'rewrite, renaming a specific directory' '
- git filter-branch -f --tree-filter "mv dir diroh || :" HEAD
-'
-
-test_expect_success 'test that the directory was renamed' '
- test dir/D = "$(git show HEAD:diroh/D.t --)" &&
- ! test -d dir &&
- test -d diroh &&
- ! test -d diroh/dir &&
- test -f diroh/D.t &&
- test dir/D = "$(cat diroh/D.t)"
-'
-
-V=$(git rev-parse HEAD)
-
-test_expect_success 'populate --state-branch' '
- git filter-branch --state-branch state -f --tree-filter "touch file || :" HEAD
-'
-
-W=$(git rev-parse HEAD)
-
-test_expect_success 'using --state-branch to skip already rewritten commits' '
- test_when_finished git reset --hard $V &&
- git reset --hard $V &&
- git filter-branch --state-branch state -f --tree-filter "touch file || :" HEAD &&
- test_cmp_rev $W HEAD
-'
-
-git tag oldD HEAD~4
-test_expect_success 'rewrite one branch, keeping a side branch' '
- git branch modD oldD &&
- git filter-branch -f --tree-filter "mv B.t boh || :" D..modD
-'
-
-test_expect_success 'common ancestor is still common (unchanged)' '
- test "$(git merge-base modD D)" = "$(git rev-parse B)"
-'
-
-test_expect_success 'filter subdirectory only' '
- mkdir subdir &&
- touch subdir/new &&
- git add subdir/new &&
- test_tick &&
- git commit -m "subdir" &&
- echo H > A.t &&
- test_tick &&
- git commit -m "not subdir" A.t &&
- echo A > subdir/new &&
- test_tick &&
- git commit -m "again subdir" subdir/new &&
- git rm A.t &&
- test_tick &&
- git commit -m "again not subdir" &&
- git branch sub &&
- git branch sub-earlier HEAD~2 &&
- git filter-branch -f --subdirectory-filter subdir \
- refs/heads/sub refs/heads/sub-earlier
-'
-
-test_expect_success 'subdirectory filter result looks okay' '
- test 2 = $(git rev-list sub | wc -l) &&
- git show sub:new &&
- test_must_fail git show sub:subdir &&
- git show sub-earlier:new &&
- test_must_fail git show sub-earlier:subdir
-'
-
-test_expect_success 'more setup' '
- git checkout master &&
- mkdir subdir &&
- echo A > subdir/new &&
- git add subdir/new &&
- test_tick &&
- git commit -m "subdir on master" subdir/new &&
- git rm A.t &&
- test_tick &&
- git commit -m "again subdir on master" &&
- git merge branch
-'
-
-test_expect_success 'use index-filter to move into a subdirectory' '
- git branch directorymoved &&
- git filter-branch -f --index-filter \
- "git ls-files -s | sed \"s- -&newsubdir/-\" |
- GIT_INDEX_FILE=\$GIT_INDEX_FILE.new \
- git update-index --index-info &&
- mv \"\$GIT_INDEX_FILE.new\" \"\$GIT_INDEX_FILE\"" directorymoved &&
- git diff --exit-code HEAD directorymoved:newsubdir
-'
-
-test_expect_success 'stops when msg filter fails' '
- old=$(git rev-parse HEAD) &&
- test_must_fail git filter-branch -f --msg-filter false HEAD &&
- test $old = $(git rev-parse HEAD) &&
- rm -rf .git-rewrite
-'
-
-test_expect_success 'author information is preserved' '
- : > i &&
- git add i &&
- test_tick &&
- GIT_AUTHOR_NAME="B V Uips" git commit -m bvuips &&
- git branch preserved-author &&
- (sane_unset GIT_AUTHOR_NAME &&
- git filter-branch -f --msg-filter "cat; \
- test \$GIT_COMMIT != $(git rev-parse master) || \
- echo Hallo" \
- preserved-author) &&
- git rev-list --author="B V Uips" preserved-author >actual &&
- test_line_count = 1 actual
-'
-
-test_expect_success "remove a certain author's commits" '
- echo i > i &&
- test_tick &&
- git commit -m i i &&
- git branch removed-author &&
- git filter-branch -f --commit-filter "\
- if [ \"\$GIT_AUTHOR_NAME\" = \"B V Uips\" ];\
- then\
- skip_commit \"\$@\";
- else\
- git commit-tree \"\$@\";\
- fi" removed-author &&
- cnt1=$(git rev-list master | wc -l) &&
- cnt2=$(git rev-list removed-author | wc -l) &&
- test $cnt1 -eq $(($cnt2 + 1)) &&
- git rev-list --author="B V Uips" removed-author >actual &&
- test_line_count = 0 actual
-'
-
-test_expect_success 'barf on invalid name' '
- test_must_fail git filter-branch -f master xy-problem &&
- test_must_fail git filter-branch -f HEAD^
-'
-
-test_expect_success '"map" works in commit filter' '
- git filter-branch -f --commit-filter "\
- parent=\$(git rev-parse \$GIT_COMMIT^) &&
- mapped=\$(map \$parent) &&
- actual=\$(echo \"\$@\" | sed \"s/^.*-p //\") &&
- test \$mapped = \$actual &&
- git commit-tree \"\$@\";" master~2..master &&
- git rev-parse --verify master
-'
-
-test_expect_success 'Name needing quotes' '
-
- git checkout -b rerere A &&
- mkdir foo &&
- name="れれれ" &&
- >foo/$name &&
- git add foo &&
- git commit -m "Adding a file" &&
- git filter-branch --tree-filter "rm -fr foo" &&
- test_must_fail git ls-files --error-unmatch "foo/$name" &&
- test $(git rev-parse --verify rerere) != $(git rev-parse --verify A)
-
-'
-
-test_expect_success 'Subdirectory filter with disappearing trees' '
- git reset --hard &&
- git checkout master &&
-
- mkdir foo &&
- touch foo/bar &&
- git add foo &&
- test_tick &&
- git commit -m "Adding foo" &&
-
- git rm -r foo &&
- test_tick &&
- git commit -m "Removing foo" &&
-
- mkdir foo &&
- touch foo/bar &&
- git add foo &&
- test_tick &&
- git commit -m "Re-adding foo" &&
-
- git filter-branch -f --subdirectory-filter foo &&
- git rev-list master >actual &&
- test_line_count = 3 actual
-'
-
-test_expect_success 'Tag name filtering retains tag message' '
- git tag -m atag T &&
- git cat-file tag T > expect &&
- git filter-branch -f --tag-name-filter cat &&
- git cat-file tag T > actual &&
- test_cmp expect actual
-'
-
-faux_gpg_tag='object XXXXXX
-type commit
-tag S
-tagger T A Gger <tagger@example.com> 1206026339 -0500
-
-This is a faux gpg signed tag.
------BEGIN PGP SIGNATURE-----
-Version: FauxGPG v0.0.0 (FAUX/Linux)
-
-gdsfoewhxu/6l06f1kxyxhKdZkrcbaiOMtkJUA9ITAc1mlamh0ooasxkH1XwMbYQ
-acmwXaWET20H0GeAGP+7vow=
-=agpO
------END PGP SIGNATURE-----
-'
-test_expect_success 'Tag name filtering strips gpg signature' '
- sha1=$(git rev-parse HEAD) &&
- sha1t=$(echo "$faux_gpg_tag" | sed -e s/XXXXXX/$sha1/ | git mktag) &&
- git update-ref "refs/tags/S" "$sha1t" &&
- echo "$faux_gpg_tag" | sed -e s/XXXXXX/$sha1/ | head -n 6 > expect &&
- git filter-branch -f --tag-name-filter cat &&
- git cat-file tag S > actual &&
- test_cmp expect actual
-'
-
-test_expect_success GPG 'Filtering retains message of gpg signed commit' '
- mkdir gpg &&
- touch gpg/foo &&
- git add gpg &&
- test_tick &&
- git commit -S -m "Adding gpg" &&
-
- git log -1 --format="%s" > expect &&
- git filter-branch -f --msg-filter "cat" &&
- git log -1 --format="%s" > actual &&
- test_cmp expect actual
-'
-
-test_expect_success 'Tag name filtering allows slashes in tag names' '
- git tag -m tag-with-slash X/1 &&
- git cat-file tag X/1 | sed -e s,X/1,X/2, > expect &&
- git filter-branch -f --tag-name-filter "echo X/2" &&
- git cat-file tag X/2 > actual &&
- test_cmp expect actual
-'
-test_expect_success 'setup --prune-empty comparisons' '
- git checkout --orphan master-no-a &&
- git rm -rf . &&
- unset test_tick &&
- test_tick &&
- GIT_COMMITTER_DATE="@0 +0000" GIT_AUTHOR_DATE="@0 +0000" &&
- test_commit --notick B B.t B Bx &&
- git checkout -b branch-no-a Bx &&
- test_commit D D.t D Dx &&
- mkdir dir &&
- test_commit dir/D dir/D.t dir/D dir/Dx &&
- test_commit E E.t E Ex &&
- git checkout master-no-a &&
- test_commit C C.t C Cx &&
- git checkout branch-no-a &&
- git merge Cx -m "Merge tag '\''C'\'' into branch" &&
- git tag Fx &&
- test_commit G G.t G Gx &&
- test_commit H H.t H Hx &&
- git checkout branch
-'
-
-test_expect_success 'Prune empty commits' '
- git rev-list HEAD > expect &&
- test_commit to_remove &&
- git filter-branch -f --index-filter "git update-index --remove to_remove.t" --prune-empty HEAD &&
- git rev-list HEAD > actual &&
- test_cmp expect actual
-'
-
-test_expect_success 'prune empty collapsed merges' '
- test_config merge.ff false &&
- git rev-list HEAD >expect &&
- test_commit to_remove_2 &&
- git reset --hard HEAD^ &&
- test_merge non-ff to_remove_2 &&
- git filter-branch -f --index-filter "git update-index --remove to_remove_2.t" --prune-empty HEAD &&
- git rev-list HEAD >actual &&
- test_cmp expect actual
-'
-
-test_expect_success 'prune empty works even without index/tree filters' '
- git rev-list HEAD >expect &&
- git commit --allow-empty -m empty &&
- git filter-branch -f --prune-empty HEAD &&
- git rev-list HEAD >actual &&
- test_cmp expect actual
-'
-
-test_expect_success '--prune-empty is able to prune root commit' '
- git rev-list branch-no-a >expect &&
- git branch testing H &&
- git filter-branch -f --prune-empty --index-filter "git update-index --remove A.t" testing &&
- git rev-list testing >actual &&
- git branch -D testing &&
- test_cmp expect actual
-'
-
-test_expect_success '--prune-empty is able to prune entire branch' '
- git branch prune-entire B &&
- git filter-branch -f --prune-empty --index-filter "git update-index --remove A.t B.t" prune-entire &&
- test_path_is_missing .git/refs/heads/prune-entire &&
- test_must_fail git reflog exists refs/heads/prune-entire
-'
-
-test_expect_success '--remap-to-ancestor with filename filters' '
- git checkout master &&
- git reset --hard A &&
- test_commit add-foo foo 1 &&
- git branch moved-foo &&
- test_commit add-bar bar a &&
- git branch invariant &&
- orig_invariant=$(git rev-parse invariant) &&
- git branch moved-bar &&
- test_commit change-foo foo 2 &&
- git filter-branch -f --remap-to-ancestor \
- moved-foo moved-bar A..master \
- -- -- foo &&
- test $(git rev-parse moved-foo) = $(git rev-parse moved-bar) &&
- test $(git rev-parse moved-foo) = $(git rev-parse master^) &&
- test $orig_invariant = $(git rev-parse invariant)
-'
-
-test_expect_success 'automatic remapping to ancestor with filename filters' '
- git checkout master &&
- git reset --hard A &&
- test_commit add-foo2 foo 1 &&
- git branch moved-foo2 &&
- test_commit add-bar2 bar a &&
- git branch invariant2 &&
- orig_invariant=$(git rev-parse invariant2) &&
- git branch moved-bar2 &&
- test_commit change-foo2 foo 2 &&
- git filter-branch -f \
- moved-foo2 moved-bar2 A..master \
- -- -- foo &&
- test $(git rev-parse moved-foo2) = $(git rev-parse moved-bar2) &&
- test $(git rev-parse moved-foo2) = $(git rev-parse master^) &&
- test $orig_invariant = $(git rev-parse invariant2)
-'
-
-test_expect_success 'setup submodule' '
- rm -fr ?* .git &&
- git init &&
- test_commit file &&
- mkdir submod &&
- submodurl="$PWD/submod" &&
- ( cd submod &&
- git init &&
- test_commit file-in-submod ) &&
- git submodule add "$submodurl" &&
- git commit -m "added submodule" &&
- test_commit add-file &&
- ( cd submod && test_commit add-in-submodule ) &&
- git add submod &&
- git commit -m "changed submodule" &&
- git branch original HEAD
-'
-
-orig_head=$(git show-ref --hash --head HEAD)
-
-test_expect_success 'rewrite submodule with another content' '
- git filter-branch --tree-filter "test -d submod && {
- rm -rf submod &&
- git rm -rf --quiet submod &&
- mkdir submod &&
- : > submod/file
- } || :" HEAD &&
- test $orig_head != $(git show-ref --hash --head HEAD)
-'
-
-test_expect_success 'replace submodule revision' '
- git reset --hard original &&
- git filter-branch -f --tree-filter \
- "if git ls-files --error-unmatch -- submod > /dev/null 2>&1
- then git update-index --cacheinfo 160000 0123456789012345678901234567890123456789 submod
- fi" HEAD &&
- test $orig_head != $(git show-ref --hash --head HEAD)
-'
-
-test_expect_success 'filter commit message without trailing newline' '
- git reset --hard original &&
- commit=$(printf "no newline" | git commit-tree HEAD^{tree}) &&
- git update-ref refs/heads/no-newline $commit &&
- git filter-branch -f refs/heads/no-newline &&
- echo $commit >expect &&
- git rev-parse refs/heads/no-newline >actual &&
- test_cmp expect actual
-'
-
-test_expect_success 'tree-filter deals with object name vs pathname ambiguity' '
- test_when_finished "git reset --hard original" &&
- ambiguous=$(git rev-list -1 HEAD) &&
- git filter-branch --tree-filter "mv file.t $ambiguous" HEAD^.. &&
- git show HEAD:$ambiguous
-'
-
-test_expect_success 'rewrite repository including refs that point at non-commit object' '
- test_when_finished "git reset --hard original" &&
- tree=$(git rev-parse HEAD^{tree}) &&
- test_when_finished "git replace -d $tree" &&
- echo A >new &&
- git add new &&
- new_tree=$(git write-tree) &&
- git replace $tree $new_tree &&
- git tag -a -m "tag to a tree" treetag $new_tree &&
- git reset --hard HEAD &&
- git filter-branch -f -- --all >filter-output 2>&1 &&
- ! fgrep fatal filter-output
-'
-
-test_done
diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7009-filter-branch-null-sha1.sh
deleted file mode 100755
index 9ba9f24ad2..0000000000
--- a/t/t7009-filter-branch-null-sha1.sh
+++ /dev/null
@@ -1,55 +0,0 @@
-#!/bin/sh
-
-test_description='filter-branch removal of trees with null sha1'
-. ./test-lib.sh
-
-test_expect_success 'setup: base commits' '
- test_commit one &&
- test_commit two &&
- test_commit three
-'
-
-test_expect_success 'setup: a commit with a bogus null sha1 in the tree' '
- {
- git ls-tree HEAD &&
- printf "160000 commit $ZERO_OID\\tbroken\\n"
- } >broken-tree &&
- echo "add broken entry" >msg &&
-
- tree=$(git mktree <broken-tree) &&
- test_tick &&
- commit=$(git commit-tree $tree -p HEAD <msg) &&
- git update-ref HEAD "$commit"
-'
-
-# we have to make one more commit on top removing the broken
-# entry, since otherwise our index does not match HEAD (and filter-branch will
-# complain). We could make the index match HEAD, but doing so would involve
-# writing a null sha1 into the index.
-test_expect_success 'setup: bring HEAD and index in sync' '
- test_tick &&
- git commit -a -m "back to normal"
-'
-
-test_expect_success 'noop filter-branch complains' '
- test_must_fail git filter-branch \
- --force --prune-empty \
- --index-filter "true"
-'
-
-test_expect_success 'filter commands are still checked' '
- test_must_fail git filter-branch \
- --force --prune-empty \
- --index-filter "git rm --cached --ignore-unmatch three.t"
-'
-
-test_expect_success 'removing the broken entry works' '
- echo three >expect &&
- git filter-branch \
- --force --prune-empty \
- --index-filter "git rm --cached --ignore-unmatch broken" &&
- git log -1 --format=%s >actual &&
- test_cmp expect actual
-'
-
-test_done
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 75512c3403..4e7f669c76 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -28,10 +28,10 @@ complete ()
#
# (2) A test makes sure that common subcommands are included in the
# completion for "git <TAB>", and a plumbing is excluded. "add",
-# "filter-branch" and "ls-files" are listed for this.
+# "rebase" and "ls-files" are listed for this.
-GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr filter-branch ls-files'
-GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout filter-branch'
+GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr rebase ls-files'
+GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout rebase'
. "$GIT_BUILD_DIR/contrib/completion/git-completion.bash"
@@ -1392,12 +1392,12 @@ test_expect_success 'basic' '
# built-in
grep -q "^add \$" out &&
# script
- grep -q "^filter-branch \$" out &&
+ grep -q "^rebase \$" out &&
# plumbing
! grep -q "^ls-files \$" out &&
- run_completion "git f" &&
- ! grep -q -v "^f" out
+ run_completion "git r" &&
+ ! grep -q -v "^r" out
'
test_expect_success 'double dash "git" itself' '
--
2.23.0.3.gcc10030edf.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it
2019-08-28 0:22 ` [PATCH v2 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (3 preceding siblings ...)
2019-08-28 0:22 ` [RFC PATCH v2 4/4] Remove git-filter-branch, it is now external to git.git Elijah Newren
@ 2019-08-29 0:06 ` Elijah Newren
2019-08-29 0:06 ` [PATCH v3 1/4] t6006: simplify and optimize empty message test Elijah Newren
` (4 more replies)
2019-09-03 18:55 ` [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
5 siblings, 5 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-29 0:06 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
Here's a series warns about git-filter-branch usage and avoids it
ourselves.
Changes since v2 (full range-diff below):
* [Patch 2] testcase syntax cleanups
* [Patch 3] fixed "linkgit:" references to filter-repo to be url
links (or footnotes)
* [Patch 3] fixed the mode on filter-branch.sh (oops) and dropped
the ambiguous "and pause". Linkified "filter-repo" in a place
where there was no link.
* [Patch 3] As suggested by Eric (Sunshine), just make the manpage
and directly include the safety and performance sections of the
referenced email (the performance section was referenced by the
safety section). Being included directly in the manpage should
help with folks reading the documentation offline. Anyway, the
text is really long, so it took a while to format it nicely,
recheck for typos, reword based on the fact that it'll be in the
manpage (because it's weird to have the manpage refer to itself),
etc.
* [Patch 4] Dropped almost all the original patch 4; only including
the bits about t9902-completion.sh. Removed the RFC label, since
that one piece should be good for including now.
Elijah Newren (4):
t6006: simplify and optimize empty message test
t3427: accelerate this test by using fast-export and fast-import
Recommend git-filter-repo instead of git-filter-branch
t9902: use a non-deprecated command for testing
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 302 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 13 ++
t/t3427-rebase-subtree.sh | 24 ++-
t/t6006-rev-list-format.sh | 5 +-
t/t9902-completion.sh | 12 +-
12 files changed, 339 insertions(+), 77 deletions(-)
Range-diff:
1: 7ddbeea2ca = 1: 7ddbeea2ca t6006: simplify and optimize empty message test
2: f18bd7a609 ! 2: e1e63189c1 t3427: accelerate this test by using fast-export and fast-import
@@ Commit message
Signed-off-by: Elijah Newren <newren@gmail.com>
## t/t3427-rebase-subtree.sh ##
-@@ t/t3427-rebase-subtree.sh: commit_message() {
+@@ t/t3427-rebase-subtree.sh: This test runs git rebase and tests the subtree strategy.
+ . ./test-lib.sh
+ . "$TEST_DIRECTORY"/lib-rebase.sh
+
+-commit_message() {
++commit_message () {
git log --pretty=format:%s -1 "$1"
}
-+extract_files_subtree() {
-+ git fast-export --no-data HEAD -- files_subtree/ \
-+ | sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" \
-+ | git fast-import --force --quiet
++extract_files_subtree () {
++ git fast-export --no-data HEAD -- files_subtree/ |
++ sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
++ git fast-import --force --quiet
+}
+
test_expect_success 'setup' '
3: 7008c16984 ! 3: 59c7446927 Recommend git-filter-repo instead of git-filter-branch
@@ Documentation/git-filter-branch.txt: SYNOPSIS
+as such, its use is not recommended. Please use an alternative history
+filtering tool such as https://github.com/newren/git-filter-repo/[git
+filter-repo]. If you still need to use 'git filter-branch', please
-+carefully read the "Safety" section of the message on the Git mailing list
++carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
++mines of filter-branch, and then vigilantly avoid as many of the hazards
++listed there as reasonably possible.
++
+https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
-+the land mines of filter-branch] and vigilantly avoid as many of the
-+hazards listed there as reasonably possible.
++the land mines of filter-branch]
+
DESCRIPTION
-----------
@@ Documentation/git-filter-branch.txt: warned.
- are much more restrictive than git-filter branch, and dedicated just
- to the tasks of removing unwanted data- e.g:
- `--strip-blobs-bigger-than 1M`.
--
++[[PERFORMANCE]]
++PERFORMANCE
++-----------
++
++The performance of filter-branch is glacially slow; its design makes it
++impossible for a backward-compatible implementation to ever be fast:
++
++* In editing files, git-filter-branch by design checks out each and
++every commit as it existed in the original repo. If your repo has 10\^5
++files and 10\^5 commits, but each commit only modifies 5 files, then
++git-filter-branch will make you do 10\^10 modifications, despite only
++having (at most) 5*10^5 unique blobs.
++
++* If you try and cheat and try to make filter-branch only work on
++files modified in a commit, then two things happen
++
++ . you run into problems with deletions whenever the user is simply
++ trying to rename files (because attempting to delete files that
++ don't exist looks like a no-op; it takes some chicanery to remap
++ deletes across file renames when the renames happen via arbitrary
++ user-provided shell)
++
++ . even if you succeed at the map-deletes-for-renames chicanery, you
++ still technically violate backward compatibility because users are
++ allowed to filter files in ways that depend upon topology of commits
++ instead of filtering solely based on file contents or names (though
++ I have never seen any user ever do this).
++
++* Even if you don't need to edit files but only want to e.g. rename or
++remove some and thus can avoid checking out each file (i.e. you can use
++--index-filter), you still are passing shell snippets for your filters.
++This means that for every commit, you have to have a prepared git repo
++where users can run git commands. That's a lot of setup. It also means
++you have to fork at least one process to run the user-provided shell
++snippet, and odds are that the user's shell snippet invokes lots of
++commands in some long pipeline, so you will have lots and lots of forks.
++For every. single. commit. That's a massive amount of overhead to
++rename a few files.
++
++* filter-branch is written in shell, which is kind of slow. Naturally,
++it makes sense to want to rewrite that in some other language. However,
++filter-branch documentation states that several additional shell
++functions are provided for users to call, e.g. 'map', 'skip_commit',
++'git_commit_non_empty_tree'. If filter-branch itself isn't a shell
++script, then in order to make those shell functions available to the
++users' shell snippets you have to prepend the shell definitions of these
++functions to every one of the users' shell snippets and thus make these
++special shell functions be parsed with each and every commit.
++
++* filter-branch provides a --setup option which is a shell snippet that
++can be sourced to make shell functions and variables available to all
++other filters. If filter-branch is a shell script, it can simply eval
++this shell snippet once at the beginning. If you try to fix performance
++by making filter-branch not be a shell script, then you have to prepend
++the setup shell snippet to all other filters and parse it with every
++single commit.
++
++* filter-branch writes lots of files to $workdir/../map/ to keep a
++mapping of commits, which it uses pruning commits and remapping to
++ancestors and the map() command more generally. Other files like
++$tempdir/backup-refs, $tempdir/raw-refs, $tempdir/heads,
++$tempdir/tree-state are all created internally too. It is possible
++(though strongly discouraged) that users could have accessed any of
++these directly. Users even had a pointer to follow in the form of
++Documentation that the 'map' command existed, which naturally uses the
++$workdir/../map/* files. So, even if you don't have to edit files, for
++strict backward compatibility you need to still write a bunch of files
++to disk somewhere and keep them updated for every commit. You can claim
++it was an implementation detail that users should not have depended
++upon, but the truth is they've had a decade where they could so. So, if
++you want full compatibility, it has to be there. Besides, the
++regression tests depend on at least one of these details, specifying an
++--index-filter that reaches down and grabs backup-refs from $tempdir,
++and thus provides resourceful users who do google searches an example
++that there are files there for them to read and grab and use. (And if
++you want to pass the existing regression tests, you have to at least put
++the backup-refs file there even if it's irrelevant to your
++implementation otherwise.)
++
++All of that said, performance of filter-branch could be improved by
++reimplementing it in a non-shell language and taking a couple small
++liberties with backward compatibility (such as having it only run
++filters on files changed within each commit). filter-repo provides a
++demo script named
++https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely]
++which does exactly that and which passes all the git-filter-branch
++regression tests. It's much faster than git-filter-branch, though it
++suffers from all the same safety issues as git-filter-branch, and is
++still glacially slow compared to
++https://github.com/newren/git-filter-repo/[git filter-repo].
++
++[[SAFETY]]
++SAFETY
++------
++
++filter-branch is riddled with gotchas resulting in various ways to
++easily corrupt repos or end up with a mess worse than what you started
++with:
++
++* Someone can have a set of "working and tested filters" which they
++document or provide to a coworker, who then runs them on a different OS
++where the same commands are not working/tested (some examples in the
++git-filter-branch manpage are also affected by this). BSD vs. GNU
++userland differences can really bite. If you're lucky, you get ugly
++error messages spewed. But just as likely, the commands either don't do
++the filtering requested, or silently corrupt making some unwanted
++change. The unwanted change may only affect a few commits, so it's not
++necessarily obvious either. (The fact that problems won't necessarily
++be obvious means they are likely to go unnoticed until the rewritten
++history is in use for quite a while, at which point it's really hard to
++justify another flag-day for another rewrite.)
++
++* Filenames with spaces (which are rare) are often mishandled by shell
++snippets since they cause problems for shell pipelines. Not everyone is
++familiar with find -print0, xargs -0, ls-files -z, etc. Even people who
++are familiar with these may assume such needs are not relevant because
++someone else renamed any such files in their repo back before the person
++doing the filtering joined the project. And, often, even those familiar
++with handling arguments with spaces my not do so just because they
++aren't in the mindset of thinking about everything that could possibly
++go wrong.
++
++* Non-ascii filenames (which are rare) can be silently removed despite
++being in a desired directory. The desire to select paths to keep often
++use pipelines like `git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`.
++ls-files will only quote filenames if needed so folks may not notice
++that one of the files didn't match the regex, again until it's much too
++late. Yes, someone who knows about core.quotePath can avoid this
++(unless they have other special characters like \t, \n, or "), and
++people who use ls-files -z with something other than grep can avoid
++this, but that doesn't mean they will.
++
++* Similarly, when moving files around, one can find that filenames with
++non-ascii or special characters end up in a different directory, one
++that includes a double quote character. (This is technically the same
++issue as above with quoting, but perhaps an interesting different way
++that it can and has manifested as a problem.)
++
++* It's far too easy to accidentally mix up old and new history. It's
++still possible with any tool, but filter-branch almost invites it. If
++we're lucky, the only downside is users getting frustrated that they
++don't know how to shrink their repo and remove the old stuff. If we're
++unlucky, they merge old and new history and end up with multiple
++"copies" of each commit, some of which have unwanted or sensitive files
++and others which don't. This comes about in multiple different ways:
++
++ ** the default to only doing a partial history rewrite ('--all' is not
++ the default and over 80% of the examples in the manpage don't use
++ it)
++
++ ** the fact that there's no automatic post-run cleanup
++
++ ** the fact that --tag-name-filter (when used to rename tags) doesn't
++ remove the old tags but just adds new ones with the new name (this
++ manpage has documented this for a long time so it's presumably not
++ a "bug" even though it feels like it)
++
++ ** the fact that little educational information is provided to inform
++ users of the ramifications of a rewrite and how to avoid mixing old
++ and new history. For example, this man page discusses how users
++ need to understand that they need to rebase their changes for all
++ their branches on top of new history (or delete and reclone), but
++ that's only one of multiple concerns to consider. See the
++ "DISCUSSION" section of the git filter-repo manual page for more
++ details.
++
++* Annotated tags can be accidentally converted to lightweight tags, due
++to either of two issues:
++
++ . Someone can do a history rewrite, realize they messed up, restore
++ from the backups in refs/original/, and then redo their
++ filter-branch command. (The backup in refs/original/ is not a real
++ backup; it dereferences tags first.)
++
++ . Running filter-branch with either --tags or --all in your <rev-list
++ options>. In order to retain annotated tags as annotated, you must
++ use --tag-name-filter (and must not have restored from
++ refs/original/ in a previously botched rewrite).
++
++* Any commit messages that specify an encoding will become corrupted
++by the rewrite; filter-branch ignores the encoding, takes the original
++bytes, and feeds it to commit-tree without telling it the proper
++encoding. (This happens whether or not --msg-filter is used, though I
++suspect --msg-filter provides additional ways to really mess things
++up).
++
++* Commit messages (even if they are all UTF-8) by default become
++corrupted due to not being updated -- any references to other commit
++hashes in commit messages will now refer to no-longer-extant commits.
++
++* There are no facilities for helping users find what unwanted crud they
++should delete, which means they are much more likely to have incomplete
++or partial cleanups that sometimes result in confusion and people
++wasting time trying to understand. (For example, folks tend to just
++look for big files to delete instead of big directories or extensions,
++and once they do so, then sometime later folks using the new repository
++who are going through history will notice a build artifact directory
++that has some files but not others, or a cache of dependencies
++(node_modules or similar) which couldn't have ever been functional since
++it's missing some files.)
++
++* If --prune-empty isn't specified, then the filtering process can
++create hoards of confusing empty commits
++
++* If --prune-empty is specified, then intentionally placed empty
++commits from before the filtering operation are also pruned instead of
++just pruning commits that became empty due to filtering rules.
++
++* If --prune empty is specified, sometimes empty commits are missed
++and left around anyway (a somewhat rare bug, but it happens...)
++
++* A minor issue, but users who have a goal to update all names and
++emails in a repository may be led to --env-filter which will only update
++authors and committers, missing taggers.
++
++* If the user provides a --tag-name-filter that maps multiple tags to
++the same name, no warning or error is provided; filter-branch simply
++overwrites each tag in some undocumented pre-defined order resulting in
++only one tag at the end. If you try to "fix" this bug in filter-branch
++and make it error out and warn the user instead, one of the
++filter-branch regression tests will fail. (So, if you are trying to
++make a backward compatible reimplementation you have to add extra code
++to detect collisions and make sure that only the lexicographically last
++one is rewritten to avoid fast-import from seeing both since fast-import
++will naturally do the sane thing and error out if told to write the same
++tag more than once.)
++
++Also, the poor performance of filter-branch often leads to safety issues:
++
++* Coming up with the correct shell snippet to do the filtering you want
++is sometimes difficult unless you're just doing a trivial modification
++such as deleting a couple files. People have often come to me for help,
++so I should be practiced and an expert, but even for fairly simple cases
++I still sometimes take over 10 minutes and several iterations to get
++the right commands -- and that's assuming they are working on a tiny
++repository. Unfortunately, people often learn if the snippet is right
++or wrong by trying it out, but the rightness or wrongness can vary
++depending on special circumstances (spaces in filenames, non-ascii
++filenames, funny author names or emails, invalid timezones, presence of
++grafts or replace objects, etc.), meaning they may have to wait a long
++time, hit an error, then restart. The performance of filter-branch is
++so bad that this cycle is painful, reducing the time available to
++carefully re-check (to say nothing about what it does to the patience of
++the person doing the rewrite even if they do technically have more time
++available). This problem is extra compounded because errors from broken
++filters may not be shown for a long time and/or get lost in a sea of
++output. Even worse, broken filters often just result in silent
++incorrect rewrites.
++
++* To top it all off, even when users finally find working commands, they
++naturally want to share them. But they may be unaware that their repo
++didn't have some special cases that someone else's does. So, when
++someone else with a different repository runs the same commands, they
++get hit by the problems above. Or, the user just runs commands that
++really were vetted for special cases, but they run it on a different OS
++where it doesn't work, as noted above.
+
GIT
---
- Part of the linkgit:git[1] suite
## Documentation/git-gc.txt ##
@@ Documentation/git-gc.txt: NOTES
@@ Documentation/git-rebase.txt: Hard case: The changes are not the same.::
`--interactive` to omit, edit, squash, or fixup commits; or
if the upstream used one of `commit --amend`, `reset`, or
- `filter-branch`.
-+ a full history rewriting command like `filter-repo`.
++ a full history rewriting command like
++ https://github.com/newren/git-filter-repo[`filter-repo`].
The easy case
@@ Documentation/git-replace.txt: The following format are available:
-replacement objects from existing objects. The `--edit` option can
-also be used with 'git replace' to create a replacement object by
+linkgit:git-hash-object[1], linkgit:git-rebase[1], and
-+linkgit:git-filter-repo[1], among other git commands, can be used to
++https://github.com/newren/git-filter-repo[git-filter-repo], among other git commands, can be used to
+create replacement objects from existing objects. The `--edit` option
+can also be used with 'git replace' to create a replacement object by
editing an existing object.
@@ Documentation/git-replace.txt: pending objects.
linkgit:git-hash-object[1]
-linkgit:git-filter-branch[1]
linkgit:git-rebase[1]
-+linkgit:git-filter-repo[1]
linkgit:git-tag[1]
linkgit:git-branch[1]
linkgit:git-commit[1]
+ linkgit:git-var[1]
+ linkgit:git[1]
++https://github.com/newren/git-filter-repo[git-filter-repo]
+
+ GIT
+ ---
## Documentation/git-svn.txt ##
@@ Documentation/git-svn.txt: option for (hopefully) obvious reasons.
@@ Documentation/git-svn.txt: option for (hopefully) obvious reasons.
This option is NOT recommended as it makes it difficult to track down
old references to SVN revision numbers in existing documentation, bug
-reports and archives. If you plan to eventually migrate from SVN to Git
-+reports, and archives. If you plan to eventually migrate from SVN to Git
- and are certain about dropping SVN history, consider
+-and are certain about dropping SVN history, consider
-linkgit:git-filter-branch[1] instead. filter-branch also allows
-+linkgit:git-filter-repo[1] instead. filter-repo also allows
- reformatting of metadata for ease-of-reading and rewriting authorship
- info for non-"svn.authorsFile" users.
+-reformatting of metadata for ease-of-reading and rewriting authorship
+-info for non-"svn.authorsFile" users.
++reports, and archives. If you plan to eventually migrate from SVN to
++Git and are certain about dropping SVN history, consider
++https://github.com/newren/git-filter-repo[git-filter-repo] instead.
++filter-repo also allows reformatting of metadata for ease-of-reading
++and rewriting authorship info for non-"svn.authorsFile" users.
+ svn.useSvmProps::
+ svn-remote.<name>.useSvmProps::
## Documentation/githooks.txt ##
@@ Documentation/githooks.txt: post-rewrite
@@ Documentation/githooks.txt: post-rewrite
-linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
-it!). Its first argument denotes the command it was invoked by:
-currently one of `amend` or `rebase`. Further command-dependent
+-arguments may be passed in the future.
+linkgit:git-rebase[1]; however, full-history (re)writing tools like
-+linkgit:git-fast-import[1] or linkgit:git-filter-repo[1] typically do
-+not call it!). Its first argument denotes the command it was invoked
-+by: currently one of `amend` or `rebase`. Further command-dependent
- arguments may be passed in the future.
++linkgit:git-fast-import[1] or
++https://github.com/newren/git-filter-repo[git-filter-repo] typically
++do not call it!). Its first argument denotes the command it was
++invoked by: currently one of `amend` or `rebase`. Further
++command-dependent arguments may be passed in the future.
The hook receives a list of the rewritten commits on stdin, in the
+ format
## contrib/svn-fe/svn-fe.txt ##
@@ contrib/svn-fe/svn-fe.txt: line. This line has the form `git-svn-id: URL@REVNO UUID`.
@@ contrib/svn-fe/svn-fe.txt: The exit status does not reflect whether an error was
+git-svn(1), svn2git(1), svk(1), git-filter-repo(1), git-fast-import(1),
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
- ## git-filter-branch.sh (mode change 100755 => 100644) ##
+ ## git-filter-branch.sh ##
@@ git-filter-branch.sh: set_ident () {
finish_ident COMMITTER
}
@@ git-filter-branch.sh: set_ident () {
+ rewrites. Please use an alternative filtering tool such as 'git
+ filter-repo' (https://github.com/newren/git-filter-repo/) instead.
+ See the filter-branch manual page for more details; to squelch
-+ this warning and pause, set FILTER_BRANCH_SQUELCH_WARNING=1.
++ this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
+
+EOF
+ sleep 5
4: ff3e04e558 < -: ---------- Remove git-filter-branch, it is now external to git.git
-: ---------- > 4: 1dbca82408 t9902: use a non-deprecated command for testing
--
2.23.0.3.g59c7446927.dirty
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v3 1/4] t6006: simplify and optimize empty message test
2019-08-29 0:06 ` [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
@ 2019-08-29 0:06 ` Elijah Newren
2019-08-29 0:06 ` [PATCH v3 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
` (3 subsequent siblings)
4 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-29 0:06 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
Test t6006.71 ("oneline with empty message") was creating two commits
with simple commit messages, and then running filter-branch to rewrite
the commit messages to be empty. This test was written this way because
the --allow-empty-message option to git commit did not exist at the
time. Simplify this test and avoid the need to invoke filter-branch by
just using --allow-empty-message when creating the commit.
Despite only being one piece of the 71st test and there being 73 tests
overall, this small change to just this one test speeds up the overall
execution time of t6006 (as measured by the best of 3 runs of `time
./t6006-rev-list-format.sh`) by about 11% on Linux and by 13% on
Mac.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t6006-rev-list-format.sh | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
index da113d975b..d30e41c9f7 100755
--- a/t/t6006-rev-list-format.sh
+++ b/t/t6006-rev-list-format.sh
@@ -501,9 +501,8 @@ test_expect_success 'reflog identity' '
'
test_expect_success 'oneline with empty message' '
- git commit -m "dummy" --allow-empty &&
- git commit -m "dummy" --allow-empty &&
- git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
+ git commit --allow-empty --allow-empty-message &&
+ git commit --allow-empty --allow-empty-message &&
git rev-list --oneline HEAD >test.txt &&
test_line_count = 5 test.txt &&
git rev-list --oneline --graph HEAD >testg.txt &&
--
2.23.0.3.g59c7446927.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v3 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-08-29 0:06 ` [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-08-29 0:06 ` [PATCH v3 1/4] t6006: simplify and optimize empty message test Elijah Newren
@ 2019-08-29 0:06 ` Elijah Newren
2019-08-29 0:06 ` [PATCH v3 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
` (2 subsequent siblings)
4 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-29 0:06 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
fast-export and fast-import can easily handle the simple rewrite that
was being done by filter-branch, and should be significantly faster on
systems with a slow fork. Timings from before and after on two laptops
that I have access to (measured via `time ./t3427-rebase-subtree.sh`,
i.e. including everything in this test -- not just the filter-branch or
fast-export/fast-import pair):
Linux: 4.305s -> 3.684s (~17% speedup)
Mac: 10.128s -> 7.038s (~30% speedup)
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t3427-rebase-subtree.sh | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/t/t3427-rebase-subtree.sh b/t/t3427-rebase-subtree.sh
index d8640522a0..c1f6102921 100755
--- a/t/t3427-rebase-subtree.sh
+++ b/t/t3427-rebase-subtree.sh
@@ -7,10 +7,16 @@ This test runs git rebase and tests the subtree strategy.
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-rebase.sh
-commit_message() {
+commit_message () {
git log --pretty=format:%s -1 "$1"
}
+extract_files_subtree () {
+ git fast-export --no-data HEAD -- files_subtree/ |
+ sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
+ git fast-import --force --quiet
+}
+
test_expect_success 'setup' '
test_commit README &&
mkdir files &&
@@ -42,7 +48,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-preserve-merges-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master4"
@@ -53,7 +59,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-preserve-merges-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "files_subtree/master5"
@@ -64,7 +70,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-keep-empty-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -75,7 +81,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-keep-empty-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -86,7 +92,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto empty commit' '
reset_rebase &&
git checkout -b rebase-keep-empty-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
@@ -96,7 +102,7 @@ test_expect_failure REBASE_P \
test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
reset_rebase &&
git checkout -b rebase-onto-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -106,7 +112,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
reset_rebase &&
git checkout -b rebase-onto-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -115,7 +121,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
test_expect_failure 'Rebase -Xsubtree --onto empty commit' '
reset_rebase &&
git checkout -b rebase-onto-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
--
2.23.0.3.g59c7446927.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v3 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-08-29 0:06 ` [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-08-29 0:06 ` [PATCH v3 1/4] t6006: simplify and optimize empty message test Elijah Newren
2019-08-29 0:06 ` [PATCH v3 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
@ 2019-08-29 0:06 ` Elijah Newren
2019-08-29 18:10 ` Eric Sunshine
2019-08-29 0:06 ` [PATCH v3 4/4] t9902: use a non-deprecated command for testing Elijah Newren
2019-08-30 5:57 ` [PATCH v4 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
4 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-08-29 0:06 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
filter-branch suffers from a deluge of disguised dangers that disfigure
history rewrites (i.e. deviate from the deliberate changes). Many of
these problems are unobtrusive and can easily go undiscovered until the
new repository is in use. This can result in problems ranging from an
even messier history than what led folks to filter-branch in the first
place, to data loss or corruption. These issues cannot be backward
compatibly fixed, so add a warning to both filter-branch and its manpage
recommending that another tool (such as filter-repo) be used instead.
Also, update other manpages that referenced filter-branch. Several of
these needed updates even if we could continue recommending
filter-branch, either due to implying that something was unique to
filter-branch when it applied more generally to all history rewriting
tools (e.g. BFG, reposurgeon, fast-import, filter-repo), or because
something about filter-branch was used as an example despite other more
commonly known examples now existing. Reword these sections to fix
these issues and to avoid recommending filter-branch.
Finally, remove the section explaining BFG Repo Cleaner as an
alternative to filter-branch. I feel somewhat bad about this,
especially since I feel like I learned so much from BFG that I put to
good use in filter-repo (which is much more than I can say for
filter-branch), but keeping that section presented a few problems:
* In order to recommend that people quit using filter-branch, we need
to provide them a recomendation for something else to use that
can handle all the same types of rewrites. To my knowledge,
filter-repo is the only such tool. So it needs to be mentioned.
* I don't want to give conflicting recommendations to users
* If we recommend two tools, we shouldn't expect users to learn both
and pick which one to use; we should explain which problems one
can solve that the other can't or when one is much faster than
the other.
* BFG and filter-repo have similar performance
* All filtering types that BFG can do, filter-repo can also do. In
fact, filter-repo comes with a reimplementation of BFG named
bfg-ish which provides the same user-interface as BFG but with
several bugfixes and new features that are hard to implement in
BFG due to its technical underpinnings.
While I could still mention both tools, it seems like I would need to
provide some kind of comparison and I would ultimately just say that
filter-repo can do everything BFG can, so ultimately it seems that it
is just better to remove that section altogether.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 302 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 13 ++
9 files changed, 316 insertions(+), 59 deletions(-)
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index cc940eb9ad..784e934009 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -17,9 +17,9 @@ This program dumps the given revisions in a form suitable to be piped
into 'git fast-import'.
You can use it as a human-readable bundle replacement (see
-linkgit:git-bundle[1]), or as a kind of an interactive
-'git filter-branch'.
-
+linkgit:git-bundle[1]), or as a format that can be edited before being
+fed to 'git fast-import' in order to do history rewrites (an ability
+relied on by tools like 'git filter-repo').
OPTIONS
-------
diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index 6b53dd7e06..c3f874b692 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -16,6 +16,22 @@ SYNOPSIS
[--original <namespace>] [-d <directory>] [-f | --force]
[--state-branch <branch>] [--] [<rev-list options>...]
+WARNING
+-------
+'git filter-branch' has a plethora of pitfalls that can produce non-obvious
+manglings of the intended history rewrite (and can leave you with little
+time to investigate such problems since it has such abysmal performance).
+These safety and performance issues cannot be backward compatibly fixed and
+as such, its use is not recommended. Please use an alternative history
+filtering tool such as https://github.com/newren/git-filter-repo/[git
+filter-repo]. If you still need to use 'git filter-branch', please
+carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
+mines of filter-branch, and then vigilantly avoid as many of the hazards
+listed there as reasonably possible.
+
+https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
+the land mines of filter-branch]
+
DESCRIPTION
-----------
Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,36 +461,262 @@ warned.
(or if your git-gc is not new enough to support arguments to
`--prune`, use `git repack -ad; git prune` instead).
-NOTES
------
-
-git-filter-branch allows you to make complex shell-scripted rewrites
-of your Git history, but you probably don't need this flexibility if
-you're simply _removing unwanted data_ like large files or passwords.
-For those operations you may want to consider
-http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
-a JVM-based alternative to git-filter-branch, typically at least
-10-50x faster for those use-cases, and with quite different
-characteristics:
-
-* Any particular version of a file is cleaned exactly _once_. The BFG,
- unlike git-filter-branch, does not give you the opportunity to
- handle a file differently based on where or when it was committed
- within your history. This constraint gives the core performance
- benefit of The BFG, and is well-suited to the task of cleansing bad
- data - you don't care _where_ the bad data is, you just want it
- _gone_.
-
-* By default The BFG takes full advantage of multi-core machines,
- cleansing commit file-trees in parallel. git-filter-branch cleans
- commits sequentially (i.e. in a single-threaded manner), though it
- _is_ possible to write filters that include their own parallelism,
- in the scripts executed against each commit.
-
-* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
- are much more restrictive than git-filter branch, and dedicated just
- to the tasks of removing unwanted data- e.g:
- `--strip-blobs-bigger-than 1M`.
+[[PERFORMANCE]]
+PERFORMANCE
+-----------
+
+The performance of filter-branch is glacially slow; its design makes it
+impossible for a backward-compatible implementation to ever be fast:
+
+* In editing files, git-filter-branch by design checks out each and
+every commit as it existed in the original repo. If your repo has 10\^5
+files and 10\^5 commits, but each commit only modifies 5 files, then
+git-filter-branch will make you do 10\^10 modifications, despite only
+having (at most) 5*10^5 unique blobs.
+
+* If you try and cheat and try to make filter-branch only work on
+files modified in a commit, then two things happen
+
+ . you run into problems with deletions whenever the user is simply
+ trying to rename files (because attempting to delete files that
+ don't exist looks like a no-op; it takes some chicanery to remap
+ deletes across file renames when the renames happen via arbitrary
+ user-provided shell)
+
+ . even if you succeed at the map-deletes-for-renames chicanery, you
+ still technically violate backward compatibility because users are
+ allowed to filter files in ways that depend upon topology of commits
+ instead of filtering solely based on file contents or names (though
+ I have never seen any user ever do this).
+
+* Even if you don't need to edit files but only want to e.g. rename or
+remove some and thus can avoid checking out each file (i.e. you can use
+--index-filter), you still are passing shell snippets for your filters.
+This means that for every commit, you have to have a prepared git repo
+where users can run git commands. That's a lot of setup. It also means
+you have to fork at least one process to run the user-provided shell
+snippet, and odds are that the user's shell snippet invokes lots of
+commands in some long pipeline, so you will have lots and lots of forks.
+For every. single. commit. That's a massive amount of overhead to
+rename a few files.
+
+* filter-branch is written in shell, which is kind of slow. Naturally,
+it makes sense to want to rewrite that in some other language. However,
+filter-branch documentation states that several additional shell
+functions are provided for users to call, e.g. 'map', 'skip_commit',
+'git_commit_non_empty_tree'. If filter-branch itself isn't a shell
+script, then in order to make those shell functions available to the
+users' shell snippets you have to prepend the shell definitions of these
+functions to every one of the users' shell snippets and thus make these
+special shell functions be parsed with each and every commit.
+
+* filter-branch provides a --setup option which is a shell snippet that
+can be sourced to make shell functions and variables available to all
+other filters. If filter-branch is a shell script, it can simply eval
+this shell snippet once at the beginning. If you try to fix performance
+by making filter-branch not be a shell script, then you have to prepend
+the setup shell snippet to all other filters and parse it with every
+single commit.
+
+* filter-branch writes lots of files to $workdir/../map/ to keep a
+mapping of commits, which it uses pruning commits and remapping to
+ancestors and the map() command more generally. Other files like
+$tempdir/backup-refs, $tempdir/raw-refs, $tempdir/heads,
+$tempdir/tree-state are all created internally too. It is possible
+(though strongly discouraged) that users could have accessed any of
+these directly. Users even had a pointer to follow in the form of
+Documentation that the 'map' command existed, which naturally uses the
+$workdir/../map/* files. So, even if you don't have to edit files, for
+strict backward compatibility you need to still write a bunch of files
+to disk somewhere and keep them updated for every commit. You can claim
+it was an implementation detail that users should not have depended
+upon, but the truth is they've had a decade where they could so. So, if
+you want full compatibility, it has to be there. Besides, the
+regression tests depend on at least one of these details, specifying an
+--index-filter that reaches down and grabs backup-refs from $tempdir,
+and thus provides resourceful users who do google searches an example
+that there are files there for them to read and grab and use. (And if
+you want to pass the existing regression tests, you have to at least put
+the backup-refs file there even if it's irrelevant to your
+implementation otherwise.)
+
+All of that said, performance of filter-branch could be improved by
+reimplementing it in a non-shell language and taking a couple small
+liberties with backward compatibility (such as having it only run
+filters on files changed within each commit). filter-repo provides a
+demo script named
+https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely]
+which does exactly that and which passes all the git-filter-branch
+regression tests. It's much faster than git-filter-branch, though it
+suffers from all the same safety issues as git-filter-branch, and is
+still glacially slow compared to
+https://github.com/newren/git-filter-repo/[git filter-repo].
+
+[[SAFETY]]
+SAFETY
+------
+
+filter-branch is riddled with gotchas resulting in various ways to
+easily corrupt repos or end up with a mess worse than what you started
+with:
+
+* Someone can have a set of "working and tested filters" which they
+document or provide to a coworker, who then runs them on a different OS
+where the same commands are not working/tested (some examples in the
+git-filter-branch manpage are also affected by this). BSD vs. GNU
+userland differences can really bite. If you're lucky, you get ugly
+error messages spewed. But just as likely, the commands either don't do
+the filtering requested, or silently corrupt making some unwanted
+change. The unwanted change may only affect a few commits, so it's not
+necessarily obvious either. (The fact that problems won't necessarily
+be obvious means they are likely to go unnoticed until the rewritten
+history is in use for quite a while, at which point it's really hard to
+justify another flag-day for another rewrite.)
+
+* Filenames with spaces (which are rare) are often mishandled by shell
+snippets since they cause problems for shell pipelines. Not everyone is
+familiar with find -print0, xargs -0, ls-files -z, etc. Even people who
+are familiar with these may assume such needs are not relevant because
+someone else renamed any such files in their repo back before the person
+doing the filtering joined the project. And, often, even those familiar
+with handling arguments with spaces my not do so just because they
+aren't in the mindset of thinking about everything that could possibly
+go wrong.
+
+* Non-ascii filenames (which are rare) can be silently removed despite
+being in a desired directory. The desire to select paths to keep often
+use pipelines like `git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`.
+ls-files will only quote filenames if needed so folks may not notice
+that one of the files didn't match the regex, again until it's much too
+late. Yes, someone who knows about core.quotePath can avoid this
+(unless they have other special characters like \t, \n, or "), and
+people who use ls-files -z with something other than grep can avoid
+this, but that doesn't mean they will.
+
+* Similarly, when moving files around, one can find that filenames with
+non-ascii or special characters end up in a different directory, one
+that includes a double quote character. (This is technically the same
+issue as above with quoting, but perhaps an interesting different way
+that it can and has manifested as a problem.)
+
+* It's far too easy to accidentally mix up old and new history. It's
+still possible with any tool, but filter-branch almost invites it. If
+we're lucky, the only downside is users getting frustrated that they
+don't know how to shrink their repo and remove the old stuff. If we're
+unlucky, they merge old and new history and end up with multiple
+"copies" of each commit, some of which have unwanted or sensitive files
+and others which don't. This comes about in multiple different ways:
+
+ ** the default to only doing a partial history rewrite ('--all' is not
+ the default and over 80% of the examples in the manpage don't use
+ it)
+
+ ** the fact that there's no automatic post-run cleanup
+
+ ** the fact that --tag-name-filter (when used to rename tags) doesn't
+ remove the old tags but just adds new ones with the new name (this
+ manpage has documented this for a long time so it's presumably not
+ a "bug" even though it feels like it)
+
+ ** the fact that little educational information is provided to inform
+ users of the ramifications of a rewrite and how to avoid mixing old
+ and new history. For example, this man page discusses how users
+ need to understand that they need to rebase their changes for all
+ their branches on top of new history (or delete and reclone), but
+ that's only one of multiple concerns to consider. See the
+ "DISCUSSION" section of the git filter-repo manual page for more
+ details.
+
+* Annotated tags can be accidentally converted to lightweight tags, due
+to either of two issues:
+
+ . Someone can do a history rewrite, realize they messed up, restore
+ from the backups in refs/original/, and then redo their
+ filter-branch command. (The backup in refs/original/ is not a real
+ backup; it dereferences tags first.)
+
+ . Running filter-branch with either --tags or --all in your <rev-list
+ options>. In order to retain annotated tags as annotated, you must
+ use --tag-name-filter (and must not have restored from
+ refs/original/ in a previously botched rewrite).
+
+* Any commit messages that specify an encoding will become corrupted
+by the rewrite; filter-branch ignores the encoding, takes the original
+bytes, and feeds it to commit-tree without telling it the proper
+encoding. (This happens whether or not --msg-filter is used, though I
+suspect --msg-filter provides additional ways to really mess things
+up).
+
+* Commit messages (even if they are all UTF-8) by default become
+corrupted due to not being updated -- any references to other commit
+hashes in commit messages will now refer to no-longer-extant commits.
+
+* There are no facilities for helping users find what unwanted crud they
+should delete, which means they are much more likely to have incomplete
+or partial cleanups that sometimes result in confusion and people
+wasting time trying to understand. (For example, folks tend to just
+look for big files to delete instead of big directories or extensions,
+and once they do so, then sometime later folks using the new repository
+who are going through history will notice a build artifact directory
+that has some files but not others, or a cache of dependencies
+(node_modules or similar) which couldn't have ever been functional since
+it's missing some files.)
+
+* If --prune-empty isn't specified, then the filtering process can
+create hoards of confusing empty commits
+
+* If --prune-empty is specified, then intentionally placed empty
+commits from before the filtering operation are also pruned instead of
+just pruning commits that became empty due to filtering rules.
+
+* If --prune empty is specified, sometimes empty commits are missed
+and left around anyway (a somewhat rare bug, but it happens...)
+
+* A minor issue, but users who have a goal to update all names and
+emails in a repository may be led to --env-filter which will only update
+authors and committers, missing taggers.
+
+* If the user provides a --tag-name-filter that maps multiple tags to
+the same name, no warning or error is provided; filter-branch simply
+overwrites each tag in some undocumented pre-defined order resulting in
+only one tag at the end. If you try to "fix" this bug in filter-branch
+and make it error out and warn the user instead, one of the
+filter-branch regression tests will fail. (So, if you are trying to
+make a backward compatible reimplementation you have to add extra code
+to detect collisions and make sure that only the lexicographically last
+one is rewritten to avoid fast-import from seeing both since fast-import
+will naturally do the sane thing and error out if told to write the same
+tag more than once.)
+
+Also, the poor performance of filter-branch often leads to safety issues:
+
+* Coming up with the correct shell snippet to do the filtering you want
+is sometimes difficult unless you're just doing a trivial modification
+such as deleting a couple files. People have often come to me for help,
+so I should be practiced and an expert, but even for fairly simple cases
+I still sometimes take over 10 minutes and several iterations to get
+the right commands -- and that's assuming they are working on a tiny
+repository. Unfortunately, people often learn if the snippet is right
+or wrong by trying it out, but the rightness or wrongness can vary
+depending on special circumstances (spaces in filenames, non-ascii
+filenames, funny author names or emails, invalid timezones, presence of
+grafts or replace objects, etc.), meaning they may have to wait a long
+time, hit an error, then restart. The performance of filter-branch is
+so bad that this cycle is painful, reducing the time available to
+carefully re-check (to say nothing about what it does to the patience of
+the person doing the rewrite even if they do technically have more time
+available). This problem is extra compounded because errors from broken
+filters may not be shown for a long time and/or get lost in a sea of
+output. Even worse, broken filters often just result in silent
+incorrect rewrites.
+
+* To top it all off, even when users finally find working commands, they
+naturally want to share them. But they may be unaware that their repo
+didn't have some special cases that someone else's does. So, when
+someone else with a different repository runs the same commands, they
+get hit by the problems above. Or, the user just runs commands that
+really were vetted for special cases, but they run it on a different OS
+where it doesn't work, as noted above.
GIT
---
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 247f765604..0c114ad1ca 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -115,15 +115,14 @@ NOTES
-----
'git gc' tries very hard not to delete objects that are referenced
-anywhere in your repository. In
-particular, it will keep not only objects referenced by your current set
-of branches and tags, but also objects referenced by the index,
-remote-tracking branches, refs saved by 'git filter-branch' in
-refs/original/, reflogs (which may reference commits in branches
-that were later amended or rewound), and anything else in the refs/* namespace.
-If you are expecting some objects to be deleted and they aren't, check
-all of those locations and decide whether it makes sense in your case to
-remove those references.
+anywhere in your repository. In particular, it will keep not only
+objects referenced by your current set of branches and tags, but also
+objects referenced by the index, remote-tracking branches, notes saved
+by 'git notes' under refs/notes/, reflogs (which may reference commits
+in branches that were later amended or rewound), and anything else in
+the refs/* namespace. If you are expecting some objects to be deleted
+and they aren't, check all of those locations and decide whether it
+makes sense in your case to remove those references.
On the other hand, when 'git gc' runs concurrently with another process,
there is a risk of it deleting an object that the other process is using
diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 6156609cf7..a8cfc0ad82 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -832,7 +832,8 @@ Hard case: The changes are not the same.::
This happens if the 'subsystem' rebase had conflicts, or used
`--interactive` to omit, edit, squash, or fixup commits; or
if the upstream used one of `commit --amend`, `reset`, or
- `filter-branch`.
+ a full history rewriting command like
+ https://github.com/newren/git-filter-repo[`filter-repo`].
The easy case
diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
index 246dc9943c..f271d758c3 100644
--- a/Documentation/git-replace.txt
+++ b/Documentation/git-replace.txt
@@ -123,10 +123,10 @@ The following format are available:
CREATING REPLACEMENT OBJECTS
----------------------------
-linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
-linkgit:git-rebase[1], among other git commands, can be used to create
-replacement objects from existing objects. The `--edit` option can
-also be used with 'git replace' to create a replacement object by
+linkgit:git-hash-object[1], linkgit:git-rebase[1], and
+https://github.com/newren/git-filter-repo[git-filter-repo], among other git commands, can be used to
+create replacement objects from existing objects. The `--edit` option
+can also be used with 'git replace' to create a replacement object by
editing an existing object.
If you want to replace many blobs, trees or commits that are part of a
@@ -148,13 +148,13 @@ pending objects.
SEE ALSO
--------
linkgit:git-hash-object[1]
-linkgit:git-filter-branch[1]
linkgit:git-rebase[1]
linkgit:git-tag[1]
linkgit:git-branch[1]
linkgit:git-commit[1]
linkgit:git-var[1]
linkgit:git[1]
+https://github.com/newren/git-filter-repo[git-filter-repo]
GIT
---
diff --git a/Documentation/git-svn.txt b/Documentation/git-svn.txt
index 30711625fd..53774f5b64 100644
--- a/Documentation/git-svn.txt
+++ b/Documentation/git-svn.txt
@@ -769,11 +769,11 @@ option for (hopefully) obvious reasons.
+
This option is NOT recommended as it makes it difficult to track down
old references to SVN revision numbers in existing documentation, bug
-reports and archives. If you plan to eventually migrate from SVN to Git
-and are certain about dropping SVN history, consider
-linkgit:git-filter-branch[1] instead. filter-branch also allows
-reformatting of metadata for ease-of-reading and rewriting authorship
-info for non-"svn.authorsFile" users.
+reports, and archives. If you plan to eventually migrate from SVN to
+Git and are certain about dropping SVN history, consider
+https://github.com/newren/git-filter-repo[git-filter-repo] instead.
+filter-repo also allows reformatting of metadata for ease-of-reading
+and rewriting authorship info for non-"svn.authorsFile" users.
svn.useSvmProps::
svn-remote.<name>.useSvmProps::
diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 82cd573776..5a789c91df 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -425,10 +425,12 @@ post-rewrite
This hook is invoked by commands that rewrite commits
(linkgit:git-commit[1] when called with `--amend` and
-linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
-it!). Its first argument denotes the command it was invoked by:
-currently one of `amend` or `rebase`. Further command-dependent
-arguments may be passed in the future.
+linkgit:git-rebase[1]; however, full-history (re)writing tools like
+linkgit:git-fast-import[1] or
+https://github.com/newren/git-filter-repo[git-filter-repo] typically
+do not call it!). Its first argument denotes the command it was
+invoked by: currently one of `amend` or `rebase`. Further
+command-dependent arguments may be passed in the future.
The hook receives a list of the rewritten commits on stdin, in the
format
diff --git a/contrib/svn-fe/svn-fe.txt b/contrib/svn-fe/svn-fe.txt
index a3425f4770..19333fc8df 100644
--- a/contrib/svn-fe/svn-fe.txt
+++ b/contrib/svn-fe/svn-fe.txt
@@ -56,7 +56,7 @@ line. This line has the form `git-svn-id: URL@REVNO UUID`.
The resulting repository will generally require further processing
to put each project in its own repository and to separate the history
-of each branch. The 'git filter-branch --subdirectory-filter' command
+of each branch. The 'git filter-repo --subdirectory-filter' command
may be useful for this purpose.
BUGS
@@ -67,5 +67,5 @@ The exit status does not reflect whether an error was detected.
SEE ALSO
--------
-git-svn(1), svn2git(1), svk(1), git-filter-branch(1), git-fast-import(1),
+git-svn(1), svn2git(1), svk(1), git-filter-repo(1), git-fast-import(1),
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 5c5afa2b98..f805965d87 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -83,6 +83,19 @@ set_ident () {
finish_ident COMMITTER
}
+if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
+ -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
+ cat <<EOF
+WARNING: git-filter-branch has a glut of gotchas generating mangled history
+ rewrites. Please use an alternative filtering tool such as 'git
+ filter-repo' (https://github.com/newren/git-filter-repo/) instead.
+ See the filter-branch manual page for more details; to squelch
+ this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
+
+EOF
+ sleep 5
+fi
+
USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
[--tree-filter <command>] [--index-filter <command>]
[--parent-filter <command>] [--msg-filter <command>]
--
2.23.0.3.g59c7446927.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v3 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-08-29 0:06 ` [PATCH v3 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-08-29 18:10 ` Eric Sunshine
2019-08-30 0:04 ` Elijah Newren
0 siblings, 1 reply; 73+ messages in thread
From: Eric Sunshine @ 2019-08-29 18:10 UTC (permalink / raw)
To: Elijah Newren
Cc: Git List, Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder
On Wed, Aug 28, 2019 at 8:07 PM Elijah Newren <newren@gmail.com> wrote:
> filter-branch suffers from a deluge of disguised dangers that disfigure
> history rewrites (i.e. deviate from the deliberate changes). [...]
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
> @@ -16,6 +16,22 @@ SYNOPSIS
> +WARNING
> +-------
> +'git filter-branch' has a plethora of pitfalls that can produce non-obvious
> +manglings of the intended history rewrite (and can leave you with little
> +time to investigate such problems since it has such abysmal performance).
> +These safety and performance issues cannot be backward compatibly fixed and
> +as such, its use is not recommended. Please use an alternative history
> +filtering tool such as https://github.com/newren/git-filter-repo/[git
> +filter-repo]. If you still need to use 'git filter-branch', please
> +carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
> +mines of filter-branch, and then vigilantly avoid as many of the hazards
> +listed there as reasonably possible.
> +
> +https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
> +the land mines of filter-branch]
This stray link looks like leftover gunk from the previous revision.
> +PERFORMANCE
> +-----------
> +
> +The performance of filter-branch is glacially slow; its design makes it
The rest of this document spells it git-filter-branch or 'git
filter-branch', not plain filter-branch.
> +* In editing files, git-filter-branch by design checks out each and
> +every commit as it existed in the original repo. If your repo has 10\^5
> +files and 10\^5 commits, but each commit only modifies 5 files, then
> +git-filter-branch will make you do 10\^10 modifications, despite only
> +having (at most) 5*10^5 unique blobs.
> +
> +* If you try and cheat and try to make filter-branch only work on
> +files modified in a commit, then two things happen
s/filter-branch/git-&/
> +
> + . you run into problems with deletions whenever the user is simply
> + trying to rename files (because attempting to delete files that
> + don't exist looks like a no-op; it takes some chicanery to remap
> + deletes across file renames when the renames happen via arbitrary
> + user-provided shell)
> +
> + . even if you succeed at the map-deletes-for-renames chicanery, you
> + still technically violate backward compatibility because users are
> + allowed to filter files in ways that depend upon topology of commits
> + instead of filtering solely based on file contents or names (though
> + I have never seen any user ever do this).
Maybe avoid first-person:
... contents or names (though this has not been observed in
the wild).
> +* filter-branch is written in shell, which is kind of slow. Naturally,
> +it makes sense to want to rewrite that in some other language. However,
> +filter-branch documentation states that several additional shell
> +functions are provided for users to call, e.g. 'map', 'skip_commit',
> +'git_commit_non_empty_tree'. If filter-branch itself isn't a shell
> +script, then in order to make those shell functions available to the
> +users' shell snippets you have to prepend the shell definitions of these
> +functions to every one of the users' shell snippets and thus make these
> +special shell functions be parsed with each and every commit.
> +
> +* filter-branch provides a --setup option which is a shell snippet that
> +can be sourced to make shell functions and variables available to all
> +other filters. If filter-branch is a shell script, it can simply eval
> +this shell snippet once at the beginning. If you try to fix performance
> +by making filter-branch not be a shell script, then you have to prepend
> +the setup shell snippet to all other filters and parse it with every
> +single commit.
Even though they made sense in the context of the original email
message, these two bullet points may not belong in the man page since
someone reading the man page is doing so to learn about
git-filter-branch usage, not because he or she is thinking about
re-implementing it. It might make sense, however, to collapse these
points to some general statement about shell being slow and process
startup being costly.
Also, these bullet points and others below need a s/filter-branch/git-&/.
> +* filter-branch writes lots of files to $workdir/../map/ to keep a
Should that path have three dots "..." instead of two ".."?
> +mapping of commits, which it uses pruning commits and remapping to
> +ancestors and the map() command more generally. Other files like
> +$tempdir/backup-refs, $tempdir/raw-refs, $tempdir/heads,
> +$tempdir/tree-state are all created internally too. It is possible
> +(though strongly discouraged) that users could have accessed any of
> +these directly. Users even had a pointer to follow in the form of
> +Documentation that the 'map' command existed, which naturally uses the
> +$workdir/../map/* files. So, even if you don't have to edit files, for
> +strict backward compatibility you need to still write a bunch of files
> +to disk somewhere and keep them updated for every commit. You can claim
> +it was an implementation detail that users should not have depended
> +upon, but the truth is they've had a decade where they could so. So, if
> +you want full compatibility, it has to be there. Besides, the
> +regression tests depend on at least one of these details, specifying an
> +--index-filter that reaches down and grabs backup-refs from $tempdir,
> +and thus provides resourceful users who do google searches an example
> +that there are files there for them to read and grab and use. (And if
> +you want to pass the existing regression tests, you have to at least put
> +the backup-refs file there even if it's irrelevant to your
> +implementation otherwise.)
As with the earlier comment, this bullet point is aimed at someone
thinking about re-implementing the command; it sounds out of place in
the "Performance" section of the man page. However, it does make sense
to mention all the files git-filter-branch creates since that can have
an impact on performance. So, perhaps this section can be collapsed so
it just talks about that.
> +All of that said, performance of filter-branch could be improved by
> +reimplementing it in a non-shell language and taking a couple small
> +liberties with backward compatibility (such as having it only run
> +filters on files changed within each commit). filter-repo provides a
> +demo script named
> +https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely]
> +which does exactly that and which passes all the git-filter-branch
> +regression tests. It's much faster than git-filter-branch, though it
> +suffers from all the same safety issues as git-filter-branch, and is
> +still glacially slow compared to
> +https://github.com/newren/git-filter-repo/[git filter-repo].
This paragraph could be collapsed to say merely that, for those with
existing tooling relying upon git-filter-branch, filter-repo's
"filter-lamely" provides a drop-in replacement with somewhat improved
performance and a few caveats.
Taking the above comments into consideration, here is a possible
rewrite of the final three bullet points and the closing paragraph:
* filter-branch is written in shell, which is kind of slow, and it
potentially can run many other commands which can slow down its
operation significantly, especially on platforms for which
process startup is costly.
* filter-branch writes lots of files to $workdir/.../map/ to keep
a mapping of commits, which it uses for pruning commits and
remapping to ancestors and for the map() command more generally.
Other files like $tempdir/backup-refs, $tempdir/raw-refs,
$tempdir/heads, $tempdir/tree-state are created internally too.
Such file creation can be costly in general, but especially on
platforms with slow filesystems.
The tool https://github.com/newren/git-filter-repo/[git
filter-repo] is an alternative to git-filter-branch which does not
suffer from these performance problems or the safety problems
(mentioned below). For those with existing tooling which relies
upon git-filter-branch, 'git repo-filter' also provides
https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
a drop-in git-filter-branch replacement (with a few caveats).
> +SAFETY
> +------
> +
> +* Non-ascii filenames (which are rare) can be silently removed despite
Perhaps drop "(which are rare)" to make this sound more formal and
less like an email message.
Comment below also are intended to make the prose sound a bit more formal.
> +being in a desired directory. The desire to select paths to keep often
> +use pipelines like `git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`.
> +ls-files will only quote filenames if needed so folks may not notice
s/ls-files/git-&/
> +that one of the files didn't match the regex, again until it's much too
> +late. Yes, someone who knows about core.quotePath can avoid this
> +(unless they have other special characters like \t, \n, or "), and
> +people who use ls-files -z with something other than grep can avoid
> +this, but that doesn't mean they will.
> +
> +* It's far too easy to accidentally mix up old and new history. It's
> +still possible with any tool, but filter-branch almost invites it. If
> +we're lucky, the only downside is users getting frustrated that they
s/we're//
> +don't know how to shrink their repo and remove the old stuff. If we're
s/we're//
> +unlucky, they merge old and new history and end up with multiple
> +"copies" of each commit, some of which have unwanted or sensitive files
> +and others which don't. This comes about in multiple different ways:
> +
> + ** the default to only doing a partial history rewrite ('--all' is not
> + the default and over 80% of the examples in the manpage don't use
> + it)
Maybe just shorten this to:
('--all is not the default, and few examples show it)
> + ** the fact that there's no automatic post-run cleanup
> +
> + ** the fact that --tag-name-filter (when used to rename tags) doesn't
> + remove the old tags but just adds new ones with the new name (this
> + manpage has documented this for a long time so it's presumably not
> + a "bug" even though it feels like it)
Perhaps drop the final parenthetical comment.
> + ** the fact that little educational information is provided to inform
> + users of the ramifications of a rewrite and how to avoid mixing old
> + and new history. For example, this man page discusses how users
> + need to understand that they need to rebase their changes for all
> + their branches on top of new history (or delete and reclone), but
> + that's only one of multiple concerns to consider. See the
> + "DISCUSSION" section of the git filter-repo manual page for more
> + details.
> +
> +* Annotated tags can be accidentally converted to lightweight tags, due
> +to either of two issues:
> +
> + . Someone can do a history rewrite, realize they messed up, restore
> + from the backups in refs/original/, and then redo their
> + filter-branch command. (The backup in refs/original/ is not a real
> + backup; it dereferences tags first.)
> +
> + . Running filter-branch with either --tags or --all in your <rev-list
> + options>. In order to retain annotated tags as annotated, you must
> + use --tag-name-filter (and must not have restored from
> + refs/original/ in a previously botched rewrite).
Should these bullet points use "**" rather than "."?
> +* Any commit messages that specify an encoding will become corrupted
> +by the rewrite; filter-branch ignores the encoding, takes the original
> +bytes, and feeds it to commit-tree without telling it the proper
> +encoding. (This happens whether or not --msg-filter is used, though I
> +suspect --msg-filter provides additional ways to really mess things
> +up).
Perhaps shorten simply to:
(This happens whether or not --msg-filter is used.)
> +* If the user provides a --tag-name-filter that maps multiple tags to
> +the same name, no warning or error is provided; filter-branch simply
> +overwrites each tag in some undocumented pre-defined order resulting in
> +only one tag at the end. If you try to "fix" this bug in filter-branch
> +and make it error out and warn the user instead, one of the
> +filter-branch regression tests will fail. (So, if you are trying to
> +make a backward compatible reimplementation you have to add extra code
> +to detect collisions and make sure that only the lexicographically last
> +one is rewritten to avoid fast-import from seeing both since fast-import
> +will naturally do the sane thing and error out if told to write the same
> +tag more than once.)
Maybe drop everything from "If you try to 'fix'..." to the end of paragraph.
> +Also, the poor performance of filter-branch often leads to safety issues:
> +
> +* Coming up with the correct shell snippet to do the filtering you want
> +is sometimes difficult unless you're just doing a trivial modification
> +such as deleting a couple files. People have often come to me for help,
> +so I should be practiced and an expert, but even for fairly simple cases
> +I still sometimes take over 10 minutes and several iterations to get
> +the right commands -- and that's assuming they are working on a tiny
> +repository. Unfortunately, people often learn if the snippet is right
> +or wrong by trying it out, but the rightness or wrongness can vary
> +depending on special circumstances (spaces in filenames, non-ascii
> +filenames, funny author names or emails, invalid timezones, presence of
> +grafts or replace objects, etc.), meaning they may have to wait a long
> +time, hit an error, then restart. The performance of filter-branch is
> +so bad that this cycle is painful, reducing the time available to
> +carefully re-check (to say nothing about what it does to the patience of
> +the person doing the rewrite even if they do technically have more time
> +available). This problem is extra compounded because errors from broken
> +filters may not be shown for a long time and/or get lost in a sea of
> +output. Even worse, broken filters often just result in silent
> +incorrect rewrites.
Drop the "People have often come to me..." sentence from this paragraph.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH v3 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-08-29 18:10 ` Eric Sunshine
@ 2019-08-30 0:04 ` Elijah Newren
0 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-30 0:04 UTC (permalink / raw)
To: Eric Sunshine
Cc: Git List, Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder
Hi Eric,
Thanks for the careful and thoughtful review.
On Thu, Aug 29, 2019 at 11:11 AM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Wed, Aug 28, 2019 at 8:07 PM Elijah Newren <newren@gmail.com> wrote:
> > filter-branch suffers from a deluge of disguised dangers that disfigure
> > history rewrites (i.e. deviate from the deliberate changes). [...]
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> > diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
> > @@ -16,6 +16,22 @@ SYNOPSIS
> > +WARNING
> > +-------
> > +'git filter-branch' has a plethora of pitfalls that can produce non-obvious
> > +manglings of the intended history rewrite (and can leave you with little
> > +time to investigate such problems since it has such abysmal performance).
> > +These safety and performance issues cannot be backward compatibly fixed and
> > +as such, its use is not recommended. Please use an alternative history
> > +filtering tool such as https://github.com/newren/git-filter-repo/[git
> > +filter-repo]. If you still need to use 'git filter-branch', please
> > +carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
> > +mines of filter-branch, and then vigilantly avoid as many of the hazards
> > +listed there as reasonably possible.
> > +
> > +https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
> > +the land mines of filter-branch]
>
> This stray link looks like leftover gunk from the previous revision.
Ugh, indeed.
>
> > +PERFORMANCE
> > +-----------
> > +
> > +The performance of filter-branch is glacially slow; its design makes it
>
> The rest of this document spells it git-filter-branch or 'git
> filter-branch', not plain filter-branch.
>
> > +* In editing files, git-filter-branch by design checks out each and
> > +every commit as it existed in the original repo. If your repo has 10\^5
> > +files and 10\^5 commits, but each commit only modifies 5 files, then
> > +git-filter-branch will make you do 10\^10 modifications, despite only
> > +having (at most) 5*10^5 unique blobs.
> > +
> > +* If you try and cheat and try to make filter-branch only work on
> > +files modified in a commit, then two things happen
>
> s/filter-branch/git-&/
I can fix these up.
>
> > +
> > + . you run into problems with deletions whenever the user is simply
> > + trying to rename files (because attempting to delete files that
> > + don't exist looks like a no-op; it takes some chicanery to remap
> > + deletes across file renames when the renames happen via arbitrary
> > + user-provided shell)
> > +
> > + . even if you succeed at the map-deletes-for-renames chicanery, you
> > + still technically violate backward compatibility because users are
> > + allowed to filter files in ways that depend upon topology of commits
> > + instead of filtering solely based on file contents or names (though
> > + I have never seen any user ever do this).
>
> Maybe avoid first-person:
>
> ... contents or names (though this has not been observed in
> the wild).
Thanks for providing alternative wording.
> > +* filter-branch is written in shell, which is kind of slow. Naturally,
> > +it makes sense to want to rewrite that in some other language. However,
> > +filter-branch documentation states that several additional shell
> > +functions are provided for users to call, e.g. 'map', 'skip_commit',
> > +'git_commit_non_empty_tree'. If filter-branch itself isn't a shell
> > +script, then in order to make those shell functions available to the
> > +users' shell snippets you have to prepend the shell definitions of these
> > +functions to every one of the users' shell snippets and thus make these
> > +special shell functions be parsed with each and every commit.
> > +
> > +* filter-branch provides a --setup option which is a shell snippet that
> > +can be sourced to make shell functions and variables available to all
> > +other filters. If filter-branch is a shell script, it can simply eval
> > +this shell snippet once at the beginning. If you try to fix performance
> > +by making filter-branch not be a shell script, then you have to prepend
> > +the setup shell snippet to all other filters and parse it with every
> > +single commit.
>
> Even though they made sense in the context of the original email
> message, these two bullet points may not belong in the man page since
> someone reading the man page is doing so to learn about
> git-filter-branch usage, not because he or she is thinking about
> re-implementing it. It might make sense, however, to collapse these
> points to some general statement about shell being slow and process
> startup being costly.
Hmm. I see where you're coming from, but the performance section
isn't really user actionable stuff anyway; it's just a warning. And I
have repeatedly seen over the years the question brought up on the
list of "Can we make filter-branch fast by making it a builtin?" (Or
"Can't _you_ make filter-branch fast by rewriting it in C?")
I could try to reword it so that there's some general statement about
shell being slow and process startup being costly, and then add these
two items as sub-bullets to try to stave off that obvious but
misguided question from coming up. Or maybe I just add a reference to
the original email?
> Also, these bullet points and others below need a s/filter-branch/git-&/.
Thanks, will fix.
> > +* filter-branch writes lots of files to $workdir/../map/ to keep a
>
> Should that path have three dots "..." instead of two ".."?
No, it's a literal parent directory reference. Users have access to
$workdir; it's where their commands run. There is no name for the
parent of that directory, other than by appending '/..' to wherever
they are. Maybe if I had spelled it as $(pwd)/../map/ it would be
better?
Or maybe I don't need to name the files at all; does it really matter
to the user?
> > +mapping of commits, which it uses pruning commits and remapping to
> > +ancestors and the map() command more generally. Other files like
> > +$tempdir/backup-refs, $tempdir/raw-refs, $tempdir/heads,
> > +$tempdir/tree-state are all created internally too. It is possible
> > +(though strongly discouraged) that users could have accessed any of
> > +these directly. Users even had a pointer to follow in the form of
> > +Documentation that the 'map' command existed, which naturally uses the
> > +$workdir/../map/* files. So, even if you don't have to edit files, for
> > +strict backward compatibility you need to still write a bunch of files
> > +to disk somewhere and keep them updated for every commit. You can claim
> > +it was an implementation detail that users should not have depended
> > +upon, but the truth is they've had a decade where they could so. So, if
> > +you want full compatibility, it has to be there. Besides, the
> > +regression tests depend on at least one of these details, specifying an
> > +--index-filter that reaches down and grabs backup-refs from $tempdir,
> > +and thus provides resourceful users who do google searches an example
> > +that there are files there for them to read and grab and use. (And if
> > +you want to pass the existing regression tests, you have to at least put
> > +the backup-refs file there even if it's irrelevant to your
> > +implementation otherwise.)
>
> As with the earlier comment, this bullet point is aimed at someone
> thinking about re-implementing the command; it sounds out of place in
> the "Performance" section of the man page. However, it does make sense
> to mention all the files git-filter-branch creates since that can have
> an impact on performance. So, perhaps this section can be collapsed so
> it just talks about that.
I think there's both a how-performance-affects-user component and a
component addressing the common incorrect question/statement/thought
that filter-branch performance could just be fixed by making it a
builtin. But splitting this may make sense. And maybe the portions
addressing making-it-a-builtin-wouldn't-fix-it could be a short
sentence with a link to the original email for more details.
> > +All of that said, performance of filter-branch could be improved by
> > +reimplementing it in a non-shell language and taking a couple small
> > +liberties with backward compatibility (such as having it only run
> > +filters on files changed within each commit). filter-repo provides a
> > +demo script named
> > +https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely]
> > +which does exactly that and which passes all the git-filter-branch
> > +regression tests. It's much faster than git-filter-branch, though it
> > +suffers from all the same safety issues as git-filter-branch, and is
> > +still glacially slow compared to
> > +https://github.com/newren/git-filter-repo/[git filter-repo].
>
> This paragraph could be collapsed to say merely that, for those with
> existing tooling relying upon git-filter-branch, filter-repo's
> "filter-lamely" provides a drop-in replacement with somewhat improved
> performance and a few caveats.
Sounds good.
> Taking the above comments into consideration, here is a possible
> rewrite of the final three bullet points and the closing paragraph:
Oh, sweet, thanks for providing this. I really like the simplicity of
your suggested wording in general; it will be really helpful in
rewording. I do have a nitpick with each one, though...
> * filter-branch is written in shell, which is kind of slow, and it
> potentially can run many other commands which can slow down its
> operation significantly, especially on platforms for which
> process startup is costly.
Even if it's not the emphasis you intended, I'm worried this makes it
sound as if filter-branch performance is only bad on Windows or Mac.
Compared to invoking a function (even in a bytecode interpreted
language), creating and running another process is slow on any
platform.
> * filter-branch writes lots of files to $workdir/.../map/ to keep
> a mapping of commits, which it uses for pruning commits and
> remapping to ancestors and for the map() command more generally.
> Other files like $tempdir/backup-refs, $tempdir/raw-refs,
> $tempdir/heads, $tempdir/tree-state are created internally too.
> Such file creation can be costly in general, but especially on
> platforms with slow filesystems.
Again, it may not have been your intended emphasis, but I think this
may be read as singling out slow filesystems, and make people think
the performance problems from this bullet point only affects some
OSes. Filesystems are part of the problem. Disks being slow is part
of the problem. But it's not all of it. I guess part of what really
gets me with these is that they represent forced synchronization (e.g.
the kernel has to flush the data upon close() to make sure any other
processes can see the file contents and all further filtering is
blocked waiting for this to finish). By way of comparison, in
filter-repo I have to both write data to fast-import and read back
information from fast-import (in order to find out the new commit
names, for example). When I did the straightforward thing of writing
a commit, writing a 'get-mark' directive, and then reading the answer,
it ruined performance. So I had to be a bit smarter and defer reading
back the resulting sha1. There's no room for anything similarly
clever in filter-branch; writing these files out is a synchronization
point that is needed before the user's filter can be eval'ed.
> The tool https://github.com/newren/git-filter-repo/[git
> filter-repo] is an alternative to git-filter-branch which does not
> suffer from these performance problems or the safety problems
> (mentioned below). For those with existing tooling which relies
> upon git-filter-branch, 'git repo-filter' also provides
> https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
> a drop-in git-filter-branch replacement (with a few caveats).
This suggests filter-lamely doesn't suffer from performance or safety
problems, which is very misleading. filter-lamely doesn't improve the
safety story at all and only ameliorates the performance problems
somewhat.
> > +SAFETY
> > +------
> > +
> > +* Non-ascii filenames (which are rare) can be silently removed despite
>
> Perhaps drop "(which are rare)" to make this sound more formal and
> less like an email message.
Makes sense; and I'm guessing I should also drop it from the bullet
point above this one.
I'll stop commenting on the individual comments since there's not much
to say with most of them other than they look like obviously good
suggestions...
> Comment below also are intended to make the prose sound a bit more formal.
>
> > +being in a desired directory. The desire to select paths to keep often
> > +use pipelines like `git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`.
> > +ls-files will only quote filenames if needed so folks may not notice
>
> s/ls-files/git-&/
>
> > +that one of the files didn't match the regex, again until it's much too
> > +late. Yes, someone who knows about core.quotePath can avoid this
> > +(unless they have other special characters like \t, \n, or "), and
> > +people who use ls-files -z with something other than grep can avoid
> > +this, but that doesn't mean they will.
> > +
> > +* It's far too easy to accidentally mix up old and new history. It's
> > +still possible with any tool, but filter-branch almost invites it. If
> > +we're lucky, the only downside is users getting frustrated that they
>
> s/we're//
>
> > +don't know how to shrink their repo and remove the old stuff. If we're
>
> s/we're//
>
> > +unlucky, they merge old and new history and end up with multiple
> > +"copies" of each commit, some of which have unwanted or sensitive files
> > +and others which don't. This comes about in multiple different ways:
> > +
> > + ** the default to only doing a partial history rewrite ('--all' is not
> > + the default and over 80% of the examples in the manpage don't use
> > + it)
>
> Maybe just shorten this to:
>
> ('--all is not the default, and few examples show it)
I know I said I'd not comment unless I disagreed, but I just wanted to
say thanks so much for providing concrete suggestions in so many
places. It's *very* helpful.
> > + ** the fact that there's no automatic post-run cleanup
> > +
> > + ** the fact that --tag-name-filter (when used to rename tags) doesn't
> > + remove the old tags but just adds new ones with the new name (this
> > + manpage has documented this for a long time so it's presumably not
> > + a "bug" even though it feels like it)
>
> Perhaps drop the final parenthetical comment.
>
> > + ** the fact that little educational information is provided to inform
> > + users of the ramifications of a rewrite and how to avoid mixing old
> > + and new history. For example, this man page discusses how users
> > + need to understand that they need to rebase their changes for all
> > + their branches on top of new history (or delete and reclone), but
> > + that's only one of multiple concerns to consider. See the
> > + "DISCUSSION" section of the git filter-repo manual page for more
> > + details.
> > +
> > +* Annotated tags can be accidentally converted to lightweight tags, due
> > +to either of two issues:
> > +
> > + . Someone can do a history rewrite, realize they messed up, restore
> > + from the backups in refs/original/, and then redo their
> > + filter-branch command. (The backup in refs/original/ is not a real
> > + backup; it dereferences tags first.)
> > +
> > + . Running filter-branch with either --tags or --all in your <rev-list
> > + options>. In order to retain annotated tags as annotated, you must
> > + use --tag-name-filter (and must not have restored from
> > + refs/original/ in a previously botched rewrite).
>
> Should these bullet points use "**" rather than "."?
I guess it could but the "either of two issues" above it made me think
of numbering them. I also had a couple sub-bullets in the performance
section that were numbered. But I guess it is slightly weird coming
so close after another section that used un-numbered subbullets. I
guess I'll just make them all un-numbered.
> > +* Any commit messages that specify an encoding will become corrupted
> > +by the rewrite; filter-branch ignores the encoding, takes the original
> > +bytes, and feeds it to commit-tree without telling it the proper
> > +encoding. (This happens whether or not --msg-filter is used, though I
> > +suspect --msg-filter provides additional ways to really mess things
> > +up).
>
> Perhaps shorten simply to:
>
> (This happens whether or not --msg-filter is used.)
>
> > +* If the user provides a --tag-name-filter that maps multiple tags to
> > +the same name, no warning or error is provided; filter-branch simply
> > +overwrites each tag in some undocumented pre-defined order resulting in
> > +only one tag at the end. If you try to "fix" this bug in filter-branch
> > +and make it error out and warn the user instead, one of the
> > +filter-branch regression tests will fail. (So, if you are trying to
> > +make a backward compatible reimplementation you have to add extra code
> > +to detect collisions and make sure that only the lexicographically last
> > +one is rewritten to avoid fast-import from seeing both since fast-import
> > +will naturally do the sane thing and error out if told to write the same
> > +tag more than once.)
>
> Maybe drop everything from "If you try to 'fix'..." to the end of paragraph.
Or just replace that long section you highlight with a parenthetical
comment, "(a git-filter-branch regression test requires this.)"
> > +Also, the poor performance of filter-branch often leads to safety issues:
> > +
> > +* Coming up with the correct shell snippet to do the filtering you want
> > +is sometimes difficult unless you're just doing a trivial modification
> > +such as deleting a couple files. People have often come to me for help,
> > +so I should be practiced and an expert, but even for fairly simple cases
> > +I still sometimes take over 10 minutes and several iterations to get
> > +the right commands -- and that's assuming they are working on a tiny
> > +repository. Unfortunately, people often learn if the snippet is right
> > +or wrong by trying it out, but the rightness or wrongness can vary
> > +depending on special circumstances (spaces in filenames, non-ascii
> > +filenames, funny author names or emails, invalid timezones, presence of
> > +grafts or replace objects, etc.), meaning they may have to wait a long
> > +time, hit an error, then restart. The performance of filter-branch is
> > +so bad that this cycle is painful, reducing the time available to
> > +carefully re-check (to say nothing about what it does to the patience of
> > +the person doing the rewrite even if they do technically have more time
> > +available). This problem is extra compounded because errors from broken
> > +filters may not be shown for a long time and/or get lost in a sea of
> > +output. Even worse, broken filters often just result in silent
> > +incorrect rewrites.
>
> Drop the "People have often come to me..." sentence from this paragraph.
Thanks again for the careful reading and many suggestions!
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v3 4/4] t9902: use a non-deprecated command for testing
2019-08-29 0:06 ` [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (2 preceding siblings ...)
2019-08-29 0:06 ` [PATCH v3 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-08-29 0:06 ` Elijah Newren
2019-08-30 5:57 ` [PATCH v4 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
4 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-29 0:06 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
t9902 had a list of three random porcelain commands as a sanity check,
one of which was filter-branch. Since we are recommending people not
use filter-branch, let's update this test to use rebase instead of
filter-branch.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t9902-completion.sh | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 75512c3403..4e7f669c76 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -28,10 +28,10 @@ complete ()
#
# (2) A test makes sure that common subcommands are included in the
# completion for "git <TAB>", and a plumbing is excluded. "add",
-# "filter-branch" and "ls-files" are listed for this.
+# "rebase" and "ls-files" are listed for this.
-GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr filter-branch ls-files'
-GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout filter-branch'
+GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr rebase ls-files'
+GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout rebase'
. "$GIT_BUILD_DIR/contrib/completion/git-completion.bash"
@@ -1392,12 +1392,12 @@ test_expect_success 'basic' '
# built-in
grep -q "^add \$" out &&
# script
- grep -q "^filter-branch \$" out &&
+ grep -q "^rebase \$" out &&
# plumbing
! grep -q "^ls-files \$" out &&
- run_completion "git f" &&
- ! grep -q -v "^f" out
+ run_completion "git r" &&
+ ! grep -q -v "^r" out
'
test_expect_success 'double dash "git" itself' '
--
2.23.0.3.g59c7446927.dirty
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v4 0/4] Warn about git-filter-branch usage and avoid it
2019-08-29 0:06 ` [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (3 preceding siblings ...)
2019-08-29 0:06 ` [PATCH v3 4/4] t9902: use a non-deprecated command for testing Elijah Newren
@ 2019-08-30 5:57 ` Elijah Newren
2019-08-30 5:57 ` [PATCH v4 1/4] t6006: simplify and optimize empty message test Elijah Newren
` (3 more replies)
4 siblings, 4 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-30 5:57 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
Here's a series warns about git-filter-branch usage and avoids it
ourselves.
Changes since v3
* Incorporated Eric's detailed feedback on the git-filter-branch
manpage, some notes:
* s/filter-branch/git-&/ (and similar for ls-files)
* Multiple sections removed (and existing sections had a
number of sentences removed)
* I ended up not linking to the original html, but just added
a small "Side Note" in a sub-bullet to address how fixing the
written-in-shell attribute of git-filter-branch would do less
than proponents expect.
* ...and lots of other miscellaneous wording fixes and cleanups
* The full range-diff is below, but it's kinda hard to read due to
line wrapping and such.
Elijah Newren (4):
t6006: simplify and optimize empty message test
t3427: accelerate this test by using fast-export and fast-import
Recommend git-filter-repo instead of git-filter-branch
t9902: use a non-deprecated command for testing
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 272 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 13 ++
t/t3427-rebase-subtree.sh | 24 ++-
t/t6006-rev-list-format.sh | 5 +-
t/t9902-completion.sh | 12 +-
12 files changed, 309 insertions(+), 77 deletions(-)
Range-diff:
1: 7ddbeea2ca = 1: 7ddbeea2ca t6006: simplify and optimize empty message test
2: e1e63189c1 = 2: e1e63189c1 t3427: accelerate this test by using fast-export and fast-import
3: 59c7446927 ! 3: ed6505584f Recommend git-filter-repo instead of git-filter-branch
@@ Documentation/git-filter-branch.txt: SYNOPSIS
+carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
+mines of filter-branch, and then vigilantly avoid as many of the hazards
+listed there as reasonably possible.
-+
-+https://public-inbox.org/git/CABPp-BEDOH-row-hxY4u_cP30ptqOpcCvPibwyZ2wBu142qUbA@mail.gmail.com/[detailing
-+the land mines of filter-branch]
+
DESCRIPTION
-----------
@@ Documentation/git-filter-branch.txt: warned.
+PERFORMANCE
+-----------
+
-+The performance of filter-branch is glacially slow; its design makes it
++The performance of git-filter-branch is glacially slow; its design makes it
+impossible for a backward-compatible implementation to ever be fast:
+
+* In editing files, git-filter-branch by design checks out each and
@@ Documentation/git-filter-branch.txt: warned.
+git-filter-branch will make you do 10\^10 modifications, despite only
+having (at most) 5*10^5 unique blobs.
+
-+* If you try and cheat and try to make filter-branch only work on
++* If you try and cheat and try to make git-filter-branch only work on
+files modified in a commit, then two things happen
+
-+ . you run into problems with deletions whenever the user is simply
-+ trying to rename files (because attempting to delete files that
-+ don't exist looks like a no-op; it takes some chicanery to remap
-+ deletes across file renames when the renames happen via arbitrary
-+ user-provided shell)
++ ** you run into problems with deletions whenever the user is simply
++ trying to rename files (because attempting to delete files that
++ don't exist looks like a no-op; it takes some chicanery to remap
++ deletes across file renames when the renames happen via arbitrary
++ user-provided shell)
+
-+ . even if you succeed at the map-deletes-for-renames chicanery, you
-+ still technically violate backward compatibility because users are
-+ allowed to filter files in ways that depend upon topology of commits
-+ instead of filtering solely based on file contents or names (though
-+ I have never seen any user ever do this).
++ ** even if you succeed at the map-deletes-for-renames chicanery, you
++ still technically violate backward compatibility because users are
++ allowed to filter files in ways that depend upon topology of
++ commits instead of filtering solely based on file contents or names
++ (though this has not been observed in the wild).
+
+* Even if you don't need to edit files but only want to e.g. rename or
+remove some and thus can avoid checking out each file (i.e. you can use
+--index-filter), you still are passing shell snippets for your filters.
+This means that for every commit, you have to have a prepared git repo
-+where users can run git commands. That's a lot of setup. It also means
-+you have to fork at least one process to run the user-provided shell
-+snippet, and odds are that the user's shell snippet invokes lots of
-+commands in some long pipeline, so you will have lots and lots of forks.
-+For every. single. commit. That's a massive amount of overhead to
-+rename a few files.
-+
-+* filter-branch is written in shell, which is kind of slow. Naturally,
-+it makes sense to want to rewrite that in some other language. However,
-+filter-branch documentation states that several additional shell
-+functions are provided for users to call, e.g. 'map', 'skip_commit',
-+'git_commit_non_empty_tree'. If filter-branch itself isn't a shell
-+script, then in order to make those shell functions available to the
-+users' shell snippets you have to prepend the shell definitions of these
-+functions to every one of the users' shell snippets and thus make these
-+special shell functions be parsed with each and every commit.
-+
-+* filter-branch provides a --setup option which is a shell snippet that
-+can be sourced to make shell functions and variables available to all
-+other filters. If filter-branch is a shell script, it can simply eval
-+this shell snippet once at the beginning. If you try to fix performance
-+by making filter-branch not be a shell script, then you have to prepend
-+the setup shell snippet to all other filters and parse it with every
-+single commit.
-+
-+* filter-branch writes lots of files to $workdir/../map/ to keep a
-+mapping of commits, which it uses pruning commits and remapping to
-+ancestors and the map() command more generally. Other files like
-+$tempdir/backup-refs, $tempdir/raw-refs, $tempdir/heads,
-+$tempdir/tree-state are all created internally too. It is possible
-+(though strongly discouraged) that users could have accessed any of
-+these directly. Users even had a pointer to follow in the form of
-+Documentation that the 'map' command existed, which naturally uses the
-+$workdir/../map/* files. So, even if you don't have to edit files, for
-+strict backward compatibility you need to still write a bunch of files
-+to disk somewhere and keep them updated for every commit. You can claim
-+it was an implementation detail that users should not have depended
-+upon, but the truth is they've had a decade where they could so. So, if
-+you want full compatibility, it has to be there. Besides, the
-+regression tests depend on at least one of these details, specifying an
-+--index-filter that reaches down and grabs backup-refs from $tempdir,
-+and thus provides resourceful users who do google searches an example
-+that there are files there for them to read and grab and use. (And if
-+you want to pass the existing regression tests, you have to at least put
-+the backup-refs file there even if it's irrelevant to your
-+implementation otherwise.)
-+
-+All of that said, performance of filter-branch could be improved by
-+reimplementing it in a non-shell language and taking a couple small
-+liberties with backward compatibility (such as having it only run
-+filters on files changed within each commit). filter-repo provides a
-+demo script named
-+https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely]
-+which does exactly that and which passes all the git-filter-branch
-+regression tests. It's much faster than git-filter-branch, though it
-+suffers from all the same safety issues as git-filter-branch, and is
-+still glacially slow compared to
-+https://github.com/newren/git-filter-repo/[git filter-repo].
++where those filters can be run. That's a significant setup.
++
++* Further, several additional files are created or updated per commit by
++git-filter-branch. Some of these are for supporting the convenience
++functions provided by git-filter-branch (such as map()), while others
++are for keeping track of internal state (but could have also been
++accessed by user filters; one of git-filter-branch's regression tests
++does so). This essentially amounts to using the filesystem as an IPC
++mechanism between git-filter-branch and the user-provided filters.
++Disks tend to be a slow IPC mechanism, and writing these files also
++effectively represents a forced synchronization point between separate
++processes that we hit with every commit.
++
++* The user-provided shell commands will likely involve a pipeline of
++commands, resulting in the creation of many processes per commit.
++Creating and running another process takes a widely varying amount of
++time between operating systems, but on any platform it is very slow
++relative to invoking a function.
++
++* git-filter-branch itself is written in shell, which is kind of slow.
++This is the one performance issue that could be backward-compatibly
++fixed, but compared to the above problems that are intrinsic to the
++design of git-filter-branch, the language of the tool itself is a
++relatively minor issue.
++
++ ** Side note: Unfortunately, people tend to fixate on the
++ written-in-shell aspect and periodically ask if git-filter-branch
++ could be rewritten in another language to fix the performance
++ issues. Not only does that ignore the bigger intrinsic problems
++ with the design, it'd help less than you'd expect: if
++ git-filter-branch itself were not shell, then the convenience
++ functions (map(), skip_commit(), etc) and the `--setup` argument
++ could no longer be executed once at the beginning of the program
++ but would instead need to be prepended to every user filter (and
++ thus re-executed with every commit).
++
++The https://github.com/newren/git-filter-repo/[git filter-repo] tool is
++an alternative to git-filter-branch which does not suffer from these
++performance problems or the safety problems (mentioned below). For those
++with existing tooling which relies upon git-filter-branch, 'git
++repo-filter' also provides
++https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
++a drop-in git-filter-branch replacement (with a few caveats). While
++filter-lamely suffers from all the same safety issues as
++git-filter-branch, it at least ameloriates the performance issues a
++little.
+
+[[SAFETY]]
+SAFETY
+------
+
-+filter-branch is riddled with gotchas resulting in various ways to
++git-filter-branch is riddled with gotchas resulting in various ways to
+easily corrupt repos or end up with a mess worse than what you started
+with:
+
@@ Documentation/git-filter-branch.txt: warned.
+history is in use for quite a while, at which point it's really hard to
+justify another flag-day for another rewrite.)
+
-+* Filenames with spaces (which are rare) are often mishandled by shell
-+snippets since they cause problems for shell pipelines. Not everyone is
-+familiar with find -print0, xargs -0, ls-files -z, etc. Even people who
-+are familiar with these may assume such needs are not relevant because
++* Filenames with spaces are often mishandled by shell snippets since
++they cause problems for shell pipelines. Not everyone is familiar with
++find -print0, xargs -0, git-ls-files -z, etc. Even people who are
++familiar with these may assume such needs are not relevant because
+someone else renamed any such files in their repo back before the person
+doing the filtering joined the project. And, often, even those familiar
+with handling arguments with spaces my not do so just because they
+aren't in the mindset of thinking about everything that could possibly
+go wrong.
+
-+* Non-ascii filenames (which are rare) can be silently removed despite
-+being in a desired directory. The desire to select paths to keep often
-+use pipelines like `git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`.
-+ls-files will only quote filenames if needed so folks may not notice
-+that one of the files didn't match the regex, again until it's much too
-+late. Yes, someone who knows about core.quotePath can avoid this
-+(unless they have other special characters like \t, \n, or "), and
-+people who use ls-files -z with something other than grep can avoid
-+this, but that doesn't mean they will.
++* Non-ascii filenames can be silently removed despite being in a desired
++directory. The desire to select paths to keep often use pipelines like
++`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`. ls-files will
++only quote filenames if needed so folks may not notice that one of the
++files didn't match the regex, again until it's much too late. Yes,
++someone who knows about core.quotePath can avoid this (unless they have
++other special characters like \t, \n, or "), and people who use ls-files
++-z with something other than grep can avoid this, but that doesn't mean
++they will.
+
+* Similarly, when moving files around, one can find that filenames with
+non-ascii or special characters end up in a different directory, one
@@ Documentation/git-filter-branch.txt: warned.
+that it can and has manifested as a problem.)
+
+* It's far too easy to accidentally mix up old and new history. It's
-+still possible with any tool, but filter-branch almost invites it. If
-+we're lucky, the only downside is users getting frustrated that they
-+don't know how to shrink their repo and remove the old stuff. If we're
-+unlucky, they merge old and new history and end up with multiple
-+"copies" of each commit, some of which have unwanted or sensitive files
-+and others which don't. This comes about in multiple different ways:
++still possible with any tool, but git-filter-branch almost invites it.
++If lucky, the only downside is users getting frustrated that they don't
++know how to shrink their repo and remove the old stuff. If unlucky,
++they merge old and new history and end up with multiple "copies" of each
++commit, some of which have unwanted or sensitive files and others which
++don't. This comes about in multiple different ways:
+
+ ** the default to only doing a partial history rewrite ('--all' is not
-+ the default and over 80% of the examples in the manpage don't use
-+ it)
++ the default and few examples show it)
+
+ ** the fact that there's no automatic post-run cleanup
+
+ ** the fact that --tag-name-filter (when used to rename tags) doesn't
-+ remove the old tags but just adds new ones with the new name (this
-+ manpage has documented this for a long time so it's presumably not
-+ a "bug" even though it feels like it)
++ remove the old tags but just adds new ones with the new name
+
+ ** the fact that little educational information is provided to inform
+ users of the ramifications of a rewrite and how to avoid mixing old
@@ Documentation/git-filter-branch.txt: warned.
+* Annotated tags can be accidentally converted to lightweight tags, due
+to either of two issues:
+
-+ . Someone can do a history rewrite, realize they messed up, restore
-+ from the backups in refs/original/, and then redo their
-+ filter-branch command. (The backup in refs/original/ is not a real
-+ backup; it dereferences tags first.)
++ ** Someone can do a history rewrite, realize they messed up, restore
++ from the backups in refs/original/, and then redo their
++ git-filter-branch command. (The backup in refs/original/ is not a
++ real backup; it dereferences tags first.)
+
-+ . Running filter-branch with either --tags or --all in your <rev-list
-+ options>. In order to retain annotated tags as annotated, you must
-+ use --tag-name-filter (and must not have restored from
-+ refs/original/ in a previously botched rewrite).
++ ** Running git-filter-branch with either --tags or --all in your
++ <rev-list options>. In order to retain annotated tags as
++ annotated, you must use --tag-name-filter (and must not have
++ restored from refs/original/ in a previously botched rewrite).
+
+* Any commit messages that specify an encoding will become corrupted
-+by the rewrite; filter-branch ignores the encoding, takes the original
++by the rewrite; git-filter-branch ignores the encoding, takes the original
+bytes, and feeds it to commit-tree without telling it the proper
-+encoding. (This happens whether or not --msg-filter is used, though I
-+suspect --msg-filter provides additional ways to really mess things
-+up).
++encoding. (This happens whether or not --msg-filter is used.)
+
+* Commit messages (even if they are all UTF-8) by default become
+corrupted due to not being updated -- any references to other commit
@@ Documentation/git-filter-branch.txt: warned.
+authors and committers, missing taggers.
+
+* If the user provides a --tag-name-filter that maps multiple tags to
-+the same name, no warning or error is provided; filter-branch simply
++the same name, no warning or error is provided; git-filter-branch simply
+overwrites each tag in some undocumented pre-defined order resulting in
-+only one tag at the end. If you try to "fix" this bug in filter-branch
-+and make it error out and warn the user instead, one of the
-+filter-branch regression tests will fail. (So, if you are trying to
-+make a backward compatible reimplementation you have to add extra code
-+to detect collisions and make sure that only the lexicographically last
-+one is rewritten to avoid fast-import from seeing both since fast-import
-+will naturally do the sane thing and error out if told to write the same
-+tag more than once.)
++only one tag at the end. (A git-filter-branch regression test requires
++this.)
+
-+Also, the poor performance of filter-branch often leads to safety issues:
++Also, the poor performance of git-filter-branch often leads to safety issues:
+
+* Coming up with the correct shell snippet to do the filtering you want
+is sometimes difficult unless you're just doing a trivial modification
-+such as deleting a couple files. People have often come to me for help,
-+so I should be practiced and an expert, but even for fairly simple cases
-+I still sometimes take over 10 minutes and several iterations to get
-+the right commands -- and that's assuming they are working on a tiny
-+repository. Unfortunately, people often learn if the snippet is right
-+or wrong by trying it out, but the rightness or wrongness can vary
-+depending on special circumstances (spaces in filenames, non-ascii
-+filenames, funny author names or emails, invalid timezones, presence of
-+grafts or replace objects, etc.), meaning they may have to wait a long
-+time, hit an error, then restart. The performance of filter-branch is
-+so bad that this cycle is painful, reducing the time available to
-+carefully re-check (to say nothing about what it does to the patience of
-+the person doing the rewrite even if they do technically have more time
-+available). This problem is extra compounded because errors from broken
-+filters may not be shown for a long time and/or get lost in a sea of
-+output. Even worse, broken filters often just result in silent
-+incorrect rewrites.
++such as deleting a couple files. Unfortunately, people often learn if
++the snippet is right or wrong by trying it out, but the rightness or
++wrongness can vary depending on special circumstances (spaces in
++filenames, non-ascii filenames, funny author names or emails, invalid
++timezones, presence of grafts or replace objects, etc.), meaning they
++may have to wait a long time, hit an error, then restart. The
++performance of git-filter-branch is so bad that this cycle is painful,
++reducing the time available to carefully re-check (to say nothing about
++what it does to the patience of the person doing the rewrite even if
++they do technically have more time available). This problem is extra
++compounded because errors from broken filters may not be shown for a
++long time and/or get lost in a sea of output. Even worse, broken
++filters often just result in silent incorrect rewrites.
+
+* To top it all off, even when users finally find working commands, they
+naturally want to share them. But they may be unaware that their repo
4: 1dbca82408 = 4: ca8e124cb3 t9902: use a non-deprecated command for testing
5: 762d63d6a5 < -: ---------- Remove git-filter-branch, it is now external to git.git
--
2.23.0.38.g892688c90e
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v4 1/4] t6006: simplify and optimize empty message test
2019-08-30 5:57 ` [PATCH v4 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
@ 2019-08-30 5:57 ` Elijah Newren
2019-09-02 14:47 ` Johannes Schindelin
2019-08-30 5:57 ` [PATCH v4 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
` (2 subsequent siblings)
3 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-08-30 5:57 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
Test t6006.71 ("oneline with empty message") was creating two commits
with simple commit messages, and then running filter-branch to rewrite
the commit messages to be empty. This test was written this way because
the --allow-empty-message option to git commit did not exist at the
time. Simplify this test and avoid the need to invoke filter-branch by
just using --allow-empty-message when creating the commit.
Despite only being one piece of the 71st test and there being 73 tests
overall, this small change to just this one test speeds up the overall
execution time of t6006 (as measured by the best of 3 runs of `time
./t6006-rev-list-format.sh`) by about 11% on Linux and by 13% on
Mac.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t6006-rev-list-format.sh | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
index da113d975b..d30e41c9f7 100755
--- a/t/t6006-rev-list-format.sh
+++ b/t/t6006-rev-list-format.sh
@@ -501,9 +501,8 @@ test_expect_success 'reflog identity' '
'
test_expect_success 'oneline with empty message' '
- git commit -m "dummy" --allow-empty &&
- git commit -m "dummy" --allow-empty &&
- git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
+ git commit --allow-empty --allow-empty-message &&
+ git commit --allow-empty --allow-empty-message &&
git rev-list --oneline HEAD >test.txt &&
test_line_count = 5 test.txt &&
git rev-list --oneline --graph HEAD >testg.txt &&
--
2.23.0.38.g892688c90e
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v4 1/4] t6006: simplify and optimize empty message test
2019-08-30 5:57 ` [PATCH v4 1/4] t6006: simplify and optimize empty message test Elijah Newren
@ 2019-09-02 14:47 ` Johannes Schindelin
0 siblings, 0 replies; 73+ messages in thread
From: Johannes Schindelin @ 2019-09-02 14:47 UTC (permalink / raw)
To: Elijah Newren
Cc: git, Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Lars Schneider,
Jonathan Nieder, Eric Sunshine
Hi Elijah,
On Thu, 29 Aug 2019, Elijah Newren wrote:
> Despite only being one piece of the 71st test and there being 73 tests
> overall, this small change to just this one test speeds up the overall
> execution time of t6006 (as measured by the best of 3 runs of `time
> ./t6006-rev-list-format.sh`) by about 11% on Linux and by 13% on
> Mac.
A similar effect can be observed on my Windows laptop: from 25s to 21s,
i.e. ~15%.
Thanks,
Dscho
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v4 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-08-30 5:57 ` [PATCH v4 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-08-30 5:57 ` [PATCH v4 1/4] t6006: simplify and optimize empty message test Elijah Newren
@ 2019-08-30 5:57 ` Elijah Newren
2019-09-02 14:45 ` Johannes Schindelin
2019-08-30 5:57 ` [PATCH v4 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
2019-08-30 5:57 ` [PATCH v4 4/4] t9902: use a non-deprecated command for testing Elijah Newren
3 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-08-30 5:57 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
fast-export and fast-import can easily handle the simple rewrite that
was being done by filter-branch, and should be significantly faster on
systems with a slow fork. Timings from before and after on two laptops
that I have access to (measured via `time ./t3427-rebase-subtree.sh`,
i.e. including everything in this test -- not just the filter-branch or
fast-export/fast-import pair):
Linux: 4.305s -> 3.684s (~17% speedup)
Mac: 10.128s -> 7.038s (~30% speedup)
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t3427-rebase-subtree.sh | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/t/t3427-rebase-subtree.sh b/t/t3427-rebase-subtree.sh
index d8640522a0..c1f6102921 100755
--- a/t/t3427-rebase-subtree.sh
+++ b/t/t3427-rebase-subtree.sh
@@ -7,10 +7,16 @@ This test runs git rebase and tests the subtree strategy.
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-rebase.sh
-commit_message() {
+commit_message () {
git log --pretty=format:%s -1 "$1"
}
+extract_files_subtree () {
+ git fast-export --no-data HEAD -- files_subtree/ |
+ sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
+ git fast-import --force --quiet
+}
+
test_expect_success 'setup' '
test_commit README &&
mkdir files &&
@@ -42,7 +48,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-preserve-merges-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master4"
@@ -53,7 +59,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-preserve-merges-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "files_subtree/master5"
@@ -64,7 +70,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-keep-empty-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -75,7 +81,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-keep-empty-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -86,7 +92,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto empty commit' '
reset_rebase &&
git checkout -b rebase-keep-empty-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
@@ -96,7 +102,7 @@ test_expect_failure REBASE_P \
test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
reset_rebase &&
git checkout -b rebase-onto-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -106,7 +112,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
reset_rebase &&
git checkout -b rebase-onto-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -115,7 +121,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
test_expect_failure 'Rebase -Xsubtree --onto empty commit' '
reset_rebase &&
git checkout -b rebase-onto-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
--
2.23.0.38.g892688c90e
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v4 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-08-30 5:57 ` [PATCH v4 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
@ 2019-09-02 14:45 ` Johannes Schindelin
0 siblings, 0 replies; 73+ messages in thread
From: Johannes Schindelin @ 2019-09-02 14:45 UTC (permalink / raw)
To: Elijah Newren
Cc: git, Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Lars Schneider,
Jonathan Nieder, Eric Sunshine
Hi Elijah,
On Thu, 29 Aug 2019, Elijah Newren wrote:
> fast-export and fast-import can easily handle the simple rewrite that
> was being done by filter-branch, and should be significantly faster on
> systems with a slow fork. Timings from before and after on two laptops
> that I have access to (measured via `time ./t3427-rebase-subtree.sh`,
> i.e. including everything in this test -- not just the filter-branch or
> fast-export/fast-import pair):
>
> Linux: 4.305s -> 3.684s (~17% speedup)
> Mac: 10.128s -> 7.038s (~30% speedup)
This patch seems to accelerate t3427 on my Windows laptop, too, from
~1m37s to ~1m17s, i.e. ~20%.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v4 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-08-30 5:57 ` [PATCH v4 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-08-30 5:57 ` [PATCH v4 1/4] t6006: simplify and optimize empty message test Elijah Newren
2019-08-30 5:57 ` [PATCH v4 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
@ 2019-08-30 5:57 ` Elijah Newren
2019-08-30 5:57 ` [PATCH v4 4/4] t9902: use a non-deprecated command for testing Elijah Newren
3 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-30 5:57 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
filter-branch suffers from a deluge of disguised dangers that disfigure
history rewrites (i.e. deviate from the deliberate changes). Many of
these problems are unobtrusive and can easily go undiscovered until the
new repository is in use. This can result in problems ranging from an
even messier history than what led folks to filter-branch in the first
place, to data loss or corruption. These issues cannot be backward
compatibly fixed, so add a warning to both filter-branch and its manpage
recommending that another tool (such as filter-repo) be used instead.
Also, update other manpages that referenced filter-branch. Several of
these needed updates even if we could continue recommending
filter-branch, either due to implying that something was unique to
filter-branch when it applied more generally to all history rewriting
tools (e.g. BFG, reposurgeon, fast-import, filter-repo), or because
something about filter-branch was used as an example despite other more
commonly known examples now existing. Reword these sections to fix
these issues and to avoid recommending filter-branch.
Finally, remove the section explaining BFG Repo Cleaner as an
alternative to filter-branch. I feel somewhat bad about this,
especially since I feel like I learned so much from BFG that I put to
good use in filter-repo (which is much more than I can say for
filter-branch), but keeping that section presented a few problems:
* In order to recommend that people quit using filter-branch, we need
to provide them a recomendation for something else to use that
can handle all the same types of rewrites. To my knowledge,
filter-repo is the only such tool. So it needs to be mentioned.
* I don't want to give conflicting recommendations to users
* If we recommend two tools, we shouldn't expect users to learn both
and pick which one to use; we should explain which problems one
can solve that the other can't or when one is much faster than
the other.
* BFG and filter-repo have similar performance
* All filtering types that BFG can do, filter-repo can also do. In
fact, filter-repo comes with a reimplementation of BFG named
bfg-ish which provides the same user-interface as BFG but with
several bugfixes and new features that are hard to implement in
BFG due to its technical underpinnings.
While I could still mention both tools, it seems like I would need to
provide some kind of comparison and I would ultimately just say that
filter-repo can do everything BFG can, so ultimately it seems that it
is just better to remove that section altogether.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 272 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 13 ++
9 files changed, 286 insertions(+), 59 deletions(-)
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index cc940eb9ad..784e934009 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -17,9 +17,9 @@ This program dumps the given revisions in a form suitable to be piped
into 'git fast-import'.
You can use it as a human-readable bundle replacement (see
-linkgit:git-bundle[1]), or as a kind of an interactive
-'git filter-branch'.
-
+linkgit:git-bundle[1]), or as a format that can be edited before being
+fed to 'git fast-import' in order to do history rewrites (an ability
+relied on by tools like 'git filter-repo').
OPTIONS
-------
diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index 6b53dd7e06..c199f2ee20 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -16,6 +16,19 @@ SYNOPSIS
[--original <namespace>] [-d <directory>] [-f | --force]
[--state-branch <branch>] [--] [<rev-list options>...]
+WARNING
+-------
+'git filter-branch' has a plethora of pitfalls that can produce non-obvious
+manglings of the intended history rewrite (and can leave you with little
+time to investigate such problems since it has such abysmal performance).
+These safety and performance issues cannot be backward compatibly fixed and
+as such, its use is not recommended. Please use an alternative history
+filtering tool such as https://github.com/newren/git-filter-repo/[git
+filter-repo]. If you still need to use 'git filter-branch', please
+carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
+mines of filter-branch, and then vigilantly avoid as many of the hazards
+listed there as reasonably possible.
+
DESCRIPTION
-----------
Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,36 +458,235 @@ warned.
(or if your git-gc is not new enough to support arguments to
`--prune`, use `git repack -ad; git prune` instead).
-NOTES
------
-
-git-filter-branch allows you to make complex shell-scripted rewrites
-of your Git history, but you probably don't need this flexibility if
-you're simply _removing unwanted data_ like large files or passwords.
-For those operations you may want to consider
-http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
-a JVM-based alternative to git-filter-branch, typically at least
-10-50x faster for those use-cases, and with quite different
-characteristics:
-
-* Any particular version of a file is cleaned exactly _once_. The BFG,
- unlike git-filter-branch, does not give you the opportunity to
- handle a file differently based on where or when it was committed
- within your history. This constraint gives the core performance
- benefit of The BFG, and is well-suited to the task of cleansing bad
- data - you don't care _where_ the bad data is, you just want it
- _gone_.
-
-* By default The BFG takes full advantage of multi-core machines,
- cleansing commit file-trees in parallel. git-filter-branch cleans
- commits sequentially (i.e. in a single-threaded manner), though it
- _is_ possible to write filters that include their own parallelism,
- in the scripts executed against each commit.
-
-* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
- are much more restrictive than git-filter branch, and dedicated just
- to the tasks of removing unwanted data- e.g:
- `--strip-blobs-bigger-than 1M`.
+[[PERFORMANCE]]
+PERFORMANCE
+-----------
+
+The performance of git-filter-branch is glacially slow; its design makes it
+impossible for a backward-compatible implementation to ever be fast:
+
+* In editing files, git-filter-branch by design checks out each and
+every commit as it existed in the original repo. If your repo has 10\^5
+files and 10\^5 commits, but each commit only modifies 5 files, then
+git-filter-branch will make you do 10\^10 modifications, despite only
+having (at most) 5*10^5 unique blobs.
+
+* If you try and cheat and try to make git-filter-branch only work on
+files modified in a commit, then two things happen
+
+ ** you run into problems with deletions whenever the user is simply
+ trying to rename files (because attempting to delete files that
+ don't exist looks like a no-op; it takes some chicanery to remap
+ deletes across file renames when the renames happen via arbitrary
+ user-provided shell)
+
+ ** even if you succeed at the map-deletes-for-renames chicanery, you
+ still technically violate backward compatibility because users are
+ allowed to filter files in ways that depend upon topology of
+ commits instead of filtering solely based on file contents or names
+ (though this has not been observed in the wild).
+
+* Even if you don't need to edit files but only want to e.g. rename or
+remove some and thus can avoid checking out each file (i.e. you can use
+--index-filter), you still are passing shell snippets for your filters.
+This means that for every commit, you have to have a prepared git repo
+where those filters can be run. That's a significant setup.
+
+* Further, several additional files are created or updated per commit by
+git-filter-branch. Some of these are for supporting the convenience
+functions provided by git-filter-branch (such as map()), while others
+are for keeping track of internal state (but could have also been
+accessed by user filters; one of git-filter-branch's regression tests
+does so). This essentially amounts to using the filesystem as an IPC
+mechanism between git-filter-branch and the user-provided filters.
+Disks tend to be a slow IPC mechanism, and writing these files also
+effectively represents a forced synchronization point between separate
+processes that we hit with every commit.
+
+* The user-provided shell commands will likely involve a pipeline of
+commands, resulting in the creation of many processes per commit.
+Creating and running another process takes a widely varying amount of
+time between operating systems, but on any platform it is very slow
+relative to invoking a function.
+
+* git-filter-branch itself is written in shell, which is kind of slow.
+This is the one performance issue that could be backward-compatibly
+fixed, but compared to the above problems that are intrinsic to the
+design of git-filter-branch, the language of the tool itself is a
+relatively minor issue.
+
+ ** Side note: Unfortunately, people tend to fixate on the
+ written-in-shell aspect and periodically ask if git-filter-branch
+ could be rewritten in another language to fix the performance
+ issues. Not only does that ignore the bigger intrinsic problems
+ with the design, it'd help less than you'd expect: if
+ git-filter-branch itself were not shell, then the convenience
+ functions (map(), skip_commit(), etc) and the `--setup` argument
+ could no longer be executed once at the beginning of the program
+ but would instead need to be prepended to every user filter (and
+ thus re-executed with every commit).
+
+The https://github.com/newren/git-filter-repo/[git filter-repo] tool is
+an alternative to git-filter-branch which does not suffer from these
+performance problems or the safety problems (mentioned below). For those
+with existing tooling which relies upon git-filter-branch, 'git
+repo-filter' also provides
+https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
+a drop-in git-filter-branch replacement (with a few caveats). While
+filter-lamely suffers from all the same safety issues as
+git-filter-branch, it at least ameloriates the performance issues a
+little.
+
+[[SAFETY]]
+SAFETY
+------
+
+git-filter-branch is riddled with gotchas resulting in various ways to
+easily corrupt repos or end up with a mess worse than what you started
+with:
+
+* Someone can have a set of "working and tested filters" which they
+document or provide to a coworker, who then runs them on a different OS
+where the same commands are not working/tested (some examples in the
+git-filter-branch manpage are also affected by this). BSD vs. GNU
+userland differences can really bite. If you're lucky, you get ugly
+error messages spewed. But just as likely, the commands either don't do
+the filtering requested, or silently corrupt making some unwanted
+change. The unwanted change may only affect a few commits, so it's not
+necessarily obvious either. (The fact that problems won't necessarily
+be obvious means they are likely to go unnoticed until the rewritten
+history is in use for quite a while, at which point it's really hard to
+justify another flag-day for another rewrite.)
+
+* Filenames with spaces are often mishandled by shell snippets since
+they cause problems for shell pipelines. Not everyone is familiar with
+find -print0, xargs -0, git-ls-files -z, etc. Even people who are
+familiar with these may assume such needs are not relevant because
+someone else renamed any such files in their repo back before the person
+doing the filtering joined the project. And, often, even those familiar
+with handling arguments with spaces my not do so just because they
+aren't in the mindset of thinking about everything that could possibly
+go wrong.
+
+* Non-ascii filenames can be silently removed despite being in a desired
+directory. The desire to select paths to keep often use pipelines like
+`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`. ls-files will
+only quote filenames if needed so folks may not notice that one of the
+files didn't match the regex, again until it's much too late. Yes,
+someone who knows about core.quotePath can avoid this (unless they have
+other special characters like \t, \n, or "), and people who use ls-files
+-z with something other than grep can avoid this, but that doesn't mean
+they will.
+
+* Similarly, when moving files around, one can find that filenames with
+non-ascii or special characters end up in a different directory, one
+that includes a double quote character. (This is technically the same
+issue as above with quoting, but perhaps an interesting different way
+that it can and has manifested as a problem.)
+
+* It's far too easy to accidentally mix up old and new history. It's
+still possible with any tool, but git-filter-branch almost invites it.
+If lucky, the only downside is users getting frustrated that they don't
+know how to shrink their repo and remove the old stuff. If unlucky,
+they merge old and new history and end up with multiple "copies" of each
+commit, some of which have unwanted or sensitive files and others which
+don't. This comes about in multiple different ways:
+
+ ** the default to only doing a partial history rewrite ('--all' is not
+ the default and few examples show it)
+
+ ** the fact that there's no automatic post-run cleanup
+
+ ** the fact that --tag-name-filter (when used to rename tags) doesn't
+ remove the old tags but just adds new ones with the new name
+
+ ** the fact that little educational information is provided to inform
+ users of the ramifications of a rewrite and how to avoid mixing old
+ and new history. For example, this man page discusses how users
+ need to understand that they need to rebase their changes for all
+ their branches on top of new history (or delete and reclone), but
+ that's only one of multiple concerns to consider. See the
+ "DISCUSSION" section of the git filter-repo manual page for more
+ details.
+
+* Annotated tags can be accidentally converted to lightweight tags, due
+to either of two issues:
+
+ ** Someone can do a history rewrite, realize they messed up, restore
+ from the backups in refs/original/, and then redo their
+ git-filter-branch command. (The backup in refs/original/ is not a
+ real backup; it dereferences tags first.)
+
+ ** Running git-filter-branch with either --tags or --all in your
+ <rev-list options>. In order to retain annotated tags as
+ annotated, you must use --tag-name-filter (and must not have
+ restored from refs/original/ in a previously botched rewrite).
+
+* Any commit messages that specify an encoding will become corrupted
+by the rewrite; git-filter-branch ignores the encoding, takes the original
+bytes, and feeds it to commit-tree without telling it the proper
+encoding. (This happens whether or not --msg-filter is used.)
+
+* Commit messages (even if they are all UTF-8) by default become
+corrupted due to not being updated -- any references to other commit
+hashes in commit messages will now refer to no-longer-extant commits.
+
+* There are no facilities for helping users find what unwanted crud they
+should delete, which means they are much more likely to have incomplete
+or partial cleanups that sometimes result in confusion and people
+wasting time trying to understand. (For example, folks tend to just
+look for big files to delete instead of big directories or extensions,
+and once they do so, then sometime later folks using the new repository
+who are going through history will notice a build artifact directory
+that has some files but not others, or a cache of dependencies
+(node_modules or similar) which couldn't have ever been functional since
+it's missing some files.)
+
+* If --prune-empty isn't specified, then the filtering process can
+create hoards of confusing empty commits
+
+* If --prune-empty is specified, then intentionally placed empty
+commits from before the filtering operation are also pruned instead of
+just pruning commits that became empty due to filtering rules.
+
+* If --prune empty is specified, sometimes empty commits are missed
+and left around anyway (a somewhat rare bug, but it happens...)
+
+* A minor issue, but users who have a goal to update all names and
+emails in a repository may be led to --env-filter which will only update
+authors and committers, missing taggers.
+
+* If the user provides a --tag-name-filter that maps multiple tags to
+the same name, no warning or error is provided; git-filter-branch simply
+overwrites each tag in some undocumented pre-defined order resulting in
+only one tag at the end. (A git-filter-branch regression test requires
+this.)
+
+Also, the poor performance of git-filter-branch often leads to safety issues:
+
+* Coming up with the correct shell snippet to do the filtering you want
+is sometimes difficult unless you're just doing a trivial modification
+such as deleting a couple files. Unfortunately, people often learn if
+the snippet is right or wrong by trying it out, but the rightness or
+wrongness can vary depending on special circumstances (spaces in
+filenames, non-ascii filenames, funny author names or emails, invalid
+timezones, presence of grafts or replace objects, etc.), meaning they
+may have to wait a long time, hit an error, then restart. The
+performance of git-filter-branch is so bad that this cycle is painful,
+reducing the time available to carefully re-check (to say nothing about
+what it does to the patience of the person doing the rewrite even if
+they do technically have more time available). This problem is extra
+compounded because errors from broken filters may not be shown for a
+long time and/or get lost in a sea of output. Even worse, broken
+filters often just result in silent incorrect rewrites.
+
+* To top it all off, even when users finally find working commands, they
+naturally want to share them. But they may be unaware that their repo
+didn't have some special cases that someone else's does. So, when
+someone else with a different repository runs the same commands, they
+get hit by the problems above. Or, the user just runs commands that
+really were vetted for special cases, but they run it on a different OS
+where it doesn't work, as noted above.
GIT
---
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 247f765604..0c114ad1ca 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -115,15 +115,14 @@ NOTES
-----
'git gc' tries very hard not to delete objects that are referenced
-anywhere in your repository. In
-particular, it will keep not only objects referenced by your current set
-of branches and tags, but also objects referenced by the index,
-remote-tracking branches, refs saved by 'git filter-branch' in
-refs/original/, reflogs (which may reference commits in branches
-that were later amended or rewound), and anything else in the refs/* namespace.
-If you are expecting some objects to be deleted and they aren't, check
-all of those locations and decide whether it makes sense in your case to
-remove those references.
+anywhere in your repository. In particular, it will keep not only
+objects referenced by your current set of branches and tags, but also
+objects referenced by the index, remote-tracking branches, notes saved
+by 'git notes' under refs/notes/, reflogs (which may reference commits
+in branches that were later amended or rewound), and anything else in
+the refs/* namespace. If you are expecting some objects to be deleted
+and they aren't, check all of those locations and decide whether it
+makes sense in your case to remove those references.
On the other hand, when 'git gc' runs concurrently with another process,
there is a risk of it deleting an object that the other process is using
diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 6156609cf7..a8cfc0ad82 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -832,7 +832,8 @@ Hard case: The changes are not the same.::
This happens if the 'subsystem' rebase had conflicts, or used
`--interactive` to omit, edit, squash, or fixup commits; or
if the upstream used one of `commit --amend`, `reset`, or
- `filter-branch`.
+ a full history rewriting command like
+ https://github.com/newren/git-filter-repo[`filter-repo`].
The easy case
diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
index 246dc9943c..f271d758c3 100644
--- a/Documentation/git-replace.txt
+++ b/Documentation/git-replace.txt
@@ -123,10 +123,10 @@ The following format are available:
CREATING REPLACEMENT OBJECTS
----------------------------
-linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
-linkgit:git-rebase[1], among other git commands, can be used to create
-replacement objects from existing objects. The `--edit` option can
-also be used with 'git replace' to create a replacement object by
+linkgit:git-hash-object[1], linkgit:git-rebase[1], and
+https://github.com/newren/git-filter-repo[git-filter-repo], among other git commands, can be used to
+create replacement objects from existing objects. The `--edit` option
+can also be used with 'git replace' to create a replacement object by
editing an existing object.
If you want to replace many blobs, trees or commits that are part of a
@@ -148,13 +148,13 @@ pending objects.
SEE ALSO
--------
linkgit:git-hash-object[1]
-linkgit:git-filter-branch[1]
linkgit:git-rebase[1]
linkgit:git-tag[1]
linkgit:git-branch[1]
linkgit:git-commit[1]
linkgit:git-var[1]
linkgit:git[1]
+https://github.com/newren/git-filter-repo[git-filter-repo]
GIT
---
diff --git a/Documentation/git-svn.txt b/Documentation/git-svn.txt
index 30711625fd..53774f5b64 100644
--- a/Documentation/git-svn.txt
+++ b/Documentation/git-svn.txt
@@ -769,11 +769,11 @@ option for (hopefully) obvious reasons.
+
This option is NOT recommended as it makes it difficult to track down
old references to SVN revision numbers in existing documentation, bug
-reports and archives. If you plan to eventually migrate from SVN to Git
-and are certain about dropping SVN history, consider
-linkgit:git-filter-branch[1] instead. filter-branch also allows
-reformatting of metadata for ease-of-reading and rewriting authorship
-info for non-"svn.authorsFile" users.
+reports, and archives. If you plan to eventually migrate from SVN to
+Git and are certain about dropping SVN history, consider
+https://github.com/newren/git-filter-repo[git-filter-repo] instead.
+filter-repo also allows reformatting of metadata for ease-of-reading
+and rewriting authorship info for non-"svn.authorsFile" users.
svn.useSvmProps::
svn-remote.<name>.useSvmProps::
diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 82cd573776..5a789c91df 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -425,10 +425,12 @@ post-rewrite
This hook is invoked by commands that rewrite commits
(linkgit:git-commit[1] when called with `--amend` and
-linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
-it!). Its first argument denotes the command it was invoked by:
-currently one of `amend` or `rebase`. Further command-dependent
-arguments may be passed in the future.
+linkgit:git-rebase[1]; however, full-history (re)writing tools like
+linkgit:git-fast-import[1] or
+https://github.com/newren/git-filter-repo[git-filter-repo] typically
+do not call it!). Its first argument denotes the command it was
+invoked by: currently one of `amend` or `rebase`. Further
+command-dependent arguments may be passed in the future.
The hook receives a list of the rewritten commits on stdin, in the
format
diff --git a/contrib/svn-fe/svn-fe.txt b/contrib/svn-fe/svn-fe.txt
index a3425f4770..19333fc8df 100644
--- a/contrib/svn-fe/svn-fe.txt
+++ b/contrib/svn-fe/svn-fe.txt
@@ -56,7 +56,7 @@ line. This line has the form `git-svn-id: URL@REVNO UUID`.
The resulting repository will generally require further processing
to put each project in its own repository and to separate the history
-of each branch. The 'git filter-branch --subdirectory-filter' command
+of each branch. The 'git filter-repo --subdirectory-filter' command
may be useful for this purpose.
BUGS
@@ -67,5 +67,5 @@ The exit status does not reflect whether an error was detected.
SEE ALSO
--------
-git-svn(1), svn2git(1), svk(1), git-filter-branch(1), git-fast-import(1),
+git-svn(1), svn2git(1), svk(1), git-filter-repo(1), git-fast-import(1),
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 5c5afa2b98..f805965d87 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -83,6 +83,19 @@ set_ident () {
finish_ident COMMITTER
}
+if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
+ -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
+ cat <<EOF
+WARNING: git-filter-branch has a glut of gotchas generating mangled history
+ rewrites. Please use an alternative filtering tool such as 'git
+ filter-repo' (https://github.com/newren/git-filter-repo/) instead.
+ See the filter-branch manual page for more details; to squelch
+ this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
+
+EOF
+ sleep 5
+fi
+
USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
[--tree-filter <command>] [--index-filter <command>]
[--parent-filter <command>] [--msg-filter <command>]
--
2.23.0.38.g892688c90e
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v4 4/4] t9902: use a non-deprecated command for testing
2019-08-30 5:57 ` [PATCH v4 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (2 preceding siblings ...)
2019-08-30 5:57 ` [PATCH v4 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-08-30 5:57 ` Elijah Newren
3 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-08-30 5:57 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
t9902 had a list of three random porcelain commands as a sanity check,
one of which was filter-branch. Since we are recommending people not
use filter-branch, let's update this test to use rebase instead of
filter-branch.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t9902-completion.sh | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 75512c3403..4e7f669c76 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -28,10 +28,10 @@ complete ()
#
# (2) A test makes sure that common subcommands are included in the
# completion for "git <TAB>", and a plumbing is excluded. "add",
-# "filter-branch" and "ls-files" are listed for this.
+# "rebase" and "ls-files" are listed for this.
-GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr filter-branch ls-files'
-GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout filter-branch'
+GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr rebase ls-files'
+GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout rebase'
. "$GIT_BUILD_DIR/contrib/completion/git-completion.bash"
@@ -1392,12 +1392,12 @@ test_expect_success 'basic' '
# built-in
grep -q "^add \$" out &&
# script
- grep -q "^filter-branch \$" out &&
+ grep -q "^rebase \$" out &&
# plumbing
! grep -q "^ls-files \$" out &&
- run_completion "git f" &&
- ! grep -q -v "^f" out
+ run_completion "git r" &&
+ ! grep -q -v "^r" out
'
test_expect_success 'double dash "git" itself' '
--
2.23.0.38.g892688c90e
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it
2019-08-28 0:22 ` [PATCH v2 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (4 preceding siblings ...)
2019-08-29 0:06 ` [PATCH v3 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
@ 2019-09-03 18:55 ` Elijah Newren
2019-09-03 18:55 ` [PATCH v5 1/4] t6006: simplify and optimize empty message test Elijah Newren
` (4 more replies)
5 siblings, 5 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-03 18:55 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
It's been about 5 days with no further feedback, other than some timings
from Dscho for Windows showing that my fixes help there too. So, I did
one last re-read, made a couple small wording tweaks, and am resending as
ready for inclusion.
Changes since v4:
* Included the windows timings from Dscho in the commit messages for
the first two perf patches
* A few slight wording tweaks to the manpage
Elijah Newren (4):
t6006: simplify and optimize empty message test
t3427: accelerate this test by using fast-export and fast-import
Recommend git-filter-repo instead of git-filter-branch
t9902: use a non-deprecated command for testing
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 273 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 13 ++
t/t3427-rebase-subtree.sh | 24 ++-
t/t6006-rev-list-format.sh | 5 +-
t/t9902-completion.sh | 12 +-
12 files changed, 310 insertions(+), 77 deletions(-)
Range-diff:
1: 7ddbeea2ca ! 1: ccea0e5846 t6006: simplify and optimize empty message test
@@ Commit message
Despite only being one piece of the 71st test and there being 73 tests
overall, this small change to just this one test speeds up the overall
execution time of t6006 (as measured by the best of 3 runs of `time
- ./t6006-rev-list-format.sh`) by about 11% on Linux and by 13% on
- Mac.
+ ./t6006-rev-list-format.sh`) by about 11% on Linux, 13% on Mac, and
+ about 15% on Windows.
Signed-off-by: Elijah Newren <newren@gmail.com>
2: e1e63189c1 ! 2: 6d73135006 t3427: accelerate this test by using fast-export and fast-import
@@ Commit message
fast-export and fast-import can easily handle the simple rewrite that
was being done by filter-branch, and should be significantly faster on
- systems with a slow fork. Timings from before and after on two laptops
- that I have access to (measured via `time ./t3427-rebase-subtree.sh`,
- i.e. including everything in this test -- not just the filter-branch or
- fast-export/fast-import pair):
+ systems with a slow fork. Timings from before and after on a few
+ laptops that I or others measured on (measured via `time
+ ./t3427-rebase-subtree.sh`, i.e. including everything in this test --
+ not just the filter-branch or fast-export/fast-import pair):
- Linux: 4.305s -> 3.684s (~17% speedup)
- Mac: 10.128s -> 7.038s (~30% speedup)
+ Linux: 4.305s -> 3.684s (~17% speedup)
+ Mac: 10.128s -> 7.038s (~30% speedup)
+ Windows: 1m 37s -> 1m 17s (~26% speedup)
Signed-off-by: Elijah Newren <newren@gmail.com>
3: ed6505584f ! 3: 2f225c8697 Recommend git-filter-repo instead of git-filter-branch
@@ Documentation/git-filter-branch.txt: warned.
+document or provide to a coworker, who then runs them on a different OS
+where the same commands are not working/tested (some examples in the
+git-filter-branch manpage are also affected by this). BSD vs. GNU
-+userland differences can really bite. If you're lucky, you get ugly
-+error messages spewed. But just as likely, the commands either don't do
-+the filtering requested, or silently corrupt making some unwanted
-+change. The unwanted change may only affect a few commits, so it's not
-+necessarily obvious either. (The fact that problems won't necessarily
-+be obvious means they are likely to go unnoticed until the rewritten
-+history is in use for quite a while, at which point it's really hard to
-+justify another flag-day for another rewrite.)
++userland differences can really bite. If lucky, error messages are
++spewed. But just as likely, the commands either don't do the filtering
++requested, or silently corrupt by making some unwanted change. The
++unwanted change may only affect a few commits, so it's not necessarily
++obvious either. (The fact that problems won't necessarily be obvious
++means they are likely to go unnoticed until the rewritten history is in
++use for quite a while, at which point it's really hard to justify
++another flag-day for another rewrite.)
+
+* Filenames with spaces are often mishandled by shell snippets since
+they cause problems for shell pipelines. Not everyone is familiar with
+find -print0, xargs -0, git-ls-files -z, etc. Even people who are
-+familiar with these may assume such needs are not relevant because
++familiar with these may assume such flags are not relevant because
+someone else renamed any such files in their repo back before the person
-+doing the filtering joined the project. And, often, even those familiar
-+with handling arguments with spaces my not do so just because they
++doing the filtering joined the project. And often, even those familiar
++with handling arguments with spaces may not do so just because they
+aren't in the mindset of thinking about everything that could possibly
+go wrong.
+
+* Non-ascii filenames can be silently removed despite being in a desired
-+directory. The desire to select paths to keep often use pipelines like
++directory. Keeping only wanted paths is often done using pipelines like
+`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`. ls-files will
-+only quote filenames if needed so folks may not notice that one of the
-+files didn't match the regex, again until it's much too late. Yes,
-+someone who knows about core.quotePath can avoid this (unless they have
-+other special characters like \t, \n, or "), and people who use ls-files
-+-z with something other than grep can avoid this, but that doesn't mean
-+they will.
++only quote filenames if needed, so folks may not notice that one of the
++files didn't match the regex (at least not until it's much too late).
++Yes, someone who knows about core.quotePath can avoid this (unless they
++have other special characters like \t, \n, or "), and people who use
++ls-files -z with something other than grep can avoid this, but that
++doesn't mean they will.
+
+* Similarly, when moving files around, one can find that filenames with
+non-ascii or special characters end up in a different directory, one
@@ Documentation/git-filter-branch.txt: warned.
+the same name, no warning or error is provided; git-filter-branch simply
+overwrites each tag in some undocumented pre-defined order resulting in
+only one tag at the end. (A git-filter-branch regression test requires
-+this.)
++this surprising behavior.)
+
-+Also, the poor performance of git-filter-branch often leads to safety issues:
++Also, the poor performance of git-filter-branch often leads to safety
++issues:
+
+* Coming up with the correct shell snippet to do the filtering you want
+is sometimes difficult unless you're just doing a trivial modification
4: ca8e124cb3 = 4: 048eba375b t9902: use a non-deprecated command for testing
--
2.23.0.39.gf92d9de5c3
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v5 1/4] t6006: simplify and optimize empty message test
2019-09-03 18:55 ` [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
@ 2019-09-03 18:55 ` Elijah Newren
2019-09-03 21:08 ` Junio C Hamano
2019-09-03 18:55 ` [PATCH v5 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
` (3 subsequent siblings)
4 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-09-03 18:55 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
Test t6006.71 ("oneline with empty message") was creating two commits
with simple commit messages, and then running filter-branch to rewrite
the commit messages to be empty. This test was written this way because
the --allow-empty-message option to git commit did not exist at the
time. Simplify this test and avoid the need to invoke filter-branch by
just using --allow-empty-message when creating the commit.
Despite only being one piece of the 71st test and there being 73 tests
overall, this small change to just this one test speeds up the overall
execution time of t6006 (as measured by the best of 3 runs of `time
./t6006-rev-list-format.sh`) by about 11% on Linux, 13% on Mac, and
about 15% on Windows.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t6006-rev-list-format.sh | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
index da113d975b..d30e41c9f7 100755
--- a/t/t6006-rev-list-format.sh
+++ b/t/t6006-rev-list-format.sh
@@ -501,9 +501,8 @@ test_expect_success 'reflog identity' '
'
test_expect_success 'oneline with empty message' '
- git commit -m "dummy" --allow-empty &&
- git commit -m "dummy" --allow-empty &&
- git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
+ git commit --allow-empty --allow-empty-message &&
+ git commit --allow-empty --allow-empty-message &&
git rev-list --oneline HEAD >test.txt &&
test_line_count = 5 test.txt &&
git rev-list --oneline --graph HEAD >testg.txt &&
--
2.23.0.39.gf92d9de5c3
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v5 1/4] t6006: simplify and optimize empty message test
2019-09-03 18:55 ` [PATCH v5 1/4] t6006: simplify and optimize empty message test Elijah Newren
@ 2019-09-03 21:08 ` Junio C Hamano
2019-09-03 21:58 ` Elijah Newren
0 siblings, 1 reply; 73+ messages in thread
From: Junio C Hamano @ 2019-09-03 21:08 UTC (permalink / raw)
To: Elijah Newren
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
Elijah Newren <newren@gmail.com> writes:
> Test t6006.71 ("oneline with empty message") was creating two commits
> with simple commit messages, and then running filter-branch to rewrite
> the commit messages to be empty. This test was written this way because
> the --allow-empty-message option to git commit did not exist at the
> time. Simplify this test and avoid the need to invoke filter-branch by
> just using --allow-empty-message when creating the commit.
The result of filter-branch seems to have one empty line as the body
(i.e. "echo X; git cat-file commit A; echo Y" will show two blank
lines between the committer line and Y), while "--allow-empty-message"
does not leave any body (i.e. the same will give you only one blank
line there).
Was this test verifying the right thing in the first place, I have
to wonder.
IOW,
git commit --allow-empty --cleanup=verbatim -m "$LF" &&
would be more faithful conversion of the original (and hopefully
just as performant).
> Despite only being one piece of the 71st test and there being 73 tests
> overall, this small change to just this one test speeds up the overall
> execution time of t6006 (as measured by the best of 3 runs of `time
> ./t6006-rev-list-format.sh`) by about 11% on Linux, 13% on Mac, and
> about 15% on Windows.
Quite an improvement ;-)
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> t/t6006-rev-list-format.sh | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
> index da113d975b..d30e41c9f7 100755
> --- a/t/t6006-rev-list-format.sh
> +++ b/t/t6006-rev-list-format.sh
> @@ -501,9 +501,8 @@ test_expect_success 'reflog identity' '
> '
>
> test_expect_success 'oneline with empty message' '
> - git commit -m "dummy" --allow-empty &&
> - git commit -m "dummy" --allow-empty &&
> - git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
> + git commit --allow-empty --allow-empty-message &&
> + git commit --allow-empty --allow-empty-message &&
> git rev-list --oneline HEAD >test.txt &&
> test_line_count = 5 test.txt &&
> git rev-list --oneline --graph HEAD >testg.txt &&
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH v5 1/4] t6006: simplify and optimize empty message test
2019-09-03 21:08 ` Junio C Hamano
@ 2019-09-03 21:58 ` Elijah Newren
2019-09-03 22:25 ` Junio C Hamano
0 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-09-03 21:58 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
On Tue, Sep 3, 2019 at 2:08 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Test t6006.71 ("oneline with empty message") was creating two commits
> > with simple commit messages, and then running filter-branch to rewrite
> > the commit messages to be empty. This test was written this way because
> > the --allow-empty-message option to git commit did not exist at the
> > time. Simplify this test and avoid the need to invoke filter-branch by
> > just using --allow-empty-message when creating the commit.
>
> The result of filter-branch seems to have one empty line as the body
> (i.e. "echo X; git cat-file commit A; echo Y" will show two blank
> lines between the committer line and Y), while "--allow-empty-message"
> does not leave any body (i.e. the same will give you only one blank
> line there).
Ah, good catch. I checked out the commit before 1fb5fdd25f0
("rev-list: fix --pretty=oneline with empty message", 2010-03-21), to
try and see the error before that testcase was introduced. I tried it
on a repo with both an actual empty commit message, and one with a
commit message consisting solely of a newline. Both styles exhibited
the bug that the testcase was introduced to guard against.
> Was this test verifying the right thing in the first place, I have
> to wonder.
>
> IOW,
>
> git commit --allow-empty --cleanup=verbatim -m "$LF" &&
>
> would be more faithful conversion of the original (and hopefully
> just as performant).
Yeah, it'd be a more faithful conversion of the original, though the
original didn't match the testcase description nor the commit message
(it claimed it was testing with an empty message). Also, in terms of
future proofing, any code changes are more likely to omit a needed
trailing LF if the commit message doesn't have one than if it does, so
I think it's a more robust test with this change.
I can update the commit message to explain this, or, if you prefer, I
could duplicate the testcase and tweak the second as you suggest so we
test both with and without the LF. What's your preference?
> > Despite only being one piece of the 71st test and there being 73 tests
> > overall, this small change to just this one test speeds up the overall
> > execution time of t6006 (as measured by the best of 3 runs of `time
> > ./t6006-rev-list-format.sh`) by about 11% on Linux, 13% on Mac, and
> > about 15% on Windows.
>
> Quite an improvement ;-)
>
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> > t/t6006-rev-list-format.sh | 5 ++---
> > 1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
> > index da113d975b..d30e41c9f7 100755
> > --- a/t/t6006-rev-list-format.sh
> > +++ b/t/t6006-rev-list-format.sh
> > @@ -501,9 +501,8 @@ test_expect_success 'reflog identity' '
> > '
> >
> > test_expect_success 'oneline with empty message' '
> > - git commit -m "dummy" --allow-empty &&
> > - git commit -m "dummy" --allow-empty &&
> > - git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
> > + git commit --allow-empty --allow-empty-message &&
> > + git commit --allow-empty --allow-empty-message &&
> > git rev-list --oneline HEAD >test.txt &&
> > test_line_count = 5 test.txt &&
> > git rev-list --oneline --graph HEAD >testg.txt &&
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH v5 1/4] t6006: simplify and optimize empty message test
2019-09-03 21:58 ` Elijah Newren
@ 2019-09-03 22:25 ` Junio C Hamano
0 siblings, 0 replies; 73+ messages in thread
From: Junio C Hamano @ 2019-09-03 22:25 UTC (permalink / raw)
To: Elijah Newren
Cc: Git Mailing List, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
Elijah Newren <newren@gmail.com> writes:
> Ah, good catch. I checked out the commit before 1fb5fdd25f0
> ("rev-list: fix --pretty=oneline with empty message", 2010-03-21), to
> try and see the error before that testcase was introduced. I tried it
> on a repo with both an actual empty commit message, and one with a
> commit message consisting solely of a newline. Both styles exhibited
> the bug that the testcase was introduced to guard against.
That's a good thing to know to decide what is a reasonable
thing to do here.
As we are creating two commits, perhaps adding one with and another
without the extra blank line may give us more diversity, and
explaining why we are adding two slightly different one
(i.e. because the original bug was there for both shapes of commits)
would help us not wasting the time we already spent discussing this
change ;-)
Of course, we can alternatively just keep the patch as-is and update
the explanation as to why we are testing with commits different from
the original when we are supposed to be making this change for
performance reasons (i.e. the symptom manifests either way, so why
not using the form that is easier to create?).
Thanks for working on this ;-)
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v5 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-09-03 18:55 ` [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-09-03 18:55 ` [PATCH v5 1/4] t6006: simplify and optimize empty message test Elijah Newren
@ 2019-09-03 18:55 ` Elijah Newren
2019-09-03 21:26 ` Junio C Hamano
2019-09-03 18:55 ` [PATCH v5 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
` (2 subsequent siblings)
4 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-09-03 18:55 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
fast-export and fast-import can easily handle the simple rewrite that
was being done by filter-branch, and should be significantly faster on
systems with a slow fork. Timings from before and after on a few
laptops that I or others measured on (measured via `time
./t3427-rebase-subtree.sh`, i.e. including everything in this test --
not just the filter-branch or fast-export/fast-import pair):
Linux: 4.305s -> 3.684s (~17% speedup)
Mac: 10.128s -> 7.038s (~30% speedup)
Windows: 1m 37s -> 1m 17s (~26% speedup)
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t3427-rebase-subtree.sh | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/t/t3427-rebase-subtree.sh b/t/t3427-rebase-subtree.sh
index d8640522a0..c1f6102921 100755
--- a/t/t3427-rebase-subtree.sh
+++ b/t/t3427-rebase-subtree.sh
@@ -7,10 +7,16 @@ This test runs git rebase and tests the subtree strategy.
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-rebase.sh
-commit_message() {
+commit_message () {
git log --pretty=format:%s -1 "$1"
}
+extract_files_subtree () {
+ git fast-export --no-data HEAD -- files_subtree/ |
+ sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
+ git fast-import --force --quiet
+}
+
test_expect_success 'setup' '
test_commit README &&
mkdir files &&
@@ -42,7 +48,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-preserve-merges-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master4"
@@ -53,7 +59,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-preserve-merges-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "files_subtree/master5"
@@ -64,7 +70,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 4' '
reset_rebase &&
git checkout -b rebase-keep-empty-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -75,7 +81,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto commit 5' '
reset_rebase &&
git checkout -b rebase-keep-empty-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -86,7 +92,7 @@ test_expect_failure REBASE_P \
'Rebase -Xsubtree --keep-empty --preserve-merges --onto empty commit' '
reset_rebase &&
git checkout -b rebase-keep-empty-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --keep-empty --preserve-merges --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
@@ -96,7 +102,7 @@ test_expect_failure REBASE_P \
test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
reset_rebase &&
git checkout -b rebase-onto-4 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~2)" = "files_subtree/master4"
@@ -106,7 +112,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 4' '
test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
reset_rebase &&
git checkout -b rebase-onto-5 master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD~)" = "files_subtree/master5"
@@ -115,7 +121,7 @@ test_expect_failure 'Rebase -Xsubtree --onto commit 5' '
test_expect_failure 'Rebase -Xsubtree --onto empty commit' '
reset_rebase &&
git checkout -b rebase-onto-empty master &&
- git filter-branch --prune-empty -f --subdirectory-filter files_subtree &&
+ extract_files_subtree &&
git commit -m "Empty commit" --allow-empty &&
git rebase -Xsubtree=files_subtree --onto files-master master &&
verbose test "$(commit_message HEAD)" = "Empty commit"
--
2.23.0.39.gf92d9de5c3
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v5 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-09-03 18:55 ` [PATCH v5 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
@ 2019-09-03 21:26 ` Junio C Hamano
2019-09-03 22:46 ` Junio C Hamano
0 siblings, 1 reply; 73+ messages in thread
From: Junio C Hamano @ 2019-09-03 21:26 UTC (permalink / raw)
To: Elijah Newren
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
Elijah Newren <newren@gmail.com> writes:
> +extract_files_subtree () {
> + git fast-export --no-data HEAD -- files_subtree/ |
> + sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
> + git fast-import --force --quiet
> +}
Clever, if a bit filthy ;-). We expect to see something like
M 100644 dead...beef files_subtree/bar
M 100755 c0f.....fee files_subtree/foo
in the --no-data output, and the assumption here is that 40-hex
followed by " files_subtree/" would never appear anywhere in the
stream other than these tree dump, so the sed script can rewrite
the above to
M 100644 dead...beef bar
M 100755 c0f.....fee foo
by getting rid of the leading directory name (plus the slash at the
end).
Thanks.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH v5 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-09-03 21:26 ` Junio C Hamano
@ 2019-09-03 22:46 ` Junio C Hamano
2019-09-04 20:32 ` Elijah Newren
0 siblings, 1 reply; 73+ messages in thread
From: Junio C Hamano @ 2019-09-03 22:46 UTC (permalink / raw)
To: Elijah Newren
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
Junio C Hamano <gitster@pobox.com> writes:
> Elijah Newren <newren@gmail.com> writes:
>
>> +extract_files_subtree () {
>> + git fast-export --no-data HEAD -- files_subtree/ |
>> + sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
>> + git fast-import --force --quiet
>> +}
This change has obvious interactions with Dscho's d51b771d ("t3427:
move the `filter-branch` invocation into the `setup` case",
2019-07-31) that is still in flight, but in a good way. There only
needs a single callsite for the above helper function after that
step.
I think I'll discard this step from the "move us closer to deprecate
filter-branch" topic, and ask you and Dscho to work together to have
it or its moral equivalent included as part of js/rebase-r-strategy
topic.
Thanks.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH v5 2/4] t3427: accelerate this test by using fast-export and fast-import
2019-09-03 22:46 ` Junio C Hamano
@ 2019-09-04 20:32 ` Elijah Newren
0 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-04 20:32 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
On Tue, Sep 3, 2019 at 3:46 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Elijah Newren <newren@gmail.com> writes:
> >
> >> +extract_files_subtree () {
> >> + git fast-export --no-data HEAD -- files_subtree/ |
> >> + sed -e "s%\([0-9a-f]\{40\} \)files_subtree/%\1%" |
> >> + git fast-import --force --quiet
> >> +}
>
> This change has obvious interactions with Dscho's d51b771d ("t3427:
> move the `filter-branch` invocation into the `setup` case",
> 2019-07-31) that is still in flight, but in a good way. There only
> needs a single callsite for the above helper function after that
> step.
>
> I think I'll discard this step from the "move us closer to deprecate
> filter-branch" topic, and ask you and Dscho to work together to have
> it or its moral equivalent included as part of js/rebase-r-strategy
> topic.
Sounds good. I'll resubmit it separately as a patch on top of his topic.
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v5 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-09-03 18:55 ` [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-09-03 18:55 ` [PATCH v5 1/4] t6006: simplify and optimize empty message test Elijah Newren
2019-09-03 18:55 ` [PATCH v5 2/4] t3427: accelerate this test by using fast-export and fast-import Elijah Newren
@ 2019-09-03 18:55 ` Elijah Newren
2019-09-03 21:40 ` Junio C Hamano
2019-09-03 18:55 ` [PATCH v5 4/4] t9902: use a non-deprecated command for testing Elijah Newren
2019-09-04 22:32 ` [PATCH v6 0/3] Warn about git-filter-branch usage and avoid it Elijah Newren
4 siblings, 1 reply; 73+ messages in thread
From: Elijah Newren @ 2019-09-03 18:55 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
filter-branch suffers from a deluge of disguised dangers that disfigure
history rewrites (i.e. deviate from the deliberate changes). Many of
these problems are unobtrusive and can easily go undiscovered until the
new repository is in use. This can result in problems ranging from an
even messier history than what led folks to filter-branch in the first
place, to data loss or corruption. These issues cannot be backward
compatibly fixed, so add a warning to both filter-branch and its manpage
recommending that another tool (such as filter-repo) be used instead.
Also, update other manpages that referenced filter-branch. Several of
these needed updates even if we could continue recommending
filter-branch, either due to implying that something was unique to
filter-branch when it applied more generally to all history rewriting
tools (e.g. BFG, reposurgeon, fast-import, filter-repo), or because
something about filter-branch was used as an example despite other more
commonly known examples now existing. Reword these sections to fix
these issues and to avoid recommending filter-branch.
Finally, remove the section explaining BFG Repo Cleaner as an
alternative to filter-branch. I feel somewhat bad about this,
especially since I feel like I learned so much from BFG that I put to
good use in filter-repo (which is much more than I can say for
filter-branch), but keeping that section presented a few problems:
* In order to recommend that people quit using filter-branch, we need
to provide them a recomendation for something else to use that
can handle all the same types of rewrites. To my knowledge,
filter-repo is the only such tool. So it needs to be mentioned.
* I don't want to give conflicting recommendations to users
* If we recommend two tools, we shouldn't expect users to learn both
and pick which one to use; we should explain which problems one
can solve that the other can't or when one is much faster than
the other.
* BFG and filter-repo have similar performance
* All filtering types that BFG can do, filter-repo can also do. In
fact, filter-repo comes with a reimplementation of BFG named
bfg-ish which provides the same user-interface as BFG but with
several bugfixes and new features that are hard to implement in
BFG due to its technical underpinnings.
While I could still mention both tools, it seems like I would need to
provide some kind of comparison and I would ultimately just say that
filter-repo can do everything BFG can, so ultimately it seems that it
is just better to remove that section altogether.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 273 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 13 ++
9 files changed, 287 insertions(+), 59 deletions(-)
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index cc940eb9ad..784e934009 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -17,9 +17,9 @@ This program dumps the given revisions in a form suitable to be piped
into 'git fast-import'.
You can use it as a human-readable bundle replacement (see
-linkgit:git-bundle[1]), or as a kind of an interactive
-'git filter-branch'.
-
+linkgit:git-bundle[1]), or as a format that can be edited before being
+fed to 'git fast-import' in order to do history rewrites (an ability
+relied on by tools like 'git filter-repo').
OPTIONS
-------
diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index 6b53dd7e06..5876598852 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -16,6 +16,19 @@ SYNOPSIS
[--original <namespace>] [-d <directory>] [-f | --force]
[--state-branch <branch>] [--] [<rev-list options>...]
+WARNING
+-------
+'git filter-branch' has a plethora of pitfalls that can produce non-obvious
+manglings of the intended history rewrite (and can leave you with little
+time to investigate such problems since it has such abysmal performance).
+These safety and performance issues cannot be backward compatibly fixed and
+as such, its use is not recommended. Please use an alternative history
+filtering tool such as https://github.com/newren/git-filter-repo/[git
+filter-repo]. If you still need to use 'git filter-branch', please
+carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
+mines of filter-branch, and then vigilantly avoid as many of the hazards
+listed there as reasonably possible.
+
DESCRIPTION
-----------
Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,36 +458,236 @@ warned.
(or if your git-gc is not new enough to support arguments to
`--prune`, use `git repack -ad; git prune` instead).
-NOTES
------
-
-git-filter-branch allows you to make complex shell-scripted rewrites
-of your Git history, but you probably don't need this flexibility if
-you're simply _removing unwanted data_ like large files or passwords.
-For those operations you may want to consider
-http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
-a JVM-based alternative to git-filter-branch, typically at least
-10-50x faster for those use-cases, and with quite different
-characteristics:
-
-* Any particular version of a file is cleaned exactly _once_. The BFG,
- unlike git-filter-branch, does not give you the opportunity to
- handle a file differently based on where or when it was committed
- within your history. This constraint gives the core performance
- benefit of The BFG, and is well-suited to the task of cleansing bad
- data - you don't care _where_ the bad data is, you just want it
- _gone_.
-
-* By default The BFG takes full advantage of multi-core machines,
- cleansing commit file-trees in parallel. git-filter-branch cleans
- commits sequentially (i.e. in a single-threaded manner), though it
- _is_ possible to write filters that include their own parallelism,
- in the scripts executed against each commit.
-
-* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
- are much more restrictive than git-filter branch, and dedicated just
- to the tasks of removing unwanted data- e.g:
- `--strip-blobs-bigger-than 1M`.
+[[PERFORMANCE]]
+PERFORMANCE
+-----------
+
+The performance of git-filter-branch is glacially slow; its design makes it
+impossible for a backward-compatible implementation to ever be fast:
+
+* In editing files, git-filter-branch by design checks out each and
+every commit as it existed in the original repo. If your repo has 10\^5
+files and 10\^5 commits, but each commit only modifies 5 files, then
+git-filter-branch will make you do 10\^10 modifications, despite only
+having (at most) 5*10^5 unique blobs.
+
+* If you try and cheat and try to make git-filter-branch only work on
+files modified in a commit, then two things happen
+
+ ** you run into problems with deletions whenever the user is simply
+ trying to rename files (because attempting to delete files that
+ don't exist looks like a no-op; it takes some chicanery to remap
+ deletes across file renames when the renames happen via arbitrary
+ user-provided shell)
+
+ ** even if you succeed at the map-deletes-for-renames chicanery, you
+ still technically violate backward compatibility because users are
+ allowed to filter files in ways that depend upon topology of
+ commits instead of filtering solely based on file contents or names
+ (though this has not been observed in the wild).
+
+* Even if you don't need to edit files but only want to e.g. rename or
+remove some and thus can avoid checking out each file (i.e. you can use
+--index-filter), you still are passing shell snippets for your filters.
+This means that for every commit, you have to have a prepared git repo
+where those filters can be run. That's a significant setup.
+
+* Further, several additional files are created or updated per commit by
+git-filter-branch. Some of these are for supporting the convenience
+functions provided by git-filter-branch (such as map()), while others
+are for keeping track of internal state (but could have also been
+accessed by user filters; one of git-filter-branch's regression tests
+does so). This essentially amounts to using the filesystem as an IPC
+mechanism between git-filter-branch and the user-provided filters.
+Disks tend to be a slow IPC mechanism, and writing these files also
+effectively represents a forced synchronization point between separate
+processes that we hit with every commit.
+
+* The user-provided shell commands will likely involve a pipeline of
+commands, resulting in the creation of many processes per commit.
+Creating and running another process takes a widely varying amount of
+time between operating systems, but on any platform it is very slow
+relative to invoking a function.
+
+* git-filter-branch itself is written in shell, which is kind of slow.
+This is the one performance issue that could be backward-compatibly
+fixed, but compared to the above problems that are intrinsic to the
+design of git-filter-branch, the language of the tool itself is a
+relatively minor issue.
+
+ ** Side note: Unfortunately, people tend to fixate on the
+ written-in-shell aspect and periodically ask if git-filter-branch
+ could be rewritten in another language to fix the performance
+ issues. Not only does that ignore the bigger intrinsic problems
+ with the design, it'd help less than you'd expect: if
+ git-filter-branch itself were not shell, then the convenience
+ functions (map(), skip_commit(), etc) and the `--setup` argument
+ could no longer be executed once at the beginning of the program
+ but would instead need to be prepended to every user filter (and
+ thus re-executed with every commit).
+
+The https://github.com/newren/git-filter-repo/[git filter-repo] tool is
+an alternative to git-filter-branch which does not suffer from these
+performance problems or the safety problems (mentioned below). For those
+with existing tooling which relies upon git-filter-branch, 'git
+repo-filter' also provides
+https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
+a drop-in git-filter-branch replacement (with a few caveats). While
+filter-lamely suffers from all the same safety issues as
+git-filter-branch, it at least ameloriates the performance issues a
+little.
+
+[[SAFETY]]
+SAFETY
+------
+
+git-filter-branch is riddled with gotchas resulting in various ways to
+easily corrupt repos or end up with a mess worse than what you started
+with:
+
+* Someone can have a set of "working and tested filters" which they
+document or provide to a coworker, who then runs them on a different OS
+where the same commands are not working/tested (some examples in the
+git-filter-branch manpage are also affected by this). BSD vs. GNU
+userland differences can really bite. If lucky, error messages are
+spewed. But just as likely, the commands either don't do the filtering
+requested, or silently corrupt by making some unwanted change. The
+unwanted change may only affect a few commits, so it's not necessarily
+obvious either. (The fact that problems won't necessarily be obvious
+means they are likely to go unnoticed until the rewritten history is in
+use for quite a while, at which point it's really hard to justify
+another flag-day for another rewrite.)
+
+* Filenames with spaces are often mishandled by shell snippets since
+they cause problems for shell pipelines. Not everyone is familiar with
+find -print0, xargs -0, git-ls-files -z, etc. Even people who are
+familiar with these may assume such flags are not relevant because
+someone else renamed any such files in their repo back before the person
+doing the filtering joined the project. And often, even those familiar
+with handling arguments with spaces may not do so just because they
+aren't in the mindset of thinking about everything that could possibly
+go wrong.
+
+* Non-ascii filenames can be silently removed despite being in a desired
+directory. Keeping only wanted paths is often done using pipelines like
+`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`. ls-files will
+only quote filenames if needed, so folks may not notice that one of the
+files didn't match the regex (at least not until it's much too late).
+Yes, someone who knows about core.quotePath can avoid this (unless they
+have other special characters like \t, \n, or "), and people who use
+ls-files -z with something other than grep can avoid this, but that
+doesn't mean they will.
+
+* Similarly, when moving files around, one can find that filenames with
+non-ascii or special characters end up in a different directory, one
+that includes a double quote character. (This is technically the same
+issue as above with quoting, but perhaps an interesting different way
+that it can and has manifested as a problem.)
+
+* It's far too easy to accidentally mix up old and new history. It's
+still possible with any tool, but git-filter-branch almost invites it.
+If lucky, the only downside is users getting frustrated that they don't
+know how to shrink their repo and remove the old stuff. If unlucky,
+they merge old and new history and end up with multiple "copies" of each
+commit, some of which have unwanted or sensitive files and others which
+don't. This comes about in multiple different ways:
+
+ ** the default to only doing a partial history rewrite ('--all' is not
+ the default and few examples show it)
+
+ ** the fact that there's no automatic post-run cleanup
+
+ ** the fact that --tag-name-filter (when used to rename tags) doesn't
+ remove the old tags but just adds new ones with the new name
+
+ ** the fact that little educational information is provided to inform
+ users of the ramifications of a rewrite and how to avoid mixing old
+ and new history. For example, this man page discusses how users
+ need to understand that they need to rebase their changes for all
+ their branches on top of new history (or delete and reclone), but
+ that's only one of multiple concerns to consider. See the
+ "DISCUSSION" section of the git filter-repo manual page for more
+ details.
+
+* Annotated tags can be accidentally converted to lightweight tags, due
+to either of two issues:
+
+ ** Someone can do a history rewrite, realize they messed up, restore
+ from the backups in refs/original/, and then redo their
+ git-filter-branch command. (The backup in refs/original/ is not a
+ real backup; it dereferences tags first.)
+
+ ** Running git-filter-branch with either --tags or --all in your
+ <rev-list options>. In order to retain annotated tags as
+ annotated, you must use --tag-name-filter (and must not have
+ restored from refs/original/ in a previously botched rewrite).
+
+* Any commit messages that specify an encoding will become corrupted
+by the rewrite; git-filter-branch ignores the encoding, takes the original
+bytes, and feeds it to commit-tree without telling it the proper
+encoding. (This happens whether or not --msg-filter is used.)
+
+* Commit messages (even if they are all UTF-8) by default become
+corrupted due to not being updated -- any references to other commit
+hashes in commit messages will now refer to no-longer-extant commits.
+
+* There are no facilities for helping users find what unwanted crud they
+should delete, which means they are much more likely to have incomplete
+or partial cleanups that sometimes result in confusion and people
+wasting time trying to understand. (For example, folks tend to just
+look for big files to delete instead of big directories or extensions,
+and once they do so, then sometime later folks using the new repository
+who are going through history will notice a build artifact directory
+that has some files but not others, or a cache of dependencies
+(node_modules or similar) which couldn't have ever been functional since
+it's missing some files.)
+
+* If --prune-empty isn't specified, then the filtering process can
+create hoards of confusing empty commits
+
+* If --prune-empty is specified, then intentionally placed empty
+commits from before the filtering operation are also pruned instead of
+just pruning commits that became empty due to filtering rules.
+
+* If --prune empty is specified, sometimes empty commits are missed
+and left around anyway (a somewhat rare bug, but it happens...)
+
+* A minor issue, but users who have a goal to update all names and
+emails in a repository may be led to --env-filter which will only update
+authors and committers, missing taggers.
+
+* If the user provides a --tag-name-filter that maps multiple tags to
+the same name, no warning or error is provided; git-filter-branch simply
+overwrites each tag in some undocumented pre-defined order resulting in
+only one tag at the end. (A git-filter-branch regression test requires
+this surprising behavior.)
+
+Also, the poor performance of git-filter-branch often leads to safety
+issues:
+
+* Coming up with the correct shell snippet to do the filtering you want
+is sometimes difficult unless you're just doing a trivial modification
+such as deleting a couple files. Unfortunately, people often learn if
+the snippet is right or wrong by trying it out, but the rightness or
+wrongness can vary depending on special circumstances (spaces in
+filenames, non-ascii filenames, funny author names or emails, invalid
+timezones, presence of grafts or replace objects, etc.), meaning they
+may have to wait a long time, hit an error, then restart. The
+performance of git-filter-branch is so bad that this cycle is painful,
+reducing the time available to carefully re-check (to say nothing about
+what it does to the patience of the person doing the rewrite even if
+they do technically have more time available). This problem is extra
+compounded because errors from broken filters may not be shown for a
+long time and/or get lost in a sea of output. Even worse, broken
+filters often just result in silent incorrect rewrites.
+
+* To top it all off, even when users finally find working commands, they
+naturally want to share them. But they may be unaware that their repo
+didn't have some special cases that someone else's does. So, when
+someone else with a different repository runs the same commands, they
+get hit by the problems above. Or, the user just runs commands that
+really were vetted for special cases, but they run it on a different OS
+where it doesn't work, as noted above.
GIT
---
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 247f765604..0c114ad1ca 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -115,15 +115,14 @@ NOTES
-----
'git gc' tries very hard not to delete objects that are referenced
-anywhere in your repository. In
-particular, it will keep not only objects referenced by your current set
-of branches and tags, but also objects referenced by the index,
-remote-tracking branches, refs saved by 'git filter-branch' in
-refs/original/, reflogs (which may reference commits in branches
-that were later amended or rewound), and anything else in the refs/* namespace.
-If you are expecting some objects to be deleted and they aren't, check
-all of those locations and decide whether it makes sense in your case to
-remove those references.
+anywhere in your repository. In particular, it will keep not only
+objects referenced by your current set of branches and tags, but also
+objects referenced by the index, remote-tracking branches, notes saved
+by 'git notes' under refs/notes/, reflogs (which may reference commits
+in branches that were later amended or rewound), and anything else in
+the refs/* namespace. If you are expecting some objects to be deleted
+and they aren't, check all of those locations and decide whether it
+makes sense in your case to remove those references.
On the other hand, when 'git gc' runs concurrently with another process,
there is a risk of it deleting an object that the other process is using
diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 6156609cf7..a8cfc0ad82 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -832,7 +832,8 @@ Hard case: The changes are not the same.::
This happens if the 'subsystem' rebase had conflicts, or used
`--interactive` to omit, edit, squash, or fixup commits; or
if the upstream used one of `commit --amend`, `reset`, or
- `filter-branch`.
+ a full history rewriting command like
+ https://github.com/newren/git-filter-repo[`filter-repo`].
The easy case
diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
index 246dc9943c..f271d758c3 100644
--- a/Documentation/git-replace.txt
+++ b/Documentation/git-replace.txt
@@ -123,10 +123,10 @@ The following format are available:
CREATING REPLACEMENT OBJECTS
----------------------------
-linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
-linkgit:git-rebase[1], among other git commands, can be used to create
-replacement objects from existing objects. The `--edit` option can
-also be used with 'git replace' to create a replacement object by
+linkgit:git-hash-object[1], linkgit:git-rebase[1], and
+https://github.com/newren/git-filter-repo[git-filter-repo], among other git commands, can be used to
+create replacement objects from existing objects. The `--edit` option
+can also be used with 'git replace' to create a replacement object by
editing an existing object.
If you want to replace many blobs, trees or commits that are part of a
@@ -148,13 +148,13 @@ pending objects.
SEE ALSO
--------
linkgit:git-hash-object[1]
-linkgit:git-filter-branch[1]
linkgit:git-rebase[1]
linkgit:git-tag[1]
linkgit:git-branch[1]
linkgit:git-commit[1]
linkgit:git-var[1]
linkgit:git[1]
+https://github.com/newren/git-filter-repo[git-filter-repo]
GIT
---
diff --git a/Documentation/git-svn.txt b/Documentation/git-svn.txt
index 30711625fd..53774f5b64 100644
--- a/Documentation/git-svn.txt
+++ b/Documentation/git-svn.txt
@@ -769,11 +769,11 @@ option for (hopefully) obvious reasons.
+
This option is NOT recommended as it makes it difficult to track down
old references to SVN revision numbers in existing documentation, bug
-reports and archives. If you plan to eventually migrate from SVN to Git
-and are certain about dropping SVN history, consider
-linkgit:git-filter-branch[1] instead. filter-branch also allows
-reformatting of metadata for ease-of-reading and rewriting authorship
-info for non-"svn.authorsFile" users.
+reports, and archives. If you plan to eventually migrate from SVN to
+Git and are certain about dropping SVN history, consider
+https://github.com/newren/git-filter-repo[git-filter-repo] instead.
+filter-repo also allows reformatting of metadata for ease-of-reading
+and rewriting authorship info for non-"svn.authorsFile" users.
svn.useSvmProps::
svn-remote.<name>.useSvmProps::
diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 82cd573776..5a789c91df 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -425,10 +425,12 @@ post-rewrite
This hook is invoked by commands that rewrite commits
(linkgit:git-commit[1] when called with `--amend` and
-linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
-it!). Its first argument denotes the command it was invoked by:
-currently one of `amend` or `rebase`. Further command-dependent
-arguments may be passed in the future.
+linkgit:git-rebase[1]; however, full-history (re)writing tools like
+linkgit:git-fast-import[1] or
+https://github.com/newren/git-filter-repo[git-filter-repo] typically
+do not call it!). Its first argument denotes the command it was
+invoked by: currently one of `amend` or `rebase`. Further
+command-dependent arguments may be passed in the future.
The hook receives a list of the rewritten commits on stdin, in the
format
diff --git a/contrib/svn-fe/svn-fe.txt b/contrib/svn-fe/svn-fe.txt
index a3425f4770..19333fc8df 100644
--- a/contrib/svn-fe/svn-fe.txt
+++ b/contrib/svn-fe/svn-fe.txt
@@ -56,7 +56,7 @@ line. This line has the form `git-svn-id: URL@REVNO UUID`.
The resulting repository will generally require further processing
to put each project in its own repository and to separate the history
-of each branch. The 'git filter-branch --subdirectory-filter' command
+of each branch. The 'git filter-repo --subdirectory-filter' command
may be useful for this purpose.
BUGS
@@ -67,5 +67,5 @@ The exit status does not reflect whether an error was detected.
SEE ALSO
--------
-git-svn(1), svn2git(1), svk(1), git-filter-branch(1), git-fast-import(1),
+git-svn(1), svn2git(1), svk(1), git-filter-repo(1), git-fast-import(1),
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 5c5afa2b98..f805965d87 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -83,6 +83,19 @@ set_ident () {
finish_ident COMMITTER
}
+if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
+ -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
+ cat <<EOF
+WARNING: git-filter-branch has a glut of gotchas generating mangled history
+ rewrites. Please use an alternative filtering tool such as 'git
+ filter-repo' (https://github.com/newren/git-filter-repo/) instead.
+ See the filter-branch manual page for more details; to squelch
+ this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
+
+EOF
+ sleep 5
+fi
+
USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
[--tree-filter <command>] [--index-filter <command>]
[--parent-filter <command>] [--msg-filter <command>]
--
2.23.0.39.gf92d9de5c3
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH v5 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-09-03 18:55 ` [PATCH v5 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-09-03 21:40 ` Junio C Hamano
2019-09-04 20:30 ` Elijah Newren
0 siblings, 1 reply; 73+ messages in thread
From: Junio C Hamano @ 2019-09-03 21:40 UTC (permalink / raw)
To: Elijah Newren
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
Elijah Newren <newren@gmail.com> writes:
> diff --git a/git-filter-branch.sh b/git-filter-branch.sh
> index 5c5afa2b98..f805965d87 100755
> --- a/git-filter-branch.sh
> +++ b/git-filter-branch.sh
> @@ -83,6 +83,19 @@ set_ident () {
> finish_ident COMMITTER
> }
>
> +if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
> + -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
This is probably the only place where [] instead of "test" is used
in our shell scripts.
if test -z "$FILTER_BRANCH_SQUELCH_WARNING$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
then
...
> + cat <<EOF
> +WARNING: git-filter-branch has a glut of gotchas generating mangled history
> + rewrites. Please use an alternative filtering tool such as 'git
> + filter-repo' (https://github.com/newren/git-filter-repo/) instead.
> + See the filter-branch manual page for more details; to squelch
> + this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
> +
> +EOF
> + sleep 5
> +fi
This should say it is "sleeping while showing the message and can
safely be killed before starting to do any harm"; alternatively it
should lose the "sleep". The user would have fear against typing ^C
to get out of a bulk history rewrite command, and the message itself
is making the fear worse. If your goal is to discourage its use,
then it would be a good idea to make it clear when it is safe to
kill it before going and studying the alternative. Otherwise, the
sleep does not help that much---the main complaint is that filter
branch is too slow, so the user has plenty of time to read the
message anyway, right? ;-)
> USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
> [--tree-filter <command>] [--index-filter <command>]
> [--parent-filter <command>] [--msg-filter <command>]
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH v5 3/4] Recommend git-filter-repo instead of git-filter-branch
2019-09-03 21:40 ` Junio C Hamano
@ 2019-09-04 20:30 ` Elijah Newren
0 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-04 20:30 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine
On Tue, Sep 3, 2019 at 2:40 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > diff --git a/git-filter-branch.sh b/git-filter-branch.sh
> > index 5c5afa2b98..f805965d87 100755
> > --- a/git-filter-branch.sh
> > +++ b/git-filter-branch.sh
> > @@ -83,6 +83,19 @@ set_ident () {
> > finish_ident COMMITTER
> > }
> >
> > +if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
> > + -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
>
> This is probably the only place where [] instead of "test" is used
> in our shell scripts.
>
> if test -z "$FILTER_BRANCH_SQUELCH_WARNING$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
> then
> ...
Yeah, git-filter-branch.sh has approximately twice as many uses of []
than "test", so it seemed in line with its coding style. I can switch
it over.
> > + cat <<EOF
> > +WARNING: git-filter-branch has a glut of gotchas generating mangled history
> > + rewrites. Please use an alternative filtering tool such as 'git
> > + filter-repo' (https://github.com/newren/git-filter-repo/) instead.
> > + See the filter-branch manual page for more details; to squelch
> > + this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
> > +
> > +EOF
> > + sleep 5
> > +fi
>
> This should say it is "sleeping while showing the message and can
> safely be killed before starting to do any harm"; alternatively it
> should lose the "sleep". The user would have fear against typing ^C
> to get out of a bulk history rewrite command, and the message itself
> is making the fear worse. If your goal is to discourage its use,
> then it would be a good idea to make it clear when it is safe to
> kill it before going and studying the alternative. Otherwise, the
> sleep does not help that much---the main complaint is that filter
> branch is too slow, so the user has plenty of time to read the
> message anyway, right? ;-)
Makes sense; will fix.
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v5 4/4] t9902: use a non-deprecated command for testing
2019-09-03 18:55 ` [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (2 preceding siblings ...)
2019-09-03 18:55 ` [PATCH v5 3/4] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-09-03 18:55 ` Elijah Newren
2019-09-04 22:32 ` [PATCH v6 0/3] Warn about git-filter-branch usage and avoid it Elijah Newren
4 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-03 18:55 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
t9902 had a list of three random porcelain commands as a sanity check,
one of which was filter-branch. Since we are recommending people not
use filter-branch, let's update this test to use rebase instead of
filter-branch.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t9902-completion.sh | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 75512c3403..4e7f669c76 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -28,10 +28,10 @@ complete ()
#
# (2) A test makes sure that common subcommands are included in the
# completion for "git <TAB>", and a plumbing is excluded. "add",
-# "filter-branch" and "ls-files" are listed for this.
+# "rebase" and "ls-files" are listed for this.
-GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr filter-branch ls-files'
-GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout filter-branch'
+GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr rebase ls-files'
+GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout rebase'
. "$GIT_BUILD_DIR/contrib/completion/git-completion.bash"
@@ -1392,12 +1392,12 @@ test_expect_success 'basic' '
# built-in
grep -q "^add \$" out &&
# script
- grep -q "^filter-branch \$" out &&
+ grep -q "^rebase \$" out &&
# plumbing
! grep -q "^ls-files \$" out &&
- run_completion "git f" &&
- ! grep -q -v "^f" out
+ run_completion "git r" &&
+ ! grep -q -v "^r" out
'
test_expect_success 'double dash "git" itself' '
--
2.23.0.39.gf92d9de5c3
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v6 0/3] Warn about git-filter-branch usage and avoid it
2019-09-03 18:55 ` [PATCH v5 0/4] Warn about git-filter-branch usage and avoid it Elijah Newren
` (3 preceding siblings ...)
2019-09-03 18:55 ` [PATCH v5 4/4] t9902: use a non-deprecated command for testing Elijah Newren
@ 2019-09-04 22:32 ` Elijah Newren
2019-09-04 22:32 ` [PATCH v6 1/3] t6006: simplify, fix, and optimize empty message test Elijah Newren
` (2 more replies)
4 siblings, 3 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-04 22:32 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
Changes since v5 (full range-diff below):
* Dropped patch 3 (which was rebased on top of js/rebase-r-strategy and
submitted separately)[1]
* Updated t6006 to include both an empty commit message and a commit
message with just a line feed
* Made the two small tweaks Junio suggested to git-filter-branch.sh
[1] https://public-inbox.org/git/20190904214048.29331-1-newren@gmail.com/
Elijah Newren (3):
t6006: simplify, fix, and optimize empty message test
Recommend git-filter-repo instead of git-filter-branch
t9902: use a non-deprecated command for testing
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 273 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 14 ++
t/t6006-rev-list-format.sh | 5 +-
t/t9902-completion.sh | 12 +-
11 files changed, 296 insertions(+), 68 deletions(-)
Range-diff:
1: ccea0e5846 ! 1: d5370568a4 t6006: simplify and optimize empty message test
@@ Metadata
Author: Elijah Newren <newren@gmail.com>
## Commit message ##
- t6006: simplify and optimize empty message test
+ t6006: simplify, fix, and optimize empty message test
Test t6006.71 ("oneline with empty message") was creating two commits
with simple commit messages, and then running filter-branch to rewrite
- the commit messages to be empty. This test was written this way because
- the --allow-empty-message option to git commit did not exist at the
- time. Simplify this test and avoid the need to invoke filter-branch by
- just using --allow-empty-message when creating the commit.
+ the commit messages to be "empty". This test was introduced in commit
+ 1fb5fdd25f0 ("rev-list: fix --pretty=oneline with empty message",
+ 2010-03-21) and written this way because the --allow-empty-message
+ option to git commit did not exist at the time.
+
+ However, the filter-branch invocation used differed slightly from
+ --allow-empty-message in that it would have a commit message consisting
+ solely of a single newline, and as such was not testing what the
+ original commit intended to test. Since both a truly empty commit
+ message and a commit message with a single linefeed could trigger the
+ original bug, modify the test slightly to include an example of each.
Despite only being one piece of the 71st test and there being 73 tests
overall, this small change to just this one test speeds up the overall
@@ t/t6006-rev-list-format.sh: test_expect_success 'reflog identity' '
- git commit -m "dummy" --allow-empty &&
- git commit -m "dummy" --allow-empty &&
- git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
-+ git commit --allow-empty --allow-empty-message &&
++ git commit --allow-empty --cleanup=verbatim -m "$LF" &&
+ git commit --allow-empty --allow-empty-message &&
git rev-list --oneline HEAD >test.txt &&
test_line_count = 5 test.txt &&
2: 6d73135006 < -: ---------- t3427: accelerate this test by using fast-export and fast-import
3: 2f225c8697 ! 2: 8635410b88 Recommend git-filter-repo instead of git-filter-branch
@@ git-filter-branch.sh: set_ident () {
finish_ident COMMITTER
}
-+if [ -z "$FILTER_BRANCH_SQUELCH_WARNING" -a \
-+ -z "$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS" ]; then
++if test -z "$FILTER_BRANCH_SQUELCH_WARNING$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
++then
+ cat <<EOF
+WARNING: git-filter-branch has a glut of gotchas generating mangled history
-+ rewrites. Please use an alternative filtering tool such as 'git
-+ filter-repo' (https://github.com/newren/git-filter-repo/) instead.
-+ See the filter-branch manual page for more details; to squelch
-+ this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
-+
++ rewrites. Hit Ctrl-C before proceeding to abort, then use an
++ alternative filtering tool such as 'git filter-repo'
++ (https://github.com/newren/git-filter-repo/) instead. See the
++ filter-branch manual page for more details; to squelch this warning,
++ set FILTER_BRANCH_SQUELCH_WARNING=1.
+EOF
-+ sleep 5
++ sleep 10
++ printf "Proceeding with filter-branch...\n\n"
+fi
+
USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
4: 048eba375b = 3: 19edb94ec2 t9902: use a non-deprecated command for testing
--
2.23.0.3.g19edb94ec2
^ permalink raw reply [flat|nested] 73+ messages in thread
* [PATCH v6 1/3] t6006: simplify, fix, and optimize empty message test
2019-09-04 22:32 ` [PATCH v6 0/3] Warn about git-filter-branch usage and avoid it Elijah Newren
@ 2019-09-04 22:32 ` Elijah Newren
2019-09-04 22:32 ` [PATCH v6 2/3] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
2019-09-04 22:32 ` [PATCH v6 3/3] t9902: use a non-deprecated command for testing Elijah Newren
2 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-04 22:32 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
Test t6006.71 ("oneline with empty message") was creating two commits
with simple commit messages, and then running filter-branch to rewrite
the commit messages to be "empty". This test was introduced in commit
1fb5fdd25f0 ("rev-list: fix --pretty=oneline with empty message",
2010-03-21) and written this way because the --allow-empty-message
option to git commit did not exist at the time.
However, the filter-branch invocation used differed slightly from
--allow-empty-message in that it would have a commit message consisting
solely of a single newline, and as such was not testing what the
original commit intended to test. Since both a truly empty commit
message and a commit message with a single linefeed could trigger the
original bug, modify the test slightly to include an example of each.
Despite only being one piece of the 71st test and there being 73 tests
overall, this small change to just this one test speeds up the overall
execution time of t6006 (as measured by the best of 3 runs of `time
./t6006-rev-list-format.sh`) by about 11% on Linux, 13% on Mac, and
about 15% on Windows.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t6006-rev-list-format.sh | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
index da113d975b..cfb74d0e03 100755
--- a/t/t6006-rev-list-format.sh
+++ b/t/t6006-rev-list-format.sh
@@ -501,9 +501,8 @@ test_expect_success 'reflog identity' '
'
test_expect_success 'oneline with empty message' '
- git commit -m "dummy" --allow-empty &&
- git commit -m "dummy" --allow-empty &&
- git filter-branch --msg-filter "sed -e s/dummy//" HEAD^^.. &&
+ git commit --allow-empty --cleanup=verbatim -m "$LF" &&
+ git commit --allow-empty --allow-empty-message &&
git rev-list --oneline HEAD >test.txt &&
test_line_count = 5 test.txt &&
git rev-list --oneline --graph HEAD >testg.txt &&
--
2.23.0.3.g19edb94ec2
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v6 2/3] Recommend git-filter-repo instead of git-filter-branch
2019-09-04 22:32 ` [PATCH v6 0/3] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-09-04 22:32 ` [PATCH v6 1/3] t6006: simplify, fix, and optimize empty message test Elijah Newren
@ 2019-09-04 22:32 ` Elijah Newren
2019-09-04 22:32 ` [PATCH v6 3/3] t9902: use a non-deprecated command for testing Elijah Newren
2 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-04 22:32 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
filter-branch suffers from a deluge of disguised dangers that disfigure
history rewrites (i.e. deviate from the deliberate changes). Many of
these problems are unobtrusive and can easily go undiscovered until the
new repository is in use. This can result in problems ranging from an
even messier history than what led folks to filter-branch in the first
place, to data loss or corruption. These issues cannot be backward
compatibly fixed, so add a warning to both filter-branch and its manpage
recommending that another tool (such as filter-repo) be used instead.
Also, update other manpages that referenced filter-branch. Several of
these needed updates even if we could continue recommending
filter-branch, either due to implying that something was unique to
filter-branch when it applied more generally to all history rewriting
tools (e.g. BFG, reposurgeon, fast-import, filter-repo), or because
something about filter-branch was used as an example despite other more
commonly known examples now existing. Reword these sections to fix
these issues and to avoid recommending filter-branch.
Finally, remove the section explaining BFG Repo Cleaner as an
alternative to filter-branch. I feel somewhat bad about this,
especially since I feel like I learned so much from BFG that I put to
good use in filter-repo (which is much more than I can say for
filter-branch), but keeping that section presented a few problems:
* In order to recommend that people quit using filter-branch, we need
to provide them a recomendation for something else to use that
can handle all the same types of rewrites. To my knowledge,
filter-repo is the only such tool. So it needs to be mentioned.
* I don't want to give conflicting recommendations to users
* If we recommend two tools, we shouldn't expect users to learn both
and pick which one to use; we should explain which problems one
can solve that the other can't or when one is much faster than
the other.
* BFG and filter-repo have similar performance
* All filtering types that BFG can do, filter-repo can also do. In
fact, filter-repo comes with a reimplementation of BFG named
bfg-ish which provides the same user-interface as BFG but with
several bugfixes and new features that are hard to implement in
BFG due to its technical underpinnings.
While I could still mention both tools, it seems like I would need to
provide some kind of comparison and I would ultimately just say that
filter-repo can do everything BFG can, so ultimately it seems that it
is just better to remove that section altogether.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
Documentation/git-fast-export.txt | 6 +-
Documentation/git-filter-branch.txt | 273 +++++++++++++++++++++++++---
Documentation/git-gc.txt | 17 +-
Documentation/git-rebase.txt | 3 +-
Documentation/git-replace.txt | 10 +-
Documentation/git-svn.txt | 10 +-
Documentation/githooks.txt | 10 +-
contrib/svn-fe/svn-fe.txt | 4 +-
git-filter-branch.sh | 14 ++
9 files changed, 288 insertions(+), 59 deletions(-)
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index cc940eb9ad..784e934009 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -17,9 +17,9 @@ This program dumps the given revisions in a form suitable to be piped
into 'git fast-import'.
You can use it as a human-readable bundle replacement (see
-linkgit:git-bundle[1]), or as a kind of an interactive
-'git filter-branch'.
-
+linkgit:git-bundle[1]), or as a format that can be edited before being
+fed to 'git fast-import' in order to do history rewrites (an ability
+relied on by tools like 'git filter-repo').
OPTIONS
-------
diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index 6b53dd7e06..5876598852 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -16,6 +16,19 @@ SYNOPSIS
[--original <namespace>] [-d <directory>] [-f | --force]
[--state-branch <branch>] [--] [<rev-list options>...]
+WARNING
+-------
+'git filter-branch' has a plethora of pitfalls that can produce non-obvious
+manglings of the intended history rewrite (and can leave you with little
+time to investigate such problems since it has such abysmal performance).
+These safety and performance issues cannot be backward compatibly fixed and
+as such, its use is not recommended. Please use an alternative history
+filtering tool such as https://github.com/newren/git-filter-repo/[git
+filter-repo]. If you still need to use 'git filter-branch', please
+carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
+mines of filter-branch, and then vigilantly avoid as many of the hazards
+listed there as reasonably possible.
+
DESCRIPTION
-----------
Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,36 +458,236 @@ warned.
(or if your git-gc is not new enough to support arguments to
`--prune`, use `git repack -ad; git prune` instead).
-NOTES
------
-
-git-filter-branch allows you to make complex shell-scripted rewrites
-of your Git history, but you probably don't need this flexibility if
-you're simply _removing unwanted data_ like large files or passwords.
-For those operations you may want to consider
-http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
-a JVM-based alternative to git-filter-branch, typically at least
-10-50x faster for those use-cases, and with quite different
-characteristics:
-
-* Any particular version of a file is cleaned exactly _once_. The BFG,
- unlike git-filter-branch, does not give you the opportunity to
- handle a file differently based on where or when it was committed
- within your history. This constraint gives the core performance
- benefit of The BFG, and is well-suited to the task of cleansing bad
- data - you don't care _where_ the bad data is, you just want it
- _gone_.
-
-* By default The BFG takes full advantage of multi-core machines,
- cleansing commit file-trees in parallel. git-filter-branch cleans
- commits sequentially (i.e. in a single-threaded manner), though it
- _is_ possible to write filters that include their own parallelism,
- in the scripts executed against each commit.
-
-* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
- are much more restrictive than git-filter branch, and dedicated just
- to the tasks of removing unwanted data- e.g:
- `--strip-blobs-bigger-than 1M`.
+[[PERFORMANCE]]
+PERFORMANCE
+-----------
+
+The performance of git-filter-branch is glacially slow; its design makes it
+impossible for a backward-compatible implementation to ever be fast:
+
+* In editing files, git-filter-branch by design checks out each and
+every commit as it existed in the original repo. If your repo has 10\^5
+files and 10\^5 commits, but each commit only modifies 5 files, then
+git-filter-branch will make you do 10\^10 modifications, despite only
+having (at most) 5*10^5 unique blobs.
+
+* If you try and cheat and try to make git-filter-branch only work on
+files modified in a commit, then two things happen
+
+ ** you run into problems with deletions whenever the user is simply
+ trying to rename files (because attempting to delete files that
+ don't exist looks like a no-op; it takes some chicanery to remap
+ deletes across file renames when the renames happen via arbitrary
+ user-provided shell)
+
+ ** even if you succeed at the map-deletes-for-renames chicanery, you
+ still technically violate backward compatibility because users are
+ allowed to filter files in ways that depend upon topology of
+ commits instead of filtering solely based on file contents or names
+ (though this has not been observed in the wild).
+
+* Even if you don't need to edit files but only want to e.g. rename or
+remove some and thus can avoid checking out each file (i.e. you can use
+--index-filter), you still are passing shell snippets for your filters.
+This means that for every commit, you have to have a prepared git repo
+where those filters can be run. That's a significant setup.
+
+* Further, several additional files are created or updated per commit by
+git-filter-branch. Some of these are for supporting the convenience
+functions provided by git-filter-branch (such as map()), while others
+are for keeping track of internal state (but could have also been
+accessed by user filters; one of git-filter-branch's regression tests
+does so). This essentially amounts to using the filesystem as an IPC
+mechanism between git-filter-branch and the user-provided filters.
+Disks tend to be a slow IPC mechanism, and writing these files also
+effectively represents a forced synchronization point between separate
+processes that we hit with every commit.
+
+* The user-provided shell commands will likely involve a pipeline of
+commands, resulting in the creation of many processes per commit.
+Creating and running another process takes a widely varying amount of
+time between operating systems, but on any platform it is very slow
+relative to invoking a function.
+
+* git-filter-branch itself is written in shell, which is kind of slow.
+This is the one performance issue that could be backward-compatibly
+fixed, but compared to the above problems that are intrinsic to the
+design of git-filter-branch, the language of the tool itself is a
+relatively minor issue.
+
+ ** Side note: Unfortunately, people tend to fixate on the
+ written-in-shell aspect and periodically ask if git-filter-branch
+ could be rewritten in another language to fix the performance
+ issues. Not only does that ignore the bigger intrinsic problems
+ with the design, it'd help less than you'd expect: if
+ git-filter-branch itself were not shell, then the convenience
+ functions (map(), skip_commit(), etc) and the `--setup` argument
+ could no longer be executed once at the beginning of the program
+ but would instead need to be prepended to every user filter (and
+ thus re-executed with every commit).
+
+The https://github.com/newren/git-filter-repo/[git filter-repo] tool is
+an alternative to git-filter-branch which does not suffer from these
+performance problems or the safety problems (mentioned below). For those
+with existing tooling which relies upon git-filter-branch, 'git
+repo-filter' also provides
+https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
+a drop-in git-filter-branch replacement (with a few caveats). While
+filter-lamely suffers from all the same safety issues as
+git-filter-branch, it at least ameloriates the performance issues a
+little.
+
+[[SAFETY]]
+SAFETY
+------
+
+git-filter-branch is riddled with gotchas resulting in various ways to
+easily corrupt repos or end up with a mess worse than what you started
+with:
+
+* Someone can have a set of "working and tested filters" which they
+document or provide to a coworker, who then runs them on a different OS
+where the same commands are not working/tested (some examples in the
+git-filter-branch manpage are also affected by this). BSD vs. GNU
+userland differences can really bite. If lucky, error messages are
+spewed. But just as likely, the commands either don't do the filtering
+requested, or silently corrupt by making some unwanted change. The
+unwanted change may only affect a few commits, so it's not necessarily
+obvious either. (The fact that problems won't necessarily be obvious
+means they are likely to go unnoticed until the rewritten history is in
+use for quite a while, at which point it's really hard to justify
+another flag-day for another rewrite.)
+
+* Filenames with spaces are often mishandled by shell snippets since
+they cause problems for shell pipelines. Not everyone is familiar with
+find -print0, xargs -0, git-ls-files -z, etc. Even people who are
+familiar with these may assume such flags are not relevant because
+someone else renamed any such files in their repo back before the person
+doing the filtering joined the project. And often, even those familiar
+with handling arguments with spaces may not do so just because they
+aren't in the mindset of thinking about everything that could possibly
+go wrong.
+
+* Non-ascii filenames can be silently removed despite being in a desired
+directory. Keeping only wanted paths is often done using pipelines like
+`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`. ls-files will
+only quote filenames if needed, so folks may not notice that one of the
+files didn't match the regex (at least not until it's much too late).
+Yes, someone who knows about core.quotePath can avoid this (unless they
+have other special characters like \t, \n, or "), and people who use
+ls-files -z with something other than grep can avoid this, but that
+doesn't mean they will.
+
+* Similarly, when moving files around, one can find that filenames with
+non-ascii or special characters end up in a different directory, one
+that includes a double quote character. (This is technically the same
+issue as above with quoting, but perhaps an interesting different way
+that it can and has manifested as a problem.)
+
+* It's far too easy to accidentally mix up old and new history. It's
+still possible with any tool, but git-filter-branch almost invites it.
+If lucky, the only downside is users getting frustrated that they don't
+know how to shrink their repo and remove the old stuff. If unlucky,
+they merge old and new history and end up with multiple "copies" of each
+commit, some of which have unwanted or sensitive files and others which
+don't. This comes about in multiple different ways:
+
+ ** the default to only doing a partial history rewrite ('--all' is not
+ the default and few examples show it)
+
+ ** the fact that there's no automatic post-run cleanup
+
+ ** the fact that --tag-name-filter (when used to rename tags) doesn't
+ remove the old tags but just adds new ones with the new name
+
+ ** the fact that little educational information is provided to inform
+ users of the ramifications of a rewrite and how to avoid mixing old
+ and new history. For example, this man page discusses how users
+ need to understand that they need to rebase their changes for all
+ their branches on top of new history (or delete and reclone), but
+ that's only one of multiple concerns to consider. See the
+ "DISCUSSION" section of the git filter-repo manual page for more
+ details.
+
+* Annotated tags can be accidentally converted to lightweight tags, due
+to either of two issues:
+
+ ** Someone can do a history rewrite, realize they messed up, restore
+ from the backups in refs/original/, and then redo their
+ git-filter-branch command. (The backup in refs/original/ is not a
+ real backup; it dereferences tags first.)
+
+ ** Running git-filter-branch with either --tags or --all in your
+ <rev-list options>. In order to retain annotated tags as
+ annotated, you must use --tag-name-filter (and must not have
+ restored from refs/original/ in a previously botched rewrite).
+
+* Any commit messages that specify an encoding will become corrupted
+by the rewrite; git-filter-branch ignores the encoding, takes the original
+bytes, and feeds it to commit-tree without telling it the proper
+encoding. (This happens whether or not --msg-filter is used.)
+
+* Commit messages (even if they are all UTF-8) by default become
+corrupted due to not being updated -- any references to other commit
+hashes in commit messages will now refer to no-longer-extant commits.
+
+* There are no facilities for helping users find what unwanted crud they
+should delete, which means they are much more likely to have incomplete
+or partial cleanups that sometimes result in confusion and people
+wasting time trying to understand. (For example, folks tend to just
+look for big files to delete instead of big directories or extensions,
+and once they do so, then sometime later folks using the new repository
+who are going through history will notice a build artifact directory
+that has some files but not others, or a cache of dependencies
+(node_modules or similar) which couldn't have ever been functional since
+it's missing some files.)
+
+* If --prune-empty isn't specified, then the filtering process can
+create hoards of confusing empty commits
+
+* If --prune-empty is specified, then intentionally placed empty
+commits from before the filtering operation are also pruned instead of
+just pruning commits that became empty due to filtering rules.
+
+* If --prune empty is specified, sometimes empty commits are missed
+and left around anyway (a somewhat rare bug, but it happens...)
+
+* A minor issue, but users who have a goal to update all names and
+emails in a repository may be led to --env-filter which will only update
+authors and committers, missing taggers.
+
+* If the user provides a --tag-name-filter that maps multiple tags to
+the same name, no warning or error is provided; git-filter-branch simply
+overwrites each tag in some undocumented pre-defined order resulting in
+only one tag at the end. (A git-filter-branch regression test requires
+this surprising behavior.)
+
+Also, the poor performance of git-filter-branch often leads to safety
+issues:
+
+* Coming up with the correct shell snippet to do the filtering you want
+is sometimes difficult unless you're just doing a trivial modification
+such as deleting a couple files. Unfortunately, people often learn if
+the snippet is right or wrong by trying it out, but the rightness or
+wrongness can vary depending on special circumstances (spaces in
+filenames, non-ascii filenames, funny author names or emails, invalid
+timezones, presence of grafts or replace objects, etc.), meaning they
+may have to wait a long time, hit an error, then restart. The
+performance of git-filter-branch is so bad that this cycle is painful,
+reducing the time available to carefully re-check (to say nothing about
+what it does to the patience of the person doing the rewrite even if
+they do technically have more time available). This problem is extra
+compounded because errors from broken filters may not be shown for a
+long time and/or get lost in a sea of output. Even worse, broken
+filters often just result in silent incorrect rewrites.
+
+* To top it all off, even when users finally find working commands, they
+naturally want to share them. But they may be unaware that their repo
+didn't have some special cases that someone else's does. So, when
+someone else with a different repository runs the same commands, they
+get hit by the problems above. Or, the user just runs commands that
+really were vetted for special cases, but they run it on a different OS
+where it doesn't work, as noted above.
GIT
---
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 247f765604..0c114ad1ca 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -115,15 +115,14 @@ NOTES
-----
'git gc' tries very hard not to delete objects that are referenced
-anywhere in your repository. In
-particular, it will keep not only objects referenced by your current set
-of branches and tags, but also objects referenced by the index,
-remote-tracking branches, refs saved by 'git filter-branch' in
-refs/original/, reflogs (which may reference commits in branches
-that were later amended or rewound), and anything else in the refs/* namespace.
-If you are expecting some objects to be deleted and they aren't, check
-all of those locations and decide whether it makes sense in your case to
-remove those references.
+anywhere in your repository. In particular, it will keep not only
+objects referenced by your current set of branches and tags, but also
+objects referenced by the index, remote-tracking branches, notes saved
+by 'git notes' under refs/notes/, reflogs (which may reference commits
+in branches that were later amended or rewound), and anything else in
+the refs/* namespace. If you are expecting some objects to be deleted
+and they aren't, check all of those locations and decide whether it
+makes sense in your case to remove those references.
On the other hand, when 'git gc' runs concurrently with another process,
there is a risk of it deleting an object that the other process is using
diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 6156609cf7..a8cfc0ad82 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -832,7 +832,8 @@ Hard case: The changes are not the same.::
This happens if the 'subsystem' rebase had conflicts, or used
`--interactive` to omit, edit, squash, or fixup commits; or
if the upstream used one of `commit --amend`, `reset`, or
- `filter-branch`.
+ a full history rewriting command like
+ https://github.com/newren/git-filter-repo[`filter-repo`].
The easy case
diff --git a/Documentation/git-replace.txt b/Documentation/git-replace.txt
index 246dc9943c..f271d758c3 100644
--- a/Documentation/git-replace.txt
+++ b/Documentation/git-replace.txt
@@ -123,10 +123,10 @@ The following format are available:
CREATING REPLACEMENT OBJECTS
----------------------------
-linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
-linkgit:git-rebase[1], among other git commands, can be used to create
-replacement objects from existing objects. The `--edit` option can
-also be used with 'git replace' to create a replacement object by
+linkgit:git-hash-object[1], linkgit:git-rebase[1], and
+https://github.com/newren/git-filter-repo[git-filter-repo], among other git commands, can be used to
+create replacement objects from existing objects. The `--edit` option
+can also be used with 'git replace' to create a replacement object by
editing an existing object.
If you want to replace many blobs, trees or commits that are part of a
@@ -148,13 +148,13 @@ pending objects.
SEE ALSO
--------
linkgit:git-hash-object[1]
-linkgit:git-filter-branch[1]
linkgit:git-rebase[1]
linkgit:git-tag[1]
linkgit:git-branch[1]
linkgit:git-commit[1]
linkgit:git-var[1]
linkgit:git[1]
+https://github.com/newren/git-filter-repo[git-filter-repo]
GIT
---
diff --git a/Documentation/git-svn.txt b/Documentation/git-svn.txt
index 30711625fd..53774f5b64 100644
--- a/Documentation/git-svn.txt
+++ b/Documentation/git-svn.txt
@@ -769,11 +769,11 @@ option for (hopefully) obvious reasons.
+
This option is NOT recommended as it makes it difficult to track down
old references to SVN revision numbers in existing documentation, bug
-reports and archives. If you plan to eventually migrate from SVN to Git
-and are certain about dropping SVN history, consider
-linkgit:git-filter-branch[1] instead. filter-branch also allows
-reformatting of metadata for ease-of-reading and rewriting authorship
-info for non-"svn.authorsFile" users.
+reports, and archives. If you plan to eventually migrate from SVN to
+Git and are certain about dropping SVN history, consider
+https://github.com/newren/git-filter-repo[git-filter-repo] instead.
+filter-repo also allows reformatting of metadata for ease-of-reading
+and rewriting authorship info for non-"svn.authorsFile" users.
svn.useSvmProps::
svn-remote.<name>.useSvmProps::
diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 82cd573776..5a789c91df 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -425,10 +425,12 @@ post-rewrite
This hook is invoked by commands that rewrite commits
(linkgit:git-commit[1] when called with `--amend` and
-linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
-it!). Its first argument denotes the command it was invoked by:
-currently one of `amend` or `rebase`. Further command-dependent
-arguments may be passed in the future.
+linkgit:git-rebase[1]; however, full-history (re)writing tools like
+linkgit:git-fast-import[1] or
+https://github.com/newren/git-filter-repo[git-filter-repo] typically
+do not call it!). Its first argument denotes the command it was
+invoked by: currently one of `amend` or `rebase`. Further
+command-dependent arguments may be passed in the future.
The hook receives a list of the rewritten commits on stdin, in the
format
diff --git a/contrib/svn-fe/svn-fe.txt b/contrib/svn-fe/svn-fe.txt
index a3425f4770..19333fc8df 100644
--- a/contrib/svn-fe/svn-fe.txt
+++ b/contrib/svn-fe/svn-fe.txt
@@ -56,7 +56,7 @@ line. This line has the form `git-svn-id: URL@REVNO UUID`.
The resulting repository will generally require further processing
to put each project in its own repository and to separate the history
-of each branch. The 'git filter-branch --subdirectory-filter' command
+of each branch. The 'git filter-repo --subdirectory-filter' command
may be useful for this purpose.
BUGS
@@ -67,5 +67,5 @@ The exit status does not reflect whether an error was detected.
SEE ALSO
--------
-git-svn(1), svn2git(1), svk(1), git-filter-branch(1), git-fast-import(1),
+git-svn(1), svn2git(1), svk(1), git-filter-repo(1), git-fast-import(1),
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 5c5afa2b98..fea7964617 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -83,6 +83,20 @@ set_ident () {
finish_ident COMMITTER
}
+if test -z "$FILTER_BRANCH_SQUELCH_WARNING$GIT_TEST_DISALLOW_ABBREVIATED_OPTIONS"
+then
+ cat <<EOF
+WARNING: git-filter-branch has a glut of gotchas generating mangled history
+ rewrites. Hit Ctrl-C before proceeding to abort, then use an
+ alternative filtering tool such as 'git filter-repo'
+ (https://github.com/newren/git-filter-repo/) instead. See the
+ filter-branch manual page for more details; to squelch this warning,
+ set FILTER_BRANCH_SQUELCH_WARNING=1.
+EOF
+ sleep 10
+ printf "Proceeding with filter-branch...\n\n"
+fi
+
USAGE="[--setup <command>] [--subdirectory-filter <directory>] [--env-filter <command>]
[--tree-filter <command>] [--index-filter <command>]
[--parent-filter <command>] [--msg-filter <command>]
--
2.23.0.3.g19edb94ec2
^ permalink raw reply related [flat|nested] 73+ messages in thread
* [PATCH v6 3/3] t9902: use a non-deprecated command for testing
2019-09-04 22:32 ` [PATCH v6 0/3] Warn about git-filter-branch usage and avoid it Elijah Newren
2019-09-04 22:32 ` [PATCH v6 1/3] t6006: simplify, fix, and optimize empty message test Elijah Newren
2019-09-04 22:32 ` [PATCH v6 2/3] Recommend git-filter-repo instead of git-filter-branch Elijah Newren
@ 2019-09-04 22:32 ` Elijah Newren
2 siblings, 0 replies; 73+ messages in thread
From: Elijah Newren @ 2019-09-04 22:32 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Derrick Stolee, Eric Wong, Jeff King,
Ævar Arnfjörð Bjarmason, Johannes Schindelin,
Lars Schneider, Jonathan Nieder, Eric Sunshine, Elijah Newren
t9902 had a list of three random porcelain commands as a sanity check,
one of which was filter-branch. Since we are recommending people not
use filter-branch, let's update this test to use rebase instead of
filter-branch.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
t/t9902-completion.sh | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 75512c3403..4e7f669c76 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -28,10 +28,10 @@ complete ()
#
# (2) A test makes sure that common subcommands are included in the
# completion for "git <TAB>", and a plumbing is excluded. "add",
-# "filter-branch" and "ls-files" are listed for this.
+# "rebase" and "ls-files" are listed for this.
-GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr filter-branch ls-files'
-GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout filter-branch'
+GIT_TESTING_ALL_COMMAND_LIST='add checkout check-attr rebase ls-files'
+GIT_TESTING_PORCELAIN_COMMAND_LIST='add checkout rebase'
. "$GIT_BUILD_DIR/contrib/completion/git-completion.bash"
@@ -1392,12 +1392,12 @@ test_expect_success 'basic' '
# built-in
grep -q "^add \$" out &&
# script
- grep -q "^filter-branch \$" out &&
+ grep -q "^rebase \$" out &&
# plumbing
! grep -q "^ls-files \$" out &&
- run_completion "git f" &&
- ! grep -q -v "^f" out
+ run_completion "git r" &&
+ ! grep -q -v "^r" out
'
test_expect_success 'double dash "git" itself' '
--
2.23.0.3.g19edb94ec2
^ permalink raw reply related [flat|nested] 73+ messages in thread