git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: New command/tool: git filter-repo
Date: Thu, 31 Jan 2019 21:43:42 +0100	[thread overview]
Message-ID: <CABPp-BH==w5APkz9cvUYq7m4qieJ3LWCsYySevgJuZ8bi2RzjQ@mail.gmail.com> (raw)
In-Reply-To: <xmqq36p83aq4.fsf@gitster-ct.c.googlers.com>

On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > git-filter-repo[1], a filter-branch-like tool for rewriting repository
> > history, is ready for more widespread testing and feedback.  The rough
> > edges I previously mentioned have been fixed, and it has several useful
> > features already, though more development work is ongoing (docs are a
> > bit sparse right now, though -h provides some help).
> >
> > Why filter-repo vs. filter-branch?
>
> How does it compare with bfg-repo-cleaner?  Somehow I was led to
> believe that all serious users of filter-branch like functionality
> are using bfg-repo-cleaner instead.

No, bfg-repo-cleaner only covers an important subset of the usecases.
bfg-repo-cleaner does a really good job if your goal is to remove a
few big files and/or to remove some sensitive text (matched via
regexes) from all blobs.  It was designed for that specific role and
has more options in this area than filter-repo currently has.  But
even within this design space it was optimized for, it is missing two
things that I really want:

  * pruning of commits which become empty due to filtering
  * providing a way for the user to know what needs to be cleaned up.
It has options like --strip-blobs-bigger-than <size> or
--strip-biggest-blobs <NUM>, but no way for the user to figure out
what <size> or <NUM> should be.  Also, since it just focuses on really
big blobs, it misses cases like someone checking in directories with a
huge number of small-to-moderately sized files (e.g. bower_components/
or node_modules/, though these could also contain a few big blobs
too), or someone checking in a lot of moderately sized files of a
uniform extension (e.g. .webm, .tar.gz, .zip, .mp4, .avi).  I've seen
cases in the wild where the correct cleaning of history was more about
filtering out directories or extensions than a couple big files.
filter-repo's --analyze option creates some reports that help with
this tremendously.  Also, the options to delete files by glob/basename
overlook the fact that renames may have occurred.  Having a report
that mentions renames that have occurred in history (also part of
filter-repo's --analyze option) can be very helpful.

Outside of this specific usecase, bfg-repo-cleaner is not very useful.
It simply lacks more general filtering capabilties:

  * While bfg-repo-cleaner has facilities to remove certain paths, it
has none to say you only want to keep certain paths.  Unlike
filter-branch where you can use a pipeline to list all files, grep to
remove the ones you want to keep from the list, then pipe the
remainder of paths to xargs git rm, bfg-repo-cleaner doesn't have a
facility for shell commands.  Instead in bfg-repo-cleaner you would
need to emulate this by exhaustively listing directories and
paths/globs of file basenames to delete, but that assumes the user
knows all paths that have ever existed making this solution not only
onerous but error prone.  More of the filterings I see these days are
about just keeping a directory (or perhaps a handful of them) rather
than just removing or cleaning a few files.  Also, this makes pruning
of commits which become empty much more important, but as noted above,
bfg-repo-cleaner lacks that ability.
  * It has no facilities for renaming paths.  You'd have to use a
different tool to do that, but then why not use the other tool to do
the whole job?  Even if you do decide to use both tools, some
capabilities of one tool can be neutered by such an approach (e.g.
bfg-repo-cleaner's carefully rewritten commit messages that tried to
ensure abbreviated commit shas referred to the new commit ids)
  * It has no facilities for affecting other parts of history, such as
changing author/committer/tagger names or emails, changing commit
timestamp or timezone, reparenting commits, splicing repository
histories together, filtering files differently based on commit
timestamp, etc. -- all of which can be done with filter-repo (though
some of those things requires writing a small python script; see basic
examples in t/lib-usage/*)

Personally, I also find it kind of annoying that bfg-repo-cleaner
doesn't automatically repack and shrink the repo when it is done and
instead prints multiple commands the user can run to achieve that,
even though it's the core use case for the tool.  Granted, they may
have had last-ditch recovery-of-the-original-repo in mind in case the
user ran in a repository they shouldn't have, but I much prefer to
have the tool just check if the repo looks like a fresh clone and bail
if not, so that users have a far easier recovery mechanism -- just
throw away the clone you were filtering and re-clone.  Once you do
that, auto repacking and shrinking is pretty natural.  (And you can
always provide a --force option to allow filtering & rewriting in a
repo that isn't a fresh clone.)


Elijah

  reply	other threads:[~2019-01-31 20:44 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
2019-01-31 19:09 ` Junio C Hamano
2019-01-31 20:43   ` Elijah Newren [this message]
2019-01-31 23:36     ` Roberto Tyley
2019-02-01  7:38       ` Elijah Newren
2019-01-31 20:47 ` Elijah Newren
2019-02-08  1:25 ` Elijah Newren
2019-02-08 10:22   ` Johannes Schindelin
2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
2019-02-08 20:13   ` Johannes Schindelin
2019-02-11 16:00     ` Elijah Newren
2019-02-11 15:47   ` Elijah Newren
2019-06-08 16:20   ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BH==w5APkz9cvUYq7m4qieJ3LWCsYySevgJuZ8bi2RzjQ@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).