git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Roberto Tyley <roberto.tyley@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: New command/tool: git filter-repo
Date: Fri, 1 Feb 2019 08:38:40 +0100	[thread overview]
Message-ID: <CABPp-BFvKk9D9h-XMjAnOrziKDhfGDG-rX410kb-pt+jF5cinA@mail.gmail.com> (raw)
In-Reply-To: <CAFY1edZ6hS5Dz9z5ZAhU59he9SjxetvfTN+ndzUZkjxhsuwEZA@mail.gmail.com>

Hi Roberto,

First of all, thanks for the feedback, and for the awesome work on BFG
repo filter!

On Fri, Feb 1, 2019 at 12:36 AM Roberto Tyley <roberto.tyley@gmail.com> wrote:
>
> On Thu, 31 Jan 2019 at 22:37, Elijah Newren <newren@gmail.com> wrote:
> > On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote:
> > > Elijah Newren <newren@gmail.com> writes:
> > >
> > > > git-filter-repo[1], a filter-branch-like tool for rewriting repository
> > > > history, is ready for more widespread testing and feedback.  The rough
> > > > edges I previously mentioned have been fixed, and it has several useful
> > > > features already, though more development work is ongoing (docs are a
> > > > bit sparse right now, though -h provides some help).
> > > >
> > > > Why filter-repo vs. filter-branch?
>
> I like the name! I think a lot of users are interested in filtering
> their entire repo, rather than rewriting a single branch.
>
> > > How does it compare with bfg-repo-cleaner?  Somehow I was led to
> > > believe that all serious users of filter-branch like functionality
> > > are using bfg-repo-cleaner instead.
> >
> > No, bfg-repo-cleaner only covers an important subset of the usecases.
>
> That's true - the focus with BFG Repo-Cleaner is on removing unwanted
> data - completely eradicating it from a repo's history. There are some
> mistakes in history that repo owners just really *do not* want to
> share (ie large files, private data/credentials), and they can be a
> critical blocker to sharing or working with a Git repo. In terms of
> rewriting history, my internal criterion for what I features I really
> want to be in the BFG is: is this unwanted data completely stopping
> many users from sharing their code or doing their work?
>
> I understand that when it comes to rewriting history, there are loads
> of other operations that people sometimes want to perform, beyond
> removing unwanted data - merging/splitting of history,
> anonymization/renaming of committers, etc. Some of those might be nice
> to add to the BFG - but as with many OSS-maintainers, I have limited
> time, and a life to balance outside of software...!

Totally understand; you picked a certain set of usecases and provided
a really good tool for those ones.  I focused more on repository
migration (not between different version control systems, but some
other special flag-day event.  e.g. we have a whole bunch of plugins,
several of them are unmaintained, they are in lots of different code
hosting platforms; and someone decides we want to put all the widely
used plugins into a big monorepo for various reasons.)  It's just that
when doing some kind of big migration is a good time to do cleanup as
well, especially when typical actual plugin size history might be a
few hundred kilobytes, but for some reason many are dozens or hundreds
of megabytes, so filter-repo also provides some facilities for
cleaning unwanted stuff out as well.

> > bfg-repo-cleaner does a really good job if your goal is to remove a
> > few big files and/or to remove some sensitive text (matched via
> > regexes) from all blobs.  It was designed for that specific role and
> > has more options in this area than filter-repo currently has.  But
> > even within this design space it was optimized for, it is missing two
> > things that I really want:
> >
> >   * pruning of commits which become empty due to filtering
>
> There certainly have been several users asking for this feature on the
> BFG, and even a kindly contributed PR for the functionality which I've
> yet to merge. As it doesn't actually stop users from doing work - so
> far as I can see - it's something that I've done a poor job of
> following up.

Just as a heads up: It may be a lot uglier than it at first looks;
there's all kinds of special cases and I'm not quite sure I've got it
all right in filter-repo yet:

* Some projects intentionally create empty commits for versioning or
publishing or other reasons, and do not want these commits to be
removed.  Thus, it's important to remove commits that *become* empty
(due to other filtering rules), not commits which started empty.
* There's a special case for the above rule: if a user only wants the
history of a certain directory, then any empty commits that pre-dated
that directory aren't wanted/needed.  The way I handle this is that if
a commit which had no changes relative to its parent became an orphan
(its parent and all other ancestors were pruned due to becoming
empty), then the fact that it was orphaned implies that it *became*
empty and is thus prunable.
* There are also topological changes possible: If a parent and all
further ancestors on one side of history are pruned, a merge commit
may become a non-merge commit.  Often that will make the merge commit
itself prunable, though there is the possibility that the merge had
other changes tucked into it.
* Merges also may become degenerate:
  * Case 1: both sides of history may have commits pruned away (due to
becoming empty) all the way back to the merge base, meaning the merge
commit now has the merge base for both parents.  I think having
redundant parents is senseless and the redundant ones should be
pruned.  In most cases, this should also mean the merge commit is no
longer empty as it'll likely have no changes relative to the merge
base.
  * Case 2: one side of history may have commits pruned away back to
the merge base, meaning that the merge commit now merges one parent
with an ancestor of that parent.  If the merge commit has no changes
relative to the newer parent, then it could potentially be pruned
away.  However: What if this merge commit *started* as a merge of some
commit with its own ancestor?  If the second parent is the newer
commit, this was likely a --no-ff merge and probably shouldn't be
pruned.  But, if a project has a strong policy of always doing --no-ff
merges to incorporate changes, then even if the merge commit didn't
start out looking like a --no-ff merge we should probably keep it even
if it ends up looking like one.  (But, of course, in either case if
the first parent is the newer one, then it's not a --no-ff merge and
unless the merge commit has extra file changes in it, we should be
able to prune it.)
* Also, if your scheme checks for changes against the first parent to
see what changes exist in a merge commit, if the first parent
history is pruned away due to filtering choices (e.g. it didn't touch
the directory of interest), then your methodology may miss the fact
that this merge commit has an empty set of changes relative to its
remaining parent and thus mistakenly retain some commits which became
empty and were expected to be pruned.  I believe I saw filter-branch
messing this up, though I'd have to double check.
* There might be other special cases I've overlooked, though I've
tried to cover them all.

I'm currently thinking that perhaps a flag or pair of flags might be
useful here (perhaps --empty-pruning={always,never,auto}
--no-ff-pruning={always,never,auto}) with 'auto' perhaps being pruning
of commits which didn't start out that way (empty or a no-ff merge)
but became so.  I'm still undecided on this...

> >   * providing a way for the user to know what needs to be cleaned up.
> > It has options like --strip-blobs-bigger-than <size> or
> > --strip-biggest-blobs <NUM>, but no way for the user to figure out
> > what <size> or <NUM> should be.
>
> For users of GitHub, It's normally 100MB with
> --strip-blobs-bigger-than <size> :-)

If you're only interested in what GitHub won't allow for you to
continue working with your repo, sure.  If you are migrating your repo
for some reason and want to take the chance to clean it up at the same
time so everyone doesn't have to pay heavy clone costs, there are many
other interesting values.  :-)

> > Also, since it just focuses on really
> > big blobs, it misses cases like someone checking in directories with a
> > huge number of small-to-moderately sized files (e.g. bower_components/
> > or node_modules/, though these could also contain a few big blobs
>
> For those use-cases, it might be that BFG's --delete-folders flag is
> useful, especially given the protected-head-commit feature of the BFG.

Absolutely.  However, this somewhat assumes the user doing the
filtering knows what parts of history are extraneous.  I've seen many
cases where the original author(s) of some repo weren't that familiar
with version control and committed a whole bunch of things they
shouldn't.  They then left the company or project and a few years
later, someone else deletes those files in a new commit, possibly even
mentioning in the commit message that it was stuff that was never
needed.  Several more years go by, the second individual isn't with
the project any more either, and someone else comes along, possibly
not even familiar with the language or build tools the project uses
but was tasked with migrating the history anyway.  Having a tool which
can tell them why this small plugin happens to be much bigger than
expected with pointers to relevant bits ("what's this
bower_components/ directory that accounts for most the size and was
deleted four years ago?") is very helpful.

I've also seen this with big repos where there are lots of ugly
travesties in all kinds of different places, and no one still around
knows about more than a couple of them.  The developers may be willing
to go through a flag day to have the history rewritten to expunge the
things that never should have been committed to get clone times down,
but only removing a few big blobs or expecting there to be an
individual who knows which directories can be nuked isn't always the
best answer.  Given enough time, anyone can run enough git commands to
kind of figure out what are the big things, but I wanted something
that made that job easier.


Granted, 'git filter-repo --analyze' is a separate read-only step
(other than the reports it writes), so even if people wanted to use
BFG repo cleaner or other filtering tools they could use this aspect
of filter-repo to guide decisions.

> It's getting late for me, must be even later in Brussels - I wish I
> could have made it there to join in! Merry Git Merge to you all, and
> good luck to you Elijah with git-filter-repo.

Thanks again for all the feedback; maybe we'll get to meet at a future
conference.  Best of luck to you as well.

Elijah

  reply	other threads:[~2019-02-01  7:38 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-31  8:57 New command/tool: git filter-repo Elijah Newren
2019-01-31 19:09 ` Junio C Hamano
2019-01-31 20:43   ` Elijah Newren
2019-01-31 23:36     ` Roberto Tyley
2019-02-01  7:38       ` Elijah Newren [this message]
2019-01-31 20:47 ` Elijah Newren
2019-02-08  1:25 ` Elijah Newren
2019-02-08 10:22   ` Johannes Schindelin
2019-02-08 18:53 ` Ævar Arnfjörð Bjarmason
2019-02-08 20:13   ` Johannes Schindelin
2019-02-11 16:00     ` Elijah Newren
2019-02-11 15:47   ` Elijah Newren
2019-06-08 16:20   ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BFvKk9D9h-XMjAnOrziKDhfGDG-rX410kb-pt+jF5cinA@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=roberto.tyley@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).