git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>, Jeff King <peff@peff.net>,
	Taylor Blau <me@ttaylorr.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Subject: Re: Import/Export as a fast way to purge files from Git?
Date: Thu, 1 Nov 2018 00:12:48 -0700
Message-ID: <CABPp-BEefqYADr8SVvh6uFWkp96PDv7qfKK1c9O1WUnPy3wqrw@mail.gmail.com> (raw)
In-Reply-To: <91771D9B-166D-403F-BB20-7E574444BB3B@gmail.com>

On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
<larsxschneider@gmail.com> wrote:
> > On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@gmail.com> wrote:
> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I recently had to purge files from large Git repos (many files, many commits).
> >> The usual recommendation is to use `git filter-branch --index-filter` to purge
> >> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> >> remove the `builtin` directory from git core). I realized that I can remove
> >> files *way* faster by exporting the repo, removing the file references,
> >> and then importing the repo (see Perl script below, it takes ~30sec to remove
> >> the `builtin` directory from git core). Do you see any problem with this
> >> approach?
> >
> > It looks like others have pointed you at other tools, and you're
> > already shifting to that route.  But I think it's a useful question to
> > answer more generally, so for those that are really curious...
> >
> >
> > The basic approach is fine, though if you try to extend it much you
> > can run into a few possible edge/corner cases (more on that below).
> > I've been using this basic approach for years and even created a
> > mini-python library[1] designed specifically to allow people to create
> > "fast-filters", used as
> >   git fast-export <options> | your-fast-filter | git fast-import <options>
> >
> > But that library didn't really take off; even I have rarely used it,
> > often opting for filter-branch despite its horrible performance or a
> > simple fast-export | long-sed-command | fast-import (with some extra
> > pre-checking to make sure the sed wouldn't unintentionally munge other
> > data).  BFG is great, as long as you're only interested in removing a
> > few big items, but otherwise doesn't seem very useful (to be fair,
> > it's very upfront about only wanting to solve that problem).
> > Recently, due to continuing questions on filter-branch and folks still
> > getting confused with it, I looked at existing tools, decided I didn't
> > think any quite fit, and started looking into converting
> > git_fast_filter into a filter-branch-like tool instead of just a
> > libary.  Found some bugs and missing features in fast-export along the
> > way (and have some patches I still need to send in).  But I kind of
> > got stuck -- if the tool is in python, will that limit adoption too
> > much?  It'd be kind of nice to have this tool in core git.  But I kind
> > of like leaving open the possibility of using it as a tool _or_ as a
> > library, the latter for the special cases where case-specific
> > programmatic filtering is needed.  But a developer-convenience library
> > makes almost no sense unless in a higher level language, such as
> > python.  I'm still trying to make up my mind about what I want (and
> > what others might want), and have been kind of blocking on that.  (If
> > others have opinions, I'm all ears.)
>
> That library sounds like a very interesting idea. Unfortunately, the
> referenced repo seems not to be available anymore:
>     git://gitorious.org/git_fast_filter/mainline.git

Yeah, gitorious went down at a time when I was busy with enough other
things that I never bothered moving my repos to a new hosting site.
Sorry about that.

I've got a copy locally, but I've been editing it heavily, without the
testing I should have in place, so I hesitate to point you at it right
now.  (Also, the old version failed to handle things like --no-data
output, which is important.)  I'll post an updated copy soon; feel
free to ping me in a week if you haven't heard anything yet.

> I very much like Python. However, more recently I started to
> write Git tools in Perl as they work out of the box on every
> machine with Git installed ... and I think Perl can be quite
> readable if no shortcuts are used :-).

Yeah, when portability matters, perl makes sense.  I thought about
switching it over, but I'm not sure I want to rewrite 1-2k lines of
code.  Especially since repo-filtering tools are kind of one-shot by
nature, and only need to be done by one person of a team, on one
specific machine, and won't affect daily development thereafter.
(Also, since I don't depend on any libraries and use only stuff from
the default python library, it ought to be relatively portable
anyway.)

  reply index

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-23 13:04 Lars Schneider
2018-09-23 14:55 ` Eric Sunshine
2018-09-23 15:58   ` Lars Schneider
2018-09-23 15:53 ` brian m. carlson
2018-09-23 17:04   ` Jeff King
2018-09-24 17:24 ` Elijah Newren
2018-10-31 19:15   ` Lars Schneider
2018-11-01  7:12     ` Elijah Newren [this message]
2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-11  6:33           ` Jeff King
2018-11-11  6:23         ` [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11  7:17             ` Elijah Newren
2018-11-13 23:25               ` Elijah Newren
2018-11-13 23:39                 ` Jonathan Nieder
2018-11-14  0:02                   ` Elijah Newren
2018-11-11  6:23         ` [PATCH 03/10] fast-export: use value from correct enum Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
2018-11-12  9:12               ` Ævar Arnfjörð Bjarmason
2018-11-12 11:31               ` Jeff King
2018-11-11  6:23         ` [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-11  6:44           ` Jeff King
2018-11-11  7:38             ` Elijah Newren
2018-11-12 12:32               ` Jeff King
2018-11-12 22:50             ` brian m. carlson
2018-11-13 14:38               ` Jeff King
2018-11-11  6:23         ` [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-11  6:47           ` Jeff King
2018-11-11  6:23         ` [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-11  6:53           ` Jeff King
2018-11-11  8:01             ` Elijah Newren
2018-11-12 12:45               ` Jeff King
2018-11-12 15:36                 ` Elijah Newren
2018-11-11  6:23         ` [PATCH 07/10] fast-export: ensure we export requested refs Elijah Newren
2018-11-11  7:02           ` Jeff King
2018-11-11  8:20             ` Elijah Newren
2018-11-11  6:23         ` [PATCH 08/10] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-11  7:11           ` Jeff King
2018-11-11  6:23         ` [PATCH 09/10] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-11  7:20           ` Jeff King
2018-11-11  8:32             ` Elijah Newren
2018-11-12 12:53               ` Jeff King
2018-11-12 15:46                 ` Elijah Newren
2018-11-12 16:31                   ` Jeff King
2018-11-11  6:23         ` [PATCH 10/10] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-11  7:23           ` Jeff King
2018-11-11  8:42             ` Elijah Newren
2018-11-12 12:58               ` Jeff King
2018-11-12 18:08                 ` Elijah Newren
2018-11-13 14:45                   ` Jeff King
2018-11-13 17:10                     ` Elijah Newren
2018-11-14  7:14                       ` Jeff King
2018-11-11  7:27         ` [PATCH 00/10] fast export and import fixes and features Jeff King
2018-11-11  8:44           ` Elijah Newren
2018-11-12 13:00             ` Jeff King
2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
2018-11-14  0:25           ` [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-14  0:25           ` [PATCH v2 02/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-14  0:25           ` [PATCH v2 03/11] fast-export: use value from correct enum Elijah Newren
2018-11-14  0:25           ` [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-14 19:17             ` SZEDER Gábor
2018-11-14 23:13               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 05/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-14  0:25           ` [PATCH v2 06/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-14  0:25           ` [PATCH v2 07/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-14  0:25           ` [PATCH v2 08/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-14 19:27             ` SZEDER Gábor
2018-11-14 23:16               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 09/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-14  0:25           ` [PATCH v2 10/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-14  0:26           ` [PATCH v2 11/11] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-14  7:25           ` [PATCH v2 00/11] fast export and import fixes and features Jeff King
2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
2018-11-16  7:59             ` [PATCH v3 01/11] fast-export: convert sha1 to oid Elijah Newren
2018-11-16  7:59             ` [PATCH v3 02/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 03/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-16  7:59             ` [PATCH v3 04/11] fast-export: use value from correct enum Elijah Newren
2018-11-16  7:59             ` [PATCH v3 05/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-16  7:59             ` [PATCH v3 06/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-16  7:59             ` [PATCH v3 07/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-16  7:59             ` [PATCH v3 08/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-16  7:59             ` [PATCH v3 09/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 10/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-16  7:59             ` [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-16 12:29               ` SZEDER Gábor
2018-11-16  8:50             ` [PATCH v3 00/11] fast export and import fixes and features Jeff King
2018-11-12  9:17       ` Import/Export as a fast way to purge files from Git? Ævar Arnfjörð Bjarmason
2018-11-12 15:34         ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BEefqYADr8SVvh6uFWkp96PDv7qfKK1c9O1WUnPy3wqrw@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git