git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: "Ævar Arnfjörð" <avarab@gmail.com>
Cc: Lars Schneider <larsxschneider@gmail.com>,
	Git Mailing List <git@vger.kernel.org>, Jeff King <peff@peff.net>,
	Taylor Blau <me@ttaylorr.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Subject: Re: Import/Export as a fast way to purge files from Git?
Date: Mon, 12 Nov 2018 07:34:50 -0800	[thread overview]
Message-ID: <CABPp-BGVhw6HeCb7wTUubEjqxfW3LopB8PXY1TdHrB9Gfd3_jw@mail.gmail.com> (raw)
In-Reply-To: <87r2fq3b9t.fsf@evledraar.gmail.com>

On Mon, Nov 12, 2018 at 1:17 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Thu, Nov 01 2018, Elijah Newren wrote:
>
> > On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
> > <larsxschneider@gmail.com> wrote:
> >> > On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@gmail.com> wrote:
> >> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@gmail.com> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I recently had to purge files from large Git repos (many files, many commits).
> >> >> The usual recommendation is to use `git filter-branch --index-filter` to purge
> >> >> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> >> >> remove the `builtin` directory from git core). I realized that I can remove
> >> >> files *way* faster by exporting the repo, removing the file references,
> >> >> and then importing the repo (see Perl script below, it takes ~30sec to remove
> >> >> the `builtin` directory from git core). Do you see any problem with this
> >> >> approach?
> >> >
> >> > It looks like others have pointed you at other tools, and you're
> >> > already shifting to that route.  But I think it's a useful question to
> >> > answer more generally, so for those that are really curious...
> >> >
> >> >
> >> > The basic approach is fine, though if you try to extend it much you
> >> > can run into a few possible edge/corner cases (more on that below).
> >> > I've been using this basic approach for years and even created a
> >> > mini-python library[1] designed specifically to allow people to create
> >> > "fast-filters", used as
> >> >   git fast-export <options> | your-fast-filter | git fast-import <options>
> >> >
> >> > But that library didn't really take off; even I have rarely used it,
> >> > often opting for filter-branch despite its horrible performance or a
> >> > simple fast-export | long-sed-command | fast-import (with some extra
> >> > pre-checking to make sure the sed wouldn't unintentionally munge other
> >> > data).  BFG is great, as long as you're only interested in removing a
> >> > few big items, but otherwise doesn't seem very useful (to be fair,
> >> > it's very upfront about only wanting to solve that problem).
> >> > Recently, due to continuing questions on filter-branch and folks still
> >> > getting confused with it, I looked at existing tools, decided I didn't
> >> > think any quite fit, and started looking into converting
> >> > git_fast_filter into a filter-branch-like tool instead of just a
> >> > libary.  Found some bugs and missing features in fast-export along the
> >> > way (and have some patches I still need to send in).  But I kind of
> >> > got stuck -- if the tool is in python, will that limit adoption too
> >> > much?  It'd be kind of nice to have this tool in core git.  But I kind
> >> > of like leaving open the possibility of using it as a tool _or_ as a
> >> > library, the latter for the special cases where case-specific
> >> > programmatic filtering is needed.  But a developer-convenience library
> >> > makes almost no sense unless in a higher level language, such as
> >> > python.  I'm still trying to make up my mind about what I want (and
> >> > what others might want), and have been kind of blocking on that.  (If
> >> > others have opinions, I'm all ears.)
> >>
> >> That library sounds like a very interesting idea. Unfortunately, the
> >> referenced repo seems not to be available anymore:
> >>     git://gitorious.org/git_fast_filter/mainline.git
> >
> > Yeah, gitorious went down at a time when I was busy with enough other
> > things that I never bothered moving my repos to a new hosting site.
> > Sorry about that.
> >
> > I've got a copy locally, but I've been editing it heavily, without the
> > testing I should have in place, so I hesitate to point you at it right
> > now.  (Also, the old version failed to handle things like --no-data
> > output, which is important.)  I'll post an updated copy soon; feel
> > free to ping me in a week if you haven't heard anything yet.
> >
> >> I very much like Python. However, more recently I started to
> >> write Git tools in Perl as they work out of the box on every
> >> machine with Git installed ... and I think Perl can be quite
> >> readable if no shortcuts are used :-).
> >
> > Yeah, when portability matters, perl makes sense.  I thought about
> > switching it over, but I'm not sure I want to rewrite 1-2k lines of
> > code.  Especially since repo-filtering tools are kind of one-shot by
> > nature, and only need to be done by one person of a team, on one
> > specific machine, and won't affect daily development thereafter.
> > (Also, since I don't depend on any libraries and use only stuff from
> > the default python library, it ought to be relatively portable
> > anyway.)
>
> FWIW I'd be very happy to have this tool itself included in git.git
> if/when it's stable / useful enough, and as you point out the language
> doesn't really matter as much as what features it exposes.

Well, I'm happy to propose it for inclusion once it gets to that
point.  I'll bring it up on the list to get wider feedback once I've
removed at least some of the sharp edges.  I suspect it'll be at least
a few weeks.

      reply	other threads:[~2018-11-12 15:35 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-23 13:04 Import/Export as a fast way to purge files from Git? Lars Schneider
2018-09-23 14:55 ` Eric Sunshine
2018-09-23 15:58   ` Lars Schneider
2018-09-23 15:53 ` brian m. carlson
2018-09-23 17:04   ` Jeff King
2018-09-24 17:24 ` Elijah Newren
2018-10-31 19:15   ` Lars Schneider
2018-11-01  7:12     ` Elijah Newren
2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-11  6:33           ` Jeff King
2018-11-11  6:23         ` [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11  7:17             ` Elijah Newren
2018-11-13 23:25               ` Elijah Newren
2018-11-13 23:39                 ` Jonathan Nieder
2018-11-14  0:02                   ` Elijah Newren
2018-11-11  6:23         ` [PATCH 03/10] fast-export: use value from correct enum Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
2018-11-12  9:12               ` Ævar Arnfjörð Bjarmason
2018-11-12 11:31               ` Jeff King
2018-11-11  6:23         ` [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-11  6:44           ` Jeff King
2018-11-11  7:38             ` Elijah Newren
2018-11-12 12:32               ` Jeff King
2018-11-12 22:50             ` brian m. carlson
2018-11-13 14:38               ` Jeff King
2018-11-11  6:23         ` [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-11  6:47           ` Jeff King
2018-11-11  6:23         ` [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-11  6:53           ` Jeff King
2018-11-11  8:01             ` Elijah Newren
2018-11-12 12:45               ` Jeff King
2018-11-12 15:36                 ` Elijah Newren
2018-11-11  6:23         ` [PATCH 07/10] fast-export: ensure we export requested refs Elijah Newren
2018-11-11  7:02           ` Jeff King
2018-11-11  8:20             ` Elijah Newren
2018-11-11  6:23         ` [PATCH 08/10] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-11  7:11           ` Jeff King
2018-11-11  6:23         ` [PATCH 09/10] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-11  7:20           ` Jeff King
2018-11-11  8:32             ` Elijah Newren
2018-11-12 12:53               ` Jeff King
2018-11-12 15:46                 ` Elijah Newren
2018-11-12 16:31                   ` Jeff King
2018-11-11  6:23         ` [PATCH 10/10] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-11  7:23           ` Jeff King
2018-11-11  8:42             ` Elijah Newren
2018-11-12 12:58               ` Jeff King
2018-11-12 18:08                 ` Elijah Newren
2018-11-13 14:45                   ` Jeff King
2018-11-13 17:10                     ` Elijah Newren
2018-11-14  7:14                       ` Jeff King
2018-11-11  7:27         ` [PATCH 00/10] fast export and import fixes and features Jeff King
2018-11-11  8:44           ` Elijah Newren
2018-11-12 13:00             ` Jeff King
2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
2018-11-14  0:25           ` [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-14  0:25           ` [PATCH v2 02/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-14  0:25           ` [PATCH v2 03/11] fast-export: use value from correct enum Elijah Newren
2018-11-14  0:25           ` [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-14 19:17             ` SZEDER Gábor
2018-11-14 23:13               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 05/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-14  0:25           ` [PATCH v2 06/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-14  0:25           ` [PATCH v2 07/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-14  0:25           ` [PATCH v2 08/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-14 19:27             ` SZEDER Gábor
2018-11-14 23:16               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 09/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-14  0:25           ` [PATCH v2 10/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-14  0:26           ` [PATCH v2 11/11] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-14  7:25           ` [PATCH v2 00/11] fast export and import fixes and features Jeff King
2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
2018-11-16  7:59             ` [PATCH v3 01/11] fast-export: convert sha1 to oid Elijah Newren
2018-11-16  7:59             ` [PATCH v3 02/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 03/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-16  7:59             ` [PATCH v3 04/11] fast-export: use value from correct enum Elijah Newren
2018-11-16  7:59             ` [PATCH v3 05/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-16  7:59             ` [PATCH v3 06/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-16  7:59             ` [PATCH v3 07/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-16  7:59             ` [PATCH v3 08/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-16  7:59             ` [PATCH v3 09/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 10/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-16  7:59             ` [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-16 12:29               ` SZEDER Gábor
2018-11-16  8:50             ` [PATCH v3 00/11] fast export and import fixes and features Jeff King
2018-11-12  9:17       ` Import/Export as a fast way to purge files from Git? Ævar Arnfjörð Bjarmason
2018-11-12 15:34         ` Elijah Newren [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BGVhw6HeCb7wTUubEjqxfW3LopB8PXY1TdHrB9Gfd3_jw@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).