git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: [ANNOUNCE] git_fast_filter
Date: Wed, 8 Apr 2009 10:55:20 -0600	[thread overview]
Message-ID: <51419b2c0904080955u71d18093rd3e8b6a29cf9bbfb@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.1.00.0904081144580.9157@intel-tinevez-2-302>

Hi,

On Wed, Apr 8, 2009 at 3:45 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> On Tue, 7 Apr 2009, Elijah Newren wrote:
>
>> Just thought I'd make this available, in case there's others with niche
>> needs that find it useful...
>
> Have you seen
>
>        http://thread.gmane.org/gmane.comp.version-control.git/52323
>
> I was rather disappointed that skimo left the patch series in a rather
> half-useful state.

No, I had not.  Looks useful, though it appears to be missing a number
of options from git-filter-branch, such as --subdirectory-filter,
--tree-filter, and --prune-empty.  I'm guessing that's related to your
"half-useful" comment?

I was particularly interested in --tree-filter for my rewrite, since I
needed it (to remove all cvs keywords and a few other touch-ups) and
it was the slowest of the filters I'd be using.  Problem is, in a
repository with 40,000 commits and 70,000 files in the latest commits,
--tree-filter is unacceptably slow.  On average, a commit would have
about 35,000 files (assuming approximately linear growth over the
commit history), meaning that I'd have to modify 40,000 x 35,000 files
= 1,400,000,000 files[1].  However, on average, less than a dozen
files changed per commit, so there are less than 40,000 x 12 = 480,000
unique files that actually need to be rewritten.  git-fast-export
provides (and git-fast-import expects) just those half million files,
and rewriting half a million files instead of 1.4 billion files is the
difference between a 45 minute rewrite and a 3 month one.

I didn't see a way to easily avoid the 1.4 billion file rewrite using
git-filter-branch (or git-rewrite-commits had I known about it), and
writing something to parse and modify git-fast-export output seemed
like the easiest solution.  Perhaps I could have written some fancy
index-filter script that recorded original and modified file sha1sums
somewhere and used that to only check out certain files and rewrite
them, but such an idea hadn't occurred to me (and I'm not sure it
would have been the better route even if it had).  Maybe there's
something I missed that would have made this easy, though?


Elijah

[1] For simplicity, I'm ignoring the 'binary' files that should not
have any cvs-keyword unmunging performed on them.  However, it does
present an issue, particularly with extra process forks, since you
need to determine which files are safe to modify.  I used libmagic
(the library behind the unix 'file' command) to avoid the need to run
'file' repeatedly.

      reply	other threads:[~2009-04-08 16:57 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-08  3:35 [ANNOUNCE] git_fast_filter Elijah Newren
2009-04-08  9:45 ` Johannes Schindelin
2009-04-08 16:55   ` Elijah Newren [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51419b2c0904080955u71d18093rd3e8b6a29cf9bbfb@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).