git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [ANNOUNCE] git_fast_filter
@ 2009-04-08  3:35 Elijah Newren
  2009-04-08  9:45 ` Johannes Schindelin
  0 siblings, 1 reply; 3+ messages in thread
From: Elijah Newren @ 2009-04-08  3:35 UTC (permalink / raw)
  To: Git Mailing List

Just thought I'd make this available, in case there's others with
niche needs that find it useful...


git_fast_filter assists with quickly rewriting the history of a repository
by making it easy to write scripts whose purpose is to serve as safe
filters between fast-export and fast-import.  git_fast_filter comes with
example programs, a basic test-suite, and a double your money back
satisfaction guarantee.  (I love free software.)  You can get it from

  git://gitorious.org/git_fast_filter/mainline.git

In more detail...

=== Purpose ===

git_fast_filter is designed to make it easy to filter or rewrite the
history of a repository.  As such, it fills the same role as
git-filter-branch, and was written primarily to overcome the sometimes
severe speed shortcomings of git-filter-branch.  In particular, using
git_fast_filter can avoid thousands or millions of new process forks, and
can allow you to rewrite the same file only one time instead of 50,000
times.  However, while using git_fast_filter is fairly simple and quick, it
is hard to beat writing a simple git-filter-branch one-liner for efficiency
of human time.  Also, the two tools use very different methods of rewriting
history and do not have exactly overlapping feature sets, so the best tool
for a particular job is going to be very problem dependent.

As human time is often more important than computer time, especially for
one-shot rewrites, git-filter-branch will probably continue to be the more
common tool.  However, git_fast_filter is useful in cases where computer
time of a rewrite matters (particularly larger repositories and more
involved rewrites that need to be run and tested many times on large data
sets).  Also git_fast_filter has a couple features that may come in handy
in special cases (assisting with generating fast-export output from
scratch, interleaving commits from seperate repositories, and bidirectional
collaboration between filtered and unfiltered repositories).

=== Idea ===

The way git_fast_filter works is by providing a simple python library,
git_fast_filter.py.  This library can be used in simple python scripts to
create a filter for the output of git-fast-export.  Thus, the typical
calling convention is of the form:

    git fast-export | filter_script.py | git fast-import

=== Example ===

An example script that renames the 'master' branch to 'other is shown
below (this is similar to the example in the git-fast-export manpage, but
is safe against the string 'refs/heads/master' appearing in some file or
commit message in the repository):

  #!/usr/bin/python

  from git_fast_filter import Commit, FastExportFilter

  def my_commit_callback(commit):
    if commit.branch == "refs/heads/master":
      commit.branch = "refs/heads/other"

  filter = FastExportFilter(commit_callback = my_commit_callback)
  filter.run()

The user can then run this script by:
  $ mkdir target && cd target && git init
  $ (cd /PATH/LEADING/TO/source && git fast-export --all) \
       | /PATH/TO/filter_script.py | git fast-import

(Note: The user can have the script take care of the git init, the cd's,
and the invocations of git fast-export and git fast-import by just passing
directory names to FastExportFilter.run; however, writing out the details
explicitly as in the above example makes it clearer what is going on.)


Elijah

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [ANNOUNCE] git_fast_filter
  2009-04-08  3:35 [ANNOUNCE] git_fast_filter Elijah Newren
@ 2009-04-08  9:45 ` Johannes Schindelin
  2009-04-08 16:55   ` Elijah Newren
  0 siblings, 1 reply; 3+ messages in thread
From: Johannes Schindelin @ 2009-04-08  9:45 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

Hi,

On Tue, 7 Apr 2009, Elijah Newren wrote:

> Just thought I'd make this available, in case there's others with niche 
> needs that find it useful...

Have you seen

	http://thread.gmane.org/gmane.comp.version-control.git/52323

I was rather disappointed that skimo left the patch series in a rather 
half-useful state.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [ANNOUNCE] git_fast_filter
  2009-04-08  9:45 ` Johannes Schindelin
@ 2009-04-08 16:55   ` Elijah Newren
  0 siblings, 0 replies; 3+ messages in thread
From: Elijah Newren @ 2009-04-08 16:55 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailing List

Hi,

On Wed, Apr 8, 2009 at 3:45 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> On Tue, 7 Apr 2009, Elijah Newren wrote:
>
>> Just thought I'd make this available, in case there's others with niche
>> needs that find it useful...
>
> Have you seen
>
>        http://thread.gmane.org/gmane.comp.version-control.git/52323
>
> I was rather disappointed that skimo left the patch series in a rather
> half-useful state.

No, I had not.  Looks useful, though it appears to be missing a number
of options from git-filter-branch, such as --subdirectory-filter,
--tree-filter, and --prune-empty.  I'm guessing that's related to your
"half-useful" comment?

I was particularly interested in --tree-filter for my rewrite, since I
needed it (to remove all cvs keywords and a few other touch-ups) and
it was the slowest of the filters I'd be using.  Problem is, in a
repository with 40,000 commits and 70,000 files in the latest commits,
--tree-filter is unacceptably slow.  On average, a commit would have
about 35,000 files (assuming approximately linear growth over the
commit history), meaning that I'd have to modify 40,000 x 35,000 files
= 1,400,000,000 files[1].  However, on average, less than a dozen
files changed per commit, so there are less than 40,000 x 12 = 480,000
unique files that actually need to be rewritten.  git-fast-export
provides (and git-fast-import expects) just those half million files,
and rewriting half a million files instead of 1.4 billion files is the
difference between a 45 minute rewrite and a 3 month one.

I didn't see a way to easily avoid the 1.4 billion file rewrite using
git-filter-branch (or git-rewrite-commits had I known about it), and
writing something to parse and modify git-fast-export output seemed
like the easiest solution.  Perhaps I could have written some fancy
index-filter script that recorded original and modified file sha1sums
somewhere and used that to only check out certain files and rewrite
them, but such an idea hadn't occurred to me (and I'm not sure it
would have been the better route even if it had).  Maybe there's
something I missed that would have made this easy, though?


Elijah

[1] For simplicity, I'm ignoring the 'binary' files that should not
have any cvs-keyword unmunging performed on them.  However, it does
present an issue, particularly with extra process forks, since you
need to determine which files are safe to modify.  I used libmagic
(the library behind the unix 'file' command) to avoid the need to run
'file' repeatedly.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-04-08 16:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-08  3:35 [ANNOUNCE] git_fast_filter Elijah Newren
2009-04-08  9:45 ` Johannes Schindelin
2009-04-08 16:55   ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).