git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Sam Vilain <sam@vilain.net>
To: Stephen Bash <bash@genarts.com>
Cc: Nathan Gray <n8gray@n8gray.org>,
	Andrew Sayers <andrew-git@pileofstuff.org>,
	Jonathan Nieder <jrnieder@gmail.com>, Jeff King <peff@peff.net>,
	git@vger.kernel.org, Sverre Rabbelier <srabbelier@gmail.com>,
	Dmitry Ivankov <divanorama@gmail.com>,
	Ramkumar Ramachandra <artagnon@gmail.com>,
	David Barr <davidbarr@google.com>
Subject: Re: [spf:guess] Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
Date: Tue, 06 Mar 2012 15:59:27 -0800	[thread overview]
Message-ID: <4F56A4DF.8060807@vilain.net> (raw)
In-Reply-To: <ab5eb5a7-a446-4dc3-b8e8-e3f7ec306452@mail>

On 3/6/12 12:35 PM, Stephen Bash wrote:
>> The problem of specifying and detecting branches is a major problem in
>> my upcoming conversion.  We've got toplevel trunk/branches/tags
>> directories but underneath "branches" it's a free-for-all:
>>
>> /branches/codenameA/{projectA,projectB,projectC}
>> /branches/codenameB   (actually a branch of projectA)
>> /branches/developers/joe/frobnicator-experiment (also a branch of
>> projectA)
>>
>> Clearly there's no simple regex that's going to capture this, so I'm
>> reduced to listing every branch of projectA, which is tedious and
>> error-prone.  However, what *would* work fabulously well for me is
>> "marker file" detection.  Every copy of projectA has a certain file at
>> it's root.  Let's call it "markerFile.txt".  What I'd really love is a
>> way to say:
>>
>> my %branch_markers = {'/branches/**/markerFile.txt' =>
>>                        '/refs/heads/**'}
>
> Ooo...  I like it.  I hadn't hit on this idea yet, but it certainly is a very helpful heuristic.  I doubt I'd have any sort of demo code for you in the near future, but it's definitely an idea to roll into the mix.

What I did for the Perl Perforce conversion is make this a multi–step 
process; first, the heuristic goes through and detects branches and 
merge parents.  Then you do the actual export.  If, however, the 
heuristic gets it wrong, then you can manually override the branch 
detection for a particular revision, which invalidates all of the 
_automatic_ decisions made for later revisions the next time you run it.

Even with all of the information in Postgres, and much of the hard work 
pushed into the Postgres engine, and Postgres tuned for OLAP, this was 
the slowest part of the operation.  For a 30,000–odd revision Perforce 
repository.

The manual input is extremely useful for bespoke conversions; there will 
always be warts in the history and no heuristic is perfect (even if you 
can supply your own set of expressions, a way to override it for just 
one revision is handy).

Just to revise, the steps in git-p4raw, are:

* load metadata (git-p4raw load ; git-p4raw check)
* load blobs (git-p4raw export-blobs)
* find project roots (git-p4raw find-branches)

   Project root decisions can be overridden, in git-p4raw this was 
through a DB insert, but all this consisted of was inserting (revision, 
branch) tuples into the appropriate table so a front–end would be 
trivial.  As you suggest, a custom heuristic is also an option but the 
most flexible solution is just being able to override the decisions made 
for a particular revision.

* detect project merges (also done by git-p4raw find-branches)

Detecting merge parents used a heuristic based on the per–file 
integration records and a computation based on an internal diff-tree 
which produced a list of files that would have needed resolving.  This 
one I actually used enough to bother implementing a front–end for:

   git-p4raw graft REV PARENT PARENT

Where 'PARENT' could be another project root (revision/branch location), 
or it could be a git commit ID (for the inevitable occasion where you 
need to manually graft on some history).  This interface allows you to 
do several things:

   1. mark a merge which was not recorded correctly in history
   2. un–mark a merge which was detected/recorded incorrectly
   3. skip bad sections of history, for instance squash merging merges 
which happened over several commits (SVN and Perforce, of course, 
support insane piecemeal merging prohibited by git)

* the actual fast-import exporter.

   git-p4raw export-commits 1..5000

There was also an important reverse operation:

   git-p4raw unexport-commits 2500

Which moved all of the exported refs backwards, deleted ones which 
didn't exist at revision 2500.

Once the data has been mined, the actual exporting can proceed very 
fast.  Eg, on my laptop I could easily be topping 300 commits per second 
which makes for a nice export/examine/rewind/adjust cycle.

For more information,

   git clone git://github.com/samv/git-p4raw
   cd git-p4raw
   perldoc git-p4raw

The "Game plan." section of the POD is particularly relevant.  Remember 
that SVN is very similar to Perforce in virtually all of its design 
details so this tool, its database schema, and implementation are all 
very relevant to the design of the new svn-fe importer.

Sam

  reply	other threads:[~2012-03-06 23:59 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-03 12:27 [RFC] "Remote helper for Subversion" project David Barr
2012-03-03 12:41 ` David Barr
2012-03-04  7:54   ` Jonathan Nieder
2012-03-04 10:37     ` David Barr
2012-03-04 13:36       ` Andrew Sayers
2012-03-05 15:27         ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Stephen Bash
2012-03-05 23:27           ` Approaches to SVN to Git conversion Andrew Sayers
2012-03-06 14:36             ` Stephen Bash
2012-03-06 19:29           ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Nathan Gray
2012-03-06 20:35             ` Stephen Bash
2012-03-06 23:59               ` Sam Vilain [this message]
2012-03-07 22:06                 ` [spf:guess] " Andrew Sayers
2012-03-07 23:15                   ` [spf:guess,iffy] " Sam Vilain
2012-03-08 20:51                     ` Andrew Sayers
2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
2012-03-07 15:38               ` Sam Vilain
2012-03-07 20:28                 ` Andrew Sayers
2012-03-07 22:33               ` Phil Hord
2012-03-07 23:08               ` Nathan Gray
2012-03-07 23:32                 ` Andrew Sayers
2012-03-04 16:23       ` [RFC] "Remote helper for Subversion" project Jonathan Nieder
2012-03-27  3:58     ` Ramkumar Ramachandra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F56A4DF.8060807@vilain.net \
    --to=sam@vilain.net \
    --cc=andrew-git@pileofstuff.org \
    --cc=artagnon@gmail.com \
    --cc=bash@genarts.com \
    --cc=davidbarr@google.com \
    --cc=divanorama@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=n8gray@n8gray.org \
    --cc=peff@peff.net \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).