git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC] "Remote helper for Subversion" project
@ 2012-03-03 12:27 David Barr
  2012-03-03 12:41 ` David Barr
  0 siblings, 1 reply; 22+ messages in thread
From: David Barr @ 2012-03-03 12:27 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Jonathan Nieder, Sverre Rabbelier, David Barr,
	Dmitry Ivankov

---
 SoC-2012-Ideas.md |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

 This is simply the direct translation of last year's project idea.
 This project make significant incremental progess each year.
 I'm seeking feedback from all involved on setting the direction.

 --
 David Barr

diff --git a/SoC-2012-Ideas.md b/SoC-2012-Ideas.md
index 5e83342..4c2ab05 100644
--- a/SoC-2012-Ideas.md
+++ b/SoC-2012-Ideas.md
@@ -182,3 +182,29 @@ this project.
 
 Proposed by: Thomas Rast  
 Possible mentor(s): Thomas Rast
+
+Remote helper for Subversion
+------------------------------------
+
+Write a remote helper for Subversion. While a lot of the underlying
+infrastructure work was completed last year, the remote helper itself
+is essentially incomplete. Major work includes:
+
+* Understanding revision mapping and building a revision-commit mapper.
+
+* Working through transport and fast-import related plumbing, changing
+  whatever is necessary.
+
+* Getting an Git-to-SVN converter merged.
+
+* Building the remote helper itself.
+
+Goal: Build a full-featured bi-directional `git-remote-svn` and get it
+      merged into upstream Git.  
+Language: C  
+See: [A note on SVN history][SVN history], [svnrdump][].  
+Proposed by: David Barr  
+Possible mentors: Jonathan Nieder, Sverre Rabbelier, David Barr
+
+[SVN history]: http://article.gmane.org/gmane.comp.version-control.git/150007
+[svnrdump]: http://svn.apache.org/repos/asf/subversion/trunk/subversion/svnrdump
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] "Remote helper for Subversion" project
  2012-03-03 12:27 [RFC] "Remote helper for Subversion" project David Barr
@ 2012-03-03 12:41 ` David Barr
  2012-03-04  7:54   ` Jonathan Nieder
  0 siblings, 1 reply; 22+ messages in thread
From: David Barr @ 2012-03-03 12:41 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Jonathan Nieder, Sverre Rabbelier, David Barr,
	Dmitry Ivankov, Ramkumar Ramachandra, Sam Vilain, Stephen Bash

On Sat, Mar 3, 2012 at 11:27 PM, David Barr <davidbarr@google.com> wrote:
> ---
>  SoC-2012-Ideas.md |   26 ++++++++++++++++++++++++++
>  1 files changed, 26 insertions(+), 0 deletions(-)
>
>  This is simply the direct translation of last year's project idea.
>  This project make significant incremental progess each year.
>  I'm seeking feedback from all involved on setting the direction.
>
>  --
>  David Barr
>
> diff --git a/SoC-2012-Ideas.md b/SoC-2012-Ideas.md
> index 5e83342..4c2ab05 100644
> --- a/SoC-2012-Ideas.md
> +++ b/SoC-2012-Ideas.md
> @@ -182,3 +182,29 @@ this project.
>
>  Proposed by: Thomas Rast
>  Possible mentor(s): Thomas Rast
> +
> +Remote helper for Subversion
> +------------------------------------
> +
> +Write a remote helper for Subversion. While a lot of the underlying
> +infrastructure work was completed last year, the remote helper itself
> +is essentially incomplete. Major work includes:
> +
> +* Understanding revision mapping and building a revision-commit mapper.
> +
> +* Working through transport and fast-import related plumbing, changing
> +  whatever is necessary.
> +
> +* Getting an Git-to-SVN converter merged.
> +
> +* Building the remote helper itself.
> +
> +Goal: Build a full-featured bi-directional `git-remote-svn` and get it
> +      merged into upstream Git.
> +Language: C
> +See: [A note on SVN history][SVN history], [svnrdump][].
> +Proposed by: David Barr
> +Possible mentors: Jonathan Nieder, Sverre Rabbelier, David Barr
> +
> +[SVN history]: http://article.gmane.org/gmane.comp.version-control.git/150007
> +[svnrdump]: http://svn.apache.org/repos/asf/subversion/trunk/subversion/svnrdump
> --
> 1.7.9
>

+cc: Ramkumar Ramachandra, Sam Vilain, Stephen Bash
I wasn't even close to "all involved.

--
David Barr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] "Remote helper for Subversion" project
  2012-03-03 12:41 ` David Barr
@ 2012-03-04  7:54   ` Jonathan Nieder
  2012-03-04 10:37     ` David Barr
  2012-03-27  3:58     ` Ramkumar Ramachandra
  0 siblings, 2 replies; 22+ messages in thread
From: Jonathan Nieder @ 2012-03-04  7:54 UTC (permalink / raw)
  To: David Barr
  Cc: Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain, Stephen Bash

David Barr wrote:
> On Sat, Mar 3, 2012 at 11:27 PM, David Barr <davidbarr@google.com> wrote:

>> --- a/SoC-2012-Ideas.md
>> +++ b/SoC-2012-Ideas.md
>> @@ -182,3 +182,29 @@ this project.
>>
>>  Proposed by: Thomas Rast
>>  Possible mentor(s): Thomas Rast
>> +
>> +Remote helper for Subversion
>> +------------------------------------
>> +
>> +Write a remote helper for Subversion. While a lot of the underlying
>> +infrastructure work was completed last year, the remote helper itself
>> +is essentially incomplete. Major work includes:

By the way, didn't we have a remote-svn prototype?  I'm happy to merge
any old hacky thing for staging in contrib/svn-fe, as long as it is
not documented in a misleading way.

(More generally, if anyone wants to resend useful svn-fe patches, that
will help a lot.)

>> +
>> +* Understanding revision mapping and building a revision-commit mapper.

Does this mean creating commit notes to record which subversion rev
corresponds to each commit, and marks or lightweight tags going the
other way?

>> +
>> +* Working through transport and fast-import related plumbing, changing
>> +  whatever is necessary.

I think Dmitry and Sverre took care of most of this.

>> +
>> +* Getting an Git-to-SVN converter merged.

Probably could fill a summer in itself.  In previous starts I think
there was some complexity creep. :/

 http://thread.gmane.org/gmane.comp.version-control.git/170290
 http://thread.gmane.org/gmane.comp.version-control.git/170551

>> +
>> +* Building the remote helper itself.
>> +
>> +Goal: Build a full-featured bi-directional `git-remote-svn` and get it
>> +      merged into upstream Git.

Sure would be neat. ;-)  Another nice piece to build would be branch
tracking / follow_parent heuristics.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] "Remote helper for Subversion" project
  2012-03-04  7:54   ` Jonathan Nieder
@ 2012-03-04 10:37     ` David Barr
  2012-03-04 13:36       ` Andrew Sayers
  2012-03-04 16:23       ` [RFC] "Remote helper for Subversion" project Jonathan Nieder
  2012-03-27  3:58     ` Ramkumar Ramachandra
  1 sibling, 2 replies; 22+ messages in thread
From: David Barr @ 2012-03-04 10:37 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain, Stephen Bash

On Sun, Mar 4, 2012 at 6:54 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> David Barr wrote:
>> On Sat, Mar 3, 2012 at 11:27 PM, David Barr <davidbarr@google.com> wrote:
>
>>> --- a/SoC-2012-Ideas.md
>>> +++ b/SoC-2012-Ideas.md
>>> @@ -182,3 +182,29 @@ this project.
>>>
>>>  Proposed by: Thomas Rast
>>>  Possible mentor(s): Thomas Rast
>>> +
>>> +Remote helper for Subversion
>>> +------------------------------------
>>> +
>>> +Write a remote helper for Subversion. While a lot of the underlying
>>> +infrastructure work was completed last year, the remote helper itself
>>> +is essentially incomplete. Major work includes:
>
> By the way, didn't we have a remote-svn prototype?  I'm happy to merge
> any old hacky thing for staging in contrib/svn-fe, as long as it is
> not documented in a misleading way.
>
> (More generally, if anyone wants to resend useful svn-fe patches, that
> will help a lot.)

Found at former SoC2011Projects wiki page:
(http://git.wiki.kernel.org/articles/s/o/c/SoC2011Projects_b1f9.html#Remote_helper_for_Subversion_and_git-svn)
[vcs-svn, svn-fe: add a couple of
options](http://thread.gmane.org/gmane.comp.version-control.git/176578)
[remote-svn-alpha
updates](http://thread.gmane.org/gmane.comp.version-control.git/176617)

The introduction should be rephrased to include Dmitry's progression.

>>> +* Understanding revision mapping and building a revision-commit mapper.
>
> Does this mean creating commit notes to record which subversion rev
> corresponds to each commit, and marks or lightweight tags going the
> other way?

Yes. I think once again, Dmitry produced a good prototype for this component.
However, I think it also potentially incorporates git-svn style
slicing of history.
That's a significant chunk of work.

>>> +* Working through transport and fast-import related plumbing, changing
>>> +  whatever is necessary.
>
> I think Dmitry and Sverre took care of most of this.

Ditto.

>>> +* Getting an Git-to-SVN converter merged.
>
> Probably could fill a summer in itself.  In previous starts I think
> there was some complexity creep. :/
>
>  http://thread.gmane.org/gmane.comp.version-control.git/170290
>  http://thread.gmane.org/gmane.comp.version-control.git/170551

This is my preferred focus, and is a sufficient project in its own right.

>>> +* Building the remote helper itself.
>>> +
>>> +Goal: Build a full-featured bi-directional `git-remote-svn` and get it
>>> +      merged into upstream Git.
>
> Sure would be neat. ;-)  Another nice piece to build would be branch
> tracking / follow_parent heuristics.

As noted earlier, the remote helper itself is now half-complete.
I do think the immediate goal should be bi-direction.
The remainder is porting git-svn logic to the new helper.
However, it would be interesting to see what's missing with respect to porting

--
David Barr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] "Remote helper for Subversion" project
  2012-03-04 10:37     ` David Barr
@ 2012-03-04 13:36       ` Andrew Sayers
  2012-03-05 15:27         ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Stephen Bash
  2012-03-04 16:23       ` [RFC] "Remote helper for Subversion" project Jonathan Nieder
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Sayers @ 2012-03-04 13:36 UTC (permalink / raw)
  To: David Barr
  Cc: Jonathan Nieder, Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain, Stephen Bash

Hi guys,

I made a few little git contributions a couple of years back, before
being swallowed up by my old job.  I've become quite interested in
svn->git translation since starting a new job, but I'm delurking earlier
than planned so please bear with me as all I have right now is an
interesting proof of concept and the possibility of some free time some
day.  If things go well, I hope to help out with the
branching-and-merging part of the problem, so I'll describe what I've
done and how it might affect SoC projects.  Apologies in advance for
hijacking the thread :)

I mentioned proof-of-concept code for the work I've done - I'd like to
get confirmation from my new employer before putting it all online, so
hopefully I can make it available in a few days.

While researching the problem, I found Stephen Bash's original
proposal[1] and snerp-vortex[2] quite inspiring, but wasn't able to find
any details on SoC-related work in the branching-and-merging department
- hopefully the following isn't just a retread of ideas developed since
then.  I've concentrated on importing from SVN so far, but have kept an
eye on update and half an eye on bi-direction in the hopes of being
useful there some day.

It seems to me the "svn export" and "git import" steps make most sense
as two unrelated projects.  Snerp-vortex and Stephen's scripts both cut
the history import problem at that point, as do svn-fe/git-fast-import
with code import.  Exporting SVN history is a messy and sometimes
project-specific job, so allowing a project to concentrate on that part
makes it possible for SVN experts to use all their skills without having
to learn git plumbing before they make their first commit (much respect
to Stephen for managing that feat BTW).


I've written a proof-of-concept history converter that can be split into
three parts: a format for describing SVN history; a large, often messy
Perl program that writes files in that format; and a small Perl program
that reads the format and translates it into git.  With hindsight, Perl
is the right language for the SVN exporter, but the git importer would
have been better written in C.


Personally, I think SVN export will always need a strong manual
component to get the best results, so I've put quite a bit of work into
designing a good SVN history format.  Like git-fast-import, it's an
ASCII format designed both for human and machine consumption:

In r1, create branch "trunk"
In r10, create branch "branches/foo" as "foo" from "trunk" r9
In r12, create tag "tags/version1" as "version1" from "trunk" r11
In r12, deactivate "tags/version1"
# blank lines and lines beginning with a '#' are ignored
In r15, merge "branches/foo" r14 into "trunk"
In r15, delete "branches/foo"

The above has been designed as an abstract representation of SVN
history, with as little git-specific content as possible.  This turns
out to make the problem a bit easier to think about, a bit easier to
implement, and potentially useful to other DVCSs some day.  I wanted to
enable svn-merge2git[3]-style extraction of merge info from logs, but
found that even a message as clear as "merge r123 from trunk" could mean
"cherry-pick revision 123", "merge everything up to revision 123",
"merge everything up to the last revision that touched trunk before
123", "merge everything up to 132", or any number of other things.  As
such, the format is designed to allow copious comments and ease of
bulk-editing with regexps and a powerful editor.

The format feels relatively complete to me, and in any case not a good
candidate for an SoC project because it's all about experience and
nothing to do with raw talent.


Once the format is defined, git import is fairly straightforward.
Proof-of-concept code to follow, but it's really just a wrapper around
git-commit-tree, git-mktag etc.  I wrote this in Perl thinking it would
relate somehow to git-svn, but eventually realised it didn't and that a
few hundred calls to (plumbing) processes per second isn't so good for
performance.  The only interesting part of the problem is how to tackle
SVN tags.  I went for an ambitious approach, making normal tags where
possible and downgrading them to lightweight tags when necessary.  This
does involve managing something that is effectively a branch in
refs/tags/, but what else is an SVN tag but a branch in the wrong namespace?

I'm afraid I don't know enough about SoC to say whether rewriting git
import in C would make a good project - on the one hand, it's a smallish
bit of work that would make a student learn a lot of good git code, on
the other hand it would mostly just involve fixing up a bit of ugly Perl
with little chance to show any creativity.


SVN export is much more complicated, and has taken most of my time.
Although there's no way to tell automatically which directories are
branches, I realised you can detect trunks automatically about 90% of
the time by looking for where files/directories are first created, and
can detect non-trunk branches at least 99% of the time once you know the
trunks.  Trunk detection is normally just a convenience, but can be a
lifesaver when importing from a sufficiently messy repository.  There's
a lot you can do with merge detection, but between svn:mergeinfo,
svnmerge.py[4], svk[5] and log messages I realise I've only scratched
the surface.  In many ways, this is a classic Perl problem - take a
bunch of messy textual input, string it all together and make some neat
textual output.  The code I've got right now seems fine for anything up
to about 20,000 commits, but eats way too much memory for huge repos.
Depending on how much time I get, I might have tackled that problem by
the time I put this code online.

Assuming I have enough time to work on SVN export, I would be loathed to
put it on anyone else in the near future.  I could see endless
optimisations being spun off once the code is mature, but right now it's
an idiosyncratic collection of untested mostly-working bits that need
serious attention before I'd be happy warping a young mind on it.


I hope this information dump explains where I am and how I can help.  As
I say, I can't yet promise how much (if any) time I'll get to work on
this in future, but I hope to help with the history translation process
if circumstances allow.  More importantly to this thread, I hope this
gives some ideas about SoC projects and places (not) to put attention.

	- Andrew Sayers

[1] http://comments.gmane.org/gmane.comp.version-control.git/158940
[2] https://github.com/rcaputo/snerp-vortex
[3] http://repo.or.cz/w/svn-merge2git.git
[4] http://www.orcaware.com/svn/wiki/Svnmerge.py
[5] http://search.cpan.org/dist/SVK/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] "Remote helper for Subversion" project
  2012-03-04 10:37     ` David Barr
  2012-03-04 13:36       ` Andrew Sayers
@ 2012-03-04 16:23       ` Jonathan Nieder
  1 sibling, 0 replies; 22+ messages in thread
From: Jonathan Nieder @ 2012-03-04 16:23 UTC (permalink / raw)
  To: David Barr
  Cc: Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain, Stephen Bash

David Barr wrote:
> On Sun, Mar 4, 2012 at 6:54 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> (More generally, if anyone wants to resend useful svn-fe patches, that
>> will help a lot.)
>
> Found at former SoC2011Projects wiki page:
> (http://git.wiki.kernel.org/articles/s/o/c/SoC2011Projects_b1f9.html#Remote_helper_for_Subversion_and_git-svn)
> [vcs-svn, svn-fe: add a couple of
> options](http://thread.gmane.org/gmane.comp.version-control.git/176578)
> [remote-svn-alpha
> updates](http://thread.gmane.org/gmane.comp.version-control.git/176617)

Do you mean these are patches that should be applied?  New emails
containing a git url or, even better, the actual patch are best, since
it means I can be sure I am looking at the latest or at least the
intended version of the change.

[...]
> However, I think it also potentially incorporates git-svn style
> slicing of history.

Do I understand correctly that you mean paying attention to copy-from
information, like "svn log" does?  (For example, making cloning

	svn::http://svn.example.com/project/branches/feature

when branches/feature was originally copied from trunk involve
grabbing "http://svn.example.com/project/trunk" in early revs?)

[...]
> The remainder is porting git-svn logic to the new helper.
> However, it would be interesting to see what's missing with respect to porting

While git-svn can be useful for inspiration when wondering "how could
I possibly solve such-and-such problem", I'm not sure feature-parity
with git-svn is too important.  After all, people needing git-svn
features can still use git-svn.

I say this since git-svn has lots of features we are missing:
not discarding unhandled properties (important), shared history with
multiple branches, author mapping, fetching and pushing svn:mergeinfo
information, partial clone via a path-ignore regex, choice of
timezone, filename reencoding, manual svn:ignore-to-gitignore
conversion, svn-compatible "log" and "blame" output, custom git<->svn
branchname mappings, and so on.  The ability to track one branch,
including push support, with a linear history would be exciting
already and doesn't require all that.

Cheers,
Jonathan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
  2012-03-04 13:36       ` Andrew Sayers
@ 2012-03-05 15:27         ` Stephen Bash
  2012-03-05 23:27           ` Approaches to SVN to Git conversion Andrew Sayers
  2012-03-06 19:29           ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Nathan Gray
  0 siblings, 2 replies; 22+ messages in thread
From: Stephen Bash @ 2012-03-05 15:27 UTC (permalink / raw)
  To: Andrew Sayers
  Cc: Jonathan Nieder, Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain, David Barr

All-

This turned out to be longer than I intended, but actually summarizes some of my modern thoughts on SVN to Git conversion (as always, I'm more interested in one time migration than bidirectional operation, so read with a grain of salt).

More inline...

----- Original Message -----
> From: "Andrew Sayers" <andrew-git@pileofstuff.org>
> Sent: Sunday, March 4, 2012 8:36:41 AM
> Subject: Re: [RFC] "Remote helper for Subversion" project
> 
> ... snip ...
> 
> While researching the problem, I found Stephen Bash's original
> proposal[1] and snerp-vortex[2] quite inspiring, but wasn't able to
> find any details on SoC-related work in the branching-and-merging
> department - hopefully the following isn't just a retread of ideas
> developed since then.  I've concentrated on importing from SVN so far,
> but have kept an eye on update and half an eye on bi-direction in the
> hopes of being useful there some day.
> 
> It seems to me the "svn export" and "git import" steps make most sense
> as two unrelated projects.  Snerp-vortex and Stephen's scripts both
> cut the history import problem at that point, as do
> svn-fe/git-fast-import with code import.  Exporting SVN history is a
> messy and sometimes project-specific job, so allowing a project to
> concentrate on that part makes it possible for SVN experts to use all
> their skills without having to learn git plumbing before they make
> their first commit (much respect to Stephen for managing that feat
> BTW).

After many a long conversation with Ram and Jonathan (and others), I'm actually going the other direction.  My current thinking (and this is very much open for discussion) is that as long as the SVN properties are available (especially the copyfrom information) Git has just as much information (if not more) to reconstruct the SVN history as SVN does.  (And going through our messy history I haven't found any counterpoint to this yet)

> I've written a proof-of-concept history converter that can be split
> into three parts: a format for describing SVN history; a large, often
> messy Perl program that writes files in that format; and a small Perl
> program that reads the format and translates it into git.  With
> hindsight, Perl is the right language for the SVN exporter, but the
> git importer would have been better written in C.
> 
> Personally, I think SVN export will always need a strong manual
> component to get the best results, so I've put quite a bit of work
> into designing a good SVN history format.  Like git-fast-import, it's
> an ASCII format designed both for human and machine consumption...

First, I'm very impressed that you managed to get a language like this up and working.  It could prove very useful going forward.  On the flip side, from my experiments over the last year I've actually been leaning toward a solution that is more implicit than explicit.  Taking git-svn as a model, I've been trying to define a mapping system (in Perl):

  my %branch_spec = { '/trunk/projname' => 'master',
                      '/branches/*/projname' => '/refs/heads/*' };
  my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };

(See [1] for notes on our SVN structure.  In a std-layout-style repo it would be:

  my %branch_spec = { '/projname/trunk' => 'master',
                      '/projname/branches/*' => '/refs/heads/*' };
  my %tag_spec = { '/projname/tags/*' => '/refs/tags/*' };)

Now I know this simple mapping will fail as I get further in our history -- in particular we have one branch that came from:

  svn cp $SVN_REPO/trunk/ $SVN_REPO/foo  # OOPS! not in branches!
  svn mv $SVN_REPO/foo $SVN_REPO/branches/foo

>From an automation perspective, I expect the first svn operation to produce an error saying "Possible branch created at /foo from known branch trunk (master), but doesn't match any known branch spec", while the second (if "continue-on-error" is turned on) would error "Branch /branches/foo created from unknown branch located at /foo".  It's then up to the user to modify the branch map to something that accounts for this behavior:

  my %branch_spec = { '/trunk/projname' => 'master',
                      '/branches/*/projname' => '/refs/heads/*',
                      '/foo' => '/refs/heads/foo' };
  my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };

So in this case I'm making an explicit branch mapping, but the use of glob-style syntax allows the user to catch larger classes of branches if desired (I'll also note that depending on the implementation /foo may need to map to /refs/heads/bad-foo so that /branches/foo can map to /refs/heads/foo, but my intention has been to squash empty commits, so it's possible the name conflict is a non-issue since content didn't change).  I also know that we have some copy operations that are just weird, so it might be helpful to have an ignore mechanism that tells the system to ignore copies into/out-of certain SVN paths.

> Once the format is defined, git import is fairly straightforward.
> Proof-of-concept code to follow, but it's really just a wrapper around
> git-commit-tree, git-mktag etc.  I wrote this in Perl thinking it
> would relate somehow to git-svn, but eventually realised it didn't and
> that a few hundred calls to (plumbing) processes per second isn't so
> good for performance.  The only interesting part of the problem is how
> to tackle SVN tags.  I went for an ambitious approach, making normal
> tags where possible and downgrading them to lightweight tags when
> necessary.  This does involve managing something that is effectively a
> branch in refs/tags/, but what else is an SVN tag but a branch in the
> wrong namespace?

I don't understand how "normal" and "lightweight" apply in this situation?  As I mentioned before I'd like to squash empty commits (in the case of a one-time migration, in the bidirectional case it's probably easier not to), so many SVN tagging operations wouldn't produce new commits, and the (technically) correct commit is tagged.  In the case of actual content changes in a tag's life, I think it's up to the user to decide between three options:

  1) only retain the last SVN tag
  2) tag using the git-svn-style 'tagname@rev' for all but the last
  3) Do (2), but move older tags to some hidden namespace (refs/hidden/tags or the like)

Option (3) is predicated on gc searching accepting all subdirectories of refs/ as valid (it did this when I wrote my original scripts, and I don't believe this behavior has changed).  For a one-time migration I think all three of these options can be implemented using annotated tags.  In the bidirectional case things get murky (maybe always tag with tagname@rev and hope for tab completion?).

> ... snip ...
>
> SVN export is much more complicated, and has taken most of my time.
> Although there's no way to tell automatically which directories are
> branches, I realised you can detect trunks automatically about 90% of
> the time by looking for where files/directories are first created, and
> can detect non-trunk branches at least 99% of the time once you know
> the trunks.  Trunk detection is normally just a convenience, but can
> be a lifesaver when importing from a sufficiently messy repository.
> There's a lot you can do with merge detection, but between
> svn:mergeinfo, svnmerge.py[4], svk[5] and log messages I realise I've
> only scratched the surface.  In many ways, this is a classic Perl
> problem - take a bunch of messy textual input, string it all together
> and make some neat textual output.  The code I've got right now seems
> fine for anything up to about 20,000 commits, but eats way too much
> memory for huge repos.  Depending on how much time I get, I might have
> tackled that problem by the time I put this code online.

Branch detection falls into my branch mapping mentioned above.  I realized that for as much as I slaved over finding every last branch (regardless of location) in our repo, it really came down to "list the known branches, find copies of those, if any copies leave the list of known branches warn the user/revise branch spec, repeat".  Now there are probably SVN repositories out there that can't write a well-formed branch spec (I had one at a previous job where the definition of a branch changed halfway through history), but I'm attempting to convince myself something is better than nothing and we'll catch the exceptions as they come up.

Merge detection and translation is (IMO) not a well formed problem.  As you point out, there's no real way to know if it's a real merge or a cherry-pick.  It occurred to me last night after reading your e-mail a tool could attempt to look at the diffs in the branch history and compare that with the diff created by the merge, but that gets into all kinds of diff machinery/text processing hijinks that I don't want to contemplate (others might be more willing).
 
> [1] http://comments.gmane.org/gmane.comp.version-control.git/158940
> [2] https://github.com/rcaputo/snerp-vortex
> [3] http://repo.or.cz/w/svn-merge2git.git
> [4] http://www.orcaware.com/svn/wiki/Svnmerge.py
> [5] http://search.cpan.org/dist/SVK/

Thanks for the brain dump.  I hope this brain dump in response is helpful in someway.  I seem to revisit this topic about once every three months or so, and this was a good chance to share some of my recent revelations.

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-05 15:27         ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Stephen Bash
@ 2012-03-05 23:27           ` Andrew Sayers
  2012-03-06 14:36             ` Stephen Bash
  2012-03-06 19:29           ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Nathan Gray
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Sayers @ 2012-03-05 23:27 UTC (permalink / raw)
  To: Stephen Bash
  Cc: Jonathan Nieder, Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain, David Barr

On 05/03/12 15:27, Stephen Bash wrote:
> All-
> 
> This turned out to be longer than I intended, but actually summarizes some of my modern thoughts on SVN to Git conversion (as always, I'm more interested in one time migration than bidirectional operation, so read with a grain of salt).
> 
> More inline...
> 
> ----- Original Message -----
>> From: "Andrew Sayers" <andrew-git@pileofstuff.org>
>> Sent: Sunday, March 4, 2012 8:36:41 AM
>> Subject: Re: [RFC] "Remote helper for Subversion" project
>>
>> ... snip ...
>>
>> While researching the problem, I found Stephen Bash's original
>> proposal[1] and snerp-vortex[2] quite inspiring, but wasn't able to
>> find any details on SoC-related work in the branching-and-merging
>> department - hopefully the following isn't just a retread of ideas
>> developed since then.  I've concentrated on importing from SVN so far,
>> but have kept an eye on update and half an eye on bi-direction in the
>> hopes of being useful there some day.
>>
>> It seems to me the "svn export" and "git import" steps make most sense
>> as two unrelated projects.  Snerp-vortex and Stephen's scripts both
>> cut the history import problem at that point, as do
>> svn-fe/git-fast-import with code import.  Exporting SVN history is a
>> messy and sometimes project-specific job, so allowing a project to
>> concentrate on that part makes it possible for SVN experts to use all
>> their skills without having to learn git plumbing before they make
>> their first commit (much respect to Stephen for managing that feat
>> BTW).
> 
> After many a long conversation with Ram and Jonathan (and others), I'm actually going the other direction.  My current thinking (and this is very much open for discussion) is that as long as the SVN properties are available (especially the copyfrom information) Git has just as much information (if not more) to reconstruct the SVN history as SVN does.  (And going through our messy history I haven't found any counterpoint to this yet)

I agree that git can be taught a superset of the information in SVN, but
you'll need absolutely all SVN properties available - somewhere out
there, someone has a created a merge script that sets "my-merge-info:
revisions 1 to 10 from trunk (inclusive)".  Building on your point
below, a git-based converter could take an ambiguous message like "merge
revision 123 from trunk" and compare the diff for this commit against
the diff for r123, for r1:r123, for r1:r132 and so on until it found a
good match.

I'm personally more interested in extracting SVN history from an SVN
dump than from git (which I'll discuss in a moment), but these
approaches sound fundamentally compatible - it sounds like we agree that
one part of the problem is a is a simple process that needs some
optimisation work and the other is a good candidate for full employment
theorem[1].  Putting an interface between them means we can write one
implementation for the bit with an obvious right answer, and continue
experimenting with different solutions for the other bit.

I described the work I've done as a solution with three parts (SVN
export, file format, Git import), but it's equally valid to call them
three different projects that were initially developed in parallel.  I'd
be quite happy for the format and import parts to move towards becoming
part of the wider "remote helper" project, and the export part to become
a CPAN module that's just one of many programs that writes the
history-import format (a bit like how svn-fe is one of many programs
that writes the git-fast-import format).

I wrote my SVN exporter based on SVN dumps for three reasons - I figured
people switching from SVN would be more comfortable customising a
solution that only used technologies they understood, I figured it might
be useful to Mercurial or Bazaar some day if it was DVCS-neutral, and I
have to use SVN for my day job so I'm more interested in getting a good
migration story today than a great one tomorrow.

My instinct is that under ideal conditions, the SVN exporter I've got
will initially surge forward, build up a great collection of test cases,
then gradually sink under the weight of hacks needed to pass the tests.
 This strikes me as a good reason to isolate the exporter from git
itself, but also a necessary step in getting SVN import done - there's
no way to really know how to architect a good solution until we can
build a spec from real world test cases.

> 
>> I've written a proof-of-concept history converter that can be split
>> into three parts: a format for describing SVN history; a large, often
>> messy Perl program that writes files in that format; and a small Perl
>> program that reads the format and translates it into git.  With
>> hindsight, Perl is the right language for the SVN exporter, but the
>> git importer would have been better written in C.
>>
>> Personally, I think SVN export will always need a strong manual
>> component to get the best results, so I've put quite a bit of work
>> into designing a good SVN history format.  Like git-fast-import, it's
>> an ASCII format designed both for human and machine consumption...
> 
> First, I'm very impressed that you managed to get a language like this up and working.  It could prove very useful going forward.  On the flip side, from my experiments over the last year I've actually been leaning toward a solution that is more implicit than explicit.  Taking git-svn as a model, I've been trying to define a mapping system (in Perl):

Just to be clear, the language described above goes after the clever
branch-detection work described below - all branches need to be
specified explicitly in the format so that it can be a simple mechanical
job.

> 
>   my %branch_spec = { '/trunk/projname' => 'master',
>                       '/branches/*/projname' => '/refs/heads/*' };
>   my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };
> 
> (See [1] for notes on our SVN structure.  In a std-layout-style repo it would be:
> 
>   my %branch_spec = { '/projname/trunk' => 'master',
>                       '/projname/branches/*' => '/refs/heads/*' };
>   my %tag_spec = { '/projname/tags/*' => '/refs/tags/*' };)
> 
> Now I know this simple mapping will fail as I get further in our history -- in particular we have one branch that came from:
> 
>   svn cp $SVN_REPO/trunk/ $SVN_REPO/foo  # OOPS! not in branches!
>   svn mv $SVN_REPO/foo $SVN_REPO/branches/foo
> 
>>From an automation perspective, I expect the first svn operation to produce an error saying "Possible branch created at /foo from known branch trunk (master), but doesn't match any known branch spec", while the second (if "continue-on-error" is turned on) would error "Branch /branches/foo created from unknown branch located at /foo".  It's then up to the user to modify the branch map to something that accounts for this behavior:
> 
>   my %branch_spec = { '/trunk/projname' => 'master',
>                       '/branches/*/projname' => '/refs/heads/*',
>                       '/foo' => '/refs/heads/foo' };
>   my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };
> 
> So in this case I'm making an explicit branch mapping, but the use of glob-style syntax allows the user to catch larger classes of branches if desired (I'll also note that depending on the implementation /foo may need to map to /refs/heads/bad-foo so that /branches/foo can map to /refs/heads/foo, but my intention has been to squash empty commits, so it's possible the name conflict is a non-issue since content didn't change).  I also know that we have some copy operations that are just weird, so it might be helpful to have an ignore mechanism that tells the system to ignore copies into/out-of certain SVN paths.
> 

I started with an approach like you describe, but as you say it winds up
in a mess of special cases.  A friend pointed me to Perl's catalyst
repository[2], which is a wonderful haven of every mad SVN thing ever
dreamt up.  That got me playing with more general heuristics, and while
writing this e-mail I think I've finally nailed it.  What do you say to
defining SVN branches like this:

A directory is a branch if...
1. it is not a subdirectory of an existing branch; and
2. either:
2a. it is in a list of branches specified by the user, or
2b. it is copied from a (subdirectory of a) branch

I'll have to go and play with the implementation details, but I don't
see any misbehaviour jumping out of this approach.  Rule 1 discounts the
"svn cp /branches/foo /trunk/libfoo" pattern, but I'm fine with that
because I don't think anyone really has a good answer there yet.  Rule
2a allows the user as much input as they want but only requires most
people to say that /trunk is a branch.  Finally, rule 2b counts the "svn
cp /trunk/libfoo /branches/foo" pattern - I'm personally not bothered by
the asymmetry there compared to rule 1.

I'm afraid I've spent all my free time tonight writing this e-mail, but
the proof-of-concept code is only based on a sloppy half-realisation of
the above so probably wouldn't be that enlightening anyway.

>> Once the format is defined, git import is fairly straightforward.
>> Proof-of-concept code to follow, but it's really just a wrapper around
>> git-commit-tree, git-mktag etc.  I wrote this in Perl thinking it
>> would relate somehow to git-svn, but eventually realised it didn't and
>> that a few hundred calls to (plumbing) processes per second isn't so
>> good for performance.  The only interesting part of the problem is how
>> to tackle SVN tags.  I went for an ambitious approach, making normal
>> tags where possible and downgrading them to lightweight tags when
>> necessary.  This does involve managing something that is effectively a
>> branch in refs/tags/, but what else is an SVN tag but a branch in the
>> wrong namespace?
> 
> I don't understand how "normal" and "lightweight" apply in this situation?  As I mentioned before I'd like to squash empty commits (in the case of a one-time migration, in the bidirectional case it's probably easier not to), so many SVN tagging operations wouldn't produce new commits, and the (technically) correct commit is tagged.  In the case of actual content changes in a tag's life, I think it's up to the user to decide between three options:
> 
>   1) only retain the last SVN tag
>   2) tag using the git-svn-style 'tagname@rev' for all but the last
>   3) Do (2), but move older tags to some hidden namespace (refs/hidden/tags or the like)
> 
> Option (3) is predicated on gc searching accepting all subdirectories of refs/ as valid (it did this when I wrote my original scripts, and I don't believe this behavior has changed).  For a one-time migration I think all three of these options can be implemented using annotated tags.  In the bidirectional case things get murky (maybe always tag with tagname@rev and hope for tab completion?).
> 

I didn't explain this particularly well, as it's based largely on the
vague desire to make update work some day.  Imagine the user does this:

* git svn-pull # get tags/foo, a candidate for an annotated tag
... time passes ...
* git svn-pull # tags/foo has now been updated in another revision

If we create an annotated tag in step 1, what do we do in step 2?  You
can't make the tag object the parent of a new revision, so you need to
do something unpleasant.  The solution I proposed was to convert the tag
message to a commit message (i.e. pretend a lightweight tag had been
created all along), then add another commit on top of it and make a
lightweight tag from the new commit (i.e. treat it like a branch).  In
retrospect that's far too much magic without user involvement - a better
solution would be to give the user this option along with the ones you
outlined, and let git-config remember their preference if they want.

	- Andrew

[1] http://en.wikipedia.org/wiki/Full_employment_theorem
[2] http://dev.catalyst.perl.org/repos/bast/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-05 23:27           ` Approaches to SVN to Git conversion Andrew Sayers
@ 2012-03-06 14:36             ` Stephen Bash
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen Bash @ 2012-03-06 14:36 UTC (permalink / raw)
  To: Andrew Sayers
  Cc: Jonathan Nieder, Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Ramkumar Ramachandra, Sam Vilain, David Barr

----- Original Message -----
> From: "Andrew Sayers" <andrew-git@pileofstuff.org>
> Sent: Monday, March 5, 2012 6:27:32 PM
> Subject: Re: Approaches to SVN to Git conversion
> 
> > My current thinking (and this is very much open for discussion) is
> > that as long as the SVN properties are available (especially the
> > copyfrom information) Git has just as much information (if not more)
> > to reconstruct the SVN history as SVN does.  (And going through our
> > messy history I haven't found any counterpoint to this yet)
> 
> I agree that git can be taught a superset of the information in SVN,
> but you'll need absolutely all SVN properties available...

I'm pretty sure Jonathan won't be happy with anything less ;)

> I wrote my SVN exporter based on SVN dumps for three reasons - I
> figured people switching from SVN would be more comfortable
> customising a solution that only used technologies they understood, I
> figured it might be useful to Mercurial or Bazaar some day if it was
> DVCS-neutral, and I have to use SVN for my day job so I'm more
> interested in getting a good migration story today than a great one
> tomorrow.

The multiple systems argument is a good one.

> >   my %branch_spec = { '/trunk/projname' => 'master',
> >                       '/branches/*/projname' => '/refs/heads/*' };
> >   my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };
> > 
> > Now I know this simple mapping will fail as I get further in our
> > history -- in particular we have one branch that came from:
> > 
> >   svn cp $SVN_REPO/trunk/ $SVN_REPO/foo  # OOPS! not in branches!
> >   svn mv $SVN_REPO/foo $SVN_REPO/branches/foo
> > 
> > It's then up to the user to modify the branch
> > map to something that accounts for this behavior:
> > 
> >   my %branch_spec = { '/trunk/projname' => 'master',
> >                       '/branches/*/projname' => '/refs/heads/*',
> >                       '/foo' => '/refs/heads/foo' };
> >   my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };
> 
> I started with an approach like you describe, but as you say it winds
> up in a mess of special cases.  A friend pointed me to Perl's catalyst
> repository[2], which is a wonderful haven of every mad SVN thing ever
> dreamt up.  That got me playing with more general heuristics, and
> while writing this e-mail I think I've finally nailed it.  What do you
> say to defining SVN branches like this:
> 
> A directory is a branch if...
> 1. it is not a subdirectory of an existing branch; and
> 2. either:
> 2a. it is in a list of branches specified by the user, or
> 2b. it is copied from a (subdirectory of a) branch

I think I started with a very similar set of rules...  Looking at my code now I'm having a hard time summarizing them (probably because they evolved with the code, so what started simple morphed into something pretty complicated).  I guess as long as the user has the option to say "no, don't treat this copy as a branch" (or equivalently the Git side of things has a way to say "ignore this branch") these rules would be okay.  But at that point we're back to a list of exceptions -- really we're arguing white-list vs black-list... I eventually chose to go the white-list route for our conversion after starting with black-list (a white-list that still required a few manual edits before manipulating the Git history).  So take that single data point for what it's worth.

> > > Once the format is defined, git import is fairly straightforward.
> > > Proof-of-concept code to follow, but it's really just a wrapper
> > > around git-commit-tree, git-mktag etc.  I wrote this in Perl
> > > thinking it would relate somehow to git-svn, but eventually
> > > realised it didn't and that a few hundred calls to (plumbing)
> > > processes per second isn't so good for performance.  The only
> > > interesting part of the problem is how to tackle SVN tags.  I went
> > > for an ambitious approach, making normal tags where possible and
> > > downgrading them to lightweight tags when necessary.  This does
> > > involve managing something that is effectively a branch in
> > > refs/tags/, but what else is an SVN tag but a branch in the wrong
> > > namespace?
> > 
> > I don't understand how "normal" and "lightweight" apply in this
> > situation? ... In the case of actual content changes in a tag's
> > life, I think it's up to the user to decide between three options:
> > 
> >   1) only retain the last SVN tag
> >   2) tag using the git-svn-style 'tagname@rev' for all but the last
> >   3) Do (2), but move older tags to some hidden namespace
> >      (refs/hidden/tags or the like)
> > 
> > ... In the bidirectional case things get murky (maybe always tag
> > with tagname@rev and hope for tab completion?).
> 
> I didn't explain this particularly well, as it's based largely on the
> vague desire to make update work some day.  Imagine the user does
> this:
> 
> * git svn-pull # get tags/foo, a candidate for an annotated tag
> ... time passes ...
> * git svn-pull # tags/foo has now been updated in another revision
> 
> If we create an annotated tag in step 1, what do we do in step 2?  You
> can't make the tag object the parent of a new revision, so you need to
> do something unpleasant.  The solution I proposed was to convert the
> tag message to a commit message (i.e. pretend a lightweight tag had
> been created all along), then add another commit on top of it and make
> a lightweight tag from the new commit (i.e. treat it like a branch).
> In retrospect that's far too much magic without user involvement - a
> better solution would be to give the user this option along with the
> ones you outlined, and let git-config remember their preference if
> they want.

Okay, that's what I thought you meant (and what I classified as a bidirectional problem, but I guess it's not strictly a bidirectional problem, but a one-time migration does not have the problem).  If you want to continue to update Git from SVN there are two cases to consider:

  1) Each Git repository *only* talks to SVN
  2) The Git repository is cloned for further use 
     (So the chain is something like SVN->Git->Git)

In (1) your lightweight tag solution is probably okay (but I'm pretty sure creating/deleting annotated tags would behave the same way because no one else sees the Git tag object).  In (2) I think there would still be a tag conflict when the upstream Git repo replaces a lightweight tag and the downstream repo attempts to fetch it.  I don't know what the fetch/pull machinery does when there's a lightweight tag conflict (I'm guessing either bails out or keeps the local one?).  Case (2) motivates me to say always generate (annotated?) tags named tagname@rev so there can be no conflicts.  In that case the only difference I see is if we create an empty Git commit with the tag message plus a lightweight tag or tag the original commit with an annotated tag (I think it's fairly obvious I'm a fan of
  the latter).

> [1] http://en.wikipedia.org/wiki/Full_employment_theorem
> [2] http://dev.catalyst.perl.org/repos/bast/

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
  2012-03-05 15:27         ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Stephen Bash
  2012-03-05 23:27           ` Approaches to SVN to Git conversion Andrew Sayers
@ 2012-03-06 19:29           ` Nathan Gray
  2012-03-06 20:35             ` Stephen Bash
  2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
  1 sibling, 2 replies; 22+ messages in thread
From: Nathan Gray @ 2012-03-06 19:29 UTC (permalink / raw)
  To: Stephen Bash
  Cc: Andrew Sayers, Jonathan Nieder, Jeff King, git, Sverre Rabbelier,
	Dmitry Ivankov, Ramkumar Ramachandra, Sam Vilain, David Barr

Hi everyone,

Soon I'm going to be undertaking a migration of a subproject from a
very messy multiproject SVN repo to git, so this is a topic that's
quite near to my heart at the moment.  More inline...

On Mon, Mar 5, 2012 at 7:27 AM, Stephen Bash <bash@genarts.com> wrote:
>
> ----- Original Message -----
>> From: "Andrew Sayers" <andrew-git@pileofstuff.org>
>> Sent: Sunday, March 4, 2012 8:36:41 AM
>> Subject: Re: [RFC] "Remote helper for Subversion" project
>>

[snip]

>> Personally, I think SVN export will always need a strong manual
>> component to get the best results, so I've put quite a bit of work
>> into designing a good SVN history format.  Like git-fast-import, it's
>> an ASCII format designed both for human and machine consumption...
>
> First, I'm very impressed that you managed to get a language like this up and working.  It could prove very useful going forward.  On the flip side, from my experiments over the last year I've actually been leaning toward a solution that is more implicit than explicit.  Taking git-svn as a model, I've been trying to define a mapping system (in Perl):
>
>  my %branch_spec = { '/trunk/projname' => 'master',
>                      '/branches/*/projname' => '/refs/heads/*' };
>  my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };

The problem of specifying and detecting branches is a major problem in
my upcoming conversion.  We've got toplevel trunk/branches/tags
directories but underneath "branches" it's a free-for-all:

/branches/codenameA/{projectA,projectB,projectC}
/branches/codenameB   (actually a branch of projectA)
/branches/developers/joe/frobnicator-experiment (also a branch of projectA)

Clearly there's no simple regex that's going to capture this, so I'm
reduced to listing every branch of projectA, which is tedious and
error-prone.  However, what *would* work fabulously well for me is
"marker file" detection.  Every copy of projectA has a certain file at
it's root.  Let's call it "markerFile.txt".  What I'd really love is a
way to say:

my %branch_markers = {'/branches/**/markerFile.txt' => '/refs/heads/**'}

I'm using ** to signify that this may match multiple path components
(sorry, I don't know perl glob syntax).  A branch point is any
revision that creates a new file that matches the marker pattern.

Ideally one could use logical connectives like AND and OR to specify a
set of patterns that could account for marker files changing over the
history of the project, but for my purposes that wouldn't be necessary
-- we've got a well-defined marker that's always present.

For bonus points I'd like to be able to speed things up by excluding
known-bad markers.  Say projectB has a file "badMarker.txt" at its
root and I don't want to import projectB into my new repo.  Maybe I
could specify:

my %branch_spec = {
        '/branches/**/markerFile.txt' => '/refs/heads/**',
        '/branches/**/badMarker.txt' => '!'}

I'm assuming that it would be helpful for the script to have this
information (e.g. it could stop recursive searches when badMarker.txt
is found), but maybe that's not the case.

I'd welcome any comments or (especially!) code to try out.  ;^)

Cheers,
-Nathan

-- 
http://n8gray.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
  2012-03-06 19:29           ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Nathan Gray
@ 2012-03-06 20:35             ` Stephen Bash
  2012-03-06 23:59               ` [spf:guess] " Sam Vilain
  2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
  1 sibling, 1 reply; 22+ messages in thread
From: Stephen Bash @ 2012-03-06 20:35 UTC (permalink / raw)
  To: Nathan Gray
  Cc: Andrew Sayers, Jonathan Nieder, Jeff King, git, Sverre Rabbelier,
	Dmitry Ivankov, Ramkumar Ramachandra, Sam Vilain, David Barr



----- Original Message -----
> From: "Nathan Gray" <n8gray@n8gray.org>
> Sent: Tuesday, March 6, 2012 2:29:59 PM
> Subject: Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
> 
> > > Personally, I think SVN export will always need a strong manual
> > > component to get the best results, so I've put quite a bit of work
> > > into designing a good SVN history format.  Like git-fast-import,
> > > it's an ASCII format designed both for human and machine
> > > consumption...
> >
> > First, I'm very impressed that you managed to get a language like
> > this up and working.  It could prove very useful going forward.   On
> > the flip side, from my experiments over the last year I've actually
> > been leaning toward a solution that is more implicit than explicit.
> > Taking git-svn as a model, I've been trying to define a mapping
> > system (in Perl):
> >
> >  my %branch_spec = { '/trunk/projname' => 'master',
> >                      '/branches/*/projname' => '/refs/heads/*' };
> >  my %tag_spec = { '/tags/*/projname' => '/refs/tags/*' };
> 
> The problem of specifying and detecting branches is a major problem in
> my upcoming conversion.  We've got toplevel trunk/branches/tags
> directories but underneath "branches" it's a free-for-all:
> 
> /branches/codenameA/{projectA,projectB,projectC}
> /branches/codenameB   (actually a branch of projectA)
> /branches/developers/joe/frobnicator-experiment (also a branch of
> projectA)
> 
> Clearly there's no simple regex that's going to capture this, so I'm
> reduced to listing every branch of projectA, which is tedious and
> error-prone.  However, what *would* work fabulously well for me is
> "marker file" detection.  Every copy of projectA has a certain file at
> it's root.  Let's call it "markerFile.txt".  What I'd really love is a
> way to say:
> 
> my %branch_markers = {'/branches/**/markerFile.txt' =>
>                       '/refs/heads/**'}

Ooo...  I like it.  I hadn't hit on this idea yet, but it certainly is a very helpful heuristic.  I doubt I'd have any sort of demo code for you in the near future, but it's definitely an idea to roll into the mix.

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-06 19:29           ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Nathan Gray
  2012-03-06 20:35             ` Stephen Bash
@ 2012-03-06 22:34             ` Andrew Sayers
  2012-03-07 15:38               ` Sam Vilain
                                 ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Andrew Sayers @ 2012-03-06 22:34 UTC (permalink / raw)
  To: Nathan Gray
  Cc: Stephen Bash, Jonathan Nieder, Jeff King, git, Sverre Rabbelier,
	Dmitry Ivankov, Ramkumar Ramachandra, Sam Vilain, David Barr

I've now added a bit of documentation and uploaded my code to github:
https://github.com/andrew-sayers/Proof-of-concept-History-Converter

I haven't attached it here because the code isn't at a stage where it
would be useful to review line-by-line.  Comments are welcome if you
really want to though :)

svn-branch-export.pl makes heavy use of SVN::Dump.  You may want to get
the latest version from github if speed is important to you:
https://github.com/book/SVN-Dump/ - many thanks to Philippe Bruhat for
accepting my performance patch so quickly.

Here are some particular gripes I have with the code I've uploaded:

git-branch-import.pl gets the revision number by parsing out the
"git-svn-id" in commit messages - as I mentioned earlier, I started off
thinking this script would be closely related to git-svn somehow.  In
hindsight it would be better to read revision numbers from the marks
file exported by git-fast-import.

Branch History Format has some git-specific stuff in the setup section.
 I didn't think about this in too much detail while writing it, but
DVCS-neutrality would be better served by turning these into
command-line options.

As mentioned before, branch detection in svn-branch-export.pl is rather
muddled, as my understanding of the problem evolved significantly while
writing it.

svn-branch-export.pl half-heartedly uses a configure/make/make install
analogy to describe its behaviour - I'm increasingly sure this is
gimmicky and awful, rather than a neat explanatory trick.

svn-branch-export.pl exposes a lot of config values (e.g. "log_style")
that just bulk up the implementation and create space for bugs to creep
in without adding much actual value.  They should be removed.

On 06/03/12 19:29, Nathan Gray wrote:
<snip>
> 
> The problem of specifying and detecting branches is a major problem in
> my upcoming conversion.  We've got toplevel trunk/branches/tags
> directories but underneath "branches" it's a free-for-all:
> 
> /branches/codenameA/{projectA,projectB,projectC}
> /branches/codenameB   (actually a branch of projectA)
> /branches/developers/joe/frobnicator-experiment (also a branch of projectA)
> 
> Clearly there's no simple regex that's going to capture this, so I'm
> reduced to listing every branch of projectA, which is tedious and
> error-prone.  However, what *would* work fabulously well for me is
> "marker file" detection.  Every copy of projectA has a certain file at
> it's root.  Let's call it "markerFile.txt".  What I'd really love is a
> way to say:

This is quite close to the implementation I've got.  The SVN exporter
runs in two stages:

In the first stage, the script treats any non-blacklisted file as a
marker file, but only looks for trunk branches.  It looks all through
the history, traces back through the copyfroms, and tries to find the
original directory associated with the file.  Usually it decides that
the only branch without a copyfrom is /trunk.  Searching just for trunks
with this weak heuristic makes it much easier to hand-verify the result.

In the second stage, the script looks through the history again, tracing
the copies of known branches in a slightly less clever way than
described in my previous e-mail.  There's no need for marker files this
time round, as we just assume any `svn cp /trunk
/directory/not/within/a/branch` is a new branch.  In my experiments this
has been a pretty solid way of detecting branches without too much human
input - I might be missing something (or have mis-explained something),
but I'd be interested to hear examples of where this would go wrong.
Having said that, here's a dodgy example I'd like to pre-emptively defend:

	svn add tronk
	svn ci -m "Created trunk" # r1
	svn cp tronk trunk
	svn ci -m "D'oh" # r2
	svn rm tronk
	svn add trunk/markerFile.txt
	svn ci -m "Double d'oh!" # r3

You could argue that the correct branch history description for the
above would be:

	In r3, create branch "trunk"

In other words, ignore everything that happened before the marker file
was created.  However, I would argue the following representation is
more correct:

	In r1, create branch "tronk"
	In r2, create branch "trunk" from "tronk" r1
	In r3, delete branch "tronk"

The branch history format supports the "delete branch" command (remove
the branch entirely) as well as the more common "deactivate branch"
(keep the branch but don't accept any new commits) specifically to deal
with this sort of weirdness.  Creating a branch then deleting it keeps
the r1 revision log intact as part of the "trunk" branch, without
leaving any useless branches lying around.

	- Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [spf:guess] Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
  2012-03-06 20:35             ` Stephen Bash
@ 2012-03-06 23:59               ` Sam Vilain
  2012-03-07 22:06                 ` Andrew Sayers
  0 siblings, 1 reply; 22+ messages in thread
From: Sam Vilain @ 2012-03-06 23:59 UTC (permalink / raw)
  To: Stephen Bash
  Cc: Nathan Gray, Andrew Sayers, Jonathan Nieder, Jeff King, git,
	Sverre Rabbelier, Dmitry Ivankov, Ramkumar Ramachandra,
	David Barr

On 3/6/12 12:35 PM, Stephen Bash wrote:
>> The problem of specifying and detecting branches is a major problem in
>> my upcoming conversion.  We've got toplevel trunk/branches/tags
>> directories but underneath "branches" it's a free-for-all:
>>
>> /branches/codenameA/{projectA,projectB,projectC}
>> /branches/codenameB   (actually a branch of projectA)
>> /branches/developers/joe/frobnicator-experiment (also a branch of
>> projectA)
>>
>> Clearly there's no simple regex that's going to capture this, so I'm
>> reduced to listing every branch of projectA, which is tedious and
>> error-prone.  However, what *would* work fabulously well for me is
>> "marker file" detection.  Every copy of projectA has a certain file at
>> it's root.  Let's call it "markerFile.txt".  What I'd really love is a
>> way to say:
>>
>> my %branch_markers = {'/branches/**/markerFile.txt' =>
>>                        '/refs/heads/**'}
>
> Ooo...  I like it.  I hadn't hit on this idea yet, but it certainly is a very helpful heuristic.  I doubt I'd have any sort of demo code for you in the near future, but it's definitely an idea to roll into the mix.

What I did for the Perl Perforce conversion is make this a multi–step 
process; first, the heuristic goes through and detects branches and 
merge parents.  Then you do the actual export.  If, however, the 
heuristic gets it wrong, then you can manually override the branch 
detection for a particular revision, which invalidates all of the 
_automatic_ decisions made for later revisions the next time you run it.

Even with all of the information in Postgres, and much of the hard work 
pushed into the Postgres engine, and Postgres tuned for OLAP, this was 
the slowest part of the operation.  For a 30,000–odd revision Perforce 
repository.

The manual input is extremely useful for bespoke conversions; there will 
always be warts in the history and no heuristic is perfect (even if you 
can supply your own set of expressions, a way to override it for just 
one revision is handy).

Just to revise, the steps in git-p4raw, are:

* load metadata (git-p4raw load ; git-p4raw check)
* load blobs (git-p4raw export-blobs)
* find project roots (git-p4raw find-branches)

   Project root decisions can be overridden, in git-p4raw this was 
through a DB insert, but all this consisted of was inserting (revision, 
branch) tuples into the appropriate table so a front–end would be 
trivial.  As you suggest, a custom heuristic is also an option but the 
most flexible solution is just being able to override the decisions made 
for a particular revision.

* detect project merges (also done by git-p4raw find-branches)

Detecting merge parents used a heuristic based on the per–file 
integration records and a computation based on an internal diff-tree 
which produced a list of files that would have needed resolving.  This 
one I actually used enough to bother implementing a front–end for:

   git-p4raw graft REV PARENT PARENT

Where 'PARENT' could be another project root (revision/branch location), 
or it could be a git commit ID (for the inevitable occasion where you 
need to manually graft on some history).  This interface allows you to 
do several things:

   1. mark a merge which was not recorded correctly in history
   2. un–mark a merge which was detected/recorded incorrectly
   3. skip bad sections of history, for instance squash merging merges 
which happened over several commits (SVN and Perforce, of course, 
support insane piecemeal merging prohibited by git)

* the actual fast-import exporter.

   git-p4raw export-commits 1..5000

There was also an important reverse operation:

   git-p4raw unexport-commits 2500

Which moved all of the exported refs backwards, deleted ones which 
didn't exist at revision 2500.

Once the data has been mined, the actual exporting can proceed very 
fast.  Eg, on my laptop I could easily be topping 300 commits per second 
which makes for a nice export/examine/rewind/adjust cycle.

For more information,

   git clone git://github.com/samv/git-p4raw
   cd git-p4raw
   perldoc git-p4raw

The "Game plan." section of the POD is particularly relevant.  Remember 
that SVN is very similar to Perforce in virtually all of its design 
details so this tool, its database schema, and implementation are all 
very relevant to the design of the new svn-fe importer.

Sam

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
@ 2012-03-07 15:38               ` Sam Vilain
  2012-03-07 20:28                 ` Andrew Sayers
  2012-03-07 22:33               ` Phil Hord
  2012-03-07 23:08               ` Nathan Gray
  2 siblings, 1 reply; 22+ messages in thread
From: Sam Vilain @ 2012-03-07 15:38 UTC (permalink / raw)
  To: Andrew Sayers
  Cc: Nathan Gray, Stephen Bash, Jonathan Nieder, Jeff King, git,
	Sverre Rabbelier, Dmitry Ivankov, Ramkumar Ramachandra,
	David Barr

On 3/6/12 2:34 PM, Andrew Sayers wrote:
> I've now added a bit of documentation and uploaded my code to github:
> https://github.com/andrew-sayers/Proof-of-concept-History-Converter
>
> I haven't attached it here because the code isn't at a stage where it
> would be useful to review line-by-line.  Comments are welcome if you
> really want to though :)

I just took a look at your readme—did you consider writing the tool to 
work against an svn-fe import, rather than using SVN::Dump? Do you think 
it could be adjusted to be like that?

Sam

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-07 15:38               ` Sam Vilain
@ 2012-03-07 20:28                 ` Andrew Sayers
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Sayers @ 2012-03-07 20:28 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nathan Gray, Stephen Bash, Jonathan Nieder, Jeff King, git,
	Sverre Rabbelier, Dmitry Ivankov, Ramkumar Ramachandra,
	David Barr

On 07/03/12 15:38, Sam Vilain wrote:
> On 3/6/12 2:34 PM, Andrew Sayers wrote:
>> I've now added a bit of documentation and uploaded my code to github:
>> https://github.com/andrew-sayers/Proof-of-concept-History-Converter
>>
>> I haven't attached it here because the code isn't at a stage where it
>> would be useful to review line-by-line.  Comments are welcome if you
>> really want to though :)
> 
> I just took a look at your readme—did you consider writing the tool to
> work against an svn-fe import, rather than using SVN::Dump? Do you think
> it could be adjusted to be like that?

I did consider writing svn-branch-export.pl against a branch created by
svn-fe, but right now it doesn't provide enough information to do a good
job (e.g. copyfrom properties).  I understand that support is in the
works, but this project is more about getting a scrappy end-to-end
solution so we can see what the issues are (is there any demand for
DVCS-neutral SVN history export?  What are the hard cases and how do you
represent them?).  I'm keen to make sure that documentation and tests
are done in such a way that a future git-based exporter could use them
without relying on any of the actual code.

I also considered writing git-branch-import.pl against the raw svn-fe
output.  As well as the technical issues with this approach, I felt like
these were better tackled as orthogonal problems.  Producing an accurate
representation of the SVN history is a very different problem to
producing a user-friendly representation, and separating those concerns
seems like it will make life easier down the line.  For example, a
user-friendly representation might convert svn:ignore properties to
.gitignore files, but that would make bidirection hard to implement
without an accurate representation in the middle.

	- Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [spf:guess] Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
  2012-03-06 23:59               ` [spf:guess] " Sam Vilain
@ 2012-03-07 22:06                 ` Andrew Sayers
  2012-03-07 23:15                   ` [spf:guess,iffy] " Sam Vilain
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Sayers @ 2012-03-07 22:06 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Stephen Bash, Nathan Gray, Jonathan Nieder, Jeff King, git,
	Sverre Rabbelier, Dmitry Ivankov, Ramkumar Ramachandra,
	David Barr

It sounds like we've approached two similar problems in similar ways, so
I'm curious about the differences where they exist.  I've been reading
this message of yours from 18 months ago alongside this thread:
http://article.gmane.org/gmane.comp.version-control.git/150007
Unfortunately these comprise everything I know about Perforce.

I notice that git-p4raw stores all of its data in Postgres and provides
a programmatic interface for querying it, whereas I've focussed on
providing ASCII interfaces at relevant points.  I can see how a DB store
would help manage the amount of data you'd need to process in a big
repository, but were there any other issues that drove you down this
route?  Did you consider a text-based interface?

On 06/03/12 23:59, Sam Vilain wrote:
<snip>
> What I did for the Perl Perforce conversion is make this a multi–step
> process; first, the heuristic goes through and detects branches and
> merge parents.  Then you do the actual export.  If, however, the
> heuristic gets it wrong, then you can manually override the branch
> detection for a particular revision, which invalidates all of the
> _automatic_ decisions made for later revisions the next time you run it.

Could you give an example of overriding branch/merge detection?  It
sounds like you're saying that if there's some problem detecting merge
parents in an early revision, then all future merges are ignored by the
script.

<snip>
> The manual input is extremely useful for bespoke conversions; there will
> always be warts in the history and no heuristic is perfect (even if you
> can supply your own set of expressions, a way to override it for just
> one revision is handy).

Again, would you mind providing a few examples?  It sounds like you have
some edge cases that could be handled by extending the branch history
format, but I'd like to pin it down a bit more before discussing solutions.

<snip>
>   3. skip bad sections of history, for instance squash merging merges
> which happened over several commits (SVN and Perforce, of course,
> support insane piecemeal merging prohibited by git)

This is an excellent point I've stumbled past in my experiments without
realising what I was seeing.  A simple SVN example might look like this:

	svn add trunk branches
	svn add trunk/foo trunk/bar
	svn ci -m "Initial revision" # r1

	svn cp trunk branches/my_branch
	svn ci -m "Created my_branch" # r2

	# edit files in my_branch

	svn merge branches/my_branch/foo trunk/foo
	svn ci -m "Merge my_branch -> trunk (1/3)" # r11

	svn merge branches/my_branch/bar trunk/bar
	svn ci -m "Merge my_branch -> trunk (2/3)" # r12

	svn cp branches/my_branch/new_file trunk/new_file
	svn ci -m "Merge my_branch -> trunk (3/3)" # r13

This strikes me as a sensibly cautious workflow in SVN, where merge
conflicts are common and changes are hard to revert.  The best
representation for this in the current branch history format would be
something like this:

	In r1, create branch "trunk"
	In r2, create branch "branches/my_branch" from "trunk"
	In r13, merge "branches/my_branch" r13 into "trunk"

In other words, pretend r11 and r12 are just normal commits, and that
r13 is a full merge.  A more useful (and arguably more accurate)
representation would be possible if we extended the format a bit:

	In r1, create branch "trunk"
	In r2, create branch "branches/my_branch" from "trunk"
	In r12, squash changes in "branches/my_branch"
	In r13, squash changes in "branches/my_branch"
	In r13, merge "branches/my_branch" r13 into "trunk"

Adding "squash" and "fixup" commands would let us represent the whole
messy business as a single commit, which is closer to what the user was
trying to say even if it's further from what they actually had to say.

	- Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
  2012-03-07 15:38               ` Sam Vilain
@ 2012-03-07 22:33               ` Phil Hord
  2012-03-07 23:08               ` Nathan Gray
  2 siblings, 0 replies; 22+ messages in thread
From: Phil Hord @ 2012-03-07 22:33 UTC (permalink / raw)
  To: Andrew Sayers
  Cc: Nathan Gray, Stephen Bash, Jonathan Nieder, Jeff King, git,
	Sverre Rabbelier, Dmitry Ivankov, Ramkumar Ramachandra,
	Sam Vilain, David Barr

On Tue, Mar 6, 2012 at 5:34 PM, Andrew Sayers
<andrew-git@pileofstuff.org> wrote:
> This is quite close to the implementation I've got.  The SVN exporter
> runs in two stages:
>
> In the first stage, the script treats any non-blacklisted file as a
> marker file, but only looks for trunk branches.  It looks all through
> the history, traces back through the copyfroms, and tries to find the
> original directory associated with the file.  Usually it decides that
> the only branch without a copyfrom is /trunk.  Searching just for trunks
> with this weak heuristic makes it much easier to hand-verify the result.
>
> In the second stage, the script looks through the history again, tracing
> the copies of known branches in a slightly less clever way than
> described in my previous e-mail.  There's no need for marker files this
> time round, as we just assume any `svn cp /trunk
> /directory/not/within/a/branch` is a new branch.  In my experiments this
> has been a pretty solid way of detecting branches without too much human
> input - I might be missing something (or have mis-explained something),
> but I'd be interested to hear examples of where this would go wrong.

I think what you're describing would work perfectly for my weird svn
repo.  I have branches named like this:

branches/developer/hordp/foo
branches/developer/hordp/bar
etc.

Since these were created with 'svn cp' originally, they would be
properly considered branches by your algorithm, right?    If so,
sweet!

> Having said that, here's a dodgy example I'd like to pre-emptively defend:
>
>        svn add tronk
>        svn ci -m "Created trunk" # r1
>        svn cp tronk trunk
>        svn ci -m "D'oh" # r2
>        svn rm tronk
>        svn add trunk/markerFile.txt
>        svn ci -m "Double d'oh!" # r3
>
> You could argue that the correct branch history description for the
> above would be:
>
>        In r3, create branch "trunk"
>
> In other words, ignore everything that happened before the marker file
> was created.  However, I would argue the following representation is
> more correct:
>
>        In r1, create branch "tronk"
>        In r2, create branch "trunk" from "tronk" r1
>        In r3, delete branch "tronk"
>

I prefer your interpretation. It doesn't look dodgy at all.

Phil

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
  2012-03-07 15:38               ` Sam Vilain
  2012-03-07 22:33               ` Phil Hord
@ 2012-03-07 23:08               ` Nathan Gray
  2012-03-07 23:32                 ` Andrew Sayers
  2 siblings, 1 reply; 22+ messages in thread
From: Nathan Gray @ 2012-03-07 23:08 UTC (permalink / raw)
  To: Andrew Sayers
  Cc: Stephen Bash, Jonathan Nieder, Jeff King, git, Sverre Rabbelier,
	Dmitry Ivankov, Ramkumar Ramachandra, Sam Vilain, David Barr

On Tue, Mar 6, 2012 at 2:34 PM, Andrew Sayers
<andrew-git@pileofstuff.org> wrote:
[snip]
> On 06/03/12 19:29, Nathan Gray wrote:
> <snip>
>>
>> The problem of specifying and detecting branches is a major problem in
>> my upcoming conversion.  We've got toplevel trunk/branches/tags
>> directories but underneath "branches" it's a free-for-all:
>>
>> /branches/codenameA/{projectA,projectB,projectC}
>> /branches/codenameB   (actually a branch of projectA)
>> /branches/developers/joe/frobnicator-experiment (also a branch of projectA)
>>
>> Clearly there's no simple regex that's going to capture this, so I'm
>> reduced to listing every branch of projectA, which is tedious and
>> error-prone.  However, what *would* work fabulously well for me is
>> "marker file" detection.  Every copy of projectA has a certain file at
>> it's root.  Let's call it "markerFile.txt".  What I'd really love is a
>> way to say:
>
> This is quite close to the implementation I've got.  The SVN exporter
> runs in two stages:
>
> In the first stage, the script treats any non-blacklisted file as a
> marker file, but only looks for trunk branches.  It looks all through
> the history, traces back through the copyfroms, and tries to find the
> original directory associated with the file.  Usually it decides that
> the only branch without a copyfrom is /trunk.  Searching just for trunks
> with this weak heuristic makes it much easier to hand-verify the result.

I'm not sure I understand.  So if I have /trunk/projectA and
/trunk/projectB then do I have to blacklist /trunk/projectB to extract
only projectA's history?  Assuming it's always lived there will your
code detect /trunk/projectA as the "trunk?"  Would it be possible to
specify /trunk/projectA directly instead of blacklisting everything
else?

> In the second stage, the script looks through the history again, tracing
> the copies of known branches in a slightly less clever way than
> described in my previous e-mail.  There's no need for marker files this
> time round, as we just assume any `svn cp /trunk
> /directory/not/within/a/branch` is a new branch.  In my experiments this
> has been a pretty solid way of detecting branches without too much human
> input - I might be missing something (or have mis-explained something),
> but I'd be interested to hear examples of where this would go wrong.

That sounds pretty good, but it should probably also be transitive,
i.e. `svn cp /any/known/branch/root /some/new/path` is also a new
branch.  Sometimes we'll spin off hotfix branches from release
branches, for example.

I'll have to give your code a try and see how it works.

Cheers,
-n8

-- 
http://n8gray.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [spf:guess,iffy] Re: [spf:guess] Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
  2012-03-07 22:06                 ` Andrew Sayers
@ 2012-03-07 23:15                   ` Sam Vilain
  2012-03-08 20:51                     ` Andrew Sayers
  0 siblings, 1 reply; 22+ messages in thread
From: Sam Vilain @ 2012-03-07 23:15 UTC (permalink / raw)
  To: Andrew Sayers
  Cc: Stephen Bash, Nathan Gray, Jonathan Nieder, Jeff King, git,
	Sverre Rabbelier, Dmitry Ivankov, Ramkumar Ramachandra,
	David Barr

On 3/7/12 2:06 PM, Andrew Sayers wrote:
> It sounds like we've approached two similar problems in similar ways, so
> I'm curious about the differences where they exist.  I've been reading
> this message of yours from 18 months ago alongside this thread:
> http://article.gmane.org/gmane.comp.version-control.git/150007
> Unfortunately these comprise everything I know about Perforce.

Right, I went into more detail back then than I did with my more recent 
message.

> I notice that git-p4raw stores all of its data in Postgres and provides
> a programmatic interface for querying it, whereas I've focussed on
> providing ASCII interfaces at relevant points.  I can see how a DB store
> would help manage the amount of data you'd need to process in a big
> repository, but were there any other issues that drove you down this
> route?  Did you consider a text-based interface?

I wrote it like this mostly because the source metadata was already in a 
tabular form.  It allowed me to load the data, and then convert 
deductions I could make of the data into unique and foreign key 
constraints.  It provided me with ACID semantics to make it so that if 
my program ran and failed the changes would not be applied.  Despite the 
popular opinion of "web–scale" technologists, databases do have large 
advantages over unstructured hierarchical data :-).

I didn't really intend to provide a programmatic interface, that was a 
set of user tools.  The SQL store is the programmatic interface :)

>> What I did for the Perl Perforce conversion is make this a multi–step
>> process; first, the heuristic goes through and detects branches and
>> merge parents.  Then you do the actual export.  If, however, the
>> heuristic gets it wrong, then you can manually override the branch
>> detection for a particular revision, which invalidates all of the
>> _automatic_ decisions made for later revisions the next time you run it.
>
> Could you give an example of overriding branch/merge detection?  It
> sounds like you're saying that if there's some problem detecting merge
> parents in an early revision, then all future merges are ignored by the
> script.

The wrong decision can make things much worse down the line.  With the 
Perl history, the repository was about 350MB of pack, until I got the 
merge history correct.  Afterwards, it packed down to about 70MB.  This 
is because there was a lot of criss–cross merging, and by marking them 
correctly git's repack algorithm was more able to locate similar blobs 
and compress correctly.  The pack size was not the goal, but a good 
verification that I had brought the correct commits together in history.

The bigger problems with it range from thinking changes are merged in 
your branch which weren't really, or depending on how branch detection 
etc works, getting thrown off completely and emitting garbage branch 
histories.  So, it does help to be able to "rewind" the heuristics, poke 
information in and then resume again and see if things are improved. 
The information could be inserted into a single file which has 
configured the entire import, and also serves as a set of notes as to 
the amendments carried out.  I was happy with a database dump :-).

> <snip>
>> The manual input is extremely useful for bespoke conversions; there will
>> always be warts in the history and no heuristic is perfect (even if you
>> can supply your own set of expressions, a way to override it for just
>> one revision is handy).
>
> Again, would you mind providing a few examples?  It sounds like you have
> some edge cases that could be handled by extending the branch history
> format, but I'd like to pin it down a bit more before discussing solutions.

There's a few,

* a branch contains a subproject and is merged into a subtree
* someone puts a "README" or similar file in a funny place, which isn't 
inside a project root
* someone starts a project with no files in its root directory
* someone records a merge incorrectly (or using a young or middle–aged 
SVN which didn't record merges).  You don't want your annotate to hit a 
merge commit which isn't recorded as a merge, and then have to go 
hunting around in history for the real origin of a line of code
* the piecemeal merge case you have seen yourself.

It's just very useful to be able to reparent during the data mining stage.

> <snip>
>>    3. skip bad sections of history, for instance squash merging merges
>> which happened over several commits (SVN and Perforce, of course,
>> support insane piecemeal merging prohibited by git)
>
> This is an excellent point I've stumbled past in my experiments without
> realising what I was seeing.  A simple SVN example might look like this:
>
> 	svn add trunk branches
> 	svn add trunk/foo trunk/bar
> 	svn ci -m "Initial revision" # r1
>
> 	svn cp trunk branches/my_branch
> 	svn ci -m "Created my_branch" # r2
>
> 	# edit files in my_branch
>
> 	svn merge branches/my_branch/foo trunk/foo
> 	svn ci -m "Merge my_branch ->  trunk (1/3)" # r11
>
> 	svn merge branches/my_branch/bar trunk/bar
> 	svn ci -m "Merge my_branch ->  trunk (2/3)" # r12
>
> 	svn cp branches/my_branch/new_file trunk/new_file
> 	svn ci -m "Merge my_branch ->  trunk (3/3)" # r13
>
> This strikes me as a sensibly cautious workflow in SVN, where merge
> conflicts are common and changes are hard to revert.  The best
> representation for this in the current branch history format would be
> something like this:
>
> 	In r1, create branch "trunk"
> 	In r2, create branch "branches/my_branch" from "trunk"
> 	In r13, merge "branches/my_branch" r13 into "trunk"
>
> In other words, pretend r11 and r12 are just normal commits, and that
> r13 is a full merge.  A more useful (and arguably more accurate)
> representation would be possible if we extended the format a bit:
>
> 	In r1, create branch "trunk"
> 	In r2, create branch "branches/my_branch" from "trunk"
> 	In r12, squash changes in "branches/my_branch"
> 	In r13, squash changes in "branches/my_branch"
> 	In r13, merge "branches/my_branch" r13 into "trunk"
>
> Adding "squash" and "fixup" commands would let us represent the whole
> messy business as a single commit, which is closer to what the user was
> trying to say even if it's further from what they actually had to say.

Right, you see the problem.

I think your text syntax is fine so long as it is precise enough, and 
similar to what I mention earlier in this e–mail with having a single 
file to drive a conversion run.  That really is the kind of input data 
that I had, it's just that I set it up as a useful set of commands.

Sam.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Approaches to SVN to Git conversion
  2012-03-07 23:08               ` Nathan Gray
@ 2012-03-07 23:32                 ` Andrew Sayers
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Sayers @ 2012-03-07 23:32 UTC (permalink / raw)
  To: Nathan Gray
  Cc: Stephen Bash, Jonathan Nieder, Jeff King, git, Sverre Rabbelier,
	Dmitry Ivankov, Ramkumar Ramachandra, Sam Vilain, David Barr

On 07/03/12 23:08, Nathan Gray wrote:
<snip>
> 
> I'm not sure I understand.  So if I have /trunk/projectA and
> /trunk/projectB then do I have to blacklist /trunk/projectB to extract
> only projectA's history?  Assuming it's always lived there will your
> code detect /trunk/projectA as the "trunk?"  Would it be possible to
> specify /trunk/projectA directly instead of blacklisting everything
> else?

Please do try it, but the process should go something like this for you:

1. run the SVN export "configure" stage - this reads through your repo
   and suggests two trunks - "/trunk/projectA" and "/trunk/projectB".
   You can explicitly ignore "/trunk/projectB", but at present there's
   no way to ignore trunks by default.  No particular reason, I just
   hadn't thought to add it :)

2. run the SVN export "make" stage - this looks through whichever
   trunks you've specified, and tracks the branches coming from it.  I
   didn't explain this correctly in my previous e-mail, but yes this is
   transitive - branches from branches from branches from trunk are
   tracked in the appropriate way.

3. edit the file created in stage 2.  If you wanted to ignore a
   specific branch from (a branch from...) trunk/projectA, your best
   bet is to exercise your text-fu on this file

4. Import the history into git

I'll be interested to hear how you and Phil get on, as it sounds like
yes this approach should work for both of your repos.

	- Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [spf:guess,iffy] Re: [spf:guess] Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
  2012-03-07 23:15                   ` [spf:guess,iffy] " Sam Vilain
@ 2012-03-08 20:51                     ` Andrew Sayers
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Sayers @ 2012-03-08 20:51 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Stephen Bash, Nathan Gray, Jonathan Nieder, Jeff King, git,
	Sverre Rabbelier, Dmitry Ivankov, Ramkumar Ramachandra,
	David Barr

Thanks - this has really helped my thoughts to crystalise.
Here's my plan at this point:

1. Create an "SVN History description" project

This will build on the ASCII format I've been proposing so far.  The
goal will be to produce a human- and machine-readable format that
describes SVN history in terms of an idealised version control system;
and to produce a set of tests that any SVN history exporter can use as a
testing framework.

2. Create an SVN history exporter

This will build on the svn-branch-export.pl script I previously made
available.  That script ran in exactly two passes ("configure" and
"make") which wrote pointlessly different file formats.  The new script
will accept an SVN history file as input and create another SVN history
file as output, allowing users to iteratively improve the file as Sam
described.

3. Create an SVN history importer for git

This will resemble the git-branch-import.pl script I previously made
available, but written in C based on the final SVN history format.  This
thread has convinced me this would be a nice little SoC project, and
I'll propose it in another thread if I've got project 1 to a reasonable
state before it's too late.  Failing that, and if nobody else wants to
take this project, I'll have a go myself some day when project 2 is
approaching completion.

My next step will be to write up the SVN history work thrown up by this
thread.  I'll come back to the list for advice when I've got something
presentable.

	- Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] "Remote helper for Subversion" project
  2012-03-04  7:54   ` Jonathan Nieder
  2012-03-04 10:37     ` David Barr
@ 2012-03-27  3:58     ` Ramkumar Ramachandra
  1 sibling, 0 replies; 22+ messages in thread
From: Ramkumar Ramachandra @ 2012-03-27  3:58 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: David Barr, Jeff King, git, Sverre Rabbelier, Dmitry Ivankov,
	Sam Vilain, Stephen Bash

Hi,

Jonathan Nieder wrote:
> David Barr wrote:
>> On Sat, Mar 3, 2012 at 11:27 PM, David Barr <davidbarr@google.com> wrote:
>>> +
>>> +* Getting an Git-to-SVN converter merged.
>
> Probably could fill a summer in itself.  In previous starts I think
> there was some complexity creep. :/
>
>  http://thread.gmane.org/gmane.comp.version-control.git/170290
>  http://thread.gmane.org/gmane.comp.version-control.git/170551

I've been meaning to finish this off for sometime now- probably as an
SoC project this summer?

>>> +
>>> +* Building the remote helper itself.
>>> +
>>> +Goal: Build a full-featured bi-directional `git-remote-svn` and get it
>>> +      merged into upstream Git.
>
> Sure would be neat. ;-)  Another nice piece to build would be branch
> tracking / follow_parent heuristics.

This doesn't sound awfully complicated; if the Git -> SVN converter
gets merged early on, I might get a chance to work on this as well.

    Ram

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-03-27  3:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-03 12:27 [RFC] "Remote helper for Subversion" project David Barr
2012-03-03 12:41 ` David Barr
2012-03-04  7:54   ` Jonathan Nieder
2012-03-04 10:37     ` David Barr
2012-03-04 13:36       ` Andrew Sayers
2012-03-05 15:27         ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Stephen Bash
2012-03-05 23:27           ` Approaches to SVN to Git conversion Andrew Sayers
2012-03-06 14:36             ` Stephen Bash
2012-03-06 19:29           ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Nathan Gray
2012-03-06 20:35             ` Stephen Bash
2012-03-06 23:59               ` [spf:guess] " Sam Vilain
2012-03-07 22:06                 ` Andrew Sayers
2012-03-07 23:15                   ` [spf:guess,iffy] " Sam Vilain
2012-03-08 20:51                     ` Andrew Sayers
2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
2012-03-07 15:38               ` Sam Vilain
2012-03-07 20:28                 ` Andrew Sayers
2012-03-07 22:33               ` Phil Hord
2012-03-07 23:08               ` Nathan Gray
2012-03-07 23:32                 ` Andrew Sayers
2012-03-04 16:23       ` [RFC] "Remote helper for Subversion" project Jonathan Nieder
2012-03-27  3:58     ` Ramkumar Ramachandra

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).