git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Sam Vilain <sam@vilain.net>
To: Andrew Sayers <andrew-git@pileofstuff.org>
Cc: Stephen Bash <bash@genarts.com>, Nathan Gray <n8gray@n8gray.org>,
	Jonathan Nieder <jrnieder@gmail.com>, Jeff King <peff@peff.net>,
	git@vger.kernel.org, Sverre Rabbelier <srabbelier@gmail.com>,
	Dmitry Ivankov <divanorama@gmail.com>,
	Ramkumar Ramachandra <artagnon@gmail.com>,
	David Barr <davidbarr@google.com>
Subject: Re: [spf:guess,iffy] Re: [spf:guess] Re: Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project)
Date: Wed, 07 Mar 2012 15:15:16 -0800	[thread overview]
Message-ID: <4F57EC04.8060705@vilain.net> (raw)
In-Reply-To: <4F57DBF0.4060101@pileofstuff.org>

On 3/7/12 2:06 PM, Andrew Sayers wrote:
> It sounds like we've approached two similar problems in similar ways, so
> I'm curious about the differences where they exist.  I've been reading
> this message of yours from 18 months ago alongside this thread:
> http://article.gmane.org/gmane.comp.version-control.git/150007
> Unfortunately these comprise everything I know about Perforce.

Right, I went into more detail back then than I did with my more recent 
message.

> I notice that git-p4raw stores all of its data in Postgres and provides
> a programmatic interface for querying it, whereas I've focussed on
> providing ASCII interfaces at relevant points.  I can see how a DB store
> would help manage the amount of data you'd need to process in a big
> repository, but were there any other issues that drove you down this
> route?  Did you consider a text-based interface?

I wrote it like this mostly because the source metadata was already in a 
tabular form.  It allowed me to load the data, and then convert 
deductions I could make of the data into unique and foreign key 
constraints.  It provided me with ACID semantics to make it so that if 
my program ran and failed the changes would not be applied.  Despite the 
popular opinion of "web–scale" technologists, databases do have large 
advantages over unstructured hierarchical data :-).

I didn't really intend to provide a programmatic interface, that was a 
set of user tools.  The SQL store is the programmatic interface :)

>> What I did for the Perl Perforce conversion is make this a multi–step
>> process; first, the heuristic goes through and detects branches and
>> merge parents.  Then you do the actual export.  If, however, the
>> heuristic gets it wrong, then you can manually override the branch
>> detection for a particular revision, which invalidates all of the
>> _automatic_ decisions made for later revisions the next time you run it.
>
> Could you give an example of overriding branch/merge detection?  It
> sounds like you're saying that if there's some problem detecting merge
> parents in an early revision, then all future merges are ignored by the
> script.

The wrong decision can make things much worse down the line.  With the 
Perl history, the repository was about 350MB of pack, until I got the 
merge history correct.  Afterwards, it packed down to about 70MB.  This 
is because there was a lot of criss–cross merging, and by marking them 
correctly git's repack algorithm was more able to locate similar blobs 
and compress correctly.  The pack size was not the goal, but a good 
verification that I had brought the correct commits together in history.

The bigger problems with it range from thinking changes are merged in 
your branch which weren't really, or depending on how branch detection 
etc works, getting thrown off completely and emitting garbage branch 
histories.  So, it does help to be able to "rewind" the heuristics, poke 
information in and then resume again and see if things are improved. 
The information could be inserted into a single file which has 
configured the entire import, and also serves as a set of notes as to 
the amendments carried out.  I was happy with a database dump :-).

> <snip>
>> The manual input is extremely useful for bespoke conversions; there will
>> always be warts in the history and no heuristic is perfect (even if you
>> can supply your own set of expressions, a way to override it for just
>> one revision is handy).
>
> Again, would you mind providing a few examples?  It sounds like you have
> some edge cases that could be handled by extending the branch history
> format, but I'd like to pin it down a bit more before discussing solutions.

There's a few,

* a branch contains a subproject and is merged into a subtree
* someone puts a "README" or similar file in a funny place, which isn't 
inside a project root
* someone starts a project with no files in its root directory
* someone records a merge incorrectly (or using a young or middle–aged 
SVN which didn't record merges).  You don't want your annotate to hit a 
merge commit which isn't recorded as a merge, and then have to go 
hunting around in history for the real origin of a line of code
* the piecemeal merge case you have seen yourself.

It's just very useful to be able to reparent during the data mining stage.

> <snip>
>>    3. skip bad sections of history, for instance squash merging merges
>> which happened over several commits (SVN and Perforce, of course,
>> support insane piecemeal merging prohibited by git)
>
> This is an excellent point I've stumbled past in my experiments without
> realising what I was seeing.  A simple SVN example might look like this:
>
> 	svn add trunk branches
> 	svn add trunk/foo trunk/bar
> 	svn ci -m "Initial revision" # r1
>
> 	svn cp trunk branches/my_branch
> 	svn ci -m "Created my_branch" # r2
>
> 	# edit files in my_branch
>
> 	svn merge branches/my_branch/foo trunk/foo
> 	svn ci -m "Merge my_branch ->  trunk (1/3)" # r11
>
> 	svn merge branches/my_branch/bar trunk/bar
> 	svn ci -m "Merge my_branch ->  trunk (2/3)" # r12
>
> 	svn cp branches/my_branch/new_file trunk/new_file
> 	svn ci -m "Merge my_branch ->  trunk (3/3)" # r13
>
> This strikes me as a sensibly cautious workflow in SVN, where merge
> conflicts are common and changes are hard to revert.  The best
> representation for this in the current branch history format would be
> something like this:
>
> 	In r1, create branch "trunk"
> 	In r2, create branch "branches/my_branch" from "trunk"
> 	In r13, merge "branches/my_branch" r13 into "trunk"
>
> In other words, pretend r11 and r12 are just normal commits, and that
> r13 is a full merge.  A more useful (and arguably more accurate)
> representation would be possible if we extended the format a bit:
>
> 	In r1, create branch "trunk"
> 	In r2, create branch "branches/my_branch" from "trunk"
> 	In r12, squash changes in "branches/my_branch"
> 	In r13, squash changes in "branches/my_branch"
> 	In r13, merge "branches/my_branch" r13 into "trunk"
>
> Adding "squash" and "fixup" commands would let us represent the whole
> messy business as a single commit, which is closer to what the user was
> trying to say even if it's further from what they actually had to say.

Right, you see the problem.

I think your text syntax is fine so long as it is precise enough, and 
similar to what I mention earlier in this e–mail with having a single 
file to drive a conversion run.  That really is the kind of input data 
that I had, it's just that I set it up as a useful set of commands.

Sam.

  reply	other threads:[~2012-03-07 23:15 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-03 12:27 [RFC] "Remote helper for Subversion" project David Barr
2012-03-03 12:41 ` David Barr
2012-03-04  7:54   ` Jonathan Nieder
2012-03-04 10:37     ` David Barr
2012-03-04 13:36       ` Andrew Sayers
2012-03-05 15:27         ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Stephen Bash
2012-03-05 23:27           ` Approaches to SVN to Git conversion Andrew Sayers
2012-03-06 14:36             ` Stephen Bash
2012-03-06 19:29           ` Approaches to SVN to Git conversion (was: Re: [RFC] "Remote helper for Subversion" project) Nathan Gray
2012-03-06 20:35             ` Stephen Bash
2012-03-06 23:59               ` [spf:guess] " Sam Vilain
2012-03-07 22:06                 ` Andrew Sayers
2012-03-07 23:15                   ` Sam Vilain [this message]
2012-03-08 20:51                     ` [spf:guess,iffy] " Andrew Sayers
2012-03-06 22:34             ` Approaches to SVN to Git conversion Andrew Sayers
2012-03-07 15:38               ` Sam Vilain
2012-03-07 20:28                 ` Andrew Sayers
2012-03-07 22:33               ` Phil Hord
2012-03-07 23:08               ` Nathan Gray
2012-03-07 23:32                 ` Andrew Sayers
2012-03-04 16:23       ` [RFC] "Remote helper for Subversion" project Jonathan Nieder
2012-03-27  3:58     ` Ramkumar Ramachandra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F57EC04.8060705@vilain.net \
    --to=sam@vilain.net \
    --cc=andrew-git@pileofstuff.org \
    --cc=artagnon@gmail.com \
    --cc=bash@genarts.com \
    --cc=davidbarr@google.com \
    --cc=divanorama@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=n8gray@n8gray.org \
    --cc=peff@peff.net \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).