git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Passing revs to git-bundle-create via stdin
@ 2017-05-22 23:44 ch
  2017-05-23 16:46 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: ch @ 2017-05-22 23:44 UTC (permalink / raw)
  To: git

Hi,

I'm using git bundles to create (incremental) backups of my local repositories.
This works quite well but for certain repositories I'm getting unexpectedly big
incremental bundles. I did some testing and from what I can tell it seems
git-bundle-create has issues processing revs passed via stdin. To illustrate
the problem I have included a small bash script below.

I'm using Git for Windows 2.13.0.windows.1 (64-bit). Unfortunately I don't have
access to a non-Windows box to check whether it's a problem specific to the
Windows port.

----
add_file()
{
     echo "$1" > "$1"
     git add "$1"
     git commit -m "$1"
}

git init .

add_file "test-1"
add_file "test-2"
add_file "test-3"

git checkout -b feature
add_file "test-4"
add_file "test-5"
add_file "test-6"

git checkout master
add_file "test-7"
add_file "test-8"
add_file "test-9"

echo -e "\nCreating test.git..."
git bundle create test.git --all ^feature ^master^

echo -e "\nCreating test-stdin.git..."
echo -e "^feature\n^master^\n" | git bundle create test-stdin.git --all --stdin

echo -e "\nCreating test-2.git..."
git bundle create test-2.git --all ^feature^ ^master^

echo -e "\nCreating test-2-stdin.git..."
echo -e "^feature^\n^master^\n" | git bundle create test-2-stdin.git --all --stdin

echo -e "\nCreating test-3-stdin.git..."
echo -e "feature\nmaster\n" | git bundle create test-3-stdin.git --stdin

echo
git branch -D feature
git tag -am "Annotated tag" annotated-tag master~2

echo -e "\nCreating annotated.git..."
git bundle create annotated.git --all ^annotated-tag

echo -e "\nCreating annotated-stdin.git..."
echo -e "^annotated-tag\n" | git bundle create annotated-stdin.git --all --stdin

echo
git tag -d annotated-tag
git tag lightweight-tag master~2
echo -e "\nCreating lightweight-stdin.git..."
echo -e "^lightweight-tag\n" | git bundle create lightweight-stdin.git --all --stdin
----

I'd expect test.git and test-stdin.git to be identical. In fact the contained-
and required-refs lists of both bundles are equal but the pack in
test-stdin.git is notably larger compared to the one in test.git. Interestingly
test-2.git and test-2-stdin.git are identical.

git-bundle-create does not appear to handle includes properly either. In this
specific case it won't create test-3-stdin.git and dies with
'error: Refusing to create empty bundle.'.

Last but not least git-bundle-create includes annotated-tag in
annotated-stdin.git even though the tag is excluded via stdin. It works alright
if the tag is excluded via commandline like in case of annotated.git. The issue
also seems to be specific to annotated tags as lightweight-tag is properly
excluded from lightweight-stdin.git.

Any help would be appreciated.

Thanks in advance.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Passing revs to git-bundle-create via stdin
  2017-05-22 23:44 Passing revs to git-bundle-create via stdin ch
@ 2017-05-23 16:46 ` Jeff King
  2017-05-24  1:44   ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2017-05-23 16:46 UTC (permalink / raw)
  To: ch; +Cc: git

On Tue, May 23, 2017 at 01:44:55AM +0200, ch wrote:

> I'm using git bundles to create (incremental) backups of my local repositories.
> This works quite well but for certain repositories I'm getting unexpectedly big
> incremental bundles. I did some testing and from what I can tell it seems
> git-bundle-create has issues processing revs passed via stdin. To illustrate
> the problem I have included a small bash script below.
> 
> I'm using Git for Windows 2.13.0.windows.1 (64-bit). Unfortunately I don't have
> access to a non-Windows box to check whether it's a problem specific to the
> Windows port.

Thanks for an easy reproduction recipe. I see the problem on Linux, too.

I think what's happening is that git-bundle actually runs _two_
traversals using the command-line arguments. It kicks off an external
rev-list via compute_and_write_prerequisites(), and then feeds the
arguments again to setup_revisions(). The first one eats all of stdin,
and the second just sees an empty input.

You can see it working if you do:

  $ git bundle create from-terminal.git --all --stdin
  ^feature
  ^master^
  [press ^D, i.e., ctrl-d]
  ^feature
  ^master^
  [press ^D again]

Hitting ^D tells the terminal driver to send an EOF; the first one goes
to the child rev-list, and then we repeat the input to get read by the
second traversal. The result is identical to your command-line-only
output. I have no idea if the ^D thing works at all on Windows, but I
don't mean it even as a workaround. It was just a way of confirming my
guess about the double-read.

The real solutions I can think of are:

  1. Teach git-bundle not to require the double-read. I'm not sure why
     it's written the way it is, but presumably it would be tricky to
     undo it (or we would have just written it the other way in the
     first place!)

  2. Git-bundle could buffer stdin and feed it to the two traversals. I
     think this actually ends up a little tricky, because the second
     traversal is done in-process (so we'd have to actually re-feed the
     buffer to our stdin via a "struct async", which feels pretty
     hacky).

  3. git-bundle could natively support --stdin, reading each line and
     convert it into traversal arguments. This is the quickest way to
     make your example work, but I suspect there will be funky corner
     cases (because we'd have to replicate the same rules that
     revision.c uses to read its input).

None of those are incredibly appealing.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Passing revs to git-bundle-create via stdin
  2017-05-23 16:46 ` Jeff King
@ 2017-05-24  1:44   ` Junio C Hamano
  2017-05-25  2:55     ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2017-05-24  1:44 UTC (permalink / raw)
  To: Jeff King; +Cc: ch, git

Jeff King <peff@peff.net> writes:

> I think what's happening is that git-bundle actually runs _two_
> traversals using the command-line arguments. ...
> ... It was just a way of confirming my
> guess about the double-read.
>
> The real solutions I can think of are:
>
>   1. Teach git-bundle not to require the double-read. I'm not sure why
>      it's written the way it is, but presumably it would be tricky to
>      undo it (or we would have just written it the other way in the
>      first place!)

If I remember correctly, the reason why it does the double-read is
because it wants to cope with things like "--since".  There is no
explicit bottom of the DAG specified on the command line, and the
first one (without "--objects") is done to find "prerequisites" that
are written in the header.

Then the packdata is generated, which does another traversal (this
time with "--objects" option).

So perhaps the right way to fix it is to keep the first traversal
as-is, but update the second one (I think write-bundle-refs is the
culprit) so that it does not use the user-supplied command line
as-is; instead it should use the positive end of the history from
the command line with the negative end set to these "prerequisites"
commits.

I said "command line" in the above, but read that as "end user
input"; the list of rev-list command line arguments given from the
standard input is exactly the same deal.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Passing revs to git-bundle-create via stdin
  2017-05-24  1:44   ` Junio C Hamano
@ 2017-05-25  2:55     ` Junio C Hamano
  0 siblings, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2017-05-25  2:55 UTC (permalink / raw)
  To: Jeff King; +Cc: ch, git

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> I think what's happening is that git-bundle actually runs _two_
>> traversals using the command-line arguments. ...
>> ... It was just a way of confirming my
>> guess about the double-read.
>>
>> The real solutions I can think of are:
>>
>>   1. Teach git-bundle not to require the double-read. I'm not sure why
>>      it's written the way it is, but presumably it would be tricky to
>>      undo it (or we would have just written it the other way in the
>>      first place!)
>
> If I remember correctly, the reason why it does the double-read is
> because it wants to cope with things like "--since".  There is no
> explicit bottom of the DAG specified on the command line, and the
> first one (without "--objects") is done to find "prerequisites" that
> are written in the header.
>
> Then the packdata is generated, which does another traversal (this
> time with "--objects" option).
>
> So perhaps the right way to fix it is to keep the first traversal
> as-is, but update the second one (I think write-bundle-refs is the
> culprit) so that it does not use the user-supplied command line
> as-is; instead it should use the positive end of the history from
> the command line with the negative end set to these "prerequisites"
> commits.
>
> I said "command line" in the above, but read that as "end user
> input"; the list of rev-list command line arguments given from the
> standard input is exactly the same deal.

Actually, after thinking a bit more about this, I think the bundle
we currently generate may be a bit less efficient than ideal when
options like --since or --max-count are used.

Imagine a history of this shape (child grows on the right hand side):

 A---D-----E-----G---H
      \         /
       B---C---F

The labels on commits also denote their timestamps in relative
terms, i.e. A has the oldest timestamp, D, even though it is a
parent of B, has newer timestamp than B has, etc.

Now, imagine running "git log --since=$time H" with time set to the
timestamp commit D has.  We traversal from H, following parent chain,
and stop when we see a commit with timestamp older than $time.  So,
we'd enumerate H G F E D; C and A are "boundaries"---we looked at,
but we decided not to include in the result.

A bundle file format records "By using this bundle, you can advance
your history up to this commit", which can be seen by running "git
ls-remote" on the bundle file.  It also records "However, this
bundle does not record the entire history; you need to have the
complete history behind these commits".  These are called
"prerequisites", and can be checked with "git bundle verify".
And then of course it has an actual packfile (which is thin).

So putting all together,

	git bundle create mybundle --since=$time H

would record H as its head, and also C and A as prerequistes.

The "double reading of --stdin" we have been discussing is there
because we run two traversals; the first one is to find the
prerequisites (i.e. C and A in the above example).  The second one
uses the same rev-list arguments (i.e. "--since=$time H") to
generate pack, so it will include D.

As the recipient of a bundle is required to have complete history
behind both A and C, however, the packfile generated with the
current proceess is inefficient--it includes D but it does not need
to.

If we change the argument to rev-list used in the actual packfile
generation, and instead use the boundary we learned during the first
traversal (i.e. A and C in the above example) and the tip of the
history being recorded in the resulting bundle (i.e. H), then we'd
run "git log ^A ^C H", which would only walk "H G F E".

Which would be smaller (it no longer includes D), and the recipient
who has A and C can still apply.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-05-25  2:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-22 23:44 Passing revs to git-bundle-create via stdin ch
2017-05-23 16:46 ` Jeff King
2017-05-24  1:44   ` Junio C Hamano
2017-05-25  2:55     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).