git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git bugs
@ 2008-06-10  8:41 Ben Lynn
  2008-06-10 16:58 ` Daniel Barkalow
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-10  8:41 UTC (permalink / raw)
  To: git

I've come across a couple of bugs. Most users will probably never
encounter them, but I think they ought to be fixed. Apologies if
they're well-known issues. I haven't read much of this list.

1. The import/export language poorly handles distinct initial commits
on the same branch, because given two commits with same branch name,
it assumes the latter is the descendant of the former (if there are no
"from" commands).

Normally this is what you want. But if your project, like git, ever
merges distinct initial commits, then all but the first will
unexpectedly gain parents, corrupting all their descendants' hashes.
For example:

$ git clone git://git.kernel.org/pub/scm/git/git.git
$ git checkout -b test 60ea0fdd7de001405fcc7591beb18a66a1f0dd09
$ git fast-export test > /tmp/x
$ cd /some/empty/dir
$ git init
$ git fast-import < /tmp/x
$ git checkout test

Importing a pristine export, we discover Linus did not in fact make
the first commit to the git project:

$ git log d154ebcc23bfcec2ed44e365af9e5c14c6e22015

As a workaround, I have a custom importer that knows that
git-fast-export omits the "from" command in initial commits. But there
should be a command to specify that the current commit is an initial
commit, allowing reliable export of projects such as git.

2. Kudos to everyone who figured out the nasty race condition and its
complex solution as described in Documentation/technical/racy-git.txt
and the comments of read-cache.c. It took me a while to get my head
around it.

Unfortunately, the solution isn't perfect. Try this:

$ echo xyzzy > file
$ git update-index --add file   # don't zero size since contents match
$ echo frotz > file             # all stats still match, contents don't
$ echo nitfol > other  # can be done much earlier
$ git update-index --add other  # now the cached size is zeroed
$ : > file                      # zero the file size muahahaha
$ # Type fast! The above must take place within the same second! ;-)
$ sleep 2
$ echo filfre > other
$ git update-index --add other  # stats of "file" match, hash is wrong

Essentially, we trigger two index writes that avoid operations on
"file": one immediately after "file" is first committed and identified
as racily clean, and the second some time later, after we have
sneakily zeroed the file behind git's back (after previously editing
its contents in place). We defeat the safeguards and end up with a bad
hash in the index that appears valid.

The"other" file merely causes index writes without reading the "file"
entry. It is also racily clean in the above, but this is irrelevant.

It's unlikely this sequence of operations would occur in real usage,
but I'd sleep better if this index corruption bug were eliminated. One
practical but unsatisfactory easy "fix" is to mark racily clean
entries with SIZE_MAX instead of 0. Who uses git to track to files of
this size?

A better solution would be to introduce a new per-entry flag. Let's
call it "dodgy". Then during a cache entry update, we set "dodgy" if
the mtime is bigger or equal to the index timestamp. And during cache
entry reads, we check "dodgy": if clear, then we trust the hatch,
otherwise we don't trust the hash and recompute it, again setting
"dodgy" if necessary (i.e. if the mtime matches the index timestamp
again).

Although this solution does require adding a flag per index entry, we
no longer have to scan through the index on every index write to
perform the size zero hack.

-Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10  8:41 Ben Lynn
@ 2008-06-10 16:58 ` Daniel Barkalow
  2008-06-10 17:44 ` Linus Torvalds
  2008-06-12  3:17 ` Shawn O. Pearce
  2 siblings, 0 replies; 37+ messages in thread
From: Daniel Barkalow @ 2008-06-10 16:58 UTC (permalink / raw)
  To: Ben Lynn; +Cc: git

On Tue, 10 Jun 2008, Ben Lynn wrote:

> 2. Kudos to everyone who figured out the nasty race condition and its
> complex solution as described in Documentation/technical/racy-git.txt
> and the comments of read-cache.c. It took me a while to get my head
> around it.
> 
> Unfortunately, the solution isn't perfect. Try this:
> 
> $ echo xyzzy > file
> $ git update-index --add file   # don't zero size since contents match
> $ echo frotz > file             # all stats still match, contents don't
> $ echo nitfol > other  # can be done much earlier
> $ git update-index --add other  # now the cached size is zeroed
> $ : > file                      # zero the file size muahahaha
> $ # Type fast! The above must take place within the same second! ;-)
> $ sleep 2
> $ echo filfre > other
> $ git update-index --add other  # stats of "file" match, hash is wrong
> 
> Essentially, we trigger two index writes that avoid operations on
> "file": one immediately after "file" is first committed and identified
> as racily clean, and the second some time later, after we have
> sneakily zeroed the file behind git's back (after previously editing
> its contents in place). We defeat the safeguards and end up with a bad
> hash in the index that appears valid.
> 
> The"other" file merely causes index writes without reading the "file"
> entry. It is also racily clean in the above, but this is irrelevant.
> 
> It's unlikely this sequence of operations would occur in real usage,
> but I'd sleep better if this index corruption bug were eliminated. One
> practical but unsatisfactory easy "fix" is to mark racily clean
> entries with SIZE_MAX instead of 0.

We could distinguish a "racily clean" entry from a 0-length file entry 
based on the hash. That is, say that a file isn't clean even though the 
size matches, if the size is 0 and the entry's hash isn't the same as the 
file's hash, which is e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. Nice thing 
about 0-length files is that you can compute their hashes without reading 
them.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10  8:41 Ben Lynn
  2008-06-10 16:58 ` Daniel Barkalow
@ 2008-06-10 17:44 ` Linus Torvalds
  2008-06-10 18:45   ` Ben Lynn
                     ` (2 more replies)
  2008-06-12  3:17 ` Shawn O. Pearce
  2 siblings, 3 replies; 37+ messages in thread
From: Linus Torvalds @ 2008-06-10 17:44 UTC (permalink / raw)
  To: Ben Lynn; +Cc: git



On Tue, 10 Jun 2008, Ben Lynn wrote:
> 
> Unfortunately, the solution isn't perfect. Try this:

Heh.

That's just because our "smudge_racily_clean_entry()" uses 0 as the magic 
smudging size.

You can fix this multiple ways. One would be to pick another size that is 
simply less likely (eg ~0 instead), which leaves the theoretical race, and 
just makes it practically impossible to hit (not that I think it's very 
practical to hit already).

The other approach is to know that an empty blob always has a very 
specific SHA1. Here's an trial patch.

		Linus

---
 read-cache.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 8e5fbb6..f83de8c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -138,6 +138,16 @@ static int ce_modified_check_fs(struct cache_entry *ce, struct stat *st)
 	return 0;
 }
 
+static int is_empty_blob_sha1(const unsigned char *sha1)
+{
+	static const unsigned char empty_blob_sha1[20] = {
+		0xe6,0x9d,0xe2,0x9b,0xb2,0xd1,0xd6,0x43,0x4b,0x8b,
+		0x29,0xae,0x77,0x5a,0xd8,0xc2,0xe4,0x8c,0x53,0x91
+	};
+
+	return !hashcmp(sha1, empty_blob_sha1);
+}
+
 static int ce_match_stat_basic(struct cache_entry *ce, struct stat *st)
 {
 	unsigned int changed = 0;
@@ -193,6 +203,12 @@ static int ce_match_stat_basic(struct cache_entry *ce, struct stat *st)
 	if (ce->ce_size != (unsigned int) st->st_size)
 		changed |= DATA_CHANGED;
 
+	/* Racily smudged entry? */
+	if (!ce->ce_size) {
+		if (!is_empty_blob_sha1(ce->sha1))
+			changed |= DATA_CHANGED;
+	}
+
 	return changed;
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10 17:44 ` Linus Torvalds
@ 2008-06-10 18:45   ` Ben Lynn
  2008-06-10 20:06     ` Linus Torvalds
  2008-06-12 20:06   ` Junio C Hamano
  2008-06-13 10:10   ` Jeff King
  2 siblings, 1 reply; 37+ messages in thread
From: Ben Lynn @ 2008-06-10 18:45 UTC (permalink / raw)
  To: Linus Torvalds, Daniel Barkalow; +Cc: git

I hadn't thought of exploiting the fact that the SHA1 of an empty file
is fixed. Nice! I believe I can prove there are no races now.
Incidentally, this is how I first found the bug: I was trying to prove
what git did worked.

I still prefer a per-entry flag solution (I suspect it's faster, and
the proof is easier), but that's too much work.

-Ben

On Tue, Jun 10, 2008 at 10:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 10 Jun 2008, Ben Lynn wrote:
>>
>> Unfortunately, the solution isn't perfect. Try this:
>
> Heh.
>
> That's just because our "smudge_racily_clean_entry()" uses 0 as the magic
> smudging size.
>
> You can fix this multiple ways. One would be to pick another size that is
> simply less likely (eg ~0 instead), which leaves the theoretical race, and
> just makes it practically impossible to hit (not that I think it's very
> practical to hit already).
>
> The other approach is to know that an empty blob always has a very
> specific SHA1. Here's an trial patch.
>
>                Linus
>
> ---
>  read-cache.c |   16 ++++++++++++++++
>  1 files changed, 16 insertions(+), 0 deletions(-)
>
> diff --git a/read-cache.c b/read-cache.c
> index 8e5fbb6..f83de8c 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -138,6 +138,16 @@ static int ce_modified_check_fs(struct cache_entry *ce, struct stat *st)
>        return 0;
>  }
>
> +static int is_empty_blob_sha1(const unsigned char *sha1)
> +{
> +       static const unsigned char empty_blob_sha1[20] = {
> +               0xe6,0x9d,0xe2,0x9b,0xb2,0xd1,0xd6,0x43,0x4b,0x8b,
> +               0x29,0xae,0x77,0x5a,0xd8,0xc2,0xe4,0x8c,0x53,0x91
> +       };
> +
> +       return !hashcmp(sha1, empty_blob_sha1);
> +}
> +
>  static int ce_match_stat_basic(struct cache_entry *ce, struct stat *st)
>  {
>        unsigned int changed = 0;
> @@ -193,6 +203,12 @@ static int ce_match_stat_basic(struct cache_entry *ce, struct stat *st)
>        if (ce->ce_size != (unsigned int) st->st_size)
>                changed |= DATA_CHANGED;
>
> +       /* Racily smudged entry? */
> +       if (!ce->ce_size) {
> +               if (!is_empty_blob_sha1(ce->sha1))
> +                       changed |= DATA_CHANGED;
> +       }
> +
>        return changed;
>  }
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10 18:45   ` Ben Lynn
@ 2008-06-10 20:06     ` Linus Torvalds
  2008-06-10 23:09       ` Ben Lynn
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2008-06-10 20:06 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Daniel Barkalow, git



On Tue, 10 Jun 2008, Ben Lynn wrote:
>
> Nice! I believe I can prove there are no races now.

It's worth pointing out that even in the absense of races, you can 
obviously screw things up if you really work at it, and *want* to. We 
cannot guarantee that we see all file changes from the stat() information, 
and we don't even save the whole stat info (ie we only save the low 32 
bits).

The ctime check is there to help make it harder for people to play games 
by setting the mtime back in time after having changed things, but we may 
at some point be forced to remove it because it triggers with things like 
beagle that use inode attributes to save indexing information (and thus 
change ctime).

And different systems have different approaches to what happens when a 
file gets modified through a writable mmap(). Exactly what is the mtime 
going to be? 

So I think git does a really good job at matching the stat() information, 
and the suggested patch makes it even stricter, but I think we should not 
even try to make it handle "malicious" changes. I bet you can work at 
making it miss some update if you really *really* try.

And I think there is one known race: the index mtime itself is not 
race-free. Remember: it may take more than a second to write the index 
file! So I can imagine that if you can set it up so that you change the 
file as the index is written out, and the index write is delayed 
sufficiently, the racy timestamp logic can fail just because the timestamp 
on the index file ends up being later.

This is more easily shown by doing a 'touch' on the index file afterwards, 
of course.

And yes, we should have written the timestamp to the file itself, instead 
of reading it from the filesystem.

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10 20:06     ` Linus Torvalds
@ 2008-06-10 23:09       ` Ben Lynn
  2008-06-10 23:38         ` Junio C Hamano
  0 siblings, 1 reply; 37+ messages in thread
From: Ben Lynn @ 2008-06-10 23:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Barkalow, git

>> Nice! I believe I can prove there are no races now.
>
> It's worth pointing out that even in the absense of races, you can
> obviously screw things up if you really work at it, and *want* to. We
> cannot guarantee that we see all file changes from the stat() information,
> and we don't even save the whole stat info (ie we only save the low 32
> bits).

I agree completely. My proof only holds in an abstract setting. It
assumes things such as a strictly nondecreasing system clock is and
that ctime and mtime work in an ideal sense.

> And different systems have different approaches to what happens when a
> file gets modified through a writable mmap(). Exactly what is the mtime
> going to be?

Good point. I confess I've only learned about mmap very recently, from
browsing the git code. All this time, I've been using streams, file
descriptors, etc. mmap is so much better!

What's Linux do in this case? For indexing purposes, so long as the
mtime is updated after the last write before git gets to it, things
should be fine.

> So I think git does a really good job at matching the stat() information,
> and the suggested patch makes it even stricter, but I think we should not
> even try to make it handle "malicious" changes. I bet you can work at
> making it miss some update if you really *really* try.

Definitely. e.g. rig the mtime by 2^32 seconds, or add 2^32 bytes to a
file within a second.

> And I think there is one known race: the index mtime itself is not
> race-free. Remember: it may take more than a second to write the index
> file! So I can imagine that if you can set it up so that you change the
> file as the index is written out, and the index write is delayed
> sufficiently, the racy timestamp logic can fail just because the timestamp
> on the index file ends up being later.

I had thought about this. I hacked some code up where the index looks
at the current system time when updating a cache entry to determine if
the hash is racy. Is doing one time(NULL) call per file reasonable?
I'm guessing it must be cheaper that a stat call.

> This is more easily shown by doing a 'touch' on the index file afterwards,
> of course.

Agreed. Another assumption of my proof is that the index is
trustworthy. If you tamper with it, all bets are off. You can't stop
determined users from hurting themselves.

> And yes, we should have written the timestamp to the file itself, instead
> of reading it from the filesystem.

Interesting. I had hacked a version of the index that did this (before
changing it to use a different solution).

In general, is the format of the index file set in stone? Is that why
it's better to use the size zero trick for the race condition rather
than introduce a new flag for example? Or are these wrinkles too small
to justify a potentially painful upgrade?

-Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10 23:09       ` Ben Lynn
@ 2008-06-10 23:38         ` Junio C Hamano
  2008-06-11  0:02           ` Ben Lynn
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2008-06-10 23:38 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Linus Torvalds, Daniel Barkalow, git

"Ben Lynn" <benlynn@gmail.com> writes:

> I had thought about this. I hacked some code up where the index looks
> at the current system time when updating a cache entry to determine if
> the hash is racy. Is doing one time(NULL) call per file reasonable?
> I'm guessing it must be cheaper that a stat call.

Hmm, sorry, could you elaborate how you would plan to use the return value
from time(2) per file?

The "index file timestamp" trick assumes that once we start reading from
and writing to the filesystem (in order to hash the current contents,
check if there is any modification), nobody else touches the paths we are
interested in (e.g., after "read-tree -m -u" checks out the new contents,
grabs the stat information from the newly deposited file and stuffs that
in the index, you do not go in and edit it further until our process
returns the control to you).  We also assume that the files (both work
tree and the index) live in the same filesystem and the file timestamp,
which could be skewed compared to the system clock if the filesystem is
over the network, are consistent among them and monotonicly increasing.

        You have to have some assumption --- if you allow anybody to touch
        anything behind your back, or if you allow timestamps of some
        files come from different time sources than the one for some other
        files, I do not think any lstat(2) based change detection scheme
        would work.

We do our writeout first and then the index is updated after all our
writeout is done, so by definition (more precisely, "by that assumption"),
anything older than the timestamp of the index file are up to date, if
their filesystem timestamp match the timestamp recorded in the index, and
anything that is the same or newer than the index timestamp is suspect.

And that is the reason the current code gets by only with a single
timestamp.  I'd have to go back and study your breakage scenario a bit
better (I'm still at work today).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10 23:38         ` Junio C Hamano
@ 2008-06-11  0:02           ` Ben Lynn
  2008-06-11  0:20             ` Junio C Hamano
  2008-06-11  1:36             ` Linus Torvalds
  0 siblings, 2 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-11  0:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Daniel Barkalow, git

>> I had thought about this. I hacked some code up where the index looks
>> at the current system time when updating a cache entry to determine if
>> the hash is racy. Is doing one time(NULL) call per file reasonable?
>> I'm guessing it must be cheaper that a stat call.
>
> Hmm, sorry, could you elaborate how you would plan to use the return value
> from time(2) per file?

My bad, I was extremely unclear. I meant I had thought about not
bothering to look at the index file timestamp (I'm not sure why, but I
instinctively trust the system clock more!). I'm making the same
assumptions, i.e. files are not touched while they're being indexed. I
suppose with the way I'm doing things, you can touch a file right up
to the point where the index wants to stat that particular file and
maybe look at its contents, but not afterwards. (And this is pretty
useless, because how are you going to know which file the index is up
to?)

I maintain a flag per file, and after computing its SHA1, I compare
the mtime with the current system time. If it matches, then that means
the hash can't be trusted (because of the race condition) and I set
the flag. This way, I avoid examining the index timestamp.

I call time(NULL) for each file out of laziness. I could cache the
value somewhere the first time, but that means I'd have to pass around
an extra argument to a whole bunch of functions, because of the way
I've written my code. But I don't think it's much of a drawback,
because I stat() each file anyway.

Actually, what Linus describes would be desirable in some sense. If
the index mtime timestamp were always later than the time you
performed the indexing (and accurate) because it always took a long
time, assuming no one messes with files while you're indexing, there
would be no race condition to worry about.

-Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  0:02           ` Ben Lynn
@ 2008-06-11  0:20             ` Junio C Hamano
  2008-06-11  0:24               ` Ben Lynn
  2008-06-11  1:36             ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2008-06-11  0:20 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Linus Torvalds, Daniel Barkalow, git

"Ben Lynn" <benlynn@gmail.com> writes:

> ... I'm making the same
> assumptions, i.e. files are not touched while they're being indexed.

Actually you are making another assumption that is somewhat worrysome.

The filesystem clock and time(2) clock can be skewed and comparing a
timestamp you obtain from lstat(2) and time(2) might not give you any
meaningful information.  That is why I made this codepath without using
time(2) when I did the racy-git work.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  0:20             ` Junio C Hamano
@ 2008-06-11  0:24               ` Ben Lynn
  2008-06-11  0:53                 ` Ben Lynn
  2008-06-11 12:46                 ` Stephen R. van den Berg
  0 siblings, 2 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-11  0:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Daniel Barkalow, git

That is problematic. How do I figure out what the filesystem thinks is
the current time? I could touch some file and read its mtime, but I
want a shortcut.

Are there any guarantees of any kind? e.g. is the filesystem current
time at least never ahead of the system time?

-Ben

On Tue, Jun 10, 2008 at 5:20 PM, Junio C Hamano <gitster@pobox.com> wrote:
> "Ben Lynn" <benlynn@gmail.com> writes:
>
>> ... I'm making the same
>> assumptions, i.e. files are not touched while they're being indexed.
>
> Actually you are making another assumption that is somewhat worrysome.
>
> The filesystem clock and time(2) clock can be skewed and comparing a
> timestamp you obtain from lstat(2) and time(2) might not give you any
> meaningful information.  That is why I made this codepath without using
> time(2) when I did the racy-git work.
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  0:24               ` Ben Lynn
@ 2008-06-11  0:53                 ` Ben Lynn
  2008-06-11 12:46                 ` Stephen R. van den Berg
  1 sibling, 0 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-11  0:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Daniel Barkalow, git

Also, without using the time(2) call, how would we write the timestamp
of the index to the file as Linus suggests? I'm guessing we'd write
it, then read its mtime, and the write it again (just the part storing
the timestamp), but that seems inelegant. In addition, the second
write may happen in the next second and you'll get an mtime different
to that in the file, though that shouldn't matter.

-Ben

On Tue, Jun 10, 2008 at 5:24 PM, Ben Lynn <benlynn@gmail.com> wrote:
> That is problematic. How do I figure out what the filesystem thinks is
> the current time? I could touch some file and read its mtime, but I
> want a shortcut.
>
> Are there any guarantees of any kind? e.g. is the filesystem current
> time at least never ahead of the system time?
>
> -Ben
>
> On Tue, Jun 10, 2008 at 5:20 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> "Ben Lynn" <benlynn@gmail.com> writes:
>>
>>> ... I'm making the same
>>> assumptions, i.e. files are not touched while they're being indexed.
>>
>> Actually you are making another assumption that is somewhat worrysome.
>>
>> The filesystem clock and time(2) clock can be skewed and comparing a
>> timestamp you obtain from lstat(2) and time(2) might not give you any
>> meaningful information.  That is why I made this codepath without using
>> time(2) when I did the racy-git work.
>>
>>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  0:02           ` Ben Lynn
  2008-06-11  0:20             ` Junio C Hamano
@ 2008-06-11  1:36             ` Linus Torvalds
  2008-06-11  2:04               ` Ben Lynn
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2008-06-11  1:36 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Daniel Barkalow, git



On Tue, 10 Jun 2008, Ben Lynn wrote:
> 
> I call time(NULL) for each file out of laziness. I could cache the
> value somewhere the first time, but that means I'd have to pass around
> an extra argument to a whole bunch of functions, because of the way
> I've written my code. But I don't think it's much of a drawback,
> because I stat() each file anyway.

No, that would be horrible. There is no guarantee that time() has 
anything to do with the stat timestamps anyway. 

The right way to do things would be to just do a stat() on the index file 
as it is created, and then save the mtime of that stat into the file. That 
way, you have the mtime of the index file not for the *last* write, but 
for the *first* one.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  1:36             ` Linus Torvalds
@ 2008-06-11  2:04               ` Ben Lynn
  2008-06-11  2:12                 ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Ben Lynn @ 2008-06-11  2:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Barkalow, git

> The right way to do things would be to just do a stat() on the index file
> as it is created, and then save the mtime of that stat into the file. That
> way, you have the mtime of the index file not for the *last* write, but
> for the *first* one.

Sorry, but if we're assuming no one is touching the files while we're
updating the index (including writing it to disk), why does it matter
whether we use the time of first or last write? In fact, if a index
write takes a long time, using the last write time as the mtime would
be beneficial for the race condition stuff.

-Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  2:04               ` Ben Lynn
@ 2008-06-11  2:12                 ` Linus Torvalds
  2008-06-11  2:31                   ` Ben Lynn
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2008-06-11  2:12 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Daniel Barkalow, git



On Tue, 10 Jun 2008, Ben Lynn wrote:
> 
> Sorry, but if we're assuming no one is touching the files while we're
> updating the index (including writing it to disk), why does it matter
> whether we use the time of first or last write? In fact, if a index
> write takes a long time, using the last write time as the mtime would
> be beneficial for the race condition stuff.

Oh, if you assume nobody is touching the files as the index is created, 
none of this matters. 

So if *that* was your only race worry, then git should already be 
perfectly fine.

The issue with the timestamp of the index only happens if somebody ends up 
modifying the files that are being indexed _as_ they are indexed. Quite 
frankly, git will notice that too in just about all possible cases, but if 
the size stays the same and the modification time ends up still being 
smaller than the final index mtime (because writing the index took so 
long!), then you might miss some modifications that would otherwise be 
noticed.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  2:12                 ` Linus Torvalds
@ 2008-06-11  2:31                   ` Ben Lynn
  2008-06-11  2:39                     ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Ben Lynn @ 2008-06-11  2:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Barkalow, git

Ah, I hadn't seen that. Yes, it is better to use the first write as
the timestamp. Would this catch everything? If the filesystem clock is
monotonically increasing and consistent then with this setup, you can
touch files even as they are being indexed? (Disregarding nonsense
like changing sizes by 2^32.)

Just realized I should have paid more attention to Junio's first reply
to me. My eye had skipped over the paragraph about clock skew, and
this caused some confusion. Sorry about all my extra emails!

-Ben

On Tue, Jun 10, 2008 at 7:12 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 10 Jun 2008, Ben Lynn wrote:
>>
>> Sorry, but if we're assuming no one is touching the files while we're
>> updating the index (including writing it to disk), why does it matter
>> whether we use the time of first or last write? In fact, if a index
>> write takes a long time, using the last write time as the mtime would
>> be beneficial for the race condition stuff.
>
> Oh, if you assume nobody is touching the files as the index is created,
> none of this matters.
>
> So if *that* was your only race worry, then git should already be
> perfectly fine.
>
> The issue with the timestamp of the index only happens if somebody ends up
> modifying the files that are being indexed _as_ they are indexed. Quite
> frankly, git will notice that too in just about all possible cases, but if
> the size stays the same and the modification time ends up still being
> smaller than the final index mtime (because writing the index took so
> long!), then you might miss some modifications that would otherwise be
> noticed.
>
>                Linus
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  2:31                   ` Ben Lynn
@ 2008-06-11  2:39                     ` Linus Torvalds
  2008-06-11  5:58                       ` Ben Lynn
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2008-06-11  2:39 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List



On Tue, 10 Jun 2008, Ben Lynn wrote:
>
> Ah, I hadn't seen that. Yes, it is better to use the first write as
> the timestamp. Would this catch everything? If the filesystem clock is
> monotonically increasing and consistent then with this setup, you can
> touch files even as they are being indexed? (Disregarding nonsense
> like changing sizes by 2^32.)

Yes, I think that at that point it would protect against arbitrary 
modifications even concurrently to index file creation.

That said, I don't think you even need a new index file format. We could 
just do a stat() on starting the index file creation, and then do a 
futimes() system call at the end to re-set the mtime to the beginning 
before we rename it back over the old index file.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  2:39                     ` Linus Torvalds
@ 2008-06-11  5:58                       ` Ben Lynn
  2008-06-11  6:18                         ` Ben Lynn
  2008-06-11 14:52                         ` Linus Torvalds
  0 siblings, 2 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-11  5:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List

Am I going crazy? All of a sudden I think I can get away without a
size zero hack. How about this smudging routine:

if (!ce_match_stat_basic(ce, &st)) {
  recompute_sha1_and_update_index();  // no other checks required
}

That should be sufficient. I think what happened was the following.
Once upon a time, the race fix was "if (stats_match) cached_size = 0",
which is nice because you don't have to examine file contents. Later,
because of the

 $ echo xyzzy >frotz ; git-update-index --add frotz ; : >frotz
 $ sleep 3
 $ echo filfre >nitfol ; git-update-index --add nitfol

issue, the ce_modified_check_fs was added.

But then if we're going to be examining file contents anyway, we may
as well drop the whole size zero trick and simply update the hash. The
bug I brought up also goes away.

-Ben

P.S: I could go through the history to see, but I bet there was a
stage after the race condition was discovered but before it was
realized that
  $ git update-index 'foo'
  : modify 'foo' in-place without changing its size
  : wait for enough time
  $ git update-index 'bar'
was a problem.

On Tue, Jun 10, 2008 at 7:39 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 10 Jun 2008, Ben Lynn wrote:
>>
>> Ah, I hadn't seen that. Yes, it is better to use the first write as
>> the timestamp. Would this catch everything? If the filesystem clock is
>> monotonically increasing and consistent then with this setup, you can
>> touch files even as they are being indexed? (Disregarding nonsense
>> like changing sizes by 2^32.)
>
> Yes, I think that at that point it would protect against arbitrary
> modifications even concurrently to index file creation.
>
> That said, I don't think you even need a new index file format. We could
> just do a stat() on starting the index file creation, and then do a
> futimes() system call at the end to re-set the mtime to the beginning
> before we rename it back over the old index file.
>
>                Linus
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  5:58                       ` Ben Lynn
@ 2008-06-11  6:18                         ` Ben Lynn
  2008-06-11 14:54                           ` Linus Torvalds
  2008-06-11 14:52                         ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Ben Lynn @ 2008-06-11  6:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List

And I'm sure you've also realized we could use the old race fix with
SIZE_MAX instead of zero, i.e:

 if (!ce_match_stat_basic(ce, &st)) {
   ce->ce_size = ~0;
 }
 return; // don't bother with ce_modified_check_fs

This is dissatisfying in that we're abusing the size variable. I'd
much prefer adding a flag per entry. And now there are other file
sizes, albeit ridiculously large ones, that will trick git. But it is
much faster than examining file contents. Luckily the decision of
which fix to use is not up to me.

-Ben

On Wed, Jun 11, 2008 at 5:58 AM, Ben Lynn <benlynn@gmail.com> wrote:
> Am I going crazy? All of a sudden I think I can get away without a
> size zero hack. How about this smudging routine:
>
> if (!ce_match_stat_basic(ce, &st)) {
>  recompute_sha1_and_update_index();  // no other checks required
> }
>
> That should be sufficient. I think what happened was the following.
> Once upon a time, the race fix was "if (stats_match) cached_size = 0",
> which is nice because you don't have to examine file contents. Later,
> because of the
>
>  $ echo xyzzy >frotz ; git-update-index --add frotz ; : >frotz
>  $ sleep 3
>  $ echo filfre >nitfol ; git-update-index --add nitfol
>
> issue, the ce_modified_check_fs was added.
>
> But then if we're going to be examining file contents anyway, we may
> as well drop the whole size zero trick and simply update the hash. The
> bug I brought up also goes away.
>
> -Ben
>
> P.S: I could go through the history to see, but I bet there was a
> stage after the race condition was discovered but before it was
> realized that
>  $ git update-index 'foo'
>  : modify 'foo' in-place without changing its size
>  : wait for enough time
>  $ git update-index 'bar'
> was a problem.
>
> On Tue, Jun 10, 2008 at 7:39 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>>
>> On Tue, 10 Jun 2008, Ben Lynn wrote:
>>>
>>> Ah, I hadn't seen that. Yes, it is better to use the first write as
>>> the timestamp. Would this catch everything? If the filesystem clock is
>>> monotonically increasing and consistent then with this setup, you can
>>> touch files even as they are being indexed? (Disregarding nonsense
>>> like changing sizes by 2^32.)
>>
>> Yes, I think that at that point it would protect against arbitrary
>> modifications even concurrently to index file creation.
>>
>> That said, I don't think you even need a new index file format. We could
>> just do a stat() on starting the index file creation, and then do a
>> futimes() system call at the end to re-set the mtime to the beginning
>> before we rename it back over the old index file.
>>
>>                Linus
>>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  0:24               ` Ben Lynn
  2008-06-11  0:53                 ` Ben Lynn
@ 2008-06-11 12:46                 ` Stephen R. van den Berg
  2008-06-12  6:51                   ` Ben Lynn
  1 sibling, 1 reply; 37+ messages in thread
From: Stephen R. van den Berg @ 2008-06-11 12:46 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Linus Torvalds, Daniel Barkalow, git

Ben Lynn wrote:
>That is problematic. How do I figure out what the filesystem thinks is
>the current time? I could touch some file and read its mtime, but I
>want a shortcut.

That basically is as short as it gets, except perhaps statting a freshly
modified file (not necessarily by oneself).

>Are there any guarantees of any kind? e.g. is the filesystem current
>time at least never ahead of the system time?

Practically no guarantees, just that it is rather likely for the time at
the fileserver to progress at about the same pace as yours.
-- 
Sincerely,
           Stephen R. van den Berg.

Differentiation is an integral part of calculus.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  5:58                       ` Ben Lynn
  2008-06-11  6:18                         ` Ben Lynn
@ 2008-06-11 14:52                         ` Linus Torvalds
  1 sibling, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2008-06-11 14:52 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List



On Tue, 10 Jun 2008, Ben Lynn wrote:
>
> Am I going crazy? All of a sudden I think I can get away without a
> size zero hack. How about this smudging routine:
> 
> if (!ce_match_stat_basic(ce, &st)) {
>   recompute_sha1_and_update_index();  // no other checks required
> }

No.

You can't recompute the sha1. That SHA1 is what the user asked us to 
remember - whether smudged or not. So you mustn't change it, until the 
user does a "git add" or "git update-index" (or anything that implicitly 
does the same).

You can change the _stat_ information, since that is only used for 
matching the tree (positively or negatively), but the SHA1 itself is not 
just a cache, it's the "pending state" in the index.

So smudging must change some piece of data that is only about the stat 
cache - the st_size, the mtime, anything like that.

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11  6:18                         ` Ben Lynn
@ 2008-06-11 14:54                           ` Linus Torvalds
  2008-06-11 17:52                             ` Ben Lynn
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2008-06-11 14:54 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List



On Wed, 11 Jun 2008, Ben Lynn wrote:
>
> But it is much faster than examining file contents.

I'm not sure why you think my patch that just did the zero-sized blob 
thing was slow? It's a 20-byte memcmp(). It takes no time at all. 

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11 14:54                           ` Linus Torvalds
@ 2008-06-11 17:52                             ` Ben Lynn
  2008-06-11 18:10                               ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Ben Lynn @ 2008-06-11 17:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List

> I'm not sure why you think my patch that just did the zero-sized blob
> thing was slow? It's a 20-byte memcmp(). It takes no time at all.

I don't think the memcmp is slow. I think the ce_modified_check_fs in:

smudge() {
   ...
  if (ce_match_stat_basic(ce, &st))
                return;
  if (ce_modified_check_fs(ce, &st))
                ce->ce_size = 0;
}

is potentially slow, and I'm saying you could replace it with

smudge() {
   ...
  if (ce_match_stat_basic(ce, &st))
                return;
  ce->ce_size = ~0;
}

to avoid the ce_modified_check_fs call. But it is an unclean solution,
which is why I champion having an extra flag per file.

Also, I think we could set ce->ce_size to ~0 when we first realize
timestamp = mtime, and we'd no longer have to do index-wide smudging
on writes.

Thanks for the explanation by the way. I get why you can't modify the
SHA1. It is indeed what we asked git to record, right or wrong. I got
confused because I misread the code and thought ce_modified_check_fs()
would write the new SHA1 to disk.

-Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11 17:52                             ` Ben Lynn
@ 2008-06-11 18:10                               ` Linus Torvalds
  2008-06-11 18:48                                 ` Ben Lynn
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2008-06-11 18:10 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List



On Wed, 11 Jun 2008, Ben Lynn wrote:
> 
> I don't think the memcmp is slow. I think the ce_modified_check_fs in:
> 
> smudge() {
>    ...
>   if (ce_match_stat_basic(ce, &st))
>                 return;
>   if (ce_modified_check_fs(ce, &st))
>                 ce->ce_size = 0;
> }
> 
> is potentially slow, and I'm saying you could replace it with
> 
> smudge() {
>    ...
>   if (ce_match_stat_basic(ce, &st))
>                 return;
>   ce->ce_size = ~0;
> }
> 
> to avoid the ce_modified_check_fs call. But it is an unclean solution,
> which is why I champion having an extra flag per file.

Well, you have to think about what you want to optimize here.

The thing we want to optimize is not the writing of the index file, but 
the subsequent _use_ of it!

As such, the last thing we actually want to do is to smudge the index file 
entry. If it actually has a possibility of being not smudged, we're much 
better off saying "hey, now the index file is newer, and the file it 
refers to is all good!"

That way, we won't have to do the ce_modified_check_fs() call later - we 
do it just once, and the file is up-to-date.

So we don't want to smudge it, but if the stat information says it migth 
match even though it doesn't, we have to. But if the stat information says 
it matches, and the data actually _does_ match, then we shouldn't smudge 
it, we should be happy - and all subsequent users of the index will then 
know that they don't even need to look at the file contents.

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11 18:10                               ` Linus Torvalds
@ 2008-06-11 18:48                                 ` Ben Lynn
  2008-06-11 18:53                                   ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Ben Lynn @ 2008-06-11 18:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List

> So we don't want to smudge it, but if the stat information says it migth
> match even though it doesn't, we have to. But if the stat information says
> it matches, and the data actually _does_ match, then we shouldn't smudge
> it, we should be happy - and all subsequent users of the index will then
> know that they don't even need to look at the file contents.

I understand. In that case, what about an unsmudging routine so we can
have the best of both worlds? We unconditionally smudge the file as
soon as timestamp = mtime is detected. We never do index-wide smudging
on writes, but rather, on index read of particular file we do this:

  if (stats_differ()) {
    if (hash_matches()) {
      // Aha! An unconditionally smudged file that we might be able to unsmudge,
      // so future reads can avoid this check.
      if (mtime < timestamp) {
        fix_stats();  // Involves writing to index.
      }
      // D'oh! This file could still be racy, leave it smudged.
    } else {
      // The stats were right, the hash does differ.
      ...
    }
  }

We minimize the amount of ce_check_modified_fs() calls for any given
sequence of index operations. Instead of doing index-wide checks on
every write, we only check when we have no choice, i.e. the first time
a particular file is being looked up in the index. We fix its stats if
possible so future reads can avoid the painful check. The only
drawback is that I'm not sure how acceptable it is to write to the
index on a read operation. Is this a big deal? (If it were a separate
flag, I'm sure no one would mind!)

-Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11 18:48                                 ` Ben Lynn
@ 2008-06-11 18:53                                   ` Linus Torvalds
  2008-06-11 20:57                                     ` Ben Lynn
  2008-06-11 21:50                                     ` Junio C Hamano
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2008-06-11 18:53 UTC (permalink / raw)
  To: Ben Lynn; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List



On Wed, 11 Jun 2008, Ben Lynn wrote:
>
> The only drawback is that I'm not sure how acceptable it is to write to 
> the index on a read operation. Is this a big deal?

Historically, we *never* did it. In fact, it was a big deal. These days we 
do it opportunistically for "git diff" if we can, but making sure that it 
all still works for a read-only access (think gitweb etc - the reader is 
*not* necessarily the owner of the archive at all!)

So I really prefer not to. But you can take it up with Junio if you think 
it can be a big deal.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11 18:53                                   ` Linus Torvalds
@ 2008-06-11 20:57                                     ` Ben Lynn
  2008-06-11 21:50                                     ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-11 20:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List

> So I really prefer not to. But you can take it up with Junio if you think
> it can be a big deal.

I began writing a patch but I abandoned it. I don't have much
experience with the code base and I'd almost certainly miss a bunch of
corner cases. Your patch seems to be the easiest way to go. I'm not
motivated to push pre-emptive smudging, even if it might be more
efficient. After all, how often does this race condition come up?

But I have a per-file flagged version of the index for personal use,
and it's good to know that it's a valid strategy.

> Historically, we *never* did it. In fact, it was a big deal. These days we
> do it opportunistically for "git diff" if we can, but making sure that it
> all still works for a read-only access (think gitweb etc - the reader is
> *not* necessarily the owner of the archive at all!)

I hadn't thought about this case. This is intriguing: why does gitweb
need access to the index? I thought the index was only to make
operations fast for users intending to make changes to the repo. Why
can't servers run stripped-down versions of git that don't bother with
an index?

-Ben

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11 18:53                                   ` Linus Torvalds
  2008-06-11 20:57                                     ` Ben Lynn
@ 2008-06-11 21:50                                     ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2008-06-11 21:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben Lynn, Daniel Barkalow, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 11 Jun 2008, Ben Lynn wrote:
>>
>> The only drawback is that I'm not sure how acceptable it is to write to 
>> the index on a read operation. Is this a big deal?
>
> Historically, we *never* did it. In fact, it was a big deal. These days we 
> do it opportunistically for "git diff" if we can, but making sure that it 
> all still works for a read-only access (think gitweb etc - the reader is 

Thanks for giving a good summary (I've been away), and I agree with your
analysis of the situation.  I do not see much need to _fix_ anything
around this area, except your "smudging is not zero timestamp" update.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10  8:41 Ben Lynn
  2008-06-10 16:58 ` Daniel Barkalow
  2008-06-10 17:44 ` Linus Torvalds
@ 2008-06-12  3:17 ` Shawn O. Pearce
  2008-06-12  6:46   ` Ben Lynn
  2008-06-12  7:12   ` Johannes Schindelin
  2 siblings, 2 replies; 37+ messages in thread
From: Shawn O. Pearce @ 2008-06-12  3:17 UTC (permalink / raw)
  To: Ben Lynn; +Cc: git

Ben Lynn <benlynn@gmail.com> wrote:
> 1. The import/export language poorly handles distinct initial commits
> on the same branch, because given two commits with same branch name,
> it assumes the latter is the descendant of the former (if there are no
> "from" commands).
> 
> Normally this is what you want. But if your project, like git, ever
> merges distinct initial commits, then all but the first will
> unexpectedly gain parents, corrupting all their descendants' hashes.
...
> As a workaround, I have a custom importer that knows that
> git-fast-export omits the "from" command in initial commits. But there
> should be a command to specify that the current commit is an initial
> commit, allowing reliable export of projects such as git.

fast-export is wrong, and is using the language wrong.  fast-import
is correct.  Because I said so.  :-)

No, seriously, fast-import came along first and can describe what
you are referring to as the many initial root commits in git.git.
The issue is fast-export is not generating commands to say as much.

Its quite easily fixable.

When we output a commit in handle_commit() we just need to reset
the branch if we have no parents.  That simple.  This is totally
untested, but I think it fixes it.


diff --git a/builtin-fast-export.c b/builtin-fast-export.c
index 1dfc01e..d0a462f 100644
--- a/builtin-fast-export.c
+++ b/builtin-fast-export.c
@@ -188,6 +188,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 	mark_object(&commit->object);
 	if (!is_encoding_utf8(encoding))
 		reencoded = reencode_string(message, "UTF-8", encoding);
+	if (!commit->parents)
+		printf("reset %s\n", (const char*)commit->util);
 	printf("commit %s\nmark :%d\n%.*s\n%.*s\ndata %u\n%s",
 	       (const char *)commit->util, last_idnum,
 	       (int)(author_end - author), author,

-- 
Shawn.

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-12  3:17 ` Shawn O. Pearce
@ 2008-06-12  6:46   ` Ben Lynn
  2008-06-12  7:12   ` Johannes Schindelin
  1 sibling, 0 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-12  6:46 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

Thanks! I just tested it on the example I quoted before, and it works.
I misunderstood the reset command. I thought it took you to some
previous commit. But now I see it takes you back to before the initial
commit if you don't use the "from" command.

-Ben

On Wed, Jun 11, 2008 at 8:17 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Ben Lynn <benlynn@gmail.com> wrote:
>> 1. The import/export language poorly handles distinct initial commits
>> on the same branch, because given two commits with same branch name,
>> it assumes the latter is the descendant of the former (if there are no
>> "from" commands).
>>
>> Normally this is what you want. But if your project, like git, ever
>> merges distinct initial commits, then all but the first will
>> unexpectedly gain parents, corrupting all their descendants' hashes.
> ...
>> As a workaround, I have a custom importer that knows that
>> git-fast-export omits the "from" command in initial commits. But there
>> should be a command to specify that the current commit is an initial
>> commit, allowing reliable export of projects such as git.
>
> fast-export is wrong, and is using the language wrong.  fast-import
> is correct.  Because I said so.  :-)
>
> No, seriously, fast-import came along first and can describe what
> you are referring to as the many initial root commits in git.git.
> The issue is fast-export is not generating commands to say as much.
>
> Its quite easily fixable.
>
> When we output a commit in handle_commit() we just need to reset
> the branch if we have no parents.  That simple.  This is totally
> untested, but I think it fixes it.
>
>
> diff --git a/builtin-fast-export.c b/builtin-fast-export.c
> index 1dfc01e..d0a462f 100644
> --- a/builtin-fast-export.c
> +++ b/builtin-fast-export.c
> @@ -188,6 +188,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
>        mark_object(&commit->object);
>        if (!is_encoding_utf8(encoding))
>                reencoded = reencode_string(message, "UTF-8", encoding);
> +       if (!commit->parents)
> +               printf("reset %s\n", (const char*)commit->util);
>        printf("commit %s\nmark :%d\n%.*s\n%.*s\ndata %u\n%s",
>               (const char *)commit->util, last_idnum,
>               (int)(author_end - author), author,
>
> --
> Shawn.
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-11 12:46                 ` Stephen R. van den Berg
@ 2008-06-12  6:51                   ` Ben Lynn
  0 siblings, 0 replies; 37+ messages in thread
From: Ben Lynn @ 2008-06-12  6:51 UTC (permalink / raw)
  To: Stephen R. van den Berg
  Cc: Junio C Hamano, Linus Torvalds, Daniel Barkalow, git

Thanks everyone who replied. You know, I was a bit apprehensive about
posting originally, because I had heard the git ML was inflammable.
But all these responses have been prompt, well-thought-out and
informative.

I plan to take all these suggestions and have a go at hacking a
race-free index which updates successfully despite concurrent file
operations.

Thanks again,
-Ben

On Wed, Jun 11, 2008 at 5:46 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
> Ben Lynn wrote:
>>That is problematic. How do I figure out what the filesystem thinks is
>>the current time? I could touch some file and read its mtime, but I
>>want a shortcut.
>
> That basically is as short as it gets, except perhaps statting a freshly
> modified file (not necessarily by oneself).
>
>>Are there any guarantees of any kind? e.g. is the filesystem current
>>time at least never ahead of the system time?
>
> Practically no guarantees, just that it is rather likely for the time at
> the fileserver to progress at about the same pace as yours.
> --
> Sincerely,
>           Stephen R. van den Berg.
>
> Differentiation is an integral part of calculus.
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-12  3:17 ` Shawn O. Pearce
  2008-06-12  6:46   ` Ben Lynn
@ 2008-06-12  7:12   ` Johannes Schindelin
  1 sibling, 0 replies; 37+ messages in thread
From: Johannes Schindelin @ 2008-06-12  7:12 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Ben Lynn, git

Hi,

On Wed, 11 Jun 2008, Shawn O. Pearce wrote:

> diff --git a/builtin-fast-export.c b/builtin-fast-export.c
> index 1dfc01e..d0a462f 100644
> --- a/builtin-fast-export.c
> +++ b/builtin-fast-export.c
> @@ -188,6 +188,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
>  	mark_object(&commit->object);
>  	if (!is_encoding_utf8(encoding))
>  		reencoded = reencode_string(message, "UTF-8", encoding);
> +	if (!commit->parents)
> +		printf("reset %s\n", (const char*)commit->util);
>  	printf("commit %s\nmark :%d\n%.*s\n%.*s\ndata %u\n%s",
>  	       (const char *)commit->util, last_idnum,
>  	       (int)(author_end - author), author,

Deemed-obviously-correct-by: Johannes Schindelin <johannes.schindelin@gmx.de>

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10 17:44 ` Linus Torvalds
  2008-06-10 18:45   ` Ben Lynn
@ 2008-06-12 20:06   ` Junio C Hamano
  2008-06-13 10:10   ` Jeff King
  2 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2008-06-12 20:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben Lynn, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> The other approach is to know that an empty blob always has a very 
> specific SHA1. Here's an trial patch.

>  read-cache.c |   16 ++++++++++++++++
>  1 files changed, 16 insertions(+), 0 deletions(-)
>
> diff --git a/read-cache.c b/read-cache.c
> index 8e5fbb6..f83de8c 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> ...
> @@ -193,6 +203,12 @@ static int ce_match_stat_basic(struct cache_entry *ce, struct stat *st)
>  	if (ce->ce_size != (unsigned int) st->st_size)
>  		changed |= DATA_CHANGED;
>  
> +	/* Racily smudged entry? */
> +	if (!ce->ce_size) {
> +		if (!is_empty_blob_sha1(ce->sha1))
> +			changed |= DATA_CHANGED;
> +	}
> +
>  	return changed;
>  }

Thanks.  This would be a good fix to the issue.

The only theoretical worry I can think of is if there is an insane
convert_to_worktree() filter that turns a non-empty blob into an empty
work tree file.

An "In blobs, always store everything as UTF16 with BOM" filter, when
badly implemented, might turn an empty work tree file into a blob with BOM
and nothing else in it, but we can safely declare that such use case is
simply insane and broken ;-).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-10 17:44 ` Linus Torvalds
  2008-06-10 18:45   ` Ben Lynn
  2008-06-12 20:06   ` Junio C Hamano
@ 2008-06-13 10:10   ` Jeff King
  2008-06-13 23:09     ` Junio C Hamano
  2 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-06-13 10:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben Lynn, git

On Tue, Jun 10, 2008 at 10:44:43AM -0700, Linus Torvalds wrote:

> You can fix this multiple ways. One would be to pick another size that is 
> simply less likely (eg ~0 instead), which leaves the theoretical race, and 
> just makes it practically impossible to hit (not that I think it's very 
> practical to hit already).

Hmm. I may have just hit it in the test suite.

Try this:

  cd git/t
  for i in `seq 1 1000`; do
    ./t4126-apply-empty.sh -v -i || break
  done
  echo made it to $i

Most of the time it works, but somewhere in that thousand (generally
within a few hundred), I end up with a failed test. The failure looks
like:

   * expecting success:
           git reset --hard &&
           >empty &&
           rm -f missing &&
           git apply --index patch0 &&
           test_cmp expect empty &&
           git diff --exit-code

   HEAD is now at e3f79fd initial
   error: empty: does not match index
   * FAIL 3: apply --index empty

> The other approach is to know that an empty blob always has a very 
> specific SHA1. Here's an trial patch.

However, I can still trigger the failure with your patch, so I wonder if
it is some other race entirely...

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-13 10:10   ` Jeff King
@ 2008-06-13 23:09     ` Junio C Hamano
  2008-06-14  6:25       ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2008-06-13 23:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Ben Lynn, git

Jeff King <peff@peff.net> writes:

> On Tue, Jun 10, 2008 at 10:44:43AM -0700, Linus Torvalds wrote:
>
>> You can fix this multiple ways. One would be to pick another size that is 
>> simply less likely (eg ~0 instead), which leaves the theoretical race, and 
>> just makes it practically impossible to hit (not that I think it's very 
>> practical to hit already).
>
> Hmm. I may have just hit it in the test suite.
>
> Try this:
>
>   cd git/t
>   for i in `seq 1 1000`; do
>     ./t4126-apply-empty.sh -v -i || break
>   done
>   echo made it to $i
>
> Most of the time it works, but somewhere in that thousand (generally
> within a few hundred), I end up with a failed test.
> ...
> However, I can still trigger the failure with your patch, so I wonder if
> it is some other race entirely...

This is not a race but the test itself is simply buggy.

When telling apply to affect index and work tree at the same time (that is
what --index means), the caller must make sure that the diff-files won't
report stale stat information.

If "git reset --hard" and ">empty" crossed the second boundary, the reset
might have written out an empty "empty" with timestamp T and then the
independent ">empty" would have given a new timestamp T+1.

I think "reset --hard" without ">empty" is enough as the head commit
records an empty blob there.

---

 t/t4126-apply-empty.sh |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/t/t4126-apply-empty.sh b/t/t4126-apply-empty.sh
index 0cfd47c..ceb6a79 100755
--- a/t/t4126-apply-empty.sh
+++ b/t/t4126-apply-empty.sh
@@ -26,7 +26,6 @@ test_expect_success setup '
 
 test_expect_success 'apply empty' '
 	git reset --hard &&
-	>empty &&
 	rm -f missing &&
 	git apply patch0 &&
 	test_cmp expect empty
@@ -34,7 +33,6 @@ test_expect_success 'apply empty' '
 
 test_expect_success 'apply --index empty' '
 	git reset --hard &&
-	>empty &&
 	rm -f missing &&
 	git apply --index patch0 &&
 	test_cmp expect empty &&
@@ -43,7 +41,6 @@ test_expect_success 'apply --index empty' '
 
 test_expect_success 'apply create' '
 	git reset --hard &&
-	>empty &&
 	rm -f missing &&
 	git apply patch1 &&
 	test_cmp expect missing
@@ -51,7 +48,6 @@ test_expect_success 'apply create' '
 
 test_expect_success 'apply --index create' '
 	git reset --hard &&
-	>empty &&
 	rm -f missing &&
 	git apply --index patch1 &&
 	test_cmp expect missing &&

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: git bugs
  2008-06-13 23:09     ` Junio C Hamano
@ 2008-06-14  6:25       ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2008-06-14  6:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Ben Lynn, git

On Fri, Jun 13, 2008 at 04:09:05PM -0700, Junio C Hamano wrote:

> When telling apply to affect index and work tree at the same time (that is
> what --index means), the caller must make sure that the diff-files won't
> report stale stat information.

Ah, I had assumed that git-apply would notice the stale stat information
and fall back to checking the content, which should be accurate.

> I think "reset --hard" without ">empty" is enough as the head commit
> records an empty blob there.

Yes, it is redundant. With your patch I was not able to reproduce the
problem over a few thousand trials.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* git bugs
@ 2017-02-23 20:27 Sean Hunt
  2017-02-24 16:52 ` Johannes Schindelin
  0 siblings, 1 reply; 37+ messages in thread
From: Sean Hunt @ 2017-02-23 20:27 UTC (permalink / raw)
  To: git

There are a few bugs I git I noticed when using mingw, mingw64,
cygwin, and cygwin64. These bugs are the following:

if I do git ``rebase -i --root`` and tell it to edit every commit to
gpg sign all my commits it bugs out and merges all of the commits into
1 commit instead of only appending the ``-S`` to each and every commit
and keeping all of the commits. It is as if I told it to squash the
commits but yet I did not. There is also another bug where if I clone
a repo on Windows and not on github desktop and that I placed commits
to the repo on github web and then when I rebase to squash the commits
to 1 commit (some repos are doing it as a requirement for 1 commit
PR's) that all of my commits on the remote (fork in this case) that is
linked to an open pull request are discarded and then the pull request
is somehow and oddly closed. It is super annoying.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git bugs
  2017-02-23 20:27 git bugs Sean Hunt
@ 2017-02-24 16:52 ` Johannes Schindelin
  0 siblings, 0 replies; 37+ messages in thread
From: Johannes Schindelin @ 2017-02-24 16:52 UTC (permalink / raw)
  To: Sean Hunt; +Cc: git

Hi,


On Thu, 23 Feb 2017, Sean Hunt wrote:

> There are a few bugs I git I noticed when using mingw, mingw64,
> cygwin, and cygwin64. These bugs are the following:
> 
> if I do git ``rebase -i --root`` and tell it to edit every commit to
> gpg sign all my commits it bugs out and merges all of the commits into
> 1 commit instead of only appending the ``-S`` to each and every commit
> and keeping all of the commits. It is as if I told it to squash the
> commits but yet I did not. There is also another bug where if I clone
> a repo on Windows and not on github desktop and that I placed commits
> to the repo on github web and then when I rebase to squash the commits
> to 1 commit (some repos are doing it as a requirement for 1 commit
> PR's) that all of my commits on the remote (fork in this case) that is
> linked to an open pull request are discarded and then the pull request
> is somehow and oddly closed. It is super annoying.

It appears this was reported as
https://github.com/git-for-windows/git/issues/1072, too.

I asked for clarification there, as I am quite confused by the
description.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-02-24 16:59 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-23 20:27 git bugs Sean Hunt
2017-02-24 16:52 ` Johannes Schindelin
  -- strict thread matches above, loose matches on Subject: below --
2008-06-10  8:41 Ben Lynn
2008-06-10 16:58 ` Daniel Barkalow
2008-06-10 17:44 ` Linus Torvalds
2008-06-10 18:45   ` Ben Lynn
2008-06-10 20:06     ` Linus Torvalds
2008-06-10 23:09       ` Ben Lynn
2008-06-10 23:38         ` Junio C Hamano
2008-06-11  0:02           ` Ben Lynn
2008-06-11  0:20             ` Junio C Hamano
2008-06-11  0:24               ` Ben Lynn
2008-06-11  0:53                 ` Ben Lynn
2008-06-11 12:46                 ` Stephen R. van den Berg
2008-06-12  6:51                   ` Ben Lynn
2008-06-11  1:36             ` Linus Torvalds
2008-06-11  2:04               ` Ben Lynn
2008-06-11  2:12                 ` Linus Torvalds
2008-06-11  2:31                   ` Ben Lynn
2008-06-11  2:39                     ` Linus Torvalds
2008-06-11  5:58                       ` Ben Lynn
2008-06-11  6:18                         ` Ben Lynn
2008-06-11 14:54                           ` Linus Torvalds
2008-06-11 17:52                             ` Ben Lynn
2008-06-11 18:10                               ` Linus Torvalds
2008-06-11 18:48                                 ` Ben Lynn
2008-06-11 18:53                                   ` Linus Torvalds
2008-06-11 20:57                                     ` Ben Lynn
2008-06-11 21:50                                     ` Junio C Hamano
2008-06-11 14:52                         ` Linus Torvalds
2008-06-12 20:06   ` Junio C Hamano
2008-06-13 10:10   ` Jeff King
2008-06-13 23:09     ` Junio C Hamano
2008-06-14  6:25       ` Jeff King
2008-06-12  3:17 ` Shawn O. Pearce
2008-06-12  6:46   ` Ben Lynn
2008-06-12  7:12   ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).