git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Anyone know what is creating commits with bogus dates?
@ 2020-05-21 17:49 Elijah Newren
  2020-05-21 18:12 ` Eric Sunshine
  2020-05-21 18:57 ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Elijah Newren @ 2020-05-21 17:49 UTC (permalink / raw)
  To: Git Mailing List

Hi,

I wanted to report that we seem to have a number of repos in the wild
with bogus (as in "won't even parse") dates.

I first discovered such a repository in the wild a while ago with
rails.git.  It has a commit with a recorded timezone of "+051800" for
both author and committer.  Everything else about the commit looks
fine.  See https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734.

Some google searches at the time turned up a few other examples, all
with the same "+051800" issue.  I put a special workaround for it into
filter-repo because I figured it was slightly prominent but probably
limited to that special timezone.  The fact that it was six digits but
the last two were zeros made it seem not quite as bad as it could be.


Recently, I've gotten a few reports of commits existing in live repos
with other forms of breakage.  Dates that correspond to roughly a
hundred years in the future, and with timezone specifiers that are
either 7 or 8 digits long and no leading or trailing zeros.  See
https://github.com/newren/git-filter-repo/issues/88.

Is there rogue software somewhere creating some bad commit objects?
Does anyone have any idea what might be causing these?  Any
suggestions on the right course(s) of action?

Right now, I'm leaning towards saying fast-import is right to bail on
these, and maybe I'll investigate having filter-repo waste some cycles
to try to detect these and report more helpful errors (though I really
want to avoid full sanity checking as that's expensive).  If I go that
route, I'll probably make it link to the example command I provided in
that bug report showing users how they can use filter-repo to fix
those problems.  Does that sound sane?  Anything I'm missing?


Thanks,
Elijah

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Anyone know what is creating commits with bogus dates?
  2020-05-21 17:49 Anyone know what is creating commits with bogus dates? Elijah Newren
@ 2020-05-21 18:12 ` Eric Sunshine
  2020-05-21 18:57 ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Sunshine @ 2020-05-21 18:12 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On Thu, May 21, 2020 at 1:49 PM Elijah Newren <newren@gmail.com> wrote:
> I first discovered such a repository in the wild a while ago with
> rails.git.  It has a commit with a recorded timezone of "+051800" for
> both author and committer.  Everything else about the commit looks
> fine.  See https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734.
>
> Some google searches at the time turned up a few other examples, all
> with the same "+051800" issue.  I put a special workaround for it into
> filter-repo because I figured it was slightly prominent but probably
> limited to that special timezone.  The fact that it was six digits but
> the last two were zeros made it seem not quite as bad as it could be.

In git.git itself, there were a couple bugs fixed [1,2] in which the
author header lacked a NUL-terminator after the timezone and in which
an error condition wasn't checked which led to the last digit of the
timezone being duplicated, however, if I recall correctly, those only
cropped up with "git rebase -i --root" (and only the author date-stamp
was affected), which doesn't seem to correspond to the cases you're
fixing.

Perhaps those commits were created by tools other than git.git?

[1]: ca3e1826a0 (sequencer: fix "rebase -i --root" corrupting author
header, 2018-07-31)
[2]: 0f16c09aae (sequencer: fix "rebase -i --root" corrupting author
header timezone, 2018-07-31)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Anyone know what is creating commits with bogus dates?
  2020-05-21 17:49 Anyone know what is creating commits with bogus dates? Elijah Newren
  2020-05-21 18:12 ` Eric Sunshine
@ 2020-05-21 18:57 ` Jeff King
  2020-05-21 19:31   ` Elijah Newren
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2020-05-21 18:57 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On Thu, May 21, 2020 at 10:49:17AM -0700, Elijah Newren wrote:

> I wanted to report that we seem to have a number of repos in the wild
> with bogus (as in "won't even parse") dates.
> 
> I first discovered such a repository in the wild a while ago with
> rails.git.  It has a commit with a recorded timezone of "+051800" for
> both author and committer.  Everything else about the commit looks
> fine.  See https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734.
> 
> Some google searches at the time turned up a few other examples, all
> with the same "+051800" issue.  I put a special workaround for it into
> filter-repo because I figured it was slightly prominent but probably
> limited to that special timezone.  The fact that it was six digits but
> the last two were zeros made it seem not quite as bad as it could be.

I can't remember the source of the bug, but we've had a workaround in
GitHub's incoming fsck checks to allow 6-digit zones like this since
August 2011. I'm almost certain that it came up because of that
rails/rails commit, but I don't remember the culprit implementation. I'm
sure we would have dug it up and fixed it at the time.

Sadly our commit message for the fsck tweak gives no further details,
nor can I dig up anything out of issues/etc.

I _think_ it wasn't GitHub/grit which did this (the 0-prefixed tree
modes you might come across are, though). I couldn't find any mention of
the fix there, at least. I'd suspect perhaps libgit2, but I also
couldn't find any fix.

But I think it would be safe to assume the bug is long-since fixed, and
it's nice if you can be a bit more lenient on the parsing for historical
issues like this. Arguably fast-export ought to be normalizing it to
something syntactically correct (just like we probably do with other
unparsable dates), though I guess you could argue that a filter might
want to see the broken form in order to fix it in a custom way.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Anyone know what is creating commits with bogus dates?
  2020-05-21 18:57 ` Jeff King
@ 2020-05-21 19:31   ` Elijah Newren
  2020-05-21 19:55     ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Elijah Newren @ 2020-05-21 19:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On Thu, May 21, 2020 at 11:57 AM Jeff King <peff@peff.net> wrote:
>
> On Thu, May 21, 2020 at 10:49:17AM -0700, Elijah Newren wrote:
>
> > I wanted to report that we seem to have a number of repos in the wild
> > with bogus (as in "won't even parse") dates.
> >
> > I first discovered such a repository in the wild a while ago with
> > rails.git.  It has a commit with a recorded timezone of "+051800" for
> > both author and committer.  Everything else about the commit looks
> > fine.  See https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734.
> >
> > Some google searches at the time turned up a few other examples, all
> > with the same "+051800" issue.  I put a special workaround for it into
> > filter-repo because I figured it was slightly prominent but probably
> > limited to that special timezone.  The fact that it was six digits but
> > the last two were zeros made it seem not quite as bad as it could be.
>
> I can't remember the source of the bug, but we've had a workaround in
> GitHub's incoming fsck checks to allow 6-digit zones like this since
> August 2011. I'm almost certain that it came up because of that
> rails/rails commit, but I don't remember the culprit implementation. I'm
> sure we would have dug it up and fixed it at the time.

What about 7- and 8- digit timezones (like the ones in the linked
filter-repo issue report)?  Do you currently prevent users from
pushing those to GitHub, or do you allow those too?
I'm curious about whether there is anything else out there that might
help flag these commits or if it's just filter-repo.

> Sadly our commit message for the fsck tweak gives no further details,
> nor can I dig up anything out of issues/etc.
>
> I _think_ it wasn't GitHub/grit which did this (the 0-prefixed tree
> modes you might come across are, though). I couldn't find any mention of
> the fix there, at least. I'd suspect perhaps libgit2, but I also
> couldn't find any fix.
>
> But I think it would be safe to assume the bug is long-since fixed, and
> it's nice if you can be a bit more lenient on the parsing for historical
> issues like this. Arguably fast-export ought to be normalizing it to
> something syntactically correct (just like we probably do with other
> unparsable dates), though I guess you could argue that a filter might
> want to see the broken form in order to fix it in a custom way.

If we're going to be more lenient on the parsing, does that suggest
fast-import shouldn't die on these?  Currently, fast-import is the
thing dying, not fast-export or filter-repo (though filter-repo of
course halts when it notices that fast-import has died under it).

I put in special-case code in filter-repo to munge the +051800
timezone case to keep fast-import from dying, but these new cases seem
to suggest it's not just one bad timezone that I can check for and
correct, but rather that they are completely random 7- or 8- (or who
knows how many) digit timezones coupled with bogus
(century-into-the-future) unix epochs.  I'm a little less comfortable
working around all of these than the very specific +051800 issue.  On
the filter-repo side, I think the most I would want to do here is
provide cleaner warning or error messages than "fast-import died,
here's a traceback."  But I'm unusre if there are other steps we
should take as well, such as making the fast-import parser more
lenient.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Anyone know what is creating commits with bogus dates?
  2020-05-21 19:31   ` Elijah Newren
@ 2020-05-21 19:55     ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2020-05-21 19:55 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On Thu, May 21, 2020 at 12:31:36PM -0700, Elijah Newren wrote:

> > I can't remember the source of the bug, but we've had a workaround in
> > GitHub's incoming fsck checks to allow 6-digit zones like this since
> > August 2011. I'm almost certain that it came up because of that
> > rails/rails commit, but I don't remember the culprit implementation. I'm
> > sure we would have dug it up and fixed it at the time.
> 
> What about 7- and 8- digit timezones (like the ones in the linked
> filter-repo issue report)?  Do you currently prevent users from
> pushing those to GitHub, or do you allow those too?
> I'm curious about whether there is anything else out there that might
> help flag these commits or if it's just filter-repo.

Our loosening allows any size:

--- a/fsck.c
+++ b/fsck.c
[...]
@@ -772,14 +778,16 @@ static int fsck_ident(const char **ident,
        if ((end == p || *end != ' '))
                return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date");
        p = end + 1;
-       if ((*p != '+' && *p != '-') ||
-           !isdigit(p[1]) ||
-           !isdigit(p[2]) ||
-           !isdigit(p[3]) ||
-           !isdigit(p[4]) ||
-           (p[5] != '\n'))
+       if (*p != '+' && *p != '-')
                return report(options, oid, type, FSCK_MSG_BAD_TIMEZONE, "invalid author/committer line - bad time zone");
-       p += 6;
+       p++;
+
+       do {
+               if (!isdigit(*p))
+                       return report(options, oid, type, FSCK_MSG_BAD_TIMEZONE, "invalid author/committer line - bad time zone");
+               p++;
+       } while (*p != '\n');
+

I don't remember the nature of the bug well enough to know if the longer
ones are likely to have the same cause.

> > But I think it would be safe to assume the bug is long-since fixed, and
> > it's nice if you can be a bit more lenient on the parsing for historical
> > issues like this. Arguably fast-export ought to be normalizing it to
> > something syntactically correct (just like we probably do with other
> > unparsable dates), though I guess you could argue that a filter might
> > want to see the broken form in order to fix it in a custom way.
> 
> If we're going to be more lenient on the parsing, does that suggest
> fast-import shouldn't die on these?  Currently, fast-import is the
> thing dying, not fast-export or filter-repo (though filter-repo of
> course halts when it notices that fast-import has died under it).

Ah, I thought filter-repo was noticing. I think it would be nice for
either fast-export or fast-import to normalize syntactically invalid
values to something sane (like just resetting a bogus timezone to
+0000). I could see arguments for putting it in either spot (putting it
in fast-import has the downside that we wouldn't catch invalid output
generated by a script; putting it in fast-export has the downside that
you can't notice and fix it up yourself if you choose to).

Probably it should be in fast-export, but with an option to turn it off
for people who want more control (or leave it off by default, and let
people who run into problems turn it on). We already have similar
options for handling un-exportable cases like signed tags.

> I put in special-case code in filter-repo to munge the +051800
> timezone case to keep fast-import from dying, but these new cases seem
> to suggest it's not just one bad timezone that I can check for and
> correct, but rather that they are completely random 7- or 8- (or who
> knows how many) digit timezones coupled with bogus
> (century-into-the-future) unix epochs.  I'm a little less comfortable
> working around all of these than the very specific +051800 issue.  On
> the filter-repo side, I think the most I would want to do here is
> provide cleaner warning or error messages than "fast-import died,
> here's a traceback."  But I'm unusre if there are other steps we
> should take as well, such as making the fast-import parser more
> lenient.

I think if filter-repo does anything, it would probably be to read any
syntactically invalid timezone and normalize it. But again, I think it
would be fine to push that into fast-export.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-05-21 19:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-21 17:49 Anyone know what is creating commits with bogus dates? Elijah Newren
2020-05-21 18:12 ` Eric Sunshine
2020-05-21 18:57 ` Jeff King
2020-05-21 19:31   ` Elijah Newren
2020-05-21 19:55     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).