git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] fast-import: accept invalid timezones so we can import existing repos
@ 2020-05-28 19:15 Elijah Newren via GitGitGadget
  2020-05-28 19:26 ` Jonathan Nieder
  2020-05-28 20:40 ` [PATCH v2] fast-import: add new --date-format=raw-permissive format Elijah Newren via GitGitGadget
  0 siblings, 2 replies; 12+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-05-28 19:15 UTC (permalink / raw)
  To: git; +Cc: peff, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

There are multiple repositories in the wild with random, invalid
timezones.  Most notably is a commit from rails.git with a timezone of
"+051800"[1].  However, a few searches found other repos with that same
invalid timezone.  Further, Peff reports that GitHub relaxed their fsck
checks in August 2011 to accept any timezone value[2], and there have
been multiple reports to filter-repo about fast-import crashing while
trying to import their existing repositories since they had timezone
values such as "-7349423" and "-43455309"[3].

It is not clear what created these invalid timezones, but since git has
permitted their use and worked with these repositories for years at this
point, it seems pointless to make fast-import be the only thing that
disallows them.  Relax the parsing to allow these timezones when using
raw import format; when using --date-format=rfc2822 (which is not the
default), we can continue being more strict about timezones.

[1] https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734
[2] https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
[3] https://github.com/newren/git-filter-repo/issues/88

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fast-import: accept invalid timezones so we can import existing repos
    
    Note: this is not a fix for a regression, so you can ignore it for 2.27;
    it can sit in pu.
    
    Peff leaned towards normalizing these timezones in fast-export[1], but
    (A) it's not clear what "valid" timezone we should randomly pick and
    regardless of what we pick, it seems it'll be wrong for most cases, (B)
    it would provide yet another way that "git fast-export --all | git
    fast-import" would not preserve the original history, as users sometimes
    expect[2], and (C) that'd prevent users from passing a special callback
    to filter-repo to fix up these values[3].
    
    Since I'm not a fan of picking a random value to reassign these to (in
    either fast-export or fast-import), I went the route of relaxing the
    requirements in fast-import, similar to what Peff reports GitHub did
    about 9 years ago in their incoming fsck checks.
    
    [1] 
    https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
    [2] 
    https://lore.kernel.org/git/CABPp-BFLJ48BZ97Y9mr4i3q7HMqjq18cXMgSYdxqD1cMzH8Spg@mail.gmail.com/
    [3] Example: 
    https://github.com/newren/git-filter-repo/issues/88#issuecomment-629706776

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-795%2Fnewren%2Floosen-fast-import-timezone-parsing-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-795/newren/loosen-fast-import-timezone-parsing-v1
Pull-Request: https://github.com/git/git/pull/795

 fast-import.c          |  7 +++----
 t/t9300-fast-import.sh | 17 +++++++++++++++++
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index c98970274c4..4a3c193007d 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1915,11 +1915,10 @@ static int validate_raw_date(const char *src, struct strbuf *result)
 {
 	const char *orig_src = src;
 	char *endp;
-	unsigned long num;
 
 	errno = 0;
 
-	num = strtoul(src, &endp, 10);
+	strtoul(src, &endp, 10);
 	/* NEEDSWORK: perhaps check for reasonable values? */
 	if (errno || endp == src || *endp != ' ')
 		return -1;
@@ -1928,8 +1927,8 @@ static int validate_raw_date(const char *src, struct strbuf *result)
 	if (*src != '-' && *src != '+')
 		return -1;
 
-	num = strtoul(src + 1, &endp, 10);
-	if (errno || endp == src + 1 || *endp || 1400 < num)
+	strtoul(src + 1, &endp, 10);
+	if (errno || endp == src + 1 || *endp)
 		return -1;
 
 	strbuf_addstr(result, orig_src);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 768257b29e0..0e798e68476 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -410,6 +410,23 @@ test_expect_success 'B: accept empty committer' '
 	test -z "$out"
 '
 
+test_expect_success 'B: accept invalid timezone' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	test_when_finished "git update-ref -d refs/heads/invalid-timezone
+		git gc
+		git prune" &&
+	git fast-import <input &&
+	git cat-file -p invalid-timezone >out &&
+	grep "1234567890 [+]051800" out
+'
+
 test_expect_success 'B: accept and fixup committer with no name' '
 	cat >input <<-INPUT_END &&
 	commit refs/heads/empty-committer-2

base-commit: 2d5e9f31ac46017895ce6a183467037d29ceb9d3
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] fast-import: accept invalid timezones so we can import existing repos
  2020-05-28 19:15 [PATCH] fast-import: accept invalid timezones so we can import existing repos Elijah Newren via GitGitGadget
@ 2020-05-28 19:26 ` Jonathan Nieder
  2020-05-28 20:40 ` [PATCH v2] fast-import: add new --date-format=raw-permissive format Elijah Newren via GitGitGadget
  1 sibling, 0 replies; 12+ messages in thread
From: Jonathan Nieder @ 2020-05-28 19:26 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, peff, Elijah Newren

Hi,

Elijah Newren wrote:

>                  Relax the parsing to allow these timezones when using
> raw import format; when using --date-format=rfc2822 (which is not the
> default), we can continue being more strict about timezones.

There are two different use cases here that we may want to distinguish.

The original motivation for fast-import was for importing a repository
from some non-Git storage system (another VCS, a collection of patches
and tarballs, or whatever).  Such an importer might use
--date-format=raw just because that's simple, to avoid overhead.  In
that case, the strict parsing seems useful for catching bugs in the
importer.

"git filter-repo" is for taking an existing repository and modifying
it.  In this case, it would be *possible* to take an invalid timezone
and normalize it to "whatever 'git log' would show", but that's
overreaching relative to the caller's intent: the caller has a specific
set of history modifications they meant to make, and fixing the time
zone wasn't necessarily one of them.

To that end, would it make sense for this to be a new date-format
(e.g., --date-format=raw-permissive) to avoid regressing behavior in
the original case?

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] fast-import: add new --date-format=raw-permissive format
  2020-05-28 19:15 [PATCH] fast-import: accept invalid timezones so we can import existing repos Elijah Newren via GitGitGadget
  2020-05-28 19:26 ` Jonathan Nieder
@ 2020-05-28 20:40 ` Elijah Newren via GitGitGadget
  2020-05-28 23:08   ` Junio C Hamano
                     ` (3 more replies)
  1 sibling, 4 replies; 12+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-05-28 20:40 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

There are multiple repositories in the wild with random, invalid
timezones.  Most notably is a commit from rails.git with a timezone of
"+051800"[1].  A few searches will find other repos with that same
invalid timezone as well.  Further, Peff reports that GitHub relaxed
their fsck checks in August 2011 to accept any timezone value[2], and
there have been multiple reports to filter-repo about fast-import
crashing while trying to import their existing repositories since they
had timezone values such as "-7349423" and "-43455309"[3].

The existing check on timezone values inside fast-import may prove
useful for people who are crafting fast-import input by hand or with a
new script.  For them, the check may help them avoid accidentally
recording invalid dates.  (Note that this check is rather simplistic and
there are still several forms of invalid dates that fast-import does not
check for: dates in the future, timezone values with minutes that are
not divisible by 15, and timezone values with minutes that are 60 or
greater.)  While this simple check may have some value for those users,
other users or tools will want to import existing repositories as-is.
Provide a --date-format=raw-permissive format that will not error out on
these otherwise invalid timezones so that such existing repositories can
be imported.

[1] https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734
[2] https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
[3] https://github.com/newren/git-filter-repo/issues/88

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fast-import: accept invalid timezones so we can import existing repos
    
    Changes since v1:
    
     * Instead of just allowing such timezones outright, did it behind a
       --date-format=raw-permissive as suggested by Jonathan

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-795%2Fnewren%2Floosen-fast-import-timezone-parsing-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-795/newren/loosen-fast-import-timezone-parsing-v2
Pull-Request: https://github.com/git/git/pull/795

Range-diff vs v1:

 1:  d3a7dbc3892 ! 1:  9580aacdb21 fast-import: accept invalid timezones so we can import existing repos
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    fast-import: accept invalid timezones so we can import existing repos
     +    fast-import: add new --date-format=raw-permissive format
      
          There are multiple repositories in the wild with random, invalid
          timezones.  Most notably is a commit from rails.git with a timezone of
     -    "+051800"[1].  However, a few searches found other repos with that same
     -    invalid timezone.  Further, Peff reports that GitHub relaxed their fsck
     -    checks in August 2011 to accept any timezone value[2], and there have
     -    been multiple reports to filter-repo about fast-import crashing while
     -    trying to import their existing repositories since they had timezone
     -    values such as "-7349423" and "-43455309"[3].
     +    "+051800"[1].  A few searches will find other repos with that same
     +    invalid timezone as well.  Further, Peff reports that GitHub relaxed
     +    their fsck checks in August 2011 to accept any timezone value[2], and
     +    there have been multiple reports to filter-repo about fast-import
     +    crashing while trying to import their existing repositories since they
     +    had timezone values such as "-7349423" and "-43455309"[3].
      
     -    It is not clear what created these invalid timezones, but since git has
     -    permitted their use and worked with these repositories for years at this
     -    point, it seems pointless to make fast-import be the only thing that
     -    disallows them.  Relax the parsing to allow these timezones when using
     -    raw import format; when using --date-format=rfc2822 (which is not the
     -    default), we can continue being more strict about timezones.
     +    The existing check on timezone values inside fast-import may prove
     +    useful for people who are crafting fast-import input by hand or with a
     +    new script.  For them, the check may help them avoid accidentally
     +    recording invalid dates.  (Note that this check is rather simplistic and
     +    there are still several forms of invalid dates that fast-import does not
     +    check for: dates in the future, timezone values with minutes that are
     +    not divisible by 15, and timezone values with minutes that are 60 or
     +    greater.)  While this simple check may have some value for those users,
     +    other users or tools will want to import existing repositories as-is.
     +    Provide a --date-format=raw-permissive format that will not error out on
     +    these otherwise invalid timezones so that such existing repositories can
     +    be imported.
      
          [1] https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734
          [2] https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
     @@ Commit message
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## fast-import.c ##
     -@@ fast-import.c: static int validate_raw_date(const char *src, struct strbuf *result)
     +@@ fast-import.c: struct hash_list {
     + 
     + typedef enum {
     + 	WHENSPEC_RAW = 1,
     ++	WHENSPEC_RAW_PERMISSIVE,
     + 	WHENSPEC_RFC2822,
     + 	WHENSPEC_NOW
     + } whenspec_type;
     +@@ fast-import.c: static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
     + 	return 1;
     + }
     + 
     +-static int validate_raw_date(const char *src, struct strbuf *result)
     ++static int validate_raw_date(const char *src, struct strbuf *result, int strict)
       {
       	const char *orig_src = src;
       	char *endp;
     --	unsigned long num;
     + 	unsigned long num;
     ++	int out_of_range_timezone;
       
       	errno = 0;
       
     --	num = strtoul(src, &endp, 10);
     -+	strtoul(src, &endp, 10);
     - 	/* NEEDSWORK: perhaps check for reasonable values? */
     - 	if (errno || endp == src || *endp != ' ')
     - 		return -1;
      @@ fast-import.c: static int validate_raw_date(const char *src, struct strbuf *result)
     - 	if (*src != '-' && *src != '+')
       		return -1;
       
     --	num = strtoul(src + 1, &endp, 10);
     + 	num = strtoul(src + 1, &endp, 10);
      -	if (errno || endp == src + 1 || *endp || 1400 < num)
     -+	strtoul(src + 1, &endp, 10);
     -+	if (errno || endp == src + 1 || *endp)
     ++	out_of_range_timezone = strict && (1400 < num);
     ++	if (errno || endp == src + 1 || *endp || out_of_range_timezone)
       		return -1;
       
       	strbuf_addstr(result, orig_src);
     +@@ fast-import.c: static char *parse_ident(const char *buf)
     + 
     + 	switch (whenspec) {
     + 	case WHENSPEC_RAW:
     +-		if (validate_raw_date(ltgt, &ident) < 0)
     ++		if (validate_raw_date(ltgt, &ident, 1) < 0)
     ++			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
     ++		break;
     ++	case WHENSPEC_RAW_PERMISSIVE:
     ++		if (validate_raw_date(ltgt, &ident, 0) < 0)
     + 			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
     + 		break;
     + 	case WHENSPEC_RFC2822:
     +@@ fast-import.c: static void option_date_format(const char *fmt)
     + {
     + 	if (!strcmp(fmt, "raw"))
     + 		whenspec = WHENSPEC_RAW;
     ++	else if (!strcmp(fmt, "raw-permissive"))
     ++		whenspec = WHENSPEC_RAW_PERMISSIVE;
     + 	else if (!strcmp(fmt, "rfc2822"))
     + 		whenspec = WHENSPEC_RFC2822;
     + 	else if (!strcmp(fmt, "now"))
      
       ## t/t9300-fast-import.sh ##
      @@ t/t9300-fast-import.sh: test_expect_success 'B: accept empty committer' '
       	test -z "$out"
       '
       
     -+test_expect_success 'B: accept invalid timezone' '
     ++test_expect_success 'B: reject invalid timezone' '
     ++	cat >input <<-INPUT_END &&
     ++	commit refs/heads/invalid-timezone
     ++	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
     ++	data <<COMMIT
     ++	empty commit
     ++	COMMIT
     ++	INPUT_END
     ++
     ++	test_when_finished "git update-ref -d refs/heads/invalid-timezone" &&
     ++	test_must_fail git fast-import <input
     ++'
     ++
     ++test_expect_success 'B: accept invalid timezone with raw-permissive' '
      +	cat >input <<-INPUT_END &&
      +	commit refs/heads/invalid-timezone
      +	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
     @@ t/t9300-fast-import.sh: test_expect_success 'B: accept empty committer' '
      +	test_when_finished "git update-ref -d refs/heads/invalid-timezone
      +		git gc
      +		git prune" &&
     -+	git fast-import <input &&
     ++	git fast-import --date-format=raw-permissive <input &&
      +	git cat-file -p invalid-timezone >out &&
      +	grep "1234567890 [+]051800" out
      +'


 fast-import.c          | 15 ++++++++++++---
 t/t9300-fast-import.sh | 30 ++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index c98970274c4..74d7298bc5a 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -139,6 +139,7 @@ struct hash_list {
 
 typedef enum {
 	WHENSPEC_RAW = 1,
+	WHENSPEC_RAW_PERMISSIVE,
 	WHENSPEC_RFC2822,
 	WHENSPEC_NOW
 } whenspec_type;
@@ -1911,11 +1912,12 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 	return 1;
 }
 
-static int validate_raw_date(const char *src, struct strbuf *result)
+static int validate_raw_date(const char *src, struct strbuf *result, int strict)
 {
 	const char *orig_src = src;
 	char *endp;
 	unsigned long num;
+	int out_of_range_timezone;
 
 	errno = 0;
 
@@ -1929,7 +1931,8 @@ static int validate_raw_date(const char *src, struct strbuf *result)
 		return -1;
 
 	num = strtoul(src + 1, &endp, 10);
-	if (errno || endp == src + 1 || *endp || 1400 < num)
+	out_of_range_timezone = strict && (1400 < num);
+	if (errno || endp == src + 1 || *endp || out_of_range_timezone)
 		return -1;
 
 	strbuf_addstr(result, orig_src);
@@ -1963,7 +1966,11 @@ static char *parse_ident(const char *buf)
 
 	switch (whenspec) {
 	case WHENSPEC_RAW:
-		if (validate_raw_date(ltgt, &ident) < 0)
+		if (validate_raw_date(ltgt, &ident, 1) < 0)
+			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
+		break;
+	case WHENSPEC_RAW_PERMISSIVE:
+		if (validate_raw_date(ltgt, &ident, 0) < 0)
 			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
 		break;
 	case WHENSPEC_RFC2822:
@@ -3258,6 +3265,8 @@ static void option_date_format(const char *fmt)
 {
 	if (!strcmp(fmt, "raw"))
 		whenspec = WHENSPEC_RAW;
+	else if (!strcmp(fmt, "raw-permissive"))
+		whenspec = WHENSPEC_RAW_PERMISSIVE;
 	else if (!strcmp(fmt, "rfc2822"))
 		whenspec = WHENSPEC_RFC2822;
 	else if (!strcmp(fmt, "now"))
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 768257b29e0..14e9baa22db 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -410,6 +410,36 @@ test_expect_success 'B: accept empty committer' '
 	test -z "$out"
 '
 
+test_expect_success 'B: reject invalid timezone' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	test_when_finished "git update-ref -d refs/heads/invalid-timezone" &&
+	test_must_fail git fast-import <input
+'
+
+test_expect_success 'B: accept invalid timezone with raw-permissive' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	test_when_finished "git update-ref -d refs/heads/invalid-timezone
+		git gc
+		git prune" &&
+	git fast-import --date-format=raw-permissive <input &&
+	git cat-file -p invalid-timezone >out &&
+	grep "1234567890 [+]051800" out
+'
+
 test_expect_success 'B: accept and fixup committer with no name' '
 	cat >input <<-INPUT_END &&
 	commit refs/heads/empty-committer-2

base-commit: 2d5e9f31ac46017895ce6a183467037d29ceb9d3
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] fast-import: add new --date-format=raw-permissive format
  2020-05-28 20:40 ` [PATCH v2] fast-import: add new --date-format=raw-permissive format Elijah Newren via GitGitGadget
@ 2020-05-28 23:08   ` Junio C Hamano
  2020-05-29  0:20   ` Jonathan Nieder
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2020-05-28 23:08 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, peff, jrnieder, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Provide a --date-format=raw-permissive format that will not error out on
> these otherwise invalid timezones so that such existing repositories can
> be imported.

So the idea is to just propagate whatever raw timestamp plus
timezone we read from the input stream, which most likely has been
copied from a (broken) raw commit object, to a rewritten commit?

Makes sense.  Will queue.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] fast-import: add new --date-format=raw-permissive format
  2020-05-28 20:40 ` [PATCH v2] fast-import: add new --date-format=raw-permissive format Elijah Newren via GitGitGadget
  2020-05-28 23:08   ` Junio C Hamano
@ 2020-05-29  0:20   ` Jonathan Nieder
  2020-05-29  6:13   ` Jeff King
  2020-05-30 20:25   ` [PATCH v3] " Elijah Newren via GitGitGadget
  3 siblings, 0 replies; 12+ messages in thread
From: Jonathan Nieder @ 2020-05-29  0:20 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, peff, Elijah Newren

Elijah Newren wrote:

> From: Elijah Newren <newren@gmail.com>
>
> There are multiple repositories in the wild with random, invalid
> timezones.  Most notably is a commit from rails.git with a timezone of
> "+051800"[1].  A few searches will find other repos with that same
[...]
> Provide a --date-format=raw-permissive format that will not error out on
> these otherwise invalid timezones so that such existing repositories can
> be imported.
>
> [1] https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734
> [2] https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
> [3] https://github.com/newren/git-filter-repo/issues/88

This could potentially go in a Bug: footer.

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  fast-import.c          | 15 ++++++++++++---
>  t/t9300-fast-import.sh | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 3 deletions(-)

Makes sense.

Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] fast-import: add new --date-format=raw-permissive format
  2020-05-28 20:40 ` [PATCH v2] fast-import: add new --date-format=raw-permissive format Elijah Newren via GitGitGadget
  2020-05-28 23:08   ` Junio C Hamano
  2020-05-29  0:20   ` Jonathan Nieder
@ 2020-05-29  6:13   ` Jeff King
  2020-05-29 17:19     ` Junio C Hamano
  2020-05-30 20:25   ` [PATCH v3] " Elijah Newren via GitGitGadget
  3 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2020-05-29  6:13 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, jrnieder, Elijah Newren

On Thu, May 28, 2020 at 08:40:37PM +0000, Elijah Newren via GitGitGadget wrote:

>      * Instead of just allowing such timezones outright, did it behind a
>        --date-format=raw-permissive as suggested by Jonathan

Thanks, I like the safety this gives us against importer bugs. It does
mean that people doing "export | filter | import" may have to manually
loosen it, but it should be rare enough that this isn't a big burden
(and if they're rewriting anyway, maybe it gives them a chance to decide
how to fix it up).

>  fast-import.c          | 15 ++++++++++++---
>  t/t9300-fast-import.sh | 30 ++++++++++++++++++++++++++++++

Would we need a documentation update for "raw-permissive", too?

> @@ -1929,7 +1931,8 @@ static int validate_raw_date(const char *src, struct strbuf *result)
>  		return -1;
>  
>  	num = strtoul(src + 1, &endp, 10);
> -	if (errno || endp == src + 1 || *endp || 1400 < num)
> +	out_of_range_timezone = strict && (1400 < num);
> +	if (errno || endp == src + 1 || *endp || out_of_range_timezone)
>  		return -1;

It's a little funny to do computations on the result of a function
before we've checked whether it produced an error. But since the result
is just an integer, I don't think we'd do anything unexpected (we might
just throw away the value, though I imagine an optimizing compiler might
evaluate it lazily anyway).

> @@ -1963,7 +1966,11 @@ static char *parse_ident(const char *buf)
>  
>  	switch (whenspec) {
>  	case WHENSPEC_RAW:
> -		if (validate_raw_date(ltgt, &ident) < 0)
> +		if (validate_raw_date(ltgt, &ident, 1) < 0)
> +			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
> +		break;
> +	case WHENSPEC_RAW_PERMISSIVE:
> +		if (validate_raw_date(ltgt, &ident, 0) < 0)
>  			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
>  		break;

This looks simple enough. We have the bogus date in a buffer at that
point, and we stuff that string literally into the commit (or tag)
object. If we ever get more clever about storing the timestamp
internally as an integer, we may need to get more clever. But your new
test should alert us if that becomes the case.

> +test_expect_success 'B: accept invalid timezone with raw-permissive' '
> +	cat >input <<-INPUT_END &&
> +	commit refs/heads/invalid-timezone
> +	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
> +	data <<COMMIT
> +	empty commit
> +	COMMIT
> +	INPUT_END
> +
> +	test_when_finished "git update-ref -d refs/heads/invalid-timezone
> +		git gc
> +		git prune" &&

We check the exit code of when_finished commands, so this should be
&&-chained as usual. And possibly indented like:

  test_when_finished "
    git update-ref -d refs/heads/invalid-timezone &&
    git gc &&
    git prune
  " &&

but I see this all is copying another nearby test. So I'm OK with
keeping it consistent for now and cleaning up the whole thing later.
Though if you want to do that step now, I have no objection. :)

I also also suspect this "gc" is not useful these days. For a small
input like this, fast-import will write loose objects, since d9545c7f46
(fast-import: implement unpack limit, 2016-04-25).

TBH, I think it would be easier to understand as:

  ...
  git init invalid-timezone &&
  git -C invalid-timezone fast-import <input &&
  git -C invalid-timezone cat-file -p master >out &&
  ...

and don't bother with the when_finished at all. Then you don't have to
wonder whether the cleanup was sufficient, and it's fewer processes
to boot (we'll leave the sub-repo cruft sitting around, but a single "rm
-rf" at the end of the test script will wipe them all out).

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] fast-import: add new --date-format=raw-permissive format
  2020-05-29  6:13   ` Jeff King
@ 2020-05-29 17:19     ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2020-05-29 17:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren via GitGitGadget, git, jrnieder, Elijah Newren

Jeff King <peff@peff.net> writes:

> On Thu, May 28, 2020 at 08:40:37PM +0000, Elijah Newren via GitGitGadget wrote:
>
>>      * Instead of just allowing such timezones outright, did it behind a
>>        --date-format=raw-permissive as suggested by Jonathan
>
> Thanks, I like the safety this gives us against importer bugs. It does
> mean that people doing "export | filter | import" may have to manually
> loosen it, but it should be rare enough that this isn't a big burden
> (and if they're rewriting anyway, maybe it gives them a chance to decide
> how to fix it up).
>
>>  fast-import.c          | 15 ++++++++++++---
>>  t/t9300-fast-import.sh | 30 ++++++++++++++++++++++++++++++
>
> Would we need a documentation update for "raw-permissive", too?

Good eyes.  I somehow thought this was an internal option but it
does need to be known by end-users who use the fast-import tool.

>> @@ -1929,7 +1931,8 @@ static int validate_raw_date(const char *src, struct strbuf *result)
>>  		return -1;
>>  
>>  	num = strtoul(src + 1, &endp, 10);
>> -	if (errno || endp == src + 1 || *endp || 1400 < num)
>> +	out_of_range_timezone = strict && (1400 < num);
>> +	if (errno || endp == src + 1 || *endp || out_of_range_timezone)
>>  		return -1;
>
> It's a little funny to do computations on the result of a function
> before we've checked whether it produced an error. But since the result
> is just an integer, I don't think we'd do anything unexpected (we might
> just throw away the value, though I imagine an optimizing compiler might
> evaluate it lazily anyway).

True, but if it is easy to make it kosher, we should.

	if (errno || endp == src + 1 || *endp || /* did not parse */
	    (strict && (1400 < num)) /* parsed a broken timezone */
	   )

perhaps?

>> +test_expect_success 'B: accept invalid timezone with raw-permissive' '
>> +	cat >input <<-INPUT_END &&
>> +	commit refs/heads/invalid-timezone
>> +	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
>> +	data <<COMMIT
>> +	empty commit
>> +	COMMIT
>> +	INPUT_END
>> +
>> +	test_when_finished "git update-ref -d refs/heads/invalid-timezone
>> +		git gc
>> +		git prune" &&
>
> We check the exit code of when_finished commands, so this should be
> &&-chained as usual. And possibly indented like:
>
>   test_when_finished "
>     git update-ref -d refs/heads/invalid-timezone &&
>     git gc &&
>     git prune
>   " &&
>
> but I see this all is copying another nearby test. So I'm OK with
> keeping it consistent for now and cleaning up the whole thing later.
> Though if you want to do that step now, I have no objection. :)

Sure.  Another alternative is to make it right/modern for only this
added test---that makes the inconsistency stand out and gives
incentive to others for cleaning up after the dust settles when the
patch gets merged to the 'master' branch.

> I also also suspect this "gc" is not useful these days. For a small
> input like this, fast-import will write loose objects, since d9545c7f46
> (fast-import: implement unpack limit, 2016-04-25).
>
> TBH, I think it would be easier to understand as:
>
>   ...
>   git init invalid-timezone &&
>   git -C invalid-timezone fast-import <input &&
>   git -C invalid-timezone cat-file -p master >out &&
>   ...
>
> and don't bother with the when_finished at all. Then you don't have to
> wonder whether the cleanup was sufficient, and it's fewer processes
> to boot (we'll leave the sub-repo cruft sitting around, but a single "rm
> -rf" at the end of the test script will wipe them all out).

Nice.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3] fast-import: add new --date-format=raw-permissive format
  2020-05-28 20:40 ` [PATCH v2] fast-import: add new --date-format=raw-permissive format Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-05-29  6:13   ` Jeff King
@ 2020-05-30 20:25   ` Elijah Newren via GitGitGadget
  2020-05-30 23:13     ` Jeff King
  2021-02-03 11:57     ` Why does fast-import need to check the validity of idents? + Other ident adventures =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason
  3 siblings, 2 replies; 12+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-05-30 20:25 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

There are multiple repositories in the wild with random, invalid
timezones.  Most notably is a commit from rails.git with a timezone of
"+051800"[1].  A few searches will find other repos with that same
invalid timezone as well.  Further, Peff reports that GitHub relaxed
their fsck checks in August 2011 to accept any timezone value[2], and
there have been multiple reports to filter-repo about fast-import
crashing while trying to import their existing repositories since they
had timezone values such as "-7349423" and "-43455309"[3].

The existing check on timezone values inside fast-import may prove
useful for people who are crafting fast-import input by hand or with a
new script.  For them, the check may help them avoid accidentally
recording invalid dates.  (Note that this check is rather simplistic and
there are still several forms of invalid dates that fast-import does not
check for: dates in the future, timezone values with minutes that are
not divisible by 15, and timezone values with minutes that are 60 or
greater.)  While this simple check may have some value for those users,
other users or tools will want to import existing repositories as-is.
Provide a --date-format=raw-permissive format that will not error out on
these otherwise invalid timezones so that such existing repositories can
be imported.

[1] https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734
[2] https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
[3] https://github.com/newren/git-filter-repo/issues/88

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fast-import: accept invalid timezones so we can import existing repos
    
    Changes since v2:
    
     * Add documentation
     * Note the fact that the "strict" method really isn't all that strict
       with some NEEDSWORK comments
     * Check for parsed as unsigned before checking that value range makes
       sense
     * Simplify the testcase as suggested by Peff, leaving it to stick out a
       bit like a sore thumb from the rest of the tests in the same file
       (#leftoverbits)

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-795%2Fnewren%2Floosen-fast-import-timezone-parsing-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-795/newren/loosen-fast-import-timezone-parsing-v3
Pull-Request: https://github.com/git/git/pull/795

Range-diff vs v2:

 1:  9580aacdb21 ! 1:  48326d16dbd fast-import: add new --date-format=raw-permissive format
     @@ Commit message
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     + ## Documentation/git-fast-import.txt ##
     +@@ Documentation/git-fast-import.txt: by users who are located in the same location and time zone.  In this
     + case a reasonable offset from UTC could be assumed.
     + +
     + Unlike the `rfc2822` format, this format is very strict.  Any
     +-variation in formatting will cause fast-import to reject the value.
     ++variation in formatting will cause fast-import to reject the value,
     ++and some sanity checks on the numeric values may also be performed.
     ++
     ++`raw-permissive`::
     ++	This is the same as `raw` except that no sanity checks on
     ++	the numeric epoch and local offset are performed.  This can
     ++	be useful when trying to filter or import an existing history
     ++	with e.g. bogus timezone values.
     + 
     + `rfc2822`::
     + 	This is the standard email format as described by RFC 2822.
     +
       ## fast-import.c ##
      @@ fast-import.c: struct hash_list {
       
     @@ fast-import.c: static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax
       {
       	const char *orig_src = src;
       	char *endp;
     - 	unsigned long num;
     -+	int out_of_range_timezone;
     - 
     +@@ fast-import.c: static int validate_raw_date(const char *src, struct strbuf *result)
       	errno = 0;
       
     + 	num = strtoul(src, &endp, 10);
     +-	/* NEEDSWORK: perhaps check for reasonable values? */
     ++	/*
     ++	 * NEEDSWORK: perhaps check for reasonable values? For example, we
     ++	 *            could error on values representing times more than a
     ++	 *            day in the future.
     ++	 */
     + 	if (errno || endp == src || *endp != ' ')
     + 		return -1;
     + 
      @@ fast-import.c: static int validate_raw_date(const char *src, struct strbuf *result)
       		return -1;
       
       	num = strtoul(src + 1, &endp, 10);
      -	if (errno || endp == src + 1 || *endp || 1400 < num)
     -+	out_of_range_timezone = strict && (1400 < num);
     -+	if (errno || endp == src + 1 || *endp || out_of_range_timezone)
     ++	/*
     ++	 * NEEDSWORK: check for brokenness other than num > 1400, such as
     ++	 *            (num % 100) >= 60, or ((num % 100) % 15) != 0 ?
     ++	 */
     ++	if (errno || endp == src + 1 || *endp || /* did not parse */
     ++	    (strict && (1400 < num))             /* parsed a broken timezone */
     ++	   )
       		return -1;
       
       	strbuf_addstr(result, orig_src);
     @@ t/t9300-fast-import.sh: test_expect_success 'B: accept empty committer' '
      +	COMMIT
      +	INPUT_END
      +
     -+	test_when_finished "git update-ref -d refs/heads/invalid-timezone
     -+		git gc
     -+		git prune" &&
     -+	git fast-import --date-format=raw-permissive <input &&
     -+	git cat-file -p invalid-timezone >out &&
     ++	git init invalid-timezone &&
     ++	git -C invalid-timezone fast-import --date-format=raw-permissive <input &&
     ++	git -C invalid-timezone cat-file -p invalid-timezone >out &&
      +	grep "1234567890 [+]051800" out
      +'
      +


 Documentation/git-fast-import.txt |  9 ++++++++-
 fast-import.c                     | 25 +++++++++++++++++++++----
 t/t9300-fast-import.sh            | 28 ++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 77c6b3d0019..7d9aad2a7e1 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -293,7 +293,14 @@ by users who are located in the same location and time zone.  In this
 case a reasonable offset from UTC could be assumed.
 +
 Unlike the `rfc2822` format, this format is very strict.  Any
-variation in formatting will cause fast-import to reject the value.
+variation in formatting will cause fast-import to reject the value,
+and some sanity checks on the numeric values may also be performed.
+
+`raw-permissive`::
+	This is the same as `raw` except that no sanity checks on
+	the numeric epoch and local offset are performed.  This can
+	be useful when trying to filter or import an existing history
+	with e.g. bogus timezone values.
 
 `rfc2822`::
 	This is the standard email format as described by RFC 2822.
diff --git a/fast-import.c b/fast-import.c
index c98970274c4..0dfa14dc8c3 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -139,6 +139,7 @@ struct hash_list {
 
 typedef enum {
 	WHENSPEC_RAW = 1,
+	WHENSPEC_RAW_PERMISSIVE,
 	WHENSPEC_RFC2822,
 	WHENSPEC_NOW
 } whenspec_type;
@@ -1911,7 +1912,7 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 	return 1;
 }
 
-static int validate_raw_date(const char *src, struct strbuf *result)
+static int validate_raw_date(const char *src, struct strbuf *result, int strict)
 {
 	const char *orig_src = src;
 	char *endp;
@@ -1920,7 +1921,11 @@ static int validate_raw_date(const char *src, struct strbuf *result)
 	errno = 0;
 
 	num = strtoul(src, &endp, 10);
-	/* NEEDSWORK: perhaps check for reasonable values? */
+	/*
+	 * NEEDSWORK: perhaps check for reasonable values? For example, we
+	 *            could error on values representing times more than a
+	 *            day in the future.
+	 */
 	if (errno || endp == src || *endp != ' ')
 		return -1;
 
@@ -1929,7 +1934,13 @@ static int validate_raw_date(const char *src, struct strbuf *result)
 		return -1;
 
 	num = strtoul(src + 1, &endp, 10);
-	if (errno || endp == src + 1 || *endp || 1400 < num)
+	/*
+	 * NEEDSWORK: check for brokenness other than num > 1400, such as
+	 *            (num % 100) >= 60, or ((num % 100) % 15) != 0 ?
+	 */
+	if (errno || endp == src + 1 || *endp || /* did not parse */
+	    (strict && (1400 < num))             /* parsed a broken timezone */
+	   )
 		return -1;
 
 	strbuf_addstr(result, orig_src);
@@ -1963,7 +1974,11 @@ static char *parse_ident(const char *buf)
 
 	switch (whenspec) {
 	case WHENSPEC_RAW:
-		if (validate_raw_date(ltgt, &ident) < 0)
+		if (validate_raw_date(ltgt, &ident, 1) < 0)
+			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
+		break;
+	case WHENSPEC_RAW_PERMISSIVE:
+		if (validate_raw_date(ltgt, &ident, 0) < 0)
 			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
 		break;
 	case WHENSPEC_RFC2822:
@@ -3258,6 +3273,8 @@ static void option_date_format(const char *fmt)
 {
 	if (!strcmp(fmt, "raw"))
 		whenspec = WHENSPEC_RAW;
+	else if (!strcmp(fmt, "raw-permissive"))
+		whenspec = WHENSPEC_RAW_PERMISSIVE;
 	else if (!strcmp(fmt, "rfc2822"))
 		whenspec = WHENSPEC_RFC2822;
 	else if (!strcmp(fmt, "now"))
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 768257b29e0..e151df81c06 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -410,6 +410,34 @@ test_expect_success 'B: accept empty committer' '
 	test -z "$out"
 '
 
+test_expect_success 'B: reject invalid timezone' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	test_when_finished "git update-ref -d refs/heads/invalid-timezone" &&
+	test_must_fail git fast-import <input
+'
+
+test_expect_success 'B: accept invalid timezone with raw-permissive' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	git init invalid-timezone &&
+	git -C invalid-timezone fast-import --date-format=raw-permissive <input &&
+	git -C invalid-timezone cat-file -p invalid-timezone >out &&
+	grep "1234567890 [+]051800" out
+'
+
 test_expect_success 'B: accept and fixup committer with no name' '
 	cat >input <<-INPUT_END &&
 	commit refs/heads/empty-committer-2

base-commit: 2d5e9f31ac46017895ce6a183467037d29ceb9d3
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] fast-import: add new --date-format=raw-permissive format
  2020-05-30 20:25   ` [PATCH v3] " Elijah Newren via GitGitGadget
@ 2020-05-30 23:13     ` Jeff King
  2021-02-03 11:57     ` Why does fast-import need to check the validity of idents? + Other ident adventures =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2020-05-30 23:13 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, jrnieder, Elijah Newren

On Sat, May 30, 2020 at 08:25:57PM +0000, Elijah Newren via GitGitGadget wrote:

>     fast-import: accept invalid timezones so we can import existing repos
>     
>     Changes since v2:

Thanks, this version looks good to me.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Why does fast-import need to check the validity of idents? + Other ident adventures
  2020-05-30 20:25   ` [PATCH v3] " Elijah Newren via GitGitGadget
  2020-05-30 23:13     ` Jeff King
@ 2021-02-03 11:57     ` =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason
  2021-02-03 19:20       ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason @ 2021-02-03 11:57 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, peff, jrnieder, Elijah Newren, Junio C Hamano


[Originally sent 5 days ago, but seems to have been a victim of the
vger.kernel.org problems at the time, re-sending]

On Sat, May 30 2020, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>

Full snipped E-Mail in the archive:
https://lore.kernel.org/git/pull.795.v3.git.git.1590870357549.gitgitgadget@gmail.com/

> There are multiple repositories in the wild with random, invalid
> timezones.  Most notably is a commit from rails.git with a timezone of
> "+051800"[1].  A few searches will find other repos with that same
> invalid timezone as well.  Further, Peff reports that GitHub relaxed
> their fsck checks in August 2011 to accept any timezone value[2], and
> there have been multiple reports to filter-repo about fast-import
> crashing while trying to import their existing repositories since they
> had timezone values such as "-7349423" and "-43455309"[3].

I've been looking at some of our duplicate logic here after my mktag
series where we now use fsck validation. It had a hardcoded "1400"
offset value, which I see fast-import.c still has.

Then in mailmap.c we have parse_name_and_email(), then there's
split_ident_line() in ident.c, and of course
fsck_ident(). record_person_from_buf() in fmt-merge-msg.c, copy_name()
and copy_email() in ref-filter.c. Maybe handle_from() in mailinfo.c also
counts. Anyway, aside from the last these are all parsers for
"author/committer" lines in commits one way or another.

But I was wondering about fast-import.c in particular. I think Elijah's
patch here is obviously good an incremental improvement. But stepping
back a bit: who cares about sort-of-fsck validation in fast-import.c
anyway?

Shouldn't it just pretty much be importing data as-is, and then we could
document "if you don't trust it, run fsck afterwards"?

Or, if it's a use-case people actually care about, then I might see
about unifying some of these parser functions as part of a series I'm
preparing.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Why does fast-import need to check the validity of idents? + Other ident adventures
  2021-02-03 11:57     ` Why does fast-import need to check the validity of idents? + Other ident adventures =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason
@ 2021-02-03 19:20       ` Junio C Hamano
  2021-02-05 15:25         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2021-02-03 19:20 UTC (permalink / raw)
  To: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason
  Cc: Elijah Newren via GitGitGadget, git, peff, jrnieder,
	Elijah Newren

"=?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?=" Bjarmason <avarab@gmail.com>
writes:

> But I was wondering about fast-import.c in particular. I think Elijah's
> patch here is obviously good an incremental improvement. But stepping
> back a bit: who cares about sort-of-fsck validation in fast-import.c
> anyway?

Those who want to notice and verify the procedure they used to
produce the import data from the original before it is too late?

I.e. data gets imported to Git, victory declared and then old SCM
turned gets off---and only then the resulting imported repository is
found not to pass fsck.

> Shouldn't it just pretty much be importing data as-is, and then we could
> document "if you don't trust it, run fsck afterwards"?

If it is a small import, the distinction does not matter, but for a
huge import, the procedure to produce the data is likely to be
mechanical, so even after processing just a very small portion of
early part of the datastream, systematic errors would be noticed
before fast-import wastes importing too much garbage that need to be
discarded after running such fsck.  So in that sense, I suspect that
there is value in the early validation.

> Or, if it's a use-case people actually care about, then I might see
> about unifying some of these parser functions as part of a series I'm
> preparing.

I think allowing people to loosen particular checks for fast-import
(or elsewhere for that matter) is a good idea, and you can do so
more easily once the existing checking is migrated to your new
scheme that shares code with the fsck machinery.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Why does fast-import need to check the validity of idents? + Other ident adventures
  2021-02-03 19:20       ` Junio C Hamano
@ 2021-02-05 15:25         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-02-05 15:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, git, peff, jrnieder,
	Elijah Newren


On Wed, Feb 03 2021, Junio C Hamano wrote:

> "=?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?=" Bjarmason <avarab@gmail.com>
> writes:
>
>> But I was wondering about fast-import.c in particular. I think Elijah's
>> patch here is obviously good an incremental improvement. But stepping
>> back a bit: who cares about sort-of-fsck validation in fast-import.c
>> anyway?
>
> Those who want to notice and verify the procedure they used to
> produce the import data from the original before it is too late?
>
> I.e. data gets imported to Git, victory declared and then old SCM
> turned gets off---and only then the resulting imported repository is
> found not to pass fsck.
>
>> Shouldn't it just pretty much be importing data as-is, and then we could
>> document "if you don't trust it, run fsck afterwards"?
>
> If it is a small import, the distinction does not matter, but for a
> huge import, the procedure to produce the data is likely to be
> mechanical, so even after processing just a very small portion of
> early part of the datastream, systematic errors would be noticed
> before fast-import wastes importing too much garbage that need to be
> discarded after running such fsck.  So in that sense, I suspect that
> there is value in the early validation.

What I was fishing for here is that perhaps since fast-import was
originally written this use-case of in-place conversion of primary data
on a server might have become too obscure to care about, i.e. as opposed
to doing a conversion locally and then "git push"-ing it to something
that does transfer.fsckObjects.

>> Or, if it's a use-case people actually care about, then I might see
>> about unifying some of these parser functions as part of a series I'm
>> preparing.
>
> I think allowing people to loosen particular checks for fast-import
> (or elsewhere for that matter) is a good idea, and you can do so
> more easily once the existing checking is migrated to your new
> scheme that shares code with the fsck machinery.

...allright, depending on how much of a hassle that is I might just add
tests for the differences and leave this particular problem to someone
else :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-02-05 22:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-28 19:15 [PATCH] fast-import: accept invalid timezones so we can import existing repos Elijah Newren via GitGitGadget
2020-05-28 19:26 ` Jonathan Nieder
2020-05-28 20:40 ` [PATCH v2] fast-import: add new --date-format=raw-permissive format Elijah Newren via GitGitGadget
2020-05-28 23:08   ` Junio C Hamano
2020-05-29  0:20   ` Jonathan Nieder
2020-05-29  6:13   ` Jeff King
2020-05-29 17:19     ` Junio C Hamano
2020-05-30 20:25   ` [PATCH v3] " Elijah Newren via GitGitGadget
2020-05-30 23:13     ` Jeff King
2021-02-03 11:57     ` Why does fast-import need to check the validity of idents? + Other ident adventures =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason
2021-02-03 19:20       ` Junio C Hamano
2021-02-05 15:25         ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).