git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git archive generates tar with malformed pax extended attribute
@ 2019-05-24  6:45 Keegan Carruthers-Smith
  2019-05-24  7:06 ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Keegan Carruthers-Smith @ 2019-05-24  6:45 UTC (permalink / raw)
  To: git

Hello,

git archive can generate a malformed tar archive. bsdtar reports the
error "tar: Ignoring malformed pax extended attribute" when reading
the archive. Go's "tar/archive" package also reports the error
"archive/tar: invalid tar header". However, BusyBox's tar does not
report the error (unsure if it just has less error logging).

I can reproduce this when generating the tar on linux and mac. I
tested this with "git version 2.21.0" and a build of next I did today:
"git version 2.22.0.rc1.257.g3120a18244"

Reproduction:

  $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git
  $ cd NNDD
  $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137
| tar tf - > /dev/null
  tar: Ignoring malformed pax extended attribute
  tar: Error exit delayed from previous errors.

Kind Regards,
Keegan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-24  6:45 git archive generates tar with malformed pax extended attribute Keegan Carruthers-Smith
@ 2019-05-24  7:06 ` Jeff King
  2019-05-24  7:35   ` Keegan Carruthers-Smith
  2019-05-25 20:46   ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff King @ 2019-05-24  7:06 UTC (permalink / raw)
  To: Keegan Carruthers-Smith; +Cc: git

On Fri, May 24, 2019 at 08:45:23AM +0200, Keegan Carruthers-Smith wrote:

> git archive can generate a malformed tar archive. bsdtar reports the
> error "tar: Ignoring malformed pax extended attribute" when reading
> the archive. Go's "tar/archive" package also reports the error
> "archive/tar: invalid tar header". However, BusyBox's tar does not
> report the error (unsure if it just has less error logging).
> 
> I can reproduce this when generating the tar on linux and mac. I
> tested this with "git version 2.21.0" and a build of next I did today:
> "git version 2.22.0.rc1.257.g3120a18244"
> 
> Reproduction:
> 
>   $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git
>   $ cd NNDD
>   $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137
> | tar tf - > /dev/null
>   tar: Ignoring malformed pax extended attribute
>   tar: Error exit delayed from previous errors.

I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
(from Debian's bsdtar package). What does your "tar --version" say?

Git does write a pax header with the commit id in it as a comment.
Presumably that's what it's complaining about (but it is not malformed
according to any tar I've tried). If you feed git-archive a tree rather
than a commit, that is omitted. What does:

  git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null

say? If it doesn't complain, then we know it's indeed the pax comment
field.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-24  7:06 ` Jeff King
@ 2019-05-24  7:35   ` Keegan Carruthers-Smith
  2019-05-24  8:13     ` Jeff King
  2019-05-25 20:46   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 21+ messages in thread
From: Keegan Carruthers-Smith @ 2019-05-24  7:35 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Fri, 24 May 2019 at 09:06, Jeff King <peff@peff.net> wrote:
>
> On Fri, May 24, 2019 at 08:45:23AM +0200, Keegan Carruthers-Smith wrote:
>
> > git archive can generate a malformed tar archive. bsdtar reports the
> > error "tar: Ignoring malformed pax extended attribute" when reading
> > the archive. Go's "tar/archive" package also reports the error
> > "archive/tar: invalid tar header". However, BusyBox's tar does not
> > report the error (unsure if it just has less error logging).
> >
> > I can reproduce this when generating the tar on linux and mac. I
> > tested this with "git version 2.21.0" and a build of next I did today:
> > "git version 2.22.0.rc1.257.g3120a18244"
> >
> > Reproduction:
> >
> >   $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git
> >   $ cd NNDD
> >   $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137
> > | tar tf - > /dev/null
> >   tar: Ignoring malformed pax extended attribute
> >   tar: Error exit delayed from previous errors.
>
> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
> (from Debian's bsdtar package). What does your "tar --version" say?

bsdtar 2.8.3 - libarchive 2.8.3

> Git does write a pax header with the commit id in it as a comment.
> Presumably that's what it's complaining about (but it is not malformed
> according to any tar I've tried). If you feed git-archive a tree rather
> than a commit, that is omitted. What does:
>
>   git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>
> say? If it doesn't complain, then we know it's indeed the pax comment
> field.

It also complains

  $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
  tar: Ignoring malformed pax extended attribute
  tar: Error exit delayed from previous errors.

Some more context: I work at Sourcegraph.com We mirror a lot of repos
from github.com. We usually interact with a working copy by running
git archive on it in our infrastructure. This is the first repository
that I have noticed which produces this error. An interesting thing to
note is the commit metadata contains a lot of non-ascii text which was
my guess at what my be tripping up the tar creation.

Keegan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-24  7:35   ` Keegan Carruthers-Smith
@ 2019-05-24  8:13     ` Jeff King
  2019-05-25 13:26       ` René Scharfe
  2019-05-27  5:11       ` Keegan Carruthers-Smith
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff King @ 2019-05-24  8:13 UTC (permalink / raw)
  To: Keegan Carruthers-Smith; +Cc: git

On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote:

> > I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
> > (from Debian's bsdtar package). What does your "tar --version" say?
> 
> bsdtar 2.8.3 - libarchive 2.8.3

Interesting. I wonder if there was a libarchive bug that was fixed
between 2.8.3 and 3.3.3.

> > Git does write a pax header with the commit id in it as a comment.
> > Presumably that's what it's complaining about (but it is not malformed
> > according to any tar I've tried). If you feed git-archive a tree rather
> > than a commit, that is omitted. What does:
> >
> >   git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
> >
> > say? If it doesn't complain, then we know it's indeed the pax comment
> > field.
> 
> It also complains
> 
>   $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>   tar: Ignoring malformed pax extended attribute
>   tar: Error exit delayed from previous errors.

Ah, OK. So it's not the comment field at all, but some other entry.

> Some more context: I work at Sourcegraph.com We mirror a lot of repos
> from github.com. We usually interact with a working copy by running
> git archive on it in our infrastructure. This is the first repository
> that I have noticed which produces this error. An interesting thing to
> note is the commit metadata contains a lot of non-ascii text which was
> my guess at what my be tripping up the tar creation.

Yeah, though the only thing that makes it into the tarfile is the actual
tree entries. I'd imagine the file content is not likely to be a source
of problems, as it's common to see binary gunk there. Most of the
filenames are pretty mundane, but this symlink destination is a little
funny:

  $ git archive ... | tar tvf - | grep nicovideo4as.swc
  lrwxrwxrwx root/root         0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024

That's not the full story, though. It is indeed a symlink in the
tree:

  $ git ls-tree -r HEAD libs/nicovideo4as.swc
  120000 blob ec3137b5fcaeae25cf67927068af116517683806	libs/nicovideo4as.swc

But the contents of that blob, which should be the destination filename,
are definitely not:

  $ git cat-file blob ec3137b5f | wc -c
  57804
  $ git cat-file blob ec3137b5f | xxd | head -1
  00000000: 504b 0304 1400 0800 0800 5069 694e 0000  PK........PiiN..

There's quite a bit more data there. And what tar showed us goes up to
the first NUL, which does not seem surprising.

It's possible Git is doing the wrong thing on the writing side, but
given that newer versions of bsdtar handle it fine, I'd guess that the
old one simply had problems consuming poorly formed symlink filenames.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-24  8:13     ` Jeff King
@ 2019-05-25 13:26       ` René Scharfe
  2019-05-25 13:46         ` Andreas Schwab
                           ` (2 more replies)
  2019-05-27  5:11       ` Keegan Carruthers-Smith
  1 sibling, 3 replies; 21+ messages in thread
From: René Scharfe @ 2019-05-25 13:26 UTC (permalink / raw)
  To: Jeff King, Keegan Carruthers-Smith; +Cc: git

Am 24.05.19 um 10:13 schrieb Jeff King:
> On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote:
>
>>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
>>> (from Debian's bsdtar package). What does your "tar --version" say?
>>
>> bsdtar 2.8.3 - libarchive 2.8.3
>
> Interesting. I wonder if there was a libarchive bug that was fixed
> between 2.8.3 and 3.3.3.
>
>>> Git does write a pax header with the commit id in it as a comment.
>>> Presumably that's what it's complaining about (but it is not malformed
>>> according to any tar I've tried). If you feed git-archive a tree rather
>>> than a commit, that is omitted. What does:
>>>
>>>   git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>
>>> say? If it doesn't complain, then we know it's indeed the pax comment
>>> field.
>>
>> It also complains
>>
>>   $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>   tar: Ignoring malformed pax extended attribute
>>   tar: Error exit delayed from previous errors.
>
> Ah, OK. So it's not the comment field at all, but some other entry.
>
>> Some more context: I work at Sourcegraph.com We mirror a lot of repos
>> from github.com. We usually interact with a working copy by running
>> git archive on it in our infrastructure. This is the first repository
>> that I have noticed which produces this error. An interesting thing to
>> note is the commit metadata contains a lot of non-ascii text which was
>> my guess at what my be tripping up the tar creation.
>
> Yeah, though the only thing that makes it into the tarfile is the actual
> tree entries. I'd imagine the file content is not likely to be a source
> of problems, as it's common to see binary gunk there. Most of the
> filenames are pretty mundane, but this symlink destination is a little
> funny:
>
>   $ git archive ... | tar tvf - | grep nicovideo4as.swc
>   lrwxrwxrwx root/root         0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024
>
> That's not the full story, though. It is indeed a symlink in the
> tree:
>
>   $ git ls-tree -r HEAD libs/nicovideo4as.swc
>   120000 blob ec3137b5fcaeae25cf67927068af116517683806	libs/nicovideo4as.swc
>
> But the contents of that blob, which should be the destination filename,
> are definitely not:
>
>   $ git cat-file blob ec3137b5f | wc -c
>   57804
>   $ git cat-file blob ec3137b5f | xxd | head -1
>   00000000: 504b 0304 1400 0800 0800 5069 694e 0000  PK........PiiN..
>
> There's quite a bit more data there. And what tar showed us goes up to
> the first NUL, which does not seem surprising.

That (the symlink target) is a ZIP file with the following contents:

 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   39733  Defl:N     3403  91% 2019-03-09 13:10 489e1be1  catalog.xml
   54131  Defl:N    54151   0% 2019-03-09 13:10 32f57322  library.swf
--------          -------  ---                            -------
   93864            57554  39%                            2 files

And link targets longer than 100 characters are encoded in an extended
Pax header.

(Usually symlink targets are paths, not file contents.)

> It's possible Git is doing the wrong thing on the writing side, but
> given that newer versions of bsdtar handle it fine, I'd guess that the
> old one simply had problems consuming poorly formed symlink filenames.

Git preserves symlink targets with embedded NULs in the repository and
in generated tar files.  Not sure if GNU tar and bsdtar truncating them
at the first NUL is a bug.  I'm also not sure if there is a platform
that would allow creating such a symlink in the file system, or how one
is supposed to use it.

We could truncate symlink targets at the first NUL as well in git
archive -- but that would be a bit sad, as the archive formats allow
storing the "real" target from the repo, with NUL and all.  We could
make git fsck report such symlinks.

Can Unicode symlink targets contain NULs?  We wouldn't want to damage
them even if we decide to truncate.

René

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-25 13:26       ` René Scharfe
@ 2019-05-25 13:46         ` Andreas Schwab
  2019-05-25 21:07         ` Ævar Arnfjörð Bjarmason
  2019-05-28  5:58         ` Jeff King
  2 siblings, 0 replies; 21+ messages in thread
From: Andreas Schwab @ 2019-05-25 13:46 UTC (permalink / raw)
  To: René Scharfe; +Cc: Jeff King, Keegan Carruthers-Smith, git

On Mai 25 2019, René Scharfe <l.s.r@web.de> wrote:

> Can Unicode symlink targets contain NULs?  We wouldn't want to damage
> them even if we decide to truncate.

The POSIX interface doesn't allow creating such a symlink, since the
argument to symlink is a C string.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-24  7:06 ` Jeff King
  2019-05-24  7:35   ` Keegan Carruthers-Smith
@ 2019-05-25 20:46   ` Ævar Arnfjörð Bjarmason
  2019-05-25 21:19     ` brian m. carlson
  1 sibling, 1 reply; 21+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-25 20:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Keegan Carruthers-Smith, git


On Fri, May 24 2019, Jeff King wrote:

> On Fri, May 24, 2019 at 08:45:23AM +0200, Keegan Carruthers-Smith wrote:
>
>> git archive can generate a malformed tar archive. bsdtar reports the
>> error "tar: Ignoring malformed pax extended attribute" when reading
>> the archive. Go's "tar/archive" package also reports the error
>> "archive/tar: invalid tar header". However, BusyBox's tar does not
>> report the error (unsure if it just has less error logging).
>>
>> I can reproduce this when generating the tar on linux and mac. I
>> tested this with "git version 2.21.0" and a build of next I did today:
>> "git version 2.22.0.rc1.257.g3120a18244"
>>
>> Reproduction:
>>
>>   $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git
>>   $ cd NNDD
>>   $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137
>> | tar tf - > /dev/null
>>   tar: Ignoring malformed pax extended attribute
>>   tar: Error exit delayed from previous errors.
>
> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
> (from Debian's bsdtar package). What does your "tar --version" say?
>
> Git does write a pax header with the commit id in it as a comment.
> Presumably that's what it's complaining about (but it is not malformed
> according to any tar I've tried). If you feed git-archive a tree rather
> than a commit, that is omitted. What does:
>
>   git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>
> say? If it doesn't complain, then we know it's indeed the pax comment
> field.

Solaris tar also complains about this. I've seen that for ages, but
never thought to report it, I figured it was well-known.

When you "tar xf" an archive git-archive it complains:

    tar: pax_global_header: typeflag 'g' not recognized, converting to regular file

It will then extract the "pax_global_header" as if it were a file at the
root of the archive. That file will look like this:

    $ wc -c x/pax_global_header
      52 x/pax_global_header
    $ cat x/pax_global_header
    52 comment=$40_CHAR_SHA_1

Where $40_CHAR_SHA_1 is whatever commit this archive was produced from.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-25 13:26       ` René Scharfe
  2019-05-25 13:46         ` Andreas Schwab
@ 2019-05-25 21:07         ` Ævar Arnfjörð Bjarmason
  2019-05-26 21:33           ` René Scharfe
  2019-05-28  5:58         ` Jeff King
  2 siblings, 1 reply; 21+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-25 21:07 UTC (permalink / raw)
  To: René Scharfe; +Cc: Jeff King, Keegan Carruthers-Smith, git


On Sat, May 25 2019, René Scharfe wrote:

> Am 24.05.19 um 10:13 schrieb Jeff King:
>> On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote:
>>
>>>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
>>>> (from Debian's bsdtar package). What does your "tar --version" say?
>>>
>>> bsdtar 2.8.3 - libarchive 2.8.3
>>
>> Interesting. I wonder if there was a libarchive bug that was fixed
>> between 2.8.3 and 3.3.3.
>>
>>>> Git does write a pax header with the commit id in it as a comment.
>>>> Presumably that's what it's complaining about (but it is not malformed
>>>> according to any tar I've tried). If you feed git-archive a tree rather
>>>> than a commit, that is omitted. What does:
>>>>
>>>>   git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>>
>>>> say? If it doesn't complain, then we know it's indeed the pax comment
>>>> field.
>>>
>>> It also complains
>>>
>>>   $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>   tar: Ignoring malformed pax extended attribute
>>>   tar: Error exit delayed from previous errors.
>>
>> Ah, OK. So it's not the comment field at all, but some other entry.
>>
>>> Some more context: I work at Sourcegraph.com We mirror a lot of repos
>>> from github.com. We usually interact with a working copy by running
>>> git archive on it in our infrastructure. This is the first repository
>>> that I have noticed which produces this error. An interesting thing to
>>> note is the commit metadata contains a lot of non-ascii text which was
>>> my guess at what my be tripping up the tar creation.
>>
>> Yeah, though the only thing that makes it into the tarfile is the actual
>> tree entries. I'd imagine the file content is not likely to be a source
>> of problems, as it's common to see binary gunk there. Most of the
>> filenames are pretty mundane, but this symlink destination is a little
>> funny:
>>
>>   $ git archive ... | tar tvf - | grep nicovideo4as.swc
>>   lrwxrwxrwx root/root         0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024
>>
>> That's not the full story, though. It is indeed a symlink in the
>> tree:
>>
>>   $ git ls-tree -r HEAD libs/nicovideo4as.swc
>>   120000 blob ec3137b5fcaeae25cf67927068af116517683806	libs/nicovideo4as.swc
>>
>> But the contents of that blob, which should be the destination filename,
>> are definitely not:
>>
>>   $ git cat-file blob ec3137b5f | wc -c
>>   57804
>>   $ git cat-file blob ec3137b5f | xxd | head -1
>>   00000000: 504b 0304 1400 0800 0800 5069 694e 0000  PK........PiiN..
>>
>> There's quite a bit more data there. And what tar showed us goes up to
>> the first NUL, which does not seem surprising.
>
> That (the symlink target) is a ZIP file with the following contents:
>
>  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
> --------  ------  ------- ---- ---------- ----- --------  ----
>    39733  Defl:N     3403  91% 2019-03-09 13:10 489e1be1  catalog.xml
>    54131  Defl:N    54151   0% 2019-03-09 13:10 32f57322  library.swf
> --------          -------  ---                            -------
>    93864            57554  39%                            2 files
>
> And link targets longer than 100 characters are encoded in an extended
> Pax header.
>
> (Usually symlink targets are paths, not file contents.)
>
>> It's possible Git is doing the wrong thing on the writing side, but
>> given that newer versions of bsdtar handle it fine, I'd guess that the
>> old one simply had problems consuming poorly formed symlink filenames.
>
> Git preserves symlink targets with embedded NULs in the repository and
> in generated tar files.  Not sure if GNU tar and bsdtar truncating them
> at the first NUL is a bug.  I'm also not sure if there is a platform
> that would allow creating such a symlink in the file system, or how one
> is supposed to use it.
>
> We could truncate symlink targets at the first NUL as well in git
> archive -- but that would be a bit sad, as the archive formats allow
> storing the "real" target from the repo, with NUL and all.  We could
> make git fsck report such symlinks.
>
> Can Unicode symlink targets contain NULs?  We wouldn't want to damage
> them even if we decide to truncate.

I don't see a practical use for this case, and maybe we should even fsck
check for the blob representing the symlink target having a \0 in it as
suggested upthread.

But that being said, this assumption that data in a tar archive will get
written to a FS of some sort isn't true. There's plenty of consumers of
the format that read it in-memory and stream its contents out to
something else entirely, e.g. taking "git archive --remote" output,
parsing it with e.g. [1] and throwing some/all of the content into a
database.

1. https://metacpan.org/pod/Archive::Tar

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-25 20:46   ` Ævar Arnfjörð Bjarmason
@ 2019-05-25 21:19     ` brian m. carlson
  0 siblings, 0 replies; 21+ messages in thread
From: brian m. carlson @ 2019-05-25 21:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Keegan Carruthers-Smith, git

[-- Attachment #1: Type: text/plain, Size: 1473 bytes --]

On 2019-05-25 at 20:46:16, Ævar Arnfjörð Bjarmason wrote:
> Solaris tar also complains about this. I've seen that for ages, but
> never thought to report it, I figured it was well-known.
> 
> When you "tar xf" an archive git-archive it complains:
> 
>     tar: pax_global_header: typeflag 'g' not recognized, converting to regular file
> 
> It will then extract the "pax_global_header" as if it were a file at the
> root of the archive. That file will look like this:
> 
>     $ wc -c x/pax_global_header
>       52 x/pax_global_header
>     $ cat x/pax_global_header
>     52 comment=$40_CHAR_SHA_1
> 
> Where $40_CHAR_SHA_1 is whatever commit this archive was produced from.

Ironically, for a long time the pax(1) utility shipped with Debian was
incapable of reading pax headers. There are various other utilities
(including 7-Zip) that don't understand them, and they invariably get
converted to regular files (since that's what the spec says to do). This
problem is well known to me, but I don't know if it's well known to
everyone.

In this case, I suspect Solaris has a pax(1) utility that works fine,
but nobody bothered to port that code to tar(1).

Usually the lack of support is because people prefer GNU tar archives
over actual POSIX tar and pax archives, so real-world archivers don't
see POSIX archives and don't handle the extensions properly.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-25 21:07         ` Ævar Arnfjörð Bjarmason
@ 2019-05-26 21:33           ` René Scharfe
  2019-05-28  5:44             ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2019-05-26 21:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Keegan Carruthers-Smith, git

Am 25.05.19 um 23:07 schrieb Ævar Arnfjörð Bjarmason:
>
> On Sat, May 25 2019, René Scharfe wrote:
>
>> We could truncate symlink targets at the first NUL as well in git
>> archive -- but that would be a bit sad, as the archive formats allow
>> storing the "real" target from the repo, with NUL and all.

> But that being said, this assumption that data in a tar archive will get
> written to a FS of some sort isn't true. There's plenty of consumers of
> the format that read it in-memory and stream its contents out to
> something else entirely, e.g. taking "git archive --remote" output,
> parsing it with e.g. [1] and throwing some/all of the content into a
> database.
>
> 1. https://metacpan.org/pod/Archive::Tar

Git archive writes link targets that are 100 characters long or less
into the appropriate field in the plain tar header.  It copies
everything, including NULs, but unlike a PAX extended header that field
lacks a length indicator, so extractors basically have to treat it as
NUL-terminated.

If we want to preserve NUL in short link targets as well, we'd have to
put such names into an PAX extended header..

---
 archive-tar.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..e8f55578d1 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -291,7 +291,8 @@ static int write_tar_entry(struct archiver_args *args,
 	}

 	if (S_ISLNK(mode)) {
-		if (size > sizeof(header.linkname)) {
+		if (size > sizeof(header.linkname) ||
+		    memchr(buffer, '\0', size)) {
 			xsnprintf(header.linkname, sizeof(header.linkname),
 				  "see %s.paxheader", oid_to_hex(oid));
 			strbuf_append_ext_header(&ext_header, "linkpath",
--
2.21.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-24  8:13     ` Jeff King
  2019-05-25 13:26       ` René Scharfe
@ 2019-05-27  5:11       ` Keegan Carruthers-Smith
  1 sibling, 0 replies; 21+ messages in thread
From: Keegan Carruthers-Smith @ 2019-05-27  5:11 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Fri, 24 May 2019 at 10:13, Jeff King <peff@peff.net> wrote:

> It's possible Git is doing the wrong thing on the writing side, but
> given that newer versions of bsdtar handle it fine, I'd guess that the
> old one simply had problems consuming poorly formed symlink filenames.

I agree that the reader should be able to handle it best-effort.
Especially given both bsdtar and gnutar seem to not choke on it. I
mentioned go's archive/tar chokes on it, so reported that issue
https://github.com/golang/go/issues/32264

-Keegan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-26 21:33           ` René Scharfe
@ 2019-05-28  5:44             ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2019-05-28  5:44 UTC (permalink / raw)
  To: René Scharfe
  Cc: Ævar Arnfjörð Bjarmason, Keegan Carruthers-Smith,
	git

On Sun, May 26, 2019 at 11:33:20PM +0200, René Scharfe wrote:

> Git archive writes link targets that are 100 characters long or less
> into the appropriate field in the plain tar header.  It copies
> everything, including NULs, but unlike a PAX extended header that field
> lacks a length indicator, so extractors basically have to treat it as
> NUL-terminated.
> 
> If we want to preserve NUL in short link targets as well, we'd have to
> put such names into an PAX extended header..

I can't say that I care much either way, since putting NULs in your
filenames is pretty crazy. But that would at least make things
consistent regardless of the length.

I also wondered if you could play any security tricks here (e.g., by
having one tool view a filename as "foo" and another as "foo\0bar"). I
doubt it, though.  Filenames themselves in Git are always
NUL-terminated, since that's dictated the tree format; so this is really
just about link destinations.

And even then, it's hard to imagine a case. I was wondering if you could
have an entry ".git\0foo" that ends up written to the filesystem as
".git". But these are tar archives we're talking about, so it's not like
you couldn't just put ".git" in a tar file in the first place.

But maybe somebody else can brainstorm something more evil. :)

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-25 13:26       ` René Scharfe
  2019-05-25 13:46         ` Andreas Schwab
  2019-05-25 21:07         ` Ævar Arnfjörð Bjarmason
@ 2019-05-28  5:58         ` Jeff King
  2019-05-28 18:01           ` René Scharfe
  2 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2019-05-28  5:58 UTC (permalink / raw)
  To: René Scharfe; +Cc: Keegan Carruthers-Smith, git

On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote:

> We could truncate symlink targets at the first NUL as well in git
> archive -- but that would be a bit sad, as the archive formats allow
> storing the "real" target from the repo, with NUL and all.  We could
> make git fsck report such symlinks.

This is a little tricky, because fsck generally looks at individual
objects, and the bad pattern is a combination of a tree and a blob
together. I think you could make it work by reusing some of the code and
patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into
maint, 2018-05-22).

> Can Unicode symlink targets contain NULs?  We wouldn't want to damage
> them even if we decide to truncate.

On Windows, I suppose, where pathnames can be UTF-16? I don't know how
any of that works with Git. I guess we'd always have to assume the
filenames in Git are UTF-8 or at least some ASCII-superset, since we
cannot encode NULs; and presumably that would extend to link
destinations, too. So I doubt it's a problem in practice. Personally,
I'd wait until somebody with such a system cares enough to suggest a new
behavior, rather than trying to guess. :)

Likewise, I think at this point with Keegan's original report that Git
is doing something reasonable with a lousy input. Unless something
interesting comes out of the golang/go bug report discussion (thank you
for opening that!), it's probably not worth chasing hypotheticals.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-28  5:58         ` Jeff King
@ 2019-05-28 18:01           ` René Scharfe
  2019-05-28 19:08             ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2019-05-28 18:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Keegan Carruthers-Smith, git

Am 28.05.19 um 07:58 schrieb Jeff King:
> On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote:
>
>> We could
>> make git fsck report such symlinks.
>
> This is a little tricky, because fsck generally looks at individual
> objects, and the bad pattern is a combination of a tree and a blob
> together. I think you could make it work by reusing some of the code and
> patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into
> maint, 2018-05-22).

Actually it's super easy, barely an inconvenience (SCNR, watched a lot
of those rants recently)..  Did I miss something?

-- >8 --
Subject: [PATCH] fsck: warn about symlink targets containing NUL

Symlink targets are kept as blob objects in Git repositories and can
thus contain arbitrary characters.  They are truncated silently at the
first NUL by symlink(2) during checkout, though.  Make git fsck warn
about such tree entries.

Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 fsck.c          | 32 ++++++++++++++++++++++++++++++++
 t/t1450-fsck.sh | 13 +++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/fsck.c b/fsck.c
index 4703f55561..d434f5d75c 100644
--- a/fsck.c
+++ b/fsck.c
@@ -49,6 +49,7 @@ static struct oidset gitmodules_done = OIDSET_INIT;
 	FUNC(MISSING_PARENT, ERROR) \
 	FUNC(MISSING_SPACE_BEFORE_DATE, ERROR) \
 	FUNC(MISSING_SPACE_BEFORE_EMAIL, ERROR) \
+	FUNC(MISSING_SYMLINK_OBJECT, ERROR) \
 	FUNC(MISSING_TAG, ERROR) \
 	FUNC(MISSING_TAG_ENTRY, ERROR) \
 	FUNC(MISSING_TAG_OBJECT, ERROR) \
@@ -58,6 +59,7 @@ static struct oidset gitmodules_done = OIDSET_INIT;
 	FUNC(MISSING_TYPE_ENTRY, ERROR) \
 	FUNC(MULTIPLE_AUTHORS, ERROR) \
 	FUNC(TAG_OBJECT_NOT_TAG, ERROR) \
+	FUNC(SYMLINK_OBJECT_NOT_BLOB, ERROR) \
 	FUNC(TREE_NOT_SORTED, ERROR) \
 	FUNC(UNKNOWN_TYPE, ERROR) \
 	FUNC(ZERO_PADDED_DATE, ERROR) \
@@ -78,6 +80,7 @@ static struct oidset gitmodules_done = OIDSET_INIT;
 	FUNC(NULL_SHA1, WARN) \
 	FUNC(ZERO_PADDED_FILEMODE, WARN) \
 	FUNC(NUL_IN_COMMIT, WARN) \
+	FUNC(NUL_IN_SYMLINK_TARGET, WARN) \
 	/* infos (reported as warnings, but ignored by default) */ \
 	FUNC(GITMODULES_PARSE, INFO) \
 	FUNC(BAD_TAG_NAME, INFO) \
@@ -578,6 +581,33 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con
 	return c1 < c2 ? 0 : TREE_UNORDERED;
 }

+static int fsck_symlink(struct tree *tree, const char *name,
+			const struct object_id *oid,
+			struct fsck_options *options)
+{
+	int ret = 0;
+	enum object_type type;
+	unsigned long size;
+	void *buffer = read_object_file(oid, &type, &size);
+
+	if (!buffer)
+		ret = report(options, &tree->object,
+			     FSCK_MSG_MISSING_SYMLINK_OBJECT,
+			     "cannot read blob object for symlink %s", name);
+	else if (type != OBJ_BLOB)
+		ret = report(options, &tree->object,
+			     FSCK_MSG_SYMLINK_OBJECT_NOT_BLOB,
+			     "expected blob got %s for symlink %s",
+			     type_name(type), name);
+	else if (memchr(buffer, '\0', size))
+		ret = report(options, &tree->object,
+			     FSCK_MSG_NUL_IN_SYMLINK_TARGET,
+			     "NUL in target of symlink %s", name);
+
+	free(buffer);
+	return ret;
+}
+
 static int fsck_tree(struct tree *item, struct fsck_options *options)
 {
 	int retval = 0;
@@ -626,6 +656,8 @@ static int fsck_tree(struct tree *item, struct fsck_options *options)
 						 FSCK_MSG_GITMODULES_SYMLINK,
 						 ".gitmodules is a symbolic link");
 		}
+		if (S_ISLNK(mode))
+			retval += fsck_symlink(item, name, oid, options);

 		if (update_tree_entry_gently(&desc)) {
 			retval += report(options, &item->object, FSCK_MSG_BAD_TREE, "cannot be parsed as a tree");
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 0f268a3664..ce9501d063 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -822,4 +822,17 @@ test_expect_success 'detect corrupt index file in fsck' '
 	test_i18ngrep "bad index file" errors
 '

+test_expect_success 'detect NUL in symlink target' '
+	test_when_finished "git update-ref -d refs/heads/nul_in_symlink" &&
+	test_when_finished "remove_object \$commit" &&
+	test_when_finished "remove_object \$tree" &&
+	test_when_finished "remove_object \$blob" &&
+	blob=$(echo fooQbar | q_to_nul | git hash-object -w --stdin) &&
+	tree=$(echo "120000 blob $blob	symlink" | git mktree) &&
+	commit=$(git commit-tree $tree) &&
+	git update-ref refs/heads/nul_in_symlink $commit &&
+	git fsck 2>out &&
+	test_i18ngrep "NUL in target of symlink" out
+'
+
 test_done
--
2.21.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-28 18:01           ` René Scharfe
@ 2019-05-28 19:08             ` Jeff King
  2019-05-28 23:34               ` René Scharfe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2019-05-28 19:08 UTC (permalink / raw)
  To: René Scharfe; +Cc: Keegan Carruthers-Smith, git

On Tue, May 28, 2019 at 08:01:43PM +0200, René Scharfe wrote:

> Am 28.05.19 um 07:58 schrieb Jeff King:
> > On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote:
> >
> >> We could
> >> make git fsck report such symlinks.
> >
> > This is a little tricky, because fsck generally looks at individual
> > objects, and the bad pattern is a combination of a tree and a blob
> > together. I think you could make it work by reusing some of the code and
> > patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into
> > maint, 2018-05-22).
> 
> Actually it's super easy, barely an inconvenience (SCNR, watched a lot
> of those rants recently)..  Did I miss something?

Yes. You cannot rely on calling read_object_file() in real-time when the
fsck is being done by index-pack. The blob in question may be in the
pack you are indexing.

See fsck_finish() for how we do this for .gitmodules checks.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-28 19:08             ` Jeff King
@ 2019-05-28 23:34               ` René Scharfe
  2019-05-29  1:17                 ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2019-05-28 23:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Keegan Carruthers-Smith, git

Am 28.05.19 um 21:08 schrieb Jeff King:
> On Tue, May 28, 2019 at 08:01:43PM +0200, René Scharfe wrote:
>
>> Am 28.05.19 um 07:58 schrieb Jeff King:
>>> On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote:
>>>
>>>> We could
>>>> make git fsck report such symlinks.
>>>
>>> This is a little tricky, because fsck generally looks at individual
>>> objects, and the bad pattern is a combination of a tree and a blob
>>> together. I think you could make it work by reusing some of the code and
>>> patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into
>>> maint, 2018-05-22).
>>
>> Actually it's super easy, barely an inconvenience (SCNR, watched a lot
>> of those rants recently)..  Did I miss something?
>
> Yes. You cannot rely on calling read_object_file() in real-time when the
> fsck is being done by index-pack. The blob in question may be in the
> pack you are indexing.

It figures.

So something like the patch below?

Parsing trees with symlinks twice is not ideal, but keeps the set
structure simple -- a standard oidset suffices.

The global variables are ugly.  Moving them into struct fsck_option
would be possible, but not much better, as they aren't really
options.

FSCK_MSG_MISSING_TREE_OBJECT has never been used before, it seems.

---
 fsck.c          | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
 t/t1450-fsck.sh | 13 +++++++++++
 2 files changed, 71 insertions(+)

diff --git a/fsck.c b/fsck.c
index 4703f55561..a6e7d0b03f 100644
--- a/fsck.c
+++ b/fsck.c
@@ -19,6 +19,7 @@

 static struct oidset gitmodules_found = OIDSET_INIT;
 static struct oidset gitmodules_done = OIDSET_INIT;
+static struct oidset trees_with_symlinks = OIDSET_INIT;

 #define FSCK_FATAL -1
 #define FSCK_INFO -2
@@ -49,6 +50,7 @@ static struct oidset gitmodules_done = OIDSET_INIT;
 	FUNC(MISSING_PARENT, ERROR) \
 	FUNC(MISSING_SPACE_BEFORE_DATE, ERROR) \
 	FUNC(MISSING_SPACE_BEFORE_EMAIL, ERROR) \
+	FUNC(MISSING_SYMLINK_OBJECT, ERROR) \
 	FUNC(MISSING_TAG, ERROR) \
 	FUNC(MISSING_TAG_ENTRY, ERROR) \
 	FUNC(MISSING_TAG_OBJECT, ERROR) \
@@ -58,6 +60,7 @@ static struct oidset gitmodules_done = OIDSET_INIT;
 	FUNC(MISSING_TYPE_ENTRY, ERROR) \
 	FUNC(MULTIPLE_AUTHORS, ERROR) \
 	FUNC(TAG_OBJECT_NOT_TAG, ERROR) \
+	FUNC(SYMLINK_OBJECT_NOT_BLOB, ERROR) \
 	FUNC(TREE_NOT_SORTED, ERROR) \
 	FUNC(UNKNOWN_TYPE, ERROR) \
 	FUNC(ZERO_PADDED_DATE, ERROR) \
@@ -78,6 +81,7 @@ static struct oidset gitmodules_done = OIDSET_INIT;
 	FUNC(NULL_SHA1, WARN) \
 	FUNC(ZERO_PADDED_FILEMODE, WARN) \
 	FUNC(NUL_IN_COMMIT, WARN) \
+	FUNC(NUL_IN_SYMLINK_TARGET, WARN) \
 	/* infos (reported as warnings, but ignored by default) */ \
 	FUNC(GITMODULES_PARSE, INFO) \
 	FUNC(BAD_TAG_NAME, INFO) \
@@ -578,6 +582,33 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con
 	return c1 < c2 ? 0 : TREE_UNORDERED;
 }

+static int fsck_symlink(struct tree *tree, const char *name,
+			const struct object_id *oid,
+			struct fsck_options *options)
+{
+	int ret = 0;
+	enum object_type type;
+	unsigned long size;
+	void *buffer = read_object_file(oid, &type, &size);
+
+	if (!buffer)
+		ret = report(options, &tree->object,
+			     FSCK_MSG_MISSING_SYMLINK_OBJECT,
+			     "cannot read blob object for symlink %s", name);
+	else if (type != OBJ_BLOB)
+		ret = report(options, &tree->object,
+			     FSCK_MSG_SYMLINK_OBJECT_NOT_BLOB,
+			     "expected blob got %s for symlink %s",
+			     type_name(type), name);
+	else if (memchr(buffer, '\0', size))
+		ret = report(options, &tree->object,
+			     FSCK_MSG_NUL_IN_SYMLINK_TARGET,
+			     "NUL in target of symlink %s", name);
+
+	free(buffer);
+	return ret;
+}
+
 static int fsck_tree(struct tree *item, struct fsck_options *options)
 {
 	int retval = 0;
@@ -626,6 +657,8 @@ static int fsck_tree(struct tree *item, struct fsck_options *options)
 						 FSCK_MSG_GITMODULES_SYMLINK,
 						 ".gitmodules is a symbolic link");
 		}
+		if (S_ISLNK(mode))
+			oidset_insert(&trees_with_symlinks, &item->object.oid);

 		if (update_tree_entry_gently(&desc)) {
 			retval += report(options, &item->object, FSCK_MSG_BAD_TREE, "cannot be parsed as a tree");
@@ -1118,8 +1151,33 @@ int fsck_finish(struct fsck_options *options)
 		free(buf);
 	}

+	oidset_iter_init(&trees_with_symlinks, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree;
+		struct tree_desc desc;
+		struct name_entry entry;
+
+		tree = lookup_tree(the_repository, oid);
+		if (!tree) {
+			struct object *obj = lookup_unknown_object(oid->hash);
+			ret |= report(options, obj,
+				      FSCK_MSG_MISSING_TREE_OBJECT,
+				      "tree %s not found", oid_to_hex(oid));
+			continue;
+		}
+		if (parse_tree(tree))
+			continue;
+		if (init_tree_desc_gently(&desc, tree->buffer, tree->size))
+			continue;
+		while (tree_entry_gently(&desc, &entry)) {
+			if (S_ISLNK(entry.mode))
+				ret |= fsck_symlink(tree, entry.path,
+						    &entry.oid, options);
+		}
+	}

 	oidset_clear(&gitmodules_found);
 	oidset_clear(&gitmodules_done);
+	oidset_clear(&trees_with_symlinks);
 	return ret;
 }
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 0f268a3664..ce9501d063 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -822,4 +822,17 @@ test_expect_success 'detect corrupt index file in fsck' '
 	test_i18ngrep "bad index file" errors
 '

+test_expect_success 'detect NUL in symlink target' '
+	test_when_finished "git update-ref -d refs/heads/nul_in_symlink" &&
+	test_when_finished "remove_object \$commit" &&
+	test_when_finished "remove_object \$tree" &&
+	test_when_finished "remove_object \$blob" &&
+	blob=$(echo fooQbar | q_to_nul | git hash-object -w --stdin) &&
+	tree=$(echo "120000 blob $blob	symlink" | git mktree) &&
+	commit=$(git commit-tree $tree) &&
+	git update-ref refs/heads/nul_in_symlink $commit &&
+	git fsck 2>out &&
+	test_i18ngrep "NUL in target of symlink" out
+'
+
 test_done
--
2.21.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-28 23:34               ` René Scharfe
@ 2019-05-29  1:17                 ` Jeff King
  2019-05-29 17:54                   ` René Scharfe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2019-05-29  1:17 UTC (permalink / raw)
  To: René Scharfe; +Cc: Keegan Carruthers-Smith, git

On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote:

> It figures.
> 
> So something like the patch below?
> 
> Parsing trees with symlinks twice is not ideal, but keeps the set
> structure simple -- a standard oidset suffices.

If blobs comes after trees (and they usually do in a pack), you can do
it in a single pass by marking the blob as a symlink target, and then
when we actually see that blob's contents, marking it as either OK or
problematic. And then the finish() step just correlates those with the
tree.

It does require O(n) storage in the number of symlinked blobs, but also
O(n) in the number of symlinked tree entries (number of trees with
symlinks times the number of entries in each such tree, _even if they're
the same entry/blob as another tree).

That makes it a lot worse than the existing gitmodules check. There we
only care about finding the .gitmodules blobs. So even though you have a
ton of trees that mention .gitmodules (basically every root tree), the
the .gitmodules file itself doesn't change much. So we only end up with
a small oidset (and a small worst case for looking at objects twice).

But here the problem is in the tree, not the blob. So we're not finding
suspect blobs, but rather re-checking each tree. And no matter what we
do (whether it's visiting the object again, or creating a set or mapping
with the object names) is going to be linear there. And a repository
with a symlink in the root tree is going to revisit or put in our
mapping every single root tree.

TBH, I'm not sure this fsck check was worth it even without the
implementation complexity.

> The global variables are ugly.  Moving them into struct fsck_option
> would be possible, but not much better, as they aren't really
> options.

Yeah, we use the name "context" elsewhere for this, which is a bit
clearer. In a real object-oriented design, I guess people would make an
"Fscker" object, and it would carry the options and context forward.
That's basically what our "context" objects are.

> FSCK_MSG_MISSING_TREE_OBJECT has never been used before, it seems.

Yeah, this is leftover cruft from my gitmodules series. I double-checked
the early iterations to be sure, and it is definitely the case. I didn't
dig down to find out the reason it went away, but I think it is probably
that the original version tried harder to find .gitmodules only in root
trees, and later we decided to just complain about it in any tree.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-29  1:17                 ` Jeff King
@ 2019-05-29 17:54                   ` René Scharfe
  2019-05-30 11:55                     ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2019-05-29 17:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Keegan Carruthers-Smith, git

Am 29.05.19 um 03:17 schrieb Jeff King:
> On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote:
>> Parsing trees with symlinks twice is not ideal, but keeps the set
>> structure simple -- a standard oidset suffices.
>
> If blobs comes after trees (and they usually do in a pack), you can do
> it in a single pass by marking the blob as a symlink target, and then
> when we actually see that blob's contents, marking it as either OK or
> problematic. And then the finish() step just correlates those with the
> tree.

Good idea.  Is that ordering guaranteed?  (Stumbling about the "usually"
in your first sentence.)

An ordering where dependent objects (like trees) follow the objects they
reference would be better suited for these kinds of checks..

> It does require O(n) storage in the number of symlinked blobs, but also
> O(n) in the number of symlinked tree entries (number of trees with
> symlinks times the number of entries in each such tree, _even if they're
> the same entry/blob as another tree).
>
> That makes it a lot worse than the existing gitmodules check. There we
> only care about finding the .gitmodules blobs. So even though you have a
> ton of trees that mention .gitmodules (basically every root tree), the
> the .gitmodules file itself doesn't change much. So we only end up with
> a small oidset (and a small worst case for looking at objects twice).
>
> But here the problem is in the tree, not the blob. So we're not finding
> suspect blobs, but rather re-checking each tree. And no matter what we
> do (whether it's visiting the object again, or creating a set or mapping
> with the object names) is going to be linear there. And a repository
> with a symlink in the root tree is going to revisit or put in our
> mapping every single root tree.

That's true, potentially it needs remember and/or reprocess all trees,
meaning this check may double the run time of fsck in the worst case.
Example from the wild: The kernel repo currently has 36 symlinks and
6+ million objects are checked in total, and the symlink check processes
18943 trees_with_symlinks entries there.

> TBH, I'm not sure this fsck check was worth it even without the
> implementation complexity.

Hmm.  git status reports such truncated symlinks as changed, so the
issue *is* already detectable.

René

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-29 17:54                   ` René Scharfe
@ 2019-05-30 11:55                     ` Jeff King
  2019-06-02 16:58                       ` René Scharfe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2019-05-30 11:55 UTC (permalink / raw)
  To: René Scharfe; +Cc: Keegan Carruthers-Smith, git

On Wed, May 29, 2019 at 07:54:44PM +0200, René Scharfe wrote:

> Am 29.05.19 um 03:17 schrieb Jeff King:
> > On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote:
> >> Parsing trees with symlinks twice is not ideal, but keeps the set
> >> structure simple -- a standard oidset suffices.
> >
> > If blobs comes after trees (and they usually do in a pack), you can do
> > it in a single pass by marking the blob as a symlink target, and then
> > when we actually see that blob's contents, marking it as either OK or
> > problematic. And then the finish() step just correlates those with the
> > tree.
> 
> Good idea.  Is that ordering guaranteed?  (Stumbling about the "usually"
> in your first sentence.)

It's not guaranteed. Our implementation of pack-objects does order blobs
after trees, but I suspect this could be violated in rare cases with
some of the delta-island pack layering stuff.

I think it makes sense to be sure that the receiver is correct no matter
what, but optimize for this expected case (that's what I tried to do
with the .gitmodules checks).

> An ordering where dependent objects (like trees) follow the objects they
> reference would be better suited for these kinds of checks..

But worse for others (e.g., like .gitmodules where it's cheap to
identify a candidate blob, but the blob check is involved; there it's
much more optimal to see the tree first).

> > But here the problem is in the tree, not the blob. So we're not finding
> > suspect blobs, but rather re-checking each tree. And no matter what we
> > do (whether it's visiting the object again, or creating a set or mapping
> > with the object names) is going to be linear there. And a repository
> > with a symlink in the root tree is going to revisit or put in our
> > mapping every single root tree.
> 
> That's true, potentially it needs remember and/or reprocess all trees,
> meaning this check may double the run time of fsck in the worst case.
> Example from the wild: The kernel repo currently has 36 symlinks and
> 6+ million objects are checked in total, and the symlink check processes
> 18943 trees_with_symlinks entries there.

That sounds about right. It's basically every version of every tree that
has a symlink. Did it make a noticeable difference in timing? Indexing
the whole kernel history is already a horribly slow process. :)

> > TBH, I'm not sure this fsck check was worth it even without the
> > implementation complexity.
> 
> Hmm.  git status reports such truncated symlinks as changed, so the
> issue *is* already detectable.

Hmm, yeah. That makes sense, since the filesystem (well, really the
syscall API layer) cannot represent the data we are feeding it.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-05-30 11:55                     ` Jeff King
@ 2019-06-02 16:58                       ` René Scharfe
  2019-06-04 20:53                         ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2019-06-02 16:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Keegan Carruthers-Smith, git

Am 30.05.19 um 13:55 schrieb Jeff King:
> On Wed, May 29, 2019 at 07:54:44PM +0200, René Scharfe wrote:
>
>> Am 29.05.19 um 03:17 schrieb Jeff King:
>>> But here the problem is in the tree, not the blob. So we're not finding
>>> suspect blobs, but rather re-checking each tree. And no matter what we
>>> do (whether it's visiting the object again, or creating a set or mapping
>>> with the object names) is going to be linear there. And a repository
>>> with a symlink in the root tree is going to revisit or put in our
>>> mapping every single root tree.
>>
>> That's true, potentially it needs remember and/or reprocess all trees,
>> meaning this check may double the run time of fsck in the worst case.
>> Example from the wild: The kernel repo currently has 36 symlinks and
>> 6+ million objects are checked in total, and the symlink check processes
>> 18943 trees_with_symlinks entries there.
>
> That sounds about right. It's basically every version of every tree that
> has a symlink. Did it make a noticeable difference in timing? Indexing
> the whole kernel history is already a horribly slow process. :)

Right, I didn't notice a difference -- no patience for watching that
thing to the end.  But here are some numbers for v2.21.0 vs. master with
the patch:

Benchmark #1: git fsck
  Time (mean ± σ):     307.775 s ±  9.054 s    [User: 307.173 s, System: 0.448 s]
  Range (min … max):   294.052 s … 322.931 s    10 runs

Benchmark #2: ~/src/git/git fsck
  Time (mean ± σ):     319.754 s ±  2.255 s    [User: 318.927 s, System: 0.671 s]
  Range (min … max):   316.376 s … 323.747 s    10 runs

Summary
  'git fsck' ran
    1.04 ± 0.03 times faster than '~/src/git/git fsck'

Seeing only a single CPU core being stressed for that long is a bit sad
to see.  Checking individual objects should be relatively easy to
parallelize, shouldn't it?

René

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive generates tar with malformed pax extended attribute
  2019-06-02 16:58                       ` René Scharfe
@ 2019-06-04 20:53                         ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2019-06-04 20:53 UTC (permalink / raw)
  To: René Scharfe; +Cc: Keegan Carruthers-Smith, git

On Sun, Jun 02, 2019 at 06:58:48PM +0200, René Scharfe wrote:

> > That sounds about right. It's basically every version of every tree that
> > has a symlink. Did it make a noticeable difference in timing? Indexing
> > the whole kernel history is already a horribly slow process. :)
> 
> Right, I didn't notice a difference -- no patience for watching that
> thing to the end.  But here are some numbers for v2.21.0 vs. master with
> the patch:
> 
> Benchmark #1: git fsck
>   Time (mean ± σ):     307.775 s ±  9.054 s    [User: 307.173 s, System: 0.448 s]
>   Range (min … max):   294.052 s … 322.931 s    10 runs
> 
> Benchmark #2: ~/src/git/git fsck
>   Time (mean ± σ):     319.754 s ±  2.255 s    [User: 318.927 s, System: 0.671 s]
>   Range (min … max):   316.376 s … 323.747 s    10 runs
> 
> Summary
>   'git fsck' ran
>     1.04 ± 0.03 times faster than '~/src/git/git fsck'

I guess that's about what I'd expect. The bulk of the time in most repos
will go to fscking the actual blobs, I'd think. But hitting each tree
twice really is noticeable.

> Seeing only a single CPU core being stressed for that long is a bit sad
> to see.  Checking individual objects should be relatively easy to
> parallelize, shouldn't it?

Yes. The fsck code is pretty old, and uses a very simple way of walking
over all of the packs. index-pack (which backs verify-pack these days)
is much smarter, and runs in parallel. It still takes a lock when doing
the actual fsck checks, but most of the time goes to the zlib inflation
and delta reconstruction.

There's some discussion in:

  https://public-inbox.org/git/20180816210657.GA9291@sigill.intra.peff.net/

and even some patches elsewhere in the thread here:

  https://public-inbox.org/git/20180902075528.GC18787@sigill.intra.peff.net/

and here:

  https://public-inbox.org/git/20180902085503.GA25391@sigill.intra.peff.net/

I think the big show-stopper there is how ugly it is to run the pack
verification in a separate process (and I suspect it is not just ugly
from a code point of view, but actively breaks index-pack because it
then relies on the set of objects seen during the first phase to do its
connectivity check).

So there would probably need to be some lib-ification work on index-pack
first, so that we could call it (at least in verification mode) multiple
times from inside fsck.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-06-04 20:53 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-24  6:45 git archive generates tar with malformed pax extended attribute Keegan Carruthers-Smith
2019-05-24  7:06 ` Jeff King
2019-05-24  7:35   ` Keegan Carruthers-Smith
2019-05-24  8:13     ` Jeff King
2019-05-25 13:26       ` René Scharfe
2019-05-25 13:46         ` Andreas Schwab
2019-05-25 21:07         ` Ævar Arnfjörð Bjarmason
2019-05-26 21:33           ` René Scharfe
2019-05-28  5:44             ` Jeff King
2019-05-28  5:58         ` Jeff King
2019-05-28 18:01           ` René Scharfe
2019-05-28 19:08             ` Jeff King
2019-05-28 23:34               ` René Scharfe
2019-05-29  1:17                 ` Jeff King
2019-05-29 17:54                   ` René Scharfe
2019-05-30 11:55                     ` Jeff King
2019-06-02 16:58                       ` René Scharfe
2019-06-04 20:53                         ` Jeff King
2019-05-27  5:11       ` Keegan Carruthers-Smith
2019-05-25 20:46   ` Ævar Arnfjörð Bjarmason
2019-05-25 21:19     ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).