* git archive generates tar with malformed pax extended attribute @ 2019-05-24 6:45 Keegan Carruthers-Smith 2019-05-24 7:06 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: Keegan Carruthers-Smith @ 2019-05-24 6:45 UTC (permalink / raw) To: git Hello, git archive can generate a malformed tar archive. bsdtar reports the error "tar: Ignoring malformed pax extended attribute" when reading the archive. Go's "tar/archive" package also reports the error "archive/tar: invalid tar header". However, BusyBox's tar does not report the error (unsure if it just has less error logging). I can reproduce this when generating the tar on linux and mac. I tested this with "git version 2.21.0" and a build of next I did today: "git version 2.22.0.rc1.257.g3120a18244" Reproduction: $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git $ cd NNDD $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137 | tar tf - > /dev/null tar: Ignoring malformed pax extended attribute tar: Error exit delayed from previous errors. Kind Regards, Keegan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-24 6:45 git archive generates tar with malformed pax extended attribute Keegan Carruthers-Smith @ 2019-05-24 7:06 ` Jeff King 2019-05-24 7:35 ` Keegan Carruthers-Smith 2019-05-25 20:46 ` Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 21+ messages in thread From: Jeff King @ 2019-05-24 7:06 UTC (permalink / raw) To: Keegan Carruthers-Smith; +Cc: git On Fri, May 24, 2019 at 08:45:23AM +0200, Keegan Carruthers-Smith wrote: > git archive can generate a malformed tar archive. bsdtar reports the > error "tar: Ignoring malformed pax extended attribute" when reading > the archive. Go's "tar/archive" package also reports the error > "archive/tar: invalid tar header". However, BusyBox's tar does not > report the error (unsure if it just has less error logging). > > I can reproduce this when generating the tar on linux and mac. I > tested this with "git version 2.21.0" and a build of next I did today: > "git version 2.22.0.rc1.257.g3120a18244" > > Reproduction: > > $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git > $ cd NNDD > $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137 > | tar tf - > /dev/null > tar: Ignoring malformed pax extended attribute > tar: Error exit delayed from previous errors. I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3 (from Debian's bsdtar package). What does your "tar --version" say? Git does write a pax header with the commit id in it as a comment. Presumably that's what it's complaining about (but it is not malformed according to any tar I've tried). If you feed git-archive a tree rather than a commit, that is omitted. What does: git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null say? If it doesn't complain, then we know it's indeed the pax comment field. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-24 7:06 ` Jeff King @ 2019-05-24 7:35 ` Keegan Carruthers-Smith 2019-05-24 8:13 ` Jeff King 2019-05-25 20:46 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 21+ messages in thread From: Keegan Carruthers-Smith @ 2019-05-24 7:35 UTC (permalink / raw) To: Jeff King; +Cc: git On Fri, 24 May 2019 at 09:06, Jeff King <peff@peff.net> wrote: > > On Fri, May 24, 2019 at 08:45:23AM +0200, Keegan Carruthers-Smith wrote: > > > git archive can generate a malformed tar archive. bsdtar reports the > > error "tar: Ignoring malformed pax extended attribute" when reading > > the archive. Go's "tar/archive" package also reports the error > > "archive/tar: invalid tar header". However, BusyBox's tar does not > > report the error (unsure if it just has less error logging). > > > > I can reproduce this when generating the tar on linux and mac. I > > tested this with "git version 2.21.0" and a build of next I did today: > > "git version 2.22.0.rc1.257.g3120a18244" > > > > Reproduction: > > > > $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git > > $ cd NNDD > > $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137 > > | tar tf - > /dev/null > > tar: Ignoring malformed pax extended attribute > > tar: Error exit delayed from previous errors. > > I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3 > (from Debian's bsdtar package). What does your "tar --version" say? bsdtar 2.8.3 - libarchive 2.8.3 > Git does write a pax header with the commit id in it as a comment. > Presumably that's what it's complaining about (but it is not malformed > according to any tar I've tried). If you feed git-archive a tree rather > than a commit, that is omitted. What does: > > git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null > > say? If it doesn't complain, then we know it's indeed the pax comment > field. It also complains $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null tar: Ignoring malformed pax extended attribute tar: Error exit delayed from previous errors. Some more context: I work at Sourcegraph.com We mirror a lot of repos from github.com. We usually interact with a working copy by running git archive on it in our infrastructure. This is the first repository that I have noticed which produces this error. An interesting thing to note is the commit metadata contains a lot of non-ascii text which was my guess at what my be tripping up the tar creation. Keegan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-24 7:35 ` Keegan Carruthers-Smith @ 2019-05-24 8:13 ` Jeff King 2019-05-25 13:26 ` René Scharfe 2019-05-27 5:11 ` Keegan Carruthers-Smith 0 siblings, 2 replies; 21+ messages in thread From: Jeff King @ 2019-05-24 8:13 UTC (permalink / raw) To: Keegan Carruthers-Smith; +Cc: git On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote: > > I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3 > > (from Debian's bsdtar package). What does your "tar --version" say? > > bsdtar 2.8.3 - libarchive 2.8.3 Interesting. I wonder if there was a libarchive bug that was fixed between 2.8.3 and 3.3.3. > > Git does write a pax header with the commit id in it as a comment. > > Presumably that's what it's complaining about (but it is not malformed > > according to any tar I've tried). If you feed git-archive a tree rather > > than a commit, that is omitted. What does: > > > > git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null > > > > say? If it doesn't complain, then we know it's indeed the pax comment > > field. > > It also complains > > $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null > tar: Ignoring malformed pax extended attribute > tar: Error exit delayed from previous errors. Ah, OK. So it's not the comment field at all, but some other entry. > Some more context: I work at Sourcegraph.com We mirror a lot of repos > from github.com. We usually interact with a working copy by running > git archive on it in our infrastructure. This is the first repository > that I have noticed which produces this error. An interesting thing to > note is the commit metadata contains a lot of non-ascii text which was > my guess at what my be tripping up the tar creation. Yeah, though the only thing that makes it into the tarfile is the actual tree entries. I'd imagine the file content is not likely to be a source of problems, as it's common to see binary gunk there. Most of the filenames are pretty mundane, but this symlink destination is a little funny: $ git archive ... | tar tvf - | grep nicovideo4as.swc lrwxrwxrwx root/root 0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024 That's not the full story, though. It is indeed a symlink in the tree: $ git ls-tree -r HEAD libs/nicovideo4as.swc 120000 blob ec3137b5fcaeae25cf67927068af116517683806 libs/nicovideo4as.swc But the contents of that blob, which should be the destination filename, are definitely not: $ git cat-file blob ec3137b5f | wc -c 57804 $ git cat-file blob ec3137b5f | xxd | head -1 00000000: 504b 0304 1400 0800 0800 5069 694e 0000 PK........PiiN.. There's quite a bit more data there. And what tar showed us goes up to the first NUL, which does not seem surprising. It's possible Git is doing the wrong thing on the writing side, but given that newer versions of bsdtar handle it fine, I'd guess that the old one simply had problems consuming poorly formed symlink filenames. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-24 8:13 ` Jeff King @ 2019-05-25 13:26 ` René Scharfe 2019-05-25 13:46 ` Andreas Schwab ` (2 more replies) 2019-05-27 5:11 ` Keegan Carruthers-Smith 1 sibling, 3 replies; 21+ messages in thread From: René Scharfe @ 2019-05-25 13:26 UTC (permalink / raw) To: Jeff King, Keegan Carruthers-Smith; +Cc: git Am 24.05.19 um 10:13 schrieb Jeff King: > On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote: > >>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3 >>> (from Debian's bsdtar package). What does your "tar --version" say? >> >> bsdtar 2.8.3 - libarchive 2.8.3 > > Interesting. I wonder if there was a libarchive bug that was fixed > between 2.8.3 and 3.3.3. > >>> Git does write a pax header with the commit id in it as a comment. >>> Presumably that's what it's complaining about (but it is not malformed >>> according to any tar I've tried). If you feed git-archive a tree rather >>> than a commit, that is omitted. What does: >>> >>> git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null >>> >>> say? If it doesn't complain, then we know it's indeed the pax comment >>> field. >> >> It also complains >> >> $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null >> tar: Ignoring malformed pax extended attribute >> tar: Error exit delayed from previous errors. > > Ah, OK. So it's not the comment field at all, but some other entry. > >> Some more context: I work at Sourcegraph.com We mirror a lot of repos >> from github.com. We usually interact with a working copy by running >> git archive on it in our infrastructure. This is the first repository >> that I have noticed which produces this error. An interesting thing to >> note is the commit metadata contains a lot of non-ascii text which was >> my guess at what my be tripping up the tar creation. > > Yeah, though the only thing that makes it into the tarfile is the actual > tree entries. I'd imagine the file content is not likely to be a source > of problems, as it's common to see binary gunk there. Most of the > filenames are pretty mundane, but this symlink destination is a little > funny: > > $ git archive ... | tar tvf - | grep nicovideo4as.swc > lrwxrwxrwx root/root 0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024 > > That's not the full story, though. It is indeed a symlink in the > tree: > > $ git ls-tree -r HEAD libs/nicovideo4as.swc > 120000 blob ec3137b5fcaeae25cf67927068af116517683806 libs/nicovideo4as.swc > > But the contents of that blob, which should be the destination filename, > are definitely not: > > $ git cat-file blob ec3137b5f | wc -c > 57804 > $ git cat-file blob ec3137b5f | xxd | head -1 > 00000000: 504b 0304 1400 0800 0800 5069 694e 0000 PK........PiiN.. > > There's quite a bit more data there. And what tar showed us goes up to > the first NUL, which does not seem surprising. That (the symlink target) is a ZIP file with the following contents: Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 39733 Defl:N 3403 91% 2019-03-09 13:10 489e1be1 catalog.xml 54131 Defl:N 54151 0% 2019-03-09 13:10 32f57322 library.swf -------- ------- --- ------- 93864 57554 39% 2 files And link targets longer than 100 characters are encoded in an extended Pax header. (Usually symlink targets are paths, not file contents.) > It's possible Git is doing the wrong thing on the writing side, but > given that newer versions of bsdtar handle it fine, I'd guess that the > old one simply had problems consuming poorly formed symlink filenames. Git preserves symlink targets with embedded NULs in the repository and in generated tar files. Not sure if GNU tar and bsdtar truncating them at the first NUL is a bug. I'm also not sure if there is a platform that would allow creating such a symlink in the file system, or how one is supposed to use it. We could truncate symlink targets at the first NUL as well in git archive -- but that would be a bit sad, as the archive formats allow storing the "real" target from the repo, with NUL and all. We could make git fsck report such symlinks. Can Unicode symlink targets contain NULs? We wouldn't want to damage them even if we decide to truncate. René ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-25 13:26 ` René Scharfe @ 2019-05-25 13:46 ` Andreas Schwab 2019-05-25 21:07 ` Ævar Arnfjörð Bjarmason 2019-05-28 5:58 ` Jeff King 2 siblings, 0 replies; 21+ messages in thread From: Andreas Schwab @ 2019-05-25 13:46 UTC (permalink / raw) To: René Scharfe; +Cc: Jeff King, Keegan Carruthers-Smith, git On Mai 25 2019, René Scharfe <l.s.r@web.de> wrote: > Can Unicode symlink targets contain NULs? We wouldn't want to damage > them even if we decide to truncate. The POSIX interface doesn't allow creating such a symlink, since the argument to symlink is a C string. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-25 13:26 ` René Scharfe 2019-05-25 13:46 ` Andreas Schwab @ 2019-05-25 21:07 ` Ævar Arnfjörð Bjarmason 2019-05-26 21:33 ` René Scharfe 2019-05-28 5:58 ` Jeff King 2 siblings, 1 reply; 21+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-25 21:07 UTC (permalink / raw) To: René Scharfe; +Cc: Jeff King, Keegan Carruthers-Smith, git On Sat, May 25 2019, René Scharfe wrote: > Am 24.05.19 um 10:13 schrieb Jeff King: >> On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote: >> >>>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3 >>>> (from Debian's bsdtar package). What does your "tar --version" say? >>> >>> bsdtar 2.8.3 - libarchive 2.8.3 >> >> Interesting. I wonder if there was a libarchive bug that was fixed >> between 2.8.3 and 3.3.3. >> >>>> Git does write a pax header with the commit id in it as a comment. >>>> Presumably that's what it's complaining about (but it is not malformed >>>> according to any tar I've tried). If you feed git-archive a tree rather >>>> than a commit, that is omitted. What does: >>>> >>>> git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null >>>> >>>> say? If it doesn't complain, then we know it's indeed the pax comment >>>> field. >>> >>> It also complains >>> >>> $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null >>> tar: Ignoring malformed pax extended attribute >>> tar: Error exit delayed from previous errors. >> >> Ah, OK. So it's not the comment field at all, but some other entry. >> >>> Some more context: I work at Sourcegraph.com We mirror a lot of repos >>> from github.com. We usually interact with a working copy by running >>> git archive on it in our infrastructure. This is the first repository >>> that I have noticed which produces this error. An interesting thing to >>> note is the commit metadata contains a lot of non-ascii text which was >>> my guess at what my be tripping up the tar creation. >> >> Yeah, though the only thing that makes it into the tarfile is the actual >> tree entries. I'd imagine the file content is not likely to be a source >> of problems, as it's common to see binary gunk there. Most of the >> filenames are pretty mundane, but this symlink destination is a little >> funny: >> >> $ git archive ... | tar tvf - | grep nicovideo4as.swc >> lrwxrwxrwx root/root 0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024 >> >> That's not the full story, though. It is indeed a symlink in the >> tree: >> >> $ git ls-tree -r HEAD libs/nicovideo4as.swc >> 120000 blob ec3137b5fcaeae25cf67927068af116517683806 libs/nicovideo4as.swc >> >> But the contents of that blob, which should be the destination filename, >> are definitely not: >> >> $ git cat-file blob ec3137b5f | wc -c >> 57804 >> $ git cat-file blob ec3137b5f | xxd | head -1 >> 00000000: 504b 0304 1400 0800 0800 5069 694e 0000 PK........PiiN.. >> >> There's quite a bit more data there. And what tar showed us goes up to >> the first NUL, which does not seem surprising. > > That (the symlink target) is a ZIP file with the following contents: > > Length Method Size Cmpr Date Time CRC-32 Name > -------- ------ ------- ---- ---------- ----- -------- ---- > 39733 Defl:N 3403 91% 2019-03-09 13:10 489e1be1 catalog.xml > 54131 Defl:N 54151 0% 2019-03-09 13:10 32f57322 library.swf > -------- ------- --- ------- > 93864 57554 39% 2 files > > And link targets longer than 100 characters are encoded in an extended > Pax header. > > (Usually symlink targets are paths, not file contents.) > >> It's possible Git is doing the wrong thing on the writing side, but >> given that newer versions of bsdtar handle it fine, I'd guess that the >> old one simply had problems consuming poorly formed symlink filenames. > > Git preserves symlink targets with embedded NULs in the repository and > in generated tar files. Not sure if GNU tar and bsdtar truncating them > at the first NUL is a bug. I'm also not sure if there is a platform > that would allow creating such a symlink in the file system, or how one > is supposed to use it. > > We could truncate symlink targets at the first NUL as well in git > archive -- but that would be a bit sad, as the archive formats allow > storing the "real" target from the repo, with NUL and all. We could > make git fsck report such symlinks. > > Can Unicode symlink targets contain NULs? We wouldn't want to damage > them even if we decide to truncate. I don't see a practical use for this case, and maybe we should even fsck check for the blob representing the symlink target having a \0 in it as suggested upthread. But that being said, this assumption that data in a tar archive will get written to a FS of some sort isn't true. There's plenty of consumers of the format that read it in-memory and stream its contents out to something else entirely, e.g. taking "git archive --remote" output, parsing it with e.g. [1] and throwing some/all of the content into a database. 1. https://metacpan.org/pod/Archive::Tar ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-25 21:07 ` Ævar Arnfjörð Bjarmason @ 2019-05-26 21:33 ` René Scharfe 2019-05-28 5:44 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2019-05-26 21:33 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, Keegan Carruthers-Smith, git Am 25.05.19 um 23:07 schrieb Ævar Arnfjörð Bjarmason: > > On Sat, May 25 2019, René Scharfe wrote: > >> We could truncate symlink targets at the first NUL as well in git >> archive -- but that would be a bit sad, as the archive formats allow >> storing the "real" target from the repo, with NUL and all. > But that being said, this assumption that data in a tar archive will get > written to a FS of some sort isn't true. There's plenty of consumers of > the format that read it in-memory and stream its contents out to > something else entirely, e.g. taking "git archive --remote" output, > parsing it with e.g. [1] and throwing some/all of the content into a > database. > > 1. https://metacpan.org/pod/Archive::Tar Git archive writes link targets that are 100 characters long or less into the appropriate field in the plain tar header. It copies everything, including NULs, but unlike a PAX extended header that field lacks a length indicator, so extractors basically have to treat it as NUL-terminated. If we want to preserve NUL in short link targets as well, we'd have to put such names into an PAX extended header.. --- archive-tar.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/archive-tar.c b/archive-tar.c index 3e53aac1e6..e8f55578d1 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -291,7 +291,8 @@ static int write_tar_entry(struct archiver_args *args, } if (S_ISLNK(mode)) { - if (size > sizeof(header.linkname)) { + if (size > sizeof(header.linkname) || + memchr(buffer, '\0', size)) { xsnprintf(header.linkname, sizeof(header.linkname), "see %s.paxheader", oid_to_hex(oid)); strbuf_append_ext_header(&ext_header, "linkpath", -- 2.21.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-26 21:33 ` René Scharfe @ 2019-05-28 5:44 ` Jeff King 0 siblings, 0 replies; 21+ messages in thread From: Jeff King @ 2019-05-28 5:44 UTC (permalink / raw) To: René Scharfe Cc: Ævar Arnfjörð Bjarmason, Keegan Carruthers-Smith, git On Sun, May 26, 2019 at 11:33:20PM +0200, René Scharfe wrote: > Git archive writes link targets that are 100 characters long or less > into the appropriate field in the plain tar header. It copies > everything, including NULs, but unlike a PAX extended header that field > lacks a length indicator, so extractors basically have to treat it as > NUL-terminated. > > If we want to preserve NUL in short link targets as well, we'd have to > put such names into an PAX extended header.. I can't say that I care much either way, since putting NULs in your filenames is pretty crazy. But that would at least make things consistent regardless of the length. I also wondered if you could play any security tricks here (e.g., by having one tool view a filename as "foo" and another as "foo\0bar"). I doubt it, though. Filenames themselves in Git are always NUL-terminated, since that's dictated the tree format; so this is really just about link destinations. And even then, it's hard to imagine a case. I was wondering if you could have an entry ".git\0foo" that ends up written to the filesystem as ".git". But these are tar archives we're talking about, so it's not like you couldn't just put ".git" in a tar file in the first place. But maybe somebody else can brainstorm something more evil. :) -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-25 13:26 ` René Scharfe 2019-05-25 13:46 ` Andreas Schwab 2019-05-25 21:07 ` Ævar Arnfjörð Bjarmason @ 2019-05-28 5:58 ` Jeff King 2019-05-28 18:01 ` René Scharfe 2 siblings, 1 reply; 21+ messages in thread From: Jeff King @ 2019-05-28 5:58 UTC (permalink / raw) To: René Scharfe; +Cc: Keegan Carruthers-Smith, git On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote: > We could truncate symlink targets at the first NUL as well in git > archive -- but that would be a bit sad, as the archive formats allow > storing the "real" target from the repo, with NUL and all. We could > make git fsck report such symlinks. This is a little tricky, because fsck generally looks at individual objects, and the bad pattern is a combination of a tree and a blob together. I think you could make it work by reusing some of the code and patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into maint, 2018-05-22). > Can Unicode symlink targets contain NULs? We wouldn't want to damage > them even if we decide to truncate. On Windows, I suppose, where pathnames can be UTF-16? I don't know how any of that works with Git. I guess we'd always have to assume the filenames in Git are UTF-8 or at least some ASCII-superset, since we cannot encode NULs; and presumably that would extend to link destinations, too. So I doubt it's a problem in practice. Personally, I'd wait until somebody with such a system cares enough to suggest a new behavior, rather than trying to guess. :) Likewise, I think at this point with Keegan's original report that Git is doing something reasonable with a lousy input. Unless something interesting comes out of the golang/go bug report discussion (thank you for opening that!), it's probably not worth chasing hypotheticals. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-28 5:58 ` Jeff King @ 2019-05-28 18:01 ` René Scharfe 2019-05-28 19:08 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2019-05-28 18:01 UTC (permalink / raw) To: Jeff King; +Cc: Keegan Carruthers-Smith, git Am 28.05.19 um 07:58 schrieb Jeff King: > On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote: > >> We could >> make git fsck report such symlinks. > > This is a little tricky, because fsck generally looks at individual > objects, and the bad pattern is a combination of a tree and a blob > together. I think you could make it work by reusing some of the code and > patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into > maint, 2018-05-22). Actually it's super easy, barely an inconvenience (SCNR, watched a lot of those rants recently).. Did I miss something? -- >8 -- Subject: [PATCH] fsck: warn about symlink targets containing NUL Symlink targets are kept as blob objects in Git repositories and can thus contain arbitrary characters. They are truncated silently at the first NUL by symlink(2) during checkout, though. Make git fsck warn about such tree entries. Signed-off-by: Rene Scharfe <l.s.r@web.de> --- fsck.c | 32 ++++++++++++++++++++++++++++++++ t/t1450-fsck.sh | 13 +++++++++++++ 2 files changed, 45 insertions(+) diff --git a/fsck.c b/fsck.c index 4703f55561..d434f5d75c 100644 --- a/fsck.c +++ b/fsck.c @@ -49,6 +49,7 @@ static struct oidset gitmodules_done = OIDSET_INIT; FUNC(MISSING_PARENT, ERROR) \ FUNC(MISSING_SPACE_BEFORE_DATE, ERROR) \ FUNC(MISSING_SPACE_BEFORE_EMAIL, ERROR) \ + FUNC(MISSING_SYMLINK_OBJECT, ERROR) \ FUNC(MISSING_TAG, ERROR) \ FUNC(MISSING_TAG_ENTRY, ERROR) \ FUNC(MISSING_TAG_OBJECT, ERROR) \ @@ -58,6 +59,7 @@ static struct oidset gitmodules_done = OIDSET_INIT; FUNC(MISSING_TYPE_ENTRY, ERROR) \ FUNC(MULTIPLE_AUTHORS, ERROR) \ FUNC(TAG_OBJECT_NOT_TAG, ERROR) \ + FUNC(SYMLINK_OBJECT_NOT_BLOB, ERROR) \ FUNC(TREE_NOT_SORTED, ERROR) \ FUNC(UNKNOWN_TYPE, ERROR) \ FUNC(ZERO_PADDED_DATE, ERROR) \ @@ -78,6 +80,7 @@ static struct oidset gitmodules_done = OIDSET_INIT; FUNC(NULL_SHA1, WARN) \ FUNC(ZERO_PADDED_FILEMODE, WARN) \ FUNC(NUL_IN_COMMIT, WARN) \ + FUNC(NUL_IN_SYMLINK_TARGET, WARN) \ /* infos (reported as warnings, but ignored by default) */ \ FUNC(GITMODULES_PARSE, INFO) \ FUNC(BAD_TAG_NAME, INFO) \ @@ -578,6 +581,33 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con return c1 < c2 ? 0 : TREE_UNORDERED; } +static int fsck_symlink(struct tree *tree, const char *name, + const struct object_id *oid, + struct fsck_options *options) +{ + int ret = 0; + enum object_type type; + unsigned long size; + void *buffer = read_object_file(oid, &type, &size); + + if (!buffer) + ret = report(options, &tree->object, + FSCK_MSG_MISSING_SYMLINK_OBJECT, + "cannot read blob object for symlink %s", name); + else if (type != OBJ_BLOB) + ret = report(options, &tree->object, + FSCK_MSG_SYMLINK_OBJECT_NOT_BLOB, + "expected blob got %s for symlink %s", + type_name(type), name); + else if (memchr(buffer, '\0', size)) + ret = report(options, &tree->object, + FSCK_MSG_NUL_IN_SYMLINK_TARGET, + "NUL in target of symlink %s", name); + + free(buffer); + return ret; +} + static int fsck_tree(struct tree *item, struct fsck_options *options) { int retval = 0; @@ -626,6 +656,8 @@ static int fsck_tree(struct tree *item, struct fsck_options *options) FSCK_MSG_GITMODULES_SYMLINK, ".gitmodules is a symbolic link"); } + if (S_ISLNK(mode)) + retval += fsck_symlink(item, name, oid, options); if (update_tree_entry_gently(&desc)) { retval += report(options, &item->object, FSCK_MSG_BAD_TREE, "cannot be parsed as a tree"); diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 0f268a3664..ce9501d063 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -822,4 +822,17 @@ test_expect_success 'detect corrupt index file in fsck' ' test_i18ngrep "bad index file" errors ' +test_expect_success 'detect NUL in symlink target' ' + test_when_finished "git update-ref -d refs/heads/nul_in_symlink" && + test_when_finished "remove_object \$commit" && + test_when_finished "remove_object \$tree" && + test_when_finished "remove_object \$blob" && + blob=$(echo fooQbar | q_to_nul | git hash-object -w --stdin) && + tree=$(echo "120000 blob $blob symlink" | git mktree) && + commit=$(git commit-tree $tree) && + git update-ref refs/heads/nul_in_symlink $commit && + git fsck 2>out && + test_i18ngrep "NUL in target of symlink" out +' + test_done -- 2.21.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-28 18:01 ` René Scharfe @ 2019-05-28 19:08 ` Jeff King 2019-05-28 23:34 ` René Scharfe 0 siblings, 1 reply; 21+ messages in thread From: Jeff King @ 2019-05-28 19:08 UTC (permalink / raw) To: René Scharfe; +Cc: Keegan Carruthers-Smith, git On Tue, May 28, 2019 at 08:01:43PM +0200, René Scharfe wrote: > Am 28.05.19 um 07:58 schrieb Jeff King: > > On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote: > > > >> We could > >> make git fsck report such symlinks. > > > > This is a little tricky, because fsck generally looks at individual > > objects, and the bad pattern is a combination of a tree and a blob > > together. I think you could make it work by reusing some of the code and > > patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into > > maint, 2018-05-22). > > Actually it's super easy, barely an inconvenience (SCNR, watched a lot > of those rants recently).. Did I miss something? Yes. You cannot rely on calling read_object_file() in real-time when the fsck is being done by index-pack. The blob in question may be in the pack you are indexing. See fsck_finish() for how we do this for .gitmodules checks. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-28 19:08 ` Jeff King @ 2019-05-28 23:34 ` René Scharfe 2019-05-29 1:17 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2019-05-28 23:34 UTC (permalink / raw) To: Jeff King; +Cc: Keegan Carruthers-Smith, git Am 28.05.19 um 21:08 schrieb Jeff King: > On Tue, May 28, 2019 at 08:01:43PM +0200, René Scharfe wrote: > >> Am 28.05.19 um 07:58 schrieb Jeff King: >>> On Sat, May 25, 2019 at 03:26:53PM +0200, René Scharfe wrote: >>> >>>> We could >>>> make git fsck report such symlinks. >>> >>> This is a little tricky, because fsck generally looks at individual >>> objects, and the bad pattern is a combination of a tree and a blob >>> together. I think you could make it work by reusing some of the code and >>> patterns from 9e84a6d758 (Merge branch 'jk/submodule-fsck-loose' into >>> maint, 2018-05-22). >> >> Actually it's super easy, barely an inconvenience (SCNR, watched a lot >> of those rants recently).. Did I miss something? > > Yes. You cannot rely on calling read_object_file() in real-time when the > fsck is being done by index-pack. The blob in question may be in the > pack you are indexing. It figures. So something like the patch below? Parsing trees with symlinks twice is not ideal, but keeps the set structure simple -- a standard oidset suffices. The global variables are ugly. Moving them into struct fsck_option would be possible, but not much better, as they aren't really options. FSCK_MSG_MISSING_TREE_OBJECT has never been used before, it seems. --- fsck.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++ t/t1450-fsck.sh | 13 +++++++++++ 2 files changed, 71 insertions(+) diff --git a/fsck.c b/fsck.c index 4703f55561..a6e7d0b03f 100644 --- a/fsck.c +++ b/fsck.c @@ -19,6 +19,7 @@ static struct oidset gitmodules_found = OIDSET_INIT; static struct oidset gitmodules_done = OIDSET_INIT; +static struct oidset trees_with_symlinks = OIDSET_INIT; #define FSCK_FATAL -1 #define FSCK_INFO -2 @@ -49,6 +50,7 @@ static struct oidset gitmodules_done = OIDSET_INIT; FUNC(MISSING_PARENT, ERROR) \ FUNC(MISSING_SPACE_BEFORE_DATE, ERROR) \ FUNC(MISSING_SPACE_BEFORE_EMAIL, ERROR) \ + FUNC(MISSING_SYMLINK_OBJECT, ERROR) \ FUNC(MISSING_TAG, ERROR) \ FUNC(MISSING_TAG_ENTRY, ERROR) \ FUNC(MISSING_TAG_OBJECT, ERROR) \ @@ -58,6 +60,7 @@ static struct oidset gitmodules_done = OIDSET_INIT; FUNC(MISSING_TYPE_ENTRY, ERROR) \ FUNC(MULTIPLE_AUTHORS, ERROR) \ FUNC(TAG_OBJECT_NOT_TAG, ERROR) \ + FUNC(SYMLINK_OBJECT_NOT_BLOB, ERROR) \ FUNC(TREE_NOT_SORTED, ERROR) \ FUNC(UNKNOWN_TYPE, ERROR) \ FUNC(ZERO_PADDED_DATE, ERROR) \ @@ -78,6 +81,7 @@ static struct oidset gitmodules_done = OIDSET_INIT; FUNC(NULL_SHA1, WARN) \ FUNC(ZERO_PADDED_FILEMODE, WARN) \ FUNC(NUL_IN_COMMIT, WARN) \ + FUNC(NUL_IN_SYMLINK_TARGET, WARN) \ /* infos (reported as warnings, but ignored by default) */ \ FUNC(GITMODULES_PARSE, INFO) \ FUNC(BAD_TAG_NAME, INFO) \ @@ -578,6 +582,33 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con return c1 < c2 ? 0 : TREE_UNORDERED; } +static int fsck_symlink(struct tree *tree, const char *name, + const struct object_id *oid, + struct fsck_options *options) +{ + int ret = 0; + enum object_type type; + unsigned long size; + void *buffer = read_object_file(oid, &type, &size); + + if (!buffer) + ret = report(options, &tree->object, + FSCK_MSG_MISSING_SYMLINK_OBJECT, + "cannot read blob object for symlink %s", name); + else if (type != OBJ_BLOB) + ret = report(options, &tree->object, + FSCK_MSG_SYMLINK_OBJECT_NOT_BLOB, + "expected blob got %s for symlink %s", + type_name(type), name); + else if (memchr(buffer, '\0', size)) + ret = report(options, &tree->object, + FSCK_MSG_NUL_IN_SYMLINK_TARGET, + "NUL in target of symlink %s", name); + + free(buffer); + return ret; +} + static int fsck_tree(struct tree *item, struct fsck_options *options) { int retval = 0; @@ -626,6 +657,8 @@ static int fsck_tree(struct tree *item, struct fsck_options *options) FSCK_MSG_GITMODULES_SYMLINK, ".gitmodules is a symbolic link"); } + if (S_ISLNK(mode)) + oidset_insert(&trees_with_symlinks, &item->object.oid); if (update_tree_entry_gently(&desc)) { retval += report(options, &item->object, FSCK_MSG_BAD_TREE, "cannot be parsed as a tree"); @@ -1118,8 +1151,33 @@ int fsck_finish(struct fsck_options *options) free(buf); } + oidset_iter_init(&trees_with_symlinks, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree; + struct tree_desc desc; + struct name_entry entry; + + tree = lookup_tree(the_repository, oid); + if (!tree) { + struct object *obj = lookup_unknown_object(oid->hash); + ret |= report(options, obj, + FSCK_MSG_MISSING_TREE_OBJECT, + "tree %s not found", oid_to_hex(oid)); + continue; + } + if (parse_tree(tree)) + continue; + if (init_tree_desc_gently(&desc, tree->buffer, tree->size)) + continue; + while (tree_entry_gently(&desc, &entry)) { + if (S_ISLNK(entry.mode)) + ret |= fsck_symlink(tree, entry.path, + &entry.oid, options); + } + } oidset_clear(&gitmodules_found); oidset_clear(&gitmodules_done); + oidset_clear(&trees_with_symlinks); return ret; } diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 0f268a3664..ce9501d063 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -822,4 +822,17 @@ test_expect_success 'detect corrupt index file in fsck' ' test_i18ngrep "bad index file" errors ' +test_expect_success 'detect NUL in symlink target' ' + test_when_finished "git update-ref -d refs/heads/nul_in_symlink" && + test_when_finished "remove_object \$commit" && + test_when_finished "remove_object \$tree" && + test_when_finished "remove_object \$blob" && + blob=$(echo fooQbar | q_to_nul | git hash-object -w --stdin) && + tree=$(echo "120000 blob $blob symlink" | git mktree) && + commit=$(git commit-tree $tree) && + git update-ref refs/heads/nul_in_symlink $commit && + git fsck 2>out && + test_i18ngrep "NUL in target of symlink" out +' + test_done -- 2.21.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-28 23:34 ` René Scharfe @ 2019-05-29 1:17 ` Jeff King 2019-05-29 17:54 ` René Scharfe 0 siblings, 1 reply; 21+ messages in thread From: Jeff King @ 2019-05-29 1:17 UTC (permalink / raw) To: René Scharfe; +Cc: Keegan Carruthers-Smith, git On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote: > It figures. > > So something like the patch below? > > Parsing trees with symlinks twice is not ideal, but keeps the set > structure simple -- a standard oidset suffices. If blobs comes after trees (and they usually do in a pack), you can do it in a single pass by marking the blob as a symlink target, and then when we actually see that blob's contents, marking it as either OK or problematic. And then the finish() step just correlates those with the tree. It does require O(n) storage in the number of symlinked blobs, but also O(n) in the number of symlinked tree entries (number of trees with symlinks times the number of entries in each such tree, _even if they're the same entry/blob as another tree). That makes it a lot worse than the existing gitmodules check. There we only care about finding the .gitmodules blobs. So even though you have a ton of trees that mention .gitmodules (basically every root tree), the the .gitmodules file itself doesn't change much. So we only end up with a small oidset (and a small worst case for looking at objects twice). But here the problem is in the tree, not the blob. So we're not finding suspect blobs, but rather re-checking each tree. And no matter what we do (whether it's visiting the object again, or creating a set or mapping with the object names) is going to be linear there. And a repository with a symlink in the root tree is going to revisit or put in our mapping every single root tree. TBH, I'm not sure this fsck check was worth it even without the implementation complexity. > The global variables are ugly. Moving them into struct fsck_option > would be possible, but not much better, as they aren't really > options. Yeah, we use the name "context" elsewhere for this, which is a bit clearer. In a real object-oriented design, I guess people would make an "Fscker" object, and it would carry the options and context forward. That's basically what our "context" objects are. > FSCK_MSG_MISSING_TREE_OBJECT has never been used before, it seems. Yeah, this is leftover cruft from my gitmodules series. I double-checked the early iterations to be sure, and it is definitely the case. I didn't dig down to find out the reason it went away, but I think it is probably that the original version tried harder to find .gitmodules only in root trees, and later we decided to just complain about it in any tree. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-29 1:17 ` Jeff King @ 2019-05-29 17:54 ` René Scharfe 2019-05-30 11:55 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2019-05-29 17:54 UTC (permalink / raw) To: Jeff King; +Cc: Keegan Carruthers-Smith, git Am 29.05.19 um 03:17 schrieb Jeff King: > On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote: >> Parsing trees with symlinks twice is not ideal, but keeps the set >> structure simple -- a standard oidset suffices. > > If blobs comes after trees (and they usually do in a pack), you can do > it in a single pass by marking the blob as a symlink target, and then > when we actually see that blob's contents, marking it as either OK or > problematic. And then the finish() step just correlates those with the > tree. Good idea. Is that ordering guaranteed? (Stumbling about the "usually" in your first sentence.) An ordering where dependent objects (like trees) follow the objects they reference would be better suited for these kinds of checks.. > It does require O(n) storage in the number of symlinked blobs, but also > O(n) in the number of symlinked tree entries (number of trees with > symlinks times the number of entries in each such tree, _even if they're > the same entry/blob as another tree). > > That makes it a lot worse than the existing gitmodules check. There we > only care about finding the .gitmodules blobs. So even though you have a > ton of trees that mention .gitmodules (basically every root tree), the > the .gitmodules file itself doesn't change much. So we only end up with > a small oidset (and a small worst case for looking at objects twice). > > But here the problem is in the tree, not the blob. So we're not finding > suspect blobs, but rather re-checking each tree. And no matter what we > do (whether it's visiting the object again, or creating a set or mapping > with the object names) is going to be linear there. And a repository > with a symlink in the root tree is going to revisit or put in our > mapping every single root tree. That's true, potentially it needs remember and/or reprocess all trees, meaning this check may double the run time of fsck in the worst case. Example from the wild: The kernel repo currently has 36 symlinks and 6+ million objects are checked in total, and the symlink check processes 18943 trees_with_symlinks entries there. > TBH, I'm not sure this fsck check was worth it even without the > implementation complexity. Hmm. git status reports such truncated symlinks as changed, so the issue *is* already detectable. René ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-29 17:54 ` René Scharfe @ 2019-05-30 11:55 ` Jeff King 2019-06-02 16:58 ` René Scharfe 0 siblings, 1 reply; 21+ messages in thread From: Jeff King @ 2019-05-30 11:55 UTC (permalink / raw) To: René Scharfe; +Cc: Keegan Carruthers-Smith, git On Wed, May 29, 2019 at 07:54:44PM +0200, René Scharfe wrote: > Am 29.05.19 um 03:17 schrieb Jeff King: > > On Wed, May 29, 2019 at 01:34:32AM +0200, René Scharfe wrote: > >> Parsing trees with symlinks twice is not ideal, but keeps the set > >> structure simple -- a standard oidset suffices. > > > > If blobs comes after trees (and they usually do in a pack), you can do > > it in a single pass by marking the blob as a symlink target, and then > > when we actually see that blob's contents, marking it as either OK or > > problematic. And then the finish() step just correlates those with the > > tree. > > Good idea. Is that ordering guaranteed? (Stumbling about the "usually" > in your first sentence.) It's not guaranteed. Our implementation of pack-objects does order blobs after trees, but I suspect this could be violated in rare cases with some of the delta-island pack layering stuff. I think it makes sense to be sure that the receiver is correct no matter what, but optimize for this expected case (that's what I tried to do with the .gitmodules checks). > An ordering where dependent objects (like trees) follow the objects they > reference would be better suited for these kinds of checks.. But worse for others (e.g., like .gitmodules where it's cheap to identify a candidate blob, but the blob check is involved; there it's much more optimal to see the tree first). > > But here the problem is in the tree, not the blob. So we're not finding > > suspect blobs, but rather re-checking each tree. And no matter what we > > do (whether it's visiting the object again, or creating a set or mapping > > with the object names) is going to be linear there. And a repository > > with a symlink in the root tree is going to revisit or put in our > > mapping every single root tree. > > That's true, potentially it needs remember and/or reprocess all trees, > meaning this check may double the run time of fsck in the worst case. > Example from the wild: The kernel repo currently has 36 symlinks and > 6+ million objects are checked in total, and the symlink check processes > 18943 trees_with_symlinks entries there. That sounds about right. It's basically every version of every tree that has a symlink. Did it make a noticeable difference in timing? Indexing the whole kernel history is already a horribly slow process. :) > > TBH, I'm not sure this fsck check was worth it even without the > > implementation complexity. > > Hmm. git status reports such truncated symlinks as changed, so the > issue *is* already detectable. Hmm, yeah. That makes sense, since the filesystem (well, really the syscall API layer) cannot represent the data we are feeding it. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-30 11:55 ` Jeff King @ 2019-06-02 16:58 ` René Scharfe 2019-06-04 20:53 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: René Scharfe @ 2019-06-02 16:58 UTC (permalink / raw) To: Jeff King; +Cc: Keegan Carruthers-Smith, git Am 30.05.19 um 13:55 schrieb Jeff King: > On Wed, May 29, 2019 at 07:54:44PM +0200, René Scharfe wrote: > >> Am 29.05.19 um 03:17 schrieb Jeff King: >>> But here the problem is in the tree, not the blob. So we're not finding >>> suspect blobs, but rather re-checking each tree. And no matter what we >>> do (whether it's visiting the object again, or creating a set or mapping >>> with the object names) is going to be linear there. And a repository >>> with a symlink in the root tree is going to revisit or put in our >>> mapping every single root tree. >> >> That's true, potentially it needs remember and/or reprocess all trees, >> meaning this check may double the run time of fsck in the worst case. >> Example from the wild: The kernel repo currently has 36 symlinks and >> 6+ million objects are checked in total, and the symlink check processes >> 18943 trees_with_symlinks entries there. > > That sounds about right. It's basically every version of every tree that > has a symlink. Did it make a noticeable difference in timing? Indexing > the whole kernel history is already a horribly slow process. :) Right, I didn't notice a difference -- no patience for watching that thing to the end. But here are some numbers for v2.21.0 vs. master with the patch: Benchmark #1: git fsck Time (mean ± σ): 307.775 s ± 9.054 s [User: 307.173 s, System: 0.448 s] Range (min … max): 294.052 s … 322.931 s 10 runs Benchmark #2: ~/src/git/git fsck Time (mean ± σ): 319.754 s ± 2.255 s [User: 318.927 s, System: 0.671 s] Range (min … max): 316.376 s … 323.747 s 10 runs Summary 'git fsck' ran 1.04 ± 0.03 times faster than '~/src/git/git fsck' Seeing only a single CPU core being stressed for that long is a bit sad to see. Checking individual objects should be relatively easy to parallelize, shouldn't it? René ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-06-02 16:58 ` René Scharfe @ 2019-06-04 20:53 ` Jeff King 0 siblings, 0 replies; 21+ messages in thread From: Jeff King @ 2019-06-04 20:53 UTC (permalink / raw) To: René Scharfe; +Cc: Keegan Carruthers-Smith, git On Sun, Jun 02, 2019 at 06:58:48PM +0200, René Scharfe wrote: > > That sounds about right. It's basically every version of every tree that > > has a symlink. Did it make a noticeable difference in timing? Indexing > > the whole kernel history is already a horribly slow process. :) > > Right, I didn't notice a difference -- no patience for watching that > thing to the end. But here are some numbers for v2.21.0 vs. master with > the patch: > > Benchmark #1: git fsck > Time (mean ± σ): 307.775 s ± 9.054 s [User: 307.173 s, System: 0.448 s] > Range (min … max): 294.052 s … 322.931 s 10 runs > > Benchmark #2: ~/src/git/git fsck > Time (mean ± σ): 319.754 s ± 2.255 s [User: 318.927 s, System: 0.671 s] > Range (min … max): 316.376 s … 323.747 s 10 runs > > Summary > 'git fsck' ran > 1.04 ± 0.03 times faster than '~/src/git/git fsck' I guess that's about what I'd expect. The bulk of the time in most repos will go to fscking the actual blobs, I'd think. But hitting each tree twice really is noticeable. > Seeing only a single CPU core being stressed for that long is a bit sad > to see. Checking individual objects should be relatively easy to > parallelize, shouldn't it? Yes. The fsck code is pretty old, and uses a very simple way of walking over all of the packs. index-pack (which backs verify-pack these days) is much smarter, and runs in parallel. It still takes a lock when doing the actual fsck checks, but most of the time goes to the zlib inflation and delta reconstruction. There's some discussion in: https://public-inbox.org/git/20180816210657.GA9291@sigill.intra.peff.net/ and even some patches elsewhere in the thread here: https://public-inbox.org/git/20180902075528.GC18787@sigill.intra.peff.net/ and here: https://public-inbox.org/git/20180902085503.GA25391@sigill.intra.peff.net/ I think the big show-stopper there is how ugly it is to run the pack verification in a separate process (and I suspect it is not just ugly from a code point of view, but actively breaks index-pack because it then relies on the set of objects seen during the first phase to do its connectivity check). So there would probably need to be some lib-ification work on index-pack first, so that we could call it (at least in verification mode) multiple times from inside fsck. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-24 8:13 ` Jeff King 2019-05-25 13:26 ` René Scharfe @ 2019-05-27 5:11 ` Keegan Carruthers-Smith 1 sibling, 0 replies; 21+ messages in thread From: Keegan Carruthers-Smith @ 2019-05-27 5:11 UTC (permalink / raw) To: Jeff King; +Cc: git On Fri, 24 May 2019 at 10:13, Jeff King <peff@peff.net> wrote: > It's possible Git is doing the wrong thing on the writing side, but > given that newer versions of bsdtar handle it fine, I'd guess that the > old one simply had problems consuming poorly formed symlink filenames. I agree that the reader should be able to handle it best-effort. Especially given both bsdtar and gnutar seem to not choke on it. I mentioned go's archive/tar chokes on it, so reported that issue https://github.com/golang/go/issues/32264 -Keegan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-24 7:06 ` Jeff King 2019-05-24 7:35 ` Keegan Carruthers-Smith @ 2019-05-25 20:46 ` Ævar Arnfjörð Bjarmason 2019-05-25 21:19 ` brian m. carlson 1 sibling, 1 reply; 21+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-25 20:46 UTC (permalink / raw) To: Jeff King; +Cc: Keegan Carruthers-Smith, git On Fri, May 24 2019, Jeff King wrote: > On Fri, May 24, 2019 at 08:45:23AM +0200, Keegan Carruthers-Smith wrote: > >> git archive can generate a malformed tar archive. bsdtar reports the >> error "tar: Ignoring malformed pax extended attribute" when reading >> the archive. Go's "tar/archive" package also reports the error >> "archive/tar: invalid tar header". However, BusyBox's tar does not >> report the error (unsure if it just has less error logging). >> >> I can reproduce this when generating the tar on linux and mac. I >> tested this with "git version 2.21.0" and a build of next I did today: >> "git version 2.22.0.rc1.257.g3120a18244" >> >> Reproduction: >> >> $ git clone https://github.com/SSW-SCIENTIFIC/NNDD.git >> $ cd NNDD >> $ git archive --format tar c21b98da2ca7f007230e696b2eda5da6589fe137 >> | tar tf - > /dev/null >> tar: Ignoring malformed pax extended attribute >> tar: Error exit delayed from previous errors. > > I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3 > (from Debian's bsdtar package). What does your "tar --version" say? > > Git does write a pax header with the commit id in it as a comment. > Presumably that's what it's complaining about (but it is not malformed > according to any tar I've tried). If you feed git-archive a tree rather > than a commit, that is omitted. What does: > > git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null > > say? If it doesn't complain, then we know it's indeed the pax comment > field. Solaris tar also complains about this. I've seen that for ages, but never thought to report it, I figured it was well-known. When you "tar xf" an archive git-archive it complains: tar: pax_global_header: typeflag 'g' not recognized, converting to regular file It will then extract the "pax_global_header" as if it were a file at the root of the archive. That file will look like this: $ wc -c x/pax_global_header 52 x/pax_global_header $ cat x/pax_global_header 52 comment=$40_CHAR_SHA_1 Where $40_CHAR_SHA_1 is whatever commit this archive was produced from. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive generates tar with malformed pax extended attribute 2019-05-25 20:46 ` Ævar Arnfjörð Bjarmason @ 2019-05-25 21:19 ` brian m. carlson 0 siblings, 0 replies; 21+ messages in thread From: brian m. carlson @ 2019-05-25 21:19 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, Keegan Carruthers-Smith, git [-- Attachment #1: Type: text/plain, Size: 1473 bytes --] On 2019-05-25 at 20:46:16, Ævar Arnfjörð Bjarmason wrote: > Solaris tar also complains about this. I've seen that for ages, but > never thought to report it, I figured it was well-known. > > When you "tar xf" an archive git-archive it complains: > > tar: pax_global_header: typeflag 'g' not recognized, converting to regular file > > It will then extract the "pax_global_header" as if it were a file at the > root of the archive. That file will look like this: > > $ wc -c x/pax_global_header > 52 x/pax_global_header > $ cat x/pax_global_header > 52 comment=$40_CHAR_SHA_1 > > Where $40_CHAR_SHA_1 is whatever commit this archive was produced from. Ironically, for a long time the pax(1) utility shipped with Debian was incapable of reading pax headers. There are various other utilities (including 7-Zip) that don't understand them, and they invariably get converted to regular files (since that's what the spec says to do). This problem is well known to me, but I don't know if it's well known to everyone. In this case, I suspect Solaris has a pax(1) utility that works fine, but nobody bothered to port that code to tar(1). Usually the lack of support is because people prefer GNU tar archives over actual POSIX tar and pax archives, so real-world archivers don't see POSIX archives and don't handle the extensions properly. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 868 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2019-06-04 20:53 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-24 6:45 git archive generates tar with malformed pax extended attribute Keegan Carruthers-Smith 2019-05-24 7:06 ` Jeff King 2019-05-24 7:35 ` Keegan Carruthers-Smith 2019-05-24 8:13 ` Jeff King 2019-05-25 13:26 ` René Scharfe 2019-05-25 13:46 ` Andreas Schwab 2019-05-25 21:07 ` Ævar Arnfjörð Bjarmason 2019-05-26 21:33 ` René Scharfe 2019-05-28 5:44 ` Jeff King 2019-05-28 5:58 ` Jeff King 2019-05-28 18:01 ` René Scharfe 2019-05-28 19:08 ` Jeff King 2019-05-28 23:34 ` René Scharfe 2019-05-29 1:17 ` Jeff King 2019-05-29 17:54 ` René Scharfe 2019-05-30 11:55 ` Jeff King 2019-06-02 16:58 ` René Scharfe 2019-06-04 20:53 ` Jeff King 2019-05-27 5:11 ` Keegan Carruthers-Smith 2019-05-25 20:46 ` Ævar Arnfjörð Bjarmason 2019-05-25 21:19 ` brian m. carlson
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).