From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: Jeff King <peff@peff.net>,
Keegan Carruthers-Smith <keegan.csmith@gmail.com>,
git@vger.kernel.org
Subject: Re: git archive generates tar with malformed pax extended attribute
Date: Sat, 25 May 2019 23:07:14 +0200 [thread overview]
Message-ID: <877eaefdkt.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <b4aaff4b-eaf7-9eaf-063f-42c073078060@web.de>
On Sat, May 25 2019, René Scharfe wrote:
> Am 24.05.19 um 10:13 schrieb Jeff King:
>> On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote:
>>
>>>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
>>>> (from Debian's bsdtar package). What does your "tar --version" say?
>>>
>>> bsdtar 2.8.3 - libarchive 2.8.3
>>
>> Interesting. I wonder if there was a libarchive bug that was fixed
>> between 2.8.3 and 3.3.3.
>>
>>>> Git does write a pax header with the commit id in it as a comment.
>>>> Presumably that's what it's complaining about (but it is not malformed
>>>> according to any tar I've tried). If you feed git-archive a tree rather
>>>> than a commit, that is omitted. What does:
>>>>
>>>> git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>>
>>>> say? If it doesn't complain, then we know it's indeed the pax comment
>>>> field.
>>>
>>> It also complains
>>>
>>> $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>> tar: Ignoring malformed pax extended attribute
>>> tar: Error exit delayed from previous errors.
>>
>> Ah, OK. So it's not the comment field at all, but some other entry.
>>
>>> Some more context: I work at Sourcegraph.com We mirror a lot of repos
>>> from github.com. We usually interact with a working copy by running
>>> git archive on it in our infrastructure. This is the first repository
>>> that I have noticed which produces this error. An interesting thing to
>>> note is the commit metadata contains a lot of non-ascii text which was
>>> my guess at what my be tripping up the tar creation.
>>
>> Yeah, though the only thing that makes it into the tarfile is the actual
>> tree entries. I'd imagine the file content is not likely to be a source
>> of problems, as it's common to see binary gunk there. Most of the
>> filenames are pretty mundane, but this symlink destination is a little
>> funny:
>>
>> $ git archive ... | tar tvf - | grep nicovideo4as.swc
>> lrwxrwxrwx root/root 0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024
>>
>> That's not the full story, though. It is indeed a symlink in the
>> tree:
>>
>> $ git ls-tree -r HEAD libs/nicovideo4as.swc
>> 120000 blob ec3137b5fcaeae25cf67927068af116517683806 libs/nicovideo4as.swc
>>
>> But the contents of that blob, which should be the destination filename,
>> are definitely not:
>>
>> $ git cat-file blob ec3137b5f | wc -c
>> 57804
>> $ git cat-file blob ec3137b5f | xxd | head -1
>> 00000000: 504b 0304 1400 0800 0800 5069 694e 0000 PK........PiiN..
>>
>> There's quite a bit more data there. And what tar showed us goes up to
>> the first NUL, which does not seem surprising.
>
> That (the symlink target) is a ZIP file with the following contents:
>
> Length Method Size Cmpr Date Time CRC-32 Name
> -------- ------ ------- ---- ---------- ----- -------- ----
> 39733 Defl:N 3403 91% 2019-03-09 13:10 489e1be1 catalog.xml
> 54131 Defl:N 54151 0% 2019-03-09 13:10 32f57322 library.swf
> -------- ------- --- -------
> 93864 57554 39% 2 files
>
> And link targets longer than 100 characters are encoded in an extended
> Pax header.
>
> (Usually symlink targets are paths, not file contents.)
>
>> It's possible Git is doing the wrong thing on the writing side, but
>> given that newer versions of bsdtar handle it fine, I'd guess that the
>> old one simply had problems consuming poorly formed symlink filenames.
>
> Git preserves symlink targets with embedded NULs in the repository and
> in generated tar files. Not sure if GNU tar and bsdtar truncating them
> at the first NUL is a bug. I'm also not sure if there is a platform
> that would allow creating such a symlink in the file system, or how one
> is supposed to use it.
>
> We could truncate symlink targets at the first NUL as well in git
> archive -- but that would be a bit sad, as the archive formats allow
> storing the "real" target from the repo, with NUL and all. We could
> make git fsck report such symlinks.
>
> Can Unicode symlink targets contain NULs? We wouldn't want to damage
> them even if we decide to truncate.
I don't see a practical use for this case, and maybe we should even fsck
check for the blob representing the symlink target having a \0 in it as
suggested upthread.
But that being said, this assumption that data in a tar archive will get
written to a FS of some sort isn't true. There's plenty of consumers of
the format that read it in-memory and stream its contents out to
something else entirely, e.g. taking "git archive --remote" output,
parsing it with e.g. [1] and throwing some/all of the content into a
database.
1. https://metacpan.org/pod/Archive::Tar
next prev parent reply other threads:[~2019-05-25 21:07 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-24 6:45 git archive generates tar with malformed pax extended attribute Keegan Carruthers-Smith
2019-05-24 7:06 ` Jeff King
2019-05-24 7:35 ` Keegan Carruthers-Smith
2019-05-24 8:13 ` Jeff King
2019-05-25 13:26 ` René Scharfe
2019-05-25 13:46 ` Andreas Schwab
2019-05-25 21:07 ` Ævar Arnfjörð Bjarmason [this message]
2019-05-26 21:33 ` René Scharfe
2019-05-28 5:44 ` Jeff King
2019-05-28 5:58 ` Jeff King
2019-05-28 18:01 ` René Scharfe
2019-05-28 19:08 ` Jeff King
2019-05-28 23:34 ` René Scharfe
2019-05-29 1:17 ` Jeff King
2019-05-29 17:54 ` René Scharfe
2019-05-30 11:55 ` Jeff King
2019-06-02 16:58 ` René Scharfe
2019-06-04 20:53 ` Jeff King
2019-05-27 5:11 ` Keegan Carruthers-Smith
2019-05-25 20:46 ` Ævar Arnfjörð Bjarmason
2019-05-25 21:19 ` brian m. carlson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877eaefdkt.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=keegan.csmith@gmail.com \
--cc=l.s.r@web.de \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).