git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: Jeff King <peff@peff.net>,
	Keegan Carruthers-Smith <keegan.csmith@gmail.com>,
	git@vger.kernel.org
Subject: Re: git archive generates tar with malformed pax extended attribute
Date: Sat, 25 May 2019 23:07:14 +0200	[thread overview]
Message-ID: <877eaefdkt.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <b4aaff4b-eaf7-9eaf-063f-42c073078060@web.de>


On Sat, May 25 2019, René Scharfe wrote:

> Am 24.05.19 um 10:13 schrieb Jeff King:
>> On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote:
>>
>>>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
>>>> (from Debian's bsdtar package). What does your "tar --version" say?
>>>
>>> bsdtar 2.8.3 - libarchive 2.8.3
>>
>> Interesting. I wonder if there was a libarchive bug that was fixed
>> between 2.8.3 and 3.3.3.
>>
>>>> Git does write a pax header with the commit id in it as a comment.
>>>> Presumably that's what it's complaining about (but it is not malformed
>>>> according to any tar I've tried). If you feed git-archive a tree rather
>>>> than a commit, that is omitted. What does:
>>>>
>>>>   git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>>
>>>> say? If it doesn't complain, then we know it's indeed the pax comment
>>>> field.
>>>
>>> It also complains
>>>
>>>   $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>   tar: Ignoring malformed pax extended attribute
>>>   tar: Error exit delayed from previous errors.
>>
>> Ah, OK. So it's not the comment field at all, but some other entry.
>>
>>> Some more context: I work at Sourcegraph.com We mirror a lot of repos
>>> from github.com. We usually interact with a working copy by running
>>> git archive on it in our infrastructure. This is the first repository
>>> that I have noticed which produces this error. An interesting thing to
>>> note is the commit metadata contains a lot of non-ascii text which was
>>> my guess at what my be tripping up the tar creation.
>>
>> Yeah, though the only thing that makes it into the tarfile is the actual
>> tree entries. I'd imagine the file content is not likely to be a source
>> of problems, as it's common to see binary gunk there. Most of the
>> filenames are pretty mundane, but this symlink destination is a little
>> funny:
>>
>>   $ git archive ... | tar tvf - | grep nicovideo4as.swc
>>   lrwxrwxrwx root/root         0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024
>>
>> That's not the full story, though. It is indeed a symlink in the
>> tree:
>>
>>   $ git ls-tree -r HEAD libs/nicovideo4as.swc
>>   120000 blob ec3137b5fcaeae25cf67927068af116517683806	libs/nicovideo4as.swc
>>
>> But the contents of that blob, which should be the destination filename,
>> are definitely not:
>>
>>   $ git cat-file blob ec3137b5f | wc -c
>>   57804
>>   $ git cat-file blob ec3137b5f | xxd | head -1
>>   00000000: 504b 0304 1400 0800 0800 5069 694e 0000  PK........PiiN..
>>
>> There's quite a bit more data there. And what tar showed us goes up to
>> the first NUL, which does not seem surprising.
>
> That (the symlink target) is a ZIP file with the following contents:
>
>  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
> --------  ------  ------- ---- ---------- ----- --------  ----
>    39733  Defl:N     3403  91% 2019-03-09 13:10 489e1be1  catalog.xml
>    54131  Defl:N    54151   0% 2019-03-09 13:10 32f57322  library.swf
> --------          -------  ---                            -------
>    93864            57554  39%                            2 files
>
> And link targets longer than 100 characters are encoded in an extended
> Pax header.
>
> (Usually symlink targets are paths, not file contents.)
>
>> It's possible Git is doing the wrong thing on the writing side, but
>> given that newer versions of bsdtar handle it fine, I'd guess that the
>> old one simply had problems consuming poorly formed symlink filenames.
>
> Git preserves symlink targets with embedded NULs in the repository and
> in generated tar files.  Not sure if GNU tar and bsdtar truncating them
> at the first NUL is a bug.  I'm also not sure if there is a platform
> that would allow creating such a symlink in the file system, or how one
> is supposed to use it.
>
> We could truncate symlink targets at the first NUL as well in git
> archive -- but that would be a bit sad, as the archive formats allow
> storing the "real" target from the repo, with NUL and all.  We could
> make git fsck report such symlinks.
>
> Can Unicode symlink targets contain NULs?  We wouldn't want to damage
> them even if we decide to truncate.

I don't see a practical use for this case, and maybe we should even fsck
check for the blob representing the symlink target having a \0 in it as
suggested upthread.

But that being said, this assumption that data in a tar archive will get
written to a FS of some sort isn't true. There's plenty of consumers of
the format that read it in-memory and stream its contents out to
something else entirely, e.g. taking "git archive --remote" output,
parsing it with e.g. [1] and throwing some/all of the content into a
database.

1. https://metacpan.org/pod/Archive::Tar

  parent reply	other threads:[~2019-05-25 21:07 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-24  6:45 git archive generates tar with malformed pax extended attribute Keegan Carruthers-Smith
2019-05-24  7:06 ` Jeff King
2019-05-24  7:35   ` Keegan Carruthers-Smith
2019-05-24  8:13     ` Jeff King
2019-05-25 13:26       ` René Scharfe
2019-05-25 13:46         ` Andreas Schwab
2019-05-25 21:07         ` Ævar Arnfjörð Bjarmason [this message]
2019-05-26 21:33           ` René Scharfe
2019-05-28  5:44             ` Jeff King
2019-05-28  5:58         ` Jeff King
2019-05-28 18:01           ` René Scharfe
2019-05-28 19:08             ` Jeff King
2019-05-28 23:34               ` René Scharfe
2019-05-29  1:17                 ` Jeff King
2019-05-29 17:54                   ` René Scharfe
2019-05-30 11:55                     ` Jeff King
2019-06-02 16:58                       ` René Scharfe
2019-06-04 20:53                         ` Jeff King
2019-05-27  5:11       ` Keegan Carruthers-Smith
2019-05-25 20:46   ` Ævar Arnfjörð Bjarmason
2019-05-25 21:19     ` brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877eaefdkt.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=keegan.csmith@gmail.com \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).