* git archive --format zip utf-8 issues
@ 2012-08-10 21:58 Sven Strickroth
2012-08-10 22:47 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: Sven Strickroth @ 2012-08-10 21:58 UTC (permalink / raw
To: git
Hi,
when I create a git repository, add a file containing utf-8 characters
or umlauts (like öäü.txt), commit and then export the HEAD revision to a
zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
contains incorrect filenames:
$ unzip -l 1.zip
Archive: 1.zip
4490a6dab1df5404f91ab3eb871f133154bff0bf
Length Date Time Name
--------- ---------- ----- ----
6 2012-08-10 23:41 +?+?++.txt
--------- -------
6 1 file
--
Best regards,
Sven Strickroth
PGP key id F5A9D4C4 @ any key-server
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
@ 2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53 ` Sven Strickroth
2012-08-11 20:53 ` René Scharfe
0 siblings, 2 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-08-10 22:47 UTC (permalink / raw
To: Sven Strickroth; +Cc: git, René Scharfe
Sven Strickroth <sven.strickroth@tu-clausthal.de> writes:
> when I create a git repository, add a file containing utf-8 characters
> or umlauts (like öäü.txt), commit and then export the HEAD revision to a
> zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
> contains incorrect filenames:
My reading of archive-zip.c seems to suggest that we write out
whatever pathname you have in the tree, so a pathname encoded in
UTF-8 will be literally written out in the resulting zip archive.
Do you know in what encoding the pathnames are _expected_ to be
stored in zip archives? Random documentation seems to suggest that
there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html
says:
There is no official file name encoding for ZIP files. If you
have unicode file names, you must convert them to byte strings
in your desired encoding before passing them to write(). WinZip
interprets all file names as encoded in CP437, also known as DOS
Latin.
which may explain it.
It may not be a bad idea for "git archive --format=zip" to
(1) check if pathname is a correct UTF-8; and
(2) check if it can be reencoded to latin-1
and if (and only if) both are true, automatically re-encode the path
to latin-1.
Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
would be the most generic way to do this.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-10 22:47 ` Junio C Hamano
@ 2012-08-10 23:53 ` Sven Strickroth
2012-08-11 20:53 ` René Scharfe
2012-08-11 20:53 ` René Scharfe
1 sibling, 1 reply; 21+ messages in thread
From: Sven Strickroth @ 2012-08-10 23:53 UTC (permalink / raw
To: git; +Cc: Junio C Hamano, René Scharfe
Am 11.08.2012 00:47 schrieb Junio C Hamano:
> Do you know in what encoding the pathnames are _expected_ to be
> stored in zip archives?
re-encoding to latin1 does not always work and may break double byte
totally (e.g. chinese or japanese).
PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8
flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
> A. Local file header:
> general purpose bit flag: (2 bytes)
> Bit 11: Language encoding flag (EFS). If this bit is
> set, the filename and comment fields for this file
> must be encoded using UTF-8. (see APPENDIX D)
--
Best regards,
Sven Strickroth
PGP key id F5A9D4C4 @ any key-server
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53 ` Sven Strickroth
@ 2012-08-11 20:53 ` René Scharfe
2012-08-11 21:37 ` Sven Strickroth
2012-08-12 4:27 ` git archive --format zip utf-8 issues Junio C Hamano
1 sibling, 2 replies; 21+ messages in thread
From: René Scharfe @ 2012-08-11 20:53 UTC (permalink / raw
To: Junio C Hamano; +Cc: Sven Strickroth, git
Am 11.08.2012 00:47, schrieb Junio C Hamano:
> Sven Strickroth <sven.strickroth@tu-clausthal.de> writes:
>
>> when I create a git repository, add a file containing utf-8 characters
>> or umlauts (like öäü.txt), commit and then export the HEAD revision to a
>> zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
>> contains incorrect filenames:
>
> My reading of archive-zip.c seems to suggest that we write out
> whatever pathname you have in the tree, so a pathname encoded in
> UTF-8 will be literally written out in the resulting zip archive.
Sorry for my imperialistic attitude of "ASCII filenames should be enough
for everybody". Laziness..
> Do you know in what encoding the pathnames are _expected_ to be
> stored in zip archives? Random documentation seems to suggest that
> there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html
> says:
>
> There is no official file name encoding for ZIP files. If you
> have unicode file names, you must convert them to byte strings
> in your desired encoding before passing them to write(). WinZip
> interprets all file names as encoded in CP437, also known as DOS
> Latin.
>
> which may explain it.
http://www.pkware.com/documents/casestudies/APPNOTE.TXT is the standard
document, as Sven noted, and it says that filenames are encoded in code
page 437, or optionally UTF-8 (a later addition). Discussions like
http://stackoverflow.com/questions/106367/ seem to indicate that at
least some archivers use the local code page as well.
> It may not be a bad idea for "git archive --format=zip" to
>
> (1) check if pathname is a correct UTF-8; and
> (2) check if it can be reencoded to latin-1
>
> and if (and only if) both are true, automatically re-encode the path
> to latin-1.
The standard says we need to convert to CP437, or to UTF-8, or provide
both versions. A more interesting question is: What's supported by which
programs?
The ZIP functionality built into Windows 7 doesn't seem to work with
UTF-8 encoded filenames (except for those that only use the ASCII
subset), and to ignore the UTF-8 part if both are given. Handling
umlauts should be possible anyway, because they are on code page 437,
but for other characters we'd have to aim for compatibility with other
programs like Info-ZIP and 7-Zip.
How do we know which encoding was used for a filename?
> Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
> would be the most generic way to do this.
I really hope we can make do without additional options.
René
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-10 23:53 ` Sven Strickroth
@ 2012-08-11 20:53 ` René Scharfe
2012-08-12 4:08 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-08-11 20:53 UTC (permalink / raw
To: Sven Strickroth; +Cc: git, Junio C Hamano
Am 11.08.2012 01:53, schrieb Sven Strickroth:
> Am 11.08.2012 00:47 schrieb Junio C Hamano:
>> Do you know in what encoding the pathnames are _expected_ to be
>> stored in zip archives?
>
> re-encoding to latin1 does not always work and may break double byte
> totally (e.g. chinese or japanese).
>
> PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8
> flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>> A. Local file header:
>> general purpose bit flag: (2 bytes)
>> Bit 11: Language encoding flag (EFS). If this bit is
>> set, the filename and comment fields for this file
>> must be encoded using UTF-8. (see APPENDIX D)
Yes, that's one of the two methods for supporting UTF-8 filenames
described there.
The other method involves writing extra ZIP header fields and was
invented by Info-ZIP. They don't use it consistently anymore, though
(from zip -h2):
"Zip now stores UTF-8 in entry path and comment fields on systems
where UTF-8 char set is default, such as most modern Unix, and
and on other systems in new extra fields with escaped versions in
entry path and comment fields for backward compatibility."
René
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-11 20:53 ` René Scharfe
@ 2012-08-11 21:37 ` Sven Strickroth
2012-08-30 22:26 ` Jeff King
2012-08-12 4:27 ` git archive --format zip utf-8 issues Junio C Hamano
1 sibling, 1 reply; 21+ messages in thread
From: Sven Strickroth @ 2012-08-11 21:37 UTC (permalink / raw
To: git; +Cc: René Scharfe, Junio C Hamano
Am 11.08.2012 22:53 schrieb René Scharfe:
> The standard says we need to convert to CP437, or to UTF-8, or provide
> both versions. A more interesting question is: What's supported by which
> programs?
>
> The ZIP functionality built into Windows 7 doesn't seem to work with
> UTF-8 encoded filenames (except for those that only use the ASCII
> subset), and to ignore the UTF-8 part if both are given.
I played a bit with the git source code and found out, that
diff --git a/archive-zip.c b/archive-zip.c
index f5af81f..e0ccb4f 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -257,7 +257,7 @@ static int write_zip_entry(struct archiver_args *args,
copy_le16(dirent.creator_version,
S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
copy_le16(dirent.version, 10);
- copy_le16(dirent.flags, flags);
+ copy_le16(dirent.flags, flags+2048);
copy_le16(dirent.compression_method, method);
copy_le16(dirent.mtime, zip_time);
copy_le16(dirent.mdate, zip_date);
--
works with 7-zip, however, not with Windows 7 build-in zip.
If I create a zip file with 7-zip which contains umlauts and other
unicode chars like (國立1-кккк.txt) the Windows 7 build-in zip displays
them correctly, too.
--
Best regards,
Sven Strickroth
PGP key id F5A9D4C4 @ any key-server
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-11 20:53 ` René Scharfe
@ 2012-08-12 4:08 ` Junio C Hamano
0 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-08-12 4:08 UTC (permalink / raw
To: René Scharfe; +Cc: Sven Strickroth, git
René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
>> PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8
>> flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>>> A. Local file header:
>>> general purpose bit flag: (2 bytes)
>>> Bit 11: Language encoding flag (EFS). If this bit is
>>> set, the filename and comment fields for this file
>>> must be encoded using UTF-8. (see APPENDIX D)
>
> Yes, that's one of the two methods for supporting UTF-8 filenames
> described there.
>
> The other method involves writing extra ZIP header fields and was
> invented by Info-ZIP. They don't use it consistently anymore, though
> (from zip -h2):
>
> "Zip now stores UTF-8 in entry path and comment fields on systems
> where UTF-8 char set is default, such as most modern Unix, and
> and on other systems in new extra fields with escaped versions in
> entry path and comment fields for backward compatibility."
Thanks; so if we adopt one of these methods, the readers that matter
will be happy? And if so, which one? Or both?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-11 20:53 ` René Scharfe
2012-08-11 21:37 ` Sven Strickroth
@ 2012-08-12 4:27 ` Junio C Hamano
1 sibling, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-08-12 4:27 UTC (permalink / raw
To: René Scharfe; +Cc: Sven Strickroth, git
René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
> ... A more interesting question is: What's supported by
> which programs?
Yes, that is the most interesting question.
>> Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
>> would be the most generic way to do this.
>
> I really hope we can make do without additional options.
We need to at least know the path encoding used in the tree objects,
and I'd be OK with a solution that assumes a single encoding is used
for the entire tree.
We would eventually need to also know the encoding used on the local
working tree (i.e. in what encoding paths are returned from
readdir() and the pathspec the user gives us from the command line),
and iconv it to the tree objects encoding for the project when
creating a cache_entry object to be fed to add_to_index(), and iconv
it back from the tree objects encoding to the working tree encoding
in write_entry(), but that is a longer term direction. For now, in
order to address the immediate issue, we only need the tree object
encoding, which should default to UTF-8 for interoperability.
So "git archive --format=zip --in-object-path-encoding=big5" for a
project whose tree object pathnames are in that encoding (and we
always record paths in UTF-8 when writing zipfiles) should be the
minimal that we need for now.
Optionally, with a configuration variable i18n.inObjectPathEncoding
(as opposed to the eventual i18n.worktreePathEncoding) set to big5,
users of such a project can say "git archive --format=zip" without
the "--in-object-path-encoding" option.
Considering that zip is a format meant for exchange, I'd think we
would be fine to always write in UTF-8 and leaving the readers
responsible for converting the pathname while extracting. If a
major zip extractor is incapable of handling UTF-8 (or even if
capable it is cumbersome, for that matter), we may end up having to
add "--in-archive-path-encoding=UTF-8" option to "git archive", with
associated "zip.archivePathEncoding" variable, though.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-11 21:37 ` Sven Strickroth
@ 2012-08-30 22:26 ` Jeff King
2012-09-04 20:23 ` René Scharfe
0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2012-08-30 22:26 UTC (permalink / raw
To: Sven Strickroth; +Cc: git, René Scharfe, Junio C Hamano
On Sat, Aug 11, 2012 at 11:37:05PM +0200, Sven Strickroth wrote:
> Am 11.08.2012 22:53 schrieb René Scharfe:
> > The standard says we need to convert to CP437, or to UTF-8, or provide
> > both versions. A more interesting question is: What's supported by which
> > programs?
> >
> > The ZIP functionality built into Windows 7 doesn't seem to work with
> > UTF-8 encoded filenames (except for those that only use the ASCII
> > subset), and to ignore the UTF-8 part if both are given.
>
> I played a bit with the git source code and found out, that
>
> diff --git a/archive-zip.c b/archive-zip.c
> index f5af81f..e0ccb4f 100644
> --- a/archive-zip.c
> +++ b/archive-zip.c
> @@ -257,7 +257,7 @@ static int write_zip_entry(struct archiver_args *args,
> copy_le16(dirent.creator_version,
> S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
> copy_le16(dirent.version, 10);
> - copy_le16(dirent.flags, flags);
> + copy_le16(dirent.flags, flags+2048);
> copy_le16(dirent.compression_method, method);
> copy_le16(dirent.mtime, zip_time);
> copy_le16(dirent.mdate, zip_date);
> --
> works with 7-zip, however, not with Windows 7 build-in zip.
>
> If I create a zip file with 7-zip which contains umlauts and other
> unicode chars like (國立1-кккк.txt) the Windows 7 build-in zip displays
> them correctly, too.
Ping on this stalled discussion.
It seems like there are two separate issues here:
1. Knowing the encoding of pathnames in the repository.
2. Setting the right flags in zip output.
A full solution would handle both parts, but let's ignore (1) for a
moment, and assume we have utf-8 (or can massage into utf-8 from an
encoding specified by the user).
It seems like just setting the magic utf-8 flag would be the only thing
we need to do, according to the standard. But according to discussions
referenced elsewhere in this thread, that flag was invented only in
2007, so we may be dealing with older implementations (I have no idea
how common they would be; that may be the problem with Windows 7's zip
you are seeing). We could re-encode to cp437, which the standard
specifies, but apparently some implementations do not respect that
(and use a local code page instead). And it cannot represent all utf-8
characters, anyway.
It sounds like 7-zip has figured out a more portable solution. Can you
show us a sample of 7-zip's output with utf-8 characters to compare to
what git generates? I wonder if it is using a combination of methods.
-Peff
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-08-30 22:26 ` Jeff King
@ 2012-09-04 20:23 ` René Scharfe
2012-09-04 21:03 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-04 20:23 UTC (permalink / raw
To: Jeff King; +Cc: Sven Strickroth, git, Junio C Hamano
Am 31.08.2012 00:26, schrieb Jeff King:
> Ping on this stalled discussion.
Sorry, I got distracted by other stuff again. I did some experiments,
though, and here's a preliminary result.
> It seems like there are two separate issues here:
>
> 1. Knowing the encoding of pathnames in the repository.
>
> 2. Setting the right flags in zip output.
>
> A full solution would handle both parts, but let's ignore (1) for a
> moment, and assume we have utf-8 (or can massage into utf-8 from an
> encoding specified by the user).
Yes, good thinking. Re-encoding may be beneficial for tar files as well,
but we can ignore that point for the moment.
> It seems like just setting the magic utf-8 flag would be the only thing
> we need to do, according to the standard. But according to discussions
> referenced elsewhere in this thread, that flag was invented only in
> 2007, so we may be dealing with older implementations (I have no idea
> how common they would be; that may be the problem with Windows 7's zip
> you are seeing). We could re-encode to cp437, which the standard
> specifies, but apparently some implementations do not respect that
> (and use a local code page instead). And it cannot represent all utf-8
> characters, anyway.
Yes, we could do that, plus adding an extra field with a UTF-8 version of
the path. That's the legacy method invented by Info-ZIP. They switched
to using the new flag on Linux at least, though.
> It sounds like 7-zip has figured out a more portable solution. Can you
> show us a sample of 7-zip's output with utf-8 characters to compare to
> what git generates? I wonder if it is using a combination of methods.
I'm not so sure they produce more portable files. I created an archive
with files named jaя.txt, smørrebrød.txt, süd.txt and €uro.txt with 7-Zip
on Windows 7 and while unzip on Ubuntu 12.04 managed to recreate the
cyrillic character and the Euro symbol, it mangled the slashed o and the
umlaut.
With the following patch I could create archives with git on Linux and
msysgit and extract them flawlessly on Windows with 7-Zip and with
Info-ZIP unzip on Linux, but not with unzip on Windows, where it mangled
all non-ASCII characters.
This gets confusing; it would help to have a compatibility matrix for all
intersting extractors and character classes -- for each proposed solution
or archiver we'd like to imitate.
But now for the patch, which is a bit confusing as well. I'm curious to
hear about results for more platforms, extractors and character classes.
Based on that we can see if we need to generate the extra fields instead
of relying on the new flag.
-- >8 --
Subject: [PATCH] archive-zip: support UTF-8 paths
Set general purpose flag 11 if we encounter a path that contains
non-ASCII characters. We assume that all paths are given as UTF-8; no
conversion is done.
The flag seems to be ignored by unzip unless we also mark the archive
entry as coming from a Unix system. This is done by setting the field
creator_version ("version made by" in the standard[1]) to 0x03NN.
The NN part represents the version of the standard supported by us, and
this patch sets it to 3f (for version 6.3) for Unix paths. We keep
creator_version set to 0 (FAT filesystem, standard version 0) in the
non-special cases, as before.
But when we declare a file to have a Unix path, then we have to set the
file mode as well, or unzip will extract the files with the permission
set 0000, i.e. inaccessible by all.
[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
archive-zip.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/archive-zip.c b/archive-zip.c
index f5af81f..928da1d 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -4,6 +4,8 @@
#include "cache.h"
#include "archive.h"
#include "streaming.h"
+#include "commit.h"
+#include "utf8.h"
static int zip_date;
static int zip_time;
@@ -16,7 +18,8 @@ static unsigned int zip_dir_offset;
static unsigned int zip_dir_entries;
#define ZIP_DIRECTORY_MIN_SIZE (1024 * 1024)
-#define ZIP_STREAM (8)
+#define ZIP_STREAM (1 << 3)
+#define ZIP_UTF8 (1 << 11)
struct zip_local_header {
unsigned char magic[4];
@@ -173,7 +176,8 @@ static int write_zip_entry(struct archiver_args *args,
{
struct zip_local_header header;
struct zip_dir_header dirent;
- unsigned long attr2;
+ unsigned int creator_version = 0;
+ unsigned long attr2 = 0;
unsigned long compressed_size;
unsigned long crc;
unsigned long direntsize;
@@ -187,6 +191,13 @@ static int write_zip_entry(struct archiver_args *args,
crc = crc32(0, NULL, 0);
+ if (has_non_ascii(path)) {
+ if (is_utf8(path))
+ flags |= ZIP_UTF8;
+ else
+ warning("Path is not valid UTF-8: %s", path);
+ }
+
if (pathlen > 0xffff) {
return error("path too long (%d chars, SHA1: %s): %s",
(int)pathlen, sha1_to_hex(sha1), path);
@@ -204,10 +215,15 @@ static int write_zip_entry(struct archiver_args *args,
enum object_type type = sha1_object_info(sha1, &size);
method = 0;
- attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) :
- (mode & 0111) ? ((mode) << 16) : 0;
if (S_ISREG(mode) && args->compression_level != 0 && size > 0)
method = 8;
+ if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) {
+ creator_version = 0x033f;
+ attr2 = mode;
+ if (S_ISLNK(mode))
+ attr2 |= 0777;
+ attr2 <<= 16;
+ }
compressed_size = size;
if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert &&
@@ -254,8 +270,7 @@ static int write_zip_entry(struct archiver_args *args,
}
copy_le32(dirent.magic, 0x02014b50);
- copy_le16(dirent.creator_version,
- S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
+ copy_le16(dirent.creator_version, creator_version);
copy_le16(dirent.version, 10);
copy_le16(dirent.flags, flags);
copy_le16(dirent.compression_method, method);
--
1.7.11.msysgit.1.1.gbf71771
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-04 20:23 ` René Scharfe
@ 2012-09-04 21:03 ` Junio C Hamano
2012-09-05 19:36 ` René Scharfe
0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2012-09-04 21:03 UTC (permalink / raw
To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git
René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
> But now for the patch, which is a bit confusing as well. I'm curious to
> hear about results for more platforms, extractors and character classes.
> Based on that we can see if we need to generate the extra fields instead
> of relying on the new flag.
Thanks for keeping the ball rolling.
> Subject: [PATCH] archive-zip: support UTF-8 paths
>
> Set general purpose flag 11 if we encounter a path that contains
> non-ASCII characters. We assume that all paths are given as UTF-8; no
> conversion is done.
>
> The flag seems to be ignored by unzip unless we also mark the archive
> entry as coming from a Unix system. This is done by setting the field
> creator_version ("version made by" in the standard[1]) to 0x03NN.
>
> The NN part represents the version of the standard supported by us, and
> this patch sets it to 3f (for version 6.3) for Unix paths. We keep
> creator_version set to 0 (FAT filesystem, standard version 0) in the
> non-special cases, as before.
>
> But when we declare a file to have a Unix path, then we have to set the
> file mode as well, or unzip will extract the files with the permission
> set 0000, i.e. inaccessible by all.
>
> [1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>
> Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
> ---
> archive-zip.c | 27 +++++++++++++++++++++------
> 1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/archive-zip.c b/archive-zip.c
> index f5af81f..928da1d 100644
> --- a/archive-zip.c
> +++ b/archive-zip.c
> @@ -4,6 +4,8 @@
> #include "cache.h"
> #include "archive.h"
> #include "streaming.h"
> +#include "commit.h"
> +#include "utf8.h"
>
> static int zip_date;
> static int zip_time;
> @@ -16,7 +18,8 @@ static unsigned int zip_dir_offset;
> static unsigned int zip_dir_entries;
>
> #define ZIP_DIRECTORY_MIN_SIZE (1024 * 1024)
> -#define ZIP_STREAM (8)
> +#define ZIP_STREAM (1 << 3)
> +#define ZIP_UTF8 (1 << 11)
>
> struct zip_local_header {
> unsigned char magic[4];
> @@ -173,7 +176,8 @@ static int write_zip_entry(struct archiver_args *args,
> {
> struct zip_local_header header;
> struct zip_dir_header dirent;
> - unsigned long attr2;
> + unsigned int creator_version = 0;
> + unsigned long attr2 = 0;
> unsigned long compressed_size;
> unsigned long crc;
> unsigned long direntsize;
> @@ -187,6 +191,13 @@ static int write_zip_entry(struct archiver_args *args,
>
> crc = crc32(0, NULL, 0);
>
> + if (has_non_ascii(path)) {
Do we want to treat \033 as "ascii" in this codepath? The function
primarily is used by the log formatter to see if we need 8-bit CTE
when writing out in the e-mail format.
> + if (is_utf8(path))
> + flags |= ZIP_UTF8;
> + else
> + warning("Path is not valid UTF-8: %s", path);
> + }
> +
> if (pathlen > 0xffff) {
> return error("path too long (%d chars, SHA1: %s): %s",
> (int)pathlen, sha1_to_hex(sha1), path);
> @@ -204,10 +215,15 @@ static int write_zip_entry(struct archiver_args *args,
> enum object_type type = sha1_object_info(sha1, &size);
>
> method = 0;
> - attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) :
> - (mode & 0111) ? ((mode) << 16) : 0;
> if (S_ISREG(mode) && args->compression_level != 0 && size > 0)
> method = 8;
> + if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) {
> + creator_version = 0x033f;
> + attr2 = mode;
> + if (S_ISLNK(mode))
> + attr2 |= 0777;
> + attr2 <<= 16;
> + }
> compressed_size = size;
>
> if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert &&
> @@ -254,8 +270,7 @@ static int write_zip_entry(struct archiver_args *args,
> }
>
> copy_le32(dirent.magic, 0x02014b50);
> - copy_le16(dirent.creator_version,
> - S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
> + copy_le16(dirent.creator_version, creator_version);
> copy_le16(dirent.version, 10);
> copy_le16(dirent.flags, flags);
> copy_le16(dirent.compression_method, method);
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-04 21:03 ` Junio C Hamano
@ 2012-09-05 19:36 ` René Scharfe
2012-09-18 19:40 ` René Scharfe
0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-05 19:36 UTC (permalink / raw
To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git
Am 04.09.2012 23:03, schrieb Junio C Hamano:
> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
>> + if (has_non_ascii(path)) {
>
> Do we want to treat \033 as "ascii" in this codepath? The function
> primarily is used by the log formatter to see if we need 8-bit CTE
> when writing out in the e-mail format.
Argh, yes, I'd think so. The function name mislead me. This won't
matter for compatibility testing, but should be corrected before inclusion.
Just checked: unzip strips the ESC character when extracting (whether
the UTF-8 flag is set or not) and 7-Zip replaces it with an underscore.
The built-in ZIP extractor of Windows 7 skips such a file; it doesn't
even show up in archive directory listings.
René
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-05 19:36 ` René Scharfe
@ 2012-09-18 19:40 ` René Scharfe
2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
` (4 more replies)
0 siblings, 5 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 19:40 UTC (permalink / raw
To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git
[-- Attachment #1: Type: text/plain, Size: 3044 bytes --]
Hello again,
so two weeks have passed, and I've moved at a glacial pace towards a
method how to measure compatibility of our generated ZIP files. Sorry,
I just keep getting distracted.
Anyway, the idea is to have a bunch of files with names using different
scripts, zip them with several packers (including git archive), unzip
them and compare the result with the original files.
As test corpus I used files named like the pangrams on this UTF-8
sampler page, the exact commands are attached:
http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox
The numbers below are how many lines the output of diff -ru contains for
this pair of packer and unpacker. There are 37 files, so the worst
result is 74 lines of difference ("Only in [...]" for both sides), while
0 indicates a perfect score.
Hmm, come to think of it, an empty directory would show up as 37, so
this metric is not ideal. A better one would be to simply give one
point for each correctly unpacked file.
Windows Info-ZIP unzip
7-Zip PeaZip builtin Linux msysgit Windows
7-Zip 9.20 0 0 46 26 43 43
PeaZip 4.7.1 win64 0 0 46 26 42 42
Info-ZIP zip 3.0 Linux 0 0 72 0 43 43
Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43
git-master 72 72 72 60 72 72
git-master-patch1 0 0 72 60 72 72
git-master-patch2 0 0 72 0 72 72
git-v1.7.11.msysgit.1 72 72 72 60 72 72
git-v1.7.11.msysgit.1-patch1 0 0 72 60 72 72
git-v1.7.11.msysgit.1-patch2 0 0 72 0 72 72
Info-ZIP's programs don't work too well on Windows. The built-in
unzipper of Windows 7 even refuses to open the file created by the
Windows version of zip. Speaking of which, this is the worst of the
unpackers.
With the two patches applied, we can say "use 7-Zip or PeaZip on Windows
and unzip on Linux" and filenames with all tested characters will be
preserved. I was surprised to see this working fine with msysgit like
that, even though no reencoding is introduced by the patches.
I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score
with the Windows-internal unzipper. Umlauts, Nordic characters and
accents are preserved by that combination. It seems that unzip on Linux
fails to unpack exactly these names, so perhaps they employ a dirty
trick like using the local encoding in the ZIP file, which makes it
unportable.
I'll reply with the two patches, which contain basically the same code
as the previous patch, only split up. The second one declares that
filenames with UTF-8 encoding came from Unix (instead of FAT), which
makes unzip happy. This, however, implies that we contain Unix
permissions for these entries, which is a bit ugly.
René
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pangrams.sh --]
[-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2536 bytes --]
#!/bin/sh
(
mkdir pangrams
cd pangrams
echo English >"The quick brown fox jumps over the lazy dog"
echo Irish 1 >"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall"
echo Irish 2 >"lena ṗóg éada ó ṡlí do leasa ṫú"
echo Irish 3 >"D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór"
echo Irish 4 >"Éava agus Áḋaiṁ"
echo Dutch >"Pa's wijze lynx bezag vroom het fikse aquaduct"
echo German 1 >"Falsches Üben von Xylophonmusik quält"
echo German 2 >"jeden größeren Zwerg"
echo Norwegian >"Blåbærsyltetøy"
echo Danish >"Høj bly gom vandt fræk sexquiz på wc"
echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor"
echo Icelandic >"Sævör grét áðan því úlpan var ónýt"
echo Finnish >"Törkylempijävongahdus"
echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig"
echo Czech >"Příliš žluťoučký kůň úpěl ďábelské kódy"
echo Slovak 1 >"Starý kôň na hŕbe kníh žuje tíško povädnuté ruže"
echo Slovak 2 >"na stĺpe sa ďateľ učí kvákať novú ódu o živote"
echo monotonic Greek >"ξεσκεπάζω την ψυχοφθόρα βδελυγμία"
echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία"
echo Russian >"Съешь же ещё этих мягких французских булок да выпей чаю"
echo Bulgarian 1 >"Жълтата дюля беше щастлива"
echo Bulgarian 2 >"че пухът, който цъфна, замръзна като гьон"
echo Northern Sami >"Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža"
echo Hungarian >"Árvíztűrő tükörfúrógép"
echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo exhaustiva"
echo Spanish 2 >"lluvia y frío añoraba a su querido cachorro"
echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico"
echo Portuguese 2 >"põe freqüentemente o único médico"
echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il gèle"
echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles"
echo French 3 >"d'œufs abîmés"
echo Esperanto >"Eĥoŝanĝo ĉiuĵaŭde"
echo Hebrew >"זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן"
echo Hiragana 1 >"いろはにほへど ちりぬるを"
echo Hiragana 2 >"わがよたれぞ つねならむ"
echo Hiragana 3 >"うゐのおくやま けふこえて"
echo Hiragana 4 >"あさきゆめみじ ゑひもせず"
)
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/2] archive-zip: support UTF-8 paths
2012-09-18 19:40 ` René Scharfe
@ 2012-09-18 19:46 ` René Scharfe
2012-09-18 19:53 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
` (3 subsequent siblings)
4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 19:46 UTC (permalink / raw
Cc: Junio C Hamano, Jeff King, Sven Strickroth, git
Set general purpose flag 11 if we encounter a path that contains
non-ASCII characters. We assume that all paths are given as UTF-8; no
conversion is done.
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
Changes from previous version: Stop using has_non_ascii(), which does
slightly too much for our purposes, and split off creator version
change into a separate patch.
archive-zip.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/archive-zip.c b/archive-zip.c
index f5af81f..0f763e8 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -4,6 +4,7 @@
#include "cache.h"
#include "archive.h"
#include "streaming.h"
+#include "utf8.h"
static int zip_date;
static int zip_time;
@@ -16,7 +17,8 @@ static unsigned int zip_dir_offset;
static unsigned int zip_dir_entries;
#define ZIP_DIRECTORY_MIN_SIZE (1024 * 1024)
-#define ZIP_STREAM (8)
+#define ZIP_STREAM (1 << 3)
+#define ZIP_UTF8 (1 << 11)
struct zip_local_header {
unsigned char magic[4];
@@ -164,6 +166,17 @@ static void set_zip_header_data_desc(struct zip_local_header *header,
copy_le32(header->size, size);
}
+static int has_only_ascii(const char *s)
+{
+ for (;;) {
+ int c = *s++;
+ if (c == '\0')
+ return 1;
+ if (!isascii(c))
+ return 0;
+ }
+}
+
#define STREAM_BUFFER_SIZE (1024 * 16)
static int write_zip_entry(struct archiver_args *args,
@@ -187,6 +200,13 @@ static int write_zip_entry(struct archiver_args *args,
crc = crc32(0, NULL, 0);
+ if (!has_only_ascii(path)) {
+ if (is_utf8(path))
+ flags |= ZIP_UTF8;
+ else
+ warning("Path is not valid UTF-8: %s", path);
+ }
+
if (pathlen > 0xffff) {
return error("path too long (%d chars, SHA1: %s): %s",
(int)pathlen, sha1_to_hex(sha1), path);
--
1.7.12
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/2] archive-zip: declare creator to be Unix for UTF-8 paths
2012-09-18 19:40 ` René Scharfe
2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
@ 2012-09-18 19:53 ` René Scharfe
2012-09-18 20:24 ` git archive --format zip utf-8 issues René Scharfe
` (2 subsequent siblings)
4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 19:53 UTC (permalink / raw
Cc: Junio C Hamano, Jeff King, Sven Strickroth, git
The UTF-8 flag seems to be ignored by unzip unless we also mark the
archive entry as coming from a Unix system. This is done by setting the
field creator_version ("version made by" in the standard[1]) to 0x03NN.
The NN part represents the version of the standard supported by us, and
this patch sets it to 3f (for version 6.3) for Unix paths. We keep
creator_version set to 0 (FAT filesystem, standard version 0) in the
non-special cases, as before.
But when we declare a file to have a Unix path, then we have to set the
file mode as well, or unzip will extract the files with the permission
set 0000, i.e. inaccessible by all.
[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
---
No sign-off for this, yet. Perhaps there is a better way to convince
unzip to respect the flag? And if not, do we need to offer umask
settings for ZIP as well as we have for tar? And perhaps declare all
files as being from a Unix filesystem, for consistency?
archive-zip.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/archive-zip.c b/archive-zip.c
index 0f763e8..e9b3dc9 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -186,7 +186,8 @@ static int write_zip_entry(struct archiver_args *args,
{
struct zip_local_header header;
struct zip_dir_header dirent;
- unsigned long attr2;
+ unsigned int creator_version = 0;
+ unsigned long attr2 = 0;
unsigned long compressed_size;
unsigned long crc;
unsigned long direntsize;
@@ -224,10 +225,15 @@ static int write_zip_entry(struct archiver_args *args,
enum object_type type = sha1_object_info(sha1, &size);
method = 0;
- attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) :
- (mode & 0111) ? ((mode) << 16) : 0;
if (S_ISREG(mode) && args->compression_level != 0 && size > 0)
method = 8;
+ if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) {
+ creator_version = 0x033f;
+ attr2 = mode;
+ if (S_ISLNK(mode))
+ attr2 |= 0777;
+ attr2 <<= 16;
+ }
compressed_size = size;
if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert &&
@@ -274,8 +280,7 @@ static int write_zip_entry(struct archiver_args *args,
}
copy_le32(dirent.magic, 0x02014b50);
- copy_le16(dirent.creator_version,
- S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
+ copy_le16(dirent.creator_version, creator_version);
copy_le16(dirent.version, 10);
copy_le16(dirent.flags, flags);
copy_le16(dirent.compression_method, method);
--
1.7.12
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-18 19:40 ` René Scharfe
2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
2012-09-18 19:53 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
@ 2012-09-18 20:24 ` René Scharfe
2012-09-18 21:12 ` Junio C Hamano
2012-09-24 15:56 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 20:24 UTC (permalink / raw
Cc: Junio C Hamano, Jeff King, Sven Strickroth, git
[-- Attachment #1: Type: text/plain, Size: 576 bytes --]
Am 18.09.2012 21:40, schrieb René Scharfe:
> Windows Info-ZIP unzip
> 7-Zip PeaZip builtin Linux msysgit Windows
> git-master-patch1 0 0 72 60 72 72
> git-master-patch2 0 0 72 0 72 72
Oh, and when I wrote Windows, I meant a German Windows 7 Home Premium
x64, and Linux is Ubuntu 12.04 x86.
I've also attached generated ZIP files for the two archivers above, for
easier testing. Let's see if they make it through.
René
[-- Attachment #2: pangrams-git-master-patch2.zip --]
[-- Type: application/x-zip-compressed, Size: 7432 bytes --]
[-- Attachment #3: pangrams-git-master-patch1.zip --]
[-- Type: application/x-zip-compressed, Size: 7432 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-18 19:40 ` René Scharfe
` (2 preceding siblings ...)
2012-09-18 20:24 ` git archive --format zip utf-8 issues René Scharfe
@ 2012-09-18 21:12 ` Junio C Hamano
2012-09-20 22:00 ` René Scharfe
2012-09-24 15:56 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
4 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2012-09-18 21:12 UTC (permalink / raw
To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git
René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
> Windows Info-ZIP unzip
> 7-Zip PeaZip builtin Linux msysgit Windows
> 7-Zip 9.20 0 0 46 26 43 43
> PeaZip 4.7.1 win64 0 0 46 26 42 42
> Info-ZIP zip 3.0 Linux 0 0 72 0 43 43
> Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43
> ...
> I wonder what 7-Zip and PeaZip do that gives them a slightly nicer
> score with the Windows-internal unzipper. Umlauts, Nordic characters
> and accents are preserved by that combination. It seems that unzip on
> Linux fails to unpack exactly these names, so perhaps they employ a
> dirty trick like using the local encoding in the ZIP file, which makes
> it unportable.
> ...
Thanks for this work.
It is kind of surprising that "Windows builtin" has very poor score
extracting from the output of Zip tools running on Windows (I am
looking at 46, 46 and n/a over there). If you tell it to create an
archive from its disk and then extract from it, I wonder what would
happen.
Does this result mean that practically nobody uses Zip archive with
exotic letters in paths on that platform? I am not talking about
developers and savvy people who know where to download third-party
Zip archivers and how to install them. I am imagining a grandma who
received an archive full of photos of her grandchild in her Outlook
Express or GMail inbox, clicked the attachment to download it, and
is trying to view the photo inside.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-18 21:12 ` Junio C Hamano
@ 2012-09-20 22:00 ` René Scharfe
2012-09-24 15:56 ` René Scharfe
0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-20 22:00 UTC (permalink / raw
To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git
Am 18.09.2012 23:12, schrieb Junio C Hamano:
> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
>
>> Windows Info-ZIP unzip
>> 7-Zip PeaZip builtin Linux msysgit Windows
>> 7-Zip 9.20 0 0 46 26 43 43
>> PeaZip 4.7.1 win64 0 0 46 26 42 42
>> Info-ZIP zip 3.0 Linux 0 0 72 0 43 43
>> Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43
> It is kind of surprising that "Windows builtin" has very poor score
> extracting from the output of Zip tools running on Windows (I am
> looking at 46, 46 and n/a over there). If you tell it to create an
> archive from its disk and then extract from it, I wonder what would
> happen.
I didn't include it as a packer because it refused to archive the
pangrams directory due to illegal characters in one of the filenames.
When I just tried a bit harder, I had to delete all but 14 files with
Latin script, accents etc. before I could zip the directory. I'll
include these results in the next round.
It uses codepage 850 on my system (MSDOS Latin 1). I don't expect this
to be portable.
> Does this result mean that practically nobody uses Zip archive with
> exotic letters in paths on that platform? I am not talking about
> developers and savvy people who know where to download third-party
> Zip archivers and how to install them. I am imagining a grandma who
> received an archive full of photos of her grandchild in her Outlook
> Express or GMail inbox, clicked the attachment to download it, and
> is trying to view the photo inside.
Not necessarily. Photos often have names like img_0123.jpg etc., which
are handled just fine. And all family members probably use the same
codepage on their computers, so they're less likely to run into this
problem.
By the way, I found this bug asking for codepage support in unzip:
https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961
Multiple codepages seem to be used for ZIP files in the wild, none of
them are supported by unzip on Linux, which only accepts ASCII or UTF-8.
René
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/2] archive-zip: write extended timestamp
2012-09-18 19:40 ` René Scharfe
` (3 preceding siblings ...)
2012-09-18 21:12 ` Junio C Hamano
@ 2012-09-24 15:56 ` René Scharfe
4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-24 15:56 UTC (permalink / raw
Cc: Junio C Hamano, Jeff King, Sven Strickroth, git
File modification times in ZIP files are encoded in DOS format: local
time with a granularity of two seconds. Add an extra field to all
archive entries to also record the mtime in Unix' fashion, as UTC with
a granularity of one second.
This has the desirable side-effect of convincing Info-ZIP unzip 6.00
to respect general purpose flag 11, which is used to indicate that a
file name is encoded in UTF-8. Any extra field would do, actually,
but the extended timestamp is a reasonably small one (22 bytes per
entry). Archives created by Info-ZIP zip 3.0 contain it, too (but
with ctime and atime as well).
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
archive-zip.c | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)
diff --git a/archive-zip.c b/archive-zip.c
index 0f763e8..55f66b4 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -76,6 +76,14 @@ struct zip_dir_trailer {
unsigned char _end[1];
};
+struct zip_extra_mtime {
+ unsigned char magic[2];
+ unsigned char extra_size[2];
+ unsigned char flags[1];
+ unsigned char mtime[4];
+ unsigned char _end[1];
+};
+
/*
* On ARM, padding is added at the end of the struct, so a simple
* sizeof(struct ...) reports two bytes more than the payload size
@@ -85,6 +93,9 @@ struct zip_dir_trailer {
#define ZIP_DATA_DESC_SIZE offsetof(struct zip_data_desc, _end)
#define ZIP_DIR_HEADER_SIZE offsetof(struct zip_dir_header, _end)
#define ZIP_DIR_TRAILER_SIZE offsetof(struct zip_dir_trailer, _end)
+#define ZIP_EXTRA_MTIME_SIZE offsetof(struct zip_extra_mtime, _end)
+#define ZIP_EXTRA_MTIME_PAYLOAD_SIZE \
+ (ZIP_EXTRA_MTIME_SIZE - offsetof(struct zip_extra_mtime, flags))
static void copy_le16(unsigned char *dest, unsigned int n)
{
@@ -186,6 +197,7 @@ static int write_zip_entry(struct archiver_args *args,
{
struct zip_local_header header;
struct zip_dir_header dirent;
+ struct zip_extra_mtime extra;
unsigned long attr2;
unsigned long compressed_size;
unsigned long crc;
@@ -266,8 +278,13 @@ static int write_zip_entry(struct archiver_args *args,
}
}
+ copy_le16(extra.magic, 0x5455);
+ copy_le16(extra.extra_size, ZIP_EXTRA_MTIME_PAYLOAD_SIZE);
+ extra.flags[0] = 1; /* just mtime */
+ copy_le32(extra.mtime, args->time);
+
/* make sure we have enough free space in the dictionary */
- direntsize = ZIP_DIR_HEADER_SIZE + pathlen;
+ direntsize = ZIP_DIR_HEADER_SIZE + pathlen + ZIP_EXTRA_MTIME_SIZE;
while (zip_dir_size < zip_dir_offset + direntsize) {
zip_dir_size += ZIP_DIRECTORY_MIN_SIZE;
zip_dir = xrealloc(zip_dir, zip_dir_size);
@@ -283,7 +300,7 @@ static int write_zip_entry(struct archiver_args *args,
copy_le16(dirent.mdate, zip_date);
set_zip_dir_data_desc(&dirent, size, compressed_size, crc);
copy_le16(dirent.filename_length, pathlen);
- copy_le16(dirent.extra_length, 0);
+ copy_le16(dirent.extra_length, ZIP_EXTRA_MTIME_SIZE);
copy_le16(dirent.comment_length, 0);
copy_le16(dirent.disk, 0);
copy_le16(dirent.attr1, 0);
@@ -301,11 +318,13 @@ static int write_zip_entry(struct archiver_args *args,
else
set_zip_header_data_desc(&header, size, compressed_size, crc);
copy_le16(header.filename_length, pathlen);
- copy_le16(header.extra_length, 0);
+ copy_le16(header.extra_length, ZIP_EXTRA_MTIME_SIZE);
write_or_die(1, &header, ZIP_LOCAL_HEADER_SIZE);
zip_offset += ZIP_LOCAL_HEADER_SIZE;
write_or_die(1, path, pathlen);
zip_offset += pathlen;
+ write_or_die(1, &extra, ZIP_EXTRA_MTIME_SIZE);
+ zip_offset += ZIP_EXTRA_MTIME_SIZE;
if (stream && method == 0) {
unsigned char buf[STREAM_BUFFER_SIZE];
ssize_t readlen;
@@ -402,6 +421,8 @@ static int write_zip_entry(struct archiver_args *args,
zip_dir_offset += ZIP_DIR_HEADER_SIZE;
memcpy(zip_dir + zip_dir_offset, path, pathlen);
zip_dir_offset += pathlen;
+ memcpy(zip_dir + zip_dir_offset, &extra, ZIP_EXTRA_MTIME_SIZE);
+ zip_dir_offset += ZIP_EXTRA_MTIME_SIZE;
zip_dir_entries++;
return 0;
--
1.7.12
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-20 22:00 ` René Scharfe
@ 2012-09-24 15:56 ` René Scharfe
2012-09-24 18:13 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-24 15:56 UTC (permalink / raw
Cc: Junio C Hamano, Jeff King, Sven Strickroth, git
[-- Attachment #1: Type: text/plain, Size: 2058 bytes --]
Hi,
I found a way to make unzip respect the UTF-8 flag in ZIP files:
Apparently (from looking at the source) an extended field needs to be
present in order for it to even look at general purpose flag 11. I sent
a patch to add an extended timestamp field that fits the bill.
Here are new numbers on ZIP international filename compatibility:
7-Zip PeaZip builtin unzip unzip unzip 7z
Windows Windows Windows Linux mingw Windows Linux
git Linux 1 1 1 7 1 1 1
git 1 Linux 37 37 1 7 1 1 37
git 2 Linux 37 37 1 37 1 1 37
git 3 Linux 37 37 1 37 15 15 37
git mingw 1 1 1 7 1 1 1
git 1 mingw 37 37 1 7 1 1 37
git 2 mingw 37 37 1 37 1 1 37
git 3 mingw 37 37 1 37 15 15 37
7-Zip Windows 37 37 14 24 15 15 24
PeaZip Windows 37 37 14 24 15 15 24
zip Linux 37 37 1 37 15 15 37
zip Windows 14 14 0 37 15 15 1
builtin Windows 14 14 14 1 14 14 1
The test corpus still consists of 37 files based on the pangrams on the
following web page:
http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox
The files can be created using the attached script. It also provides a
check command to count the files with correct names, and the results of
that for different ZIP extractors are give in the table. The built-in
ZIP functionality on Windows was only able to pack 14 of the 37 files,
which explains the low score across the board for this packer.
"git 1" is the patch "archive-zip: support UTF-8 paths" added, which
let's archive-zip make use of the UTF-8 flag. "git 2" is "git 1" plus
the patch "archive-zip: declare creator to be Unix for UTF-8 paths".
Both have been posted before. "git 3" is "git 1" plus the new patch
"archive-zip: write extended timestamp".
Let's drop patch 2 (Unix as creator) and keep patches 1 (UTF-8 flag) and
3 (mtime field) to make archive-zip record non-ASCII filenames in a
portable way. It's not perfect, but I don't know how to do any better
given that Windows' built-in ZIP functionality expects filenames in the
local code page and with an international audience for projects
distributing ZIP files.
René
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pangrams.sh --]
[-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2367 bytes --]
#!/bin/sh
files() {
cat <<EOF
pangrams/わがよたれぞ つねならむ
pangrams/うゐのおくやま けふこえて
pangrams/いろはにほへど ちりぬるを
pangrams/あさきゆめみじ ゑひもせず
pangrams/An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall
pangrams/Árvíztűrő tükörfúrógép
pangrams/Blåbærsyltetøy
pangrams/D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór
pangrams/d'œufs abîmés
pangrams/Éava agus Áḋaiṁ
pangrams/Eĥoŝanĝo ĉiuĵaŭde
pangrams/El pingüino Wenceslao hizo kilómetros bajo exhaustiva
pangrams/Falsches Üben von Xylophonmusik quält
pangrams/Flygande bäckasiner söka strax hwila på mjuka tuvor
pangrams/Høj bly gom vandt fræk sexquiz på wc
pangrams/jeden größeren Zwerg
pangrams/lena ṗóg éada ó ṡlí do leasa ṫú
pangrams/Les naïfs ægithales hâtifs pondant à Noël où il gèle
pangrams/lluvia y frío añoraba a su querido cachorro
pangrams/na stĺpe sa ďateľ učí kvákať novú ódu o živote
pangrams/O próximo vôo à noite sobre o Atlântico
pangrams/Pa's wijze lynx bezag vroom het fikse aquaduct
pangrams/Pchnąć w tę łódź jeża lub osiem skrzyń fig
pangrams/põe freqüentemente o único médico
pangrams/Příliš žluťoučký kůň úpěl ďábelské kódy
pangrams/Sævör grét áðan því úlpan var ónýt
pangrams/sont sûrs d'être déçus en voyant leurs drôles
pangrams/Starý kôň na hŕbe kníh žuje tíško povädnuté ruže
pangrams/The quick brown fox jumps over the lazy dog
pangrams/Törkylempijävongahdus
pangrams/Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža
pangrams/זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן
pangrams/ξεσκεπάζω την ψυχοφθόρα βδελυγμία
pangrams/ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία
pangrams/Жълтата дюля беше щастлива
pangrams/Съешь же ещё этих мягких французских булок да выпей чаю
pangrams/че пухът, който цъфна, замръзна като гьон
EOF
}
case "$1" in
create)
mkdir -p pangrams
files | while read file
do
touch "$file"
done
;;
check)
files | while read file
do
test -f "$file" && echo "$file"
done | wc -l
;;
*)
echo "Usage: $0 create | check" >&2
exit 1
;;
esac
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git archive --format zip utf-8 issues
2012-09-24 15:56 ` René Scharfe
@ 2012-09-24 18:13 ` Junio C Hamano
0 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-09-24 18:13 UTC (permalink / raw
To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git
René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
> "git 1" is the patch "archive-zip: support UTF-8 paths" added, which
> let's archive-zip make use of the UTF-8 flag. "git 2" is "git 1" plus
> the patch "archive-zip: declare creator to be Unix for UTF-8
> paths". Both have been posted before. "git 3" is "git 1" plus the new
> patch "archive-zip: write extended timestamp".
>
> Let's drop patch 2 (Unix as creator) and keep patches 1 (UTF-8 flag)
> and 3 (mtime field) to make archive-zip record non-ASCII filenames in
> a portable way. It's not perfect, but I don't know how to do any
> better given that Windows' built-in ZIP functionality expects
> filenames in the local code page and with an international audience
> for projects distributing ZIP files.
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2012-09-24 18:13 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53 ` Sven Strickroth
2012-08-11 20:53 ` René Scharfe
2012-08-12 4:08 ` Junio C Hamano
2012-08-11 20:53 ` René Scharfe
2012-08-11 21:37 ` Sven Strickroth
2012-08-30 22:26 ` Jeff King
2012-09-04 20:23 ` René Scharfe
2012-09-04 21:03 ` Junio C Hamano
2012-09-05 19:36 ` René Scharfe
2012-09-18 19:40 ` René Scharfe
2012-09-18 19:46 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
2012-09-18 19:53 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
2012-09-18 20:24 ` git archive --format zip utf-8 issues René Scharfe
2012-09-18 21:12 ` Junio C Hamano
2012-09-20 22:00 ` René Scharfe
2012-09-24 15:56 ` René Scharfe
2012-09-24 18:13 ` Junio C Hamano
2012-09-24 15:56 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
2012-08-12 4:27 ` git archive --format zip utf-8 issues Junio C Hamano
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).