git archive --format zip utf-8 issues

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git archive --format zip utf-8 issues
@ 2012-08-10 21:58 Sven Strickroth
  2012-08-10 22:47 ` Junio C Hamano
  0 siblings, 1 reply; 21+ messages in thread
From: Sven Strickroth @ 2012-08-10 21:58 UTC (permalink / raw
  To: git

Hi,

when I create a git repository, add a file containing utf-8 characters
or umlauts (like öäü.txt), commit and then export the HEAD revision to a
zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
contains incorrect filenames:

$ unzip -l 1.zip
Archive:  1.zip
4490a6dab1df5404f91ab3eb871f133154bff0bf
  Length      Date    Time    Name
---------  ---------- -----   ----
        6  2012-08-10 23:41   +?+?++.txt
---------                     -------
        6                     1 file
-- 
Best regards,
 Sven Strickroth
 PGP key id F5A9D4C4 @ any key-server

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
@ 2012-08-10 22:47 ` Junio C Hamano
  2012-08-10 23:53   ` Sven Strickroth
  2012-08-11 20:53   ` René Scharfe
  0 siblings, 2 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-08-10 22:47 UTC (permalink / raw
  To: Sven Strickroth; +Cc: git, René Scharfe

Sven Strickroth <sven.strickroth@tu-clausthal.de> writes:

> when I create a git repository, add a file containing utf-8 characters
> or umlauts (like öäü.txt), commit and then export the HEAD revision to a
> zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
> contains incorrect filenames:

My reading of archive-zip.c seems to suggest that we write out
whatever pathname you have in the tree, so a pathname encoded in
UTF-8 will be literally written out in the resulting zip archive.

Do you know in what encoding the pathnames are _expected_ to be
stored in zip archives?  Random documentation seems to suggest that
there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html
says:

    There is no official file name encoding for ZIP files. If you
    have unicode file names, you must convert them to byte strings
    in your desired encoding before passing them to write(). WinZip
    interprets all file names as encoded in CP437, also known as DOS
    Latin.

which may explain it.

It may not be a bad idea for "git archive --format=zip" to

 (1) check if pathname is a correct UTF-8; and
 (2) check if it can be reencoded to latin-1

and if (and only if) both are true, automatically re-encode the path
to latin-1.

Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
would be the most generic way to do this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-10 22:47 ` Junio C Hamano
@ 2012-08-10 23:53   ` Sven Strickroth
  2012-08-11 20:53     ` René Scharfe
  2012-08-11 20:53   ` René Scharfe
  1 sibling, 1 reply; 21+ messages in thread
From: Sven Strickroth @ 2012-08-10 23:53 UTC (permalink / raw
  To: git; +Cc: Junio C Hamano, René Scharfe

Am 11.08.2012 00:47 schrieb Junio C Hamano:
> Do you know in what encoding the pathnames are _expected_ to be
> stored in zip archives?

re-encoding to latin1 does not always work and may break double byte
totally (e.g. chinese or japanese).

PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8
flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
> A.  Local file header:
> general purpose bit flag: (2 bytes)
> Bit 11: Language encoding flag (EFS).  If this bit is
> set, the filename and comment fields for this file
> must be encoded using UTF-8. (see APPENDIX D)

-- 
Best regards,
 Sven Strickroth
 PGP key id F5A9D4C4 @ any key-server

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-10 22:47 ` Junio C Hamano
  2012-08-10 23:53   ` Sven Strickroth
@ 2012-08-11 20:53   ` René Scharfe
  2012-08-11 21:37     ` Sven Strickroth
  2012-08-12  4:27     ` git archive --format zip utf-8 issues Junio C Hamano
  1 sibling, 2 replies; 21+ messages in thread
From: René Scharfe @ 2012-08-11 20:53 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Sven Strickroth, git

Am 11.08.2012 00:47, schrieb Junio C Hamano:
> Sven Strickroth <sven.strickroth@tu-clausthal.de> writes:
>
>> when I create a git repository, add a file containing utf-8 characters
>> or umlauts (like öäü.txt), commit and then export the HEAD revision to a
>> zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
>> contains incorrect filenames:
>
> My reading of archive-zip.c seems to suggest that we write out
> whatever pathname you have in the tree, so a pathname encoded in
> UTF-8 will be literally written out in the resulting zip archive.

Sorry for my imperialistic attitude of "ASCII filenames should be enough 
for everybody".  Laziness..

> Do you know in what encoding the pathnames are _expected_ to be
> stored in zip archives?  Random documentation seems to suggest that
> there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html
> says:
>
>      There is no official file name encoding for ZIP files. If you
>      have unicode file names, you must convert them to byte strings
>      in your desired encoding before passing them to write(). WinZip
>      interprets all file names as encoded in CP437, also known as DOS
>      Latin.
>
> which may explain it.

http://www.pkware.com/documents/casestudies/APPNOTE.TXT is the standard 
document, as Sven noted, and it says that filenames are encoded in code 
page 437, or optionally UTF-8 (a later addition).  Discussions like 
http://stackoverflow.com/questions/106367/ seem to indicate that at 
least some archivers use the local code page as well.

> It may not be a bad idea for "git archive --format=zip" to
>
>   (1) check if pathname is a correct UTF-8; and
>   (2) check if it can be reencoded to latin-1
>
> and if (and only if) both are true, automatically re-encode the path
> to latin-1.

The standard says we need to convert to CP437, or to UTF-8, or provide 
both versions. A more interesting question is: What's supported by which 
programs?

The ZIP functionality built into Windows 7 doesn't seem to work with 
UTF-8 encoded filenames (except for those that only use the ASCII 
subset), and to ignore the UTF-8 part if both are given.  Handling 
umlauts should be possible anyway, because they are on code page 437, 
but for other characters we'd have to aim for compatibility with other 
programs like Info-ZIP and 7-Zip.

How do we know which encoding was used for a filename?

> Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
> would be the most generic way to do this.

I really hope we can make do without additional options.

René

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-10 23:53   ` Sven Strickroth
@ 2012-08-11 20:53     ` René Scharfe
  2012-08-12  4:08       ` Junio C Hamano
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-08-11 20:53 UTC (permalink / raw
  To: Sven Strickroth; +Cc: git, Junio C Hamano

Am 11.08.2012 01:53, schrieb Sven Strickroth:
> Am 11.08.2012 00:47 schrieb Junio C Hamano:
>> Do you know in what encoding the pathnames are _expected_ to be
>> stored in zip archives?
>
> re-encoding to latin1 does not always work and may break double byte
> totally (e.g. chinese or japanese).
>
> PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8
> flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>> A.  Local file header:
>> general purpose bit flag: (2 bytes)
>> Bit 11: Language encoding flag (EFS).  If this bit is
>> set, the filename and comment fields for this file
>> must be encoded using UTF-8. (see APPENDIX D)

Yes, that's one of the two methods for supporting UTF-8 filenames 
described there.

The other method involves writing extra ZIP header fields and was 
invented by Info-ZIP. They don't use it consistently anymore, though 
(from zip -h2):

  "Zip now stores UTF-8 in entry path and comment fields on systems
   where UTF-8 char set is default, such as most modern Unix, and
   and on other systems in new extra fields with escaped versions in
   entry path and comment fields for backward compatibility."

René

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-11 20:53   ` René Scharfe
@ 2012-08-11 21:37     ` Sven Strickroth
  2012-08-30 22:26       ` Jeff King
  2012-08-12  4:27     ` git archive --format zip utf-8 issues Junio C Hamano
  1 sibling, 1 reply; 21+ messages in thread
From: Sven Strickroth @ 2012-08-11 21:37 UTC (permalink / raw
  To: git; +Cc: René Scharfe, Junio C Hamano

Am 11.08.2012 22:53 schrieb René Scharfe:
> The standard says we need to convert to CP437, or to UTF-8, or provide 
> both versions. A more interesting question is: What's supported by which 
> programs?
> 
> The ZIP functionality built into Windows 7 doesn't seem to work with 
> UTF-8 encoded filenames (except for those that only use the ASCII 
> subset), and to ignore the UTF-8 part if both are given.

I played a bit with the git source code and found out, that

diff --git a/archive-zip.c b/archive-zip.c
index f5af81f..e0ccb4f 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -257,7 +257,7 @@ static int write_zip_entry(struct archiver_args *args,
 	copy_le16(dirent.creator_version,
 		S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
 	copy_le16(dirent.version, 10);
-	copy_le16(dirent.flags, flags);
+	copy_le16(dirent.flags, flags+2048);
 	copy_le16(dirent.compression_method, method);
 	copy_le16(dirent.mtime, zip_time);
 	copy_le16(dirent.mdate, zip_date);
--
works with 7-zip, however, not with Windows 7 build-in zip.

If I create a zip file with 7-zip which contains umlauts and other
unicode chars like (國立1-кккк.txt) the Windows 7 build-in zip displays
them correctly, too.

-- 
Best regards,
 Sven Strickroth
 PGP key id F5A9D4C4 @ any key-server

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-11 20:53     ` René Scharfe
@ 2012-08-12  4:08       ` Junio C Hamano
  0 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-08-12  4:08 UTC (permalink / raw
  To: René Scharfe; +Cc: Sven Strickroth, git

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

>> PKZIP APPNOTE seems to be the zip standard and it specifies a utf-8
>> flag: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>>> A.  Local file header:
>>> general purpose bit flag: (2 bytes)
>>> Bit 11: Language encoding flag (EFS).  If this bit is
>>> set, the filename and comment fields for this file
>>> must be encoded using UTF-8. (see APPENDIX D)
>
> Yes, that's one of the two methods for supporting UTF-8 filenames
> described there.
>
> The other method involves writing extra ZIP header fields and was
> invented by Info-ZIP. They don't use it consistently anymore, though
> (from zip -h2):
>
>  "Zip now stores UTF-8 in entry path and comment fields on systems
>   where UTF-8 char set is default, such as most modern Unix, and
>   and on other systems in new extra fields with escaped versions in
>   entry path and comment fields for backward compatibility."

Thanks; so if we adopt one of these methods, the readers that matter
will be happy?  And if so, which one?  Or both?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-11 20:53   ` René Scharfe
  2012-08-11 21:37     ` Sven Strickroth
@ 2012-08-12  4:27     ` Junio C Hamano
  1 sibling, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-08-12  4:27 UTC (permalink / raw
  To: René Scharfe; +Cc: Sven Strickroth, git

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> ... A more interesting question is: What's supported by
> which programs?

Yes, that is the most interesting question.

>> Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
>> would be the most generic way to do this.
>
> I really hope we can make do without additional options.

We need to at least know the path encoding used in the tree objects,
and I'd be OK with a solution that assumes a single encoding is used
for the entire tree.

We would eventually need to also know the encoding used on the local
working tree (i.e. in what encoding paths are returned from
readdir() and the pathspec the user gives us from the command line),
and iconv it to the tree objects encoding for the project when
creating a cache_entry object to be fed to add_to_index(), and iconv
it back from the tree objects encoding to the working tree encoding
in write_entry(), but that is a longer term direction.  For now, in
order to address the immediate issue, we only need the tree object
encoding, which should default to UTF-8 for interoperability.

So "git archive --format=zip --in-object-path-encoding=big5" for a
project whose tree object pathnames are in that encoding (and we
always record paths in UTF-8 when writing zipfiles) should be the
minimal that we need for now.

Optionally, with a configuration variable i18n.inObjectPathEncoding
(as opposed to the eventual i18n.worktreePathEncoding) set to big5,
users of such a project can say "git archive --format=zip" without
the "--in-object-path-encoding" option.

Considering that zip is a format meant for exchange, I'd think we
would be fine to always write in UTF-8 and leaving the readers
responsible for converting the pathname while extracting.  If a
major zip extractor is incapable of handling UTF-8 (or even if
capable it is cumbersome, for that matter), we may end up having to
add "--in-archive-path-encoding=UTF-8" option to "git archive", with
associated "zip.archivePathEncoding" variable, though.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-11 21:37     ` Sven Strickroth
@ 2012-08-30 22:26       ` Jeff King
  2012-09-04 20:23         ` René Scharfe
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2012-08-30 22:26 UTC (permalink / raw
  To: Sven Strickroth; +Cc: git, René Scharfe, Junio C Hamano

On Sat, Aug 11, 2012 at 11:37:05PM +0200, Sven Strickroth wrote:

> Am 11.08.2012 22:53 schrieb René Scharfe:
> > The standard says we need to convert to CP437, or to UTF-8, or provide 
> > both versions. A more interesting question is: What's supported by which 
> > programs?
> > 
> > The ZIP functionality built into Windows 7 doesn't seem to work with 
> > UTF-8 encoded filenames (except for those that only use the ASCII 
> > subset), and to ignore the UTF-8 part if both are given.
> 
> I played a bit with the git source code and found out, that
> 
> diff --git a/archive-zip.c b/archive-zip.c
> index f5af81f..e0ccb4f 100644
> --- a/archive-zip.c
> +++ b/archive-zip.c
> @@ -257,7 +257,7 @@ static int write_zip_entry(struct archiver_args *args,
>  	copy_le16(dirent.creator_version,
>  		S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
>  	copy_le16(dirent.version, 10);
> -	copy_le16(dirent.flags, flags);
> +	copy_le16(dirent.flags, flags+2048);
>  	copy_le16(dirent.compression_method, method);
>  	copy_le16(dirent.mtime, zip_time);
>  	copy_le16(dirent.mdate, zip_date);
> --
> works with 7-zip, however, not with Windows 7 build-in zip.
> 
> If I create a zip file with 7-zip which contains umlauts and other
> unicode chars like (國立1-кккк.txt) the Windows 7 build-in zip displays
> them correctly, too.

Ping on this stalled discussion.

It seems like there are two separate issues here:

  1. Knowing the encoding of pathnames in the repository.

  2. Setting the right flags in zip output.

A full solution would handle both parts, but let's ignore (1) for a
moment, and assume we have utf-8 (or can massage into utf-8 from an
encoding specified by the user).

It seems like just setting the magic utf-8 flag would be the only thing
we need to do, according to the standard. But according to discussions
referenced elsewhere in this thread, that flag was invented only in
2007, so we may be dealing with older implementations (I have no idea
how common they would be; that may be the problem with Windows 7's zip
you are seeing). We could re-encode to cp437, which the standard
specifies, but apparently some implementations do not respect that
(and use a local code page instead). And it cannot represent all utf-8
characters, anyway.

It sounds like 7-zip has figured out a more portable solution. Can you
show us a sample of 7-zip's output with utf-8 characters to compare to
what git generates? I wonder if it is using a combination of methods.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-08-30 22:26       ` Jeff King
@ 2012-09-04 20:23         ` René Scharfe
  2012-09-04 21:03           ` Junio C Hamano
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-04 20:23 UTC (permalink / raw
  To: Jeff King; +Cc: Sven Strickroth, git, Junio C Hamano

Am 31.08.2012 00:26, schrieb Jeff King:
> Ping on this stalled discussion.

Sorry, I got distracted by other stuff again.  I did some experiments,
though, and here's a preliminary result.

> It seems like there are two separate issues here:
> 
>    1. Knowing the encoding of pathnames in the repository.
> 
>    2. Setting the right flags in zip output.
> 
> A full solution would handle both parts, but let's ignore (1) for a
> moment, and assume we have utf-8 (or can massage into utf-8 from an
> encoding specified by the user).

Yes, good thinking.  Re-encoding may be beneficial for tar files as well,
but we can ignore that point for the moment.

> It seems like just setting the magic utf-8 flag would be the only thing
> we need to do, according to the standard. But according to discussions
> referenced elsewhere in this thread, that flag was invented only in
> 2007, so we may be dealing with older implementations (I have no idea
> how common they would be; that may be the problem with Windows 7's zip
> you are seeing). We could re-encode to cp437, which the standard
> specifies, but apparently some implementations do not respect that
> (and use a local code page instead). And it cannot represent all utf-8
> characters, anyway.

Yes, we could do that, plus adding an extra field with a UTF-8 version of
the path.  That's the legacy method invented by Info-ZIP.  They switched
to using the new flag on Linux at least, though.

> It sounds like 7-zip has figured out a more portable solution. Can you
> show us a sample of 7-zip's output with utf-8 characters to compare to
> what git generates? I wonder if it is using a combination of methods.

I'm not so sure they produce more portable files.  I created an archive
with files named jaя.txt, smørrebrød.txt, süd.txt and €uro.txt with 7-Zip
on Windows 7 and while unzip on Ubuntu 12.04 managed to recreate the
cyrillic character and the Euro symbol, it mangled the slashed o and the
umlaut.

With the following patch I could create archives with git on Linux and
msysgit and extract them flawlessly on Windows with 7-Zip and with
Info-ZIP unzip on Linux, but not with unzip on Windows, where it mangled
all non-ASCII characters.

This gets confusing; it would help to have a compatibility matrix for all
intersting extractors and character classes -- for each proposed solution
or archiver we'd like to imitate.

But now for the patch, which is a bit confusing as well.  I'm curious to
hear about results for more platforms, extractors and character classes.
Based on that we can see if we need to generate the extra fields instead
of relying on the new flag.

-- >8 --
Subject: [PATCH] archive-zip: support UTF-8 paths

Set general purpose flag 11 if we encounter a path that contains
non-ASCII characters.  We assume that all paths are given as UTF-8; no
conversion is done.

The flag seems to be ignored by unzip unless we also mark the archive
entry as coming from a Unix system.  This is done by setting the field
creator_version ("version made by" in the standard[1]) to 0x03NN.

The NN part represents the version of the standard supported by us, and
this patch sets it to 3f (for version 6.3) for Unix paths.  We keep
creator_version set to 0 (FAT filesystem, standard version 0) in the
non-special cases, as before.

But when we declare a file to have a Unix path, then we have to set the
file mode as well, or unzip will extract the files with the permission
set 0000, i.e. inaccessible by all.

[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 archive-zip.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/archive-zip.c b/archive-zip.c
index f5af81f..928da1d 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -4,6 +4,8 @@
 #include "cache.h"
 #include "archive.h"
 #include "streaming.h"
+#include "commit.h"
+#include "utf8.h"

 static int zip_date;
 static int zip_time;
@@ -16,7 +18,8 @@ static unsigned int zip_dir_offset;
 static unsigned int zip_dir_entries;

 #define ZIP_DIRECTORY_MIN_SIZE	(1024 * 1024)
-#define ZIP_STREAM (8)
+#define ZIP_STREAM	(1 <<  3)
+#define ZIP_UTF8	(1 << 11)

 struct zip_local_header {
 	unsigned char magic[4];
@@ -173,7 +176,8 @@ static int write_zip_entry(struct archiver_args *args,
 {
 	struct zip_local_header header;
 	struct zip_dir_header dirent;
-	unsigned long attr2;
+	unsigned int creator_version = 0;
+	unsigned long attr2 = 0;
 	unsigned long compressed_size;
 	unsigned long crc;
 	unsigned long direntsize;
@@ -187,6 +191,13 @@ static int write_zip_entry(struct archiver_args *args,

 	crc = crc32(0, NULL, 0);

+	if (has_non_ascii(path)) {
+		if (is_utf8(path))
+			flags |= ZIP_UTF8;
+		else
+			warning("Path is not valid UTF-8: %s", path);
+	}
+
 	if (pathlen > 0xffff) {
 		return error("path too long (%d chars, SHA1: %s): %s",
 				(int)pathlen, sha1_to_hex(sha1), path);
@@ -204,10 +215,15 @@ static int write_zip_entry(struct archiver_args *args,
 		enum object_type type = sha1_object_info(sha1, &size);

 		method = 0;
-		attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) :
-			(mode & 0111) ? ((mode) << 16) : 0;
 		if (S_ISREG(mode) && args->compression_level != 0 && size > 0)
 			method = 8;
+		if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) {
+			creator_version = 0x033f;
+			attr2 = mode;
+			if (S_ISLNK(mode))
+				attr2 |= 0777;
+			attr2 <<= 16;
+		}
 		compressed_size = size;

 		if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert &&
@@ -254,8 +270,7 @@ static int write_zip_entry(struct archiver_args *args,
 	}

 	copy_le32(dirent.magic, 0x02014b50);
-	copy_le16(dirent.creator_version,
-		S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
+	copy_le16(dirent.creator_version, creator_version);
 	copy_le16(dirent.version, 10);
 	copy_le16(dirent.flags, flags);
 	copy_le16(dirent.compression_method, method);
-- 
1.7.11.msysgit.1.1.gbf71771

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-04 20:23         ` René Scharfe
@ 2012-09-04 21:03           ` Junio C Hamano
  2012-09-05 19:36             ` René Scharfe
  0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2012-09-04 21:03 UTC (permalink / raw
  To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> But now for the patch, which is a bit confusing as well.  I'm curious to
> hear about results for more platforms, extractors and character classes.
> Based on that we can see if we need to generate the extra fields instead
> of relying on the new flag.

Thanks for keeping the ball rolling.

> Subject: [PATCH] archive-zip: support UTF-8 paths
>
> Set general purpose flag 11 if we encounter a path that contains
> non-ASCII characters.  We assume that all paths are given as UTF-8; no
> conversion is done.
>
> The flag seems to be ignored by unzip unless we also mark the archive
> entry as coming from a Unix system.  This is done by setting the field
> creator_version ("version made by" in the standard[1]) to 0x03NN.
>
> The NN part represents the version of the standard supported by us, and
> this patch sets it to 3f (for version 6.3) for Unix paths.  We keep
> creator_version set to 0 (FAT filesystem, standard version 0) in the
> non-special cases, as before.
>
> But when we declare a file to have a Unix path, then we have to set the
> file mode as well, or unzip will extract the files with the permission
> set 0000, i.e. inaccessible by all.
>
> [1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
>
> Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
> ---
>  archive-zip.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/archive-zip.c b/archive-zip.c
> index f5af81f..928da1d 100644
> --- a/archive-zip.c
> +++ b/archive-zip.c
> @@ -4,6 +4,8 @@
>  #include "cache.h"
>  #include "archive.h"
>  #include "streaming.h"
> +#include "commit.h"
> +#include "utf8.h"
>  
>  static int zip_date;
>  static int zip_time;
> @@ -16,7 +18,8 @@ static unsigned int zip_dir_offset;
>  static unsigned int zip_dir_entries;
>  
>  #define ZIP_DIRECTORY_MIN_SIZE	(1024 * 1024)
> -#define ZIP_STREAM (8)
> +#define ZIP_STREAM	(1 <<  3)
> +#define ZIP_UTF8	(1 << 11)
>  
>  struct zip_local_header {
>  	unsigned char magic[4];
> @@ -173,7 +176,8 @@ static int write_zip_entry(struct archiver_args *args,
>  {
>  	struct zip_local_header header;
>  	struct zip_dir_header dirent;
> -	unsigned long attr2;
> +	unsigned int creator_version = 0;
> +	unsigned long attr2 = 0;
>  	unsigned long compressed_size;
>  	unsigned long crc;
>  	unsigned long direntsize;
> @@ -187,6 +191,13 @@ static int write_zip_entry(struct archiver_args *args,
>  
>  	crc = crc32(0, NULL, 0);
>  
> +	if (has_non_ascii(path)) {

Do we want to treat \033 as "ascii" in this codepath?  The function
primarily is used by the log formatter to see if we need 8-bit CTE
when writing out in the e-mail format.

> +		if (is_utf8(path))
> +			flags |= ZIP_UTF8;
> +		else
> +			warning("Path is not valid UTF-8: %s", path);
> +	}
> +
>  	if (pathlen > 0xffff) {
>  		return error("path too long (%d chars, SHA1: %s): %s",
>  				(int)pathlen, sha1_to_hex(sha1), path);
> @@ -204,10 +215,15 @@ static int write_zip_entry(struct archiver_args *args,
>  		enum object_type type = sha1_object_info(sha1, &size);
>  
>  		method = 0;
> -		attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) :
> -			(mode & 0111) ? ((mode) << 16) : 0;
>  		if (S_ISREG(mode) && args->compression_level != 0 && size > 0)
>  			method = 8;
> +		if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) {
> +			creator_version = 0x033f;
> +			attr2 = mode;
> +			if (S_ISLNK(mode))
> +				attr2 |= 0777;
> +			attr2 <<= 16;
> +		}
>  		compressed_size = size;
>  
>  		if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert &&
> @@ -254,8 +270,7 @@ static int write_zip_entry(struct archiver_args *args,
>  	}
>  
>  	copy_le32(dirent.magic, 0x02014b50);
> -	copy_le16(dirent.creator_version,
> -		S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
> +	copy_le16(dirent.creator_version, creator_version);
>  	copy_le16(dirent.version, 10);
>  	copy_le16(dirent.flags, flags);
>  	copy_le16(dirent.compression_method, method);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-04 21:03           ` Junio C Hamano
@ 2012-09-05 19:36             ` René Scharfe
  2012-09-18 19:40               ` René Scharfe
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-05 19:36 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git

Am 04.09.2012 23:03, schrieb Junio C Hamano:
> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
>> +	if (has_non_ascii(path)) {
>
> Do we want to treat \033 as "ascii" in this codepath?  The function
> primarily is used by the log formatter to see if we need 8-bit CTE
> when writing out in the e-mail format.

Argh, yes, I'd think so. The function name mislead me.  This won't 
matter for compatibility testing, but should be corrected before inclusion.

Just checked: unzip strips the ESC character when extracting (whether 
the UTF-8 flag is set or not) and 7-Zip replaces it with an underscore. 
The built-in ZIP extractor of Windows 7 skips such a file; it doesn't 
even show up in archive directory listings.

René

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-05 19:36             ` René Scharfe
@ 2012-09-18 19:40               ` René Scharfe
  2012-09-18 19:46                 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
                                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 19:40 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git

[-- Attachment #1: Type: text/plain, Size: 3044 bytes --]

Hello again,

so two weeks have passed, and I've moved at a glacial pace towards a 
method how to measure compatibility of our generated ZIP files.  Sorry, 
I just keep getting distracted.

Anyway, the idea is to have a bunch of files with names using different 
scripts, zip them with several packers (including git archive), unzip 
them and compare the result with the original files.

As test corpus I used files named like the pangrams on this UTF-8 
sampler page, the exact commands are attached:

    http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox

The numbers below are how many lines the output of diff -ru contains for 
this pair of packer and unpacker.  There are 37 files, so the worst 
result is 74 lines of difference ("Only in [...]" for both sides), while 
0 indicates a perfect score.

Hmm, come to think of it, an empty directory would show up as 37, so 
this metric is not ideal.  A better one would be to simply give one 
point for each correctly unpacked file.

                                          Windows    Info-ZIP unzip
                             7-Zip PeaZip builtin Linux msysgit Windows
7-Zip 9.20                      0      0      46    26      43      43
PeaZip 4.7.1 win64              0      0      46    26      42      42
Info-ZIP zip 3.0 Linux          0      0      72     0      43      43
Info-ZIP zip 3.0 Windows       45     45     n/a     0      43      43
git-master                     72     72      72    60      72      72
git-master-patch1               0      0      72    60      72      72
git-master-patch2               0      0      72     0      72      72
git-v1.7.11.msysgit.1          72     72      72    60      72      72
git-v1.7.11.msysgit.1-patch1    0      0      72    60      72      72
git-v1.7.11.msysgit.1-patch2    0      0      72     0      72      72

Info-ZIP's programs don't work too well on Windows.  The built-in 
unzipper of Windows 7 even refuses to open the file created by the 
Windows version of zip.  Speaking of which, this is the worst of the 
unpackers.

With the two patches applied, we can say "use 7-Zip or PeaZip on Windows 
and unzip on Linux" and filenames with all tested characters will be 
preserved.  I was surprised to see this working fine with msysgit like 
that, even though no reencoding is introduced by the patches.

I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score 
with the Windows-internal unzipper.  Umlauts, Nordic characters and 
accents are preserved by that combination.  It seems that unzip on Linux 
fails to unpack exactly these names, so perhaps they employ a dirty 
trick like using the local encoding in the ZIP file, which makes it 
unportable.

I'll reply with the two patches, which contain basically the same code 
as the previous patch, only split up.  The second one declares that 
filenames with UTF-8 encoding came from Unix (instead of FAT), which 
makes unzip happy.  This, however, implies that we contain Unix 
permissions for these entries, which is a bit ugly.

René

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pangrams.sh --]
[-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2536 bytes --]

#!/bin/sh
(
	mkdir pangrams
	cd pangrams

	echo English >"The quick brown fox jumps over the lazy dog"
	echo Irish 1 >"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall"
	echo Irish 2 >"lena ṗóg éada ó ṡlí do leasa ṫú"
	echo Irish 3 >"D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór"
	echo Irish 4 >"Éava agus Áḋaiṁ"
	echo Dutch >"Pa's wĳze lynx bezag vroom het fikse aquaduct"
	echo German 1 >"Falsches Üben von Xylophonmusik quält"
	echo German 2 >"jeden größeren Zwerg"
	echo Norwegian >"Blåbærsyltetøy"
	echo Danish >"Høj bly gom vandt fræk sexquiz på wc"
	echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor"
	echo Icelandic >"Sævör grét áðan því úlpan var ónýt"
	echo Finnish >"Törkylempijävongahdus"
	echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig"
	echo Czech >"Příliš žluťoučký kůň úpěl ďábelské kódy"
	echo Slovak 1 >"Starý kôň na hŕbe kníh žuje tíško povädnuté ruže"
	echo Slovak 2 >"na stĺpe sa ďateľ učí kvákať novú ódu o živote"
	echo monotonic Greek >"ξεσκεπάζω την ψυχοφθόρα βδελυγμία"
	echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία"
	echo Russian >"Съешь же ещё этих мягких французских булок да выпей чаю"
	echo Bulgarian 1 >"Жълтата дюля беше щастлива"
	echo Bulgarian 2 >"че пухът, който цъфна, замръзна като гьон"
	echo Northern Sami >"Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža"
	echo Hungarian >"Árvíztűrő tükörfúrógép"
	echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo exhaustiva"
	echo Spanish 2 >"lluvia y frío añoraba a su querido cachorro"
	echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico"
	echo Portuguese 2 >"põe freqüentemente o único médico"
	echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il gèle"
	echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles"
	echo French 3 >"d'œufs abîmés"
	echo Esperanto >"Eĥoŝanĝo ĉiuĵaŭde"
	echo Hebrew >"זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן"
	echo Hiragana 1 >"いろはにほへど　ちりぬるを"
	echo Hiragana 2 >"わがよたれぞ　つねならむ"
	echo Hiragana 3 >"うゐのおくやま　けふこえて"
	echo Hiragana 4 >"あさきゆめみじ　ゑひもせず"
)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] archive-zip: support UTF-8 paths
  2012-09-18 19:40               ` René Scharfe
@ 2012-09-18 19:46                 ` René Scharfe
  2012-09-18 19:53                 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 19:46 UTC (permalink / raw
  Cc: Junio C Hamano, Jeff King, Sven Strickroth, git

Set general purpose flag 11 if we encounter a path that contains
non-ASCII characters.  We assume that all paths are given as UTF-8; no
conversion is done.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
Changes from previous version: Stop using has_non_ascii(), which does
slightly too much for our purposes, and split off creator version
change into a separate patch.

 archive-zip.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/archive-zip.c b/archive-zip.c
index f5af81f..0f763e8 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -4,6 +4,7 @@
 #include "cache.h"
 #include "archive.h"
 #include "streaming.h"
+#include "utf8.h"
 
 static int zip_date;
 static int zip_time;
@@ -16,7 +17,8 @@ static unsigned int zip_dir_offset;
 static unsigned int zip_dir_entries;
 
 #define ZIP_DIRECTORY_MIN_SIZE	(1024 * 1024)
-#define ZIP_STREAM (8)
+#define ZIP_STREAM	(1 <<  3)
+#define ZIP_UTF8	(1 << 11)
 
 struct zip_local_header {
 	unsigned char magic[4];
@@ -164,6 +166,17 @@ static void set_zip_header_data_desc(struct zip_local_header *header,
 	copy_le32(header->size, size);
 }
 
+static int has_only_ascii(const char *s)
+{
+	for (;;) {
+		int c = *s++;
+		if (c == '\0')
+			return 1;
+		if (!isascii(c))
+			return 0;
+	}
+}
+
 #define STREAM_BUFFER_SIZE (1024 * 16)
 
 static int write_zip_entry(struct archiver_args *args,
@@ -187,6 +200,13 @@ static int write_zip_entry(struct archiver_args *args,
 
 	crc = crc32(0, NULL, 0);
 
+	if (!has_only_ascii(path)) {
+		if (is_utf8(path))
+			flags |= ZIP_UTF8;
+		else
+			warning("Path is not valid UTF-8: %s", path);
+	}
+
 	if (pathlen > 0xffff) {
 		return error("path too long (%d chars, SHA1: %s): %s",
 				(int)pathlen, sha1_to_hex(sha1), path);
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] archive-zip: declare creator to be Unix for UTF-8 paths
  2012-09-18 19:40               ` René Scharfe
  2012-09-18 19:46                 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
@ 2012-09-18 19:53                 ` René Scharfe
  2012-09-18 20:24                 ` git archive --format zip utf-8 issues René Scharfe
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 19:53 UTC (permalink / raw
  Cc: Junio C Hamano, Jeff King, Sven Strickroth, git

The UTF-8 flag seems to be ignored by unzip unless we also mark the
archive entry as coming from a Unix system.  This is done by setting the
field creator_version ("version made by" in the standard[1]) to 0x03NN.

The NN part represents the version of the standard supported by us, and
this patch sets it to 3f (for version 6.3) for Unix paths.  We keep
creator_version set to 0 (FAT filesystem, standard version 0) in the
non-special cases, as before.

But when we declare a file to have a Unix path, then we have to set the
file mode as well, or unzip will extract the files with the permission
set 0000, i.e. inaccessible by all.

[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
---
No sign-off for this, yet.  Perhaps there is a better way to convince
unzip to respect the flag?  And if not, do we need to offer umask
settings for ZIP as well as we have for tar?  And perhaps declare all
files as being from a Unix filesystem, for consistency?

 archive-zip.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/archive-zip.c b/archive-zip.c
index 0f763e8..e9b3dc9 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -186,7 +186,8 @@ static int write_zip_entry(struct archiver_args *args,
 {
 	struct zip_local_header header;
 	struct zip_dir_header dirent;
-	unsigned long attr2;
+	unsigned int creator_version = 0;
+	unsigned long attr2 = 0;
 	unsigned long compressed_size;
 	unsigned long crc;
 	unsigned long direntsize;
@@ -224,10 +225,15 @@ static int write_zip_entry(struct archiver_args *args,
 		enum object_type type = sha1_object_info(sha1, &size);
 
 		method = 0;
-		attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) :
-			(mode & 0111) ? ((mode) << 16) : 0;
 		if (S_ISREG(mode) && args->compression_level != 0 && size > 0)
 			method = 8;
+		if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) {
+			creator_version = 0x033f;
+			attr2 = mode;
+			if (S_ISLNK(mode))
+				attr2 |= 0777;
+			attr2 <<= 16;
+		}
 		compressed_size = size;
 
 		if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert &&
@@ -274,8 +280,7 @@ static int write_zip_entry(struct archiver_args *args,
 	}
 
 	copy_le32(dirent.magic, 0x02014b50);
-	copy_le16(dirent.creator_version,
-		S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
+	copy_le16(dirent.creator_version, creator_version);
 	copy_le16(dirent.version, 10);
 	copy_le16(dirent.flags, flags);
 	copy_le16(dirent.compression_method, method);
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-18 19:40               ` René Scharfe
  2012-09-18 19:46                 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
  2012-09-18 19:53                 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
@ 2012-09-18 20:24                 ` René Scharfe
  2012-09-18 21:12                 ` Junio C Hamano
  2012-09-24 15:56                 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
  4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-18 20:24 UTC (permalink / raw
  Cc: Junio C Hamano, Jeff King, Sven Strickroth, git

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

Am 18.09.2012 21:40, schrieb René Scharfe:
>                                            Windows    Info-ZIP unzip
>                               7-Zip PeaZip builtin Linux msysgit Windows
> git-master-patch1               0      0      72    60      72      72
> git-master-patch2               0      0      72     0      72      72

Oh, and when I wrote Windows, I meant a German Windows 7 Home Premium 
x64, and Linux is Ubuntu 12.04 x86.

I've also attached generated ZIP files for the two archivers above, for 
easier testing.  Let's see if they make it through.

René


[-- Attachment #2: pangrams-git-master-patch2.zip --]
[-- Type: application/x-zip-compressed, Size: 7432 bytes --]

[-- Attachment #3: pangrams-git-master-patch1.zip --]
[-- Type: application/x-zip-compressed, Size: 7432 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-18 19:40               ` René Scharfe
                                   ` (2 preceding siblings ...)
  2012-09-18 20:24                 ` git archive --format zip utf-8 issues René Scharfe
@ 2012-09-18 21:12                 ` Junio C Hamano
  2012-09-20 22:00                   ` René Scharfe
  2012-09-24 15:56                 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
  4 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2012-09-18 21:12 UTC (permalink / raw
  To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

>                                          Windows    Info-ZIP unzip
>                             7-Zip PeaZip builtin Linux msysgit Windows
> 7-Zip 9.20                      0      0      46    26      43      43
> PeaZip 4.7.1 win64              0      0      46    26      42      42
> Info-ZIP zip 3.0 Linux          0      0      72     0      43      43
> Info-ZIP zip 3.0 Windows       45     45     n/a     0      43      43
> ...
> I wonder what 7-Zip and PeaZip do that gives them a slightly nicer
> score with the Windows-internal unzipper.  Umlauts, Nordic characters
> and accents are preserved by that combination.  It seems that unzip on
> Linux fails to unpack exactly these names, so perhaps they employ a
> dirty trick like using the local encoding in the ZIP file, which makes
> it unportable.
> ...

Thanks for this work.

It is kind of surprising that "Windows builtin" has very poor score
extracting from the output of Zip tools running on Windows (I am
looking at 46, 46 and n/a over there).  If you tell it to create an
archive from its disk and then extract from it, I wonder what would
happen.

Does this result mean that practically nobody uses Zip archive with
exotic letters in paths on that platform?  I am not talking about
developers and savvy people who know where to download third-party
Zip archivers and how to install them.  I am imagining a grandma who
received an archive full of photos of her grandchild in her Outlook
Express or GMail inbox, clicked the attachment to download it, and
is trying to view the photo inside.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-18 21:12                 ` Junio C Hamano
@ 2012-09-20 22:00                   ` René Scharfe
  2012-09-24 15:56                     ` René Scharfe
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-20 22:00 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Jeff King, Sven Strickroth, git

Am 18.09.2012 23:12, schrieb Junio C Hamano:
> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
>
>>                                           Windows    Info-ZIP unzip
>>                              7-Zip PeaZip builtin Linux msysgit Windows
>> 7-Zip 9.20                      0      0      46    26      43      43
>> PeaZip 4.7.1 win64              0      0      46    26      42      42
>> Info-ZIP zip 3.0 Linux          0      0      72     0      43      43
>> Info-ZIP zip 3.0 Windows       45     45     n/a     0      43      43

> It is kind of surprising that "Windows builtin" has very poor score
> extracting from the output of Zip tools running on Windows (I am
> looking at 46, 46 and n/a over there).  If you tell it to create an
> archive from its disk and then extract from it, I wonder what would
> happen.

I didn't include it as a packer because it refused to archive the 
pangrams directory due to illegal characters in one of the filenames. 
When I just tried a bit harder, I had to delete all but 14 files with 
Latin script, accents etc. before I could zip the directory.  I'll 
include these results in the next round.

It uses codepage 850 on my system (MSDOS Latin 1).  I don't expect this 
to be portable.

> Does this result mean that practically nobody uses Zip archive with
> exotic letters in paths on that platform?  I am not talking about
> developers and savvy people who know where to download third-party
> Zip archivers and how to install them.  I am imagining a grandma who
> received an archive full of photos of her grandchild in her Outlook
> Express or GMail inbox, clicked the attachment to download it, and
> is trying to view the photo inside.

Not necessarily.  Photos often have names like img_0123.jpg etc., which 
are handled just fine.  And all family members probably use the same 
codepage on their computers, so they're less likely to run into this 
problem.

By the way, I found this bug asking for codepage support in unzip:

   https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961

Multiple codepages seem to be used for ZIP files in the wild, none of 
them are supported by unzip on Linux, which only accepts ASCII or UTF-8.

René

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/2] archive-zip: write extended timestamp
  2012-09-18 19:40               ` René Scharfe
                                   ` (3 preceding siblings ...)
  2012-09-18 21:12                 ` Junio C Hamano
@ 2012-09-24 15:56                 ` René Scharfe
  4 siblings, 0 replies; 21+ messages in thread
From: René Scharfe @ 2012-09-24 15:56 UTC (permalink / raw
  Cc: Junio C Hamano, Jeff King, Sven Strickroth, git

File modification times in ZIP files are encoded in DOS format: local
time with a granularity of two seconds.  Add an extra field to all
archive entries to also record the mtime in Unix' fashion, as UTC with
a granularity of one second.

This has the desirable side-effect of convincing Info-ZIP unzip 6.00
to respect general purpose flag 11, which is used to indicate that a
file name is encoded in UTF-8.  Any extra field would do, actually,
but the extended timestamp is a reasonably small one (22 bytes per
entry).  Archives created by Info-ZIP zip 3.0 contain it, too (but
with ctime and atime as well).

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 archive-zip.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/archive-zip.c b/archive-zip.c
index 0f763e8..55f66b4 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -76,6 +76,14 @@ struct zip_dir_trailer {
 	unsigned char _end[1];
 };
 
+struct zip_extra_mtime {
+	unsigned char magic[2];
+	unsigned char extra_size[2];
+	unsigned char flags[1];
+	unsigned char mtime[4];
+	unsigned char _end[1];
+};
+
 /*
  * On ARM, padding is added at the end of the struct, so a simple
  * sizeof(struct ...) reports two bytes more than the payload size
@@ -85,6 +93,9 @@ struct zip_dir_trailer {
 #define ZIP_DATA_DESC_SIZE	offsetof(struct zip_data_desc, _end)
 #define ZIP_DIR_HEADER_SIZE	offsetof(struct zip_dir_header, _end)
 #define ZIP_DIR_TRAILER_SIZE	offsetof(struct zip_dir_trailer, _end)
+#define ZIP_EXTRA_MTIME_SIZE	offsetof(struct zip_extra_mtime, _end)
+#define ZIP_EXTRA_MTIME_PAYLOAD_SIZE \
+	(ZIP_EXTRA_MTIME_SIZE - offsetof(struct zip_extra_mtime, flags))
 
 static void copy_le16(unsigned char *dest, unsigned int n)
 {
@@ -186,6 +197,7 @@ static int write_zip_entry(struct archiver_args *args,
 {
 	struct zip_local_header header;
 	struct zip_dir_header dirent;
+	struct zip_extra_mtime extra;
 	unsigned long attr2;
 	unsigned long compressed_size;
 	unsigned long crc;
@@ -266,8 +278,13 @@ static int write_zip_entry(struct archiver_args *args,
 		}
 	}
 
+	copy_le16(extra.magic, 0x5455);
+	copy_le16(extra.extra_size, ZIP_EXTRA_MTIME_PAYLOAD_SIZE);
+	extra.flags[0] = 1;	/* just mtime */
+	copy_le32(extra.mtime, args->time);
+
 	/* make sure we have enough free space in the dictionary */
-	direntsize = ZIP_DIR_HEADER_SIZE + pathlen;
+	direntsize = ZIP_DIR_HEADER_SIZE + pathlen + ZIP_EXTRA_MTIME_SIZE;
 	while (zip_dir_size < zip_dir_offset + direntsize) {
 		zip_dir_size += ZIP_DIRECTORY_MIN_SIZE;
 		zip_dir = xrealloc(zip_dir, zip_dir_size);
@@ -283,7 +300,7 @@ static int write_zip_entry(struct archiver_args *args,
 	copy_le16(dirent.mdate, zip_date);
 	set_zip_dir_data_desc(&dirent, size, compressed_size, crc);
 	copy_le16(dirent.filename_length, pathlen);
-	copy_le16(dirent.extra_length, 0);
+	copy_le16(dirent.extra_length, ZIP_EXTRA_MTIME_SIZE);
 	copy_le16(dirent.comment_length, 0);
 	copy_le16(dirent.disk, 0);
 	copy_le16(dirent.attr1, 0);
@@ -301,11 +318,13 @@ static int write_zip_entry(struct archiver_args *args,
 	else
 		set_zip_header_data_desc(&header, size, compressed_size, crc);
 	copy_le16(header.filename_length, pathlen);
-	copy_le16(header.extra_length, 0);
+	copy_le16(header.extra_length, ZIP_EXTRA_MTIME_SIZE);
 	write_or_die(1, &header, ZIP_LOCAL_HEADER_SIZE);
 	zip_offset += ZIP_LOCAL_HEADER_SIZE;
 	write_or_die(1, path, pathlen);
 	zip_offset += pathlen;
+	write_or_die(1, &extra, ZIP_EXTRA_MTIME_SIZE);
+	zip_offset += ZIP_EXTRA_MTIME_SIZE;
 	if (stream && method == 0) {
 		unsigned char buf[STREAM_BUFFER_SIZE];
 		ssize_t readlen;
@@ -402,6 +421,8 @@ static int write_zip_entry(struct archiver_args *args,
 	zip_dir_offset += ZIP_DIR_HEADER_SIZE;
 	memcpy(zip_dir + zip_dir_offset, path, pathlen);
 	zip_dir_offset += pathlen;
+	memcpy(zip_dir + zip_dir_offset, &extra, ZIP_EXTRA_MTIME_SIZE);
+	zip_dir_offset += ZIP_EXTRA_MTIME_SIZE;
 	zip_dir_entries++;
 
 	return 0;
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-20 22:00                   ` René Scharfe
@ 2012-09-24 15:56                     ` René Scharfe
  2012-09-24 18:13                       ` Junio C Hamano
  0 siblings, 1 reply; 21+ messages in thread
From: René Scharfe @ 2012-09-24 15:56 UTC (permalink / raw
  Cc: Junio C Hamano, Jeff King, Sven Strickroth, git

[-- Attachment #1: Type: text/plain, Size: 2058 bytes --]

Hi,

I found a way to make unzip respect the UTF-8 flag in ZIP files: 
Apparently (from looking at the source) an extended field needs to be 
present in order for it to even look at general purpose flag 11.  I sent 
a patch to add an extended timestamp field that fits the bill.

Here are new numbers on ZIP international filename compatibility:

		7-Zip	PeaZip	builtin	unzip	unzip	unzip	7z
		Windows	Windows	Windows	Linux	mingw	Windows	Linux
git	Linux	1	1	1	7	1	1	1
git 1	Linux	37	37	1	7	1	1	37
git 2	Linux	37	37	1	37	1	1	37
git 3	Linux	37	37	1	37	15	15	37
git	mingw	1	1	1	7	1	1	1
git 1	mingw	37	37	1	7	1	1	37
git 2	mingw	37	37	1	37	1	1	37
git 3	mingw	37	37	1	37	15	15	37
7-Zip	Windows	37	37	14	24	15	15	24
PeaZip	Windows	37	37	14	24	15	15	24
zip	Linux	37	37	1	37	15	15	37
zip	Windows	14	14	0	37	15	15	1
builtin	Windows	14	14	14	1	14	14	1

The test corpus still consists of 37 files based on the pangrams on the 
following web page:

	http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox

The files can be created using the attached script.  It also provides a 
check command to count the files with correct names, and the results of 
that for different ZIP extractors are give in the table.  The built-in 
ZIP functionality on Windows was only able to pack 14 of the 37 files, 
which explains the low score across the board for this packer.

"git 1" is the patch "archive-zip: support UTF-8 paths" added, which 
let's archive-zip make use of the UTF-8 flag.  "git 2" is "git 1" plus 
the patch "archive-zip: declare creator to be Unix for UTF-8 paths". 
Both have been posted before.  "git 3" is "git 1" plus the new patch 
"archive-zip: write extended timestamp".

Let's drop patch 2 (Unix as creator) and keep patches 1 (UTF-8 flag) and 
3 (mtime field) to make archive-zip record non-ASCII filenames in a 
portable way.  It's not perfect, but I don't know how to do any better 
given that Windows' built-in ZIP functionality expects filenames in the 
local code page and with an international audience for projects 
distributing ZIP files.

René


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pangrams.sh --]
[-- Type: text/plain; charset=windows-1252; name="pangrams.sh", Size: 2367 bytes --]

#!/bin/sh

files() {
cat <<EOF
pangrams/わがよたれぞ　つねならむ
pangrams/うゐのおくやま　けふこえて
pangrams/いろはにほへど　ちりぬるを
pangrams/あさきゆめみじ　ゑひもせず
pangrams/An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall
pangrams/Árvíztűrő tükörfúrógép
pangrams/Blåbærsyltetøy
pangrams/D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór
pangrams/d'œufs abîmés
pangrams/Éava agus Áḋaiṁ
pangrams/Eĥoŝanĝo ĉiuĵaŭde
pangrams/El pingüino Wenceslao hizo kilómetros bajo exhaustiva
pangrams/Falsches Üben von Xylophonmusik quält
pangrams/Flygande bäckasiner söka strax hwila på mjuka tuvor
pangrams/Høj bly gom vandt fræk sexquiz på wc
pangrams/jeden größeren Zwerg
pangrams/lena ṗóg éada ó ṡlí do leasa ṫú
pangrams/Les naïfs ægithales hâtifs pondant à Noël où il gèle
pangrams/lluvia y frío añoraba a su querido cachorro
pangrams/na stĺpe sa ďateľ učí kvákať novú ódu o živote
pangrams/O próximo vôo à noite sobre o Atlântico
pangrams/Pa's wĳze lynx bezag vroom het fikse aquaduct
pangrams/Pchnąć w tę łódź jeża lub osiem skrzyń fig
pangrams/põe freqüentemente o único médico
pangrams/Příliš žluťoučký kůň úpěl ďábelské kódy
pangrams/Sævör grét áðan því úlpan var ónýt
pangrams/sont sûrs d'être déçus en voyant leurs drôles
pangrams/Starý kôň na hŕbe kníh žuje tíško povädnuté ruže
pangrams/The quick brown fox jumps over the lazy dog
pangrams/Törkylempijävongahdus
pangrams/Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža
pangrams/זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן
pangrams/ξεσκεπάζω την ψυχοφθόρα βδελυγμία
pangrams/ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία
pangrams/Жълтата дюля беше щастлива
pangrams/Съешь же ещё этих мягких французских булок да выпей чаю
pangrams/че пухът, който цъфна, замръзна като гьон
EOF
}

case "$1" in
create)
	mkdir -p pangrams
	files | while read file
	do
		touch "$file"
	done
	;;
check)
	files | while read file
	do
		test -f "$file" && echo "$file"
	done | wc -l
	;;
*)
	echo "Usage: $0 create | check" >&2
	exit 1
	;;
esac

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git archive --format zip utf-8 issues
  2012-09-24 15:56                     ` René Scharfe
@ 2012-09-24 18:13                       ` Junio C Hamano
  0 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2012-09-24 18:13 UTC (permalink / raw
  To: René Scharfe; +Cc: Jeff King, Sven Strickroth, git

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> "git 1" is the patch "archive-zip: support UTF-8 paths" added, which
> let's archive-zip make use of the UTF-8 flag.  "git 2" is "git 1" plus
> the patch "archive-zip: declare creator to be Unix for UTF-8
> paths". Both have been posted before.  "git 3" is "git 1" plus the new
> patch "archive-zip: write extended timestamp".
>
> Let's drop patch 2 (Unix as creator) and keep patches 1 (UTF-8 flag)
> and 3 (mtime field) to make archive-zip record non-ASCII filenames in
> a portable way.  It's not perfect, but I don't know how to do any
> better given that Windows' built-in ZIP functionality expects
> filenames in the local code page and with an international audience
> for projects distributing ZIP files.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-09-24 18:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-10 21:58 git archive --format zip utf-8 issues Sven Strickroth
2012-08-10 22:47 ` Junio C Hamano
2012-08-10 23:53   ` Sven Strickroth
2012-08-11 20:53     ` René Scharfe
2012-08-12  4:08       ` Junio C Hamano
2012-08-11 20:53   ` René Scharfe
2012-08-11 21:37     ` Sven Strickroth
2012-08-30 22:26       ` Jeff King
2012-09-04 20:23         ` René Scharfe
2012-09-04 21:03           ` Junio C Hamano
2012-09-05 19:36             ` René Scharfe
2012-09-18 19:40               ` René Scharfe
2012-09-18 19:46                 ` [PATCH 1/2] archive-zip: support UTF-8 paths René Scharfe
2012-09-18 19:53                 ` [PATCH 2/2] archive-zip: declare creator to be Unix for " René Scharfe
2012-09-18 20:24                 ` git archive --format zip utf-8 issues René Scharfe
2012-09-18 21:12                 ` Junio C Hamano
2012-09-20 22:00                   ` René Scharfe
2012-09-24 15:56                     ` René Scharfe
2012-09-24 18:13                       ` Junio C Hamano
2012-09-24 15:56                 ` [PATCH 3/2] archive-zip: write extended timestamp René Scharfe
2012-08-12  4:27     ` git archive --format zip utf-8 issues Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).