git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Pete Wyckoff <pw@padd.com>
To: John Keeping <john@keeping.me.uk>
Cc: Junio C Hamano <gitster@pobox.com>,
	Michael Haggerty <mhagger@alum.mit.edu>,
	git@vger.kernel.org, "Eric S. Raymond" <esr@thyrsus.com>,
	Felipe Contreras <felipe.contreras@gmail.com>,
	Sverre Rabbelier <srabbelier@gmail.com>
Subject: Re: [RFC/PATCH 2/8 v3] git_remote_helpers: fix input when running under Python 3
Date: Tue, 15 Jan 2013 19:03:16 -0500	[thread overview]
Message-ID: <20130116000316.GA26999@padd.com> (raw)
In-Reply-To: <20130115224049.GZ4574@serenity.lan>

john@keeping.me.uk wrote on Tue, 15 Jan 2013 22:40 +0000:
> This is what keeping the refs as byte strings looks like.

As John knows, it is not possible to interpret text from a byte
string without talking about the character encoding.

Git is (largely) a C program and uses the character set defined
in the C standard, which is a subset of ASCII.  But git does
"math" on strings, like this snippet that takes something from
argv[] and prepends "refs/heads/":

    strcpy(refname, "refs/heads/");
    strcpy(refname + strlen("refs/heads/"), ret->name);

The result doesn't talk about what character set it is using,
but because it combines a prefix from ASCII with its input,
git makes the assumption that the input is ASCII-compatible.

If you feed a UTF-16 string in argv, e.g.

    $ echo master | iconv -f ascii -t utf16 | xargs git branch
    xargs: Warning: a NUL character occurred in the input.  It cannot be passed through in the argument list.  Did you mean to use the --null option?
    fatal: Not a valid object name: ''.

you get an error about NUL, and not the branch you hoped for.
Git assumes that the input character set contains roughly ASCII
in byte positions 0..127.

That's one small reason why the useful character encodings put
ASCII in the 0..127 range, including utf-8, big5 and shift-jis.
ASCII is indeed special due to its legacy, and both C and Python
recognize this.

> diff --git a/git_remote_helpers/git/importer.py b/git_remote_helpers/git/importer.py
> @@ -18,13 +18,16 @@ class GitImporter(object):
>  
>      def get_refs(self, gitdir):
>          """Returns a dictionary with refs.
> +
> +        Note that the keys in the returned dictionary are byte strings as
> +        read from git.
>          """
>          args = ["git", "--git-dir=" + gitdir, "for-each-ref", "refs/heads"]
> -        lines = check_output(args).strip().split('\n')
> +        lines = check_output(args).strip().split('\n'.encode('utf-8'))
>          refs = {}
>          for line in lines:
> -            value, name = line.split(' ')
> -            name = name.strip('commit\t')
> +            value, name = line.split(' '.encode('utf-8'))
> +            name = name.strip('commit\t'.encode('utf-8'))
>              refs[name] = value
>          return refs

I'd suggest for this Python conundrum using byte-string literals, e.g.:

        lines = check_output(args).strip().split(b'\n')
	value, name = line.split(b' ')
	name = name.strip(b'commit\t')

Essentially identical to what you have, but avoids naming "utf-8" as
the encoding.  It instead relies on Python's interpretation of
ASCII characters in string context, which is exactly what C does.

		-- Pete

  reply	other threads:[~2013-01-16  0:03 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-12 19:23 [PATCH 0/8] Initial support for Python 3 John Keeping
2013-01-12 19:23 ` [PATCH 1/8] git_remote_helpers: Allow building with " John Keeping
2013-01-12 19:23 ` [PATCH 2/8] git_remote_helpers: fix input when running under " John Keeping
2013-01-13  3:26   ` Michael Haggerty
2013-01-13 16:17     ` John Keeping
2013-01-14  4:48       ` Michael Haggerty
2013-01-14  9:47         ` John Keeping
2013-01-15 19:48           ` [RFC/PATCH 2/8 v2] " John Keeping
2013-01-15 20:51             ` Junio C Hamano
2013-01-15 21:54               ` John Keeping
2013-01-15 22:04                 ` Junio C Hamano
2013-01-15 22:40                   ` [RFC/PATCH 2/8 v3] " John Keeping
2013-01-16  0:03                     ` Pete Wyckoff [this message]
2013-01-16  9:45                       ` John Keeping
2013-01-17  0:29                         ` Pete Wyckoff
2013-01-12 19:23 ` [PATCH 3/8] git_remote_helpers: Force rebuild if python version changes John Keeping
2013-01-12 23:30   ` Pete Wyckoff
2013-01-13 16:26     ` John Keeping
2013-01-13 17:14       ` Pete Wyckoff
2013-01-13 17:52         ` John Keeping
2013-01-15 22:58           ` John Keeping
2013-01-17  0:27             ` Pete Wyckoff
2013-01-12 19:23 ` [PATCH 4/8] git_remote_helpers: Use 2to3 if building with Python 3 John Keeping
2013-01-12 19:23 ` [PATCH 5/8] svn-fe: allow svnrdump_sim.py to run " John Keeping
2013-01-12 19:23 ` [PATCH 6/8] git-remote-testpy: hash bytes explicitly John Keeping
2013-01-12 19:23 ` [PATCH 7/8] git-remote-testpy: don't do unbuffered text I/O John Keeping
2013-01-12 19:23 ` [PATCH 8/8] git-remote-testpy: call print as a function John Keeping
2013-01-12 23:43 ` [PATCH 0/8] Initial support for Python 3 Pete Wyckoff
2013-01-13  0:41   ` John Keeping
2013-01-13 12:34     ` John Keeping
2013-01-13 16:40     ` Pete Wyckoff
2013-01-13 17:35       ` John Keeping
2013-01-17 18:53 ` [PATCH v2 0/8] Initial Python 3 support John Keeping
2013-01-17 18:53 ` [PATCH v2 1/8] git_remote_helpers: allow building with Python 3 John Keeping
2013-01-17 18:53 ` [PATCH v2 2/8] git_remote_helpers: fix input when running under " John Keeping
2013-01-17 18:53 ` [PATCH v2 3/8] git_remote_helpers: force rebuild if python version changes John Keeping
2013-01-17 18:53 ` [PATCH v2 4/8] git_remote_helpers: use 2to3 if building with Python 3 John Keeping
2013-01-18  5:15   ` Sverre Rabbelier
2013-01-18 10:32     ` John Keeping
2013-01-19  7:52       ` Sverre Rabbelier
2013-01-17 18:53 ` [PATCH v2 5/8] svn-fe: allow svnrdump_sim.py to run " John Keeping
2013-01-17 18:53 ` [PATCH v2 6/8] git-remote-testpy: hash bytes explicitly John Keeping
2013-01-17 20:36   ` Junio C Hamano
2013-01-17 20:43     ` Junio C Hamano
2013-01-17 21:00     ` John Keeping
2013-01-17 21:05       ` John Keeping
2013-01-17 22:24       ` Junio C Hamano
2013-01-17 22:30         ` John Keeping
2013-01-17 22:57           ` Junio C Hamano
2013-01-17 18:54 ` [PATCH v2 7/8] git-remote-testpy: don't do unbuffered text I/O John Keeping
2013-01-18  3:50   ` Sverre Rabbelier
2013-01-17 18:54 ` [PATCH v2 8/8] git-remote-testpy: call print as a function John Keeping
2013-01-18  3:48   ` Sverre Rabbelier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130116000316.GA26999@padd.com \
    --to=pw@padd.com \
    --cc=esr@thyrsus.com \
    --cc=felipe.contreras@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=john@keeping.me.uk \
    --cc=mhagger@alum.mit.edu \
    --cc=srabbelier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).