* Re: Cross-Platform Version Control
@ 2009-05-12 15:06 Esko Luontola
2009-05-12 15:14 ` Shawn O. Pearce
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: Esko Luontola @ 2009-05-12 15:06 UTC (permalink / raw)
To: git
A good start for making Git cross-platform, would be storing the text
encoding of every file name and commit message together with the
commit. Currently, because Git is oblivious to the encodings and just
considers them as a series of bytes, there is no way to make them
cross-platform. It's as http://www.joelonsoftware.com/articles/Unicode.html
says, "It does not make sense to have a string without knowing what
encoding it uses." Without explicit encoding information, making a
system that works even on the three main platforms, let alone in all
countries and languages, is simply not possible.
On the other hand, if the encoding is explicitly stated in the
repository, then it is possible for platform and locale aware Git
clients to handle the file names and commit messages in whatever way
makes most sense for the platform (for example convert the file names
to the platform's encoding, if it differs from the committer's
platform encoding). Then it would also be possible to create a Mac
version of Git, which compensates for Mac OS X's file system's file
name encoding peculiarities. Also the system could then warn (on "git
add") if the data does not look like it has been encoded with the said
encoding.
If the platform's and the repository's encoding happen to be the same
(which in reality might be possible only inside a small company where
everybody is forced to use the same OS and is configured by a single
sysadmin), then no conversions need to be done. Also Git purists, who
think that the byte sequence representing a file name are more
important than the human readable version of the file name, may use
some configuration switch that disables all conversions - but even
then the current encoding should be stored together with the commit.
Are there any plans on storing the encoding information of file names
and commit messages in the Git repository? How much time would
implementing it take? Any ideas on how to maintain backwards
compatibility (for old commits that do not have the encoding
information)?
- Esko
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
@ 2009-05-12 15:14 ` Shawn O. Pearce
2009-05-12 16:13 ` Johannes Schindelin
2009-05-12 16:16 ` Jeff King
2009-05-12 18:28 ` Dmitry Potapov
2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
2 siblings, 2 replies; 59+ messages in thread
From: Shawn O. Pearce @ 2009-05-12 15:14 UTC (permalink / raw)
To: Esko Luontola; +Cc: git
Esko Luontola <esko.luontola@gmail.com> wrote:
> Are there any plans on storing the encoding information of file names
> and commit messages in the Git repository?
Commit messages already store their encoding in an optional
"encoding" header if the message isn't stored in UTF-8, or
US-ASCII, which is a strict subset of UTF-8.
As for file names, no plans, its a sequence of bytes, but I think a
lot of people wind up using some subset of US-ASCII for their file
names, especially if their project is going to be cross platform.
--
Shawn.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 15:14 ` Shawn O. Pearce
@ 2009-05-12 16:13 ` Johannes Schindelin
2009-05-12 17:56 ` Esko Luontola
2009-05-12 16:16 ` Jeff King
1 sibling, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-12 16:13 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Esko Luontola, git
Hi,
On Tue, 12 May 2009, Shawn O. Pearce wrote:
> Esko Luontola <esko.luontola@gmail.com> wrote:
> > Are there any plans on storing the encoding information of file names
> > and commit messages in the Git repository?
>
> Commit messages already store their encoding in an optional "encoding"
> header if the message isn't stored in UTF-8, or US-ASCII, which is a
> strict subset of UTF-8.
>
> As for file names, no plans, its a sequence of bytes, but I think a
> lot of people wind up using some subset of US-ASCII for their file
> names, especially if their project is going to be cross platform.
Some context: this issue cropped up in msysGit, of course.
As to storing all file names in UTF-8, my point about Unicode being not
necessarily appropriate for everyone still stands.
UTF-8 _might_ be the de-facto standard for Linux filesystems, but
IMHO we should not take away the freedom for everybody to decide what they
want their file names to be encoded as.
However, I see that there might be a need to be able to encode the file
names differently, such as on Windows. IMHO the best solution would be
a config variable controlling the reencoding of file names.
For some time, it looked as if two people were interested in implementing
something like that (Peter and Robin IIRC), but efforts have stalled.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 15:14 ` Shawn O. Pearce
2009-05-12 16:13 ` Johannes Schindelin
@ 2009-05-12 16:16 ` Jeff King
2009-05-12 16:57 ` Johannes Schindelin
2009-05-13 16:26 ` Linus Torvalds
1 sibling, 2 replies; 59+ messages in thread
From: Jeff King @ 2009-05-12 16:16 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Esko Luontola, git
On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote:
> As for file names, no plans, its a sequence of bytes, but I think a
> lot of people wind up using some subset of US-ASCII for their file
> names, especially if their project is going to be cross platform.
Or they use a single encoding like utf8 so that there are no surprises.
You can still run into normalization problems with filenames on some
filesystems, though. Linus's name_hash code sets up the framework to
handle "these two names are actually equivalent", but right now I think
there is just code for handling case-sensitivity, not utf8 normalization
(but I just skimmed the code, so I might be wrong).
-Peff
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 16:16 ` Jeff King
@ 2009-05-12 16:57 ` Johannes Schindelin
2009-05-13 16:26 ` Linus Torvalds
1 sibling, 0 replies; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-12 16:57 UTC (permalink / raw)
To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git
Hi,
On Tue, 12 May 2009, Jeff King wrote:
> On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote:
>
> > As for file names, no plans, its a sequence of bytes, but I think a
> > lot of people wind up using some subset of US-ASCII for their file
> > names, especially if their project is going to be cross platform.
>
> Or they use a single encoding like utf8 so that there are no surprises.
> You can still run into normalization problems with filenames on some
> filesystems, though. Linus's name_hash code sets up the framework to
> handle "these two names are actually equivalent", but right now I think
> there is just code for handling case-sensitivity, not utf8 normalization
> (but I just skimmed the code, so I might be wrong).
Back then I actually started on a patch to make Git capable of determining
UTF-8 equivalence, but at the same time somebody started such an annoying
mail thread that I stopped working on the issue completely.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 16:13 ` Johannes Schindelin
@ 2009-05-12 17:56 ` Esko Luontola
2009-05-12 20:38 ` Johannes Schindelin
0 siblings, 1 reply; 59+ messages in thread
From: Esko Luontola @ 2009-05-12 17:56 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Shawn O. Pearce, git
On 12.5.2009, at 19:13, Johannes Schindelin wrote:
> As to storing all file names in UTF-8, my point about Unicode being
> not
> necessarily appropriate for everyone still stands.
>
> UTF-8 _might_ be the de-facto standard for Linux filesystems, but
> IMHO we should not take away the freedom for everybody to decide
> what they
> want their file names to be encoded as.
>
> However, I see that there might be a need to be able to encode the
> file
> names differently, such as on Windows. IMHO the best solution would
> be
> a config variable controlling the reencoding of file names.
Exactly. The system should not force the use of a specific encoding.
It should only offer a recommendation, but be also fully compatible if
the user uses some other encoding.
That's why it's best to always store the information about what
encoding was used. It shouldn't matter, whether the data is encoded
with ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as
long as it is explicitly said that what the encoding is. Then the
reader of the data can best decide, how to show that data on the
current platform.
A config variable for defining, that what encoding should be used when
committing the file names, would make sense. Git should also try to
autodetect, that what encoding is used in its current environment. In
the case of UTF-8, you should also be able to specify which
normalization form is used (http://www.unicode.org/unicode/reports/
tr15/), or whether it is normalized at all.
For example, it should be possible to configure Git so, that when a
file is checked out on Mac, its file name is converted to the current
file system's encoding (UTF-8 NFD, I think), and when the file is
committed on Mac, the file name is normalized back to the same UTF-8
form as is used on Linux (UTF-8 NFC).
It would be nice to have config variables for saying, that all file
names in this repository must use UTF-8 NFC, and all commit messages
must use UTF-8 NFC (with Unix newlines). Then the Git client would
autodetect the current environment's encoding, and convert the text,
if necessary, to match the repository's encoding.
- Esko
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
2009-05-12 15:14 ` Shawn O. Pearce
@ 2009-05-12 18:28 ` Dmitry Potapov
2009-05-12 18:40 ` Martin Langhoff
2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
2 siblings, 1 reply; 59+ messages in thread
From: Dmitry Potapov @ 2009-05-12 18:28 UTC (permalink / raw)
To: Esko Luontola; +Cc: git
On Tue, May 12, 2009 at 06:06:05PM +0300, Esko Luontola wrote:
> A good start for making Git cross-platform, would be storing the text
> encoding of every file name and commit message together with the commit.
> Currently, because Git is oblivious to the encodings and just considers
> them as a series of bytes, there is no way to make them cross-platform.
1. Git already stores the endcoding for all commit messages that are not
in UTF-8.
2. If you really want to be cross-platform portable, you should not use
any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
Filename Character Set)
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276
Dmitry
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 18:28 ` Dmitry Potapov
@ 2009-05-12 18:40 ` Martin Langhoff
2009-05-12 18:55 ` Jakub Narebski
0 siblings, 1 reply; 59+ messages in thread
From: Martin Langhoff @ 2009-05-12 18:40 UTC (permalink / raw)
To: Dmitry Potapov; +Cc: Esko Luontola, git
On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> 2. If you really want to be cross-platform portable, you should not use
> any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
> Filename Character Set)
> http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276
Would it make sense to have warnings at 'git add' time about
- filenames outside of that charset (as the strictest mode, perhaps
even default)
- filenames that have a potential conflict wrt case-sensitivity
- filenames that have potential conflict in the same tree due to
utf-8 encoding vagaries
MHO is that a strict "start your project portable from day one" mode
is best as a default. But I'd be happy with any default, actually ;-)
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 18:40 ` Martin Langhoff
@ 2009-05-12 18:55 ` Jakub Narebski
2009-05-12 21:43 ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt
0 siblings, 1 reply; 59+ messages in thread
From: Jakub Narebski @ 2009-05-12 18:55 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Dmitry Potapov, Esko Luontola, git
Martin Langhoff <martin.langhoff@gmail.com> writes:
> On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> > 2. If you really want to be cross-platform portable, you should not use
> > any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
> > Filename Character Set)
> > http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276
>
> Would it make sense to have warnings at 'git add' time about
>
> - filenames outside of that charset (as the strictest mode, perhaps
> even default)
> - filenames that have a potential conflict wrt case-sensitivity
> - filenames that have potential conflict in the same tree due to
> utf-8 encoding vagaries
>
> MHO is that a strict "start your project portable from day one" mode
> is best as a default. But I'd be happy with any default, actually ;-)
Somebody asked for a pre-add hook in the past; it would be good place
to put such check. But in meantime you can do it using pre-commit
hook instead, isn't it?
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 17:56 ` Esko Luontola
@ 2009-05-12 20:38 ` Johannes Schindelin
2009-05-12 21:16 ` Esko Luontola
0 siblings, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-12 20:38 UTC (permalink / raw)
To: Esko Luontola; +Cc: Shawn O. Pearce, git
Hi,
On Tue, 12 May 2009, Esko Luontola wrote:
> On 12.5.2009, at 19:13, Johannes Schindelin wrote:
> >As to storing all file names in UTF-8, my point about Unicode being not
> >necessarily appropriate for everyone still stands.
> >
> >UTF-8 _might_ be the de-facto standard for Linux filesystems, but IMHO
> >we should not take away the freedom for everybody to decide what they
> >want their file names to be encoded as.
> >
> >However, I see that there might be a need to be able to encode the file
> >names differently, such as on Windows. IMHO the best solution would be
> >a config variable controlling the reencoding of file names.
>
> Exactly. The system should not force the use of a specific encoding. It
> should only offer a recommendation, but be also fully compatible if the
> user uses some other encoding.
>
> That's why it's best to always store the information about what encoding
> was used. It shouldn't matter, whether the data is encoded with
> ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as long as it
> is explicitly said that what the encoding is. Then the reader of the
> data can best decide, how to show that data on the current platform.
>
> A config variable for defining, that what encoding should be used when
> committing the file names, would make sense. Git should also try to
> autodetect, that what encoding is used in its current environment. In
> the case of UTF-8, you should also be able to specify which
> normalization form is used
> (http://www.unicode.org/unicode/reports/tr15/), or whether it is
> normalized at all.
>
> For example, it should be possible to configure Git so, that when a file
> is checked out on Mac, its file name is converted to the current file
> system's encoding (UTF-8 NFD, I think), and when the file is committed
> on Mac, the file name is normalized back to the same UTF-8 form as is
> used on Linux (UTF-8 NFC).
>
> It would be nice to have config variables for saying, that all file
> names in this repository must use UTF-8 NFC, and all commit messages
> must use UTF-8 NFC (with Unix newlines). Then the Git client would
> autodetect the current environment's encoding, and convert the text, if
> necessary, to match the repository's encoding.
That is a nice analysis. How about implementing it?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 20:38 ` Johannes Schindelin
@ 2009-05-12 21:16 ` Esko Luontola
2009-05-13 0:23 ` Johannes Schindelin
0 siblings, 1 reply; 59+ messages in thread
From: Esko Luontola @ 2009-05-12 21:16 UTC (permalink / raw)
To: git; +Cc: Johannes Schindelin, Shawn O. Pearce
Johannes Schindelin wrote on 12.5.2009 23:38:
> That is a nice analysis. How about implementing it?
>
Do we have here somebody, who knows Git's code well and is motivated to
implement this?
I don't think that I would be capable, because of not having used C
much, being new to Git's codebase and having too little time. But I can
help with the requirements specification, interaction design and system
testing.
--
Esko Luontola
www.orfjackal.net
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames
2009-05-12 18:55 ` Jakub Narebski
@ 2009-05-12 21:43 ` Heiko Voigt
2009-05-12 21:55 ` Jakub Narebski
0 siblings, 1 reply; 59+ messages in thread
From: Heiko Voigt @ 2009-05-12 21:43 UTC (permalink / raw)
To: Jakub Narebski
Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
At the moment non-ascii encodings of file/usernames are not very well
supported by git. This will most likely change in the future but to
allow repositories to be portable among different file/operating systems
this check is enabled by default.
Signed-off-by: Heiko Voigt <heiko.voigt@mahr.de>
---
On Tue, May 12, 2009 at 11:55:39AM -0700, Jakub Narebski wrote:
> Somebody asked for a pre-add hook in the past; it would be good place
> to put such check. But in meantime you can do it using pre-commit
> hook instead, isn't it?
I actually had this in my queue to be submitted...
templates/hooks--pre-commit.sample | 33 +++++++++++++++++++++++++++++++++
1 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..83ff873 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,39 @@
#
# To enable this hook, rename this file to "pre-commit".
+# If you want to allow non-ascii filenames or usernames set
+# this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+function is_ascii () {
+ test -z "$(cat | sed -e "s/[\ -~]*//g")"
+ return $?
+}
+
+if [ "$allownonascii" != "true" ]
+then
+ # until git can handle non-ascii filenames gracefully
+ # prevent them to be added into the repository
+ if ! git diff --cached --name-only --diff-filter=A -z \
+ | tr "\0" "\n" | is_ascii; then
+ echo "Non-ascii filenames are not allowed !"
+ echo "Please rename the file ..."
+ exit 1
+ fi
+
+ # non-ascii username issue a warning in git gui so tell the
+ # user to change it
+ if ! git config user.name | is_ascii; then
+ echo "Please only use ascii characters in your username!"
+ exit 1
+ fi
+
+ if ! git config user.email | is_ascii; then
+ echo "Please only use ascii characters in your email!"
+ exit 1
+ fi
+fi
+
if git-rev-parse --verify HEAD 2>/dev/null
then
against=HEAD
--
1.6.3
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames
2009-05-12 21:43 ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt
@ 2009-05-12 21:55 ` Jakub Narebski
2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
0 siblings, 1 reply; 59+ messages in thread
From: Jakub Narebski @ 2009-05-12 21:55 UTC (permalink / raw)
To: Heiko Voigt
Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
On Tue, 12 May 2009, Heiko Voigt wrote:
> At the moment non-ascii encodings of file/usernames are not very well
> supported by git. This will most likely change in the future but to
> allow repositories to be portable among different file/operating systems
> this check is enabled by default.
> + # non-ascii username issue a warning in git gui so tell the
> + # user to change it
> + if ! git config user.name | is_ascii; then
> + echo "Please only use ascii characters in your username!"
> + exit 1
> + fi
> +
> + if ! git config user.email | is_ascii; then
> + echo "Please only use ascii characters in your email!"
> + exit 1
> + fi
Actually 1.) there is no easy way to avoid non-ASCII names at least
in user.name (I think they are not allowed in email), but 2.) there
is no trouble with non-ASCII encoding of commits, as they have
'encoding' header if it is not uft-8 (see *encoding* config variables).
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 21:16 ` Esko Luontola
@ 2009-05-13 0:23 ` Johannes Schindelin
2009-05-13 5:34 ` Esko Luontola
0 siblings, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-13 0:23 UTC (permalink / raw)
To: Esko Luontola; +Cc: git, Shawn O. Pearce
Hi,
On Wed, 13 May 2009, Esko Luontola wrote:
> Johannes Schindelin wrote on 12.5.2009 23:38:
> > That is a nice analysis. How about implementing it?
> >
>
> Do we have here somebody, who knows Git's code well and is motivated to
> implement this?
>
> I don't think that I would be capable, because of not having used C
> much, being new to Git's codebase and having too little time.
Well, that rather settles things, no?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 0:23 ` Johannes Schindelin
@ 2009-05-13 5:34 ` Esko Luontola
2009-05-13 6:49 ` Alex Riesen
2009-05-13 10:15 ` Johannes Schindelin
0 siblings, 2 replies; 59+ messages in thread
From: Esko Luontola @ 2009-05-13 5:34 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: git, Shawn O. Pearce
Johannes Schindelin wrote on 13.5.2009 3:23:
> Well, that rather settles things, no?
>
There is need for the feature, but it's unfortunate that the Git
developers do not see its value. There are many users for whom using
non-ASCII names is necessary (for example all of Asia and most of
Europe), but now it seems that Bazaar is the only DVCS that handles
encodings correctly:
http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames
Let's see if I have time later this or next year to work on it. At least
it would be good practise in getting acquainted with a new codebase and
learning C. But it would be better for someone else do it, to get it
done within a reasonable amount of time.
I see that there are some tests in the /t directory. Which command will
run all of them, how good coverage do the tests have, how reproducable
and isolated they are, how many seconds does it take to run all the
tests? Is there some high-level documentation for new developers?
--
Esko Luontola
www.orfjackal.net
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 5:34 ` Esko Luontola
@ 2009-05-13 6:49 ` Alex Riesen
2009-05-13 10:15 ` Johannes Schindelin
1 sibling, 0 replies; 59+ messages in thread
From: Alex Riesen @ 2009-05-13 6:49 UTC (permalink / raw)
To: Esko Luontola; +Cc: Johannes Schindelin, git, Shawn O. Pearce
2009/5/13 Esko Luontola <esko.luontola@gmail.com>:
> Johannes Schindelin wrote on 13.5.2009 3:23:
>>
>> Well, that rather settles things, no?
>>
>
> There is need for the feature, but it's unfortunate that the Git developers
> do not see its value. There are many users for whom using non-ASCII names is
> necessary (for example all of Asia and most of Europe), but now it seems
> that Bazaar is the only DVCS that handles encodings correctly:
> http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames
Many Git developers just use systems which don't care about the file names
encoding at all and just keep the names as they were. So interoperability
problem does not exist for them. So, they either don't need the feature,
or can trivially avoid or workaround any problems.
> I see that there are some tests in the /t directory. Which command will run
> all of them, how good coverage do the tests have, how reproducable and
> isolated they are, how many seconds does it take to run all the tests? Is
> there some high-level documentation for new developers?
make test. See also t/README. We like them. I always run test suite before
deployment and sometimes run it just for fun (unless I have to run it
on Windows).
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 5:34 ` Esko Luontola
2009-05-13 6:49 ` Alex Riesen
@ 2009-05-13 10:15 ` Johannes Schindelin
[not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
1 sibling, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-13 10:15 UTC (permalink / raw)
To: Esko Luontola; +Cc: git, Shawn O. Pearce
Hi,
On Wed, 13 May 2009, Esko Luontola wrote:
> Johannes Schindelin wrote on 13.5.2009 3:23:
> > Well, that rather settles things, no?
>
> There is need for the feature, but it's unfortunate that the Git
> developers do not see its value.
I see a value. But it is not my itch. And since it is your itch and you
said that you will not do anything about it (I don't count writing emails
here ;-), I concluded that it settles the issue.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 59+ messages in thread
* Cross-Platform Version Control
[not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
@ 2009-05-13 10:41 ` John Tapsell
2009-05-13 13:42 ` Jay Soffian
0 siblings, 1 reply; 59+ messages in thread
From: John Tapsell @ 2009-05-13 10:41 UTC (permalink / raw)
To: git
2009/5/13 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> Hi,
>
> On Wed, 13 May 2009, Esko Luontola wrote:
>
>> Johannes Schindelin wrote on 13.5.2009 3:23:
>> > Well, that rather settles things, no?
>>
>> There is need for the feature, but it's unfortunate that the Git
>> developers do not see its value.
>
> I see a value. But it is not my itch. And since it is your itch and you
> said that you will not do anything about it (I don't count writing emails
> here ;-), I concluded that it settles the issue.
I don't know why the git developers are being so hostile/dismisisve,
but I also hope that somebody volunteers to fix this.
Esko, you have my moral support :-)
John
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 10:41 ` John Tapsell
@ 2009-05-13 13:42 ` Jay Soffian
2009-05-13 13:44 ` Alex Riesen
0 siblings, 1 reply; 59+ messages in thread
From: Jay Soffian @ 2009-05-13 13:42 UTC (permalink / raw)
To: John Tapsell; +Cc: git
On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
> I don't know why the git developers are being so hostile/dismisisve,
Are you serious?
j.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 13:42 ` Jay Soffian
@ 2009-05-13 13:44 ` Alex Riesen
2009-05-13 13:50 ` Jay Soffian
0 siblings, 1 reply; 59+ messages in thread
From: Alex Riesen @ 2009-05-13 13:44 UTC (permalink / raw)
To: Jay Soffian; +Cc: John Tapsell, git
2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>> I don't know why the git developers are being so hostile/dismisisve,
>
> Are you serious?
>
...because we'll kill you if aren't >:-E
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 13:44 ` Alex Riesen
@ 2009-05-13 13:50 ` Jay Soffian
2009-05-13 13:57 ` John Tapsell
0 siblings, 1 reply; 59+ messages in thread
From: Jay Soffian @ 2009-05-13 13:50 UTC (permalink / raw)
To: Alex Riesen; +Cc: John Tapsell, git
On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>> I don't know why the git developers are being so hostile/dismisisve,
>>
>> Are you serious?
>>
>
> ...because we'll kill you if aren't >:-E
I'm just flabbergasted by some people's expectations. Perhaps John
doesn't realize the git developers are all volunteers, and that it is
never appropriate to criticize a volunteer. A "thank you for all your
hard work on git" would have done nicely.
j.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 13:50 ` Jay Soffian
@ 2009-05-13 13:57 ` John Tapsell
2009-05-13 15:27 ` Nicolas Pitre
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: John Tapsell @ 2009-05-13 13:57 UTC (permalink / raw)
To: Jay Soffian; +Cc: Alex Riesen, git
2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>>> I don't know why the git developers are being so hostile/dismisisve,
>>>
>>> Are you serious?
>>>
>>
>> ...because we'll kill you if aren't >:-E
>
> I'm just flabbergasted by some people's expectations. Perhaps John
> doesn't realize the git developers are all volunteers, and that it is
> never appropriate to criticize a volunteer. A "thank you for all your
> hard work on git" would have done nicely.
I'm as much of an open source developer as anyone else here. I spend
a huge amount of my time programming for KDE. But I've never told a
user "well that settles it" because they won't code it themselves :-/
I certaintly get a huge number of bug/wishes that I can't/won't code
myself, but I try to be a bit more diplomatic about it.
But then the kernel mailing lists tend to be a lot more.. direct..
than the kde mailing lists, so I guess it comes from that. Requiring
people to have a thick skin and all that.
John
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 13:57 ` John Tapsell
@ 2009-05-13 15:27 ` Nicolas Pitre
2009-05-13 16:22 ` Johannes Schindelin
2009-05-13 17:24 ` Andreas Ericsson
2009-05-14 1:49 ` Miles Bader
2 siblings, 1 reply; 59+ messages in thread
From: Nicolas Pitre @ 2009-05-13 15:27 UTC (permalink / raw)
To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git
On Wed, 13 May 2009, John Tapsell wrote:
> I'm as much of an open source developer as anyone else here. I spend
> a huge amount of my time programming for KDE. But I've never told a
> user "well that settles it" because they won't code it themselves :-/
> I certaintly get a huge number of bug/wishes that I can't/won't code
> myself, but I try to be a bit more diplomatic about it.
> But then the kernel mailing lists tend to be a lot more.. direct..
> than the kde mailing lists, so I guess it comes from that. Requiring
> people to have a thick skin and all that.
This is not the kernel mailing list. In fact this list is quite
friendlier and accommodating that the kernel list.
The remark alluded above comes from _one_ of the git developers. And
Dscho is apparently in a rather sad mood these days. While the substance
of Dscho's remark is entirely pertinent, it would be wrong to use its
form and style as a characterization of git developers in general.
Nicolas
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 15:27 ` Nicolas Pitre
@ 2009-05-13 16:22 ` Johannes Schindelin
0 siblings, 0 replies; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-13 16:22 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: John Tapsell, Jay Soffian, Alex Riesen, git
Hi,
On Wed, 13 May 2009, Nicolas Pitre wrote:
> On Wed, 13 May 2009, John Tapsell wrote:
>
> > I'm as much of an open source developer as anyone else here. I spend
> > a huge amount of my time programming for KDE. But I've never told a
> > user "well that settles it" because they won't code it themselves :-/
> > I certaintly get a huge number of bug/wishes that I can't/won't code
> > myself, but I try to be a bit more diplomatic about it.
> >
> > But then the kernel mailing lists tend to be a lot more.. direct..
> > than the kde mailing lists, so I guess it comes from that. Requiring
> > people to have a thick skin and all that.
>
> This is not the kernel mailing list. In fact this list is quite
> friendlier and accommodating that the kernel list.
>
> The remark alluded above comes from _one_ of the git developers. And
> Dscho is apparently in a rather sad mood these days. While the substance
> of Dscho's remark is entirely pertinent, it would be wrong to use its
> form and style as a characterization of git developers in general.
Even if I were in a better mood, the whole thread has a back story on an
msysGit issue, and this led me to try to stop what I feared would become a
rather long mail thread without much of an outcome, such as that infamous
thread about MacOSX UTF-8 filename handling.
Alas, it seems that Robin is willing to work on the issues, so my fears
have been totally and completely unfounded.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 16:16 ` Jeff King
2009-05-12 16:57 ` Johannes Schindelin
@ 2009-05-13 16:26 ` Linus Torvalds
2009-05-13 17:12 ` Linus Torvalds
1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 16:26 UTC (permalink / raw)
To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git
On Tue, 12 May 2009, Jeff King wrote:
>
> Or they use a single encoding like utf8 so that there are no surprises.
> You can still run into normalization problems with filenames on some
> filesystems, though. Linus's name_hash code sets up the framework to
> handle "these two names are actually equivalent", but right now I think
> there is just code for handling case-sensitivity, not utf8 normalization
> (but I just skimmed the code, so I might be wrong).
utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But
quite frankly, the index is only part of it, and probably not the worst
part.
The real pain of filename handling is all the "read tree recursively with
readdir()" issues. Along with just an absolute sh*t-load of issues about
what to do when people ended up using different versions of the "same"
name in different branches.
There's also the issue that "cross-platform" really can be a pretty damn
big pain. What do you do for platforms that simply are pure shit? I
realize that OS X people have a hard time accepting it, but OS X
filesystems are generally total and utter crap - even more so than
Windows.
Yes, yes, you can tell OS X that case matters, but that's not the normal
case - and what do you do with projects that simply _do_ care about case.
The kernel is one such project.
Sure, you can "encode" the filenames on such broken filesystems in a way
that they'd be different - but that won't really help the project, since
makefiles etc won't work anyway.
So one reason I didn't bother with utf-8 is that the much more fundamental
issues are simply in plain old 7-bit US-ASCII.
That said, if the only issue is that you want to encode regular utf-8 in a
coherent way (and ignore the case issues), then we could probably do that
part fairly easily with a "convert_to_internal()" and
"convert_to_filename()" thing that acts very much like the CRLF conversion
(except on filenames, not data).
And yes, it's probably worth doing, since we'd need that for fuller case
support anyway.
It's just a fair amount of churn - not fundamentally _hard_, but not
trivial either. And it needs a _lot_ of care, and a fair amount of
testing that is probably hard to do on sane filesystems (ie the case where
the filesystem actually _changes_ the name is going to be hard to test on
anything sane).
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 16:26 ` Linus Torvalds
@ 2009-05-13 17:12 ` Linus Torvalds
2009-05-13 17:31 ` Andreas Ericsson
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 17:12 UTC (permalink / raw)
To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git
On Wed, 13 May 2009, Linus Torvalds wrote:
>
> utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But
> quite frankly, the index is only part of it, and probably not the worst
> part.
>
> The real pain of filename handling is all the "read tree recursively with
> readdir()" issues. Along with just an absolute sh*t-load of issues about
> what to do when people ended up using different versions of the "same"
> name in different branches.
Btw, if people care mainly just about OS X, and don't worry so much about
case, but about the idiotic and insane OS X behavior of turning UTF-8
filenames into that crazy NFD format, here's a simple patch that may be
useful for that.
There _will_ certainly be other places, but this handles the one big case
of "read_directory_recursive()", and can turn NFD into the sane NFC
format.
Since OS X will then accept NFC (and internally turn it back to NFD) when
you pass them as filenames, that means that converting the other way is
not necessary.
NOTE NOTE NOTE! This really just handles one case, and is not enough for
any kind of general case. For example, it does NOT handle the case where
you do
git add filename_with_åäö
explicitly, because if the "filename_with_åäö" is done using NFD
(tab-completion etc), now git won't _match_ it with the filename it reads
using readdir() any more (which got converted to NFC), so at a minimum
we'd need to do that crazy NFD->NFC conversion in all the pathspecs too.
See "get_pathspec()" in setup.c for that latter case.
But with that, and this crazy thing, OS X users might be already a lot
better off. Totally untested, of course.
Oh, and somebody needs to fill in that
convert_name_from_nfd_to_nfc()
implementation. It's designed so that if it notices that the string is
just plain US-ASCII, it can return 0 and no extra work is done. That, in
turn, can easily be done by some simple and efficient pre-processign that
checks that there are no high bits set (on a 64-bit platform, do it 8
characters at a time with a "& 0x8080808080808080"), so that the common
case doesn't need to have barely any overhead at all.
Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do
the actual normalization if you find characters with the high bit set. And
since I know that the OS X filesystems are so buggy as to not even do that
whole NFD thing right, there is probably some OS-X specific "use this for
filesystem names" conversion function.
Hmm. Anybody want to take this on? It really shouldn't be too complex to
get it working for the common case on just OS X. It's really the case
sensitivity that is the biggest problem, if you ignore that for now, the
problem space is _much_ smaller.
In other words, I think we can reasonably easily support a subset of
_common_ issues with some trivial patches like this. But getting it right
in _all_ the cases is going to be much more work (there are lots of other
uses of "readdir()" too, this one just happens to be one of the more
central ones).
Of course, it probably makes sense to have a whole "git_readdir()" that
does this thing in general. That "create_full_path()" thing makes sense
regardless, though, in that it also simplifies a lot of "baselen+len"
usage in just "len".
Linus
---
dir.c | 40 ++++++++++++++++++++++++++++++++--------
1 files changed, 32 insertions(+), 8 deletions(-)
diff --git a/dir.c b/dir.c
index 6aae09a..4cbfc24 100644
--- a/dir.c
+++ b/dir.c
@@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path)
}
/*
+ * Take the readdir output, in (d_name,len), and append it to
+ * our base name in (fullname,baselen) with any required
+ * readdir fs->internal translation.
+ *
+ * Put the result in 'fullname', and return the final length.
+ *
+ * Right now we have no translation, and just do a memcpy()
+ * (the +1 is to copy the final NUL character too).
+ */
+static int create_full_path(char *fullname, int baselen, const char *d_name, int len)
+{
+#ifdef OS_X_IS_SOME_CRAZY_SHxAT
+ char temp[256], nlen;
+ nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp));
+ if (nlen) {
+ len = nlen;
+ d_name = temp;
+ }
+#endif
+ memcpy(fullname + baselen, d_name, len + 1);
+ return baselen + len;
+}
+
+/*
* Read a directory tree. We currently ignore anything but
* directories, regular files and symlinks. That's because git
* doesn't handle them at all yet. Maybe that will change some
@@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
/* Ignore overly long pathnames! */
if (len + baselen + 8 > sizeof(fullname))
continue;
- memcpy(fullname + baselen, de->d_name, len+1);
- if (simplify_away(fullname, baselen + len, simplify))
+ len = create_full_path(fullname, baselen, de->d_name, len);
+ if (simplify_away(fullname, len, simplify))
continue;
dtype = DTYPE(de);
exclude = excluded(dir, fullname, &dtype);
if (exclude && (dir->flags & DIR_COLLECT_IGNORED)
- && in_pathspec(fullname, baselen + len, simplify))
- dir_add_ignored(dir, fullname, baselen + len);
+ && in_pathspec(fullname, len, simplify))
+ dir_add_ignored(dir, fullname, len);
/*
* Excluded? If we don't explicitly want to show
@@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
default:
continue;
case DT_DIR:
- memcpy(fullname + baselen + len, "/", 2);
+ memcpy(fullname + len, "/", 2);
len++;
- switch (treat_directory(dir, fullname, baselen + len, simplify)) {
+ switch (treat_directory(dir, fullname, len, simplify)) {
case show_directory:
if (exclude != !!(dir->flags
& DIR_SHOW_IGNORED))
@@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
break;
case recurse_into_directory:
contents += read_directory_recursive(dir,
- fullname, fullname, baselen + len, 0, simplify);
+ fullname, fullname, len, 0, simplify);
continue;
case ignore_directory:
continue;
@@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
if (check_only)
goto exit_early;
else
- dir_add_name(dir, fullname, baselen + len);
+ dir_add_name(dir, fullname, len);
}
exit_early:
closedir(fdir);
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 13:57 ` John Tapsell
2009-05-13 15:27 ` Nicolas Pitre
@ 2009-05-13 17:24 ` Andreas Ericsson
2009-05-14 1:49 ` Miles Bader
2 siblings, 0 replies; 59+ messages in thread
From: Andreas Ericsson @ 2009-05-13 17:24 UTC (permalink / raw)
To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git
John Tapsell wrote:
> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
>>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>>>> I don't know why the git developers are being so hostile/dismisisve,
>>>> Are you serious?
>>>>
>>> ...because we'll kill you if aren't >:-E
>> I'm just flabbergasted by some people's expectations. Perhaps John
>> doesn't realize the git developers are all volunteers, and that it is
>> never appropriate to criticize a volunteer. A "thank you for all your
>> hard work on git" would have done nicely.
>
> I'm as much of an open source developer as anyone else here. I spend
> a huge amount of my time programming for KDE. But I've never told a
> user "well that settles it" because they won't code it themselves :-/
> I certaintly get a huge number of bug/wishes that I can't/won't code
> myself, but I try to be a bit more diplomatic about it.
> But then the kernel mailing lists tend to be a lot more.. direct..
> than the kde mailing lists, so I guess it comes from that. Requiring
> people to have a thick skin and all that.
>
I think much of the perceived malignancy stems from the fact that the
git list has a high ratio of developer-to-luser mailings on it, being
by nature a developer tool most of the time. When the unaware user
appears on the list with demands rather than polite requests, they're
treated that much harder. Especially by the developer who happens to
be, as it were, the butt of the request.
Personally, I've only ever found Dscho being anything but friendly on
this list, and even then, I really didn't find it offensive. If viewed
in a happy mood, it matches quite nicely with a swedish sketch whose
theme is "men ja ente bitter". It's often quite funny, really :-)
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
http://nordicmeetonnagios.op5.org/
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 17:12 ` Linus Torvalds
@ 2009-05-13 17:31 ` Andreas Ericsson
2009-05-13 17:46 ` Linus Torvalds
2009-05-13 20:57 ` Matthias Andree
2 siblings, 0 replies; 59+ messages in thread
From: Andreas Ericsson @ 2009-05-13 17:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git
Linus Torvalds wrote:
>
> Of course, it probably makes sense to have a whole "git_readdir()" that
> does this thing in general. That "create_full_path()" thing makes sense
> regardless, though, in that it also simplifies a lot of "baselen+len"
> usage in just "len".
>
In a flash of premonitory insight, libgit2 has
gitfo_foreach_dirent(path, callback)
which would probably be well suited for this kind of thing.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
http://nordicmeetonnagios.op5.org/
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 17:12 ` Linus Torvalds
2009-05-13 17:31 ` Andreas Ericsson
@ 2009-05-13 17:46 ` Linus Torvalds
2009-05-13 18:26 ` Martin Langhoff
2009-05-13 20:57 ` Matthias Andree
2 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 17:46 UTC (permalink / raw)
To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git
On Wed, 13 May 2009, Linus Torvalds wrote:
>
> Of course, it probably makes sense to have a whole "git_readdir()" that
> does this thing in general.
Actually, the more I think about that, the less true I think it is.
It _sounds_ like a nice simplification ("just do it once in readdir, and
forget about it everywhere else"), but it's in fact a stupid thing to do.
Why?
If we _ever_ want to fix this in the general case, then the code that does
the readdir() will actually have to remember both the "raw filesystem"
form _and_ the "cleaned-up utf-8 form".
Why? Because when we do readdir(), we'll also do 'lstat()' on the end
result to check the types, and opendir() in case it's a directory and we
then want to do things recursively etc. And that happens to work on OS X
(because we can use our "fixed" filename for lstat too), but it does not
work in the general case.
And you can say "well, just do the stat inside the wrapped readdir()", but
that doesn't work _either_, since
- we don't want to do the lstat() if it's unnecessary. Even if we don't
have "de->d_type" information, we can often avoid the need for it, if
we can tell that the name isn't interestign (due to being ignored).
Avoiding the lstat is a huge performance issue for cold-cache cases.
It's basically a seek.
So we really want to do the lstat() later, which implies that the
caller needs to know _both_ the original "real" filesystem name _and_
the converted one.
- it doesn't handle the opendir() case anyway - so the end result is that
a real implementation will _always_ need to carry around both the
"filesystem view" filename _and_ the "what we've converted it into".
Now, the point of the patch I sent out was that for the specific case of
OS X, which does UTF-8 conversions (wrong) but also is happy to get our
properly normalized name, we don't care. So my patch is "correct" for that
special case - and so would a plain readdir() wrapper be.
But my patch is _also_ correct for the case where a readdir() wrapper
would do the wrong thing. My patch doesn't _handle_ it (since it doesn't
change the code to pass both "filesystem view" and "cleaned-up view"
pathnames), but the patch I sent out also doesn't make it any harder to do
right.
In contrast, doing a readdir() wrapper makes it much harder to do right
later, because it's just doing the conversion at the wrong level (you
could make that "wrapper" return both the original and the fixed
filename, but at that point the wrapper doesn't really help - you might
as well just have the "convert" function, and it would be a hell of a lot
more obvious what is really going on).
So I take it back. A readdir() wrapper is not a good idea. It gets us a
tiny bit of the way, but it would actually take us a step back from the
"real" solution.
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 17:46 ` Linus Torvalds
@ 2009-05-13 18:26 ` Martin Langhoff
2009-05-13 18:37 ` Linus Torvalds
0 siblings, 1 reply; 59+ messages in thread
From: Martin Langhoff @ 2009-05-13 18:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, May 13, 2009 at 7:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So I take it back. A readdir() wrapper is not a good idea. It gets us a
> tiny bit of the way, but it would actually take us a step back from the
> "real" solution.
Do we need to take the real solution to the core of git?
What I am wondering is whether we can keep this simple in git
internals and catch problem filenames at git-add time. This would
allow git to keep treating filenames as a bag of bytes, and it does a
better thing for users.
In cross platform projects, most users don't even know that there are
problems, and even if they do, they don't know what the problems are.
If git add can be told to warn & refuse to add a path with portability
problems, then we educate our users, prevent them from committing
filenames that will later cause trouble to others in their projects,
etc.
from-the-keep-it-simple-and-informative-dept,
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 18:26 ` Martin Langhoff
@ 2009-05-13 18:37 ` Linus Torvalds
2009-05-13 21:04 ` Theodore Tso
2009-05-13 21:08 ` Daniel Barkalow
0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 18:37 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, 13 May 2009, Martin Langhoff wrote:
>
> Do we need to take the real solution to the core of git?
Well, I suspect that if we really want to support it, then we'd better.
> What I am wondering is whether we can keep this simple in git
> internals and catch problem filenames at git-add time.
I can almost guarantee that it will just cause more problems than it
solves, and generate some nasty cases that just aren't solvable.
Because it really isn't just "git add". It's every single thing that does
a lstat() on a filename inside of git.
Now, the simple OS X case is not a huge problem, since the lstat will
succeed with the fixed-up filename too. But as mentioned, the OS X case is
the thing that doesn't need a lot of infrastructure _anyway_ - I can
almost guarantee that my posted patch (with the added setup.c stuff for
get_pathspec()) is going to be _fewer_ lines than some wrapper logic.
Note: in all of the above, I assume that people care more about just plain
UTF characters (and the insane NFD form OS X uses) than about worrying
about the _really_ subtle issues of case-independence. Those are a major
pain, but they will need even more "internal" support, because there
simply isn't any sane wrapping method.
(You could wrap everything to force lower-casing of all filesystem ops or
something, but that would not be acceptable to any sane environment. So in
reality you need to accept mixed-case things, and then there is no way to
know from the "outside" whether one external mixed-case thing matches some
internal index mixed-case thing).
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 17:12 ` Linus Torvalds
2009-05-13 17:31 ` Andreas Ericsson
2009-05-13 17:46 ` Linus Torvalds
@ 2009-05-13 20:57 ` Matthias Andree
2009-05-13 21:10 ` Linus Torvalds
2 siblings, 1 reply; 59+ messages in thread
From: Matthias Andree @ 2009-05-13 20:57 UTC (permalink / raw)
To: Linus Torvalds, Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git
Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds
<torvalds@linux-foundation.org>:
> Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to
> do the actual normalization if you find characters with the high bit
> set. And since I know that the OS X filesystems are so buggy as to not
> even do that whole NFD thing right, there is probably some OS-X specific
> "use this for
> filesystem names" conversion function.
Sorry for interrupting, but NF_K_C? You don't want that (K for
compatibility, rather than canonical, normalization) for anything except
normalizing temporary variables inside strcasecmp(3) or similar. Probably
not even that. The normalizations done are often irreversible and also
surprising. You don't want to turn 2³.c into 23.c, do you?
--
Matthias Andree
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 18:37 ` Linus Torvalds
@ 2009-05-13 21:04 ` Theodore Tso
2009-05-13 21:20 ` Linus Torvalds
2009-05-13 21:08 ` Daniel Barkalow
1 sibling, 1 reply; 59+ messages in thread
From: Theodore Tso @ 2009-05-13 21:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, May 13, 2009 at 11:37:28AM -0700, Linus Torvalds wrote:
> Note: in all of the above, I assume that people care more about just plain
> UTF characters (and the insane NFD form OS X uses) than about worrying
> about the _really_ subtle issues of case-independence. Those are a major
> pain, but they will need even more "internal" support, because there
> simply isn't any sane wrapping method.
Stupid question --- if we get something that works for Windows and
MacOS X, is there any reason why we need to solve the general problem
of case-insentive filesystems? It's really backwards compatibility
with Legacy OS's that most important, right? Are there any other
systems other than Windows and Mac OS X which (a) perpetrate case
insensitivity on application programmers, and (b) which current or
future git users are likely to care about?
- Ted
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 18:37 ` Linus Torvalds
2009-05-13 21:04 ` Theodore Tso
@ 2009-05-13 21:08 ` Daniel Barkalow
2009-05-13 21:29 ` Linus Torvalds
1 sibling, 1 reply; 59+ messages in thread
From: Daniel Barkalow @ 2009-05-13 21:08 UTC (permalink / raw)
To: Linus Torvalds
Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, 13 May 2009, Linus Torvalds wrote:
> On Wed, 13 May 2009, Martin Langhoff wrote:
> >
> > Do we need to take the real solution to the core of git?
>
> Well, I suspect that if we really want to support it, then we'd better.
>
> > What I am wondering is whether we can keep this simple in git
> > internals and catch problem filenames at git-add time.
>
> I can almost guarantee that it will just cause more problems than it
> solves, and generate some nasty cases that just aren't solvable.
>
> Because it really isn't just "git add". It's every single thing that does
> a lstat() on a filename inside of git.
>
> Now, the simple OS X case is not a huge problem, since the lstat will
> succeed with the fixed-up filename too.
I'm not seeing what the general case is, and how it could possibly behave.
There's the "insensitive" behavior: if you create "foo" and look for
"FOO", it's there, but readdir() reports "foo".
There's the "converting" behavior: if you create "foo", readdir() reports
"FOO", but lstat("foo") returns it.
The obvious general case is: if you create "foo", readdir() reports "FOO",
and lstat("foo") doesn't find a match. But if you create "foo" again... it
doesn't find "foo", so it creates a new file, which it also calls "FOO",
and the filesystem now has two files with identical names?
It seems to me that the limits of minimally functional, non-inode-losing
filesystems are: lstat() might take a filename and return the data for a
non-byte-identical filename; open(name, O_CREAT|O_EXCL) might replace the
given name with a non-byte-identical filename. But surely open(name) and
lstat(name) (with the same name) must find the same file, even if
readdir() would report it with a different name.
And I assume that a filesystem that rejected any non-NFD filenames or any
non-NFC filenames would be totally unusable, in that users will manage to
get unnormalized filenames into programs and find that the filesystem just
doesn't work.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 20:57 ` Matthias Andree
@ 2009-05-13 21:10 ` Linus Torvalds
2009-05-13 21:30 ` Jay Soffian
2009-05-13 21:47 ` Matthias Andree
0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:10 UTC (permalink / raw)
To: Matthias Andree; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, 13 May 2009, Matthias Andree wrote:
> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds
> <torvalds@linux-foundation.org>:
>
> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do
> > the actual normalization if you find characters with the high bit set. And
> > since I know that the OS X filesystems are so buggy as to not even do that
> > whole NFD thing right, there is probably some OS-X specific "use this for
> > filesystem names" conversion function.
>
> Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility,
> rather than canonical, normalization) for anything except normalizing
> temporary variables inside strcasecmp(3) or similar. Probably not even that.
> The normalizations done are often irreversible and also surprising. You don't
> want to turn 2³.c into 23.c, do you?
No, you're right. We want just plain NFC. I just googled for how some
other projects handled this, and found the stringprep thing in a post
about rsync, and didn't look any closer.
But yes, you're absolutely right, stringprep is total crap, and nfkc is
horrible.
I have no idea of what library to use, though. For perl, there's
Unicode::Normalize, but that's likely still subtly incorrect for the OS-X
case due to the filesystem not using _strict_ NFD.
I have this dim memory of somebody actually pointing to the documentation
of exactly which characters OS X ends up decomposing. Maybe we could just
do a git-specific inverse of that, knowing that NOBODY ELSE IN THE WHOLE
UNIVERSE IS SO TERMINALLY STUPID AS TO DO THAT DECOMPOSITION, and thus the
OS X case is the only one we need to care about?
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 21:04 ` Theodore Tso
@ 2009-05-13 21:20 ` Linus Torvalds
0 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:20 UTC (permalink / raw)
To: Theodore Tso
Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, 13 May 2009, Theodore Tso wrote:
>
> Stupid question --- if we get something that works for Windows and
> MacOS X, is there any reason why we need to solve the general problem
> of case-insentive filesystems?
Qutie frankly, I don't think we're even very close to getting anything
that works for Windows of OS X.
Case-insensitivity is _hard_.
The "easy" case is to just handle the OS X craxy pseudo-NFD format, and at
least turn that into NFC (and perhaps add a config option to do latin1 and
EUC-JP to utf-8 too) and. At that point, we at least handle regular utf-8
the same way.
Doing the latin1/EUC-JP thing would actually to some degree be more
interesting than the OS X NFD case, because that really does require
two-way conversion, and we can "test" that even on sane filesystems (ie
play at having a Latin1 filesystem).
That said, I suspect there aren't that many people who care about latin1
filesystems. I dunno about EUC-JP (and variants - for all I know,
shift-JIS and other cases may be the more common ones).
Of course, if we do everything right, maybe the windows people would
actually like us to keep the filesystem-native representation in UTF-16LE
or whatever the crazy format is that Windows really uses deep down.
My point being that all of these things happen even without the added
worry about case. And in many ways, not worrying about case should
probably be the first step. We do have some support for worrying about
case, but trying to solve both things at the same time isn't going to be
workable, I suspect.
Case insensitivity should never ever involve a _conversion_ (if it does,
you get all kinds of crazy behavior), it's just purely a _comparison_
issue, so the two really are fundamentally different.
Of course, the reason OS-X seems to be so messed up is exactly that the
morons at Apple didn't understand the difference between conversion and
comparison, and mixed them up.
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 21:08 ` Daniel Barkalow
@ 2009-05-13 21:29 ` Linus Torvalds
0 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:29 UTC (permalink / raw)
To: Daniel Barkalow
Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, 13 May 2009, Daniel Barkalow wrote:
> >
> > Now, the simple OS X case is not a huge problem, since the lstat will
> > succeed with the fixed-up filename too.
>
> I'm not seeing what the general case is, and how it could possibly behave.
Here's a simple example.
Let's say that your company uses Latin1 internally for your filesystems,
because your tools really aren't utf-8 ready.
This is NOT AT ALL unnatural - it's how lots of people used to work with
Linux over the years, and it's largely how people still use FAT, I suspect
(except it's not latin1, it's some windows-specific 8-bits-per-character
mapping).
IOW, if you have a file called 'åäö', it literally is encoded as
'\xe5\xe4\xf6' (if you wonder why I picked those three letters, it's
because they are the regular extra letters in Swedish - Swedish has 29
letters in its alphabet, and those three letters really are letters in
their own right, they are NOT 'a' and 'o' with some dots/rings on top).
IOW, if you open such a file, you need to use those three bytes.
Now, even if you happen to have an OS and use Latin1 on disk, you may
realize that you'd like to interact with others that use UTF-8, and would
want to have your git archive that you export use nice portable UTF-8.
But you absolutely MUST NOT just do a conversion at "readdir()" time. If
you do that, then your three-byte filename turns into a six-byte utf-8
sequence of '\xc3\xa5\xc3\xa4\xc3\xb6' and the thing is, now "lstat()"
won't work on that sequence.
So obviously you could always turn things _back_ for lstat(), but quite
frankly, that's (a) insane (b) incompetent and (c) not even always
well-defined.
> There's the "insensitive" behavior: if you create "foo" and look for
> "FOO", it's there, but readdir() reports "foo".
>
> There's the "converting" behavior: if you create "foo", readdir() reports
> "FOO", but lstat("foo") returns it.
Then there's the behaviour above: you want your git repository to have
utf-8, but your filesystem doesn't convert anything at all, and all your
regular tools (think editors etc) are all Latin1.
Latin1 is going away, I hope, but I bet EUC-JP etc still exist.
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 21:10 ` Linus Torvalds
@ 2009-05-13 21:30 ` Jay Soffian
2009-05-13 21:47 ` Matthias Andree
1 sibling, 0 replies; 59+ messages in thread
From: Jay Soffian @ 2009-05-13 21:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthias Andree, Jeff King, Shawn O. Pearce, Esko Luontola, git
On Wed, May 13, 2009 at 5:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I have this dim memory of somebody actually pointing to the documentation
> of exactly which characters OS X ends up decomposing.
http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
http://developer.apple.com/technotes/tn/tn1150table.html
j.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 21:10 ` Linus Torvalds
2009-05-13 21:30 ` Jay Soffian
@ 2009-05-13 21:47 ` Matthias Andree
1 sibling, 0 replies; 59+ messages in thread
From: Matthias Andree @ 2009-05-13 21:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git
Am 13.05.2009, 23:10 Uhr, schrieb Linus Torvalds
<torvalds@linux-foundation.org>:
>
>
> On Wed, 13 May 2009, Matthias Andree wrote:
>
>> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds
>> <torvalds@linux-foundation.org>:
>>
>> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something
>> to do
>> > the actual normalization if you find characters with the high bit
>> set. And
>> > since I know that the OS X filesystems are so buggy as to not even do
>> that
>> > whole NFD thing right, there is probably some OS-X specific "use this
>> for
>> > filesystem names" conversion function.
>>
>> Sorry for interrupting, but NF_K_C? You don't want that (K for
>> compatibility,
>> rather than canonical, normalization) for anything except normalizing
>> temporary variables inside strcasecmp(3) or similar. Probably not even
>> that.
>> The normalizations done are often irreversible and also surprising. You
>> don't
>> want to turn 2³.c into 23.c, do you?
>
> No, you're right. We want just plain NFC. I just googled for how some
> other projects handled this, and found the stringprep thing in a post
> about rsync, and didn't look any closer.
>
> But yes, you're absolutely right, stringprep is total crap, and nfkc is
> horrible.
Crap? It's just besides the purpose and some limited form of fuzzy match.
Anyways...
> I have no idea of what library to use, though. For perl, there's
> Unicode::Normalize, but that's likely still subtly incorrect for the OS-X
> case due to the filesystem not using _strict_ NFD.
Perhaps ICU (ICU4C), from http://site.icu-project.org/
--
Matthias Andree
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-13 13:57 ` John Tapsell
2009-05-13 15:27 ` Nicolas Pitre
2009-05-13 17:24 ` Andreas Ericsson
@ 2009-05-14 1:49 ` Miles Bader
2 siblings, 0 replies; 59+ messages in thread
From: Miles Bader @ 2009-05-14 1:49 UTC (permalink / raw)
To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git
John Tapsell <johnflux@gmail.com> writes:
> I'm as much of an open source developer as anyone else here. I spend
> a huge amount of my time programming for KDE. But I've never told a
> user "well that settles it" because they won't code it themselves :-/
FWIW, Johannes' use of "Well, that rather settles things, no?" in this
thread this didn't strike me as being rude or truly dismissive (even
though it's literally so).
It seemed more just a timely and to the point reminder that however fun
it is to talk about random feature X, someone's gotta do the work if
it's going to actually be implemented, and that the direction of git
development very much follows the whims of those doing the actual
hacking (perhaps more so than other projects).
[and I don't even have particularly thick skin, I think -- I'm often
very annoyed by brusqueness one sees on many developer mailing lists...]
-Miles
--
Acquaintance, n. A person whom we know well enough to borrow from, but not
well enough to lend to.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
2009-05-12 15:14 ` Shawn O. Pearce
2009-05-12 18:28 ` Dmitry Potapov
@ 2009-05-14 13:48 ` Peter Krefting
2009-05-14 19:58 ` Esko Luontola
2 siblings, 1 reply; 59+ messages in thread
From: Peter Krefting @ 2009-05-14 13:48 UTC (permalink / raw)
To: Esko Luontola; +Cc: git
Esko Luontola:
> A good start for making Git cross-platform, would be storing the text
> encoding of every file name and commit message together with the commit.
Is it really necessary to store the encoding for every single file name,
should it not be enough to just store encoding information for all file
names at once (i.e., for the object that contains the list of file names and
their associated blobs)?
I did publish, as a request for comments, the beginnings of a patch that
would change the Windows version of Git to expect file names to be UTF-8
encoded. There were some comments about it, especially that I could not just
assume that UTF-8 was the right thing to assume.
Perhaps if we added some meta-data, maybe using the same fall-back mechanism
as for commit messages (i.e., assume UTF-8 unless otherwise specified), it
would be easier to do.
On Windows, the file APIs allow you to use Unicode (UTF-16) to specify file
names, and the file systems will handle any necessary conversion to whatever
byte sequences are used to store the file names. UTF-16 and UTF-8 are
trivial to convert between, and Windows does contain APIs to convert between
other character encodings and UTF-16.
On Mac OS X, I believe the file system APIs assume you use some kind of
normalized UTF-8. That should also be possible to create, possibly
converting back and forth between different normalization forms, if necessary.
On Linux and other Unixes we could just use iconv() to convert from the
repository file name encoding to whatever the current locale has set up. The
trick here is to handle file names outside the current encoding. Some kind
of escaping mechanism will probably need to be introduced.
The best way would be to define this in the Git core once and for all, and
add support to it for all the platforms in the same go, instead of trying to
hack around the issue whenever it pops up on the various platforms.
My main use-case for Git on Windows has disappeared as my $dayjob went
bankrupt, but I am happy to assist with whatever insight I may be able to
bring.
--
\\// Peter - http://www.softwolves.pp.se/
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
2009-05-12 21:55 ` Jakub Narebski
@ 2009-05-14 17:59 ` Heiko Voigt
2009-05-15 10:52 ` Martin Langhoff
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-05-14 17:59 UTC (permalink / raw)
To: Jakub Narebski
Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
At the moment non-ascii encodings of filenames are not portably converted
between different filesystems by git. This will most likely change in the
future but to allow repositories to be portable among different file/operating
systems this check is enabled by default.
Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---
On Tue, May 12, 2009 at 11:55:59PM +0200, Jakub Narebski wrote:
> On Tue, 12 May 2009, Heiko Voigt wrote:
>
> > At the moment non-ascii encodings of file/usernames are not very well
> > supported by git. This will most likely change in the future but to
> > allow repositories to be portable among different file/operating systems
> > this check is enabled by default.
>
> > + # non-ascii username issue a warning in git gui so tell the
> > + # user to change it
> > + if ! git config user.name | is_ascii; then
> > + echo "Please only use ascii characters in your username!"
> > + exit 1
> > + fi
> > +
> > + if ! git config user.email | is_ascii; then
> > + echo "Please only use ascii characters in your email!"
> > + exit 1
> > + fi
>
> Actually 1.) there is no easy way to avoid non-ASCII names at least
> in user.name (I think they are not allowed in email), but 2.) there
> is no trouble with non-ASCII encoding of commits, as they have
> 'encoding' header if it is not uft-8 (see *encoding* config variables).
I tried it and indeed it seems to work now. This hook originated from a
windows installation were having non-ascii characters resulted in a
strange warning from git gui each time you commit. So here is the
corrected patch.
templates/hooks--pre-commit.sample | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..3083735 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,26 @@
#
# To enable this hook, rename this file to "pre-commit".
+# If you want to allow non-ascii filenames set this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+function is_ascii () {
+ test -z "$(cat | sed -e "s/[\ -~]*//g")"
+ return $?
+}
+
+if [ "$allownonascii" != "true" ]
+then
+ # until git can handle non-ascii filenames gracefully
+ # prevent them to be added into the repository
+ if ! git diff --cached --name-only --diff-filter=A -z \
+ | tr "\0" "\n" | is_ascii; then
+ echo "Non-ascii filenames are not allowed !"
+ echo "Please rename the file ..."
+ exit 1
+ fi
+fi
+
if git-rev-parse --verify HEAD 2>/dev/null
then
against=HEAD
--
1.6.3
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
@ 2009-05-14 19:58 ` Esko Luontola
2009-05-14 20:21 ` Andreas Ericsson
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: Esko Luontola @ 2009-05-14 19:58 UTC (permalink / raw)
To: Peter Krefting; +Cc: git
Peter Krefting wrote on 14.5.2009 16:48:
> Is it really necessary to store the encoding for every single file name,
> should it not be enough to just store encoding information for all file
> names at once (i.e., for the object that contains the list of file names
> and their associated blobs)?
What about if some disorganized project has people committing with many
different encodings? Should we allow it, that a directory has the names
of some files using one encoding, and the names of other files using
another encoding? Or should we force the whole repository to use the
same encoding?
> The best way would be to define this in the Git core once and for all,
> and add support to it for all the platforms in the same go, instead of
> trying to hack around the issue whenever it pops up on the various
> platforms.
+1
--
Esko Luontola
www.orfjackal.net
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-14 19:58 ` Esko Luontola
@ 2009-05-14 20:21 ` Andreas Ericsson
2009-05-14 22:25 ` Johannes Schindelin
2009-05-15 11:18 ` Dmitry Potapov
2 siblings, 0 replies; 59+ messages in thread
From: Andreas Ericsson @ 2009-05-14 20:21 UTC (permalink / raw)
To: Esko Luontola; +Cc: Peter Krefting, git
Esko Luontola wrote:
> Peter Krefting wrote on 14.5.2009 16:48:
>> Is it really necessary to store the encoding for every single file
>> name, should it not be enough to just store encoding information for
>> all file names at once (i.e., for the object that contains the list of
>> file names and their associated blobs)?
>
> What about if some disorganized project has people committing with many
> different encodings? Should we allow it, that a directory has the names
> of some files using one encoding, and the names of other files using
> another encoding? Or should we force the whole repository to use the
> same encoding?
>
If encodings are on a per-tree basis, we could add a special mode-flag for
it without breaking backwards incompatibility (I think, anyways). Older
gits just won't know how to handle it and will treat it as a byte-stream.
>> The best way would be to define this in the Git core once and for all,
>> and add support to it for all the platforms in the same go, instead of
>> trying to hack around the issue whenever it pops up on the various
>> platforms.
>
> +1
>
There's still the problem that noone's stepped forward to do all that
work yet, so apparently this isn't important enough for people to put
their patches where their mouths are. Often when issues generate long
discussions and no code, it's of high academic interest and of little
real-world value.
I believe the "little real-world value" here comes from the fact that
cross-platform projects often enforce 7-bit ascii compatible filenames
from the start, because they *know* they may run into problems on other
filesystems otherwise. Remember it's not only git that has to get
things right. It's also build-systems and compilers that have to locate
the correct files (the Makefile and the filesystem may use different
encodings), so in the real world, people really do stay away from
filenames with åäö or other non-ascii chars in them.
It's fun to discuss, but I won't spend any time on it. Good luck to
those who do though. I'd quite like to see if someone could pull it
off without breaking backwards compatibility or impacting performance
too much.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
http://nordicmeetonnagios.op5.org/
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-14 19:58 ` Esko Luontola
2009-05-14 20:21 ` Andreas Ericsson
@ 2009-05-14 22:25 ` Johannes Schindelin
2009-05-15 11:18 ` Dmitry Potapov
2 siblings, 0 replies; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-14 22:25 UTC (permalink / raw)
To: Esko Luontola; +Cc: Peter Krefting, git
Hi,
On Thu, 14 May 2009, Esko Luontola wrote:
> Peter Krefting wrote on 14.5.2009 16:48:
>
> > The best way would be to define this in the Git core once and for all,
> > and add support to it for all the platforms in the same go, instead of
> > trying to hack around the issue whenever it pops up on the various
> > platforms.
>
> +1
You might be enthusiastic about this cunning idea. However, if it costs
me performance on Linux, and all the benefits go to Windows users, then I
will remove this "solution" from my personal Git tree _right away_, and
I'd expect a lot of other people, too.
I repeat this just once more: if you add complexity, you'll have to have a
compelling reason to do so. If there is no benefit for Linux users, why
should they bear the cost?
But as Andreas remarked, I sincerely think that there has been enough talk
about the issue. It's time to see some patches, or to stop the
discussion.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
@ 2009-05-15 10:52 ` Martin Langhoff
2009-05-18 9:37 ` Heiko Voigt
2009-06-20 12:14 ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt
2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
2009-05-15 18:11 ` [PATCH v2] " Junio C Hamano
2 siblings, 2 replies; 59+ messages in thread
From: Martin Langhoff @ 2009-05-15 10:52 UTC (permalink / raw)
To: Heiko Voigt
Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote:
> At the moment non-ascii encodings of filenames are not portably converted
> between different filesystems by git. This will most likely change in the
> future but to allow repositories to be portable among different file/operating
> systems this check is enabled by default.
Nice!
- It'd be a good idea to add to the mix a check for filenames that
are equivalent in case-insensitive FSs.
- Should all of this be a general "portablefilenames" setting?
cheers,
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control
2009-05-14 19:58 ` Esko Luontola
2009-05-14 20:21 ` Andreas Ericsson
2009-05-14 22:25 ` Johannes Schindelin
@ 2009-05-15 11:18 ` Dmitry Potapov
2 siblings, 0 replies; 59+ messages in thread
From: Dmitry Potapov @ 2009-05-15 11:18 UTC (permalink / raw)
To: Esko Luontola; +Cc: Peter Krefting, git
On Thu, May 14, 2009 at 10:58:17PM +0300, Esko Luontola wrote:
>
> What about if some disorganized project has people committing with many
> different encodings? Should we allow it, that a directory has the names
> of some files using one encoding, and the names of other files using
> another encoding? Or should we force the whole repository to use the
> same encoding?
The whole repository should have the same encoding internally. Anything
else will be too complex and too slow... Have you seen any file system
where file names would be stored in different encodings? And Git does
far more operation on file names than a file system does. So, it is
clearly to me that the whole repository should have a single encoding.
Now, I don't think that you will find many open source projects that use
non-ASCII in file names. Moreover, most Linux users are either use UTF-8
already or switch to it in the near future. Mac OS X uses UTF-8 (though
there is a problem with decomposed characters, but Linus posted a
possible solution). So, the only platform were non-ASCII characters may
be interesting to Git users and that does not support UTF-8 is Windows.
AFAIK, Cygwin 1.7 has UTF-8 support. So, it is mostly a problem for
msysGit... Though adding support for legacy encodings can help to some
degree, it means that every system call involving a file name will go
through UTF-8 <-> LEGACY_ENC <-> UTF-16LE conversion. IMHO, having a
legacy encoding involved is far from the best possible solution; but
to avoid that, you need to change MSYS to be able to work with UTF-8.
(I have never looked at MSYS myself, but I suspect it may be not easy).
Dmitry
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
2009-05-15 10:52 ` Martin Langhoff
@ 2009-05-15 14:57 ` Jakub Narebski
2009-05-18 9:50 ` [PATCH] " Heiko Voigt
2009-05-15 18:11 ` [PATCH v2] " Junio C Hamano
2 siblings, 1 reply; 59+ messages in thread
From: Jakub Narebski @ 2009-05-15 14:57 UTC (permalink / raw)
To: Heiko Voigt
Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
<Insert standard Dscho disclaimer here...> ;-)
In short: good idea, don't be discouraged by criticism...
On Thu, 14 May 2009, Heiko Voigt wrote:
> At the moment non-ascii encodings of filenames are not portably converted
> between different filesystems by git. This will most likely change in the
> future but to allow repositories to be portable among different file/operating
> systems this check is enabled by default.
By the way, you might consider choosing shorter line length for commits,
something around 70-76 characters per line; otherwise it is harder to
reply to without linewrapping. 80 characters that you used is, IMHO,
absolute maximum, and it is good that you kept to it.
>
> Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
> ---
> +# If you want to allow non-ascii filenames set this variable to true.
> +allownonascii=$(git config hooks.allownonascii)
> +
> +function is_ascii () {
> + test -z "$(cat | sed -e "s/[\ -~]*//g")"
> + return $?
> +}
>From CodingGuidelines for shell scripts:
- We do not write the noiseword "function" in front of shell
functions.
(in short: do not use bash-specific features... unless, of course,
you are modifying bash-completion script).
Second, it would be nice to have comment about how to use this
function (as it does not check file given by its argument, but
rather its standard input). And perhaps also a comment that it
works because ASCII printable characters begin with ' ' space
(does it have to be escaped?) and end with '~' tilde[2].
Third, isn't it useless use of 'cat'[3]? And wouldn't it be better
to use 'tr' to either delete printable characters and check for
anything left (as you do; BTW. wouldn't "return test ..." be simpler?),
or use 'tr' to count non portable characters?
[1] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters
[3] http://partmaps.org/era/unix/award.html#cat
> +
> +if [ "$allownonascii" != "true" ]
> +then
> + # until git can handle non-ascii filenames gracefully
> + # prevent them to be added into the repository
> + if ! git diff --cached --name-only --diff-filter=A -z \
> + | tr "\0" "\n" | is_ascii; then
> + echo "Non-ascii filenames are not allowed !"
> + echo "Please rename the file ..."
> + exit 1
> + fi
> +fi
> +
> if git-rev-parse --verify HEAD 2>/dev/null
> then
> against=HEAD
> --
> 1.6.3
>
>
>
>
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
2009-05-15 10:52 ` Martin Langhoff
2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
@ 2009-05-15 18:11 ` Junio C Hamano
2 siblings, 0 replies; 59+ messages in thread
From: Junio C Hamano @ 2009-05-15 18:11 UTC (permalink / raw)
To: Heiko Voigt
Cc: Jakub Narebski, Martin Langhoff, Dmitry Potapov, Esko Luontola,
git, Junio C Hamano
Heiko Voigt <hvoigt@hvoigt.net> writes:
> diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
> index 0e49279..3083735 100755
> --- a/templates/hooks--pre-commit.sample
> +++ b/templates/hooks--pre-commit.sample
> @@ -7,6 +7,26 @@
> #
> # To enable this hook, rename this file to "pre-commit".
>
> +# If you want to allow non-ascii filenames set this variable to true.
> +allownonascii=$(git config hooks.allownonascii)
> +
> +function is_ascii () {
We do not say "#!/bin/bash" at the beginning (hopefully), so let's not say
"function " here.
> + test -z "$(cat | sed -e "s/[\ -~]*//g")"
Do you need "cat | "?
Does this script run under LC_ALL=C? Can an i18n'ized sed interfere with
what you are trying to do?
> + return $?
Do you need this, or does the function return the result of the last
statment anyway?
> + echo "Non-ascii filenames are not allowed !"
> + echo "Please rename the file ..."
Can we make this sound more like a _sample_ project policy? It's not like
we enforce that policy to other people's projects.
> + exit 1
> + fi
> +fi
> +
> if git-rev-parse --verify HEAD 2>/dev/null
> then
> against=HEAD
> --
> 1.6.3
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
2009-05-15 10:52 ` Martin Langhoff
@ 2009-05-18 9:37 ` Heiko Voigt
2009-05-18 22:26 ` Jakub Narebski
2009-06-20 12:14 ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt
1 sibling, 1 reply; 59+ messages in thread
From: Heiko Voigt @ 2009-05-18 9:37 UTC (permalink / raw)
To: Martin Langhoff
Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote:
> On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote:
> > At the moment non-ascii encodings of filenames are not portably converted
> > between different filesystems by git. This will most likely change in the
> > future but to allow repositories to be portable among different file/operating
> > systems this check is enabled by default.
>
> Nice!
>
> - It'd be a good idea to add to the mix a check for filenames that
> are equivalent in case-insensitive FSs.
I agree, but that will be an extension in another patch. BTW, if anyone
has a good idea how to efficiently do that kind of check in a hook I'd
cook up a patch on top of this.
> - Should all of this be a general "portablefilenames" setting?
Well, if you can specify what general portable filenames would have as
properties.
Questions like:
* What is the portable maximum path length?
* How long may a filename be (DOS 8.3 ?)
* Are windows keywords (PRN, ...) allowed?
* ...
So I think this should be on a per property basis providing sensible
defaults to support the most standard case.
cheers Heiko
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] Extend sample pre-commit hook to check for non ascii filenames
2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
@ 2009-05-18 9:50 ` Heiko Voigt
2009-05-18 10:40 ` Johannes Sixt
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-05-18 9:50 UTC (permalink / raw)
To: Jakub Narebski, Junio C Hamano
Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git
At the moment non-ascii encodings of filenames are not portably converted
between different filesystems by git. This will most likely change in the
future but to allow repositories to be portable among different file/operating
systems this check is enabled by default.
Signed-off-by: Heiko <hvoigt@hvoigt.net>
---
so here is a third version ...
On Fri, May 15, 2009 at 04:57:45PM +0200, Jakub Narebski wrote:
> On Thu, 14 May 2009, Heiko Voigt wrote:
>
> > At the moment non-ascii encodings of filenames are not portably converted
> > between different filesystems by git. This will most likely change in the
> > future but to allow repositories to be portable among different file/operating
> > systems this check is enabled by default.
>
> By the way, you might consider choosing shorter line length for commits,
> something around 70-76 characters per line; otherwise it is harder to
> reply to without linewrapping. 80 characters that you used is, IMHO,
> absolute maximum, and it is good that you kept to it.
Yeah, I admit they were a little bit long.
> > +function is_ascii () {
> > + test -z "$(cat | sed -e "s/[\ -~]*//g")"
> > + return $?
> > +}
>
> From CodingGuidelines for shell scripts:
> - We do not write the noiseword "function" in front of shell
> functions.
>
> (in short: do not use bash-specific features... unless, of course,
> you are modifying bash-completion script).
Addressed.
> Second, it would be nice to have comment about how to use this
> function (as it does not check file given by its argument, but
> rather its standard input). And perhaps also a comment that it
> works because ASCII printable characters begin with ' ' space
> (does it have to be escaped?) and end with '~' tilde[2].
Done
>
> Third, isn't it useless use of 'cat'[3]? And wouldn't it be better
> to use 'tr' to either delete printable characters and check for
> anything left (as you do; BTW. wouldn't "return test ..." be simpler?),
> or use 'tr' to count non portable characters?
Yes indeed it was useless. I also switched from sed to tr.
>
> [1] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
> [2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters
> [3] http://partmaps.org/era/unix/award.html#cat
On Fri, May 15, 2009 at 11:11:12AM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> > +function is_ascii () {
>
> We do not say "#!/bin/bash" at the beginning (hopefully), so let's not say
> "function " here.
See above.
> > + test -z "$(cat | sed -e "s/[\ -~]*//g")"
>
> Do you need "cat | "?
Also above.
> Does this script run under LC_ALL=C? Can an i18n'ized sed interfere with
> what you are trying to do?
I now explicitely set LC_ALL=C for the tr call which should now be robust
against such cases.
>
> > + return $?
>
> Do you need this, or does the function return the result of the last
> statment anyway?
I wasn't aware of that. Removed the return.
> > + echo "Non-ascii filenames are not allowed !"
> > + echo "Please rename the file ..."
>
> Can we make this sound more like a _sample_ project policy? It's not like
> we enforce that policy to other people's projects.
I've polished this so we are now more user friendly as well.
templates/hooks--pre-commit.sample | 32 ++++++++++++++++++++++++++++++++
1 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..91ab563 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,38 @@
#
# To enable this hook, rename this file to "pre-commit".
+# If you want to allow non-ascii filenames set this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+# is_ascii() Tests the string given given on standard input for
+# printable ascii conformance. We exploit the fact that the printable
+# range starts at the space character and ends with tilde.
+is_ascii() {
+ test -z "$(LC_ALL=C tr -d \ -~)"
+}
+
+if [ "$allownonascii" != "true" ]
+then
+ # until git can handle non-ascii filenames gracefully
+ # prevent them to be added into the repository
+ if ! git diff --cached --name-only --diff-filter=A -z \
+ | tr "\0" "\n" | is_ascii; then
+ echo "Error: Preventing to add a non-ascii filename."
+ echo
+ echo "This can cause problems if you want to work together"
+ echo "with people on other platforms than you."
+ echo
+ echo "To be portable it is adviseable to rename the file ..."
+ echo
+ echo "If you know what you are doing you can disable this"
+ echo "check using:"
+ echo
+ echo " git config hooks.allownonascii true"
+ echo
+ exit 1
+ fi
+fi
+
if git-rev-parse --verify HEAD 2>/dev/null
then
against=HEAD
--
1.6.3
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
2009-05-18 9:50 ` [PATCH] " Heiko Voigt
@ 2009-05-18 10:40 ` Johannes Sixt
2009-05-18 11:50 ` Heiko Voigt
2009-05-19 20:01 ` [PATCH v4] " Heiko Voigt
2009-05-18 14:42 ` [PATCH] " Junio C Hamano
2009-05-18 20:35 ` Julian Phillips
2 siblings, 2 replies; 59+ messages in thread
From: Johannes Sixt @ 2009-05-18 10:40 UTC (permalink / raw)
To: Heiko Voigt
Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
Esko Luontola, git
Heiko Voigt schrieb:
> +# is_ascii() Tests the string given given on standard input for
> +# printable ascii conformance. We exploit the fact that the printable
> +# range starts at the space character and ends with tilde.
> +is_ascii() {
> + test -z "$(LC_ALL=C tr -d \ -~)"
> +}
> +
> +if [ "$allownonascii" != "true" ]
> +then
> + # until git can handle non-ascii filenames gracefully
> + # prevent them to be added into the repository
> + if ! git diff --cached --name-only --diff-filter=A -z \
> + | tr "\0" "\n" | is_ascii; then
Will this not fail to add more than one file with allowed names? The \n is
not removed in is_ascii(), and so the resulting string will not be empty.
BTW, not all tr work well with NULs. See the commit message of e85fe4d8,
for example. Otherwise, I would have suggested to convert the NUL to some
allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and
'\n' (single-quotes) to guarantee that the shell does not ignore the
backslash.
-- Hannes
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
2009-05-18 10:40 ` Johannes Sixt
@ 2009-05-18 11:50 ` Heiko Voigt
2009-05-18 12:04 ` Johannes Sixt
2009-05-19 20:01 ` [PATCH v4] " Heiko Voigt
1 sibling, 1 reply; 59+ messages in thread
From: Heiko Voigt @ 2009-05-18 11:50 UTC (permalink / raw)
To: Johannes Sixt
Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
Esko Luontola, git
On Mon, May 18, 2009 at 12:40:09PM +0200, Johannes Sixt wrote:
> Heiko Voigt schrieb:
> > +# is_ascii() Tests the string given given on standard input for
> > +# printable ascii conformance. We exploit the fact that the printable
> > +# range starts at the space character and ends with tilde.
> > +is_ascii() {
> > + test -z "$(LC_ALL=C tr -d \ -~)"
> > +}
> > +
> > +if [ "$allownonascii" != "true" ]
> > +then
> > + # until git can handle non-ascii filenames gracefully
> > + # prevent them to be added into the repository
> > + if ! git diff --cached --name-only --diff-filter=A -z \
> > + | tr "\0" "\n" | is_ascii; then
>
> Will this not fail to add more than one file with allowed names? The \n is
> not removed in is_ascii(), and so the resulting string will not be empty.
No currently it does not. At least on my system, but good point.
> BTW, not all tr work well with NULs. See the commit message of e85fe4d8,
> for example. Otherwise, I would have suggested to convert the NUL to some
> allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and
> '\n' (single-quotes) to guarantee that the shell does not ignore the
> backslash.
Are there any problems with '\0' and tr other than swallowing of it. In
case not I would just change
tr "\0" "\n"
to
tr -d '\0'
That way there are no '\n's left over and it doesn't matter if tr
swallows the '\0'.
Waiting for further comments before sending the cleanup.
cheers Heiko
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
2009-05-18 11:50 ` Heiko Voigt
@ 2009-05-18 12:04 ` Johannes Sixt
0 siblings, 0 replies; 59+ messages in thread
From: Johannes Sixt @ 2009-05-18 12:04 UTC (permalink / raw)
To: Heiko Voigt
Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
Esko Luontola, git
Heiko Voigt schrieb:
> Are there any problems with '\0' and tr other than swallowing of it.
I can't tell. But the commits ae90e16..aab0abf are interesting to study in
w.r.t. portability.
> In
> case not I would just change
>
> tr "\0" "\n"
> to
> tr -d '\0'
In which case I'd suggest that you call tr only once, in isascii():
tr -d '[ -~]\0'
-- Hannes
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
2009-05-18 9:50 ` [PATCH] " Heiko Voigt
2009-05-18 10:40 ` Johannes Sixt
@ 2009-05-18 14:42 ` Junio C Hamano
2009-05-18 20:35 ` Julian Phillips
2 siblings, 0 replies; 59+ messages in thread
From: Junio C Hamano @ 2009-05-18 14:42 UTC (permalink / raw)
To: Heiko Voigt
Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
Esko Luontola, git
Heiko Voigt <hvoigt@hvoigt.net> writes:
> +if [ "$allownonascii" != "true" ]
> +then
> + # until git can handle non-ascii filenames gracefully
> + # prevent them to be added into the repository
I think you can inline your is_ascii shell function in the pipeline below.
You made it a separate function and I agree that it has a very good
documentation value, but the mention of "non-ascii filenames" in this
comment here is enough clue to let anybody know what is going on.
Side note: I am not sure "Until ... can ... gracefully" is a good
description of the general problem. It probably is more neutral
to say "Cross platform projects tend to avoid non-ascii filenames;
prevent them from being added to the repository."
> + if ! git diff --cached --name-only --diff-filter=A -z \
> + | tr "\0" "\n" | is_ascii; then
A standard trick while writing a long pipeline in shell is to change line
after a pipe, like:
cmd1 |
cmd2 |
cmd3
which allows you to lose the BS-before-LF sequence.
I think comments from J6t and others are valuable but clear enough that I
wouldn't have to repeat them.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
2009-05-18 9:50 ` [PATCH] " Heiko Voigt
2009-05-18 10:40 ` Johannes Sixt
2009-05-18 14:42 ` [PATCH] " Junio C Hamano
@ 2009-05-18 20:35 ` Julian Phillips
2 siblings, 0 replies; 59+ messages in thread
From: Julian Phillips @ 2009-05-18 20:35 UTC (permalink / raw)
To: Heiko Voigt
Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
Esko Luontola, git
On Mon, 18 May 2009, Heiko Voigt wrote:
> +if [ "$allownonascii" != "true" ]
> +then
> + # until git can handle non-ascii filenames gracefully
> + # prevent them to be added into the repository
> + if ! git diff --cached --name-only --diff-filter=A -z \
> + | tr "\0" "\n" | is_ascii; then
> + echo "Error: Preventing to add a non-ascii filename."
This would read better as:
+ echo "Error: Attempt to add a non-ascii filename."
(after all the prevention itself is a result of the error, not the cause
of it)
If you want to keep the preventing, then you need to at least correct the
english:
> + echo "Error: Preventing addition of a non-ascii filename."
--
Julian
---
QOTD:
Money isn't everything, but at least it keeps the kids in touch.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
2009-05-18 9:37 ` Heiko Voigt
@ 2009-05-18 22:26 ` Jakub Narebski
0 siblings, 0 replies; 59+ messages in thread
From: Jakub Narebski @ 2009-05-18 22:26 UTC (permalink / raw)
To: Heiko Voigt
Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
On Mon, 18 May 2009, Heiko Voigt wrote:
> On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote:
> > - Should all of this be a general "portablefilenames" setting?
>
> Well, if you can specify what general portable filenames would have as
> properties.
"Fixing Unix/Linux/POSIX Filenames: Control Characters (such as
Newline), Leading Dashes, and Other Problems" by David A. Wheeler
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v4] Extend sample pre-commit hook to check for non ascii filenames
2009-05-18 10:40 ` Johannes Sixt
2009-05-18 11:50 ` Heiko Voigt
@ 2009-05-19 20:01 ` Heiko Voigt
1 sibling, 0 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-05-19 20:01 UTC (permalink / raw)
To: Johannes Sixt, Junio C Hamano, Julian Phillips
Cc: Jakub Narebski, Martin Langhoff, Dmitry Potapov, Esko Luontola,
git
At the moment non-ascii encodings of filenames are not portably
converted between different filesystems by git. This will most likely
change in the future but to allow repositories to be portable among
different file/operating systems this check is enabled by default.
Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---
Thanks for all comments. I now hopefully have a satisfying patch.
On Mon, May 18, 2009 at 12:40:09PM +0200, Johannes Sixt wrote:
> Heiko Voigt schrieb:
> > + if ! git diff --cached --name-only --diff-filter=A -z \
> > + | tr "\0" "\n" | is_ascii; then
>
> Will this not fail to add more than one file with allowed names? The \n is
> not removed in is_ascii(), and so the resulting string will not be empty.
>
> BTW, not all tr work well with NULs. See the commit message of e85fe4d8,
> for example. Otherwise, I would have suggested to convert the NUL to some
> allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and
> '\n' (single-quotes) to guarantee that the shell does not ignore the
> backslash.
I removed all \0 characters and hopefully use the correct platform
independent syntax as described in the commits you send.
On Mon, May 18, 2009 at 02:04:08PM +0200, Johannes Sixt wrote:
> Heiko Voigt schrieb:
> > Are there any problems with '\0' and tr other than swallowing of it.
>
> I can't tell. But the commits ae90e16..aab0abf are interesting to study in
> w.r.t. portability.
>
> > In
> > case not I would just change
> >
> > tr "\0" "\n"
> > to
> > tr -d '\0'
>
> In which case I'd suggest that you call tr only once, in isascii():
>
> tr -d '[ -~]\0'
After reading a little about the portability things. This seems to be
the right way and is now included.
On Mon, May 18, 2009 at 07:42:31AM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
>
> > +if [ "$allownonascii" != "true" ]
> > +then
> > + # until git can handle non-ascii filenames gracefully
> > + # prevent them to be added into the repository
>
> I think you can inline your is_ascii shell function in the pipeline below.
> You made it a separate function and I agree that it has a very good
> documentation value, but the mention of "non-ascii filenames" in this
> comment here is enough clue to let anybody know what is going on.
I agree. I thought it would probably be useful in other places but we
just need it once so its inlined now.
>
> Side note: I am not sure "Until ... can ... gracefully" is a good
> description of the general problem. It probably is more neutral
> to say "Cross platform projects tend to avoid non-ascii filenames;
> prevent them from being added to the repository."
Changed that.
>
> > + if ! git diff --cached --name-only --diff-filter=A -z \
> > + | tr "\0" "\n" | is_ascii; then
>
> A standard trick while writing a long pipeline in shell is to change line
> after a pipe, like:
>
> cmd1 |
> cmd2 |
> cmd3
>
> which allows you to lose the BS-before-LF sequence.
Wasn't aware of that. Changed it accordingly.
On Mon, May 18, 2009 at 09:35:19PM +0100, Julian Phillips wrote:
> On Mon, 18 May 2009, Heiko Voigt wrote:
>> + echo "Error: Preventing to add a non-ascii filename."
>
> This would read better as:
>
> + echo "Error: Attempt to add a non-ascii filename."
>
> (after all the prevention itself is a result of the error, not the cause
> of it)
That really sounds better. Thanks.
templates/hooks--pre-commit.sample | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..ad892a2 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,31 @@
#
# To enable this hook, rename this file to "pre-commit".
+# If you want to allow non-ascii filenames set this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+# Cross platform projects tend to avoid non-ascii filenames; prevent
+# them from being added to the repository. We exploit the fact that the
+# printable range starts at the space character and ends with tilde.
+if [ "$allownonascii" != "true" ] &&
+ test "$(git diff --cached --name-only --diff-filter=A -z |
+ LC_ALL=C tr -d '[ -~]\0')"
+then
+ echo "Error: Attempt to add a non-ascii filename."
+ echo
+ echo "This can cause problems if you want to work together"
+ echo "with people on other platforms than you."
+ echo
+ echo "To be portable it is adviseable to rename the file ..."
+ echo
+ echo "If you know what you are doing you can disable this"
+ echo "check using:"
+ echo
+ echo " git config hooks.allownonascii true"
+ echo
+ exit 1
+fi
+
if git-rev-parse --verify HEAD 2>/dev/null
then
against=HEAD
--
1.6.3
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook
2009-05-15 10:52 ` Martin Langhoff
2009-05-18 9:37 ` Heiko Voigt
@ 2009-06-20 12:14 ` Heiko Voigt
1 sibling, 0 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-06-20 12:14 UTC (permalink / raw)
To: Martin Langhoff
Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git,
Junio C Hamano
This helps cross-platform projects on the case-sensitive filename side
of operating systems to use filenames that are nice for the
case-insensitive side
---
On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote:
> On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote:
> > At the moment non-ascii encodings of filenames are not portably converted
> > between different filesystems by git. This will most likely change in the
> > future but to allow repositories to be portable among different file/operating
> > systems this check is enabled by default.
> - It'd be a good idea to add to the mix a check for filenames that
> are equivalent in case-insensitive FSs.
Totally untested. Just to get feedback if someone has ideas how this can
be solved more efficiently. I suspect that processing all files will
yield an unbearable performance degradation on large projects.
Let me know what you think. The wording of the error message is not yet
final.
templates/hooks--pre-commit.sample | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index b11ad6a..32d1809 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -9,6 +9,10 @@
# If you want to allow non-ascii filenames set this variable to true.
allownonascii=$(git config hooks.allownonascii)
+# If you want to allow filenames that only differ in case set this
+# variable to true. NOTE: This can degrade performance on project with
+# lots of files
+allowcaseonly=$(git config hooks.allowcaseonly)
# Cross platform projects tend to avoid non-ascii filenames; prevent
# them from being added to the repository. We exploit the fact that the
@@ -32,6 +36,23 @@ then
exit 1
fi
+# check for names that already exist but only differ in case
+# which can be problematic on non-casesensitive filesystems
+if [ "$allowcaseonly" != "true" ] &&
+ test -z "$(git ls-files | LC_ALL=C tr -s [A-Z] [a-z] | uniq -d)"
+then
+ echo "Error: Attempt to add file which already exists in different case"
+ echo
+ echo "If you know what you are doing you can disable this"
+ echo "check using:"
+ echo
+ echo " git config hooks.allowcaseonly true"
+ echo
+ exit 1
+fi
+
if git-rev-parse --verify HEAD >/dev/null 2>&1
then
against=HEAD
--
1.6.3.2.203.g9a122
^ permalink raw reply related [flat|nested] 59+ messages in thread
end of thread, other threads:[~2009-06-20 12:14 UTC | newest]
Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
2009-05-12 15:14 ` Shawn O. Pearce
2009-05-12 16:13 ` Johannes Schindelin
2009-05-12 17:56 ` Esko Luontola
2009-05-12 20:38 ` Johannes Schindelin
2009-05-12 21:16 ` Esko Luontola
2009-05-13 0:23 ` Johannes Schindelin
2009-05-13 5:34 ` Esko Luontola
2009-05-13 6:49 ` Alex Riesen
2009-05-13 10:15 ` Johannes Schindelin
[not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
2009-05-13 10:41 ` John Tapsell
2009-05-13 13:42 ` Jay Soffian
2009-05-13 13:44 ` Alex Riesen
2009-05-13 13:50 ` Jay Soffian
2009-05-13 13:57 ` John Tapsell
2009-05-13 15:27 ` Nicolas Pitre
2009-05-13 16:22 ` Johannes Schindelin
2009-05-13 17:24 ` Andreas Ericsson
2009-05-14 1:49 ` Miles Bader
2009-05-12 16:16 ` Jeff King
2009-05-12 16:57 ` Johannes Schindelin
2009-05-13 16:26 ` Linus Torvalds
2009-05-13 17:12 ` Linus Torvalds
2009-05-13 17:31 ` Andreas Ericsson
2009-05-13 17:46 ` Linus Torvalds
2009-05-13 18:26 ` Martin Langhoff
2009-05-13 18:37 ` Linus Torvalds
2009-05-13 21:04 ` Theodore Tso
2009-05-13 21:20 ` Linus Torvalds
2009-05-13 21:08 ` Daniel Barkalow
2009-05-13 21:29 ` Linus Torvalds
2009-05-13 20:57 ` Matthias Andree
2009-05-13 21:10 ` Linus Torvalds
2009-05-13 21:30 ` Jay Soffian
2009-05-13 21:47 ` Matthias Andree
2009-05-12 18:28 ` Dmitry Potapov
2009-05-12 18:40 ` Martin Langhoff
2009-05-12 18:55 ` Jakub Narebski
2009-05-12 21:43 ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt
2009-05-12 21:55 ` Jakub Narebski
2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
2009-05-15 10:52 ` Martin Langhoff
2009-05-18 9:37 ` Heiko Voigt
2009-05-18 22:26 ` Jakub Narebski
2009-06-20 12:14 ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt
2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
2009-05-18 9:50 ` [PATCH] " Heiko Voigt
2009-05-18 10:40 ` Johannes Sixt
2009-05-18 11:50 ` Heiko Voigt
2009-05-18 12:04 ` Johannes Sixt
2009-05-19 20:01 ` [PATCH v4] " Heiko Voigt
2009-05-18 14:42 ` [PATCH] " Junio C Hamano
2009-05-18 20:35 ` Julian Phillips
2009-05-15 18:11 ` [PATCH v2] " Junio C Hamano
2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
2009-05-14 19:58 ` Esko Luontola
2009-05-14 20:21 ` Andreas Ericsson
2009-05-14 22:25 ` Johannes Schindelin
2009-05-15 11:18 ` Dmitry Potapov
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).