bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
* [PATCH] canonicalize-lgpl: Canonicalize casing too for MinGW.
@ 2021-12-09 18:01 Jan (janneke) Nieuwenhuizen
  2021-12-09 22:06 ` Bruno Haible
  0 siblings, 1 reply; 7+ messages in thread
From: Jan (janneke) Nieuwenhuizen @ 2021-12-09 18:01 UTC (permalink / raw
  To: bug-gnulib

* lib/canonicalize-lgpl.c (filesystem_name)[__MINGW32__]: New static
function.
(realpath_stk)[__MINGW32__]: Use it to return correct canonicalized
casing.
* tests/test-canonicalize-lgpl.c (main)[__MINGW32__]: Test it.
---
 lib/canonicalize-lgpl.c        | 37 ++++++++++++++++++++++++++++++++++
 tests/test-canonicalize-lgpl.c | 12 +++++++++++
 2 files changed, 49 insertions(+)

diff --git a/lib/canonicalize-lgpl.c b/lib/canonicalize-lgpl.c
index 92e9639720..baabcbdc25 100644
--- a/lib/canonicalize-lgpl.c
+++ b/lib/canonicalize-lgpl.c
@@ -41,6 +41,11 @@
 #include <intprops.h>
 #include <scratch_buffer.h>
 
+#if __MINGW32__
+#include <dirent.h>
+#include <libgen.h>
+#endif
+
 #ifdef _LIBC
 # include <shlib-compat.h>
 # define GCC_LINT 1
@@ -180,6 +185,33 @@ get_path_max (void)
   return path_max < 0 ? 1024 : path_max <= IDX_MAX ? path_max : IDX_MAX;
 }
 
+#if __MINGW32__
+/* Return the basename of NAME as found on the filesystem, which may
+   or may not canonicalize the casing, or NULL if not found.  */
+static char *
+filesystem_name (char const *name)
+{
+  char base_buf[PATH_MAX];
+  strcpy (base_buf, name);
+  char *base = basename (base_buf);
+
+  int select_base (struct dirent const* entry)
+  {
+    return strcasecmp (entry->d_name, base) == 0;
+  }
+
+  char dir_buf[PATH_MAX];
+  strcpy (dir_buf, name);
+  char *dir = dirname (dir_buf);
+
+  struct dirent **name_list;
+  int i = scandir (dir, &name_list, select_base, NULL);
+  if (i == 1)
+    return name_list[0]->d_name;
+  return NULL;
+}
+#endif
+
 /* Act like __realpath (see below), with an additional argument
    rname_buf that can be used as temporary storage.
 
@@ -322,6 +354,11 @@ realpath_stk (const char *name, char *resolved,
             {
               buf = link_buffer.data;
               idx_t bufsize = link_buffer.length;
+#if __MINGW32__
+              char *fname = filesystem_name (rname);
+              if (fname)
+                strcpy (rname + strlen (rname) - strlen (fname), fname);
+#endif
               n = __readlink (rname, buf, bufsize - 1);
               if (n < bufsize - 1)
                 break;
diff --git a/tests/test-canonicalize-lgpl.c b/tests/test-canonicalize-lgpl.c
index c0a5a55150..cf41a2a628 100644
--- a/tests/test-canonicalize-lgpl.c
+++ b/tests/test-canonicalize-lgpl.c
@@ -279,6 +279,18 @@ main (void)
     free (result2);
   }
 
+#if __MINGW32__
+  /* Check that \\ are changed into / and casing is canonicalized. */
+  {
+    int fd = creat (BASE "/MinGW", 0600);
+    ASSERT (0 <= fd);
+    ASSERT (close (fd) == 0);
+
+    char *result = canonicalize_file_name (BASE "\\mingw");
+    ASSERT (strcmp (result, BASE "/MinGW");
+    free (result);
+  }
+#endif
 
   /* Cleanup.  */
   ASSERT (remove (BASE "/droot") == 0);


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] canonicalize-lgpl: Canonicalize casing too for MinGW.
  2021-12-09 18:01 [PATCH] canonicalize-lgpl: Canonicalize casing too for MinGW Jan (janneke) Nieuwenhuizen
@ 2021-12-09 22:06 ` Bruno Haible
  2021-12-10  6:59   ` [PATCH v2] " Jan Nieuwenhuizen
  0 siblings, 1 reply; 7+ messages in thread
From: Bruno Haible @ 2021-12-09 22:06 UTC (permalink / raw
  To: bug-gnulib; +Cc: Jan Nieuwenhuizen

Hi Jan,

> * lib/canonicalize-lgpl.c (filesystem_name)[__MINGW32__]: New static
> function.
> (realpath_stk)[__MINGW32__]: Use it to return correct canonicalized
> casing.

I don't think this is desirable, because

1) The 'realpath' function that canonicalize-lgpl.c implements is
   specified to return "an absolute pathname that resolves to the same
   directory entry, whose resolution does not involve '.', '..', or symbolic
   links." [1] There is no guarantee in the spec that it prefers lowercase,
   uppercase, or the case of the existing directory entry.

2) If we wanted to make this function consistent on all platforms, we would
   also need to handle
     - Linux with mounted VFAT file systems,
     - macOS with case-insensitive HFS+,
     - different locales on Windows (e.g. to recognize that 'ä' and 'Ä' are
       equivalent in Windows installations with Western locales).
   And, on macOS with HFS+, also the Unicode canonicalization (NFC vs. NFD).

3) By doing this, the function would be slowed down significantly. The
   scandir() call that you added reads all directory entries of a certain
   directory.

What exactly do you want to do? If you want to look at the file name of
an existing directory entry, let your program use scandir().

Additionally,

> +  int select_base (struct dirent const* entry)
> +  {
> +    return strcasecmp (entry->d_name, base) == 0;
> +  }

I would refrain from adding code that requires GCC and does not work with MSVC.

Bruno

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/realpath.html





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] canonicalize-lgpl: Canonicalize casing too for MinGW.
  2021-12-09 22:06 ` Bruno Haible
@ 2021-12-10  6:59   ` Jan Nieuwenhuizen
  2021-12-10 22:20     ` Paul Eggert
  2021-12-11 13:41     ` Bruno Haible
  0 siblings, 2 replies; 7+ messages in thread
From: Jan Nieuwenhuizen @ 2021-12-10  6:59 UTC (permalink / raw
  To: Bruno Haible; +Cc: bug-gnulib

[-- Attachment #1: Type: text/plain, Size: 3668 bytes --]

Bruno Haible writes:

Hello Bruno,

Thanks for your quick and thougtful response.

TL;DR: Okay, well too bad.  I guess we and everyone else has to patch
their client code, then?  For the archives: a version 2 attached that
fixes two bugs.

>> * lib/canonicalize-lgpl.c (filesystem_name)[__MINGW32__]: New static
>> function.
>> (realpath_stk)[__MINGW32__]: Use it to return correct canonicalized
>> casing.
>
> I don't think this is desirable, because

Oh, too bad!  :-)

> 1) The 'realpath' function that canonicalize-lgpl.c implements is
>    specified to return "an absolute pathname that resolves to the same
>    directory entry, whose resolution does not involve '.', '..', or symbolic
>    links." [1] There is no guarantee in the spec that it prefers lowercase,
>    uppercase, or the case of the existing directory entry.

Yeah...but wasn't that documentation written long before someone could
have thought of the enormity of a "file system" that doesn't respect
casing?  That's what the name of the function suggests to me.

> 2) If we wanted to make this function consistent on all platforms, we would
>    also need to handle
>      - Linux with mounted VFAT file systems,
>      - macOS with case-insensitive HFS+,
>      - different locales on Windows (e.g. to recognize that 'ä' and 'Ä' are
>        equivalent in Windows installations with Western locales).
>    And, on macOS with HFS+, also the Unicode canonicalization (NFC vs. NFD).

If that is what we would really like, someone could work on fixing those
too, no need for this patch to fix all bugs at once?

> 3) By doing this, the function would be slowed down significantly. The
>    scandir() call that you added reads all directory entries of a certain
>    directory.

Yes, this is probably the biggest problem with this patch, although I
thought it might just have been alright: If you care a lot about speed,
you aren't using Windows anyway and I understood that STAT is already
pretty slow there.  On the other hand, in favor of this patch: If
correctness isn't all that important, I can think of a patch decreases
correctness and will dramatically speed up canonicalize_file_name ;-)

> What exactly do you want to do?

Maybe I patched the wrong function, I didn't know about realpath.  We
are using canonicalize-path in Guile, which uses canonicalize_file_name.

I think of a canonical form of something as a unique, authoritative
representation.  As long as canonicalize_file_name can return both
"/foo" and "/FOO", the name of the function is sloppy/wrong, or its
implementation (IMHO).  I guessed the latter, hence this "fix".

We are using the canonical form as an automatic include guard, to not
include the same file twice.  A user reported a bug: In a large project
they used an import of a file with alternative casing.  I was ready to
say: "Well, don't do that", but...yeah.

Instead of patching our client code and having everyone else who targets
Windows find this out and patch their client code, I wanted to help
and fix it in the Right Place(TM).

> If you want to look at the file name of
> an existing directory entry, let your program use scandir().
>
> Additionally,
>
>> +  int select_base (struct dirent const* entry)
>> +  {
>> +    return strcasecmp (entry->d_name, base) == 0;
>> +  }
>
> I would refrain from adding code that requires GCC and does not work with MSVC.

Right, I will try to remember that.  That's pretty terrible here, we
would need a global (or thread local storage) I guess..  Didn't fix this
in v2, as this isn't going in anyway.

Greetings,
Janneke


[-- Attachment #2: v2-0001-canonicalize-lgpl-Canonicalize-casing-too-for-Min.patch --]
[-- Type: text/x-patch, Size: 4080 bytes --]

From e5e7e10927d5c20bf4ba0a20c7343b389bd192a2 Mon Sep 17 00:00:00 2001
From: "Jan (janneke) Nieuwenhuizen" <janneke@gnu.org>
Date: Thu, 9 Dec 2021 18:39:33 +0100
Subject: [PATCH v2] canonicalize-lgpl: Canonicalize casing too for MinGW.
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8

* lib/canonicalize-lgpl.c (filesystem_name)[__MINGW32__]: New static
function.
(realpath_stk)[__MINGW32__]: Use it to return correct canonicalized
casing.  Also for relative file names, replace any '\\' with '/'.
* tests/test-canonicalize-lgpl.c (main)[__MINGW32__]: Test it.
---
 lib/canonicalize-lgpl.c        | 51 ++++++++++++++++++++++++++++++++++
 tests/test-canonicalize-lgpl.c | 12 ++++++++
 2 files changed, 63 insertions(+)

diff --git a/lib/canonicalize-lgpl.c b/lib/canonicalize-lgpl.c
index 92e9639720..895c335b68 100644
--- a/lib/canonicalize-lgpl.c
+++ b/lib/canonicalize-lgpl.c
@@ -41,6 +41,12 @@
 #include <intprops.h>
 #include <scratch_buffer.h>
 
+#if __MINGW32__
+#include <ctype.h>
+#include <dirent.h>
+#include <libgen.h>
+#endif
+
 #ifdef _LIBC
 # include <shlib-compat.h>
 # define GCC_LINT 1
@@ -180,6 +186,33 @@ get_path_max (void)
   return path_max < 0 ? 1024 : path_max <= IDX_MAX ? path_max : IDX_MAX;
 }
 
+#if __MINGW32__
+/* Return the basename of NAME as found on the filesystem, which may
+   or may not canonicalize the casing, or NULL if not found.  */
+static char *
+filesystem_name (char const *name)
+{
+  char base_buf[PATH_MAX];
+  strcpy (base_buf, name);
+  char *base = basename (base_buf);
+
+  int select_base (struct dirent const* entry)
+  {
+    return strcasecmp (entry->d_name, base) == 0;
+  }
+
+  char dir_buf[PATH_MAX];
+  strcpy (dir_buf, name);
+  char *dir = dirname (dir_buf);
+
+  struct dirent **name_list;
+  int i = scandir (dir, &name_list, select_base, NULL);
+  if (i == 1)
+    return name_list[0]->d_name;
+  return NULL;
+}
+#endif
+
 /* Act like __realpath (see below), with an additional argument
    rname_buf that can be used as temporary storage.
 
@@ -251,12 +284,25 @@ realpath_stk (const char *name, char *resolved,
             goto error_nomem;
           rname = rname_buf->data;
         }
+#if __MINGW32__
+      char *p = rname;
+      while (*p)
+        {
+          if (ISSLASH (*p))
+            *p = '/';
+          p++;
+        }
+#endif
       dest = __rawmemchr (rname, '\0');
       start = name;
       prefix_len = FILE_SYSTEM_PREFIX_LEN (rname);
     }
   else
     {
+#if __MINGW32__
+      if (HAS_DEVICE (rname))
+        rname[0] = toupper (rname[0]);
+#endif
       dest = __mempcpy (rname, name, prefix_len);
       *dest++ = '/';
       if (DOUBLE_SLASH_IS_DISTINCT_ROOT)
@@ -322,6 +368,11 @@ realpath_stk (const char *name, char *resolved,
             {
               buf = link_buffer.data;
               idx_t bufsize = link_buffer.length;
+#if __MINGW32__
+              char *fname = filesystem_name (rname);
+              if (fname)
+                strcpy (rname + strlen (rname) - strlen (fname), fname);
+#endif
               n = __readlink (rname, buf, bufsize - 1);
               if (n < bufsize - 1)
                 break;
diff --git a/tests/test-canonicalize-lgpl.c b/tests/test-canonicalize-lgpl.c
index c0a5a55150..afc637b073 100644
--- a/tests/test-canonicalize-lgpl.c
+++ b/tests/test-canonicalize-lgpl.c
@@ -279,6 +279,18 @@ main (void)
     free (result2);
   }
 
+#if __MINGW32__
+  /* Check that \\ are changed into / and casing is canonicalized. */
+  {
+    int fd = creat (BASE "/MinGW", 0600);
+    ASSERT (0 <= fd);
+    ASSERT (close (fd) == 0);
+
+    char *result = canonicalize_file_name (BASE "\\mingw");
+    ASSERT (strcmp (result, BASE "/MinGW"));
+    free (result);
+  }
+#endif
 
   /* Cleanup.  */
   ASSERT (remove (BASE "/droot") == 0);
-- 
Jan Nieuwenhuizen <janneke@gnu.org> | GNU LilyPond http://lilypond.org
Freelance IT http://JoyofSource.com | Avatar® http://AvatarAcademy.com


[-- Attachment #3: Type: text/plain, Size: 152 bytes --]


-- 
Jan Nieuwenhuizen <janneke@gnu.org> | GNU LilyPond http://lilypond.org
Freelance IT http://JoyofSource.com | Avatar® http://AvatarAcademy.com

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] canonicalize-lgpl: Canonicalize casing too for MinGW.
  2021-12-10  6:59   ` [PATCH v2] " Jan Nieuwenhuizen
@ 2021-12-10 22:20     ` Paul Eggert
  2021-12-11 11:35       ` Jan Nieuwenhuizen
  2021-12-11 13:41     ` Bruno Haible
  1 sibling, 1 reply; 7+ messages in thread
From: Paul Eggert @ 2021-12-10 22:20 UTC (permalink / raw
  To: Jan Nieuwenhuizen; +Cc: Gnulib bugs

On 12/9/21 22:59, Jan Nieuwenhuizen wrote:
> We are using the canonical form as an automatic include guard, to not
> include the same file twice.

Gnulib's same-inode module is often a better way to attack that problem.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] canonicalize-lgpl: Canonicalize casing too for MinGW.
  2021-12-10 22:20     ` Paul Eggert
@ 2021-12-11 11:35       ` Jan Nieuwenhuizen
  2021-12-12  2:39         ` Paul Eggert
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Nieuwenhuizen @ 2021-12-11 11:35 UTC (permalink / raw
  To: Paul Eggert; +Cc: Gnulib bugs

Paul Eggert writes:

> On 12/9/21 22:59, Jan Nieuwenhuizen wrote:
>> We are using the canonical form as an automatic include guard, to not
>> include the same file twice.
>
> Gnulib's same-inode module is often a better way to attack that problem.

That's an interesting suggestion.  It looks like the inode returned by
stat is always 0 on MinGW (at least it is under wine), so also when
choosing this approach we would need to special case our client code for
MinGW?

Greetings,
Janneke

-- 
Jan Nieuwenhuizen <janneke@gnu.org> | GNU LilyPond http://lilypond.org
Freelance IT http://JoyofSource.com | Avatar® http://AvatarAcademy.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] canonicalize-lgpl: Canonicalize casing too for MinGW.
  2021-12-10  6:59   ` [PATCH v2] " Jan Nieuwenhuizen
  2021-12-10 22:20     ` Paul Eggert
@ 2021-12-11 13:41     ` Bruno Haible
  1 sibling, 0 replies; 7+ messages in thread
From: Bruno Haible @ 2021-12-11 13:41 UTC (permalink / raw
  To: Jan Nieuwenhuizen; +Cc: bug-gnulib

Hello Jan,

> > 2) If we wanted to make this function consistent on all platforms, we would
> >    also need to handle
> >      - Linux with mounted VFAT file systems,
> >      - macOS with case-insensitive HFS+,
> >      - different locales on Windows (e.g. to recognize that 'ä' and 'Ä' are
> >        equivalent in Windows installations with Western locales).
> >    And, on macOS with HFS+, also the Unicode canonicalization (NFC vs. NFD).
> 
> If that is what we would really like, someone could work on fixing those
> too, no need for this patch to fix all bugs at once?

The problem with that is that it is very expensive. The implementation
would have to determine the mount point first (via statfs()), then query
the properties of the mounted file system, and additionally on Windows,
use the locale. And for remote file systems, I don't know whether it
will actually be possible to implement: When I have a CIFS or NFS file
system mounted from a macOS machine, how do I know whether on the macOS
side it is case-sensitive or case-insensitive HFS+?

> We
> are using canonicalize-path in Guile, which uses canonicalize_file_name.
> 
> I think of a canonical form of something as a unique, authoritative
> representation.

Yes, the name is misleading. Only certain properties of the file name
are canonicalized by this function.

> We are using the canonical form as an automatic include guard, to not
> include the same file twice.  A user reported a bug: In a large project
> they used an import of a file with alternative casing.

I would suggest to look at two Gnulib modules:
  - 'same'
  - 'same-inode' (like Paul suggested).

They deal with the problem without undue costs.

Bruno





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] canonicalize-lgpl: Canonicalize casing too for MinGW.
  2021-12-11 11:35       ` Jan Nieuwenhuizen
@ 2021-12-12  2:39         ` Paul Eggert
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Eggert @ 2021-12-12  2:39 UTC (permalink / raw
  To: Jan Nieuwenhuizen; +Cc: Gnulib bugs

On 12/11/21 03:35, Jan Nieuwenhuizen wrote:
> It looks like the inode returned by
> stat is always 0 on MinGW (at least it is under wine), so also when
> choosing this approach we would need to special case our client code for
> MinGW?

Yes, you'll have problems on such platforms. For ideas about this, 
please see the last several lines of:

https://www.gnu.org/software/gnulib/manual/html_node/sys_002fstat_002eh.html

(in particular the "partial workaround").

Since you're doing all this merely as an optimization, you can avoid 
doing the optimization on nonstandard platforms where stat is 
unreliable; that way, your code should still work even on nonstandard 
platforms, while having the optimized performance on standard platforms.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-12  2:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-09 18:01 [PATCH] canonicalize-lgpl: Canonicalize casing too for MinGW Jan (janneke) Nieuwenhuizen
2021-12-09 22:06 ` Bruno Haible
2021-12-10  6:59   ` [PATCH v2] " Jan Nieuwenhuizen
2021-12-10 22:20     ` Paul Eggert
2021-12-11 11:35       ` Jan Nieuwenhuizen
2021-12-12  2:39         ` Paul Eggert
2021-12-11 13:41     ` Bruno Haible

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).