hard-locale: make multithread-safe

bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed

* hard-locale: make multithread-safe
@ 2019-12-17 13:45 Bruno Haible
  2019-12-17 14:08 ` Tim Rühsen
  2019-12-18  1:45 ` hard-locale: make multithread-safe Paul Eggert
  0 siblings, 2 replies; 23+ messages in thread
From: Bruno Haible @ 2019-12-17 13:45 UTC (permalink / raw)
  To: bug-gnulib

Hi Paul,

Here is a proposed patch to make the hard_locale() function multithread-safe.
This is needed because our mbrtowc() override relies on hard_locale, and
mbrtowc obviously must be multi-thread safe (that's one of its main features,
compared to mbtowc).

The previous hard_locale code tries to guess whether a locale is in fact
a "C"/"POSIX" locale, although it is not apparent from its name. This was
a case to worry about between 1995 and 2000, when many systems did not
have working locales. This has changed: Nowadays nearly all platforms
honour the locale names with some localized behaviour, except OpenBSD,
Minix, and Android. It's not my priority to optimize for these three
systems. But if you want to keep optimizations for these platforms,
we could add #ifs for these platforms.


2019-12-17  Bruno Haible  <bruno@clisp.org>

	hard-locale: Make multithread-safe.
	* lib/hard-locale.h (hard_locale): Move documentation to here.
	* lib/hard-locale.c: Don't include <stdlib.h>.
	(GLIBC_VERSION): Remove macro.
	(hard_locale): Assume that all systems name the "C" and "POSIX" locales
	"C" or "POSIX". Invoke setlocale_null instead of setlocale.
	* modules/hard-locale (Depends-on): Remove strdup. Add setlocale-null.

diff --git a/lib/hard-locale.h b/lib/hard-locale.h
index 8f1da96..160674b 100644
--- a/lib/hard-locale.h
+++ b/lib/hard-locale.h
@@ -20,6 +20,9 @@
 
 # include <stdbool.h>
 
-bool hard_locale (int);
+/* Return true if the specified CATEGORY of the current locale is hard, i.e.
+   different from the C or POSIX locale that has a fixed behavior.
+   CATEGORY must be one of the LC_* values, but not LC_ALL.  */
+extern bool hard_locale (int category);
 
 #endif /* HARD_LOCALE_H_ */diff --git a/lib/hard-locale.c b/lib/hard-locale.c
index dcfcad6..4a2adab 100644
--- a/lib/hard-locale.c
+++ b/lib/hard-locale.c
@@ -21,52 +21,15 @@
 #include "hard-locale.h"
 
 #include <locale.h>
-#include <stdlib.h>
 #include <string.h>
 
-#ifdef __GLIBC__
-# define GLIBC_VERSION __GLIBC__
-#elif defined __UCLIBC__
-# define GLIBC_VERSION 2
-#else
-# define GLIBC_VERSION 0
-#endif
-
-/* Return true if the current CATEGORY locale is hard, i.e. if you
-   can't get away with assuming traditional C or POSIX behavior.  */
 bool
 hard_locale (int category)
 {
-  bool hard = true;
-  char const *p = setlocale (category, NULL);
-
-  if (p)
-    {
-      if (2 <= GLIBC_VERSION)
-        {
-          if (strcmp (p, "C") == 0 || strcmp (p, "POSIX") == 0)
-            hard = false;
-        }
-      else
-        {
-          char *locale = strdup (p);
-          if (locale)
-            {
-              /* Temporarily set the locale to the "C" and "POSIX" locales
-                 to find their names, so that we can determine whether one
-                 or the other is the caller's locale.  */
-              if (((p = setlocale (category, "C"))
-                   && strcmp (p, locale) == 0)
-                  || ((p = setlocale (category, "POSIX"))
-                      && strcmp (p, locale) == 0))
-                hard = false;
+  char locale[SETLOCALE_NULL_MAX];
 
-              /* Restore the caller's locale.  */
-              setlocale (category, locale);
-              free (locale);
-            }
-        }
-    }
+  if (setlocale_null (category, locale, sizeof (locale)))
+    return false;
 
-  return hard;
+  return !(strcmp (locale, "C") == 0 || strcmp (locale, "POSIX") == 0);
 }

diff --git a/modules/hard-locale b/modules/hard-locale
index d4463b7..02be90a 100644
--- a/modules/hard-locale
+++ b/modules/hard-locale
@@ -7,7 +7,7 @@ lib/hard-locale.c
 
 Depends-on:
 stdbool
-strdup
+setlocale-null
 
 configure.ac:
 



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: hard-locale: make multithread-safe
  2019-12-17 13:45 hard-locale: make multithread-safe Bruno Haible
@ 2019-12-17 14:08 ` Tim Rühsen
       [not found]   ` <20191226221225.GA800@HATZ>
  2019-12-18  1:45 ` hard-locale: make multithread-safe Paul Eggert
  1 sibling, 1 reply; 23+ messages in thread
From: Tim Rühsen @ 2019-12-17 14:08 UTC (permalink / raw)
  To: Bruno Haible, bug-gnulib


[-- Attachment #1.1: Type: text/plain, Size: 1364 bytes --]

Hi Bruno, hi gnulib developers,

it's a joy to follow the posts on this list - you (all) surprise,
impress and inspire me with your code but even more with your detailed
explanations / documentations.

Thank you so much for your ongoing work !!!

[E.g. this post made me check my code for mbtowc/mbrtowc (and also to
read the man pages again). In fact I found a call to mbtowc in MT code
(protected by a mutex) without resetting the internal state.]

Regards, Tim

On 12/17/19 2:45 PM, Bruno Haible wrote:
> Hi Paul,
> 
> Here is a proposed patch to make the hard_locale() function multithread-safe.
> This is needed because our mbrtowc() override relies on hard_locale, and
> mbrtowc obviously must be multi-thread safe (that's one of its main features,
> compared to mbtowc).
> 
> The previous hard_locale code tries to guess whether a locale is in fact
> a "C"/"POSIX" locale, although it is not apparent from its name. This was
> a case to worry about between 1995 and 2000, when many systems did not
> have working locales. This has changed: Nowadays nearly all platforms
> honour the locale names with some localized behaviour, except OpenBSD,
> Minix, and Android. It's not my priority to optimize for these three
> systems. But if you want to keep optimizations for these platforms,
> we could add #ifs for these platforms.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: hard-locale: make multithread-safe
  2019-12-17 13:45 hard-locale: make multithread-safe Bruno Haible
  2019-12-17 14:08 ` Tim Rühsen
@ 2019-12-18  1:45 ` Paul Eggert
  2019-12-18  8:51   ` Bruno Haible
                     ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Paul Eggert @ 2019-12-18  1:45 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

Thanks, this change looks fine to me.

I do have a qualm in that coreutils (and I assume others) interpret !hard_locale
(LC_COLLATE) as meaning that the locale is unibyte and uses native byte
comparison. As I recall on some platforms (macOS maybe?), the C locale uses
UTF-8 so this interpretation isn't correct.

This qualm is present both before and after the change, though. Perhaps the
problem should be documented, but that could be the subject of a later patch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: hard-locale: make multithread-safe
  2019-12-18  1:45 ` hard-locale: make multithread-safe Paul Eggert
@ 2019-12-18  8:51   ` Bruno Haible
  2019-12-21  6:33     ` Bruno Haible
  2019-12-18 10:29   ` LC_COLLATE in the C locale Bruno Haible
  2019-12-18 10:46   ` hard-locale: make multithread-safe Bruno Haible
  2 siblings, 1 reply; 23+ messages in thread
From: Bruno Haible @ 2019-12-18  8:51 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

[-- Attachment #1: Type: text/plain, Size: 534 bytes --]

> Thanks, this change looks fine to me.

Thanks. But before I do the rewrite, let me add a unit test first.

It's a bit tricky only on musl libc, NetBSD, and OpenBSD. Other
than that, there's a test failure on Android 4.3, where
  - setlocale (category, NULL) always returns NULL.
  - hard_locale of the initial locale returns true, but should be false.


2019-12-18  Bruno Haible  <bruno@clisp.org>

	hard-locale: Add test.
	* tests/test-hard-locale.c: New file.
	* tests/locale.c: New file.
	* modules/hard-locale-tests: New file.


[-- Attachment #2: 0001-hard-locale-Add-test.patch --]
[-- Type: text/x-patch, Size: 8202 bytes --]

From 2d1b7501b00dce53481b58b3808313f264c720ea Mon Sep 17 00:00:00 2001
From: Bruno Haible <bruno@clisp.org>
Date: Wed, 18 Dec 2019 09:41:31 +0100
Subject: [PATCH] hard-locale: Add test.

* tests/test-hard-locale.c: New file.
* tests/locale.c: New file.
* modules/hard-locale-tests: New file.
---
 ChangeLog                 |   7 +++
 modules/hard-locale-tests |  18 ++++++++
 tests/locale.c            |  81 ++++++++++++++++++++++++++++++++++
 tests/test-hard-locale.c  | 109 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 215 insertions(+)
 create mode 100644 modules/hard-locale-tests
 create mode 100644 tests/locale.c
 create mode 100644 tests/test-hard-locale.c

diff --git a/ChangeLog b/ChangeLog
index 7988bec..2fffaf2 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2019-12-18  Bruno Haible  <bruno@clisp.org>
+
+	hard-locale: Add test.
+	* tests/test-hard-locale.c: New file.
+	* tests/locale.c: New file.
+	* modules/hard-locale-tests: New file.
+
 2019-12-17  Paul Eggert  <eggert@cs.ucla.edu>
 
 	dfa: do not match invalid UTF-8
diff --git a/modules/hard-locale-tests b/modules/hard-locale-tests
new file mode 100644
index 0000000..cdc084e
--- /dev/null
+++ b/modules/hard-locale-tests
@@ -0,0 +1,18 @@
+Files:
+tests/test-hard-locale.c
+tests/locale.c
+
+Depends-on:
+
+configure.ac:
+AC_REQUIRE([AC_CANONICAL_HOST])
+case "$host_os" in
+  *-musl*) AC_DEFINE([MUSL_LIBC], [1], [Define to 1 on musl libc.]) ;;
+esac
+dnl Distinguish OpenBSD >= 6.2 from OpenBSD < 6.2.
+AC_CHECK_FUNCS_ONCE([duplocale])
+
+Makefile.am:
+TESTS += test-hard-locale
+check_PROGRAMS += test-hard-locale
+noinst_PROGRAMS += locale
diff --git a/tests/locale.c b/tests/locale.c
new file mode 100644
index 0000000..ac587ad
--- /dev/null
+++ b/tests/locale.c
@@ -0,0 +1,81 @@
+/* Program that prints the names of the categories of the current locale.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <bruno@clisp.org>, 2019.  */
+
+#include <config.h>
+
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+/* Specification:
+   <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/locale.html>
+   Here we implement only the invocation without any command-line options.  */
+
+static const char *
+defaulted_getenv (const char *variable)
+{
+  const char *value = getenv (variable);
+  return (value != NULL ? value : "");
+}
+
+static void
+print_category (int category, const char *variable)
+{
+  const char *value = defaulted_getenv (variable);
+  if (value[0] != '\0' && defaulted_getenv ("LC_ALL")[0] == '\0')
+    /* The variable is set in the environment and not overridden by LC_ALL.  */
+    printf ("%s=%s\n", variable, value);
+  else
+    printf ("%s=\"%s\"\n", variable, setlocale (category, NULL));
+}
+
+int
+main ()
+{
+  setlocale (LC_ALL, "");
+
+  printf ("LANG=%s\n", defaulted_getenv ("LANG"));
+  print_category (LC_CTYPE, "LC_CTYPE");
+  print_category (LC_NUMERIC, "LC_NUMERIC");
+  print_category (LC_TIME, "LC_TIME");
+  print_category (LC_COLLATE, "LC_COLLATE");
+  print_category (LC_MONETARY, "LC_MONETARY");
+  print_category (LC_MESSAGES, "LC_MESSAGES");
+#ifdef LC_PAPER
+  print_category (LC_PAPER, "LC_PAPER");
+#endif
+#ifdef LC_NAME
+  print_category (LC_NAME, "LC_NAME");
+#endif
+#ifdef LC_ADDRESS
+  print_category (LC_ADDRESS, "LC_ADDRESS");
+#endif
+#ifdef LC_TELEPHONE
+  print_category (LC_TELEPHONE, "LC_TELEPHONE");
+#endif
+#ifdef LC_MEASUREMENT
+  print_category (LC_MEASUREMENT, "LC_MEASUREMENT");
+#endif
+#ifdef LC_IDENTIFICATION
+  print_category (LC_IDENTIFICATION, "LC_IDENTIFICATION");
+#endif
+
+  printf ("LC_ALL=%s\n", defaulted_getenv ("LC_ALL"));
+
+  return 0;
+}
diff --git a/tests/test-hard-locale.c b/tests/test-hard-locale.c
new file mode 100644
index 0000000..f83dc27
--- /dev/null
+++ b/tests/test-hard-locale.c
@@ -0,0 +1,109 @@
+/* Test of determination whether a locale is different from the "C" locale.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <bruno@clisp.org>, 2019.  */
+
+#include <config.h>
+
+#include "hard-locale.h"
+
+#include <locale.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <string.h>
+
+/* True if all locale names are accepted and all locales are trivial.
+   This is the case e.g. on OpenBSD 3.8.  */
+static bool all_trivial;
+
+static int
+test_one (const char *name, int failure_bitmask)
+{
+  if (setlocale (LC_ALL, name) != NULL)
+    {
+      bool expected;
+
+      /* musl libc has special code for the C.UTF-8 locale; other than that,
+         all locale names are accepted and all locales are trivial.
+         OpenBSD returns the locale name that was set, but we don't know how it
+         behaves under the hood.  */
+#if defined MUSL_LIBC || defined __OpenBSD__
+      expected = true;
+#else
+      expected = !all_trivial;
+#endif
+      if (hard_locale (LC_CTYPE) != expected)
+        {
+          if (expected)
+            fprintf (stderr, "Unexpected: The category LC_CTYPE of the locale '%s' is not equivalent to C or POSIX.\n",
+                     name);
+          else
+            fprintf (stderr, "Unexpected: The category LC_CTYPE of the locale '%s' is equivalent to C or POSIX.\n",
+                     name);
+          return failure_bitmask;
+        }
+
+      /* On NetBSD 7.0, some locales such as de_DE.ISO8859-1 and de_DE.UTF-8
+         have the LC_COLLATE category set to "C".
+         Similarly, on musl libc, with the C.UTF-8 locale.  */
+#if defined __NetBSD__
+      expected = false;
+#elif defined MUSL_LIBC
+      expected = strcmp (name, "C.UTF-8") != 0;
+#elif defined __OpenBSD__ && HAVE_DUPLOCALE /* OpenBSD >= 6.2 */
+      expected = true;
+#else
+      expected = !all_trivial;
+#endif
+      if (hard_locale (LC_COLLATE) != expected)
+        {
+          if (expected)
+            fprintf (stderr, "Unexpected: The category LC_COLLATE of the locale '%s' is not equivalent to C or POSIX.\n",
+                     name);
+          else
+            fprintf (stderr, "Unexpected: The category LC_COLLATE of the locale '%s' is equivalent to C or POSIX.\n",
+                     name);
+          return failure_bitmask;
+        }
+    }
+  return 0;
+}
+
+int
+main ()
+{
+  int fail = 0;
+
+  /* The initial locale is the "C" or "POSIX" locale.  */
+  if (hard_locale (LC_CTYPE) || hard_locale (LC_COLLATE))
+    {
+      fprintf (stderr, "The initial locale should not be hard!\n");
+      fail |= 1;
+    }
+
+  all_trivial = (setlocale (LC_ALL, "foobar") != NULL);
+
+  fail |= test_one ("de", 2);
+  fail |= test_one ("de_DE", 4);
+  fail |= test_one ("de_DE.ISO8859-1", 8);
+  fail |= test_one ("de_DE.iso88591", 8);
+  fail |= test_one ("de_DE.UTF-8", 16);
+  fail |= test_one ("de_DE.utf8", 16);
+  fail |= test_one ("german", 32);
+  fail |= test_one ("C.UTF-8", 64);
+
+  return fail;
+}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: LC_COLLATE in the C locale
  2019-12-18  1:45 ` hard-locale: make multithread-safe Paul Eggert
  2019-12-18  8:51   ` Bruno Haible
@ 2019-12-18 10:29   ` Bruno Haible
  2019-12-18 16:27     ` Paul Eggert
  2019-12-18 10:46   ` hard-locale: make multithread-safe Bruno Haible
  2 siblings, 1 reply; 23+ messages in thread
From: Bruno Haible @ 2019-12-18 10:29 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

Hi Paul,

> I do have a qualm in that coreutils (and I assume others) interpret !hard_locale
> (LC_COLLATE) as meaning that the locale is unibyte and uses native byte
> comparison.

Isn't this warranted by section "LC_COLLATE Category in the POSIX Locale" in
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html> ?

> As I recall on some platforms (macOS maybe?), the C locale uses
> UTF-8 so this interpretation isn't correct.

UTF-8 has the nice property that byte-per-byte comparison and codepoint-per-
codepoint comparison are equivalent. If the encoding was not UTF-8, but
e.g. GB18030, I would agree that there is a problem. But there is no C
locale with GB18030 encoding on any platform.

Bruno

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: hard-locale: make multithread-safe
  2019-12-18  1:45 ` hard-locale: make multithread-safe Paul Eggert
  2019-12-18  8:51   ` Bruno Haible
  2019-12-18 10:29   ` LC_COLLATE in the C locale Bruno Haible
@ 2019-12-18 10:46   ` Bruno Haible
  2019-12-24 23:36     ` Bruno Haible
  2 siblings, 1 reply; 23+ messages in thread
From: Bruno Haible @ 2019-12-18 10:46 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

Paul Eggert wrote:
> Thanks, this change looks fine to me.

I'm pushing it. Tested: on all platforms, the unit test still passes.
Additionally, the test failure on Android is gone.

It adds a new link requirement for the users of this module. It's
necessary because on AIX, -lpthread is necessary for an MT-safe
setlocale_null and therefore also for 'hard-locale'.


2019-12-18  Bruno Haible  <bruno@clisp.org>

	hard-locale: Make multithread-safe.
	* lib/hard-locale.h (hard_locale): Move documentation to here.
	* lib/hard-locale.c: Don't include <stdlib.h>.
	(GLIBC_VERSION): Remove macro.
	(hard_locale): Assume that all systems name the "C" and "POSIX" locales
	"C" or "POSIX". Invoke setlocale_null instead of setlocale.
	* modules/hard-locale (Depends-on): Remove strdup. Add setlocale-null.
	(configure.ac): Require gl_FUNC_SETLOCALE_NULL. Set LIB_HARD_LOCALE.
	(Link): New section.
	* modules/hard-locale-tests (Makefile.am): Link test-hard-locale against
	$(LIB_HARD_LOCALE).

diff --git a/lib/hard-locale.h b/lib/hard-locale.h
index 8f1da96..160674b 100644
--- a/lib/hard-locale.h
+++ b/lib/hard-locale.h
@@ -20,6 +20,9 @@
 
 # include <stdbool.h>
 
-bool hard_locale (int);
+/* Return true if the specified CATEGORY of the current locale is hard, i.e.
+   different from the C or POSIX locale that has a fixed behavior.
+   CATEGORY must be one of the LC_* values, but not LC_ALL.  */
+extern bool hard_locale (int category);
 
 #endif /* HARD_LOCALE_H_ */
diff --git a/lib/hard-locale.c b/lib/hard-locale.c
index dcfcad6..4a2adab 100644
--- a/lib/hard-locale.c
+++ b/lib/hard-locale.c
@@ -21,52 +21,15 @@
 #include "hard-locale.h"
 
 #include <locale.h>
-#include <stdlib.h>
 #include <string.h>
 
-#ifdef __GLIBC__
-# define GLIBC_VERSION __GLIBC__
-#elif defined __UCLIBC__
-# define GLIBC_VERSION 2
-#else
-# define GLIBC_VERSION 0
-#endif
-
-/* Return true if the current CATEGORY locale is hard, i.e. if you
-   can't get away with assuming traditional C or POSIX behavior.  */
 bool
 hard_locale (int category)
 {
-  bool hard = true;
-  char const *p = setlocale (category, NULL);
-
-  if (p)
-    {
-      if (2 <= GLIBC_VERSION)
-        {
-          if (strcmp (p, "C") == 0 || strcmp (p, "POSIX") == 0)
-            hard = false;
-        }
-      else
-        {
-          char *locale = strdup (p);
-          if (locale)
-            {
-              /* Temporarily set the locale to the "C" and "POSIX" locales
-                 to find their names, so that we can determine whether one
-                 or the other is the caller's locale.  */
-              if (((p = setlocale (category, "C"))
-                   && strcmp (p, locale) == 0)
-                  || ((p = setlocale (category, "POSIX"))
-                      && strcmp (p, locale) == 0))
-                hard = false;
+  char locale[SETLOCALE_NULL_MAX];
 
-              /* Restore the caller's locale.  */
-              setlocale (category, locale);
-              free (locale);
-            }
-        }
-    }
+  if (setlocale_null (category, locale, sizeof (locale)))
+    return false;
 
-  return hard;
+  return !(strcmp (locale, "C") == 0 || strcmp (locale, "POSIX") == 0);
 }
diff --git a/modules/hard-locale b/modules/hard-locale
index d4463b7..df07d4a 100644
--- a/modules/hard-locale
+++ b/modules/hard-locale
@@ -7,9 +7,12 @@ lib/hard-locale.c
 
 Depends-on:
 stdbool
-strdup
+setlocale-null
 
 configure.ac:
+AC_REQUIRE([gl_FUNC_SETLOCALE_NULL])
+LIB_HARD_LOCALE="$LIB_SETLOCALE_NULL"
+AC_SUBST([LIB_HARD_LOCALE])
 
 Makefile.am:
 lib_SOURCES += hard-locale.c
@@ -17,6 +20,9 @@ lib_SOURCES += hard-locale.c
 Include:
 "hard-locale.h"
 
+Link:
+$(LIB_HARD_LOCALE)
+
 License:
 LGPLv2+
 
diff --git a/modules/hard-locale-tests b/modules/hard-locale-tests
index cdc084e..a501781 100644
--- a/modules/hard-locale-tests
+++ b/modules/hard-locale-tests
@@ -16,3 +16,4 @@ Makefile.am:
 TESTS += test-hard-locale
 check_PROGRAMS += test-hard-locale
 noinst_PROGRAMS += locale
+test_hard_locale_LDADD = $(LDADD) @LIB_HARD_LOCALE@



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: LC_COLLATE in the C locale
  2019-12-18 10:29   ` LC_COLLATE in the C locale Bruno Haible
@ 2019-12-18 16:27     ` Paul Eggert
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Eggert @ 2019-12-18 16:27 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

On 12/18/19 2:29 AM, Bruno Haible wrote:
> Hi Paul,
> 
>> I do have a qualm in that coreutils (and I assume others) interpret !hard_locale
>> (LC_COLLATE) as meaning that the locale is unibyte and uses native byte
>> comparison.
> Isn't this warranted by section "LC_COLLATE Category in the POSIX Locale" in
> <https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html> ?

I don't see where that section requires unibyte.

>> As I recall on some platforms (macOS maybe?), the C locale uses
>> UTF-8 so this interpretation isn't correct.
> UTF-8 has the nice property that byte-per-byte comparison and codepoint-per-
> codepoint comparison are equivalent.

True, so the code that assumes strcmp == strcoll should work. But I think some
code specifically assumes unibyte. Presumably that code should also check
MB_CUR_MAX, which should be enough in practice (even though it doesn't suffice
in theory).


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: hard-locale: make multithread-safe
  2019-12-18  8:51   ` Bruno Haible
@ 2019-12-21  6:33     ` Bruno Haible
  0 siblings, 0 replies; 23+ messages in thread
From: Bruno Haible @ 2019-12-21  6:33 UTC (permalink / raw)
  To: bug-gnulib

> 2019-12-18  Bruno Haible  <bruno@clisp.org>
> 
> 	hard-locale: Add test.
> 	* tests/test-hard-locale.c: New file.
> 	* tests/locale.c: New file.
> 	* modules/hard-locale-tests: New file.

The weekly continuous integration build [1] detected that this causes a
testdir build failure. Namely, the program name 'locale' conflicts with
the directory name 'locale' used by the quotearg tests.

[1] https://gitlab.com/gnulib/gnulib-ci/pipelines


2019-12-21  Bruno Haible  <bruno@clisp.org>

	quotearg tests: Fix conflict with hard-locale tests.
	* tests/testlocale: Renamed from tests/locale.
	* modules/quotearg-tests (Files): Update.
	* tests/test-quotearg.sh (LOCALEDIR): Likewise.

diff --git a/modules/quotearg-tests b/modules/quotearg-tests
index 486956a..77fde80 100644
--- a/modules/quotearg-tests
+++ b/modules/quotearg-tests
@@ -6,8 +6,8 @@ tests/test-quotearg.sh
 tests/test-quotearg.c
 tests/test-quotearg.h
 tests/macros.h
-tests/locale/fr/LC_MESSAGES/test-quotearg.po
-tests/locale/fr/LC_MESSAGES/test-quotearg.mo
+tests/testlocale/fr/LC_MESSAGES/test-quotearg.po
+tests/testlocale/fr/LC_MESSAGES/test-quotearg.mo
 m4/locale-fr.m4
 m4/codeset.m4
 
diff --git a/tests/test-quotearg.sh b/tests/test-quotearg.sh
index 2862f87..ae33153 100755
--- a/tests/test-quotearg.sh
+++ b/tests/test-quotearg.sh
@@ -22,5 +22,5 @@ if test $locale = French_France.1252; then
   locale=fr_FR.CP1252
 fi
 
-LOCALE=$locale LOCALEDIR="$srcdir/locale" \
+LOCALE=$locale LOCALEDIR="$srcdir/testlocale" \
 ${CHECKER} ./test-quotearg${EXEEXT}



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: hard-locale: make multithread-safe
  2019-12-18 10:46   ` hard-locale: make multithread-safe Bruno Haible
@ 2019-12-24 23:36     ` Bruno Haible
  0 siblings, 0 replies; 23+ messages in thread
From: Bruno Haible @ 2019-12-24 23:36 UTC (permalink / raw)
  To: bug-gnulib

> 2019-12-18  Bruno Haible  <bruno@clisp.org>
> 	* modules/hard-locale (Depends-on): Remove strdup. Add setlocale-null.

This change has an effect on gettext's libgettextpo: Through this new
dependency chain
   hard-locale -> setlocale-null -> locale
there now is a gnulib-generated locale.h file in libgettextpo, and it
makes a '#define setlocale rpl_locale' definition visible, which in turn
cause the linking of libgettextpo to fail on mingw.

Ultimately this is caused by having two gnulib-tool invocations in the
scope of the same configure file; this is more fragile than having
them under different configure files. But anyway, there is an easy fix
in gnulib:

2019-12-24  Bruno Haible  <bruno@clisp.org>

	localcharset: Avoid referencing rpl_setlocale on native Windows.
	* lib/localcharset.c (setlocale): Undefine.

diff --git a/lib/localcharset.c b/lib/localcharset.c
index ffc0d8f..5dd4429 100644
--- a/lib/localcharset.c
+++ b/lib/localcharset.c
@@ -58,6 +58,9 @@
 #elif defined WINDOWS_NATIVE
 # define WIN32_LEAN_AND_MEAN
 # include <windows.h>
+  /* For the use of setlocale() below, the Gnulib override in setlocale.c is
+     not needed; see the platform lists in setlocale_null.m4.  */
+# undef setlocale
 #endif
 #if defined OS2
 # define INCL_DOS

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: string types
       [not found]   ` <20191226221225.GA800@HATZ>
@ 2019-12-27 10:51     ` Bruno Haible
  2019-12-28 13:14       ` ag
  2019-12-29 21:24       ` Tim Rühsen
  0 siblings, 2 replies; 23+ messages in thread
From: Bruno Haible @ 2019-12-27 10:51 UTC (permalink / raw)
  To: ag; +Cc: Tim Rühsen, bug-gnulib

Aga wrote:
> I do not know if
> you can (or if it is possible, how it can be done), extract with a way a specific
> a functionality from gnulib, with the absolute necessary code and only that.

gnulib-tool does this. With its --avoid option, the developer can even customize
their notion of "absolutely necessary".

> In a myriad of codebases a string type is implemented at least as:
>   size_t mem_size;
>   size_t num_bytes;
>   char *bytes;

This is actually a string-buffer type. A string type does not need two size_t
members. Long-term experience has shown that using different types for string
and string-buffer is a win, because
  - a string can be put in a read-only virtual memory area, thus enforcing
    immutability (-> reducing multithread problems),
  - providing primitives for string allocation reduces the amount of buffer
    overflow bugs that otherwise occur in this area. [1]

Unfortunately, the common string type in C is 'char *' with NUL termination,
and a different type is hard to establish
  - because developers already know how to use 'char *',
  - because existing functions like printf consume 'char *' strings.
  - Few programs have had the need to correctly handles strings with embedded
    NULs.

> An extended ustring (unicode|utf8) type can include information for its bytes with
> character semantics, like:
>  (utf8 typedef'ed as signed int)
>   utf8 code;   // the integer representation
>   int len;     // the number of the needed bytes
>   int width;   // the number of the occupied cells
>   char buf[5]; // and probably the character representation

Such a type would have a niche use, IMO, because
  - 99% of the processing would not need to access the width (screen columns) - so
    why spend CPU time and RAM to store it and keep it up-to-date?
  - 80% of the processing does not care about the Unicode code points either,
    and libraries like libunistring can do the Unicode-aware processing.

> But the programmer mind would be probably best
> if could concentrate to how to express the thought (with whatever meaning of what we
> are calling "thought") and follow this flow, or if could concentrate the energy to
> understand the intentions (while reading) of the code (instead of wasting self with
> the "details" of the code) and finally to the actual algorithm (usually conditions
> that can or can't be met).

That is the idea behind the container types (list, map) in gnulib. However, I don't
see how to reasonably transpose this principle to string types.

Bruno

[1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-27 10:51     ` string types Bruno Haible
@ 2019-12-28 13:14       ` ag
  2019-12-28 18:28         ` Paul Eggert
  2019-12-29  9:19         ` Bruno Haible
  2019-12-29 21:24       ` Tim Rühsen
  1 sibling, 2 replies; 23+ messages in thread
From: ag @ 2019-12-28 13:14 UTC (permalink / raw)
  To: bug-gnulib, Bruno Haible, Paul Eggert; +Cc: Tim Rühsen

Hi,

On Fri, Dec 27, at 11:51 Bruno Haible wrote:
>  - providing primitives for string allocation reduces the amount of buffer
>    overflow bugs that otherwise occur in this area. [1]

[1] Re: string allocation
https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html

Thanks, i remember this thread, though at the time i couldn't understand some bits.

>> ag wrote:
> > ... to the actual algorithm (usually conditions that can or can't be met).

> That is the idea behind the container types (list, map) in gnulib. However, I don't
> see how to reasonably transpose this principle to string types.

Ok, let us try, so allow me to summarize with some of (my unqualified) assumptions
(please correct):

  - glibc malloc can request at most PTRDIFF_MAX

  - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX
    (PTRDIFF_MAX is INT_MAX in 32bit)

  - SIZE_MAX as (size_t) (-1)

  - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals to
    SIZE_MAX?

  - the returned value of the *printf family of functions dictates their
    limits/range, as they return an int, this can be as INT_MAX mostly

Some concerns:

  - truncation errors should be caught

  - memory checkers should catch overflows

  - as since there is a "risk"¹ that someone has to take at some point (either the
    programmer or the underlying library code (as strdup() does)), the designed
    interface should lower those risks

There is a proposal from Eric Sanchis to Austin group at 9 Jun 2016, for a String
copy/concatenation interface, that his functions have both the allocated size and
the number of bytes to be written as arguments (some i will inline them here, since
i was unable to find his mail in the Posix mailing list archives).

I used this as a basis (as it was rather intuitive and perfectly suited for C), to
implement my own str_cp, which goes like this:

size_t str_cp (char *dest, size_t dest_len, const char *src, size_t nelem) {
  size_t num = (nelem > (dest_len - 1) ? dest_len - 1 : nelem);
  size_t len = (NULL is src ? 0 : byte_cp (dest, src, num));
  dest[len] = '\0';
  return len;
}

size_t byte_cp (char *dest, const char *src, size_t nelem) {
  const char *sp = src;
  size_t len = 0;

  while (len < nelem and *sp) {
    dest[len] = *sp++;
    len++;
  }

  return len;
}

Of course it can be done better, but here we have a low level function (byte_cp),
that does only the required checks and which returns the actual bytes written to
`dest', while str_cp checks if `src' is NULL and if `nelem' is bigger than `dest_len'
(if it is then copies at least `dest_len' - 1). It returns 0 or the actual written
bytes.

Since this returns the actual bytes written, it is up to the programmer to check
if truncation happened, but there is no possibility to copy more than `dest_len' - 1.

Based on the above assumptions this can be extended. First instead of size_t to
return ssize_t, so functions can return -1 and set errno accordingly.

Eric Sanchis in his proposal does it a bit different because in his functions adds
an extra argument as size_t, that uses this to control the behavior of the function
(what it will do in the case that destination length is less than source len).

He uses an int as a returned value which either is 0/1 on succesful operation, the
following:
#define   OKNOTRUNC  0		/* copy/concatenation performed without truncation */
#define   OKTRUNC    1		/* copy/concatenation performed with truncation */

And below is the extra information passed as fifth argument:
#define   TRUNC      0		/* truncation allowed */
#define   NOTRUNC    1		/* truncation not allowed */

In the case of an error, returns > 0 which is either:
#define   EDSTPAR   -1		/* Error : bad dst parameters */
#define   ESRCPAR   -2		/* Error : bad src parameters */
#define   EMODPAR   -3		/* Error : bad mode parameter */
#define   ETRUNC    -4		/* Error : not enough space to copy/concatenate
							   and truncation not allowed */

Now combining all this and if the assumptions are correct, gnulib can return
ssize_t and uses this to make it's functions to work up to SIZE_MAX and uses
either Eric's interface or to set errno accordingly.

But to me a function call like:
  str_cp (dest, memsize_of_dest, src, memsize_of_dest - 1)
is quite common C's way to do things, plus we have a way to catch truncation and
not to go out of bounds at the same time.

Of course such operations are tied with malloc().
I've read the gnulib document yesteday and i saw that gnulib wraps malloc() with a
function that (quite logically) aborts execution and even allows to set a callback
function.

In my humble opinion there is also the choise to choose reallocarray() from OpenBSD,
which always checks for integer overflows with the following way:

#define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
#define MEM_IS_INT_OVERFLOW(nmemb, ssize)                             \
 (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) &&       \
  (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize))

Now, you also said to the abovementioned thread:

>> So, what we would need is are functions

    char * substring (const char *string, size_t length);
    char * concatenated_string2 (const char *string1, size_t length1,
                                 const char *string2, size_t length2);
    char * concatenated_string3 (const char *string1, size_t length1,
                                 const char *string2, size_t length2,
                                 const char *string3, size_t length3);
    ...

>> where the length arguments are set to SIZE_MAX to designate the entire
 string.

But exactly this why a string_buffer is preffered in many occations like these,
plus also it has in constant time access to the byte length.

> > An extended ustring (unicode|utf8) type can include information for its bytes with
> > character semantics, like:
> >  (utf8 typedef'ed as signed int)
> >   utf8 code;   // the integer representation
> >   int len;     // the number of the needed bytes
> >   int width;   // the number of the occupied cells
> >   char buf[5]; // and probably the character representation
>
> Such a type would have a niche use, IMO, because
>   - 99% of the processing would not need to access the width (screen columns) - so
>     why spend CPU time and RAM to store it and keep it up-to-date?
>   - 80% of the processing does not care about the Unicode code points either,
>     and libraries like libunistring can do the Unicode-aware processing.

Of course is specialized but it's not uncommon those functions/operations as many
need this information. And i forgot also to include utf8 validation.
In that case as unfortunately there is not a way in C to exclude or include fields
in a structure and since i'm talking here mostly for the functionality, rather a
specific type and since you mentioned libunistring, perhaps would be wise to offer
this functionality in gnulib (like you do for iconv and readline).

But really the level of abstraction that maybe will help C developers is mostly
something (very simplified) like this:

inline long fget_size (FILE *);

implemented (probably) as:

long cur_p = ftell (fp);
fseek (fp, 0, SEEK_END);
size = ftell (fp);
fseek (fp, cur_p, SEEK_SET);
return size;

There is no penalty here, it just will be a common way and expected way to do things.
Maybe then writting and reading code in C will be much more enjoyable and C can be
considered as an expressional language.

But all this needs a standard. Perhaps gnulib can lead those.

> Bruno

Best,
  Αγαθοκλής

¹. https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00004.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-28 13:14       ` ag
@ 2019-12-28 18:28         ` Paul Eggert
  2019-12-28 20:44           ` ag
  2019-12-29  9:19         ` Bruno Haible
  1 sibling, 1 reply; 23+ messages in thread
From: Paul Eggert @ 2019-12-28 18:28 UTC (permalink / raw)
  To: ag, bug-gnulib, Bruno Haible; +Cc: Tim Rühsen

On 12/28/19 5:14 AM, ag wrote:

>   - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX
>     (PTRDIFF_MAX is INT_MAX in 32bit)

PTRDIFF_MAX can exceed SIZE_MAX, in the sense that POSIX and C allows it and it
could be useful on 32-bit platforms for size_t to be 32 bits and ptrdiff_t to be
64 bits. Although I don't know of any platforms doing things that way, I prefer
not to assume that PTRDIFF_MAX <= SIZE_MAX so as to allow for the possibility.

>   - SIZE_MAX as (size_t) (-1)
> 
>   - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals to
>     SIZE_MAX?

ssize_t can be either narrower or wider than size_t, according to POSIX.
Historically ssize_t was 32 bits and size_t 64 bits on some platforms, and
though I don't know of any current platforms doing that it's easy to not make
assumptions here.

> Based on the above assumptions this can be extended. First instead of size_t to
> return ssize_t, so functions can return -1 and set errno accordingly.

It's better to use ptrdiff_t for this sort of thing, since it's hardwired into
the C language (you can't do any better than ptrdiff_t anyway, if you use
pointer subtraction), whereas ssize_t is merely in POSIX and is narrower than
ptrdiff_t on some (obsolete?) platforms.

> In my humble opinion there is also the choise to choose reallocarray() from OpenBSD,
> which always checks for integer overflows with the following way:
> 
> #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
> #define MEM_IS_INT_OVERFLOW(nmemb, ssize)                             \
>  (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) &&       \
>   (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize))

Ouch. That code is not good. An unsigned division at runtime to do memory
allocation? Gnulib does better than that already. Also, Glibc has some code in
this area that we could migrate into Gnulib, that could be better yet.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-28 18:28         ` Paul Eggert
@ 2019-12-28 20:44           ` ag
  2019-12-28 22:40             ` Paul Eggert
  0 siblings, 1 reply; 23+ messages in thread
From: ag @ 2019-12-28 20:44 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Bruno Haible, Tim Rühsen, bug-gnulib

Hi Paul,

On Sat, Dec 28, at 10:28 Paul Eggert wrote:
> > Based on the above assumptions this can be extended. First instead of size_t to
> > return ssize_t, so functions can return -1 and set errno accordingly.
> 
> It's better to use ptrdiff_t for this sort of thing, since it's hardwired into
> the C language (you can't do any better than ptrdiff_t anyway, if you use
> pointer subtraction), whereas ssize_t is merely in POSIX and is narrower than
> ptrdiff_t on some (obsolete?) platforms.

So, let's say we designed this thing without obligating to the past and thinking for
the next hundred years (of course with the current knowledge and to lessons from the
past), and wanted to make it work with malloc and string type functions, as best it
can be done and without worries for overflows and unsigned divisions and all this
kind of confusing things that hunts us altogether after so many years that things
should have been settled by now... is your opininion that this is adequate?

typedef ptrdiff_t msize_t (m for memory here)

> > #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
> > #define MEM_IS_INT_OVERFLOW(nmemb, ssize)                             \
> >  (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) &&       \
> >   (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize))
> 
> Ouch. That code is not good. An unsigned division at runtime to do memory
> allocation? Gnulib does better than that already. Also, Glibc has some code in
> this area that we could migrate into Gnulib, that could be better yet.

Sorry, i don't have time to do it right now - as i just escaped from a snow-storm -
but i will check this for atleast not to spread misleading information (is quite
possible my fault here), so thanks for your comment.

By the way Paul and since i'm self taught by practical experience kind of human
being and joking with zoi here said that at least my teacher is a hall of famer
in the computing history. Isn't this life great!
So true this is also a school for free afterall.

My Honor,
 Αγαθοκλής

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-28 20:44           ` ag
@ 2019-12-28 22:40             ` Paul Eggert
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Eggert @ 2019-12-28 22:40 UTC (permalink / raw)
  To: ag; +Cc: Bruno Haible, Tim Rühsen, bug-gnulib

On 12/28/19 12:44 PM, ag wrote:
> is your opininion that this is adequate?
> 
> typedef ptrdiff_t msize_t (m for memory here)

Yes, something like that. dfa.c calls this type 'idx_t', which is a couple of
characters shorter.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-28 13:14       ` ag
  2019-12-28 18:28         ` Paul Eggert
@ 2019-12-29  9:19         ` Bruno Haible
  2019-12-29 17:13           ` ag
  2019-12-29 20:02           ` ag
  1 sibling, 2 replies; 23+ messages in thread
From: Bruno Haible @ 2019-12-29  9:19 UTC (permalink / raw)
  To: ag; +Cc: Tim Rühsen, Paul Eggert, bug-gnulib

Aga wrote:
>   - the returned value of the *printf family of functions dictates their
>     limits/range, as they return an int, this can be as INT_MAX mostly

Yes, we need new implementations of the *asprintf functions that are not
limited to returning strings of maximum length INT_MAX.

>   - as since there is a "risk"¹ that someone has to take at some point (either the
>     programmer or the underlying library code (as strdup() does)), the designed
>     interface should lower those risks

I agree with the goal. How to do it precisely, is an art however.

> In the case of an error, returns > 0 which is either:
> #define   EDSTPAR   -1		/* Error : bad dst parameters */
> #define   ESRCPAR   -2		/* Error : bad src parameters */
> #define   EMODPAR   -3		/* Error : bad mode parameter */
> #define   ETRUNC    -4		/* Error : not enough space to copy/concatenate
> 							   and truncation not allowed */

I don't think an interface for string concatenation with that many error
cases will be successful. Programmers are lazy, therefore
  - some will not check the errors at all,
  - some will only check for the fourth one (because "I'm not passing invalid
    arguments, after all"),
  - among those few that implement all 4 checks, half will get it wrong
    (that's my experience with similarly complex functions like mbrtowc() or
    iconv()).

For an interface to be successful, it needs to be simpler than that.

Bruno



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-29  9:19         ` Bruno Haible
@ 2019-12-29 17:13           ` ag
  2019-12-29 20:02           ` ag
  1 sibling, 0 replies; 23+ messages in thread
From: ag @ 2019-12-29 17:13 UTC (permalink / raw)
  To: Bruno Haible; +Cc: Tim Rühsen, Paul Eggert, bug-gnulib

On Sun, Dec 29, at 10:19 Bruno Haible wrote:
> I agree with the goal. How to do it precisely, is an art however.

Ok, let's see what do we have until now.

First the Base: (easy) that is malloc maximum requested size from the kernel,
and that is PTRDIFF_MAX. We also here we have to forget SIZE_MAX as it is not
guaranteed that PTRDIFF_MAX equals to SIZE_MAX.

Second the (function returned value) Requirenment: (easy) a signed type.
There is an agreement that introduced functions should return on error -1,
else the interface will be complicated and we do not want complication.
So ptrdiff_t is adequate, since ptrdiff_t is in standard C and include'd
with stddef.h.

The rest:

Catching out of bounds conditions: (rather easy and already implemented in
snprintf) after the destination argument will follow an argument with the
allocated destination size (from the stack or from the heap). Now, snprintf
uses size_t here, but (question) isn't this a contradiction with the above
or not? Not probably but it's better ask to de-confuse things (as clarity
is a requirenment (semantics should be able to understood by mere humans)).

Another concern. What if destination is NULL. Should the internal functions
procceed by allocating a buffer with the requested size? What they will do
if the requested size <= 0?
There are preceding's here, like realpath() which allocates a buffer and
it's up to the user to free it.

Also. Declared as static internal variables considered harmfull. But sometimes
is desirable to have some data in a private place protected or handy to work
without side effects. This is solved however with the new im(muttable) module.

Catching truncation (first priority maybe): There is a choice to complicate
a bit the interface to return more values than -1, but this rejected by the
perfect legal assumption that humans are lazy, probably because they have been
exposed to try/catch (not bad if you ask but innapropriated for C).
The other thing it could be done is to return -1 and set errno accordingly with
the error. But such an error doesn't exists or exists? So ETRUNC should be
introduced. Few programmers will take the risk to make their program dependable
in something that is not standard, but perhaps they will (doubtfull though at
this stage).

The other thing that left is to check the returned value. Now. In snprintf(3)
there are notes about this and a method to calculate truncation (misty though).

       The functions snprintf() and vsnprintf() do not write more than size
       bytes (including the terminating null byte ('\0')).  If the output was
       truncated due to this limit, then the return value is the number of
       characters (excluding the terminating null byte) which would have been
       written to the final string if enough space had been available.  Thus,
       a return value of size or more means that the output was truncated.
       (See also below under NOTES.)

"which would have been written?" why not always the bytes that had been written?

Ok i got it after a break; still difficult to parse though and for what? We
have to admit that this a programmer error. [Sh|H]e should know her strings.
But we still want to help here. How? Three choises comes to mind.

1.
Use a bit map flag argument to control the function behavior. But this adds
verbosity but at the same time allows extensibility. Which conditions could
be covered with that? Perhaps to return an error if destination is NULL and
the function directed with the flag to return in this condition. Same with
the source. Very convenient but still verbose as you have to learn another
set of FLAGS.

2.
Introduce wrappers. Actually wrappers maybe will be used either way.
Or introduce a complete set of same functions, post-fixed with _un (to
denote unsafety, if _s (not sure) means safe).

3. The programmer knows best. Based on that, either continue with the
implementation like it is, or (where is appropriate) use a fourh argument
for the requested bytes to be written. And sleep in full conscience, that
you did your best you could. He should do the same.

Now. What concerns me most is the userspace and all these functions that
takes a variable number of arguments and a format string. I was fighting
in my code to know with a reliable way the actual bytes produced by the
sum of those arguments (as this can be really difficult to catch some of
those described conditions above). You also said at one point that noone
that does system programming will use (because of the overhead this set
of functions). We could go further and say. Noone sane (sorry) would want
to format big strings. Such functions are very prone to errors, but are
easy to work with them. So what should do with them? There is a method
to calculate the size beforehand (means before the declaration) and is
given in the printf(3) Linux man page.

  va_start(ap, fmt);
  size = vsnprintf(p, size, fmt, ap);
  va_end(ap);

So it parses twice varargs. Plus a compiler version (not 9*), gave warning
with -fsanitize=undefined). Could we (users) do better? Can we rely on
something else? No we can't. It's the only way. C strings are like this.

So (just speaking loudly here), is it possible to introduce such a function
that will handle this? Something like a growing buffer? But no. Usually
such usage is with stack allocated strings in function scope, but maybe
with some kind of recursivety (if such a word) when such a hypothetical
function sees at some point that the actual bytes exceeds the allocated
size. Sorry as i said it's about user convenience and safety (at the same
time), but as it proved with the immutable string, perhaps there is a way
with mmap (do not really know).

Lastly since we were talking about assumptions and such. It's better to
thing them like warrantees. And if we really want to go ahead, perhaps
with a way, that even there will be no providence for obsolete systems
or to care in this interface only for systems that should also provide
these warrantees (perhaps systems that were developed in this decade)
then we can wrap all this interface with a big fat:

#if WE_WANT_TO_MOVE_ON
 ...
#else
continue with this you have, but i cannot help you as i want and i can
#endif

Bruno,
Starting from zero always gives a breath of energy. So if we really want
to move on, then the best that it can be done, is to do it like you want
to do it, without any obligations to no[o]ne.
It's always us and (for) us at the end. The art here is that through us,
will benifit the outside of us at exact the same time. This is called dada
i believe.

> Bruno

Thanks,
 Αγαθοκλής
(you know the funny thing: Iggy, Nushrat-Fateh, the unknown to you
(but our great) Manolis, and you are my beloved idols. What a life!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-29  9:19         ` Bruno Haible
  2019-12-29 17:13           ` ag
@ 2019-12-29 20:02           ` ag
  1 sibling, 0 replies; 23+ messages in thread
From: ag @ 2019-12-29 20:02 UTC (permalink / raw)
  To: Bruno Haible; +Cc: Tim Rühsen, Paul Eggert, bug-gnulib

On Sun, Dec 29, at 10:19 Bruno Haible wrote:
> Aga wrote:
> >   - the returned value of the *printf family of functions dictates their
> >     limits/range, as they return an int, this can be as INT_MAX mostly
> 
> Yes, we need new implementations of the *asprintf functions that are not
> limited to returning strings of maximum length INT_MAX.

There is also the question how current functions behave with buffers over INT_MAX.
And what to do with such large buffers if stdio can not handle them reliably.
And what POSIX says about this if says at all?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-27 10:51     ` string types Bruno Haible
  2019-12-28 13:14       ` ag
@ 2019-12-29 21:24       ` Tim Rühsen
  2019-12-31  9:53         ` Bruno Haible
  1 sibling, 1 reply; 23+ messages in thread
From: Tim Rühsen @ 2019-12-29 21:24 UTC (permalink / raw)
  To: Bruno Haible, ag; +Cc: bug-gnulib


[-- Attachment #1.1: Type: text/plain, Size: 2457 bytes --]

On 27.12.19 11:51, Bruno Haible wrote:
> Aga wrote:
>> I do not know if
>> you can (or if it is possible, how it can be done), extract with a way a specific
>> a functionality from gnulib, with the absolute necessary code and only that.
> 
> gnulib-tool does this. With its --avoid option, the developer can even customize
> their notion of "absolutely necessary".
> 
>> In a myriad of codebases a string type is implemented at least as:
>>   size_t mem_size;
>>   size_t num_bytes;
>>   char *bytes;
> 
> This is actually a string-buffer type. A string type does not need two size_t
> members. Long-term experience has shown that using different types for string
> and string-buffer is a win, because
>   - a string can be put in a read-only virtual memory area, thus enforcing
>     immutability (-> reducing multithread problems),
>   - providing primitives for string allocation reduces the amount of buffer
>     overflow bugs that otherwise occur in this area. [1]
> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
>

Just FYI,

here is a string concatenation function without ellipsis, analogue to
writev() and struct iovec - just a suggestion. Instead of 'struct
strvec' a new string_t type would be handy.

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct strvec {
  char *strv_base;
  size_t strv_len;
};

__attribute__ ((nonnull (1)))
char *concat_stringv(const struct strvec *strv)
{
  const struct strvec *str;
  size_t len = 0;
  char *buf;

  for (str = strv; str->strv_base; str++)
    len += str->strv_len;

  if (!(buf = malloc(len + 1)))
    return buf;

  len = 0;
  for (str = strv; str->strv_base; len += str->strv_len, str++)
    memcpy(buf + len, str->strv_base, str->strv_len);

  buf[len] = 0;

  return buf;
}

void main(void)
{
  char *s = concat_stringv((struct strvec []) {
    { "a", 1 },
    { "b", 1 },
    { NULL }
  });

  puts(s);

  free(s);
}


In GNU Wget2 we already have type similar to string_t. Just used in
cases where we need pointer + len of URLs inside const HTML/XML/CSS data.

typedef struct {
        const char
                *p; //!< pointer to memory region
        size_t
                len; //!< length of memory region
} wget_string;


So maybe we need a string_t and a const_string_t type !? (to avoid
casting from const char *)

Regards, Tim


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-29 21:24       ` Tim Rühsen
@ 2019-12-31  9:53         ` Bruno Haible
  2020-01-06 10:34           ` Tim Rühsen
  0 siblings, 1 reply; 23+ messages in thread
From: Bruno Haible @ 2019-12-31  9:53 UTC (permalink / raw)
  To: Tim Rühsen; +Cc: bug-gnulib, ag

Hi Tim,

> >   - providing primitives for string allocation reduces the amount of buffer
> >     overflow bugs that otherwise occur in this area. [1]
> > [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html

> here is a string concatenation function without ellipsis, analogue to
> writev() and struct iovec - just a suggestion. Instead of 'struct
> strvec' a new string_t type would be handy.
> 
> #include <stddef.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> 
> struct strvec {
>   char *strv_base;
>   size_t strv_len;
> };
> 
> __attribute__ ((nonnull (1)))
> char *concat_stringv(const struct strvec *strv)
> {
>   const struct strvec *str;
>   size_t len = 0;
>   char *buf;
> 
>   for (str = strv; str->strv_base; str++)
>     len += str->strv_len;
> 
>   if (!(buf = malloc(len + 1)))
>     return buf;
> 
>   len = 0;
>   for (str = strv; str->strv_base; len += str->strv_len, str++)
>     memcpy(buf + len, str->strv_base, str->strv_len);
> 
>   buf[len] = 0;
> 
>   return buf;
> }
> 
> void main(void)
> {
>   char *s = concat_stringv((struct strvec []) {
>     { "a", 1 },
>     { "b", 1 },
>     { NULL }
>   });

This looks good. It brings us one step closer to the stated goal [1].

Would you like to contribute such a 'string-alloc' module that, together with
'strdup' and 'asprintf', removes most needs to create a string's contents
"by hand"?

Regarding the type name: There can't be a 'string_t' in C, I would say,
because you will always have the NUL-terminated strings on one side and what
you call a 'wget_string' on the other side, and there can't be a clear winner
between both.

Bruno



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2019-12-31  9:53         ` Bruno Haible
@ 2020-01-06 10:34           ` Tim Rühsen
  2020-01-06 12:46             ` Bruno Haible
  0 siblings, 1 reply; 23+ messages in thread
From: Tim Rühsen @ 2020-01-06 10:34 UTC (permalink / raw)
  To: Bruno Haible; +Cc: ag, bug-gnulib


[-- Attachment #1.1: Type: text/plain, Size: 4019 bytes --]



On 12/31/19 10:53 AM, Bruno Haible wrote:
> Hi Tim,
> 
>>>   - providing primitives for string allocation reduces the amount of buffer
>>>     overflow bugs that otherwise occur in this area. [1]
>>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
> 
>> here is a string concatenation function without ellipsis, analogue to
>> writev() and struct iovec - just a suggestion. Instead of 'struct
>> strvec' a new string_t type would be handy.
>>
>> #include <stddef.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <string.h>
>>
>> struct strvec {
>>   char *strv_base;
>>   size_t strv_len;
>> };
>>
>> __attribute__ ((nonnull (1)))
>> char *concat_stringv(const struct strvec *strv)
>> {
>>   const struct strvec *str;
>>   size_t len = 0;
>>   char *buf;
>>
>>   for (str = strv; str->strv_base; str++)
>>     len += str->strv_len;
>>
>>   if (!(buf = malloc(len + 1)))
>>     return buf;
>>
>>   len = 0;
>>   for (str = strv; str->strv_base; len += str->strv_len, str++)
>>     memcpy(buf + len, str->strv_base, str->strv_len);
>>
>>   buf[len] = 0;
>>
>>   return buf;
>> }
>>
>> void main(void)
>> {
>>   char *s = concat_stringv((struct strvec []) {
>>     { "a", 1 },
>>     { "b", 1 },
>>     { NULL }
>>   });
> 
> This looks good. It brings us one step closer to the stated goal [1].
> 
> Would you like to contribute such a 'string-alloc' module that, together with
> 'strdup' and 'asprintf', removes most needs to create a string's contents
> "by hand"?

When time allows, I would like to make up a module.

Though IMO the design of the function doesn't allow to reuse an existing
buffer (e.g. a scratch buffer on the stack). Since malloc() etc are
pretty costly, you often want to avoid it as much as possible.

Like e.g.

/* Use given stack buffer, fallback to malloc() if too short */
char sbuf[256];
char *s = concat_stringv_stack(sbuf, sizeof (sbuf), (struct strvec []) {
    { "a", 1 },
    { "b", 1 },
    { NULL }
  });

... do things with s ...

if (s != sbuf)
  free (s);

Sometimes you want to reuse an existing malloc'ed buffer:

/* Use existing heap buffer, use realloc() if too short */
char *buf = malloc(N);
char *buf = concat_stringv_reuse(buf, N, (struct strvec []) {
    { "a", 1 },
    { "b", 1 },
    { NULL }
  });

... do things with s ...

free (buf);

You might also be interested in the size of the created string to avoid
a superfluous strlen(). So the need for more specialized functions makes
it all more and more complex.

During the development of Libwget/Wget2 we needed all of the above (and
more) and finally came up with a good compromise (well, good for us).

We created a 'catch them all' string/buffer type plus API. It is a good
compromise for all kinds of situations, works like a memory buffer but
is guaranteed 0-terminated, allows custom stack buffers with fallback to
heap if to small.

$ cloc buffer.c
Language             files          blank        comment           code
-----------------------------------------------------------------------
C                        1             49            327            195

https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c


There also is a sprintf functionality (glibc compatible) using these
buffers - and the operation is well faster than glibc's sprintf-like
functions for all format strings tested (tested back a few years). The
code is also much smaller (380 C code lines), the return values are
size_t. It doesn't support float/double.

$ cloc buffer_printf.c
Language             files          blank        comment           code
-----------------------------------------------------------------------
C                        1             74            120            380

https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c

If there is serious interest, I could prepare modules for gnulib.


Regards, Tim


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2020-01-06 10:34           ` Tim Rühsen
@ 2020-01-06 12:46             ` Bruno Haible
  2020-01-06 16:08               ` Tim Rühsen
  0 siblings, 1 reply; 23+ messages in thread
From: Bruno Haible @ 2020-01-06 12:46 UTC (permalink / raw)
  To: Tim Rühsen; +Cc: ag, bug-gnulib

Hi Tim,

> >>>   - providing primitives for string allocation reduces the amount of buffer
> >>>     overflow bugs that otherwise occur in this area. [1]
> >>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
> > ...
> We created a 'catch them all' string/buffer type plus API. It is a good
> compromise for all kinds of situations, works like a memory buffer but
> is guaranteed 0-terminated, allows custom stack buffers with fallback to
> heap if to small.
> 
> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c
> 
> 
> There also is a sprintf functionality (glibc compatible) using these
> buffers - and the operation is well faster than glibc's sprintf-like
> functions for all format strings tested (tested back a few years). The
> code is also much smaller (380 C code lines), the return values are
> size_t. It doesn't support float/double.
> 
> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c
> 
> If there is serious interest, I could prepare modules for gnulib.

It is interesting that your solution does not only cover the simple cases
(string concatenation, etc.), but also the more complex one, possibly
with if()s in the generation logic, and all this without blatant potential
for buffer overflow bugs.

So, the solution would consists of the following parts:
  (A) A growable buffer type, with up to N (128 or 1024 or so) bytes on
      the stack.
  (B) A set of functions for appending to such a growable buffer.
  (C) A function for creating a heap-allocated 'char *' from a growable
      buffer.
  (D) Short-hand functions for the simple cases (like string concatenation).

It would be good to have all these well integrated (in terms of function
names and calling conventions). So far, in gnulib, we have only pieces of
it:
  - Module 'scratch_buffer' is (A) without (B), (C), (D).
  - Modules 'vasnprintf', 'asprintf' are (B), (C), (D) but without (A).

Before you start writing the code, it's worth looking at the following
questions:
  * Should the module 'scratch_buffer' be reused for (A)? Or is this
    not possible? If not, can it still have a memory-leak prevention
    like described in lib/malloc/scratch_buffer.h?
  * What about reusing the complete vasnprintf.c for (B), rather than
    adding another, limited printf-like implementation?
  * Is it best to implement (D) based on (A), (B), (C), or directly
    from scratch?

Bruno



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2020-01-06 12:46             ` Bruno Haible
@ 2020-01-06 16:08               ` Tim Rühsen
  2020-01-06 16:49                 ` Tim Rühsen
  0 siblings, 1 reply; 23+ messages in thread
From: Tim Rühsen @ 2020-01-06 16:08 UTC (permalink / raw)
  To: Bruno Haible; +Cc: ag, bug-gnulib

[-- Attachment #1.1: Type: text/plain, Size: 6629 bytes --]

Hi Bruno,

On 1/6/20 1:46 PM, Bruno Haible wrote:
>>>>>   - providing primitives for string allocation reduces the amount of buffer
>>>>>     overflow bugs that otherwise occur in this area. [1]
>>>>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
>>> ...
>> We created a 'catch them all' string/buffer type plus API. It is a good
>> compromise for all kinds of situations, works like a memory buffer but
>> is guaranteed 0-terminated, allows custom stack buffers with fallback to
>> heap if to small.
>>
>> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c
>>
>>
>> There also is a sprintf functionality (glibc compatible) using these
>> buffers - and the operation is well faster than glibc's sprintf-like
>> functions for all format strings tested (tested back a few years). The
>> code is also much smaller (380 C code lines), the return values are
>> size_t. It doesn't support float/double.
>>
>> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c
>>
>> If there is serious interest, I could prepare modules for gnulib.
> 
> It is interesting that your solution does not only cover the simple cases
> (string concatenation, etc.), but also the more complex one, possibly
> with if()s in the generation logic, and all this without blatant potential
> for buffer overflow bugs.
> 
> So, the solution would consists of the following parts:
>   (A) A growable buffer type, with up to N (128 or 1024 or so) bytes on
>       the stack.

Preferable, the initial size and if starting with heap or stack buffer
should be (runtime) configurable.
- initial size because it allows fine-tuning to better avoid reallocations
- initial stack if used as local / temporary buffer
- initial heap when you already know that the resulting string has to
persist the function return

Currently there is are two init functions (I leave away the wget namespace):
int buffer_init(buffer *buf, char *data, size_t size);
buffer *buffer_alloc(size_t size);

buffer_alloc creates a buffer instance on the heap and initializes it
with a heap buffer of size.

buffer_init(buf, data, date_size) initializes 'buf' with the given data
and data_size. data will not be free'd, so stack data can be used here.

buffer_init(buf, NULL, date_size) initializes 'buf' with freshly
allocated heap data of size 'data_size'.

buffer_init(buf, NULL, 0) initializes 'buf' with freshly allocated heap
data of size 128. We could leave this out - it's a currently unused
special case to avoid error handling.

Then there is
int buffer_ensure_capacity(buffer *buf, size_t size);

>   (B) A set of functions for appending to such a growable buffer.

To copy a number of bytes to the beginning (effectively dropping the
previous content):
size_t buffer_memcpy(buffer *buf, const void *data, size_t length);

To append a number of bytes:
size_t buffer_memcat(buffer *buf, const void *data, size_t length);

To copy a string to the beginning (effectively dropping the previous
content):
size_t buffer_strcpy(buffer *buf, const char *s);

To append a string:
size_t buffer_strcat(buffer *buf, const char *s);

To set a number of bytes at the beginning (effectively dropping the
previous content):
size_t buffer_memset(buffer *buf, char c, size_t length);

To append a number of the same bytes:
size_t buffer_memset_append(buffer *buf, char c, size_t length);

>   (C) A function for creating a heap-allocated 'char *' from a growable
>       buffer.

Currently we do:
buffer buf;
buffer_init(&buf, NULL, date_size); // allocate buf.data on heap
... add stuff to buf ...
mydata = buf.data; buf.data = NULL;
buffer_deinit(&buf);

We could make up a (static inline) for this, named
void *buffer_deinit_transfer(buffer *buf);

This function could also call realloc() to shrink 'data' to it's
occupied length.

>   (D) Short-hand functions for the simple cases (like string concatenation).

See above, e.g.
buffer_strcpy(buf, scheme);
buffer_strcat(buf, "://");
buffer_strcat(buf, domain);
buffer_memcat(buf, ":", 1);
buffer_strcat(buf, port_s);
buffer_memcat(buf, "/", 1);
buffer_strcat(buf, path);

But I prefer the slightly slower but better readable form
buffer_printf("%s://%s:%d/%s", scheme, domain, port, path);

Since our printf-like functions directly write into a buffer, there is
no overhead for copying data.

> It would be good to have all these well integrated (in terms of function
> names and calling conventions). So far, in gnulib, we have only pieces of
> it:
>   - Module 'scratch_buffer' is (A) without (B), (C), (D).
>   - Modules 'vasnprintf', 'asprintf' are (B), (C), (D) but without (A).
> 
> Before you start writing the code, it's worth looking at the following
> questions:
>   * Should the module 'scratch_buffer' be reused for (A)? Or is this
>     not possible? If not, can it still have a memory-leak prevention
>     like described in lib/malloc/scratch_buffer.h?

I don't see the advantage of the described memory-leak prevention. On
memory error scratch_buffer_grow() releases all memory, so you can
simply return ? Most functions have a 'cleanup' part anyways, which
needs to be jumped to before return. Why not add a scratch_buffer_free()
there ?

Also, scratch_buffer has IMO these flaws.
(1) The fixed initial size of 1024 on the stack. This should be tunable.
(2) If you need a scratch_buffer itself on the heap, you can't free
those initial 1024 bytes.
(3) The 'expected usage' as described in scratch_buffer.h expects the
developer to explicitly design his functions in a certain way.

>   * What about reusing the complete vasnprintf.c for (B), rather than
>     adding another, limited printf-like implementation?

Yes, would be nice. At least for appending we need an extra
malloc/malloc/memcpy/free.

vasnprintf reallocates RESULTBUF that means we can't use a stack
allocated buffer - thus we lose the performance advantage. Or should we
try snprintf first and fallback to vasnprintf in case of truncation ?

We want another module e.g. buffer-printf to not pull in vasnprintf when
not needing printf-like buffer functions.

Once the buffer module is done, we could think of amending vasnprintf to
better play with the buffer type.

>   * Is it best to implement (D) based on (A), (B), (C), or directly
>     from scratch?

Better from scratch, but the code already exists (with FSF copyright)
and has been used a lot. So it's not really from scratch, though we can
amend/optimize a few things.

Regards, Tim

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: string types
  2020-01-06 16:08               ` Tim Rühsen
@ 2020-01-06 16:49                 ` Tim Rühsen
  0 siblings, 0 replies; 23+ messages in thread
From: Tim Rühsen @ 2020-01-06 16:49 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib, ag


[-- Attachment #1.1: Type: text/plain, Size: 1800 bytes --]

On 1/6/20 5:08 PM, Tim Rühsen wrote:
>>   * What about reusing the complete vasnprintf.c for (B), rather than
>>     adding another, limited printf-like implementation?
> 
> Yes, would be nice. At least for appending we need an extra
> malloc/malloc/memcpy/free.
> 
> vasnprintf reallocates RESULTBUF that means we can't use a stack
> allocated buffer - thus we lose the performance advantage. Or should we
> try snprintf first and fallback to vasnprintf in case of truncation ?
> 
> We want another module e.g. buffer-printf to not pull in vasnprintf when
> not needing printf-like buffer functions.
> 
> Once the buffer module is done, we could think of amending vasnprintf to
> better play with the buffer type.

Just made a speed comparison between vasnprintf and wget_buffer_printf,
10m times executed, within a stack-only szenario (no reallocations), gcc
9.2.1, -O1 -g.

asnprintf(sbuf, &size, "%d", i);
takes 0m2.727s

wget_buffer_printf(&buf, "%d", i);
takes 0m0.226s

char s[]="123";

asnprintf(sbuf, &size, "%s", s);
takes 0m2.282s

wget_buffer_printf(&buf, "%s", s);
takes 0m0.212s

It tells me that vasnprintf has a huge startup overhead. Perhaps we can
tweak that a little bit.


the vasnprintf program that I run for %d:

#include <vasnprintf.h>

void main(void)
{
  char sbuf[256];
  size_t size = sizeof(sbuf);

  for (int i=0;i<1000000;i++) {
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
    asnprintf(sbuf, &size, "%d", i);
  }
}


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-01-06 16:50 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17 13:45 hard-locale: make multithread-safe Bruno Haible
2019-12-17 14:08 ` Tim Rühsen
     [not found]   ` <20191226221225.GA800@HATZ>
2019-12-27 10:51     ` string types Bruno Haible
2019-12-28 13:14       ` ag
2019-12-28 18:28         ` Paul Eggert
2019-12-28 20:44           ` ag
2019-12-28 22:40             ` Paul Eggert
2019-12-29  9:19         ` Bruno Haible
2019-12-29 17:13           ` ag
2019-12-29 20:02           ` ag
2019-12-29 21:24       ` Tim Rühsen
2019-12-31  9:53         ` Bruno Haible
2020-01-06 10:34           ` Tim Rühsen
2020-01-06 12:46             ` Bruno Haible
2020-01-06 16:08               ` Tim Rühsen
2020-01-06 16:49                 ` Tim Rühsen
2019-12-18  1:45 ` hard-locale: make multithread-safe Paul Eggert
2019-12-18  8:51   ` Bruno Haible
2019-12-21  6:33     ` Bruno Haible
2019-12-18 10:29   ` LC_COLLATE in the C locale Bruno Haible
2019-12-18 16:27     ` Paul Eggert
2019-12-18 10:46   ` hard-locale: make multithread-safe Bruno Haible
2019-12-24 23:36     ` Bruno Haible

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).