Re: [PATCH] hamt: New module. - Marc Nieper-Wißkirchen

From: "Marc Nieper-Wißkirchen" <marc.nieper+gnu@gmail.com>
To: Bruno Haible <bruno@clisp.org>
Cc: "Marc Nieper-Wißkirchen" <marc.nieper+gnu@gmail.com>, bug-gnulib@gnu.org
Subject: Re: [PATCH] hamt: New module.
Date: Sat, 3 Apr 2021 22:32:22 +0200	[thread overview]
Message-ID: <CAEYrNrRPKTvktjYrQ5UbqR5Y8JABp1QXxbDhTTPukWDzxY6oRA@mail.gmail.com> (raw)
In-Reply-To: <1796868.jvDaOvkGpR@omega>

[-- Attachment #1: Type: text/plain, Size: 8786 bytes --]

Hi Bruno,

thank you very much for all the helpful feedback. I have incorporated your
suggestions, and you can find an updated diff below.

Thanks,

Marc

diff --git a/doc/containers.texi b/doc/containers.texi
index 15c915b93..9bc1ae43d 100644
--- a/doc/containers.texi
+++ b/doc/containers.texi
@@ -35,6 +35,9 @@ log
 Gnulib provides several generic container data types.  They can be used
 to organize collections of application-defined objects.

+@node Ordinary containers
+@subsection Ordinary container data types
+
 @multitable @columnfractions .15 .5 .1 .1 .15
 @headitem Data type
 @tab Details
@@ -599,6 +602,58 @@ For C++, Gnulib provides a C++ template class for each
of these container data t
 @tab @code{"gl_omap.hh"}
 @end multitable

+@node Specialized containers
+@subsection Specialized container data types
+
+The @code{hamt} module implements the hash array mapped trie (HAMT) data
+structure.  This is a data structure that contains (key, value) pairs.
+Lookup of a (key, value) pair given the key is an O(1) operation,
+assuming a good hash function for the keys is employed.
+
+The HAMT data structure is useful when you want modifications (additions
+of pairs, removal, value changes) to be visible only to some part of
+your program, whereas other parts of the program continue to use the
+unmodified HAMT.  The HAMT makes this possible in a space-efficient
+manner: the modified and the unmodified HAMT share most of their
+allocated memory.  It is also time-efficient: Every such modification
+is O(1) on average, again assuming a good hash function for the keys.
+
+A HAMT can be used whenever an ordinary hash table would be used.  It
+does however, provide non-destructive updating operations without the
+need to copy the whole container  On the other hand, a hash table is
+simpler so that its performance may be better when persistence is not
+needed.
+
+For example, a HAMT can be used to model the dynamic environment in a
+LISP interpreter.  Updating a value in the dynamic environment of one
+continuation frame would not modify values in earlier frames.
+
+To use the module, include @code{hamt.h} in your code.  The public
+interface is documented in that header file.  You have to provide a hash
+function and an equivalence relation, which defines key equality.  The
+module includes a test file @code{test-hamt.c}, which demonstrates how
+the API can be used.
+
+In the current implementation, each inner node of the HAMT can store up
+to @math{32 = 2^5} entries and subtries.  Whenever a collision between
+the initial bits of the hash values of two entries would happen, the
+next @math{5} bits of the hash values are examined and the two entries
+pushed down one level in the trie.
+
+HAMTs have the same average access times as hash tables but grow and
+shrink dynamically, so they use memory more economically and do not have
+to be periodically resized.
+
+They were described and analyzed in @cite{Phil Bagwell (2000). Ideal
+Hash Trees (Report). Infoscience Department, École Polytechnique
+Fédérale de Lausanne.}
+
+HAMTs are well-suited to a persistent data structure, which means that
+each updating operation (like inserting, replacing, or removing an
+element) returns a new HAMT while leaving the original one intact.  This
+is achieved through structure sharing, which is even safe in the
+presence of multiple threads when the used C compiler supports atomics.
+
 @ifnottex
 @unmacro log
 @end ifnottex
diff --git a/lib/hamt.h b/lib/hamt.h
index 55ee964dd..25a0ad9f9 100644
--- a/lib/hamt.h
+++ b/lib/hamt.h
@@ -78,10 +78,13 @@ _GL_INLINE_HEADER_BEGIN
 /************/

 /* A hamt stores pointers to elements.  Each element has to be a
-   struct whose initial member is of the type Hamt_entry.  An element
-   is conceptually owned by a hamt as soon as it is inserted.  It will
-   be automatically freed as soon as the last hamt containing it is
-   freed.  */
+   struct whose initial member is of the type Hamt_entry.  You need to
+   define this struct yourself.  It will typically contain an
+   Hamt_entry, a key, and, optionally, a value.
+
+   An element is conceptually owned by a hamt as soon as it is
+   inserted.  It will be automatically freed as soon as the last hamt
+   containing it is freed.  */
 typedef struct
 {
 #if GL_HAMT_THREAD_SAFE

Am Sa., 3. Apr. 2021 um 21:55 Uhr schrieb Bruno Haible <bruno@clisp.org>:

> Hi Marc,
>
> > what do you think of the attempt attached below?
>
> I believe that technical documentation of a feature should be written
> to answer the questions in this order:
>   1. What is it? (Short summary)
>   2. When would I use it? (Use cases)
>   3. How do I use it?
>   4. Further details.
>
> The idea is to get the reader understand rapidly whether they want to
> use the feature or not. That is the purpose of parts 1 and 2.
>
> For part 1:
>
> > +The @code{hamt} module provides a persistant version of persistent hash
> > +array mapped tries (HAMTs).
>
> "A persistant version of persistent ..." ?
>
> The term "persistent" is only explained much later.
>
> > A HAMT is an array mapped trie in which
> > +elements are stored according to the initial bits of their hash values.
>
> This is a technical detail, to be mentioned later.
>
> I would start in a different way. How about this?
>
>   The @code{hamt} module implements the hash array mapped trie (HAMT)
>   data structure.  This is a data structure that contains (key, value)
>   pairs (or just plain keys, if no value is needed).  Lookup of a
>   (key, value) pair given the key is an O(1) operation, assuming a
>   good hash function for the keys is employed.
>
>   The HAMT data structure is useful when you want modifications
>   (additions of pairs, removal, value changes) to be visible only to
>   some part of your program, whereas other parts of the program
>   continue to use the unmodified HAMT.  The HAMT makes this possible
>   in a space-efficient manner: the modified and the unmodified HAMT
>   share most of their allocated memory.  It is also time-efficient:
>   Every such modification is O(1) on average, again assuming a good
>   hash function for the keys
>
> For part 2:
>
> > +A HAMT can be used whenever an ordinary hash table would be used.  It
> > +does however, provide non-destructive updating operations without the
> > +need to copy the whole container  On the other hand, a hash table is
> > +simpler so that its performance may be better when persistence is not
> > +needed.
> > +
> > +For example, a HAMT can be used to model the dynamic environment in a
> > +LISP interpreter.  Updating a value in the dynamic environment of one
> > +continuation frame would not modify values in earlier frames.
>
> This is good; this should come right after part 1.
>
> Now part 3, right after part 2:
>
> > +To use the module, include @code{hamt.h} in your code.  The public
> > +interface is documented in that header file.
>
> OK.
>
> And the rest are details that can come after part 3:
>
> > +In the current implementation, each inner node of the HAMT can store up
> > +to @math{32 = 2^5} elements and subtries.  Whenever a collision between
> > +the initial bits of the hash values of two elements happens, the next
> > +@math{5} bits of the hash values are examined and the two elements
> > +pushed down one level in the trie.
> > +
> > +HAMTs have the same average access times as hash tables but grow and
> > +shrink dynamically, so they use memory more economically and do not have
> > +to be periodically resized.
> > +
> > +They were described and analyzed in @cite{Phil Bagwell (2000). Ideal
> > +   Hash Trees (Report). Infoscience Department, École Polytechnique
> > +   Fédérale de Lausanne.}
> > +
> > +HAMTs are well-suited to a persistent data structure, which means that
> > +each updating operation (like inserting, replacing, or removing an
> > +element) returns a new HAMT while leaving the original one intact.  This
> > +is achieved through structure sharing, which is even safe in the
> > +presence of multiple threads when the used C compiler supports atomics.
> > +
>
> By the way, I just starred at the Hamt_entry struct and wondered how to
> make use of it. The test class shows it. How about extending the comment
> a bit:
>
>    Each element has to be a
>    struct whose initial member is of the type Hamt_entry.
> ->
>    Each element has to be a
>    struct whose initial member is of the type Hamt_entry.  You need to
>    define this struct yourself.  It will typically contain a
>    Hamt_entry, a key, and optionally a value.
>
> Bruno
>
>

[-- Attachment #2: Type: text/html, Size: 10279 bytes --]