git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Antoine Pelisse <apelisse@gmail.com>
Cc: git <git@vger.kernel.org>
Subject: Re: [PATCH 01/10] list_lookup: create case and length search
Date: Mon, 07 Jan 2013 14:37:39 -0800	[thread overview]
Message-ID: <7vpq1ghb64.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <7vmwwnp84p.fsf@alter.siamese.dyndns.org> (Junio C. Hamano's message of "Sat, 05 Jan 2013 14:39:18 -0800")

Junio C Hamano <gitster@pobox.com> writes:

> Antoine Pelisse <apelisse@gmail.com> writes:
>
>> Create a new function to look-up a string in a string_list, but:
>>  - add a new parameter to ignore case differences
>>  - add a length parameter to search for a substring
>>
>> The idea is to avoid several copies (lowering a string before searching
>> it when we just want to ignore case), or copying a substring of a bigger
>> string to search it in the string_list
>>
>> Signed-off-by: Antoine Pelisse <apelisse@gmail.com>
>> ---
>
> I did not read beyond the log message and the function signature of
> the updated get_entry_index(), but it strikes me insane to build a
> sorted list using case sensitive full string comarison and then to
> look for an element in that list using a different comparison
> criteria (e.g. case insensitive comparison) and to expect the
> bisection search to still work.  Shouldn't the codepath that builds
> the string-list be sorting the list in case insensitive way from the
> beginning if this were to work correctly?
>
> It seems to suggest to me that this "are the keys case sensitive?"
> bit belongs to the entire struct string_list structure as its
> property (similar to the way "do the keys need to be strdup'ed?"
> bit), not something you would flip per invocation basis of the
> lookup function.
>
> Also isn't size_t an unsigned type?  What does it even mean to pass
> "-1" to it, and propagate it down to strncmp()?
>
> If you have a list sorted by a key, and if you want to query it with
> a partial prefix of a possibly valid key, I think you shouldn't have
> to add the "length search" at all. The existing look up function
> would return the location in the list that short key would have been
> inserted, which would come before the first entry that your short
> key is a prefix of, so the caller can iterate the list from there to
> find all entries.  In other words, if existing list has "aaa",
> "bba", and "bbc", and you want to grab entries that begin with "bb",
> you can ask for "bb" to the loop up function, which would say "the
> key does not exist in the list, but it would be inserted before this
> 'bba' entry".  Then you can discover that "bba" and "bbc" both
> matches the shortened key you have, no?

In the above, I mixed up which string is a prefix of which, but the
overall comment still stands, I think.

Your mailmap string list has "Ee@eee" (email part only) stored, while
the record from author/committer line has "NNNN <EE@EEE> TTTTTT sZZZZ",
to hold name, timestamp and zone.  You have a pointer email pointing
at the first "E" in that record, with maillen set to 6, and see if
the mailmap contains a string that match "EE@EEE".  So I mixed up
the prefix-ness of these strings.

Let's address the case sensitivety first.  The string_list_lookup()
and your string_list_lookup_extended() functions both work on a
string list that is built by add_mapping() function that asks
string_list_find_insert_index() to find an insertion point in the
sorted string list.  If this is not done using the same case
insensitive manner for a string list that is meant to be queried
case insensitively, the resulting list is still sorted case
sensitively, and look-up functions that work by binary search won't
find what you are looking for when they try to find the string using
case-insensitive comparison.  The change to implement it would
probably look like the attached patch.

Assuming that we can leave the case sensitivity behind us that way,
we still would need a new helper function

    string_list_lookup_prefix(struct string_list *, const char*, size_t)

so that you can find an existing "Ee@eee" entry in the string list
by email that points at "EE@EEE> TTTTTT sZZZZ" with maillen of 6.

I am wondering if we can do that without adding the length-limited
sort in the low-level get_entry_index() to do so, though.

You could first ask string_list_find_insert_index() to locate the
insertion point for "EE@EEE> TTTTTT sZZZZ".  That has to come after
"Ee@eee" if it exists (or it could be "De@eee" in which case you
know there is no match).

 1. Ask string_list_find_insert_index() to locate "EE@EEE> TTTTTT..."

    a. If it says there is an exact match, then that did not match
       "EE@EEE" list (you found "EE@EEE> TTTTTT sZZZZ").  You might
       still find "EE@EEE" before that record, though.  Also your
       key may have been "EE@EEE" exactly, in which case you already
       found what you were looking for.

    b. If it says there is not an exact match, the returned value is
       pointing at a record that comes after "EE@EEE> TTTTTT sZZZZ" if
       you were to insert it, which definitely is after "EE@EEE".

    So in either case, you know what you are looking for must come
    before string_list_find_insert_index() gave you.  Let's call
    that position "i".

    You still need to be careful that there could be an existing
    entry whose key is "EE@EEE" followed by a byte that sorts before
    ">" or " " (we do not want to limit this new generic string-list
    function to operate only on e-mail addresses), though.  So the
    above "must come before" is not "must come immediately before".

 2. Then compare map->items[i].string and email using strncasecmp()
    for maillen bytes, and make sure map->items[].string is exactly
    maillen bytes long.  If it matches, you found what you were
    looking for.  Otherwise, decrement "i" (e.g. the overlong key
    string you had was "EE@EEE> TTTTTT..." and the record at one
    before the insertion point was "EE@EEE\t" which sorts before it,
    but the string-list may have "EE@EEE" before that entry), see if
    the first maillen bytes still match, and continue.  Once you
    decrement i enough, the first maillen bytes won't match
    (e.g. you hit "De@EEE"), or "i" goes negative, and you can stop.

or something like that.  I'll cook up a patch and see how ugly it
ends up with.

-- >8 --
Subject: [PATCH] string-list: allow case-insensitive string list

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * At the very beginning of mailmap.c::read_mailmap(), you would say
   something like:

	int read_mailmap(struct string_list *map, char **repo_ident)
        {
        	map->strdup_strings = 1;
       +        map->cmp = strcasecmp;
                /* each failure returns 1, so >1 means both calls ...

   to make the mailmap string list a case insensitive one.

 string-list.c | 17 +++++++++++++----
 string-list.h |  4 ++++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/string-list.c b/string-list.c
index 397e6cf..6f3d8cf 100644
--- a/string-list.c
+++ b/string-list.c
@@ -7,10 +7,11 @@ static int get_entry_index(const struct string_list *list, const char *string,
 		int *exact_match)
 {
 	int left = -1, right = list->nr;
+	compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
 
 	while (left + 1 < right) {
 		int middle = (left + right) / 2;
-		int compare = strcmp(string, list->items[middle].string);
+		int compare = cmp(string, list->items[middle].string);
 		if (compare < 0)
 			right = middle;
 		else if (compare > 0)
@@ -96,8 +97,9 @@ void string_list_remove_duplicates(struct string_list *list, int free_util)
 {
 	if (list->nr > 1) {
 		int src, dst;
+		compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
 		for (src = dst = 1; src < list->nr; src++) {
-			if (!strcmp(list->items[dst - 1].string, list->items[src].string)) {
+			if (!cmp(list->items[dst - 1].string, list->items[src].string)) {
 				if (list->strdup_strings)
 					free(list->items[src].string);
 				if (free_util)
@@ -230,15 +232,20 @@ struct string_list_item *string_list_append(struct string_list *list,
 			list->strdup_strings ? xstrdup(string) : (char *)string);
 }
 
+/* Yuck */
+static compare_strings_fn compare_for_qsort;
+
+/* Only call this from inside sort_string_list! */
 static int cmp_items(const void *a, const void *b)
 {
 	const struct string_list_item *one = a;
 	const struct string_list_item *two = b;
-	return strcmp(one->string, two->string);
+	return compare_for_qsort(one->string, two->string);
 }
 
 void sort_string_list(struct string_list *list)
 {
+	compare_for_qsort = list->cmp ? list->cmp : strcmp;
 	qsort(list->items, list->nr, sizeof(*list->items), cmp_items);
 }
 
@@ -246,8 +253,10 @@ struct string_list_item *unsorted_string_list_lookup(struct string_list *list,
 						     const char *string)
 {
 	int i;
+	compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
+
 	for (i = 0; i < list->nr; i++)
-		if (!strcmp(string, list->items[i].string))
+		if (!cmp(string, list->items[i].string))
 			return list->items + i;
 	return NULL;
 }
diff --git a/string-list.h b/string-list.h
index c50b0d0..446e79e 100644
--- a/string-list.h
+++ b/string-list.h
@@ -5,10 +5,14 @@ struct string_list_item {
 	char *string;
 	void *util;
 };
+
+typedef int (*compare_strings_fn)(const char *, const char *);
+
 struct string_list {
 	struct string_list_item *items;
 	unsigned int nr, alloc;
 	unsigned int strdup_strings:1;
+	compare_strings_fn cmp; /* NULL uses strcmp() */
 };
 
 #define STRING_LIST_INIT_NODUP { NULL, 0, 0, 0 }
-- 
1.8.1.304.gf036638

  reply	other threads:[~2013-01-07 22:38 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-05 21:26 [PATCH 00/10] Log mailmap optimization Antoine Pelisse
2013-01-05 21:26 ` [PATCH 01/10] list_lookup: create case and length search Antoine Pelisse
2013-01-05 22:39   ` Junio C Hamano
2013-01-07 22:37     ` Junio C Hamano [this message]
2013-01-05 21:26 ` [PATCH 02/10] Use split_ident_line to parse author and committer Antoine Pelisse
2013-01-05 21:26 ` [PATCH 03/10] mailmap: remove email copy and length limitation Antoine Pelisse
2013-01-05 21:26 ` [PATCH 04/10] mailmap: simplify map_user() interface Antoine Pelisse
2013-01-06 22:56   ` Junio C Hamano
2013-01-05 21:26 ` [PATCH 05/10] mailmap: add mailmap structure to rev_info and pp Antoine Pelisse
2013-01-05 21:26 ` [PATCH 06/10] pretty: use mailmap to display username and email Antoine Pelisse
2013-01-05 21:26 ` [PATCH 07/10] log: add --use-mailmap option Antoine Pelisse
2013-01-05 21:26 ` [PATCH 08/10] test: add test for " Antoine Pelisse
2013-01-05 21:26 ` [PATCH 09/10] log: grep author/committer using mailmap Antoine Pelisse
2013-01-05 21:26 ` [PATCH 10/10] log: add log.mailmap configuration option Antoine Pelisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vpq1ghb64.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=apelisse@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).