From: Junio C Hamano <gitster@pobox.com>
To: Antoine Pelisse <apelisse@gmail.com>
Cc: git <git@vger.kernel.org>
Subject: Re: [PATCH 01/10] list_lookup: create case and length search
Date: Mon, 07 Jan 2013 14:37:39 -0800 [thread overview]
Message-ID: <7vpq1ghb64.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <7vmwwnp84p.fsf@alter.siamese.dyndns.org> (Junio C. Hamano's message of "Sat, 05 Jan 2013 14:39:18 -0800")
Junio C Hamano <gitster@pobox.com> writes:
> Antoine Pelisse <apelisse@gmail.com> writes:
>
>> Create a new function to look-up a string in a string_list, but:
>> - add a new parameter to ignore case differences
>> - add a length parameter to search for a substring
>>
>> The idea is to avoid several copies (lowering a string before searching
>> it when we just want to ignore case), or copying a substring of a bigger
>> string to search it in the string_list
>>
>> Signed-off-by: Antoine Pelisse <apelisse@gmail.com>
>> ---
>
> I did not read beyond the log message and the function signature of
> the updated get_entry_index(), but it strikes me insane to build a
> sorted list using case sensitive full string comarison and then to
> look for an element in that list using a different comparison
> criteria (e.g. case insensitive comparison) and to expect the
> bisection search to still work. Shouldn't the codepath that builds
> the string-list be sorting the list in case insensitive way from the
> beginning if this were to work correctly?
>
> It seems to suggest to me that this "are the keys case sensitive?"
> bit belongs to the entire struct string_list structure as its
> property (similar to the way "do the keys need to be strdup'ed?"
> bit), not something you would flip per invocation basis of the
> lookup function.
>
> Also isn't size_t an unsigned type? What does it even mean to pass
> "-1" to it, and propagate it down to strncmp()?
>
> If you have a list sorted by a key, and if you want to query it with
> a partial prefix of a possibly valid key, I think you shouldn't have
> to add the "length search" at all. The existing look up function
> would return the location in the list that short key would have been
> inserted, which would come before the first entry that your short
> key is a prefix of, so the caller can iterate the list from there to
> find all entries. In other words, if existing list has "aaa",
> "bba", and "bbc", and you want to grab entries that begin with "bb",
> you can ask for "bb" to the loop up function, which would say "the
> key does not exist in the list, but it would be inserted before this
> 'bba' entry". Then you can discover that "bba" and "bbc" both
> matches the shortened key you have, no?
In the above, I mixed up which string is a prefix of which, but the
overall comment still stands, I think.
Your mailmap string list has "Ee@eee" (email part only) stored, while
the record from author/committer line has "NNNN <EE@EEE> TTTTTT sZZZZ",
to hold name, timestamp and zone. You have a pointer email pointing
at the first "E" in that record, with maillen set to 6, and see if
the mailmap contains a string that match "EE@EEE". So I mixed up
the prefix-ness of these strings.
Let's address the case sensitivety first. The string_list_lookup()
and your string_list_lookup_extended() functions both work on a
string list that is built by add_mapping() function that asks
string_list_find_insert_index() to find an insertion point in the
sorted string list. If this is not done using the same case
insensitive manner for a string list that is meant to be queried
case insensitively, the resulting list is still sorted case
sensitively, and look-up functions that work by binary search won't
find what you are looking for when they try to find the string using
case-insensitive comparison. The change to implement it would
probably look like the attached patch.
Assuming that we can leave the case sensitivity behind us that way,
we still would need a new helper function
string_list_lookup_prefix(struct string_list *, const char*, size_t)
so that you can find an existing "Ee@eee" entry in the string list
by email that points at "EE@EEE> TTTTTT sZZZZ" with maillen of 6.
I am wondering if we can do that without adding the length-limited
sort in the low-level get_entry_index() to do so, though.
You could first ask string_list_find_insert_index() to locate the
insertion point for "EE@EEE> TTTTTT sZZZZ". That has to come after
"Ee@eee" if it exists (or it could be "De@eee" in which case you
know there is no match).
1. Ask string_list_find_insert_index() to locate "EE@EEE> TTTTTT..."
a. If it says there is an exact match, then that did not match
"EE@EEE" list (you found "EE@EEE> TTTTTT sZZZZ"). You might
still find "EE@EEE" before that record, though. Also your
key may have been "EE@EEE" exactly, in which case you already
found what you were looking for.
b. If it says there is not an exact match, the returned value is
pointing at a record that comes after "EE@EEE> TTTTTT sZZZZ" if
you were to insert it, which definitely is after "EE@EEE".
So in either case, you know what you are looking for must come
before string_list_find_insert_index() gave you. Let's call
that position "i".
You still need to be careful that there could be an existing
entry whose key is "EE@EEE" followed by a byte that sorts before
">" or " " (we do not want to limit this new generic string-list
function to operate only on e-mail addresses), though. So the
above "must come before" is not "must come immediately before".
2. Then compare map->items[i].string and email using strncasecmp()
for maillen bytes, and make sure map->items[].string is exactly
maillen bytes long. If it matches, you found what you were
looking for. Otherwise, decrement "i" (e.g. the overlong key
string you had was "EE@EEE> TTTTTT..." and the record at one
before the insertion point was "EE@EEE\t" which sorts before it,
but the string-list may have "EE@EEE" before that entry), see if
the first maillen bytes still match, and continue. Once you
decrement i enough, the first maillen bytes won't match
(e.g. you hit "De@EEE"), or "i" goes negative, and you can stop.
or something like that. I'll cook up a patch and see how ugly it
ends up with.
-- >8 --
Subject: [PATCH] string-list: allow case-insensitive string list
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* At the very beginning of mailmap.c::read_mailmap(), you would say
something like:
int read_mailmap(struct string_list *map, char **repo_ident)
{
map->strdup_strings = 1;
+ map->cmp = strcasecmp;
/* each failure returns 1, so >1 means both calls ...
to make the mailmap string list a case insensitive one.
string-list.c | 17 +++++++++++++----
string-list.h | 4 ++++
2 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/string-list.c b/string-list.c
index 397e6cf..6f3d8cf 100644
--- a/string-list.c
+++ b/string-list.c
@@ -7,10 +7,11 @@ static int get_entry_index(const struct string_list *list, const char *string,
int *exact_match)
{
int left = -1, right = list->nr;
+ compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
while (left + 1 < right) {
int middle = (left + right) / 2;
- int compare = strcmp(string, list->items[middle].string);
+ int compare = cmp(string, list->items[middle].string);
if (compare < 0)
right = middle;
else if (compare > 0)
@@ -96,8 +97,9 @@ void string_list_remove_duplicates(struct string_list *list, int free_util)
{
if (list->nr > 1) {
int src, dst;
+ compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
for (src = dst = 1; src < list->nr; src++) {
- if (!strcmp(list->items[dst - 1].string, list->items[src].string)) {
+ if (!cmp(list->items[dst - 1].string, list->items[src].string)) {
if (list->strdup_strings)
free(list->items[src].string);
if (free_util)
@@ -230,15 +232,20 @@ struct string_list_item *string_list_append(struct string_list *list,
list->strdup_strings ? xstrdup(string) : (char *)string);
}
+/* Yuck */
+static compare_strings_fn compare_for_qsort;
+
+/* Only call this from inside sort_string_list! */
static int cmp_items(const void *a, const void *b)
{
const struct string_list_item *one = a;
const struct string_list_item *two = b;
- return strcmp(one->string, two->string);
+ return compare_for_qsort(one->string, two->string);
}
void sort_string_list(struct string_list *list)
{
+ compare_for_qsort = list->cmp ? list->cmp : strcmp;
qsort(list->items, list->nr, sizeof(*list->items), cmp_items);
}
@@ -246,8 +253,10 @@ struct string_list_item *unsorted_string_list_lookup(struct string_list *list,
const char *string)
{
int i;
+ compare_strings_fn cmp = list->cmp ? list->cmp : strcmp;
+
for (i = 0; i < list->nr; i++)
- if (!strcmp(string, list->items[i].string))
+ if (!cmp(string, list->items[i].string))
return list->items + i;
return NULL;
}
diff --git a/string-list.h b/string-list.h
index c50b0d0..446e79e 100644
--- a/string-list.h
+++ b/string-list.h
@@ -5,10 +5,14 @@ struct string_list_item {
char *string;
void *util;
};
+
+typedef int (*compare_strings_fn)(const char *, const char *);
+
struct string_list {
struct string_list_item *items;
unsigned int nr, alloc;
unsigned int strdup_strings:1;
+ compare_strings_fn cmp; /* NULL uses strcmp() */
};
#define STRING_LIST_INIT_NODUP { NULL, 0, 0, 0 }
--
1.8.1.304.gf036638
next prev parent reply other threads:[~2013-01-07 22:38 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-05 21:26 [PATCH 00/10] Log mailmap optimization Antoine Pelisse
2013-01-05 21:26 ` [PATCH 01/10] list_lookup: create case and length search Antoine Pelisse
2013-01-05 22:39 ` Junio C Hamano
2013-01-07 22:37 ` Junio C Hamano [this message]
2013-01-05 21:26 ` [PATCH 02/10] Use split_ident_line to parse author and committer Antoine Pelisse
2013-01-05 21:26 ` [PATCH 03/10] mailmap: remove email copy and length limitation Antoine Pelisse
2013-01-05 21:26 ` [PATCH 04/10] mailmap: simplify map_user() interface Antoine Pelisse
2013-01-06 22:56 ` Junio C Hamano
2013-01-05 21:26 ` [PATCH 05/10] mailmap: add mailmap structure to rev_info and pp Antoine Pelisse
2013-01-05 21:26 ` [PATCH 06/10] pretty: use mailmap to display username and email Antoine Pelisse
2013-01-05 21:26 ` [PATCH 07/10] log: add --use-mailmap option Antoine Pelisse
2013-01-05 21:26 ` [PATCH 08/10] test: add test for " Antoine Pelisse
2013-01-05 21:26 ` [PATCH 09/10] log: grep author/committer using mailmap Antoine Pelisse
2013-01-05 21:26 ` [PATCH 10/10] log: add log.mailmap configuration option Antoine Pelisse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vpq1ghb64.fsf@alter.siamese.dyndns.org \
--to=gitster@pobox.com \
--cc=apelisse@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).