git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Antoine Pelisse <apelisse@gmail.com>
Cc: git <git@vger.kernel.org>
Subject: Re: [PATCH 1/2] log: grep author/committer using mailmap
Date: Wed, 26 Dec 2012 13:37:28 -0800	[thread overview]
Message-ID: <7vy5gkmr53.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <CALWbr2xW6r5ysJ8KQZa1eGYehG8ZbEp6K+s5JkG2goK9ef7rcA@mail.gmail.com> (Antoine Pelisse's message of "Wed, 26 Dec 2012 22:12:16 +0100")

Antoine Pelisse <apelisse@gmail.com> writes:

>>>
>>> +static int commit_rewrite_authors(struct strbuf *buf, const char *what, struct string_list *mailmap)
>>> +{
>>> +     char *author, *endp;
>>> +     size_t len;
>>> +     struct strbuf name = STRBUF_INIT;
>>> +     struct strbuf mail = STRBUF_INIT;
>>> +     struct ident_split ident;
>>> +
>>> +     author = strstr(buf->buf, what);
>>> +     if (!author)
>>> +             goto error;
>>
>> This does not stop at the end of the header part and would match a
>> random line in the log message that happens to begin with "author ";
>> is this something we would worry about, or would we leave it to "fsck"?
>
> The only worrying case would be:
> ...

Yeah, that pretty much matches what I had in mind (the short answer:
leave it to "git fsck").

>> We usually signal error by returning a negative integer.  It does
>> not matter too much in this case as no callers seem to check the
>> return value from this function, though.
>
> Fixed, or would you rather see it `void` ?

Just like you can take advantage of map_user() that signals the
caller if it did anything to optimize this function, in the longer
run, it may help the (future) callers of this function if it gave "I
did something" vs "I left it intact".  In the particular case of
this function, the "error" cases fall into the latter (it merely
explains why it left it intact, and there is no sensible error
recovery the caller _could_ do in any case) and I think it is not
necessary to differenciate between "Returned as-is because there is
no mapping" and "Returned as-is because I couldn't parse the
commit".

So "return 0 when it didn't do anything, return 1 when it rewrote"
feels good enough, at least to me.

>>> +     }
>>> +
>>> +     strbuf_add(&name, ident.name_begin, ident.name_end - ident.name_begin);
>>> +     strbuf_add(&mail, ident.mail_begin, ident.mail_end - ident.mail_begin);
>>> +
>>> +     map_user(mailmap, &mail, &name);
>>> +
>>> +     strbuf_addf(&name, " <%s>", mail.buf);
>>> +
>>> +     strbuf_splice(buf, ident.name_begin - buf->buf,
>>> +                   ident.mail_end - ident.name_begin + 1,
>>> +                   name.buf, name.len);
>>
>> Would it give us better performance if we splice only when
>> map_user() tells us that we actually rewrote the ident?
>
> My intuition was that the cost of splice belongs to "memoving", when the
> size is different. Yet, Fixed, as it removes two copies.

Thanks.

I wonder if we can further restructure the code so that it first
inspects the existing buffer to see if it even needs to copy the
original commit buffer into a "strbuf only for grepping".  If that
can be easily done, then we will save even more copying, I think.

The reason I alluded to revamping the grep API to get rid of the use
of "header grep" mode in this codepath was exactly that.  We could:

 - change the command line parser for --author= and --committer= so
   that these do not become part of the main "grep" expression.
   Instead we keep them as separate grep expressions (one "author"
   expression that OR'es the --author= options together, the other
   for the --committer= options);

 - in this codepath, inspect the "author" and "committer" in the
   commit object buffer, map them if necessary via the mailmap
   mechanism into temporary buffers (that is different from the
   "buf" in the commit_match() function), then run grep_buffer()
   with the author and committer grep expressions we separated in
   the previous step. Then we combine the results from "author" and
   "committer" grep and the main grep_buffer() result ourselves in
   this function.

That may essentially amount to going in the totally opposite
direction from what 2d10c55 (git log: Unify header_filter and
message_filter into one., 2006-09-20) attempted to do.  We used to
have two grep expressions (one for header, the other one for body)
commit_match() runs grep_buffer() on and combined the results.
2d10c55 merged them into one grep expression by introducing a term
that matches only header elements.  But we would instead split the
"header" expression into "author" and "committer" expressions
(making it three from one) if we go the above route.

That would eliminate the need to copy and rewrite the contents of
the commit object in this codepath, which may be a big win when
names and emails that need to be rewritten are minority cases.

But I suspect that is a much larger change.  If we can reduce the
amount of copies necessary without changing the code structure, that
may be enough to reduce the performance hit from this change.

Thanks.

  reply	other threads:[~2012-12-26 21:37 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-22 16:58 [PATCH 0/2] Mailmap in log improvements Antoine Pelisse
2012-12-22 16:58 ` [PATCH 1/2] log: grep author/committer using mailmap Antoine Pelisse
2012-12-26 19:27   ` Junio C Hamano
2012-12-26 21:12     ` Antoine Pelisse
2012-12-26 21:37       ` Junio C Hamano [this message]
2012-12-27 15:31         ` [PATCH v2] " Antoine Pelisse
2012-12-27 18:45           ` Junio C Hamano
2012-12-27 18:48             ` Junio C Hamano
2012-12-28 18:00               ` Antoine Pelisse
2012-12-28 18:43                 ` Junio C Hamano
2012-12-28 20:37                   ` Antoine Pelisse
2012-12-22 16:58 ` [PATCH 2/2] log: add log.mailmap configuration option Antoine Pelisse
2012-12-23  4:26   ` Junio C Hamano
2012-12-26 16:14     ` Junio C Hamano
2012-12-26 16:42       ` Antoine Pelisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vy5gkmr53.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=apelisse@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).