git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Michael Giuffrida" <michaelpg@chromium.org>,
	git@vger.kernel.org, "SZEDER Gábor" <szeder.dev@gmail.com>
Subject: Re: [BUG] add_again() off-by-one error in custom format
Date: Thu, 15 Jun 2017 01:56:54 -0400	[thread overview]
Message-ID: <20170615055654.efvsouhr3leszz3i@sigill.intra.peff.net> (raw)
In-Reply-To: <dae96f72-761c-3ed1-4567-0933acc7618a@web.de>

On Wed, Jun 14, 2017 at 08:24:25PM +0200, René Scharfe wrote:

> > I think the real question is how likely people use more than one
> > occurrence of the same thing in their custom format, and how deeply
> > they care that --format='%h %h' costs more than --format='%h'.  The
> > cost won't of course be double (because the main traversal costs
> > without any output), but it would be rather unreasonable to expect
> > that --format='%h %h %h %h %h' to cost the same as --format='%h';
> > after all, Git is doing more for them ;-)
> 
> The answer to the first half is obviously "very likely" -- otherwise
> this bug wouldn't have been found, right? :)
> 
> Regarding the question of how bad a 50% slowdown for a second %h
> would be: No idea.  If ran interactively it may not even be noticeable
> because the user can read the first few lines in less while the rest
> is prepared in the background.  We don't have a perf test for formats
> with duplicate short hashes, so we don't promise anything, right? :)

One interesting thing is that the cost of finding short hashes very much
depends on your loose object setup. I timed:

  git log --format=%H >/dev/null

versus

  git log --format=%h >/dev/null

on git.git. It went from about 400ms to about 800ms. But then I noticed
I had a lot of loose object directories, and ran "git gc --prune=now".
Afterwards, my timings were more like 380ms and 460ms.

The difference is that in the "before" case, we actually opened each
directory and ran getdents(). But after gc, the directories are gone
totally and open() fails. We also have to do a linear walk through the
objects in each directory, since the contents are sorted.

So I wonder if it is worth trying to optimize the short-sha1 computation
in the first place. Double-%h aside, that would make _everything_
faster, including --oneline.

I'm not really sure how, though, short of caching the directory
contents. That opens up questions of whether and when to invalidate the
cache. If the cache were _just_ about short hashes, it might be OK to
just assume that it remains valid through the length of the program (so
worst case, a simultaneous write might mean that we generate a sha1
which just became ambiguous, but that's generally going to be racy
anyway).

The other downside of course is that we'd spend RAM on it. We could
bound the size of the cache, I suppose.

-Peff

  reply	other threads:[~2017-06-15  5:56 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-12  3:13 [BUG] add_again() off-by-one error in custom format Michael Giuffrida
2017-06-12 22:49 ` Junio C Hamano
2017-06-13 18:09   ` René Scharfe
2017-06-13 18:29     ` Junio C Hamano
2017-06-13 20:29       ` René Scharfe
2017-06-13 21:20         ` Junio C Hamano
2017-06-14 18:24           ` René Scharfe
2017-06-15  5:56             ` Jeff King [this message]
2017-06-15 11:33               ` René Scharfe
2017-06-15 13:25                 ` Jeff King
2017-06-18 10:58                   ` René Scharfe
2017-06-18 11:49                     ` Jeff King
2017-06-18 12:59                       ` René Scharfe
2017-06-18 13:56                         ` Jeff King
2017-06-22 18:19                           ` René Scharfe
2017-06-22 23:15                             ` Jeff King
2017-06-18 10:58                   ` René Scharfe
2017-06-18 11:50                     ` Jeff King
2017-06-19  4:46                       ` Junio C Hamano
2017-06-22 18:19                         ` [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename() René Scharfe
2017-06-22 23:10                           ` Jeff King
2017-06-24 12:12                             ` René Scharfe
2017-06-24 12:14                               ` Jeff King
2017-06-24 12:12                             ` René Scharfe
2017-06-24 12:20                               ` Jeff King
2017-06-24 14:09                                 ` René Scharfe
2017-06-24 14:12                                   ` Jeff King
2017-06-15 18:37             ` [BUG] add_again() off-by-one error in custom format Junio C Hamano
2017-06-13 22:24         ` SZEDER Gábor
2017-06-14 17:34           ` René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170615055654.efvsouhr3leszz3i@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=michaelpg@chromium.org \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).