git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johan Herland <johan@herland.net>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	trast@student.ethz.ch, tavestbo@trolltech.com,
	git@drmicha.warpmail.net, chriscool@tuxfamily.org,
	spearce@spearce.org
Subject: Re: [PATCHv5 00/14] git notes
Date: Tue, 8 Sep 2009 14:36:30 +0200	[thread overview]
Message-ID: <200909081436.30761.johan@herland.net> (raw)
In-Reply-To: <alpine.DEB.1.00.0909081100020.4330@intel-tinevez-2-302>

Hi,

On Tuesday 08 September 2009, Johannes Schindelin wrote:
> On Tue, 8 Sep 2009, Johan Herland wrote:
> > Algorithm / Notes tree   git log -n10 (x100)   git log --all
> > ------------------------------------------------------------
> > next / no-notes                4.77s              63.84s
> >
> > before / no-notes              4.78s              63.90s
> > before / no-fanout            56.85s              65.69s
> >
> > 16tree / no-notes              4.77s              64.18s
> > 16tree / no-fanout            30.35s              65.39s
> > 16tree / 2_38                  5.57s              65.42s
> > 16tree / 2_2_36                5.19s              65.76s
> >
> > flexible / no-notes            4.78s              63.91s
> > flexible / no-fanout          30.34s              65.57s
> > flexible / 2_38                5.57s              65.46s
> > flexible / 2_2_36              5.18s              65.72s
> > flexible / ym                  5.13s              65.66s
> > flexible / ym_2_38             5.08s              65.63s
> > flexible / ymd                 5.30s              65.45s
> > flexible / ymd_2_38            5.29s              65.90s
> > flexible / y_m                 5.11s              65.72s
> > flexible / y_m_2_38            5.08s              65.67s
> > flexible / y_m_d               5.06s              65.50s
> > flexible / y_m_d_2_38          5.07s              65.79s
>
> It's good to see that the no-notes behaves roughly like baseline.

Agreed.

> I can see that some people may think that date-based fan-out is the
> cat's ass, but I have to warn that we have no idea how notes will be
> used,

I don't agree. Although we will certainly see many more use cases for 
notes, I believe that the vast majority of them can be placed in one of 
two categories:

1. Looking up a few (1 - 10) notes on an individual basis. In this case, 
performance will always be "good enough", and I don't believe it's 
worth spending much time optimizing for this case.

2. Looking up many notes in a sequence based on the chronology of the 
objects (commits) they annotate. This is what 'git log' does, and I 
believe this is the case we should optimize for.

Once you work from this assumption, it is clear that date-based fanout 
beats pure SHA1-based fanout, because it changes the notes lookup 
process from random-access to near-streaming. This is clearly reflected 
in the memory usage graphs I posted.

Also note that the "flexible" code to some degree resolves the whole 
date-based fanout vs. SHA1-based fanout discussion: We are now free to 
choose -- at runtime -- which notes tree structure is more optimal for 
a given collection of notes.

> and the date-based fan-out is rather limiting in certain respects:
>
> - for the typical nightly-build-generated notes, this fan-out is
> pretty inefficient memory-wise.

Yes and no. A y/m/d/40 structure with 1 note for every y/m/d is indeed 
wasteful, but using a y/40 structure instead creates a much better 
situation with a "healthy" ~365 notes per year level. And the y/40 
still preserves some of the 'streaming' aspects mentioned above.

> - I find the restriction to commits rather limiting.

I see your point, but I don't agree until I see a compelling case for 
annotating a non-commit.

In any case, the flexible lookup code still allow us to organize notes 
purely by SHA1-based fanout, so annotated trees/blobs could still be 
supported by the same code (modulo a small s/struct commit/struct 
object/ patch) provided that they are stored in a notes tree that 
simply does not use date-based fanout.

> - most of the performance difference between the date-based and the
> SHA-1 based fan-out looks to me as if the issue was the top-level
> tree. Basically, this tree has to be read _every_ time _anybody_
> wants to read a note.

Not sure what you're trying to say here. The top-level notes tree is 
read (as in fill_tree_descriptor()) exactly _once_. After that, it is 
cached by the internal data structure (until free_commit_notes() or 
end-of-process).

In the SHA1-based case, each notes lookup does indeed start at the root 
of the data structure (corresponding to the top-level tree), but in the 
date-based case, we keep a pointer (cur_node) to the innermost 
date-based tree that was most recently used. Thus, if the next note is 
within the same date-based tree (which I assume is the common case), 
then we don't need to look at the root of the data structure.

>   Maybe a finer-grained fan-out (finer than 16-bits) could help. 
> After all, if you have 16 different notes, chances are that they have
> 16 different first letters, but all have the same commit year. 
> That's where the top-level notes with a fan-out perform incredibly
> bad.

Not really, the first lookup would start at the root node, and navigate 
into the year node, but all subsequent lookups would start directly at 
the year node (and only backtrack if the commit year doesn't match the 
year node).

BTW, when you mention "finer than 16-bits", do you mean moving from a 
16-tree to a, say, 32-tree or 64-tree structure? That will complicate 
the tree navigation somewhat, and increase memory waste. (Remember, I 
started out with a 256-tree idea, but scrapped it because of memory 
waste).

>   But I think that having a dynamic fan-out that can even put blobs
> into the top-level tree (nothing prevents us from doing that, right?)

Well, the "flexible" code does add the new requirement that all entries 
in a notes (sub)tree object must follow the same scheme, i.e. you 
cannot have:

  /12/34567890123456789012345678901234567890
  /2345/678901234567890123456789012345678901

but you can have

  /12/34567890123456789012345678901234567890
  /23/45/678901234567890123456789012345678901

> would _outperform_ the date-based one, at least with less than 1
> note/commit (and maybe even then, because the year-based fan-out
> results in pretty varying entropies per fan-out depth).
>
>   The real question for me, therefore, is: what is the optimal way to
>   strike the balance between size of the tree objects (which we want
> to be small, so that unpacking them is fast)  and depth of the
> fan-out (which we want to be shallow to avoid reading worst-case 39
> tree objects to get at one single note).

s/39/19/ (each fanout must use at least 2 chars of the 40-char SHA1)

Yes, the challenge is indeed striking the correct balance. I believe 
that the notes code should be taught to write (and automatically 
re-organize) the notes tree so that it is optimized for the current 
collection of notes.

> - related to the previous point is my gut feeling that the date-based
>   fan-out has nothing to do with any theoretical optimum.  I am
> pretty certain that the optimal fan-out strategy depends heavily on
> the SHA-1s of the annotated objects (if you have 10,000 notes in
> 2009, but only 1 in 2008, the year-based fan-out _must_ be
> suboptimal)  and maybe is something like a sibling to the Fibonacci
> heap.

Yes, it is trivial to create scenarios where any rigid date-based fanout 
is suboptimal. However, it is equally trivial to create scenarios where 
any SHA1-based fanout will perform worse than a carefully chosen 
date-based fanout. I believe the best way forward is to design for 
flexibility in the notes tree structure, and then teach the notes 
_writing_ code to choose a notes tree structure that is good/optimal 
for the given set of notes.

> - I'd love to see performance numbers for less than 157118 notes. 
> Don't get me wrong, it is good to see the worst-case scenario in
> terms of notes/commits ratio.  But it will hardly be the common case,
> and I very much would like to optimize for the common case.
>
>   So, I'd appreciate if you could do the tests with something like
> 500 notes, randomly spread over the commits (rationale: my original
> understanding was that the notes could amend commit messages, and
> that is much more likely to be done with relatively old commits that
> you cannot change anymore).

Ok. I will try to test that.

> Please understand that I might not have the time to participate in
> this thread as much as I would like to.  The next 4 days will be
> especially hard.

Thanks for the feedback! I appreciate any time you're able to spend on 
this. And I don't mind waiting for a few days for more feedback.


Have fun! :)

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

  reply	other threads:[~2009-09-08 12:38 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-08  2:26 [PATCHv5 00/14] git notes Johan Herland
2009-09-08  2:26 ` [PATCHv5 01/14] Introduce commit notes Johan Herland
2009-09-08  2:26 ` [PATCHv5 02/14] Add a script to edit/inspect notes Johan Herland
2009-09-08  2:26 ` [PATCHv5 03/14] Speed up git notes lookup Johan Herland
2009-09-08  2:26 ` [PATCHv5 04/14] Add an expensive test for git-notes Johan Herland
2009-09-08  2:26 ` [PATCHv5 05/14] Teach "-m <msg>" and "-F <file>" to "git notes edit" Johan Herland
2009-09-08  2:26 ` [PATCHv5 06/14] fast-import: Add support for importing commit notes Johan Herland
2009-09-08  2:26 ` [PATCHv5 07/14] t3302-notes-index-expensive: Speed up create_repo() Johan Herland
2009-09-08  2:26 ` [PATCHv5 08/14] Add flags to get_commit_notes() to control the format of the note string Johan Herland
2009-09-08  2:26 ` [PATCHv5 09/14] Add '%N'-format for pretty-printing commit notes Johan Herland
2009-09-08  2:26 ` [PATCHv5 10/14] Teach notes code to free its internal data structures on request Johan Herland
2009-09-08  2:26 ` [PATCHv5 11/14] Teach the notes lookup code to parse notes trees with various fanout schemes Johan Herland
2009-09-08  2:27 ` [PATCHv5 12/14] Selftests verifying semantics when loading notes trees with various fanouts Johan Herland
2009-09-08  2:27 ` [PATCHv5 13/14] Allow flexible organization of notes trees, using both commit date and SHA1 Johan Herland
2009-09-08  2:27 ` [PATCHv5 14/14] Add test cases for date-based fanouts Johan Herland
2009-09-08  3:12 ` [PATCHv5 00/14] git notes Johan Herland
2009-09-08  4:16   ` Junio C Hamano
2009-09-08  8:54     ` Johan Herland
2009-09-08  9:32       ` Johannes Schindelin
2009-09-08 12:36         ` Johan Herland [this message]
2009-09-08 15:53           ` Johannes Schindelin
2009-09-08 22:46             ` Johan Herland
2009-09-10  6:23               ` Stephen R. van den Berg
2009-09-10  9:25           ` Johan Herland
2009-09-08 20:31         ` Junio C Hamano
2009-09-08 21:10           ` Shawn O. Pearce
2009-09-08 21:36             ` Sverre Rabbelier
2009-09-08 21:39               ` Shawn O. Pearce
2009-09-08 21:57                 ` Sverre Rabbelier
2009-09-08 21:40           ` Johan Herland
2009-09-12 15:50   ` Johan Herland
2009-09-12 18:11     ` Shawn O. Pearce
2009-09-12 18:35       ` Johan Herland
2009-09-10 14:00 ` Geert Bosch
2009-09-10 14:09   ` Michael J Gruber
2009-09-10 14:12     ` Geert Bosch
2009-09-12  0:11 ` Junio C Hamano
2009-09-12 15:52   ` Johan Herland
2009-09-12 16:08     ` [PATCHv6 " Johan Herland
2009-09-12 16:08     ` [PATCHv6 01/14] Introduce commit notes Johan Herland
2009-09-12 16:08     ` [PATCHv6 02/14] Add a script to edit/inspect notes Johan Herland
2009-09-12 16:08     ` [PATCHv6 03/14] Speed up git notes lookup Johan Herland
2009-09-12 16:08     ` [PATCHv6 04/14] Add an expensive test for git-notes Johan Herland
2009-09-12 16:08     ` [PATCHv6 05/14] Teach "-m <msg>" and "-F <file>" to "git notes edit" Johan Herland
2009-09-12 16:08     ` [PATCHv6 06/14] fast-import: Add support for importing commit notes Johan Herland
2009-09-12 16:08     ` [PATCHv6 07/14] t3302-notes-index-expensive: Speed up create_repo() Johan Herland
2009-09-12 16:08     ` [PATCHv6 08/14] Add flags to get_commit_notes() to control the format of the note string Johan Herland
2009-09-12 16:08     ` [PATCHv6 09/14] Add '%N'-format for pretty-printing commit notes Johan Herland
2009-09-12 16:08     ` [PATCHv6 10/14] Teach notes code to free its internal data structures on request Johan Herland
2009-09-12 18:40       ` Junio C Hamano
2009-09-12 22:21         ` Johan Herland
2009-09-12 16:08     ` [PATCHv6 11/14] Teach the notes lookup code to parse notes trees with various fanout schemes Johan Herland
2009-09-12 16:08     ` [PATCHv6 12/14] Selftests verifying semantics when loading notes trees with various fanouts Johan Herland
2009-09-12 16:08     ` [PATCHv6 13/14] Allow flexible organization of notes trees, using both commit date and SHA1 Johan Herland
2009-09-12 18:41       ` Junio C Hamano
2009-09-12 22:33         ` Johan Herland
2009-09-12 23:37           ` Junio C Hamano
2009-09-12 16:08     ` [PATCHv6 14/14] Add test cases for various date-based fanouts Johan Herland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200909081436.30761.johan@herland.net \
    --to=johan@herland.net \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=chriscool@tuxfamily.org \
    --cc=git@drmicha.warpmail.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=spearce@spearce.org \
    --cc=tavestbo@trolltech.com \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).