[PATCH] blame.c: don't drop origin blobs as eagerly

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] blame.c: don't drop origin blobs as eagerly
@ 2016-05-27 13:35 David Kastrup
  2016-05-27 15:00 ` Johannes Schindelin
  0 siblings, 1 reply; 15+ messages in thread
From: David Kastrup @ 2016-05-27 13:35 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

When a parent blob already has chunks queued up for blaming, dropping
the blob at the end of one blame step will cause it to get reloaded
right away, doubling the amount of I/O and unpacking when processing a
linear history.

Keeping such parent blobs in memory seems like a reasonable
optimization.  It's conceivable that this may incur additional memory
pressure particularly when the history contains lots of merges from
long-diverged branches.  In practice, this optimization appears to
behave quite benignly, and a viable strategy for limiting the total
amount of cached blobs in a useful manner seems rather hard to
implement.  In addition, calling git-blame with -C leads to similar
memory retention patterns.
---
 builtin/blame.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 21f42b0..2596fbc 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1556,7 +1556,8 @@ finish:
 	}
 	for (i = 0; i < num_sg; i++) {
 		if (sg_origin[i]) {
-			drop_origin_blob(sg_origin[i]);
+			if (!sg_origin[i]->suspects)
+				drop_origin_blob(sg_origin[i]);
 			origin_decref(sg_origin[i]);
 		}
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2016-05-27 13:35 David Kastrup
@ 2016-05-27 15:00 ` Johannes Schindelin
  2016-05-27 15:41   ` David Kastrup
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2016-05-27 15:00 UTC (permalink / raw)
  To: David Kastrup; +Cc: git, gitster

Hi David,

it is good practice to Cc: the original author of the code in question, in
this case Junio. I guess he sees this anyway, but that is really just an
assumption.

On Fri, 27 May 2016, David Kastrup wrote:

> When a parent blob already has chunks queued up for blaming, dropping
> the blob at the end of one blame step will cause it to get reloaded
> right away, doubling the amount of I/O and unpacking when processing a
> linear history.

It is obvious from your commit message that you have studied the code
quite deeply. To make it easier for the reader (which might be your future
self), it is advisable to give at least a little bit of introduction, e.g.
what the "parent blob" is.

I would *guess* that it is the blob corresponding to the same path in the
parent of the current revision, but that should be spelled out explicitly.

> Keeping such parent blobs in memory seems like a reasonable
> optimization.  It's conceivable that this may incur additional memory

This sentence would be easier to read if "It's conceivable that" was
simply deleted.

> pressure particularly when the history contains lots of merges from
> long-diverged branches.  In practice, this optimization appears to
> behave quite benignly,

Why not just stop here? I say that because...

> and a viable strategy for limiting the total amount of cached blobs in a
> useful manner seems rather hard to implement.

... this sounds awfully handwaving. Since we already have reference
counting, it sounds fishy to claim that simply piggybacking a global
counter on top of it would be "rather hard".

> In addition, calling git-blame with -C leads to similar memory retention
> patterns.

This is a red herring. Just delete it. I, for one, being a heavy user of
`git blame`, could count the number of times I used blame's -C option
without any remaining hands. Zero times.

Besides, -C is *supposed* to look harder. By that argument, you could read
all blobs in rev-list even when the user did not specify --objects
"because --objects leads to similar memory retention patterns". So: let's
just forget about that statement.

The commit message is missing your sign-off.

Also: is there an easy way to reproduce your claims of better I/O
characteristics? Something like a command-line, ideally with a file in
git.git's own history, that demonstrates the I/O before and after the
patch, would be an excellent addition to the commit message.

Further: I would have at least expected some rudimentary discussion why
this patch -- which seems to at least partially contradict 7c3c796 (blame:
drop blob data after passing blame to the parent, 2007-12-11) -- is not
regressing on the intent of said commit.

> diff --git a/builtin/blame.c b/builtin/blame.c
> index 21f42b0..2596fbc 100644
> --- a/builtin/blame.c
> +++ b/builtin/blame.c
> @@ -1556,7 +1556,8 @@ finish:
>  	}
>  	for (i = 0; i < num_sg; i++) {
>  		if (sg_origin[i]) {
> -			drop_origin_blob(sg_origin[i]);
> +			if (!sg_origin[i]->suspects)
> +				drop_origin_blob(sg_origin[i]);
>  			origin_decref(sg_origin[i]);
>  		}

It would be good to mention in the commit message that this patch does not
change anything for blobs with only one remaining reference (the current
one) because origin_decref() would do the same job as drop_origin_blob
when decrementing the reference counter to 0.

In fact, I suspect that simply removing the drop_origin_blob() call might
result in the exact same I/O pattern.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2016-05-27 15:00 ` Johannes Schindelin
@ 2016-05-27 15:41   ` David Kastrup
  2016-05-28  6:37     ` Johannes Schindelin
  0 siblings, 1 reply; 15+ messages in thread
From: David Kastrup @ 2016-05-27 15:41 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, gitster

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Fri, 27 May 2016, David Kastrup wrote:
>
>> pressure particularly when the history contains lots of merges from
>> long-diverged branches.  In practice, this optimization appears to
>> behave quite benignly,
>
> Why not just stop here?

Because there is a caveat.

> I say that because...
>
>> and a viable strategy for limiting the total amount of cached blobs in a
>> useful manner seems rather hard to implement.
>
> ... this sounds awfully handwaving.

Because it is.

> Since we already have reference counting, it sounds fishy to claim
> that simply piggybacking a global counter on top of it would be
> "rather hard".

You'll see that the patch is from 2014.  When I actively worked on it,
I found no convincing/feasible way to enforce a reasonable hard limit.
I am not picking up work on this again but am merely flushing my queue
so that the patch going to waste is not on my conscience.

>> In addition, calling git-blame with -C leads to similar memory retention
>> patterns.
>
> This is a red herring. Just delete it. I, for one, being a heavy user of
> `git blame`, could count the number of times I used blame's -C option
> without any remaining hands. Zero times.
>
> Besides, -C is *supposed* to look harder.

We are not talking about "looking harder" but "taking more memory than
the set limit".

> Also: is there an easy way to reproduce your claims of better I/O
> characteristics? Something like a command-line, ideally with a file in
> git.git's own history, that demonstrates the I/O before and after the
> patch, would be an excellent addition to the commit message.

I've used it on the wortliste repository and system time goes down from
about 70 seconds to 50 seconds (this is a flash drive).  User time from
about 4:20 to 4:00.  It is a rather degenerate repository (predominantly
small changes in one humongous text file) so savings for more typical
cases might end up less than that.  But then it is degenerate
repositories that are most costly to blame.

> Further: I would have at least expected some rudimentary discussion
> why this patch -- which seems to at least partially contradict 7c3c796
> (blame: drop blob data after passing blame to the parent, 2007-12-11)
> -- is not regressing on the intent of said commit.

It is regressing on the intent of said commit by _retaining_ blob data
that it is _sure_ to look at again because of already having this data
asked for again in the priority queue.  That's the point.  It only drops
the blob data for which it has no request queued yet.  But there is no
handle on when the request is going to pop up until it actually leaves
the priority queue: the priority queue is a heap IIRC and thus only
provides partial orderings.

>> diff --git a/builtin/blame.c b/builtin/blame.c
>> index 21f42b0..2596fbc 100644
>> --- a/builtin/blame.c
>> +++ b/builtin/blame.c
>> @@ -1556,7 +1556,8 @@ finish:
>>  	}
>>  	for (i = 0; i < num_sg; i++) {
>>  		if (sg_origin[i]) {
>> -			drop_origin_blob(sg_origin[i]);
>> +			if (!sg_origin[i]->suspects)
>> +				drop_origin_blob(sg_origin[i]);
>>  			origin_decref(sg_origin[i]);
>>  		}
>
> It would be good to mention in the commit message that this patch does not
> change anything for blobs with only one remaining reference (the current
> one) because origin_decref() would do the same job as drop_origin_blob
> when decrementing the reference counter to 0.

A sizable number of references but not blobs are retained and needed for
producing the information in the final printed output (at least when
using the default sequential output instead of --incremental or
--porcelaine or similar).

> In fact, I suspect that simply removing the drop_origin_blob() call
> might result in the exact same I/O pattern.

It's been years since I actually worked on the code but I'm still pretty
sure you are wrong.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2016-05-27 15:41   ` David Kastrup
@ 2016-05-28  6:37     ` Johannes Schindelin
  2016-05-28  8:29       ` David Kastrup
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2016-05-28  6:37 UTC (permalink / raw)
  To: David Kastrup; +Cc: git, gitster

[-- Attachment #1: Type: text/plain, Size: 5585 bytes --]

Hi David,

On Fri, 27 May 2016, David Kastrup wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Fri, 27 May 2016, David Kastrup wrote:
> >
> >> pressure particularly when the history contains lots of merges from
> >> long-diverged branches.  In practice, this optimization appears to
> >> behave quite benignly,
> >
> > Why not just stop here?
> 
> Because there is a caveat.
> 
> > I say that because...
> >
> >> and a viable strategy for limiting the total amount of cached blobs in a
> >> useful manner seems rather hard to implement.
> >
> > ... this sounds awfully handwaving.
> 
> Because it is.

Hrm. Are you really happy with this, on public record?

> > Since we already have reference counting, it sounds fishy to claim
> > that simply piggybacking a global counter on top of it would be
> > "rather hard".
> 
> You'll see that the patch is from 2014.

No I do not. There was no indication of that.

> When I actively worked on it, I found no convincing/feasible way to
> enforce a reasonable hard limit.  I am not picking up work on this again
> but am merely flushing my queue so that the patch going to waste is not
> on my conscience.

Hmm. Speaking from my point of view as a maintainer, this raises a yellow
alert. Sure, I do not maintain git.git, but if I were, I would want my
confidence in the quality of this patch, and in the patch author being
around when things go south with it, strengthened. This paragraph achieves
the opposite.

> >> In addition, calling git-blame with -C leads to similar memory retention
> >> patterns.
> >
> > This is a red herring. Just delete it. I, for one, being a heavy user of
> > `git blame`, could count the number of times I used blame's -C option
> > without any remaining hands. Zero times.
> >
> > Besides, -C is *supposed* to look harder.
> 
> We are not talking about "looking harder" but "taking more memory than
> the set limit".

I meant to say: *of course* it uses more memory, it has a much harder job.

> > Also: is there an easy way to reproduce your claims of better I/O
> > characteristics? Something like a command-line, ideally with a file in
> > git.git's own history, that demonstrates the I/O before and after the
> > patch, would be an excellent addition to the commit message.
> 
> I've used it on the wortliste repository and system time goes down from
> about 70 seconds to 50 seconds (this is a flash drive).  User time from
> about 4:20 to 4:00.  It is a rather degenerate repository (predominantly
> small changes in one humongous text file) so savings for more typical
> cases might end up less than that.  But then it is degenerate
> repositories that are most costly to blame.

Well, obviously I did not mean to measure time, but I/O. Something like an
strace call that pipes into some grep that in turn pipes into wc.

> > Further: I would have at least expected some rudimentary discussion
> > why this patch -- which seems to at least partially contradict 7c3c796
> > (blame: drop blob data after passing blame to the parent, 2007-12-11)
> > -- is not regressing on the intent of said commit.
> 
> It is regressing on the intent of said commit by _retaining_ blob data
> that it is _sure_ to look at again because of already having this data
> asked for again in the priority queue.  That's the point.  It only drops
> the blob data for which it has no request queued yet.  But there is no
> handle on when the request is going to pop up until it actually leaves
> the priority queue: the priority queue is a heap IIRC and thus only
> provides partial orderings.

Again, this lowers my confidence in the patch. Mind you, the patch could
be totally sound! Yet its commit message, and unfortunately even more so
the discussion we are having right now, offer no adequate reason to assume
that. If you, as the patch author, state that you are not sure what this
thing does exactly, we must conservatively assume that it contains flaws.

> >> diff --git a/builtin/blame.c b/builtin/blame.c
> >> index 21f42b0..2596fbc 100644
> >> --- a/builtin/blame.c
> >> +++ b/builtin/blame.c
> >> @@ -1556,7 +1556,8 @@ finish:
> >>  	}
> >>  	for (i = 0; i < num_sg; i++) {
> >>  		if (sg_origin[i]) {
> >> -			drop_origin_blob(sg_origin[i]);
> >> +			if (!sg_origin[i]->suspects)
> >> +				drop_origin_blob(sg_origin[i]);
> >>  			origin_decref(sg_origin[i]);
> >>  		}
> >
> > It would be good to mention in the commit message that this patch does not
> > change anything for blobs with only one remaining reference (the current
> > one) because origin_decref() would do the same job as drop_origin_blob
> > when decrementing the reference counter to 0.
> 
> A sizable number of references but not blobs are retained and needed for
> producing the information in the final printed output (at least when
> using the default sequential output instead of --incremental or
> --porcelaine or similar).

Sorry, please help me understand how that response relates to my
suggestion to improve the commit message.

> > In fact, I suspect that simply removing the drop_origin_blob() call
> > might result in the exact same I/O pattern.
> 
> It's been years since I actually worked on the code but I'm still pretty
> sure you are wrong.

The short version of your answer is that you will leave this patch in its
current form and address none of my concerns because you moved on,
correct? If so, that's totally okay, it just needs to be spelled out.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2016-05-28  6:37     ` Johannes Schindelin
@ 2016-05-28  8:29       ` David Kastrup
  2016-05-28 12:34         ` Johannes Schindelin
  0 siblings, 1 reply; 15+ messages in thread
From: David Kastrup @ 2016-05-28  8:29 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, gitster

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Fri, 27 May 2016, David Kastrup wrote:
>
>> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>> 
>> > On Fri, 27 May 2016, David Kastrup wrote:
>> >
>> >> pressure particularly when the history contains lots of merges from
>> >> long-diverged branches.  In practice, this optimization appears to
>> >> behave quite benignly,
>> >
>> > Why not just stop here?
>> 
>> Because there is a caveat.
>> 
>> > I say that because...
>> >
>> >> and a viable strategy for limiting the total amount of cached blobs in a
>> >> useful manner seems rather hard to implement.
>> >
>> > ... this sounds awfully handwaving.
>> 
>> Because it is.
>
> Hrm. Are you really happy with this, on public record?

Shrug.  git-blame has limits for its memory usage which it generally
heeds, throwing away potentially useful data to do so.  git-blame -C has
a store of data retained outside of those limits because nobody bothered
reining them in, and this patch similarly retains data outside of those
limits, though strictly limited to the pending priority queue size which
controls the descent in reverse commit time order.

In actual measurements on actual heavy-duty repositories I could not
discern a difference in memory usage after the patch.  The difference is
likely only noticeable when you have lots of old long branches which is
not a typical development model.

The current data structures don't make harder constraints on memory
pressure feasible, and enforcing such constraints would incur a lot of
swapping (not by the operating system which does it more efficiently
particularly since it does not need to decompress every time, but by the
application constantly rereading and recompressing data).

git-blame -C cares jack-shit about that (and it's used by git-gui's
second pass if I remember correctly) and nobody raised concerns about
its memory usage ever.

I've taken out a lot of hand-waving of the "eh, good enough" kind from
git-blame.  If one wanted to have the memory limits enforced more
stringently than they currently are, one would need a whole new layer of
FIFO and accounting.  This is not done in the current code base for
existing functionality.

And nobody bothered implementing it in decades or complaining about it
in spite of it affecting git-blame -C (and consequently git-gui blame).

Yes, this patch is another hand waving beside an already waving crowd.
I am not going to lie about it but I am also not going to invest the
time to fix yet another part of git-blame that has in decades not shown
to be a problem anybody considered worth reporting let alone fixing.

The limits for git-blame are set quite conservative: on actual developer
systems, exceeding them by a factor of 10 will not exhaust the available
memory.  And I could not even measure a difference in memory usage in a
small sample set of large examples.

That's good enough for me (and better than the shape most of git-blame
was in before I started on it and also after I finished), but it's still
hand-waving.  And it's not like it doesn't have a lot of company.

>> > Since we already have reference counting, it sounds fishy to claim
>> > that simply piggybacking a global counter on top of it would be
>> > "rather hard".
>> 
>> You'll see that the patch is from 2014.
>
> No I do not. There was no indication of that.

I thought that git-send-email/git-am retained most patch metadata as
opposed to git-patch, but you are entirely correct that the author date
is nowhere to be found.  Sorry for the presumption.

>> When I actively worked on it, I found no convincing/feasible way to
>> enforce a reasonable hard limit.  I am not picking up work on this
>> again but am merely flushing my queue so that the patch going to
>> waste is not on my conscience.
>
> Hmm. Speaking from my point of view as a maintainer, this raises a
> yellow alert. Sure, I do not maintain git.git, but if I were, I would
> want my confidence in the quality of this patch, and in the patch
> author being around when things go south with it, strengthened. This
> paragraph achieves the opposite.

It's completely reasonable to apply the same standards here as with an
author having terminal cancer.  I am not going to be around when things
go South with git-blame but I should be very much surprised if this
patch were significantly involved.

>> > Besides, -C is *supposed* to look harder.
>> 
>> We are not talking about "looking harder" but "taking more memory
>> than the set limit".
>
> I meant to say: *of course* it uses more memory, it has a much harder
> job.

Still a non-sequitur.  If you apply memory limits harder, it will still
achieve the same job but finish later due to (decompressing) swapping.

>> > Also: is there an easy way to reproduce your claims of better I/O
>> > characteristics? Something like a command-line, ideally with a file
>> > in git.git's own history, that demonstrates the I/O before and
>> > after the patch, would be an excellent addition to the commit
>> > message.
>> 
>> I've used it on the wortliste repository and system time goes down
>> from about 70 seconds to 50 seconds (this is a flash drive).  User
>> time from about 4:20 to 4:00.  It is a rather degenerate repository
>> (predominantly small changes in one humongous text file) so savings
>> for more typical cases might end up less than that.  But then it is
>> degenerate repositories that are most costly to blame.
>
> Well, obviously I did not mean to measure time, but I/O.

System time on a SSD is highly correlated with I/O for this task.
Actually, so is system time on a hard drive, it just takes more real
time (waiting for I/O to complete) then as well.

>> > Further: I would have at least expected some rudimentary discussion
>> > why this patch -- which seems to at least partially contradict
>> > 7c3c796 (blame: drop blob data after passing blame to the parent,
>> > 2007-12-11) -- is not regressing on the intent of said commit.
>> 
>> It is regressing on the intent of said commit by _retaining_ blob
>> data that it is _sure_ to look at again because of already having
>> this data asked for again in the priority queue.  That's the point.
>> It only drops the blob data for which it has no request queued yet.
>> But there is no handle on when the request is going to pop up until
>> it actually leaves the priority queue: the priority queue is a heap
>> IIRC and thus only provides partial orderings.
>
> Again, this lowers my confidence in the patch. Mind you, the patch
> could be totally sound! Yet its commit message, and unfortunately even
> more so the discussion we are having right now, offer no adequate
> reason to assume that. If you, as the patch author, state that you are
> not sure what this thing does exactly,

Uh, I very definitely know what this thing does exactly.  I never stated
otherwise.

> we must conservatively assume that it contains flaws.

It has consequences without a theoretical hard limit on memory usage but
where every additional retained blob will provably be needed.

Basically, you don't want to apply this patch because I know what I'm
doing and documenting it and the underlying rationale without
sugar-coating.

That's perfectly fine with me.  It matches my expectations and
experience.

>> >> diff --git a/builtin/blame.c b/builtin/blame.c
>> >> index 21f42b0..2596fbc 100644
>> >> --- a/builtin/blame.c
>> >> +++ b/builtin/blame.c
>> >> @@ -1556,7 +1556,8 @@ finish:
>> >>  	}
>> >>  	for (i = 0; i < num_sg; i++) {
>> >>  		if (sg_origin[i]) {
>> >> -			drop_origin_blob(sg_origin[i]);
>> >> +			if (!sg_origin[i]->suspects)
>> >> +				drop_origin_blob(sg_origin[i]);
>> >>  			origin_decref(sg_origin[i]);
>> >>  		}
>> >
>> > It would be good to mention in the commit message that this patch does not
>> > change anything for blobs with only one remaining reference (the current
>> > one) because origin_decref() would do the same job as drop_origin_blob
>> > when decrementing the reference counter to 0.
>> 
>> A sizable number of references but not blobs are retained and needed for
>> producing the information in the final printed output (at least when
>> using the default sequential output instead of --incremental or
>> --porcelaine or similar).
>
> Sorry, please help me understand how that response relates to my
> suggestion to improve the commit message.

Blobs and origins don't have separate reference counts.  This patch has
the same effect as maintaining such separate counts explicitly.  Since
the majority of commits lead to some lines in the current source (apart
from commits being weeded out by later refactoring/reimplementation),
this means that the majority of origins never reach a reference count of
0 before the end of the run.

>> > In fact, I suspect that simply removing the drop_origin_blob() call
>> > might result in the exact same I/O pattern.
>> 
>> It's been years since I actually worked on the code but I'm still
>> pretty sure you are wrong.
>
> The short version of your answer is that you will leave this patch in
> its current form and address none of my concerns because you moved on,
> correct? If so, that's totally okay, it just needs to be spelled out.

Yes, that's it.  You'll notice that the code change itself is both
minuscule as well purely functional, so it contains nothing
copyrightable.  I'm perfectly fine with anybody else writing whatever
commit message he wants based on whatever additional testing he wants
and claiming whole credit for what indeed amounts to being the actually
hard work, given the Git maintainership which is dependent on judging
quality by how a patch author advertises his code when the author is not
a frequent contributor.  I've not authored more than a dozen or two
patches spread out over more of a decade.  If you do git-blame on them,
I doubt you'll find any that ever caused problems or that called for
refactoring/replacement.

Nevertheless, I clearly am not a regular and different standards apply.
I'm fine with that.  It's off my lawn, my responsibility, and my
conscience.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2016-05-28  8:29       ` David Kastrup
@ 2016-05-28 12:34         ` Johannes Schindelin
  2016-05-28 14:00           ` David Kastrup
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2016-05-28 12:34 UTC (permalink / raw)
  To: David Kastrup; +Cc: git, gitster

Hi David,

On Sat, 28 May 2016, David Kastrup wrote:

> > The short version of your answer is that you will leave this patch in
> > its current form and address none of my concerns because you moved on,
> > correct? If so, that's totally okay, it just needs to be spelled out.
> 
> Yes, that's it.  You'll notice that the code change itself is both
> minuscule as well purely functional, so it contains nothing
> copyrightable.

That is unfortunately an interpretation of the law that would need to come
from a professional lawyer. As it is, the patch was clearly authored by
you, and anybody else who would claim authorship would most likely be
breaking the law.

So I won't touch it.

Sorry,
Johannes

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2016-05-28 12:34         ` Johannes Schindelin
@ 2016-05-28 14:00           ` David Kastrup
  0 siblings, 0 replies; 15+ messages in thread
From: David Kastrup @ 2016-05-28 14:00 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, gitster

[-- Attachment #1: Type: text/plain, Size: 1738 bytes --]

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi David,
>
> On Sat, 28 May 2016, David Kastrup wrote:
>
>> > The short version of your answer is that you will leave this patch in
>> > its current form and address none of my concerns because you moved on,
>> > correct? If so, that's totally okay, it just needs to be spelled out.
>> 
>> Yes, that's it.  You'll notice that the code change itself is both
>> minuscule as well purely functional, so it contains nothing
>> copyrightable.
>
> That is unfortunately an interpretation of the law that would need to
> come from a professional lawyer.

A professional lawyer would laugh at "Signed-off-by:" being of more
legal significance than a written record of intent but of course you
know that.  This is mere bluster.

> As it is, the patch was clearly authored by you, and anybody else who
> would claim authorship would most likely be breaking the law.

The _diff_ is not "clearly authored" by me but just the simplest
expression of the intent.  The commit message is clearly authored by me
but is not acceptable anyway.  Whoever gets to write an acceptable
commit message is up for all copyrightable credit in my book.  Feel free
to keep the authorship if you really want to, but when replacing the
commit message it is not a particularly accurate attribution.

> So I won't touch it.

Signed-off-by: David Kastrup <dak@gnu.org>

You don't get more than that for other patches either and it's a few
bytes compared to a mail conversation.  Here is a PGP signature on top.

As I said: I am not going to put more work into it anyway and if it's an
occasion for theatralics, it has at least accomplished something.

-- 
David Kastrup

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 180 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] blame.c: don't drop origin blobs as eagerly
@ 2019-04-02 11:56 David Kastrup
  2019-04-03  7:45 ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: David Kastrup @ 2019-04-02 11:56 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

When a parent blob already has chunks queued up for blaming, dropping
the blob at the end of one blame step will cause it to get reloaded
right away, doubling the amount of I/O and unpacking when processing a
linear history.

Keeping such parent blobs in memory seems like a reasonable optimization
that should incur additional memory pressure mostly when processing the
merges from old branches.

Signed-off-by: David Kastrup <dak@gnu.org>
---
 blame.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/blame.c b/blame.c
index 5c07dec190..c11c516921 100644
--- a/blame.c
+++ b/blame.c
@@ -1562,7 +1562,8 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 	}
 	for (i = 0; i < num_sg; i++) {
 		if (sg_origin[i]) {
-			drop_origin_blob(sg_origin[i]);
+			if (!sg_origin[i]->suspects)
+				drop_origin_blob(sg_origin[i]);
 			blame_origin_decref(sg_origin[i]);
 		}
 	}
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2019-04-02 11:56 [PATCH] blame.c: don't drop origin blobs as eagerly David Kastrup
@ 2019-04-03  7:45 ` Junio C Hamano
  2019-04-03  9:32   ` Duy Nguyen
  2019-04-03 11:08   ` David Kastrup
  0 siblings, 2 replies; 15+ messages in thread
From: Junio C Hamano @ 2019-04-03  7:45 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> When a parent blob already has chunks queued up for blaming, dropping
> the blob at the end of one blame step will cause it to get reloaded
> right away, doubling the amount of I/O and unpacking when processing a
> linear history.
>
> Keeping such parent blobs in memory seems like a reasonable optimization
> that should incur additional memory pressure mostly when processing the
> merges from old branches.

Thanks for finding an age-old one that dates back to 7c3c7962
("blame: drop blob data after passing blame to the parent",
2007-12-11).

Interestingly, the said commit claims:

    When passing blame from a parent to its parent (i.e. the
    grandparent), the blob data for the parent may need to be read
    again, but it should be relatively cheap, thanks to delta-base
    cache.
            
but perhaps you found a case where the delta-base cache is not all
that effective in the benchmark?

Will queue.  Thanks.




>
> Signed-off-by: David Kastrup <dak@gnu.org>
> ---
>  blame.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/blame.c b/blame.c
> index 5c07dec190..c11c516921 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -1562,7 +1562,8 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
>  	}
>  	for (i = 0; i < num_sg; i++) {
>  		if (sg_origin[i]) {
> -			drop_origin_blob(sg_origin[i]);
> +			if (!sg_origin[i]->suspects)
> +				drop_origin_blob(sg_origin[i]);
>  			blame_origin_decref(sg_origin[i]);
>  		}
>  	}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2019-04-03  7:45 ` Junio C Hamano
@ 2019-04-03  9:32   ` Duy Nguyen
  2019-04-03 11:36     ` Jeff King
  2019-04-03 11:08   ` David Kastrup
  1 sibling, 1 reply; 15+ messages in thread
From: Duy Nguyen @ 2019-04-03  9:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Kastrup, Git Mailing List

On Wed, Apr 3, 2019 at 2:45 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> David Kastrup <dak@gnu.org> writes:
>
> > When a parent blob already has chunks queued up for blaming, dropping
> > the blob at the end of one blame step will cause it to get reloaded
> > right away, doubling the amount of I/O and unpacking when processing a
> > linear history.
> >
> > Keeping such parent blobs in memory seems like a reasonable optimization
> > that should incur additional memory pressure mostly when processing the
> > merges from old branches.
>
> Thanks for finding an age-old one that dates back to 7c3c7962
> ("blame: drop blob data after passing blame to the parent",
> 2007-12-11).
>
> Interestingly, the said commit claims:
>
>     When passing blame from a parent to its parent (i.e. the
>     grandparent), the blob data for the parent may need to be read
>     again, but it should be relatively cheap, thanks to delta-base
>     cache.
>
> but perhaps you found a case where the delta-base cache is not all
> that effective in the benchmark?

Interesting. For some reason I keep remembering the delta-base cache
is for caching base objects, not all packed objects.

That might explain why I could not see significant gain when blaming
linux.git's MAINTAINERS file (0.5s was shaved out of 13s) even though
the number of objects read was cut by half (8424 vs 15083).

I just tried again. The number of actual pack reading is slightly
reduced with the patch (498260 vs 502140). Not by a large margin. But
I imagine if the cache is under pressure (MAINTAINERS file is quite
small, 426k), we may get more eviction and misses from delta-base
cache.

It might still help when we need to read loose objects though. But I
guess this matters even less.

And I don't know how lazy clones behave in this case. If we get new
objects and store as loose, then it helps a bit more.

> Will queue.  Thanks.
>
>
>
>
> >
> > Signed-off-by: David Kastrup <dak@gnu.org>
> > ---
> >  blame.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/blame.c b/blame.c
> > index 5c07dec190..c11c516921 100644
> > --- a/blame.c
> > +++ b/blame.c
> > @@ -1562,7 +1562,8 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
> >       }
> >       for (i = 0; i < num_sg; i++) {
> >               if (sg_origin[i]) {
> > -                     drop_origin_blob(sg_origin[i]);
> > +                     if (!sg_origin[i]->suspects)
> > +                             drop_origin_blob(sg_origin[i]);
> >                       blame_origin_decref(sg_origin[i]);
> >               }
> >       }



-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2019-04-03  7:45 ` Junio C Hamano
  2019-04-03  9:32   ` Duy Nguyen
@ 2019-04-03 11:08   ` David Kastrup
  1 sibling, 0 replies; 15+ messages in thread
From: David Kastrup @ 2019-04-03 11:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> When a parent blob already has chunks queued up for blaming, dropping
>> the blob at the end of one blame step will cause it to get reloaded
>> right away, doubling the amount of I/O and unpacking when processing a
>> linear history.
>>
>> Keeping such parent blobs in memory seems like a reasonable optimization
>> that should incur additional memory pressure mostly when processing the
>> merges from old branches.
>
> Thanks for finding an age-old one that dates back to 7c3c7962
> ("blame: drop blob data after passing blame to the parent",
> 2007-12-11).
>
> Interestingly, the said commit claims:
>
>     When passing blame from a parent to its parent (i.e. the
>     grandparent), the blob data for the parent may need to be read
>     again, but it should be relatively cheap, thanks to delta-base
>     cache.
>             
> but perhaps you found a case where the delta-base cache is not all
> that effective in the benchmark?

The most relevant contribution is in a linear history where the diff
between commit and parent is followed by the diff between parent and
grandparent.  It seems wasteful to recreate the blobs in this case.  Of
course this is also the case where any close cache layers are more
likely to still be warm, so the savings may be less apparent.  They are
likely more for deep delta chains in long histories where the
delta-chain cache is more thoroughly exercised.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2019-04-03  9:32   ` Duy Nguyen
@ 2019-04-03 11:36     ` Jeff King
  2019-04-03 12:06       ` Duy Nguyen
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2019-04-03 11:36 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, David Kastrup, Git Mailing List

On Wed, Apr 03, 2019 at 04:32:30PM +0700, Duy Nguyen wrote:

> That might explain why I could not see significant gain when blaming
> linux.git's MAINTAINERS file (0.5s was shaved out of 13s) even though
> the number of objects read was cut by half (8424 vs 15083).

I did a few timings, too, and managed to come up with similar
improvements (only a small fraction, and only for large files). I think
the main thing is simply that loading the blob from the object database
is a fraction of the total work done. We still have to actually diff the
blobs, which is at least as expensive as loading them from disk.

We also have to load commits and trees from disk as we traverse.
Enabling the commit-graph would shrink that portion (and make
improvements in the blob loading proportionally more impressive).

All that said, this seems like an easy and obvious win, and worth doing.
0.5s is still something.

I suspect we could do even better by storing and reusing not just the
original blob between diffs, but the intermediate diff state (i.e., the
hashes produced by xdl_prepare(), which should be usable between
multiple diffs). That's quite a bit more complex, though, and I imagine
would require some surgery to xdiff.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2019-04-03 11:36     ` Jeff King
@ 2019-04-03 12:06       ` Duy Nguyen
  2019-04-03 12:19         ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Duy Nguyen @ 2019-04-03 12:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, David Kastrup, Git Mailing List

On Wed, Apr 3, 2019 at 6:36 PM Jeff King <peff@peff.net> wrote:
> I suspect we could do even better by storing and reusing not just the
> original blob between diffs, but the intermediate diff state (i.e., the
> hashes produced by xdl_prepare(), which should be usable between
> multiple diffs). That's quite a bit more complex, though, and I imagine
> would require some surgery to xdiff.

Amazing. xdl_prepare_ctx and xdl_hash_record (called inside
xdl_prepare_ctx) account for 36% according to 'perf report'. Please
tell me you just did not get this on your first guess.

I tracked and dumped all the hashes that are sent to xdl_prepare() and
it looks like the amount of duplicates is quite high. There are only
about 1000 one-time hashes out of 7000 (didn't really draw a histogram
to examine closer). So yeah this looks really promising, assuming
somebody is going to do something about it.
-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2019-04-03 12:06       ` Duy Nguyen
@ 2019-04-03 12:19         ` Jeff King
  2019-04-03 12:32           ` David Kastrup
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2019-04-03 12:19 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, David Kastrup, Git Mailing List

On Wed, Apr 03, 2019 at 07:06:02PM +0700, Duy Nguyen wrote:

> On Wed, Apr 3, 2019 at 6:36 PM Jeff King <peff@peff.net> wrote:
> > I suspect we could do even better by storing and reusing not just the
> > original blob between diffs, but the intermediate diff state (i.e., the
> > hashes produced by xdl_prepare(), which should be usable between
> > multiple diffs). That's quite a bit more complex, though, and I imagine
> > would require some surgery to xdiff.
> 
> Amazing. xdl_prepare_ctx and xdl_hash_record (called inside
> xdl_prepare_ctx) account for 36% according to 'perf report'. Please
> tell me you just did not get this on your first guess.

Sorry, it was a guess. ;)

But an educated one, based on previous experiments with speeding up "log
-p". Remember XDL_FAST_HASH, which produced speedups there (but
unfortunately had some pathological slowdowns, because it produced too
many collisions). I've also played around with using other hashes like
murmur or siphash, but was never able to get anything remarkably faster
than what we have now.

> I tracked and dumped all the hashes that are sent to xdl_prepare() and
> it looks like the amount of duplicates is quite high. There are only
> about 1000 one-time hashes out of 7000 (didn't really draw a histogram
> to examine closer). So yeah this looks really promising, assuming
> somebody is going to do something about it.

I don't think counting the unique hash outputs tells you much about what
can be sped up. After all, two related blobs are likely to overlap quite
a bit in their hashes (i.e., all of their non-unique lines). The trick
is finding in each blob those ones that _are_ unique. :)

But if we spend 36% of our time in hashing the blobs, then that implies
that we could gain back 18% by caching and reusing the work from a
previous diff (as David notes, a simple keep-the-last-parent cache only
yields 100% cache hits in a linear history, but it's probably good
enough for our purposes).

This should likewise make "git log -p -- file" faster, though with more
files you'd need a bigger cache.

So I do think it's a promising lead. I don't have immediate plans to
work on it, though. Maybe it would be a good GSoC project. ;)

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] blame.c: don't drop origin blobs as eagerly
  2019-04-03 12:19         ` Jeff King
@ 2019-04-03 12:32           ` David Kastrup
  0 siblings, 0 replies; 15+ messages in thread
From: David Kastrup @ 2019-04-03 12:32 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, Junio C Hamano, Git Mailing List

Jeff King <peff@peff.net> writes:

> On Wed, Apr 03, 2019 at 07:06:02PM +0700, Duy Nguyen wrote:
>
>> On Wed, Apr 3, 2019 at 6:36 PM Jeff King <peff@peff.net> wrote:
>> > I suspect we could do even better by storing and reusing not just the
>> > original blob between diffs, but the intermediate diff state (i.e., the
>> > hashes produced by xdl_prepare(), which should be usable between
>> > multiple diffs). That's quite a bit more complex, though, and I imagine
>> > would require some surgery to xdiff.
>> 
>> Amazing. xdl_prepare_ctx and xdl_hash_record (called inside
>> xdl_prepare_ctx) account for 36% according to 'perf report'. Please
>> tell me you just did not get this on your first guess.
>
> Sorry, it was a guess. ;)
>
> But an educated one, based on previous experiments with speeding up "log
> -p". Remember XDL_FAST_HASH, which produced speedups there (but
> unfortunately had some pathological slowdowns, because it produced too
> many collisions). I've also played around with using other hashes like
> murmur or siphash, but was never able to get anything remarkably faster
> than what we have now.
>
>> I tracked and dumped all the hashes that are sent to xdl_prepare() and
>> it looks like the amount of duplicates is quite high. There are only
>> about 1000 one-time hashes out of 7000 (didn't really draw a histogram
>> to examine closer). So yeah this looks really promising, assuming
>> somebody is going to do something about it.
>
> I don't think counting the unique hash outputs tells you much about what
> can be sped up. After all, two related blobs are likely to overlap quite
> a bit in their hashes (i.e., all of their non-unique lines). The trick
> is finding in each blob those ones that _are_ unique. :)
>
> But if we spend 36% of our time in hashing the blobs, then that implies
> that we could gain back 18% by caching and reusing the work from a
> previous diff (as David notes, a simple keep-the-last-parent cache only
> yields 100% cache hits in a linear history, but it's probably good
> enough for our purposes).

Of course, if you really want to get tricky, you'll not even compare
stuff that is expanded from the same delta-chain location.  Basically,
there are a number of separate layers that are doing rather similar work
with rather similar data, but intermingling the layers' work is not
going to be good for maintainability.  Caching at the various layers can
keep their separation while still reducing some of the redundancy costs.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-04-03 12:32 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-02 11:56 [PATCH] blame.c: don't drop origin blobs as eagerly David Kastrup
2019-04-03  7:45 ` Junio C Hamano
2019-04-03  9:32   ` Duy Nguyen
2019-04-03 11:36     ` Jeff King
2019-04-03 12:06       ` Duy Nguyen
2019-04-03 12:19         ` Jeff King
2019-04-03 12:32           ` David Kastrup
2019-04-03 11:08   ` David Kastrup
  -- strict thread matches above, loose matches on Subject: below --
2016-05-27 13:35 David Kastrup
2016-05-27 15:00 ` Johannes Schindelin
2016-05-27 15:41   ` David Kastrup
2016-05-28  6:37     ` Johannes Schindelin
2016-05-28  8:29       ` David Kastrup
2016-05-28 12:34         ` Johannes Schindelin
2016-05-28 14:00           ` David Kastrup

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).