git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [GSOC] Blog about week 6
@ 2020-07-14  6:32 Abhishek Kumar
  2020-07-17 22:49 ` Jakub Narębski
  0 siblings, 1 reply; 3+ messages in thread
From: Abhishek Kumar @ 2020-07-14  6:32 UTC (permalink / raw)
  To: git; +Cc: jnareb, stolee

Hello everyone!

Over the last week, I started implementing generation number v2. Based
on the performance numbers and subsequent discussion, we have decided to
use generation data chunk approach, storing topological levels into CDAT
and corrected commit date offsets into GDAT chunk. Do note that we are
maintaining backward compatibility using topological level, not with
monotonic offsets.

I also found a performance regression with `commit-graph write
--reachable --changed_path` that sneaked in along with the patch on
moving generation and graph positions into commit-slab.

https://abhishekkumar2718.github.io/gsoc/2020/07/12/gsoc-week-6.html

https://github.com/gitgitgadget/git/pull/676

> Yes, I pushed without signing off or a cover letter. It's still a work
> in progress.

Thanks
Abhishek

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSOC] Blog about week 6
  2020-07-14  6:32 [GSOC] Blog about week 6 Abhishek Kumar
@ 2020-07-17 22:49 ` Jakub Narębski
  2020-07-21 13:27   ` Abhishek Kumar
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Narębski @ 2020-07-17 22:49 UTC (permalink / raw)
  To: Abhishek Kumar; +Cc: git, Derrick Stolee

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

> Hello everyone!
>
> Over the last week, I started implementing generation number v2. Based
> on the performance numbers and subsequent discussion, we have decided to
> use generation data chunk approach, storing topological levels into CDAT
> and corrected commit date offsets into GDAT chunk. Do note that we are
> maintaining backward compatibility using topological level, not with
> monotonic offsets.
>
> I also found a performance regression with `commit-graph write
> --reachable --changed_path` that sneaked in along with the patch on
> moving generation and graph positions into commit-slab.
>
> https://github.com/gitgitgadget/git/pull/676
>
> Yes, I pushed without signing off or a cover letter. It's still a work
> in progress.

All right.  Did you send this series of patches to git mailing list for
a review already, or is it only a working version?

>
> https://abhishekkumar2718.github.io/gsoc/2020/07/12/gsoc-week-6.html

Reply to the blog contents follows:

----------------------------------------------------------------------

> GSoC - Weeks 6
> ==============

Very minor nitpick / correction - I think you meant "Week" not "Weeks"
here in the title of this blog.

> Over the last week, I worked on implementing corrected commit date
> with generation data chunk and boy, was it painful!
>
> While working on the prototype, I used the space allocated for graph
> position to store topological level before writing it into CDAT chunk.

First, I assume that if we choose to not compute topological levels and
simply store GENERATION_NUMBER_MAX in the 30-bits in CDAT set for
generation number v1 we would not have this problem, isn't it?

Second, isn't `graph_pos` nowadays on the commit slab? Why reuse it,
which can lead to problems if other parts of code assume things about
its syntax, rather than making use of _new_ uint32_t `topo_level` commit
slab?  We have many different ad-hoc entries stored on the slab, what's
one more...

> However, while computing bloom filters, we call `get_bloom_filter()`.

Ah, I guess this happens when `git commit-graph write` is called with
the `--changed-paths` option, so we compute both generation numbers v1
and v2, and Bloom filters for changed paths.

> This function checks if the commit belongs to the commit-graph. If it
> does, load the bloom filters (avoiding a lot of computation).
> Otherwise, compute the filters and return it. Since I am storing
> topological levels in graph position, Git mistakenly thinks that we
> are reading a graph and tries to load bloom filter, segfaulting right
> away.

So don't do that; don't store topological level in graph position.

> Faced with the problem, I could think of three ideas:
>

> 1. Extend generation to be 64-bits (which we were planning on anyway)
>    and store topological levels within higher 32 bits and corrected date
>    offsets within lower 32 bits. Code for calculating corrected date
>    offsets, topological levels would be littered with bit shifts and
>    ANDs, but code for parsing and using generation numbers would be
>    clean.

If I remember it correctly, extending `generation` to 64-bits in
`struct commit_graph_data` (stored on slab) is there to store and
operate on the corrected commit(ter) date, instead of storing and
fiddling with offsets?

>
> 2. Use two 32-bits variables, level and odate, and use the contiguous
>    64 bits to store generation (while reading commit-graph): Code for
>    writing commit-graph is clean, but reading generation is much trickier
>    as we try to coerce value stored in (graph_data + 4) into a timestamp.

I don't quite get this idea.  For backward compatibility we have to
store topological levels in appropriate 30-bit wide field in CDAT chunk.
Storing topological levels in higher 32-bits of 64-bit GDAT chunk would
not provide it.

> 3. Split get_bloom_filters() into two functions - First checks if the
>    commit is from the graph and tries to load whereas the other computes
>    bloom filter. Then we could directly call the second function when
>    writing a commit-graph. I was not sure if this would have worked, but
>    I wanted not to change things unless required.

4. Store topological level data in a separate commit slab entry.
   This way any code that examines `commit_graph_data.position`,
   like `get_bloom_filter()`, wouldn't be mislead.

About the 4th possibility: does Git use both graph position and
generation number in the same process?  Because if it is not, then
perhaps putting graph_position and graph_generation_number in a
composite type `struct commit_graph_data` on slab might have been a
mistake.  Due to how modern CPUs work (with data prefetching and
pipelining) it might be more advantageous wrt performance to have
subsequent data for graph position or for generation number to be packed
and not strides, as it is now -- at least when accessing commits in the
commit.index order.

> I felt the first two approaches were too unreadable.
>

I think the 3rd approach is worth doing on its own, as it makes the code
bit clearer... but I have not actually examined the commits in the pull
request.

> In the end, I compromised by using a 64-bit generation and a 32-bit
> level in the initial commit and will restrict the ugly conversion to
> reuse 64-bits in a focused patch later in the series.

I would have to examine the patch in question in more detail, but I am
waiting for the series to be posted to Git mailing list (though I can be
persuaded to do pre-review on GitHub pull request, as pre-release step).

> Regression when computing bloom filters
> =======================================
>
> While wrestling with the problem above, I noticed that
> `commit_gen_cmp()` lies in “commit-graph write” path but uses the
> helpers. But the helper would always return `GENERATION_NUMBER_MAX`
> as we found while moving generation number into a slab.
>
> Digging through git logs, I found [the patch][1], which introduced
> `commit_gen_cmp()`. Sorting commits make the computing bloom filters
> much, much faster. The helper returns the same value every time, which
> makes sorting pointless and nullifies the performance boost. Fixing
> this was easy, bypass the helper and access generation number.
>
> [1]: https://github.com/git/git/commit/3d11275505694ce4e5256516de1c5dd90e749303

Good catch!

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSOC] Blog about week 6
  2020-07-17 22:49 ` Jakub Narębski
@ 2020-07-21 13:27   ` Abhishek Kumar
  0 siblings, 0 replies; 3+ messages in thread
From: Abhishek Kumar @ 2020-07-21 13:27 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git, stolee

On Sat, Jul 18, 2020 at 12:49:11AM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> > Hello everyone!
> >
> > Over the last week, I started implementing generation number v2. Based
> > on the performance numbers and subsequent discussion, we have decided to
> > use generation data chunk approach, storing topological levels into CDAT
> > and corrected commit date offsets into GDAT chunk. Do note that we are
> > maintaining backward compatibility using topological level, not with
> > monotonic offsets.
> >
> > I also found a performance regression with `commit-graph write
> > --reachable --changed_path` that sneaked in along with the patch on
> > moving generation and graph positions into commit-slab.
> >
> > https://github.com/gitgitgadget/git/pull/676
> >
> > Yes, I pushed without signing off or a cover letter. It's still a work
> > in progress.
> 
> All right.  Did you send this series of patches to git mailing list for
> a review already, or is it only a working version?
> 

It's a working version so far.

> >
> > https://abhishekkumar2718.github.io/gsoc/2020/07/12/gsoc-week-6.html
> 
> Reply to the blog contents follows:
> 
> ----------------------------------------------------------------------
> 
> > GSoC - Weeks 6
> > ==============
> 
> Very minor nitpick / correction - I think you meant "Week" not "Weeks"
> here in the title of this blog.
> 

Thanks! Will fix it.

> > Over the last week, I worked on implementing corrected commit date
> > with generation data chunk and boy, was it painful!
> >
> > While working on the prototype, I used the space allocated for graph
> > position to store topological level before writing it into CDAT chunk.
> 
> First, I assume that if we choose to not compute topological levels and
> simply store GENERATION_NUMBER_MAX in the 30-bits in CDAT set for
> generation number v1 we would not have this problem, isn't it?
> 
> Second, isn't `graph_pos` nowadays on the commit slab? Why reuse it,
> which can lead to problems if other parts of code assume things about
> its syntax, rather than making use of _new_ uint32_t `topo_level` commit
> slab?  We have many different ad-hoc entries stored on the slab, what's
> one more...
> 
> > However, while computing bloom filters, we call `get_bloom_filter()`.
> 
> Ah, I guess this happens when `git commit-graph write` is called with
> the `--changed-paths` option, so we compute both generation numbers v1
> and v2, and Bloom filters for changed paths.
> 
> > This function checks if the commit belongs to the commit-graph. If it
> > does, load the bloom filters (avoiding a lot of computation).
> > Otherwise, compute the filters and return it. Since I am storing
> > topological levels in graph position, Git mistakenly thinks that we
> > are reading a graph and tries to load bloom filter, segfaulting right
> > away.
> 
> So don't do that; don't store topological level in graph position.
> 
> > Faced with the problem, I could think of three ideas:
> >
> 
> > 1. Extend generation to be 64-bits (which we were planning on anyway)
> >    and store topological levels within higher 32 bits and corrected date
> >    offsets within lower 32 bits. Code for calculating corrected date
> >    offsets, topological levels would be littered with bit shifts and
> >    ANDs, but code for parsing and using generation numbers would be
> >    clean.
> 
> If I remember it correctly, extending `generation` to 64-bits in
> `struct commit_graph_data` (stored on slab) is there to store and
> operate on the corrected commit(ter) date, instead of storing and
> fiddling with offsets?
> 

Well, yes. We could ask for a new slab for topo_levels while writing the
graph. But since we are writing offsets (to minimize space used by the
graph), I store offsets through out `compute_generation_number`. Since
the offsets fit into 32-bits, I thought of reusing the higher 32-bits to
store topological levels.

> >
> > 2. Use two 32-bits variables, level and odate, and use the contiguous
> >    64 bits to store generation (while reading commit-graph): Code for
> >    writing commit-graph is clean, but reading generation is much trickier
> >    as we try to coerce value stored in (graph_data + 4) into a timestamp.
> 
> I don't quite get this idea.  For backward compatibility we have to
> store topological levels in appropriate 30-bit wide field in CDAT chunk.
> Storing topological levels in higher 32-bits of 64-bit GDAT chunk would
> not provide it.

Here, I am talking not about the how data is stored on chunks but rather
how it's stored in the commit-slab while writing commit-graph.

> 
> > 3. Split get_bloom_filters() into two functions - First checks if the
> >    commit is from the graph and tries to load whereas the other computes
> >    bloom filter. Then we could directly call the second function when
> >    writing a commit-graph. I was not sure if this would have worked, but
> >    I wanted not to change things unless required.
> 
> 4. Store topological level data in a separate commit slab entry.
>    This way any code that examines `commit_graph_data.position`,
>    like `get_bloom_filter()`, wouldn't be mislead.
> 
> About the 4th possibility: does Git use both graph position and
> generation number in the same process?  Because if it is not, then
> perhaps putting graph_position and graph_generation_number in a
> composite type `struct commit_graph_data` on slab might have been a
> mistake.  Due to how modern CPUs work (with data prefetching and
> pipelining) it might be more advantageous wrt performance to have
> subsequent data for graph position or for generation number to be packed
> and not strides, as it is now -- at least when accessing commits in the
> commit.index order.
>

I suppose so. Graph positions would be used to efficiently look up
generation number in the commit-graph, so Git uses both graph position
and generation number in the same process.

Although, since we store generation number in the anyway, we have no
further use for the graph position.

But I think Dr. Stolee would be the right person to answer this.

> > I felt the first two approaches were too unreadable.
> >
> 
> I think the 3rd approach is worth doing on its own, as it makes the code
> bit clearer... but I have not actually examined the commits in the pull
> request.
> 

Alright, will add this as well.

> > In the end, I compromised by using a 64-bit generation and a 32-bit
> > level in the initial commit and will restrict the ugly conversion to
> > reuse 64-bits in a focused patch later in the series.
> 
> I would have to examine the patch in question in more detail, but I am
> waiting for the series to be posted to Git mailing list (though I can be
> persuaded to do pre-review on GitHub pull request, as pre-release step).
> 
> > Regression when computing bloom filters
> > =======================================
> >
> > While wrestling with the problem above, I noticed that
> > `commit_gen_cmp()` lies in “commit-graph write” path but uses the
> > helpers. But the helper would always return `GENERATION_NUMBER_MAX`
> > as we found while moving generation number into a slab.
> >
> > Digging through git logs, I found [the patch][1], which introduced
> > `commit_gen_cmp()`. Sorting commits make the computing bloom filters
> > much, much faster. The helper returns the same value every time, which
> > makes sorting pointless and nullifies the performance boost. Fixing
> > this was easy, bypass the helper and access generation number.
> >
> > [1]: https://github.com/git/git/commit/3d11275505694ce4e5256516de1c5dd90e749303
> 
> Good catch!
> 
> Best,
> --
> Jakub Narębski

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-07-21 13:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-14  6:32 [GSOC] Blog about week 6 Abhishek Kumar
2020-07-17 22:49 ` Jakub Narębski
2020-07-21 13:27   ` Abhishek Kumar

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).