git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [GSoC Patch v3 0/4] Move generation, graph_pos to a slab
@ 2020-06-12 18:40 Abhishek Kumar
  2020-06-12 18:40 ` [PATCH v3 1/4] alloc: introduce parsed_commits_count Abhishek Kumar
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Abhishek Kumar @ 2020-06-12 18:40 UTC (permalink / raw)
  To: git; +Cc: stolee, jnareb

The struct commit is used in many contexts. However, members
`generation` and `graph_pos` are only used for commit graph related
operations and otherwise waste memory.

This wastage would have been more pronounced as we transition to
generation number v2, which uses 64-bit generation number instead of
current 32-bits.

While the overall test suite runs slightly faster than master
(series: 27m10s, master: 27ms34s, faster by 2.35%), certain commands
like `git merge-base --is-ancestory` are slowed by nearly 40% as
discovered by SDEZER Gabor [1].

Derrick Stolee believes the slow down is attributable to the underlying
algorithm rather than the slowness of commit-slab access [2] and we will
follow-up on that in a later series.

I did not mention maximum RSS in the commit messages as they were nearly
identical (series: 68104kb, master: 68040kb, fewer by <0.1%). This leads
me to conclude that either the test using maximum memory involves commit
graph or did not involve the struct commit at all. The move to
commit-slab reduces memory footprint for the cases where struct commit
is used but members generation and graph position are not. Average RSS
would have been a good and more representative measure, but 
unfortunately time(1) could not measure it on my system.

With this, I feel the patch will require minor fixes, if any. I am
moving ahead with working the next step of "Implement Generation Number
v2" that is proper handling of commit-graph format change.

Based on the discussions, I feel we should compute both generation
number v1 and the date offset value with storing date offsets in a new
chunk as the cost is mostly from walking the commits.

Abhishek Kumar (4):
  alloc: introduce parsed_commits_count
  commit-graph: introduce commit_graph_data_slab
  commit: move members graph_pos, generation to a slab
  commit-graph: minimize commit_graph_data_slab access

 alloc.c                         |   6 +-
 blame.c                         |   2 +-
 bloom.c                         |   7 +-
 commit-graph.c                  | 122 ++++++++++++++++++++++++--------
 commit-graph.h                  |  10 +++
 commit-reach.c                  |  69 +++++++++++-------
 commit.c                        |   8 ++-
 contrib/coccinelle/commit.cocci |  18 +++++
 revision.c                      |  20 +++---
 9 files changed, 188 insertions(+), 74 deletions(-)

-- 
2.27.0

Changes in v3:
- Introduce alloc commit to fix the failing diff-submodule test.
- Elaborate on performance and slow down noticed in the commit message.

Changes in v2:
- Introduce struct commit_graph_data.
- Merge `graph_pos`, `generation` slabs into a single,
  `commit_graph_data` slab.
- Use graph position for an intermediate check for generation, saving
  the cost of initializing generation numbers.
- Add an follow-up patch caching results of slab access in local
  variables.
- Move coccinelle transformation to commit.coccinelle instead of
  creating new scripts.
- Elaborate on removing default values from init_commit_node().
- Revert moving macro constants (e.g. COMMIT_NOT_FROM_GRAPH,
  GENERATION_NUMBER_ZERO) from commit.h to commit-graph.h

CI Build: https://travis-ci.com/github/abhishekkumar2718/git  

^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH v3 1/4] alloc: introduce parsed_commits_count
@ 2020-06-14  6:09 Abhishek Kumar
  0 siblings, 0 replies; 15+ messages in thread
From: Abhishek Kumar @ 2020-06-14  6:09 UTC (permalink / raw)
  To: jnareb; +Cc: stolee, git

On 13.06.2020, Jakub Narębski wrote:

> On 12.06.2020, Abhishek Kumar wrote:
> 
> > Commit slab relies on uniqueness of commit->index to access data. As
> > submodules are repositories on their own, alloc_commit_index() (which
> > depends on repository->parsed_objects->commit_count) no longer
> > returns unique values.
> > 
> > This would break tests once we move `generation` and `graph_pos` into
> > a
> > commit slab, as commits of supermodule and submodule can have the same
> > index but must have different graph positions.
> 
> First, commits of supermodule and of submodule are in different graphs,
> so I don't see why they have to be in the same serialized commit-graph
> file.
> 

I should have been clearer in the commit message:

commit->index = ith commit allocated for the repository.

As both supermodule and submodule are repositories, they can have
commits with the same index.

This means that were both the ith commit allocated for supermodule and
submodule respectively.

However, the commit-slab code treats them identically (because of the
same index) and they (incorrectly) have same generation number and graph
position.

Switching to a global counter instead of per-repo counter avoids having
two commits with the same index in the first place.

> Second, Git stores many different types of information on slab already.
> How comes that we have not had any problems till now?  
> 
> There is contains_cache, commit_seen, indegree_slab, author_date_slab,
> commit_base, commit_pos, bloom_filter_slab, buffer_slab,
> commit_rev_name,
> commit_names, commit_name_slab, saved_parents, blame_suspects,
> commit_todo_item.
> 

That's a fair point - It seems rather unlikely that we had not faced
this problem yet.

> > 
> > Let's introduce a counter variable, `parsed_commits_count` to keep
> > track
> > of parsed commits so far.
> 
> All right, thought it might be worth mentioning that it is a global
> variable, or rather a static variable in a function.
> 

Alright, noted.

> > 
> > 
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> > 
> > CI Build for the failing tests:
> > https://travis-ci.com/github/abhishekkumar2718/git/jobs/345413840
> 
> The failed tests are, from what I see:
> - t4060-diff-submodule-option-diff-format.sh
> - t4041-diff-submodule-option.sh
> - t4059-diff-submodule-not-initialized.sh
> 
> 
> > 
> >   alloc.c | 4 +++-
> >   1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/alloc.c b/alloc.c
> > index 1c64c4dd16..29f0e3aa80 100644
> > --- a/alloc.c
> > +++ b/alloc.c
> > @@ -101,7 +101,9 @@ void *alloc_object_node(struct repository *r)
> >   
> >   static unsigned int alloc_commit_index(struct repository *r)
> >   {
> > -     return r->parsed_objects->commit_count++;
> > +     static unsigned int parsed_commits_count = 0;
> > +     r->parsed_objects->commit_count++;
> 
> Do we use r->parsed_objects->commit_count anywhere?

We don't use it anywhere else. 14ba97f8 (alloc: allow arbitary
repositories for alloc functions, 2019-05-15) made the count per
repository but it's only been used to assign commit index.

Junio suggested that we could remove commit_count from the struct and
clean up function calls a bit.

> 
> > +     return parsed_commits_count++;
> 
> Does it matter that it is not thread safe, because it is not atomic?
> Shouldn't it be
> 
>   +     static _Atomic unsigned int parsed_commits_count = 0;
> 
>   or
> 
>     +   static _Atomic unsigned int parsed_commits_count =
>     ATOMIC_VAR_INIT(0);
> 
>     (If it is allowed).
> 
>     >   }
>     >   
>     >   void init_commit_node(struct repository *r, struct commit *c)
>     > 

Thanks
Abhishek

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-06-17  9:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-12 18:40 [GSoC Patch v3 0/4] Move generation, graph_pos to a slab Abhishek Kumar
2020-06-12 18:40 ` [PATCH v3 1/4] alloc: introduce parsed_commits_count Abhishek Kumar
2020-06-12 20:08   ` Junio C Hamano
2020-06-12 22:00     ` Junio C Hamano
2020-06-12 22:20       ` Junio C Hamano
2020-06-12 22:52   ` Junio C Hamano
2020-06-13 18:57     ` Abhishek Kumar
2020-06-12 23:16   ` Jakub Narębski
2020-06-12 18:40 ` [PATCH v3 2/4] commit-graph: introduce commit_graph_data_slab Abhishek Kumar
2020-06-13  6:53   ` SZEDER Gábor
2020-06-17  9:18     ` Abhishek Kumar
2020-06-12 18:40 ` [PATCH v3 3/4] commit: move members graph_pos, generation to a slab Abhishek Kumar
2020-06-12 18:40 ` [PATCH v3 4/4] commit-graph: minimize commit_graph_data_slab access Abhishek Kumar
2020-06-12 21:26 ` [GSoC Patch v3 0/4] Move generation, graph_pos to a slab Jakub Narębski
  -- strict thread matches above, loose matches on Subject: below --
2020-06-14  6:09 [PATCH v3 1/4] alloc: introduce parsed_commits_count Abhishek Kumar

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).