git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH v2 00/14] Serialized Git Commit Graph
@ 2018-01-30 21:39 Derrick Stolee
  2018-01-30 21:47 ` Stefan Beller
  0 siblings, 1 reply; 3+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Thanks to everyone who gave comments on v1. I tried my best to respond to
all of the feedback, but may have missed some while I was doing several
renames, including:

* builtin/graph.c -> builtin/commit-graph.c
* packed-graph.[c|h] -> commit-graph.[c|h]
* t/t5319-graph.sh -> t/t5318-commit-graph.sh

Because of these renames (and several type/function renames) the diff
is too large to conveniently share here.

Some issues that came up and are addressed:

* Use <hash> instead of <oid> when referring to the graph-<hash>.graph
  filenames and the contents of graph-head.
* 32-bit timestamps will not cause undefined behavior.
* timestamp_t is unsigned, so they are never negative.
* The config setting "core.commitgraph" now only controls consuming the
  graph during normal operations and will not block the commit-graph
  plumbing command.
* The --stdin-commits is better about sanitizing the input for strings
  that do not parse to OIDs or are OIDs for non-commit objects.

One unresolved comment that I would like consensus on is the use of
globals to store the config setting and the graph state. I'm currently
using the pattern from packed_git instead of putting these values in
the_repository. However, we want to eventually remove globals like
packed_git. Should I deviate from the pattern _now_ in order to keep
the problem from growing, or should I keep to the known pattern?

Finally, I tried to clean up my incorrect style as I was recreating
these commits. Feel free to be merciless in style feedback now that the
architecture is more stable.

Thanks,
-Stolee

-- >8 --

As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base <A> <B>' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv                       |  0.42s |  0.27s | -35%  |
| rev-list --all                   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

  git config core.commitgraph true
  git show-ref -s | git graph --write --update-head --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.

[1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a2ce@jeffhostetler.com/
    Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
    A GitHub pull request containing the latest version of this patch.

Derrick Stolee (14):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement construct_commit_graph()
  commit-graph: implement git-commit-graph --write
  commit-graph: implement git-commit-graph --read
  commit-graph: implement git-commit-graph --update-head
  commit-graph: implement git-commit-graph --clear
  commit-graph: teach git-commit-graph --delete-expired
  commit-graph: add core.commitgraph setting
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: close under reachability
  commit-graph: build graph from starting commits

 .gitignore                                      |   1 +
 Documentation/config.txt                        |   3 +
 Documentation/git-commit-graph.txt              | 100 +++
 Documentation/technical/commit-graph-format.txt |  89 +++
 Documentation/technical/commit-graph.txt        | 189 ++++++
 Makefile                                        |   2 +
 alloc.c                                         |   1 +
 builtin.h                                       |   1 +
 builtin/commit-graph.c                          | 229 +++++++
 cache.h                                         |   1 +
 command-list.txt                                |   1 +
 commit-graph.c                                  | 841 ++++++++++++++++++++++++
 commit-graph.h                                  |  69 ++
 commit.c                                        |  10 +-
 commit.h                                        |   4 +
 config.c                                        |   5 +
 environment.c                                   |   1 +
 git.c                                           |   1 +
 log-tree.c                                      |   3 +-
 packfile.c                                      |   4 +-
 packfile.h                                      |   2 +
 t/t5318-commit-graph.sh                         | 272 ++++++++
 22 files changed, 1824 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 Documentation/technical/commit-graph-format.txt
 create mode 100644 Documentation/technical/commit-graph.txt
 create mode 100644 builtin/commit-graph.c
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h
 create mode 100755 t/t5318-commit-graph.sh

-- 
2.16.0


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 00/14] Serialized Git Commit Graph
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
@ 2018-01-30 21:47 ` Stefan Beller
  2018-02-01  2:34   ` Stefan Beller
  0 siblings, 1 reply; 3+ messages in thread
From: Stefan Beller @ 2018-01-30 21:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee

On Tue, Jan 30, 2018 at 1:39 PM, Derrick Stolee <stolee@gmail.com> wrote:
> Thanks to everyone who gave comments on v1. I tried my best to respond to
> all of the feedback, but may have missed some while I was doing several
> renames, including:
>
> * builtin/graph.c -> builtin/commit-graph.c
> * packed-graph.[c|h] -> commit-graph.[c|h]
> * t/t5319-graph.sh -> t/t5318-commit-graph.sh
>
> Because of these renames (and several type/function renames) the diff
> is too large to conveniently share here.
>
> Some issues that came up and are addressed:
>
> * Use <hash> instead of <oid> when referring to the graph-<hash>.graph
>   filenames and the contents of graph-head.
> * 32-bit timestamps will not cause undefined behavior.
> * timestamp_t is unsigned, so they are never negative.
> * The config setting "core.commitgraph" now only controls consuming the
>   graph during normal operations and will not block the commit-graph
>   plumbing command.
> * The --stdin-commits is better about sanitizing the input for strings
>   that do not parse to OIDs or are OIDs for non-commit objects.
>
> One unresolved comment that I would like consensus on is the use of
> globals to store the config setting and the graph state. I'm currently
> using the pattern from packed_git instead of putting these values in
> the_repository. However, we want to eventually remove globals like
> packed_git. Should I deviate from the pattern _now_ in order to keep
> the problem from growing, or should I keep to the known pattern?

I have a series doing the conversion in
https://github.com/stefanbeller/git/tree/object-store
that is based on 2.16.

While the commits are structured for easy review (to not miss any of
the globals that that series is based upon), I did not come up with a
good strategy how to take care of series in flight that add more globals.

So I think for now you'd want to keep it as global vars, such that
it is consistent with the code base and then we'll figure out how to
do the conversion one step at a time.

Please do not feel stopped or hindered by my slow pace of working
through that series, maybe I'll have to come up with another approach
that is better for upstream (rebasing that series is a pain, as upstream
moves rather quickly. Maybe I'll have to send that series in smaller chunks).

> Finally, I tried to clean up my incorrect style as I was recreating
> these commits. Feel free to be merciless in style feedback now that the
> architecture is more stable.

ok, will do.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 00/14] Serialized Git Commit Graph
  2018-01-30 21:47 ` Stefan Beller
@ 2018-02-01  2:34   ` Stefan Beller
  0 siblings, 0 replies; 3+ messages in thread
From: Stefan Beller @ 2018-02-01  2:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee

>> Finally, I tried to clean up my incorrect style as I was recreating
>> these commits. Feel free to be merciless in style feedback now that the
>> architecture is more stable.
>
> ok, will do.

I reviewed all patches and found no nits worth mentioning.
The whole series has been reviewed by me.

>
> Thanks,
> Stefan

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, back to index

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
2018-01-30 21:47 ` Stefan Beller
2018-02-01  2:34   ` Stefan Beller

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox