git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH v5 00/13] Serialized Git Commit Graph
@ 2018-02-27  2:32 Derrick Stolee
  2018-02-27 18:50 ` Stefan Beller
  0 siblings, 1 reply; 2+ messages in thread
From: Derrick Stolee @ 2018-02-27  2:32 UTC (permalink / raw)
  To: git
  Cc: gitster, peff, git, jonathantanmy, sbeller, szeder.dev, ramsay,
	Derrick Stolee

This patch series is another big difference from version 4, but I do 
think we are converging on a stable design.

This series depends on a few things in flight:

* jt/binsearch-with-fanout for bsearch_graph()

* 'master' includes the sha1file -> hashfile rename in (98a3beab).

* [PATCH] commit: drop uses of get_cached_commit_buffer(). [1] I
  couldn't find a ds/* branch for this one, but it is necessary or
  else the commit graph test script should fail.

Here are some of the inter-patch changes:

* The single commit graph file is stored in the fixed filename
  .git/objects/info/commit-graph

* Because of this change, I struggled with the right way to pair the
  lockfile API with the hashfile API. Perhaps they were not meant to
  interact like this. I include a new patch step that adds a flag for
  hashclose() to keep the file descriptor open so commit_lock_file()
  can succeed. Please let me know if this is the wrong approach.

* A side-benefit of this change is that the "--set-latest" and
  "--delete-expired" arguments are no longer useful.

* I re-ran the performance tests since I rebased onto master. I had
  moved my "master" branch on my copy of Linux from another perf test,
  which changed the data shape a bit.

* There was some confusion between v3 and v4 about whether commits in
  an existing commit-graph file are automatically added to the new
  file during a write. I think I cleared up all of the documentation
  that referenced this to the new behavior: we only include commits
  reachable from the starting commits (depending on --stdin-commits,
  --stdin-packs, or neither) unless the new "--additive" argument
  is specified.

Thanks,
-Stolee

[1] https://public-inbox.org/git/1519240631-221761-1-git-send-email-dstolee@microsoft.com/

-- >8 --

This patch contains a way to serialize the commit graph.

The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base <A> <B>' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 664,185 reachable commits and is behind 'origin/master'
by 60,191 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  6.56s |  0.66s | -89%  |
| branch -vv                       |  1.35s |  0.32s | -76%  |
| rev-list --all                   |  6.7s  |  0.83s | -87%  |
| rev-list --all --objects         | 33.0s  | 27.5s  | -16%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisons by toggling the 'core.commitGraph' setting.

[1] https://github.com/derrickstolee/git/pull/2
    A GitHub pull request containing the latest version of this patch.

Derrick Stolee (13):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  csum-file: add CSUM_KEEP_OPEN flag
  commit-graph: implement write_commit_graph()
  commit-graph: implement 'git-commit-graph write'
  commit-graph: implement git commit-graph read
  commit-graph: add core.commitGraph setting
  commit-graph: close under reachability
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits
  commit-graph: implement "--additive" option

 .gitignore                                    |   1 +
 Documentation/config.txt                      |   3 +
 Documentation/git-commit-graph.txt            |  93 +++
 .../technical/commit-graph-format.txt         |  98 +++
 Documentation/technical/commit-graph.txt      | 164 ++++
 Makefile                                      |   2 +
 alloc.c                                       |   1 +
 builtin.h                                     |   1 +
 builtin/commit-graph.c                        | 172 +++++
 cache.h                                       |   1 +
 command-list.txt                              |   1 +
 commit-graph.c                                | 720 ++++++++++++++++++
 commit-graph.h                                |  47 ++
 commit.c                                      |   3 +
 commit.h                                      |   3 +
 config.c                                      |   5 +
 contrib/completion/git-completion.bash        |   2 +
 csum-file.c                                   |  10 +-
 csum-file.h                                   |   1 +
 environment.c                                 |   1 +
 git.c                                         |   1 +
 packfile.c                                    |   4 +-
 packfile.h                                    |   2 +
 t/t5318-commit-graph.sh                       | 225 ++++++
 24 files changed, 1556 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 Documentation/technical/commit-graph-format.txt
 create mode 100644 Documentation/technical/commit-graph.txt
 create mode 100644 builtin/commit-graph.c
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h
 create mode 100755 t/t5318-commit-graph.sh

-- 
2.16.2.282.g5029fe8.dirty


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH v5 00/13] Serialized Git Commit Graph
  2018-02-27  2:32 [PATCH v5 00/13] Serialized Git Commit Graph Derrick Stolee
@ 2018-02-27 18:50 ` Stefan Beller
  0 siblings, 0 replies; 2+ messages in thread
From: Stefan Beller @ 2018-02-27 18:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Jonathan Tan,
	SZEDER Gábor, Ramsay Jones, Derrick Stolee

On Mon, Feb 26, 2018 at 6:32 PM, Derrick Stolee <stolee@gmail.com> wrote:
> This patch series is another big difference from version 4, but I do
> think we are converging on a stable design.
>
> This series depends on a few things in flight:
>
> * jt/binsearch-with-fanout for bsearch_graph()
>
> * 'master' includes the sha1file -> hashfile rename in (98a3beab).
>
> * [PATCH] commit: drop uses of get_cached_commit_buffer(). [1] I
>   couldn't find a ds/* branch for this one, but it is necessary or
>   else the commit graph test script should fail.

'jk/cached-commit-buffer', 'jk' as the first commit in that series
is by Jeff King ?

I found this commit by searching for its verbatim title in
'git log --oneline origin/pu' and then using
https://github.com/mhagger/git-when-merged
to find 51ff16f5f3a (Merge branch 'jk/cached-commit-buffer'
into jch, 2018-02-23)

>
> Here are some of the inter-patch changes:
>
> * The single commit graph file is stored in the fixed filename
>   .git/objects/info/commit-graph
>
> * Because of this change, I struggled with the right way to pair the
>   lockfile API with the hashfile API. Perhaps they were not meant to
>   interact like this. I include a new patch step that adds a flag for
>   hashclose() to keep the file descriptor open so commit_lock_file()
>   can succeed. Please let me know if this is the wrong approach.

This sounds like an interesting thing to review.

>
> * A side-benefit of this change is that the "--set-latest" and
>   "--delete-expired" arguments are no longer useful.
>
> * I re-ran the performance tests since I rebased onto master. I had
>   moved my "master" branch on my copy of Linux from another perf test,
>   which changed the data shape a bit.
>
> * There was some confusion between v3 and v4 about whether commits in
>   an existing commit-graph file are automatically added to the new
>   file during a write. I think I cleared up all of the documentation
>   that referenced this to the new behavior: we only include commits
>   reachable from the starting commits (depending on --stdin-commits,
>   --stdin-packs, or neither) unless the new "--additive" argument
>   is specified.
>

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, back to index

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27  2:32 [PATCH v5 00/13] Serialized Git Commit Graph Derrick Stolee
2018-02-27 18:50 ` Stefan Beller

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox