[RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
@ 2018-05-04 19:40 Jakub Narebski
  2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Jakub Narebski @ 2018-05-04 19:40 UTC (permalink / raw)
  To: git, Derrick Stolee, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

Hello,

With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/
I think it would be good idea to think what other data could be added
there to make Git even faster.

Commit-graph format
===================

A quick reminder: in its current incarnation the commit graph file
includes the following data [1]:

1. Some commit data that is just enough for many Git operations:
   - commit parents
   - commit tree (root tree OID)
   - committer date

2. The generation number of the commit. Commits with no parents have
   generation number 1; commits with parents have generation number one
   more than the maximum generation number of its parents.

The commit-graph file is split into chunks, which theoretically allows
extending the format wihout need for a version bump... though there is
currently no specified policy about unknown chunks (and understandably
so, as currently there are not any).

I think it would be good idea to talk about what more could be added to
be stored in the commit-graph file.

[1]: https://github.com/git/git/blob/pu/Documentation/technical/commit-graph-format.txt

Requirements and recommendations about possible new chunks
==========================================================

Because the main goal of commit-graph feature is better performance in
large repositories, any proposed new chunks (or, at higher level, every
proposed piece of new data) needs to have the following properties.

1. First, it needs to have bounded and sensible size.  This means that
the size of new proposed chunk should be constant, proportional to the
number of commits, or at worst proportional to the number of edges.

From the existing chunks, OIDF (OID Fanout) has constant size, both OIDL
(OID Lookup) and CGET/CDAT (Commit Data) has size proportional to the
number of commits, while not always present EDGE (Large Edge List) has
size proportional to the number of "excess" edges in "huge vertices"
(octopus merges).

2. Second, we want fast access, in most cases fast random access.  This
means using fixed-size records.  All currently existing chunks have
records (columns) of fixed and aligned size.

This restriction means that idexes where we have variable amount of data
per vertex (per commit) are discouraged.

3. Third, it needs to be reasonably fast to create, and fast to update.
This means time to create the chunk to be proportional to number of
commits, or sum of number of commits and edges (which for commit graph
and other sparse graphs is proprtional to the number of nodes / commits
anyway).  In my opinion time proportional to n*lug(n), where 'n' is the
number of commits, is also acceptable.  Times proportional to n^2 or n^3
are not acceptable.

It is also strongly preferred that time to update the chunk is
proportional to the number of new commits, so that incremental
(append-only) update is possible.  The generation numbers index has this
property.

Generic new chunks
==================

There are a few ideas for new chunks, new pieces of data to be added to
the commit-graph file, that are not connected with some labeling scheme
for directed acyclic graphs like GRAIL, FERRARI, FELINE or TF-Label.
Let's list them here.

If you have an idea of another bit of information that could be added to
the commit-graph file, please tell us.

Bloom filter for changed paths
------------------------------

The goal of this chunk is to speed up checking if the file or directory
was changed in given commit, for queries such as "git log -- <file>" or
"git blame <file>".  This is something that according to "Git Merge
contributor summit notes" [2] is already present in VSTS (Visual Studio
Team Services - the server counterpart of GVFS: Git Virtual File System)
at Microsoft:

AV> - VSTS adds bloom filters to know which paths have changed on the commit
AV> - tree-same check in the bloom filter is fast; speeds up file history checks
AV> - might be useful in the client as well, since limited-traversal is common
AV> - if the file history is _very_ sparse, then bloom filter is useful
AV> - but needs pre-compute, so useful to do once
AV> - first make the client do it, then think about how to serve it centrally

[2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/

I think it was what Derrick Stolee was talking about at the end of his
part of "Making Git for Windows" presentation at Git Merge 2018:
https://youtu.be/oOMzi983Qmw?t=1835

This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
trees when reading commit-graph", starting from [3]
[3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/

A quick reminder: a Bloom filter is a space-efficient probabilistic data
structure that is used to test whether an element is a member of a set.
False negatives are not possible: if a Bloom filter says that the
element is not in set, then it certainly isn't.  False positives matches
are possible: if a Bloom filter says that element is in set, then maybe
it is -- it is possible with some probability that it really is not.
The probability depends on number of elements in set, number of bits
Bloom filter uses (with fewer than 10 bits per element are required for
a 1% false positive probability), and number of [independent] hash
functions it uses.

This Bloom filter can be used to store in a compact and fixed-size way a
set of files (and possibly also directories) changed by given commit.
Then when "git log -- <file>" is called, Git can quickly check if the
commit in question touches given file.  If Bloom filter say that it most
probably is, only then Git would need to perform a diff to check if it
actually is (and usually get data to show).

There are a few problems that needs to be solved, though.

* First, and most important is the performance considerations of
  actually creating Bloom filter.  It is very expensive to do the full
  diff for each commit to create Bloom filter.

  As Derrick Stolee said in [3]:

  DS> My guess is that only server-side operations will need the added
  DS> response time, and can bear the cost of computing them when
  DS> writing the commit-graph file. Clients are less likely to be
  DS> patient waiting for a lot of diff calculations.
  DS>
  DS> If we add commit-graph file downloads to the protocol, then the
  DS> server could do this computation and send the data to all
  DS> clients. But that would be "secondary" information that maybe
  DS> clients want to verify, which is as difficult as computing it
  DS> themselves.

* What to do about merge commits, and octopus merges in particular?
  Should Bloom filter be stored for each of the parents?  How to ensure
  fast access then (fixed-width records) - use large edge list?

* Then there is problem of rename and copying detection - I think we can
  simply ignore it: unless someone has an idea about how to handle it?

  Though this means that "git log --follow <file>" wouldn't get any
  speedup, and neither the half of "git gui blame" that runs "git blame
  --incremental -C -C -w" -- the one that allows code copying and
  movement detection.

One possible way of solving the issue of large amount of work required
to compute Bloom filter could be updating it as a side effect of
operation that calculates full diff anyway (like "git log -p" or "git
show") if the commit is in commit-graph file.  We would need then some
special value that marks Bloom filter as not filled in (in addition to
the one saying that there were too many changes, and the one saying that
there were no changes at all - as rare as the latter might been).

Because not all queries that require calculating diff do it for full
hierarchy, maybe additionally there could be a additional field denoting
the depth to which diff and changed files were calculated.  For example
depth of 1 means that only top-level files and top-level directories
were added to Bloom filter, so you can only verify the first component
of the path to check if it could have been changed.  Zero depth may mean
unlimited.

A final note: Git would use this Bloom filter for changed files in quite
different way than is this structure typical use case: instead of
checking large number of elements against given Bloom filter, Git would
be checking given element against large amount of Bloom filters.  This
means that the optimization people came up with, like fast-forward Bllom
filters (https://github.com/efficient/ffbf) wouldn't necessary help.

Commit-graph validity
---------------------

One problem with commit-graph feature that needs to be solved is that it
stores one particular view of the project history, and there exist
features that change it.  Beween writing commit-graph file and using it,
commits may have been replaced by commits with different parents,
shallow clone may have been shortened or deepened, or grafts file may
have been edited.

One possible solution would be to automatically turn off commit-graph if
any of those features is present.  For shallow clone it may even be
acceptable solution - the hstory is shortened, so the queries are fast
anyway, and commit-graph is less needed.

If we join the working repository with the historical one (via grafts or
replacements) to examine the full history of a project, last thing we
would want is to make operations slower.

A possible way of solving this would be to add a chunk which says for
what combination of graft file content, shallow file content, and the
state of replace refs given commit-graph file is valid for.  If they do
not match, Git would not use commit-graph feature and fall back on
ordinary slower mechanism of answering queries.

Moreover, with this solution, when shallow clone changes its depth
(e.g. with 'git fetch --deepen=<depth>' or 'git fetch --unshallow), or
when a replacement is added with 'git replace', Git can either update
commit-graph file, or update the validity - saying that commit-graph
file is valid for the new state.  The latter is more important for
git-replace, which does not need to affect history at all.  There isn't
anything like that for grafts, but the mechanism is deprecated anyway,
and I am not opposed to just turning off commit-graph feature if there
is grafts file present: this could help convince people to upgrade to
git-replace, which should be easy now thanks to the new
--convert-graft-file option.

The verification could be as simple as verification hash / check over
contents of grafts file, over contents of shallow file, and over
replacement refs represented as fragment of packed-refs file.

Topographical ordering of commits
---------------------------------

The idea is simple: store topological order of commits, so that 'git log
--topo-order' and 'git log --graph' are faster.

This could be done with two chunks: one storing the position of given
commit in topological order (commits sorted by their OID, like for
metadata), and list of commits in topological order.  The first chunk
can also be thought of as the index into later.

This way it would be possible to list commits in topological order
starting at given commit, and ending at given commit or after given
number of commits.

The problem is with updating this data.  It should be possible to update
a topological order, but I am not sure if it would be possible to update
topological order based only on changed files and existing order to
still fulfill the criteria on --topo-order:

  --topo-order ::
      Show no parents before all of its children are shown, and avoid
      showing commits on multiple lines of history intermixed.

The second requirement may (or may not be) hard to fulfill when updating.

Bitmap indices indicator
------------------------

Bitmap index is supported since 2015 for Git itself [4], and since at
least 2013 for JGit [5].  The main goal of this feature was to improve
the "counting objects" phase of clone and fetch performance.

A quick reminder: bitmap index file is computed per packfile, and it
refers to objects by their position in packfile.  Bitmap index can be
currenly generated only for a packfile with full closure (i.e. where
every single object in the packfile can find its parent links inside the
same packfile).

A single bitmap in a bitmap index file is a bit vector, where i-th
position (i-th bit of bitmap) corresponds to the i-th object in
packfile.  There are 4 bitmaps that act as type indexes, for example
commit type index has "1" on i-th position if i-th object in packfile is
a commit object.  There are also N entries for commits, where i-th bit
in bitmap for given commit denotes whether i-th object is reachable from
the commit this bitmap is for.

I think that bitmap indexes does not see much use when Git needs to
answer reachability query; correct me if I am wrong here.  Maybe some
kind of indicator that given commit has reachability bitmap in some
bitmap file could help?

As soon as we arrive at commit for which reachability bitmap exist,
answering reachability query is easy: just find out the position of the
commit we want to check if it is reachable in same packfile, and if i-th
bit in bitmap is "1" then it is reachable.  Finding path is slightly
more involved: just use reachability bitmap to steer walk, always
choosing only reachable parents.

I don't know if we can find merge-base, also known as lowest common
ancestor (or sometimes as greatest common ancestor) with reachability
bitmaps.  You can easily find *all* ancestors (with bitwise AND
operation on bitmaps).  Maybe starting from the ones that have highest
generation numbers we can remove redundant ancestors (with AND NOT --
but not all commits have reachability bitmaps).

I wonder if it would be better to store indicators that commit has
reachability bitmap (and in which packfile, if it is possible with
mentioned restriction to have more than one bitmap file) in commit-file,
or if it would be easier when reading commit-graph information into
memory to simply read bitmap file for this information and add it to
in-memory representation.

[4]: http://githubengineering.com/counting-objects/
[5]: https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf

As this post is getting long, I'll post other ideas, about commit
labeling for faster reachability queries in a separate email.

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski
@ 2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason
  2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-04 20:07 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Derrick Stolee, Derrick Stolee, Jeff King


On Fri, May 04 2018, Jakub Narebski wrote:

> With early parts of commit-graph feature (ds/commit-graph and
> ds/lazy-load-trees) close to being merged into "master", see
> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/
> I think it would be good idea to think what other data could be added
> there to make Git even faster.

Thanks for the write-up.

> 3. Third, it needs to be reasonably fast to create, and fast to update.
> This means time to create the chunk to be proportional to number of
> commits, or sum of number of commits and edges (which for commit graph
> and other sparse graphs is proprtional to the number of nodes / commits
> anyway).  In my opinion time proportional to n*lug(n), where 'n' is the
> number of commits, is also acceptable.  Times proportional to n^2 or n^3
> are not acceptable.

I don't think this requirement makes sense...

>   DS> If we add commit-graph file downloads to the protocol, then the
>   DS> server could do this computation and send the data to all
>   DS> clients. But that would be "secondary" information that maybe
>   DS> clients want to verify, which is as difficult as computing it
>   DS> themselves.

... when combined with this. If hypothetically there was some data that
significantly sped up Git and the algorithm to generate it was
ridiculously expensive, that would be fine when combined with the
ability to fetch it pre-generated from the server. There could always be
an option where you trust the server and optionally don't verify the
data, or where it's much cheaper to verify than compute.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski
  2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason
@ 2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason
  2018-05-05 13:28   ` Jakub Narebski
  2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski
  2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee
  3 siblings, 1 reply; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-04 20:36 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Derrick Stolee, Derrick Stolee, Jeff King

On Fri, May 04 2018, Jakub Narebski wrote:

(Just off-the cuff here and I'm surely about to be corrected by
Derrick...)

> * What to do about merge commits, and octopus merges in particular?
>   Should Bloom filter be stored for each of the parents?  How to ensure
>   fast access then (fixed-width records) - use large edge list?

You could still store it fixed with, you'd just say that if you
encounter a merge with N parents the filter wouldn't store files changed
in that commit, but rather whether any of the N (including the merge)
had changes to files as of the their common merge-base.

Then if they did you'd need to walk all sides of the merge where each
commit would also have the filter to figure out where the change(s)
was/were, but if they didn't you could skip straight to the merge base
and keep walking.

Which, on the topic of what else a commit graph could store: A mapping
from merge commits of N parents to the merge-base of those commits.

You could also store nothing for merges (or only files the merge itself
changed v.s. its parents). Derrick talked about how the bloom filter
implementation has a value that's "Didn't compute (for whatever reason),
look at it manually".

> * Then there is problem of rename and copying detection - I think we can
>   simply ignore it: unless someone has an idea about how to handle it?
>
>   Though this means that "git log --follow <file>" wouldn't get any
>   speedup, and neither the half of "git gui blame" that runs "git blame
>   --incremental -C -C -w" -- the one that allows code copying and
>   movement detection.

Couldn't the bloom filter also speed up --follow if you did two passes
through the history? The first to figure out all files that ever changed
names, and then say you did `--follow sha1-name.c` on git.git. The
filter would have had all the bits for both sha1_name.c and sha1-name.c
set on all commits that touched either for all of the history.

Of course this would only work with a given default value of -M<n>, but
on the assumption that most users left it at the default, and
furthermore that renames weren't so common as to make the filter useless
with too many false-positives as a result, it might be worth it. If you

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason
@ 2018-05-05 13:28   ` Jakub Narebski
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Narebski @ 2018-05-05 13:28 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Derrick Stolee, Derrick Stolee, Jeff King

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> On Fri, May 04 2018, Jakub Narebski wrote:
>
> (Just off-the cuff here and I'm surely about to be corrected by
> Derrick...)
>
>> * What to do about merge commits, and octopus merges in particular?
>>   Should Bloom filter be stored for each of the parents?  How to ensure
>>   fast access then (fixed-width records) - use large edge list?
>
> You could still store it fixed with, you'd just say that if you
> encounter a merge with N parents the filter wouldn't store files changed
> in that commit, but rather whether any of the N (including the merge)
> had changes to files as of the their common merge-base.

Well, that is one solution: to store union of changes (sum of changes)
from all parents of a merge commit in a Bloom filter for a merge.

But this wouldn't help us to skip merge entirely, because which parent
we would walk then?  But perhaps I am mistaken, and that does not
matter.

> Then if they did you'd need to walk all sides of the merge where each
> commit would also have the filter to figure out where the change(s)
> was/were, but if they didn't you could skip straight to the merge base
> and keep walking.

Another solution that I thought of is to use the same mechanism that
commit-graph file uses for storing merges: store Bloom filters for first
two parents, and if there are more parents (octopus merge), store Bloom
filters for the remaining parents in large edge extension for Bloom
filters.

This meant accepting some padding waste for CDAT chunk, to have faster
access.  We could do the same for Bloom filters, but it may mean quite a
bit of waste, depending on how many bits Bloom filter would use... but
there is another solution: for merge commits store Bloom filters for
first two parents that are half the size - this means of course more
false positives, but it may be acceptable solution.

> Which, on the topic of what else a commit graph could store: A mapping
> from merge commits of N parents to the merge-base of those commits.

The problem is that those N parents may have more than one merge-base,
and if so then those merge-bases may have also multiple merge-bases,
recursively (what 'recursive' merge strategy handles).  Though this
could be solved with 'large merge-base list' extension, just like
existing EDGE chunk - I think we can assume that most merge parents have
only single merge-base (but I have not checked this).

> You could also store nothing for merges (or only files the merge itself
> changed v.s. its parents). Derrick talked about how the bloom filter
> implementation has a value that's "Didn't compute (for whatever reason),
> look at it manually".

Right, another solution could be to store nothing for merges, or store
Bloom filter for changes against only first parent.  The goal of Bloom
filter is to avoid calculating diff if we don't need to.

Derrick, could you tell us what solution VSTS uses for Bloom filters on
merge commits?  Thanks in advance.

>> * Then there is problem of rename and copying detection - I think we can
>>   simply ignore it: unless someone has an idea about how to handle it?
>>
>>   Though this means that "git log --follow <file>" wouldn't get any
>>   speedup, and neither the half of "git gui blame" that runs "git blame
>>   --incremental -C -C -w" -- the one that allows code copying and
>>   movement detection.
>
> Couldn't the bloom filter also speed up --follow if you did two passes
> through the history? The first to figure out all files that ever changed
> names, and then say you did `--follow sha1-name.c` on git.git. The
> filter would have had all the bits for both sha1_name.c and sha1-name.c
> set on all commits that touched either for all of the history.
>
> Of course this would only work with a given default value of -M<n>, but
> on the assumption that most users left it at the default, and
> furthermore that renames weren't so common as to make the filter useless
> with too many false-positives as a result, it might be worth it. If you

I think it would be much simpler to just ensure that we store in Bloom
filter as changed files also pure renames, and leave doing rename
detection to the walk.  This way we do not fix old rename detecion
algorithm in stone.

The walk would simply change the name of file it would ask Bloom filters
about.

Thank you for your comments,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] Other chunks for commit-graph, part 2 - reachability indexes
  2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski
  2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason
  2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason
@ 2018-05-06 23:55 ` Jakub Narebski
  2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee
  3 siblings, 0 replies; 10+ messages in thread
From: Jakub Narebski @ 2018-05-06 23:55 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

Hello,

In previous email I wrote:

JN> As this post is getting long, I'll post other ideas, about commit
JN> labeling for faster reachability queries in a separate email.

This is this email.

Here is second part of series dedicated to discussing what other data,
like various reachability indexes / labeling could be added to the
commit-graph file (as new chunks) to make Git even faster.

By reachability I mean answering the question whether commit A can reach
commit B, or in other words if commit B is an ancestor of commit A.
This type of query is used in many Git operations.

The commit-graph has one such reachability index built-in already, in
the form of generation numbers; this index is called level or
topological level of node / vertex in the literature / articles about
graph algorithms.

> 2. The generation number of the commit. Commits with no parents have
>    generation number 1; commits with parents have generation number one
>    more than the maximum generation number of its parents.

I have played a bit with various reachability indexes, starting with the
ones described in "Reachability Queries in Very Large Graphs: A Fast
Refined Online Search Approach" (FELINE index) paper by Renê R. Veloso,
Loïc Cerf, Wagner Meira Jr and Mohammed J. Zaki (2014), available in the
PDF form at https://openproceedings.org/EDBT/2014/paper_166.pdf

I have started working on Jupyter Notebook on Google Colaboratory to
find out how much speedup we can get using those indexes for Git
large-ish commit graphs (which turns out to be quite specific type of
graph, more about which later), namely git.git and Linux kernel ones.

The Jupyter Notebook (which runs on Google cloud, but can be also run
locally) uses Python kernel, NetworkX librabry for graph manipulation,
and matplotlib (via NetworkX) for display.

Available at:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

At current size it started to be a bit unwieldy, especially recomputing
it from the start, so I am thinking about splitting it up and moving it
to GitHub; then I can e.g. put code and data in separate files, so they
do not have to be recalculated (cloning full Linux repository takes
quite a bit of time).  I have started the process: first step which is
copy of Colaboratory notebook is now available as the following repo:
https://github.com/jnareb/git-commit-graph-ext
https://github.com/jnareb/git-commit-graph-ext/blob/master/Graphs_FELINE_index.ipynb

Let's start with the reminder from part 1:

> Requirements and recommendations about possible new chunks
> ==========================================================
>
> Because the main goal of commit-graph feature is better performance in
> large repositories, any proposed new chunks (or, at higher level, every
> proposed piece of new data) needs to have the following properties.
>
>
> 1. First, it needs to have bounded and sensible size.  This means that
> the size of new proposed chunk should be constant, proportional to the
> number of commits, or at worst proportional to the number of edges.
>
> From the existing chunks, OIDF (OID Fanout) has constant size, both OIDL
> (OID Lookup) and CGET/CDAT (Commit Data) has size proportional to the
> number of commits, while not always present EDGE (Large Edge List) has
> size proportional to the number of "excess" edges in "huge vertices"
> (octopus merges).
>
>
> 2. Second, we want fast access, in most cases fast random access.  This
> means using fixed-size records.  All currently existing chunks have
> records (columns) of fixed and aligned size.
>
> This restriction means that idexes where we have variable amount of data
> per vertex (per commit) are discouraged.
>
>
> 3. Third, it needs to be reasonably fast to create, and fast to update.
> This means time to create the chunk to be proportional to number of
> commits, or sum of number of commits and edges (which for commit graph
> and other sparse graphs is proprtional to the number of nodes / commits
> anyway).  In my opinion time proportional to n*lug(n), where 'n' is the
> number of commits, is also acceptable.  Times proportional to n^2 or n^3
> are not acceptable.
>
> It is also strongly preferred that time to update the chunk is
> proportional to the number of new commits, so that incremental
> (append-only) update is possible.  The generation numbers index has this
> property.

Though as Ævar said in response, which I agree with:

ÆB> If hypothetically there was some data that significantly sped up Git
ÆB> and the algorithm to generate it was ridiculously expensive, that
ÆB> would be fine when combined with the ability to fetch it
ÆB> pre-generated from the server. There could always be an option where
ÆB> you trust the server and optionally don't verify the data, or where
ÆB> it's much cheaper to verify than compute.

https://public-inbox.org/git/86h8nm2pv8.fsf@gmail.com/T/#mdbc6a4ef052ae95136faf6d243d0c29ddfac58a8

Types of algorithms to speed up reachability queries
====================================================

The [research] issue for answering reachability queries is to find a
tradeoff between the online processing cost and the offline processing
cost.  All different approaches aim to find a trade-off among three main
costs: (a) query time which is the time to answer a reachability query
online [which for Git would be time to perform operation such as `git
branch --contains`], (b) construction time which is the time to
construct an index offline [which for Git would be time for client or
server to create commit-graph file data], and (c) the index size which
is the space needed to maintain the index [which for Git would be the
size of commit-graph file].

We can classify existing works / existing algorithms into two
categories.  The two main approaches are Label-Only and Label+G.  The
Label-Only approach is to answer reachability queries using the labels
computed only (this includes transitive closure compression and hop
labeling / transitive closure factorization methods).  The Label+G
approach is to make use of the labels computed as much as possible, but
it may need to traverse the graph G (it also called refined online
search).  The main idea behind Label+G is to reduce the time to
construct an effective index, since the computing cost to create
effective Label-Only index can be very high.

Note: this categorization is taken from the IP paper.

Beside those two groups there also exists so called boosting frameworks.
This algorithms try to reduce answering reachability query on very large
graph to answering reachability queries on smaller graph (reachability
backbone in case of SCARAB framework, reduced/simplified graph in case
of DAG Reduction method).

Let's table boosting frameworks for later (for example to examine
whether SCARAB could be used to choose which commits should have
reachability bitmaps).  If we follow "Requirements and recommendations
about possible new chunks", then we can limit examinations to Label+G
approaches.  Besides, commit-graphs can get large enough so that
Label-Only approaches starts to get unfeasible: index creation time and
index size gets very large.

Commit graph characteristics
============================

But first lets try to address the elephant in the room: would those
algorithms work well for the graph of revisions?

These methods are usually evaluated on a variety of real graph datasets,
both small and large, as well as large synthetic ones.  The real graphs
datasets include XML documents, genome databases, citation networks,
semantic knowledge databases and ontology terms, taxonomies, social
networks and web graphs.

It turns out that graph of revisions has size (number of nodes) that for
large repositories number of commits/nodes (50k for git.git, 750k for
Linux kernel, 1.7M for MS Windows) falls somewhere in the middle between
size of small graphs with 6k to 40k nodes and size of large graphs, most
of which has number of nodes staring from 1.5M and bigger.

Many of those example graphs (that are used for evaluation of
reachability algorithms) are approximately scale-free, with relative
commonness of vertices with a degree that greatly exceeds the average --
which is not the case for graph of revisions.  Some approaches include
special handling for those vertices to speed up queries; this probably
is not needed for DAG of commits, and I will skip these parts for now.

Commit graphs are sparse graphs, with average number of parents (the
ratio of edges to nodes in graph) below 2.0; even closer to 1.

One major difference is that in those graphs the percentage of
reachability queries that are answered negatively over all possible
reachability queries is over 90%.  On the other hand, the ratio of
reachability queries that are answered positively over all possible
reachability queries is very small; this ratio is called
reachability-ratio (or simply R-ratio), usually denoted as 'r'.  For
those real graph datasets, the R-ratio is between 2.1e-1 to 1.3e-7.  But
for graph of revision (based on sample of one) it looks like this
R-ratio is around 4.5e-1 (with, if I am not mistaken, maximum R-ratio
possible in DAGs being 0.5 or 5.0e-1, that is 50%).  This may affect the
evaluation of method for commit graphs; on the other hand as positive
queries are more interesting, some works do evaluations separately in
positive and negative queries, or on equal mixure of positive and
negative queries.

Another issue is that many of those "evaluation" graphs usually have the
large number of sources (nodes with only outgoing edges, "roots" in
graph terminology, "tips" or "heads" in Git terminology -- childless
nodes) or/and the large number of sinks (nodes with no outgoing edges,
"leafs" in graph terminology, "roots" in Git terminology -- parentless
nodes).  This seems to matter for FELINE index, which as my experiments
in mentioned Jupyter notebook shows is much less efective on commit
graphs than indicated in original work.  Note that while live
repositories would have much larger amount of branches than public
repositories (that I have used for tests) where there are only a few
long-lived public branches, even for Windows repo it is small percentage
of total number of nodes / commits (the Windows repo according to the
Git Merge 2018 presentation by Derrick has 90k branches and 1700k
commits; it is not as where 1/4 of nodes were source nodes).

Negative and positive-cut filters, false negatives and false positives
======================================================================

In Label+Graph approaches, indexing (node/edge labels) is enough to
answer some queries in constant time.

If the indexing can answer positive queries in constant time we call it
_positive-cut filter_.  This means that querying the index can state
that node B is reachable from node A (positive result); reaachability is
guaranteed.  However if we don't get such answer from the index, we
don't know if B is reachable from A, or if B is not reachable from A.
The case where B is reachable from A, but the index doesn't say it, we
can call _false negatives_: we get negative answer from the index, while
the actual ressult of reachability query is positive.

If the indexing can answer negative queries in constant time we call it
_negative-cut filter_.  This means that querying the index can state
that node B is definitevly not reachable from node A (negative result).
Otherwise we don't know if B is or is not reachable from A.  The case
where there is no path from A to B, but the index doesn't say it, we can
call _false positives_ or _falsely-implied path_.

Generation numbers (or topological levels) are an example of negative
cut filter.  If gen(A) < gen(B) then A cannot reach B.

Note: The terminology used here is taken from the FELINE paper (which
does not necesarily mean that it originated here; it appears in earlier
GRAIL paper too).

When one wants just to find if there exist path from A to B, both
positive-cut and negative-cut filters can be used.

  Data: (u, v) ∈ V^2, two vertices of the DAG
  Result: r(u, v), whether v is reachable from u
  Note: simplified DFS algorithm, no visited filter
  begin
      if positive-cut(u, v) then
          return true;
      if u == v then
          return true;
      if not negative-cut(u, v) then
          forall the w ∈ parents(u) do
              if Reachable(w, v) then
                  return true;

      return false;

This is the case for '--contains' and '--merged' queries in Git.

When one wants to find a single path from A to B if it exists, then both
negative-cut and most positive-cut filters can be used.

When one wants to find out all paths from A to B (e.g. for `git log A ^B
--strict-ancestor`) then in most cases only negative-cut filters are
really useful.

Online search algorithms
========================

When the index or indexes used cannot answer reachability query by
itself, one needs to perform a search from the source node to the target
node, applying positive-cut and negative-cut filters to reduce search
space.

One possible solution, often used for rechability testing in various
works, is to perform depth-first search (DFS).  This can be done simply
by using recursion (like in the above pseudo-code), but for large graphs
one need to use iterative approach by employing stack.  DFS is needed to
compute some of the reachability indexes.

However DFS it is not always the best strategy. Breadth-first search
(BFS) guarantees to find the shortest path with the cost of using more
memory.  In average, BFS is expected to be faster considering the cases
where the target node is close to the source node that has a large
reachability set.  This is especially true for graphs of revisions, that
are deep but narrow (their width is much smaller than their depth;
maximum number of commits with the same generation number is much
smaller than maximum generation number).  In BFS we store nodes to be
walked in a queue.

Besides the standard graph search strategies of DFS and BFS, sometimes
there is also considered bidirectional breadth-first search (BBFS),
which has a few advantages over BFS. In bidirectional search two BFS are
started simultaneously utilizing separate queues: one starting from
source node and going forward, and one starting from target node and
going backward (walking the reverse of DAG).  The first advantage of
bidirectional search is that it can answer positive queries faster when
the average degree of the graph is high. In negative queries, the worst
case scenario is that bidirectional search takes twice as much time as a
BFS would take; however there are substantial amount of cases where
bidirectional search is much faster than BFS.

Some indexes are geared towards making BBFS faster; this needs to be
taken into account when evaluating them, as we would not want to have to
deal with reversed commit graph.

Git actually uses a variant of BFS in paint_down_to_commin() in
commit.c, where it uses priority queue, sorted by generation number then
by date.  We can say that it is closest-first search variant of BFS.

Reachability indexes
====================

The Label+Graph approaches to include the following algorithms to speed
up reachability queries:

- Tree+SPPI (2005)
  (Surrogate & Surplus Predecessor Index)
- GRIPP (2007)
  (GRaph Indexing based on Pre- and Postorder numbering)
+ GRAIL (2010 & 2012)
  (Graph Reachability indexing via rAndomized Interval Labeling)
* PReaCH (2014)
  (Pruned Reachability Contraction Hierarchies)
+ FERRARI (2013)
  (Flexible and Efficient Reachability Range Assignment for gRaph Indexing)
+ FELINE (2014)
  (Fast rEfined onLINE search)
+ IP (2014)
  (Independent Permutation labeling approach)
+ BFL (2016)
  (Bloom Filter Labeling)

Algorithms prefixed with '-' are those older ones usually compared
against in different works that I won't cover here while algorithms
prefixed with '+' are those that I will cover somewhat.  Algorithms
prefixed with '*' are less known works that I think might be
interesting.

Levels / generation numbers
---------------------------

One of the simplest negative-cut filters, used by almost all works as a
supplementary filter... and also by Git's commit-graph feature.  The
generation number, or [topological] level is defined as length of
shortest path from a sink node ("root" in Git terminology).  You can also
define "reverse" or "backward" levels starting from source nodes, and
some works use that.

More useful in actually computing generation number for a commit, or
level of a node is recursive definition given at the beginning of this
email.  Note that usually nodes without outgoing edges ("root" nodes)
are assigned level zero, but giving it value of one like commit-graph
feature does is also valid convention.

The recursive definition shows that it is possible to calculate
generation numbers incrementally.  In particular the cost of updating
generation numbers is proportional to the number of new commits.  This
is not the case for "reverse" level numbers -- we won't consider it
then.

Generation numbers / topological levels are denoted using gen(B) symbol
in commit-graph documentation and commit messages, and l_u or L(u) in
works about graph reachability algorithms.  Let's use gen(A) here.

The following holds true:

  If A can reach B, and A != B, then gen(A) > gen(B)

This means that we get the following negative cut condition (this
condition is equivalent to the one above):

  If gen(A) <= gen(B) and A != B, then A cannot reach B

Sometimes it is easier to make use of the weaker negative-cut condition:

  If gen(A) < gen(B), then A cannot reach B

Topological ordering
--------------------

A topological sort or topological ordering of a directed acyclic graph
is a linear ordering of its vertices such that for every directed edge
(u,v) from vertex u to vertex v, u comes before v in the ordering.  Any
DAG has at least one topological ordering, and algorithms are known for
constructing a topological ordering of any DAG in linear time.

A topological ordering should be able to be updated quite easily, while
the reverse topological ordering should be able to be updated in time
proportional to the number of added commits (though there are no
guarantees about properties of such topological order).

The following holds true:

  If A can reach B, and A != B, then topo(A) > topo(B)
  If A can reach B, then topo(A) >= topo(B)

Because this is ordering, we also have that if topo(A) <= topo(B) and
topo(A) >= topo(B), then A = B; if topo(A) = topo(B) then A = B.

This means that we have the following negative-cut condition:

  If topo(A) < topo(B), then A cannot reach B

The condition is reversed for the reverse topological ordering.

Note: The FELINE index can be thought of as composed of two topological
ordering, where the second ordering is constructed using heuristic
algorithm in such way that it tries to minimize number of false
positives (falsely-implied paths) for the combination of those two
topological sort negative-cut filters.

Post-order in spanning tree
---------------------------

Directed rooted tree is a directed graph in which, for a vertex u called
the [tree] root and any other vertex v, there is exactly one directed
path from u to v.  A spannig tree of a directed acyclic graph G is a
subgraph that is a tree or set of trees (directed rooted tree for DAG)
which includes all of the vertices of graph G.  Source nodes of G are
roots of spanning trees of G.

The reachability problem on trees can be solved effectively by interval
labeling, which takes linear time and space for constructing the index,
and provides constant time querying.  Given a tree/forest T, interval
labeling assigns each node a label I(u) = [s_u, e_u] which represents
interval starting at s_u and ending at e_u.  A desired labeling
suitable for answering reachability queries in trees must satisfy the
condition that I(u) subsumes I(v), or in other words interval I(v) is a
subset of interval I(v) if and only if u reaches v.

One such labeling is _min-post labeling_ method for trees.  In this
labeling e_u = post(u) is the post-order value of the node u, defined as
the rank of the node u in a post order traversal of the tree.  The
starting value of the interval s_u is defined as the lowest post-order
rank for any node x in the subtree rooted at u (i.e., including u).
Note that because of how this interval was defined the check
I(v) ⊆ I(u) can be replaced by simpler post(v) ∈ I(u).

Pre order traversal of tree gives _pre-max labeling_ method for trees,
exactly equivalent to the min-post labeling described above.

This method used for a spanning tree gives positive-cut filter: if it
gives positive results for u ---> v query, then node v is reachable from
u via spanning tree.  If the answer from spanning tree labeling is
negative, node v can still be reachable from u, but via non-tree edge.
This is the idea behind positive-cut filter in GRAIL method, and this is
also positive-cut filter used in FELINE work/paper.  GRAIL calls this
positive-cut interval an interior interval.

The post-order or pre-order labeling of spanning tree is also labeling
of nodes in the original DAG.  Hence for negative-cut filter we can
assign to each node u the interval starting at minimum of labels of
nodes reachable in the DAG from u, and ending at maximum of labels of
nodes reachable in the DAG from u.  Then if node v has label outside
this interval I'(u), then u cannot reach v.  GRAIL calls this
negative-cut interval an exterior interval.

Now if the traversal of tree was done using DFS (depth-first search),
then the ending of min-post interval I(u) for a spanning tree is also
end of interval I'(u) for DAG.  Moreover the minimum of post(v) for
nodes reachable from u, that is the starting point for the DAG interval,
can be calculated iteratively during post-order traversal of the tree.
(It is possible that DFS is not needed; PReaCH work proves it for DFS,
but I have not examined proof in GRAIL work in detail).  GRAIL approach
uses those two intervals together: the interval label for node u is
given as I(u) = [s_u, m_u, e_u ], where EI(u) = [s_u, e_u] and
II(u) = [m_u, e_u] are the exterior (negative-cut) and interior
(positive-cut) intervals respectively.

The motivating idea in GRAIL is to use interval labeling multiple times,
for multiple random spaanning trees to improve searching phase
(answering reachability queries).  The GRAIL work proposes several
heuristics how to choose best set of spanning trees.  The recommended
number of those trees for large graphs is k = 5.

Using multiple spanning trees is not the only solution.  Alternative is
trying to get more out of single DFS numbering.  PReaCH work
(arXiv:1404.4465) adds one more possible and possibly large positive-cut
interval (called "full" interval there), and possibly one more
negative-cut interval (called "empty" interval).  Note that the original
work uses pre-order numbering and pre-max intervals, but the result can
be easily translated to post-numbering and min-post intervals (I have
done this in mentioned Jupyter notebook).

The additional positive-cut interval for node u, if it exists, is the
positive-cut interval for a node w = p_tree(u) that is reachable from u
and not in positive-cut interval of u, and has maximum size of said
positive-cut interval.  This can be computed iteratively, while
computing post-number ordering on u.  This seems promising for graph of
revisions.

One can also find additional negative-cut interval, finding f_gap so
that [f_gap, min] is "empty" interval, and the negative cut is split
from [f_min, post] to two intervals: [f_min, f_gap] and [min, post].
This also can be computed iteratively.

The FERRARI approach is more generic.  It starts with the fact that the
reachable set R(v) of a vertex v in the DAG is only partly represented
by the spanning tree interval I_T(v), as the tree interval only accounts
for reachability relationships that are preserved in tree T.  The
reachable set R(v) can be represented as set of intervals II(v).  The
resulting index (collection of reachable interval sets) can be regarded
as a materialization of the transitive closure of the graph, which gets
too large and is thus infeasible for large graphs.

Instead of storing full set of intervals II(v) representing reachability
set R(v), FERRARI index assigns a mixture of exact and approximate
reachability intervals, given a user-specified size constraint on the
index.  In FERRARI-L (local version) it is replaced by at most k
intervals, some of which come from II(v) and some are merges of
intervals from II(v).  If node w is in one of exact intervals for node
v, then v can reach w; if node w is outside all of intervals for node v,
then v cannot reach w.  Intervals are chosen in such way as to maximize
the sum of sizes of exact intervals, maximizing positive-cut
effectiveness.

In FERRARI-G variant the user-specified size constraint is made global;
some nodes may use more intervals, and some less, keeping the average
constant.  This allows to use index space more effectively, at the cost
of variable-length index and slower access.

Contraction hierarchies
-----------------------

In general, a Reachability Contraction Hierarchy (RCH) can be defined
based on any numbering order. We successively contract nodes with
increasing order(v). Contracting node v means removing it from the graph
and possibly inserting new edges such that the reachability relation in
the new graph is the same as in the previous graph, The query algorithm
does not use the contracted versions of the graph but only the ordering
and the shortcuts. This algorithm is based on the observation that in an
RCH it suffices to look at "up-down" paths, with contraction order first
increasing, then decreasing; this works best with bidirectional breadth
first search (BBFS).

PReaCH approach exploits the fact that any DAG contains source nodes
(with in-degree zero) and sink nodes (with out-degree zero). Contracting
such a node v never requires the insertion of shortcuts.

The index in this case is just one number per node: the contraction
order.  I'm not sure how useful that would be for BFS search, without
accessing reverse graph.

Graph dominance drawing
-----------------------

The fundamental idea behind FELINE is to associate every vertex u in V
with a unique ordered pair of natural integers i(u), so that if vertex u
can reach v, then i(u) ≦ u(v), that is first element is less or equal
first element in pair, and second element is less or equal second
element in pair.  If this pair of integers (x, y) associated with a
vertex is understood as coordinates in the Cartesian plane, then the
implication means that paths always point right and up, and it is enough
to check uper-right quadrant.

The method is inspired by weak dominance drawing, which guarantees the
implication for edges.  FELINE work shows proof that this leads to the
implication for paths.

Such pair of numbers are negative-cut filter.  The algorithm for
creating such index in FELINE work uses two topological orderings, where
the second is choosen to minimize number of false positives for the
index.

It turns out however from the experiments in mentoned Jupyter notebook
that this method doesn't work that well with commit graphs.  When there
are only a few sources and only a few sinks, then node positions gather
around 'x = y' diagonal line, thus FELINE index check almost reduces to
the topological ordering filter mentioned earlier.

Approximate set inclusion
-------------------------

Both IP and BFL takes reachability testing from a vertex u to a vertex v
as set-containment testing. Intuitively, if a vertex u can reach a
vertex v, then u must be able to reach all the vertices that v can
reach. Let Out(x) denote the set of all vertices that a vertex x can
reach in a graph. If u can reach v, then Out(v) ⊆ Out(u).  This is done
by checking by checking Out(v) ⊈ Out(u) instead (that Out(v) is not
subset or equal to Out(u), which is to check if there is at least one
element in Out(v) that is not in Out(u).

Both instead of getting an exact answer for testing, they get an answer
with given probability guarantee in a negative-cut fashion.  The IP
label uses k-min-wise independent permutations, while BFL uses Bloom
filters.

Those approaches work well with graphs that have not large reachability
sets (short depth), and where ratio of positive queries to all queries
is very small.  This means that they would most probably won't work well
for commit graphs, where neither holds true.

Summary
=======

The min-post labeling (the base of GRAIL approach) looks promising; it
needs to be checked whether PReaCH approach or FERRARI approach be
better for commit graphs.

I hope that this somewhat long post was informative,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski
                   ` (2 preceding siblings ...)
  2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski
@ 2018-05-07 14:26 ` Derrick Stolee
  2018-05-12 14:00   ` Jakub Narebski
  3 siblings, 1 reply; 10+ messages in thread
From: Derrick Stolee @ 2018-05-07 14:26 UTC (permalink / raw)
  To: Jakub Narebski, git, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

On 5/4/2018 3:40 PM, Jakub Narebski wrote:
> Hello,
>
> With early parts of commit-graph feature (ds/commit-graph and
> ds/lazy-load-trees) close to being merged into "master", see
> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/
> I think it would be good idea to think what other data could be added
> there to make Git even faster.

Before thinking about adding more data to the commit-graph, I think 
instead we need to finish taking advantage of the data that is already 
there. This means landing the generation number patch [1] (I think this 
is close, so I'll send a v6 this week if there is no new feedback.) and 
the auto-compute patch [2] (this could use more feedback, but I'll send 
a v1 based on the RFC feedback if no one chimes in).

[1] 
https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/
     [PATCH v5 00/11] Compute and consume generation numbers

[2] 
https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/
     [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

The big wins remaining from this data are `git tag --merged` and `git 
log --graph`. The `tag` scenario is probably easier: this can be done by 
replacing the revision-walk underlying the call to use 
paint_down_to_common() instead. Requires adding an external method to 
commit.c, but not too much code.

The tougher challenge is `git log --graph`. The revision walk machinery 
currently uses two precompute phases before iterating results to the 
pager: limit_list() and sort_in_topological_order(); these correspond to 
two phases of Kahn's algorithm for topo-sort (compute in-degrees, then 
walk by peeling commits with in-degree zero). This requires O(N) time, 
where N is the number of reachable commits. Instead, we could make this 
be O(W) time to output one page of results, where W is (roughly) the 
number of reachable commits with generation number above the last 
reported result.

In order to take advantage of this approach, the two phases of Kahn's 
algorithm need to be done in-line with reporting results to the pager. 
This means keeping two queues: one is a priority queue by generation 
number that computes in-degrees, the other is a priority queue (by 
commit-date or a visit-order value to do the --topo-order priority) that 
peels the in-degree-zero commits (and decrements the in-degree of their 
parents). I have not begun this refactoring effort because appears 
complicated to me, and it will be hard to tease out the logic without 
affecting other consumers of the revision-walk machinery.

I would love it if someone picked up the `git log --graph` task, since 
it will be a few weeks before I have the time to focus on it.

Without completing the benefits we get from generation numbers, these 
investigations into other reachability indexes will be incomplete as 
they are comparing benefits without all consumers taking advantage of a 
reachability index.

[...]
> Bloom filter for changed paths
> ------------------------------
>
> The goal of this chunk is to speed up checking if the file or directory
> was changed in given commit, for queries such as "git log -- <file>" or
> "git blame <file>".  This is something that according to "Git Merge
> contributor summit notes" [2] is already present in VSTS (Visual Studio
> Team Services - the server counterpart of GVFS: Git Virtual File System)
> at Microsoft:
>
> AV> - VSTS adds bloom filters to know which paths have changed on the commit
> AV> - tree-same check in the bloom filter is fast; speeds up file history checks
> AV> - might be useful in the client as well, since limited-traversal is common
> AV> - if the file history is _very_ sparse, then bloom filter is useful
> AV> - but needs pre-compute, so useful to do once
> AV> - first make the client do it, then think about how to serve it centrally
>
> [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
>
> I think it was what Derrick Stolee was talking about at the end of his
> part of "Making Git for Windows" presentation at Git Merge 2018:
> https://youtu.be/oOMzi983Qmw?t=1835
>
> This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
> trees when reading commit-graph", starting from [3]
> [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/

Again, the benefits of Bloom filters should only be measured after 
already taking advantage of a reachability index during `git log`. 
However, you could get performance benefits from Bloom filters in a 
normal `git log` (no topo-order).

The tricky part about this feature is that the decisions we made in our 
C# implementation for the VSTS Git server may be very different than the 
needs for the C implementation of the Git client. Questions like "how do 
we handle merge commits?" may have different answers, which can only be 
discovered by implementing the feature.

(The answer for VSTS is that we only store Bloom filters containing the 
list of changed paths against the first parent. The second parent 
frequently has too many different paths, and if we are computing 
file-history simplification we have already determined the first parent 
is _not_ TREESAME, which requires verifying the difference by parsing 
trees against the first parent.)

I'm happy to provide more information on how we built this feature if 
someone is writing a patch. Otherwise, I plan to implement it after 
finishing the parts I think are higher priority.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee
@ 2018-05-12 14:00   ` Jakub Narebski
  2018-05-14 13:20     ` Derrick Stolee
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narebski @ 2018-05-12 14:00 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:
> On 5/4/2018 3:40 PM, Jakub Narebski wrote:
>>
>> With early parts of commit-graph feature (ds/commit-graph and
>> ds/lazy-load-trees) close to being merged into "master", see
>> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/
>> I think it would be good idea to think what other data could be added
>> there to make Git even faster.
>
> Before thinking about adding more data to the commit-graph, I think
> instead we need to finish taking advantage of the data that is already
> there. This means landing the generation number patch [1] (I think
> this is close, so I'll send a v6 this week if there is no new
> feedback.) and the auto-compute patch [2] (this could use more
> feedback, but I'll send a v1 based on the RFC feedback if no one
> chimes in).
>
> [1]
> https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/
>     [PATCH v5 00/11] Compute and consume generation numbers
>
> [2]
> https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/
>     [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

Right, so the RFC might be a bit premature; I wanted the discussion to
be out there to think about when adding new uses of existing features.

DIGRESSION: it is commendable that you are sending patches in small,
easy digestible chunks / patch series.  It is much easier to review 10+
series than 80+ behemoth (though I understand it is not always possible
to subdivide patch series into smaller self-contained sub-series).

On the other hand, it would be nice to have some roadmap about series
and features to be sent in the future, if possible.  Something like what
was done when 'git rebase --interactive' was getting builtinified: moved
(in parts) to C.  It would be great to have such roadmap with milestones
achieved and milestones to be done in the cover letter for series.

> The big wins remaining from this data are `git tag --merged` and `git
> log --graph`. The `tag` scenario is probably easier: this can be done
> by replacing the revision-walk underlying the call to use
> paint_down_to_common() instead. Requires adding an external method to
> commit.c, but not too much code.

I wonder if there is some significant reason behind `git tag --merged`
using its own codepath, beside historical reasons.  Maybe performance is
better with current code?

Utilizing paint_down_to_common() there, beside reducing amount of code
you would have to modify, would also unify code (and possibly reduce
amount of code).  That's very nice.

>
> The tougher challenge is `git log --graph`. The revision walk
> machinery currently uses two precompute phases before iterating
> results to the pager: limit_list() and sort_in_topological_order();
> these correspond to two phases of Kahn's algorithm for topo-sort
> (compute in-degrees, then walk by peeling commits with in-degree
> zero). This requires O(N) time, where N is the number of reachable
> commits. Instead, we could make this be O(W) time to output one page
> of results, where W is (roughly) the number of reachable commits with
> generation number above the last reported result.

A reminder: Kahn's algorithm (described for example in [1] and [3])
looks like this:

  L ← Empty list that will contain the sorted elements
  S ← Collection of all nodes with no incoming edge
  while S is non-empty do
      remove a node 'n' from S
      add 'n' to tail of L
      for each parent 'm' of 'n' do
          decrease the in-degree of 'm'
          if 'm' has in-degree of 0 then
              insert 'm' into S

[1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
[2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/

> In order to take advantage of this approach, the two phases of Kahn's
> algorithm need to be done in-line with reporting results to the
> pager. This means keeping two queues: one is a priority queue by
> generation number that computes in-degrees,

This simple solition of using priority queue by generation number won't
work, I think.  In-degree is computed from heads down, generation number
is computed from roots up.

For example in the graph below

   *<---*<---*<---*<---*<---*<---*<---*<---*<---A
         \
          \--*<---B

both A and B have in-degree of 0, but gen(B) << gen(A).

But I might be wrong.

>                                            the other is a priority
> queue (by commit-date or a visit-order value to do the --topo-order
> priority) that peels the in-degree-zero commits (and decrements the
> in-degree of their parents). I have not begun this refactoring effort
> because appears complicated to me, and it will be hard to tease out
> the logic without affecting other consumers of the revision-walk
> machinery.
>
> I would love it if someone picked up the `git log --graph` task, since
> it will be a few weeks before I have the time to focus on it.

Let's assume that we have extra data (indexes such as generation number)
that can be used for positive-cut (we know that A can reach B) and
negative-cut (we know that A cannot reach B) filters.  Generation number
aka. topological level can be used in negative-cut filter.

NOTE: I have not looked at current Git code that does topological
sorting, as to not be suggested by the existing implementation.

How the indexes-amplified incremental Kahn's algorithm could look like:

First we need to find at least one node / vertex / commit with an
in-degree of zero.  We are given a list of commits to start from, but
they may not all have in-degree of zero - they may be dependent, or in
other words some of them may be reachable from others and be
irrelevant.  Here negative-cut and positive-cut filters can help.

If the order of commits on command line does not matter, we can start
from any distinct commit with highest generation number - we know that
it cannot be reached from other heads / refs and thus has in-degree of
zero.

  L ← Empty list that will contain the sorted elements
  S ← Collection of all nodes with no incoming edge
  R ← Collection of starting points (refs)
  while S is non-empty do
      remove a node 'n' from S
      add 'n' to tail of L
      for each parent 'm' of 'n' do
          if 'm' cannot be reached from R then
              # it has in-degree of 0
              insert 'm' into S
          else if 'm' can be reached from R then
              # it has in-degree greater than 0
              insert 'm' into R
          else
              walk from each 'r of R,
              until we know if 'm' is reachable from R
              then insert it into S or R

              perhaps computing in-degrees,
              or marking commits as reachable...

Does it looks correct?  I can try to make this pseudocode into actual
algorithm.

>
> Without completing the benefits we get from generation numbers, these
> investigations into other reachability indexes will be incomplete as
> they are comparing benefits without all consumers taking advantage of
> a reachability index.

On one hand side, you are right: if we try to investigate of some
reachability index is worth it by checking if it *improves performance
of actual git operations* without all consumers taking advantage of a
reachability index would be incomplete.

On the other hand we can still perform synthetic tests: how much less
commits we walk when checking that A can reach B on real commit graphs
(like I did in mentioned Google Colaboratory notebook [3]).  This
assumes that the cost of accessing commit data (and possibly also
indexes data) dominates, and the cost of using reachability indexes is
negligible.

[3]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg

On the gripping hand the construction of algorithms in those future
steps would be, I think, affected by what types of indexes would we
have: negative-cut filters, positive-cut filters, reachability bitmaps,
Bloom filters for changed files, eic.

> [...]
>> Bloom filter for changed paths
>> ------------------------------
>>
>> The goal of this chunk is to speed up checking if the file or directory
>> was changed in given commit, for queries such as "git log -- <file>" or
>> "git blame <file>".  This is something that according to "Git Merge
>> contributor summit notes" [2] is already present in VSTS (Visual Studio
>> Team Services - the server counterpart of GVFS: Git Virtual File System)
>> at Microsoft:
>>
>> AV> - VSTS adds bloom filters to know which paths have changed on the commit
>> AV> - tree-same check in the bloom filter is fast; speeds up file history checks
>> AV> - might be useful in the client as well, since limited-traversal is common
>> AV> - if the file history is _very_ sparse, then bloom filter is useful
>> AV> - but needs pre-compute, so useful to do once
>> AV> - first make the client do it, then think about how to serve it centrally
>>
>> [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
>>
>> I think it was what Derrick Stolee was talking about at the end of his
>> part of "Making Git for Windows" presentation at Git Merge 2018:
>> https://youtu.be/oOMzi983Qmw?t=1835
>>
>> This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
>> trees when reading commit-graph", starting from [3]
>> [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/
>
> Again, the benefits of Bloom filters should only be measured after
> already taking advantage of a reachability index during `git
> log`. However, you could get performance benefits from Bloom filters
> in a normal `git log` (no topo-order).

I wonder how much they improve performance of "git blame"...

> The tricky part about this feature is that the decisions we made in
> our C# implementation for the VSTS Git server may be very different
> than the needs for the C implementation of the Git client. Questions
> like "how do we handle merge commits?" may have different answers,
> which can only be discovered by implementing the feature.
>
> (The answer for VSTS is that we only store Bloom filters containing
> the list of changed paths against the first parent. The second parent
> frequently has too many different paths, and if we are computing
> file-history simplification we have already determined the first
> parent is _not_ TREESAME, which requires verifying the difference by
> parsing trees against the first parent.)

Thanks for the information.  I think for now it is sufficient level of
the detail.

> I'm happy to provide more information on how we built this feature if
> someone is writing a patch. Otherwise, I plan to implement it after
> finishing the parts I think are higher priority.

All right, I understand that.  Time is a scarse resource.

I think that, beside writing patches for Git, exploring how various
pieces of data and various indexes affect walking commit graphs is also
important.  My explorations shown that, for example, that FELINE index
is not good fit for relatively "narrow" graphs of revision history.
Exploring this in Python / Jupyter is easier than trying to write a
exploratory patch for Git, in my opinion.  Just IMVHO.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-12 14:00   ` Jakub Narebski
@ 2018-05-14 13:20     ` Derrick Stolee
  2018-05-14 20:58       ` Jakub Narebski
  0 siblings, 1 reply; 10+ messages in thread
From: Derrick Stolee @ 2018-05-14 13:20 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

On 5/12/2018 10:00 AM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>> On 5/4/2018 3:40 PM, Jakub Narebski wrote:
>>> With early parts of commit-graph feature (ds/commit-graph and
>>> ds/lazy-load-trees) close to being merged into "master", see
>>> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/
>>> I think it would be good idea to think what other data could be added
>>> there to make Git even faster.
>> Before thinking about adding more data to the commit-graph, I think
>> instead we need to finish taking advantage of the data that is already
>> there. This means landing the generation number patch [1] (I think
>> this is close, so I'll send a v6 this week if there is no new
>> feedback.) and the auto-compute patch [2] (this could use more
>> feedback, but I'll send a v1 based on the RFC feedback if no one
>> chimes in).
>>
>> [1]
>> https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/
>>      [PATCH v5 00/11] Compute and consume generation numbers
>>
>> [2]
>> https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/
>>      [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
> Right, so the RFC might be a bit premature; I wanted the discussion to
> be out there to think about when adding new uses of existing features.
>
>
> DIGRESSION: it is commendable that you are sending patches in small,
> easy digestible chunks / patch series.  It is much easier to review 10+
> series than 80+ behemoth (though I understand it is not always possible
> to subdivide patch series into smaller self-contained sub-series).
>
> On the other hand, it would be nice to have some roadmap about series
> and features to be sent in the future, if possible.  Something like what
> was done when 'git rebase --interactive' was getting builtinified: moved
> (in parts) to C.  It would be great to have such roadmap with milestones
> achieved and milestones to be done in the cover letter for series.

I suppose that is what I intended in the "Future Work" section of 
Documentation/technical/commit-graph.txt. It gives a set of things that 
need to be done in order to make this a default feature, not just an 
expert-level feature. When I wrote the section, I was focused entirely 
on "commit-graph version 1.0" and I didn't know how long that would 
take. The series have been getting interest and review (in great part to 
your interest, Jakub) so they have been moving to 'next' faster than I 
anticipated.

I'll plan on writing a more detailed list of future directions, but for 
now I'll list the things I know about and how I see them in priority order:

Commit-graph v1.0:
* ds/generation-numbers
* 'verify' and fsck/gc integration
* correct behavior with shallow clones, replace-objects, and grafts

Commit-graph v1.1:
* Place commit-graph storage in the_repository
* 'git tag --merged' use generation numbers
* 'git log --graph' use generation numbers

Commit-graph v1.X:
* Protocol v2 verb for sending commit-graph
* Bloom filters for changed paths

>
>> The big wins remaining from this data are `git tag --merged` and `git
>> log --graph`. The `tag` scenario is probably easier: this can be done
>> by replacing the revision-walk underlying the call to use
>> paint_down_to_common() instead. Requires adding an external method to
>> commit.c, but not too much code.
> I wonder if there is some significant reason behind `git tag --merged`
> using its own codepath, beside historical reasons.  Maybe performance is
> better with current code?
>
> Utilizing paint_down_to_common() there, beside reducing amount of code
> you would have to modify, would also unify code (and possibly reduce
> amount of code).  That's very nice.
>
>> The tougher challenge is `git log --graph`. The revision walk
>> machinery currently uses two precompute phases before iterating
>> results to the pager: limit_list() and sort_in_topological_order();
>> these correspond to two phases of Kahn's algorithm for topo-sort
>> (compute in-degrees, then walk by peeling commits with in-degree
>> zero). This requires O(N) time, where N is the number of reachable
>> commits. Instead, we could make this be O(W) time to output one page
>> of results, where W is (roughly) the number of reachable commits with
>> generation number above the last reported result.
> A reminder: Kahn's algorithm (described for example in [1] and [3])
> looks like this:
>
>    L ← Empty list that will contain the sorted elements
>    S ← Collection of all nodes with no incoming edge
>    while S is non-empty do
>        remove a node 'n' from S
>        add 'n' to tail of L
>        for each parent 'm' of 'n' do
>            decrease the in-degree of 'm'
>            if 'm' has in-degree of 0 then
>                insert 'm' into S
>
> [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
> [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/
>
>> In order to take advantage of this approach, the two phases of Kahn's
>> algorithm need to be done in-line with reporting results to the
>> pager. This means keeping two queues: one is a priority queue by
>> generation number that computes in-degrees,
> This simple solition of using priority queue by generation number won't
> work, I think.  In-degree is computed from heads down, generation number
> is computed from roots up.
>
> For example in the graph below
>
>     *<---*<---*<---*<---*<---*<---*<---*<---*<---A
>           \
>            \--*<---B
>
> both A and B have in-degree of 0, but gen(B) << gen(A).
>
> But I might be wrong.
>
>>                                             the other is a priority
>> queue (by commit-date or a visit-order value to do the --topo-order
>> priority) that peels the in-degree-zero commits (and decrements the
>> in-degree of their parents). I have not begun this refactoring effort
>> because appears complicated to me, and it will be hard to tease out
>> the logic without affecting other consumers of the revision-walk
>> machinery.
>>
>> I would love it if someone picked up the `git log --graph` task, since
>> it will be a few weeks before I have the time to focus on it.
> Let's assume that we have extra data (indexes such as generation number)
> that can be used for positive-cut (we know that A can reach B) and
> negative-cut (we know that A cannot reach B) filters.  Generation number
> aka. topological level can be used in negative-cut filter.
>
> NOTE: I have not looked at current Git code that does topological
> sorting, as to not be suggested by the existing implementation.
>
>
> How the indexes-amplified incremental Kahn's algorithm could look like:
>
> First we need to find at least one node / vertex / commit with an
> in-degree of zero.  We are given a list of commits to start from, but
> they may not all have in-degree of zero - they may be dependent, or in
> other words some of them may be reachable from others and be
> irrelevant.  Here negative-cut and positive-cut filters can help.
>
> If the order of commits on command line does not matter, we can start
> from any distinct commit with highest generation number - we know that
> it cannot be reached from other heads / refs and thus has in-degree of
> zero.
>
>    L ← Empty list that will contain the sorted elements
>    S ← Collection of all nodes with no incoming edge
>    R ← Collection of starting points (refs)
>    while S is non-empty do
>        remove a node 'n' from S
>        add 'n' to tail of L
>        for each parent 'm' of 'n' do
>            if 'm' cannot be reached from R then
>                # it has in-degree of 0
>                insert 'm' into S
>            else if 'm' can be reached from R then
>                # it has in-degree greater than 0
>                insert 'm' into R
>            else
>                walk from each 'r of R,
>                until we know if 'm' is reachable from R
>                then insert it into S or R
>
>                perhaps computing in-degrees,
>                or marking commits as reachable...
>
>
> Does it looks correct?  I can try to make this pseudocode into actual
> algorithm.

I'm not following your pseudocode very well, so instead I'll provide a 
more concrete description of what I mentioned before:

Here are some data structures.

IDV : A dictionary storing the currently-computed in-degree value of a 
commit 'c' as 'IDV[c]'. Assume all commits start with in-degree 0.
IDQ : A queue, for storing the "in-degree walk". It is a priority queue 
ordered by generation number.
TOQ : A queue, for storing the "topo-order walk". It is a priority queue 
ordered by "visit order" (see algorithm)

Here are some methods.

AdvanceInDegreeWalk(IDQ, IDV, gen):
     while !IDX.Empty && IDQ.Max >= gen:
         c = IDX.Dequeue

         for each parent p of c:
             IDV[p]++;
             IDQ.Enqueue(p, generation(p))

InitializeTopoOrder(R):
     Initialize IDQ, IDV, TOQ

     min_generation = GENERATION_NUMBER_INFINITY
     visit_order = 0

     for each commit c in R:
         min_generation = min(min_generation, generation(c))
         IDQ.Enqueue(c, generation(c))

     AdvanceInDegreeWalk(IDQ, IDV, min_generation)

     for each commit c in R:
         if IDV[c] == 0:
             TOQ.Enqueue(c, visit_order++)

     return (IDQ, IDV, TOQ, visit_order)

GetNextInTopoOrder(IDQ, IDV, TOQ, ref visit_order):
     if TOQ.Empty:
         return null

     c = TOQ.Dequeue()

     for each parent p of c:
         IDV[p]--
         AdvanceInDegreeWalk(IDQ, IDV, generation(p))
         if IDV[p] == 0:
             TOQ.Enqueue(p, visit_order++)

     return c

(I mention "ref visit_order" to be sure we are passing-by-reference. In 
a full implementation, the walk details would be grouped into a struct.)

This algorithm is relatively simple, but the hard part is teasing the 
revision walk machinery to initialize the data by calling 
InitializeTopoOrder(R) and to call GetNextInTopoOrder() whenever it 
needs a new element of the walk. That is, it is hard to be sure we are 
not causing side-effects as we make that transformation.

>
>> Without completing the benefits we get from generation numbers, these
>> investigations into other reachability indexes will be incomplete as
>> they are comparing benefits without all consumers taking advantage of
>> a reachability index.
> On one hand side, you are right: if we try to investigate of some
> reachability index is worth it by checking if it *improves performance
> of actual git operations* without all consumers taking advantage of a
> reachability index would be incomplete.
>
> On the other hand we can still perform synthetic tests: how much less
> commits we walk when checking that A can reach B on real commit graphs
> (like I did in mentioned Google Colaboratory notebook [3]).  This
> assumes that the cost of accessing commit data (and possibly also
> indexes data) dominates, and the cost of using reachability indexes is
> negligible.
>
> [3]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>
> On the gripping hand the construction of algorithms in those future
> steps would be, I think, affected by what types of indexes would we
> have: negative-cut filters, positive-cut filters, reachability bitmaps,
> Bloom filters for changed files, eic.
>
>> [...]
>>> Bloom filter for changed paths
>>> ------------------------------
>>>
>>> The goal of this chunk is to speed up checking if the file or directory
>>> was changed in given commit, for queries such as "git log -- <file>" or
>>> "git blame <file>".  This is something that according to "Git Merge
>>> contributor summit notes" [2] is already present in VSTS (Visual Studio
>>> Team Services - the server counterpart of GVFS: Git Virtual File System)
>>> at Microsoft:
>>>
>>> AV> - VSTS adds bloom filters to know which paths have changed on the commit
>>> AV> - tree-same check in the bloom filter is fast; speeds up file history checks
>>> AV> - might be useful in the client as well, since limited-traversal is common
>>> AV> - if the file history is _very_ sparse, then bloom filter is useful
>>> AV> - but needs pre-compute, so useful to do once
>>> AV> - first make the client do it, then think about how to serve it centrally
>>>
>>> [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
>>>
>>> I think it was what Derrick Stolee was talking about at the end of his
>>> part of "Making Git for Windows" presentation at Git Merge 2018:
>>> https://youtu.be/oOMzi983Qmw?t=1835
>>>
>>> This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
>>> trees when reading commit-graph", starting from [3]
>>> [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/
>> Again, the benefits of Bloom filters should only be measured after
>> already taking advantage of a reachability index during `git
>> log`. However, you could get performance benefits from Bloom filters
>> in a normal `git log` (no topo-order).
> I wonder how much they improve performance of "git blame"...
>
>> The tricky part about this feature is that the decisions we made in
>> our C# implementation for the VSTS Git server may be very different
>> than the needs for the C implementation of the Git client. Questions
>> like "how do we handle merge commits?" may have different answers,
>> which can only be discovered by implementing the feature.
>>
>> (The answer for VSTS is that we only store Bloom filters containing
>> the list of changed paths against the first parent. The second parent
>> frequently has too many different paths, and if we are computing
>> file-history simplification we have already determined the first
>> parent is _not_ TREESAME, which requires verifying the difference by
>> parsing trees against the first parent.)
> Thanks for the information.  I think for now it is sufficient level of
> the detail.
>
>> I'm happy to provide more information on how we built this feature if
>> someone is writing a patch. Otherwise, I plan to implement it after
>> finishing the parts I think are higher priority.
> All right, I understand that.  Time is a scarse resource.
>
>
> I think that, beside writing patches for Git, exploring how various
> pieces of data and various indexes affect walking commit graphs is also
> important.  My explorations shown that, for example, that FELINE index
> is not good fit for relatively "narrow" graphs of revision history.
> Exploring this in Python / Jupyter is easier than trying to write a
> exploratory patch for Git, in my opinion.  Just IMVHO.

You are right. Ruling out possibilities is the best outcome these 
prototypes can have. Your work has saved someone a lot of time in 
investigating that direction.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-14 13:20     ` Derrick Stolee
@ 2018-05-14 20:58       ` Jakub Narebski
  2018-05-15 10:01         ` Johannes Schindelin
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narebski @ 2018-05-14 20:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:
> On 5/12/2018 10:00 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>> On 5/4/2018 3:40 PM, Jakub Narebski wrote:
>>>>
>>>> With early parts of commit-graph feature (ds/commit-graph and
>>>> ds/lazy-load-trees) close to being merged into "master", see
>>>> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/
>>>> I think it would be good idea to think what other data could be added
>>>> there to make Git even faster.
>>> Before thinking about adding more data to the commit-graph, I think
>>> instead we need to finish taking advantage of the data that is already
>>> there. This means landing the generation number patch [1] (I think
>>> this is close, so I'll send a v6 this week if there is no new
>>> feedback.) and the auto-compute patch [2] (this could use more
>>> feedback, but I'll send a v1 based on the RFC feedback if no one
>>> chimes in).
>>>
>>> [1]
>>> https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/
>>>      [PATCH v5 00/11] Compute and consume generation numbers
>>>
>>> [2]
>>> https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/
>>>      [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
>>
>> Right, so the RFC might be a bit premature; I wanted the discussion to
>> be out there to think about when adding new uses of existing features.
>>
>>
>> DIGRESSION: it is commendable that you are sending patches in small,
>> easy digestible chunks / patch series.  It is much easier to review 10+
>> series than 80+ behemoth (though I understand it is not always possible
>> to subdivide patch series into smaller self-contained sub-series).
>>
>> On the other hand, it would be nice to have some roadmap about series
>> and features to be sent in the future, if possible.  Something like what
>> was done when 'git rebase --interactive' was getting builtinified: moved
>> (in parts) to C.  It would be great to have such roadmap with milestones
>> achieved and milestones to be done in the cover letter for series.
>
> I suppose that is what I intended in the "Future Work" section of
> Documentation/technical/commit-graph.txt. It gives a set of things
> that need to be done in order to make this a default feature, not just
> an expert-level feature. When I wrote the section, I was focused
> entirely on "commit-graph version 1.0" and I didn't know how long that
> would take. The series have been getting interest and review (in great
> part to your interest, Jakub) so they have been moving to 'next'
> faster than I anticipated.
>
> I'll plan on writing a more detailed list of future directions, but
> for now I'll list the things I know about and how I see them in
> priority order:
>
> Commit-graph v1.0:
> * ds/generation-numbers
> * 'verify' and fsck/gc integration
> * correct behavior with shallow clones, replace-objects, and grafts

So the goal of current v1.0 phase is to introduce generation numbers.
use them for better performance ("low hanging fruit"), ensure that it is
automatic and safe -- thus useable for an ordinary user.

>
> Commit-graph v1.1:
> * Place commit-graph storage in the_repository
> * 'git tag --merged' use generation numbers
> * 'git log --graph' use generation numbers
>
> Commit-graph v1.X:
> * Protocol v2 verb for sending commit-graph
> * Bloom filters for changed paths

Thanks, that is what I was missing.  The "Future Work" section, while
very nice to have (because it does not require to follow git
development; it is here in technical documentation), lacked
prioritization and rough milestones map.

[...]
>>> The tougher challenge is `git log --graph`. The revision walk
>>> machinery currently uses two precompute phases before iterating
>>> results to the pager: limit_list() and sort_in_topological_order();
>>> these correspond to two phases of Kahn's algorithm for topo-sort
>>> (compute in-degrees, then walk by peeling commits with in-degree
>>> zero). This requires O(N) time, where N is the number of reachable
>>> commits. Instead, we could make this be O(W) time to output one page
>>> of results, where W is (roughly) the number of reachable commits with
>>> generation number above the last reported result.
>>
>> A reminder: Kahn's algorithm (described for example in [1] and [3])
>> looks like this:
>>
>>    L ← Empty list that will contain the sorted elements
>>    S ← Collection of all nodes with no incoming edge
>>    while S is non-empty do
>>        remove a node 'n' from S
>>        add 'n' to tail of L
>>        for each parent 'm' of 'n' do
>>            decrease the in-degree of 'm'
>>            if 'm' has in-degree of 0 then
>>                insert 'm' into S
>>
>> [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
>> [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/
[...]

> I'm not following your pseudocode very well,

Not surprising, I've lost myself in the middle of writing it...

>                                              so instead I'll provide a
> more concrete description of what I mentioned before:

Before we start with helper data structures, we should start with the
inputs.  For topological sort it is set of starting commits that define
the part of commit graph we are interested in; they may be some that are
redundant, but those that do not are assumed to have in-degree of 0.

> Here are some data structures.
>
  IDX : ???

> IDV : A dictionary storing the currently-computed in-degree value of a
> commit 'c' as 'IDV[c]'. Assume all commits start with in-degree 0.

Nitpick: I guess that it is really "with assumed in-degree of 0".

This would be stored using commit-slab, isn't it?

> IDQ : A queue, for storing the "in-degree walk". It is a priority
> queue ordered by generation number.

All right, here we use the fact that if we walk up to and including
generation number of commit, then its in-degree will be calculated
correctly, as any commit with generation number smaller than generation
number of given commit cannot reach given commit and change its
in-degree.

> TOQ : A queue, for storing the "topo-order walk". It is a priority
> queue ordered by "visit order" (see algorithm)

Note: max priority queue ordered by "visit order" is practically a
stack, though I guess we prefer priority queue here to use the same code
for '--date-order'.

>
> Here are some methods.
>
> AdvanceInDegreeWalk(IDQ, IDV, gen):

If I understand it correctly this subroutine walks those commits that do
not have correct in-degree calculated, and walks till all commits with
generation number greater or equal to 'gen' cutoff parameter are walked
and have correct in-degree.

Assumes that all commits in graph but not in IDX have their in-degree
calculated.

>     while !IDX.Empty && IDQ.Max >= gen:
>         c = IDX.Dequeue
>
>         for each parent p of c:

Here the ordering of parents walked does not matter (much), as
generation number is used as cutoff only.

>             IDV[p]++;
>             IDQ.Enqueue(p, generation(p))

All right, looks correct; with gen = 0 it probably reduces to the
current code that does in-degree calculation upfront.

>
> InitializeTopoOrder(R):
>     Initialize IDQ, IDV, TOQ
>
>     min_generation = GENERATION_NUMBER_INFINITY
>     visit_order = 0
>
>     for each commit c in R:
>         min_generation = min(min_generation, generation(c))
>         IDQ.Enqueue(c, generation(c))
>
>     AdvanceInDegreeWalk(IDQ, IDV, min_generation)

This means that we walk BFS-like until all starting points, that is all
commits in R, have their in-degree calculated.  I think we can avoid
that, but it may change the exact behavior of '--topo-order'.

Currently if one commit in R is from orphan branch (like 'todo' in
git.git), or has otherwise very low generation number, it would mean
that AdvanceInDegreeWalk would walk almost all commits.

>     for each commit c in R:
>         if IDV[c] == 0:
>             TOQ.Enqueue(c, visit_order++)

I think we may want to either start from commits in the order given on
command line (given by R), and not reverse order, or maybe even start
from commits with maximum generation number.

>
>     return (IDQ, IDV, TOQ, visit_order)

All right.

Here is my [allegedly] improved version, which assumes that we always
want to start from commit with maximum generation number (there may be
more than one such commit).

Let's add one more data structure:

  RRQ : A queue, for storing remaining commits from R (starting point).
  It is a priority queue ordered by generation number.


InitializeTopoOrder(R):
    Initialize IDQ, IDV, TOQ, RRQ

    max_generation = 0
    visit_order = 0

    for each commit c in R:
        max_generation = max(max_generation, generation(c))
        unless IsRedundantFilter(R / c, c): # optional
            RRQ.Enqueue(c, generation(c))

    while RRQ.Max == max_generation:
        c = RRQ.Dequeue()
        IDV[c] = 0  # commits with max gen have in-degree of 0
        IDQ.Enqueue(c, generation(c))

    # this would be almost no-op
    AdvanceInDegreeWalk(IDQ, IDV, max_generation)

    for each commit c in reversed R:
        if generation(c) == max_generation: # then IDV[c] == 0
            TOQ.Enqueue(c, visit_order++)

    return (IDQ, IDV, TOQ, RRQ, visit_order)

>
> GetNextInTopoOrder(IDQ, IDV, TOQ, ref visit_order):
>     if TOQ.Empty:
>         return null
>
>     c = TOQ.Dequeue()
>
>     for each parent p of c:
>         IDV[p]--
>         AdvanceInDegreeWalk(IDQ, IDV, generation(p))
>         if IDV[p] == 0:
>             TOQ.Enqueue(p, visit_order++)
>
>     return c

With the modification to InitializeTopoOrder it would be

GetNextInTopoOrder(IDQ, IDV, TOQ, RRQ, ref visit_order):
    if TOQ.Empty:
        return null

    c = TOQ.Dequeue()

    for each parent p of c, sorted by generation number:
        # some of maybe in-degree zero are now surely in-degree zero
        while RRQ.Max > generation(p):
           r = RRQ.Dequeue()
           IDQ.Enqueue(r, generation(r))

        AdvanceInDegreeWalk(IDQ, IDV, generation(p))
        IDV[p]--
        if IDV[p] == 0:
            TOQ.Enqueue(p, visit_order++)

    return c


> (I mention "ref visit_order" to be sure we are
> passing-by-reference. In a full implementation, the walk details would
> be grouped into a struct.)
>
> This algorithm is relatively simple, but the hard part is teasing the
> revision walk machinery to initialize the data by calling
> InitializeTopoOrder(R) and to call GetNextInTopoOrder() whenever it
> needs a new element of the walk. That is, it is hard to be sure we are
> not causing side-effects as we make that transformation.

Right.  I'll take a look at existing code to find out how involved that
would be.

It is a pity that we cannot simply turn in-degree calculation into
producer that calculates in-degree in generation number order, and spits
it to FIFO pipe / channel / supply / blocking queue, while topo-order
calculation reads from in-degree pipe, and spits output to pager...

I wonder if we could use Scientist tool to make sure that new and old
code give the same results.

[...]
>> I think that, beside writing patches for Git, exploring how various
>> pieces of data and various indexes affect walking commit graphs is also
>> important.  My explorations shown that, for example, that FELINE index
>> is not good fit for relatively "narrow" graphs of revision history.
>> Exploring this in Python / Jupyter is easier than trying to write a
>> exploratory patch for Git, in my opinion.  Just IMVHO.

Actually I should have said that FELINE index for commit graphs (that I
have checked) does not show improvement over simply using some
topological ordering as negative-cut filter.

>
> You are right. Ruling out possibilities is the best outcome these
> prototypes can have. Your work has saved someone a lot of time in
> investigating that direction.

Regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
  2018-05-14 20:58       ` Jakub Narebski
@ 2018-05-15 10:01         ` Johannes Schindelin
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Schindelin @ 2018-05-15 10:01 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

[-- Attachment #1: Type: text/plain, Size: 2230 bytes --]

Hi Kuba,

On Mon, 14 May 2018, Jakub Narebski wrote:

> [... lots and lots of discussions...]
>
> All right.
> 
> Here is my [allegedly] improved version, which assumes that we always
> want to start from commit with maximum generation number (there may be
> more than one such commit).
> 
> Let's add one more data structure:
> 
>   RRQ : A queue, for storing remaining commits from R (starting point).
>   It is a priority queue ordered by generation number.
> 
> 
> InitializeTopoOrder(R):
>     Initialize IDQ, IDV, TOQ, RRQ
> 
>     max_generation = 0
>     visit_order = 0
> 
>     for each commit c in R:
>         max_generation = max(max_generation, generation(c))
>         unless IsRedundantFilter(R / c, c): # optional
>             RRQ.Enqueue(c, generation(c))
> 
>     while RRQ.Max == max_generation:
>         c = RRQ.Dequeue()
>         IDV[c] = 0  # commits with max gen have in-degree of 0
>         IDQ.Enqueue(c, generation(c))
> 
>     # this would be almost no-op
>     AdvanceInDegreeWalk(IDQ, IDV, max_generation)
> 
>     for each commit c in reversed R:
>         if generation(c) == max_generation: # then IDV[c] == 0
>             TOQ.Enqueue(c, visit_order++)
> 
>     return (IDQ, IDV, TOQ, RRQ, visit_order)

Isn't it premature to throw out code at this stage? My impression from the
side lines is that Stolee is trying to get concrete code implemented, code
that can be tested against concrete repositories, so that the question of
singularly overall importance can be asked: is it worth it, is it really
faster? And by how much?

It may not be the best idea to distract Stolee from said implementation
(i.e. step 1) by getting ahead ourselves with steps 17, 18 or 19. The next
steps would seem to be step 2 and 3, and I am sure that Stolee would
appreciate your help in implementing them.

So at this point it would maybe make a lot more sense to help Stolee e.g.
by refactoring the rev-list code that is a *mess* around the topo order,
so that all kinds of cute algorithms can be implemented *directly* in Git,
and put to the ultimate test.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-05-15 10:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski
2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason
2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason
2018-05-05 13:28   ` Jakub Narebski
2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski
2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee
2018-05-12 14:00   ` Jakub Narebski
2018-05-14 13:20     ` Derrick Stolee
2018-05-14 20:58       ` Jakub Narebski
2018-05-15 10:01         ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).