* [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. @ 2018-05-04 19:40 Jakub Narebski 2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason ` (3 more replies) 0 siblings, 4 replies; 10+ messages in thread From: Jakub Narebski @ 2018-05-04 19:40 UTC (permalink / raw) To: git, Derrick Stolee, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason Hello, With early parts of commit-graph feature (ds/commit-graph and ds/lazy-load-trees) close to being merged into "master", see https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/ I think it would be good idea to think what other data could be added there to make Git even faster. Commit-graph format =================== A quick reminder: in its current incarnation the commit graph file includes the following data [1]: 1. Some commit data that is just enough for many Git operations: - commit parents - commit tree (root tree OID) - committer date 2. The generation number of the commit. Commits with no parents have generation number 1; commits with parents have generation number one more than the maximum generation number of its parents. The commit-graph file is split into chunks, which theoretically allows extending the format wihout need for a version bump... though there is currently no specified policy about unknown chunks (and understandably so, as currently there are not any). I think it would be good idea to talk about what more could be added to be stored in the commit-graph file. [1]: https://github.com/git/git/blob/pu/Documentation/technical/commit-graph-format.txt Requirements and recommendations about possible new chunks ========================================================== Because the main goal of commit-graph feature is better performance in large repositories, any proposed new chunks (or, at higher level, every proposed piece of new data) needs to have the following properties. 1. First, it needs to have bounded and sensible size. This means that the size of new proposed chunk should be constant, proportional to the number of commits, or at worst proportional to the number of edges. From the existing chunks, OIDF (OID Fanout) has constant size, both OIDL (OID Lookup) and CGET/CDAT (Commit Data) has size proportional to the number of commits, while not always present EDGE (Large Edge List) has size proportional to the number of "excess" edges in "huge vertices" (octopus merges). 2. Second, we want fast access, in most cases fast random access. This means using fixed-size records. All currently existing chunks have records (columns) of fixed and aligned size. This restriction means that idexes where we have variable amount of data per vertex (per commit) are discouraged. 3. Third, it needs to be reasonably fast to create, and fast to update. This means time to create the chunk to be proportional to number of commits, or sum of number of commits and edges (which for commit graph and other sparse graphs is proprtional to the number of nodes / commits anyway). In my opinion time proportional to n*lug(n), where 'n' is the number of commits, is also acceptable. Times proportional to n^2 or n^3 are not acceptable. It is also strongly preferred that time to update the chunk is proportional to the number of new commits, so that incremental (append-only) update is possible. The generation numbers index has this property. Generic new chunks ================== There are a few ideas for new chunks, new pieces of data to be added to the commit-graph file, that are not connected with some labeling scheme for directed acyclic graphs like GRAIL, FERRARI, FELINE or TF-Label. Let's list them here. If you have an idea of another bit of information that could be added to the commit-graph file, please tell us. Bloom filter for changed paths ------------------------------ The goal of this chunk is to speed up checking if the file or directory was changed in given commit, for queries such as "git log -- <file>" or "git blame <file>". This is something that according to "Git Merge contributor summit notes" [2] is already present in VSTS (Visual Studio Team Services - the server counterpart of GVFS: Git Virtual File System) at Microsoft: AV> - VSTS adds bloom filters to know which paths have changed on the commit AV> - tree-same check in the bloom filter is fast; speeds up file history checks AV> - might be useful in the client as well, since limited-traversal is common AV> - if the file history is _very_ sparse, then bloom filter is useful AV> - but needs pre-compute, so useful to do once AV> - first make the client do it, then think about how to serve it centrally [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ I think it was what Derrick Stolee was talking about at the end of his part of "Making Git for Windows" presentation at Git Merge 2018: https://youtu.be/oOMzi983Qmw?t=1835 This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph", starting from [3] [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/ A quick reminder: a Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False negatives are not possible: if a Bloom filter says that the element is not in set, then it certainly isn't. False positives matches are possible: if a Bloom filter says that element is in set, then maybe it is -- it is possible with some probability that it really is not. The probability depends on number of elements in set, number of bits Bloom filter uses (with fewer than 10 bits per element are required for a 1% false positive probability), and number of [independent] hash functions it uses. This Bloom filter can be used to store in a compact and fixed-size way a set of files (and possibly also directories) changed by given commit. Then when "git log -- <file>" is called, Git can quickly check if the commit in question touches given file. If Bloom filter say that it most probably is, only then Git would need to perform a diff to check if it actually is (and usually get data to show). There are a few problems that needs to be solved, though. * First, and most important is the performance considerations of actually creating Bloom filter. It is very expensive to do the full diff for each commit to create Bloom filter. As Derrick Stolee said in [3]: DS> My guess is that only server-side operations will need the added DS> response time, and can bear the cost of computing them when DS> writing the commit-graph file. Clients are less likely to be DS> patient waiting for a lot of diff calculations. DS> DS> If we add commit-graph file downloads to the protocol, then the DS> server could do this computation and send the data to all DS> clients. But that would be "secondary" information that maybe DS> clients want to verify, which is as difficult as computing it DS> themselves. * What to do about merge commits, and octopus merges in particular? Should Bloom filter be stored for each of the parents? How to ensure fast access then (fixed-width records) - use large edge list? * Then there is problem of rename and copying detection - I think we can simply ignore it: unless someone has an idea about how to handle it? Though this means that "git log --follow <file>" wouldn't get any speedup, and neither the half of "git gui blame" that runs "git blame --incremental -C -C -w" -- the one that allows code copying and movement detection. One possible way of solving the issue of large amount of work required to compute Bloom filter could be updating it as a side effect of operation that calculates full diff anyway (like "git log -p" or "git show") if the commit is in commit-graph file. We would need then some special value that marks Bloom filter as not filled in (in addition to the one saying that there were too many changes, and the one saying that there were no changes at all - as rare as the latter might been). Because not all queries that require calculating diff do it for full hierarchy, maybe additionally there could be a additional field denoting the depth to which diff and changed files were calculated. For example depth of 1 means that only top-level files and top-level directories were added to Bloom filter, so you can only verify the first component of the path to check if it could have been changed. Zero depth may mean unlimited. A final note: Git would use this Bloom filter for changed files in quite different way than is this structure typical use case: instead of checking large number of elements against given Bloom filter, Git would be checking given element against large amount of Bloom filters. This means that the optimization people came up with, like fast-forward Bllom filters (https://github.com/efficient/ffbf) wouldn't necessary help. Commit-graph validity --------------------- One problem with commit-graph feature that needs to be solved is that it stores one particular view of the project history, and there exist features that change it. Beween writing commit-graph file and using it, commits may have been replaced by commits with different parents, shallow clone may have been shortened or deepened, or grafts file may have been edited. One possible solution would be to automatically turn off commit-graph if any of those features is present. For shallow clone it may even be acceptable solution - the hstory is shortened, so the queries are fast anyway, and commit-graph is less needed. If we join the working repository with the historical one (via grafts or replacements) to examine the full history of a project, last thing we would want is to make operations slower. A possible way of solving this would be to add a chunk which says for what combination of graft file content, shallow file content, and the state of replace refs given commit-graph file is valid for. If they do not match, Git would not use commit-graph feature and fall back on ordinary slower mechanism of answering queries. Moreover, with this solution, when shallow clone changes its depth (e.g. with 'git fetch --deepen=<depth>' or 'git fetch --unshallow), or when a replacement is added with 'git replace', Git can either update commit-graph file, or update the validity - saying that commit-graph file is valid for the new state. The latter is more important for git-replace, which does not need to affect history at all. There isn't anything like that for grafts, but the mechanism is deprecated anyway, and I am not opposed to just turning off commit-graph feature if there is grafts file present: this could help convince people to upgrade to git-replace, which should be easy now thanks to the new --convert-graft-file option. The verification could be as simple as verification hash / check over contents of grafts file, over contents of shallow file, and over replacement refs represented as fragment of packed-refs file. Topographical ordering of commits --------------------------------- The idea is simple: store topological order of commits, so that 'git log --topo-order' and 'git log --graph' are faster. This could be done with two chunks: one storing the position of given commit in topological order (commits sorted by their OID, like for metadata), and list of commits in topological order. The first chunk can also be thought of as the index into later. This way it would be possible to list commits in topological order starting at given commit, and ending at given commit or after given number of commits. The problem is with updating this data. It should be possible to update a topological order, but I am not sure if it would be possible to update topological order based only on changed files and existing order to still fulfill the criteria on --topo-order: --topo-order :: Show no parents before all of its children are shown, and avoid showing commits on multiple lines of history intermixed. The second requirement may (or may not be) hard to fulfill when updating. Bitmap indices indicator ------------------------ Bitmap index is supported since 2015 for Git itself [4], and since at least 2013 for JGit [5]. The main goal of this feature was to improve the "counting objects" phase of clone and fetch performance. A quick reminder: bitmap index file is computed per packfile, and it refers to objects by their position in packfile. Bitmap index can be currenly generated only for a packfile with full closure (i.e. where every single object in the packfile can find its parent links inside the same packfile). A single bitmap in a bitmap index file is a bit vector, where i-th position (i-th bit of bitmap) corresponds to the i-th object in packfile. There are 4 bitmaps that act as type indexes, for example commit type index has "1" on i-th position if i-th object in packfile is a commit object. There are also N entries for commits, where i-th bit in bitmap for given commit denotes whether i-th object is reachable from the commit this bitmap is for. I think that bitmap indexes does not see much use when Git needs to answer reachability query; correct me if I am wrong here. Maybe some kind of indicator that given commit has reachability bitmap in some bitmap file could help? As soon as we arrive at commit for which reachability bitmap exist, answering reachability query is easy: just find out the position of the commit we want to check if it is reachable in same packfile, and if i-th bit in bitmap is "1" then it is reachable. Finding path is slightly more involved: just use reachability bitmap to steer walk, always choosing only reachable parents. I don't know if we can find merge-base, also known as lowest common ancestor (or sometimes as greatest common ancestor) with reachability bitmaps. You can easily find *all* ancestors (with bitwise AND operation on bitmaps). Maybe starting from the ones that have highest generation numbers we can remove redundant ancestors (with AND NOT -- but not all commits have reachability bitmaps). I wonder if it would be better to store indicators that commit has reachability bitmap (and in which packfile, if it is possible with mentioned restriction to have more than one bitmap file) in commit-file, or if it would be easier when reading commit-graph information into memory to simply read bitmap file for this information and add it to in-memory representation. [4]: http://githubengineering.com/counting-objects/ [5]: https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf As this post is getting long, I'll post other ideas, about commit labeling for faster reachability queries in a separate email. Regards, -- Jakub Narębski ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski @ 2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason 2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason ` (2 subsequent siblings) 3 siblings, 0 replies; 10+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-04 20:07 UTC (permalink / raw) To: Jakub Narebski; +Cc: git, Derrick Stolee, Derrick Stolee, Jeff King On Fri, May 04 2018, Jakub Narebski wrote: > With early parts of commit-graph feature (ds/commit-graph and > ds/lazy-load-trees) close to being merged into "master", see > https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/ > I think it would be good idea to think what other data could be added > there to make Git even faster. Thanks for the write-up. > 3. Third, it needs to be reasonably fast to create, and fast to update. > This means time to create the chunk to be proportional to number of > commits, or sum of number of commits and edges (which for commit graph > and other sparse graphs is proprtional to the number of nodes / commits > anyway). In my opinion time proportional to n*lug(n), where 'n' is the > number of commits, is also acceptable. Times proportional to n^2 or n^3 > are not acceptable. I don't think this requirement makes sense... > DS> If we add commit-graph file downloads to the protocol, then the > DS> server could do this computation and send the data to all > DS> clients. But that would be "secondary" information that maybe > DS> clients want to verify, which is as difficult as computing it > DS> themselves. ... when combined with this. If hypothetically there was some data that significantly sped up Git and the algorithm to generate it was ridiculously expensive, that would be fine when combined with the ability to fetch it pre-generated from the server. There could always be an option where you trust the server and optionally don't verify the data, or where it's much cheaper to verify than compute. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski 2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason @ 2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason 2018-05-05 13:28 ` Jakub Narebski 2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski 2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee 3 siblings, 1 reply; 10+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-04 20:36 UTC (permalink / raw) To: Jakub Narebski; +Cc: git, Derrick Stolee, Derrick Stolee, Jeff King On Fri, May 04 2018, Jakub Narebski wrote: (Just off-the cuff here and I'm surely about to be corrected by Derrick...) > * What to do about merge commits, and octopus merges in particular? > Should Bloom filter be stored for each of the parents? How to ensure > fast access then (fixed-width records) - use large edge list? You could still store it fixed with, you'd just say that if you encounter a merge with N parents the filter wouldn't store files changed in that commit, but rather whether any of the N (including the merge) had changes to files as of the their common merge-base. Then if they did you'd need to walk all sides of the merge where each commit would also have the filter to figure out where the change(s) was/were, but if they didn't you could skip straight to the merge base and keep walking. Which, on the topic of what else a commit graph could store: A mapping from merge commits of N parents to the merge-base of those commits. You could also store nothing for merges (or only files the merge itself changed v.s. its parents). Derrick talked about how the bloom filter implementation has a value that's "Didn't compute (for whatever reason), look at it manually". > * Then there is problem of rename and copying detection - I think we can > simply ignore it: unless someone has an idea about how to handle it? > > Though this means that "git log --follow <file>" wouldn't get any > speedup, and neither the half of "git gui blame" that runs "git blame > --incremental -C -C -w" -- the one that allows code copying and > movement detection. Couldn't the bloom filter also speed up --follow if you did two passes through the history? The first to figure out all files that ever changed names, and then say you did `--follow sha1-name.c` on git.git. The filter would have had all the bits for both sha1_name.c and sha1-name.c set on all commits that touched either for all of the history. Of course this would only work with a given default value of -M<n>, but on the assumption that most users left it at the default, and furthermore that renames weren't so common as to make the filter useless with too many false-positives as a result, it might be worth it. If you ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason @ 2018-05-05 13:28 ` Jakub Narebski 0 siblings, 0 replies; 10+ messages in thread From: Jakub Narebski @ 2018-05-05 13:28 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Derrick Stolee, Derrick Stolee, Jeff King Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > On Fri, May 04 2018, Jakub Narebski wrote: > > (Just off-the cuff here and I'm surely about to be corrected by > Derrick...) > >> * What to do about merge commits, and octopus merges in particular? >> Should Bloom filter be stored for each of the parents? How to ensure >> fast access then (fixed-width records) - use large edge list? > > You could still store it fixed with, you'd just say that if you > encounter a merge with N parents the filter wouldn't store files changed > in that commit, but rather whether any of the N (including the merge) > had changes to files as of the their common merge-base. Well, that is one solution: to store union of changes (sum of changes) from all parents of a merge commit in a Bloom filter for a merge. But this wouldn't help us to skip merge entirely, because which parent we would walk then? But perhaps I am mistaken, and that does not matter. > Then if they did you'd need to walk all sides of the merge where each > commit would also have the filter to figure out where the change(s) > was/were, but if they didn't you could skip straight to the merge base > and keep walking. Another solution that I thought of is to use the same mechanism that commit-graph file uses for storing merges: store Bloom filters for first two parents, and if there are more parents (octopus merge), store Bloom filters for the remaining parents in large edge extension for Bloom filters. This meant accepting some padding waste for CDAT chunk, to have faster access. We could do the same for Bloom filters, but it may mean quite a bit of waste, depending on how many bits Bloom filter would use... but there is another solution: for merge commits store Bloom filters for first two parents that are half the size - this means of course more false positives, but it may be acceptable solution. > Which, on the topic of what else a commit graph could store: A mapping > from merge commits of N parents to the merge-base of those commits. The problem is that those N parents may have more than one merge-base, and if so then those merge-bases may have also multiple merge-bases, recursively (what 'recursive' merge strategy handles). Though this could be solved with 'large merge-base list' extension, just like existing EDGE chunk - I think we can assume that most merge parents have only single merge-base (but I have not checked this). > You could also store nothing for merges (or only files the merge itself > changed v.s. its parents). Derrick talked about how the bloom filter > implementation has a value that's "Didn't compute (for whatever reason), > look at it manually". Right, another solution could be to store nothing for merges, or store Bloom filter for changes against only first parent. The goal of Bloom filter is to avoid calculating diff if we don't need to. Derrick, could you tell us what solution VSTS uses for Bloom filters on merge commits? Thanks in advance. >> * Then there is problem of rename and copying detection - I think we can >> simply ignore it: unless someone has an idea about how to handle it? >> >> Though this means that "git log --follow <file>" wouldn't get any >> speedup, and neither the half of "git gui blame" that runs "git blame >> --incremental -C -C -w" -- the one that allows code copying and >> movement detection. > > Couldn't the bloom filter also speed up --follow if you did two passes > through the history? The first to figure out all files that ever changed > names, and then say you did `--follow sha1-name.c` on git.git. The > filter would have had all the bits for both sha1_name.c and sha1-name.c > set on all commits that touched either for all of the history. > > Of course this would only work with a given default value of -M<n>, but > on the assumption that most users left it at the default, and > furthermore that renames weren't so common as to make the filter useless > with too many false-positives as a result, it might be worth it. If you I think it would be much simpler to just ensure that we store in Bloom filter as changed files also pure renames, and leave doing rename detection to the walk. This way we do not fix old rename detecion algorithm in stone. The walk would simply change the name of file it would ask Bloom filters about. Thank you for your comments, -- Jakub Narębski ^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC] Other chunks for commit-graph, part 2 - reachability indexes 2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski 2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason 2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason @ 2018-05-06 23:55 ` Jakub Narebski 2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee 3 siblings, 0 replies; 10+ messages in thread From: Jakub Narebski @ 2018-05-06 23:55 UTC (permalink / raw) To: git Cc: Derrick Stolee, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason Hello, In previous email I wrote: JN> As this post is getting long, I'll post other ideas, about commit JN> labeling for faster reachability queries in a separate email. This is this email. Here is second part of series dedicated to discussing what other data, like various reachability indexes / labeling could be added to the commit-graph file (as new chunks) to make Git even faster. By reachability I mean answering the question whether commit A can reach commit B, or in other words if commit B is an ancestor of commit A. This type of query is used in many Git operations. The commit-graph has one such reachability index built-in already, in the form of generation numbers; this index is called level or topological level of node / vertex in the literature / articles about graph algorithms. > 2. The generation number of the commit. Commits with no parents have > generation number 1; commits with parents have generation number one > more than the maximum generation number of its parents. I have played a bit with various reachability indexes, starting with the ones described in "Reachability Queries in Very Large Graphs: A Fast Refined Online Search Approach" (FELINE index) paper by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed J. Zaki (2014), available in the PDF form at https://openproceedings.org/EDBT/2014/paper_166.pdf I have started working on Jupyter Notebook on Google Colaboratory to find out how much speedup we can get using those indexes for Git large-ish commit graphs (which turns out to be quite specific type of graph, more about which later), namely git.git and Linux kernel ones. The Jupyter Notebook (which runs on Google cloud, but can be also run locally) uses Python kernel, NetworkX librabry for graph manipulation, and matplotlib (via NetworkX) for display. Available at: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing At current size it started to be a bit unwieldy, especially recomputing it from the start, so I am thinking about splitting it up and moving it to GitHub; then I can e.g. put code and data in separate files, so they do not have to be recalculated (cloning full Linux repository takes quite a bit of time). I have started the process: first step which is copy of Colaboratory notebook is now available as the following repo: https://github.com/jnareb/git-commit-graph-ext https://github.com/jnareb/git-commit-graph-ext/blob/master/Graphs_FELINE_index.ipynb Let's start with the reminder from part 1: > Requirements and recommendations about possible new chunks > ========================================================== > > Because the main goal of commit-graph feature is better performance in > large repositories, any proposed new chunks (or, at higher level, every > proposed piece of new data) needs to have the following properties. > > > 1. First, it needs to have bounded and sensible size. This means that > the size of new proposed chunk should be constant, proportional to the > number of commits, or at worst proportional to the number of edges. > > From the existing chunks, OIDF (OID Fanout) has constant size, both OIDL > (OID Lookup) and CGET/CDAT (Commit Data) has size proportional to the > number of commits, while not always present EDGE (Large Edge List) has > size proportional to the number of "excess" edges in "huge vertices" > (octopus merges). > > > 2. Second, we want fast access, in most cases fast random access. This > means using fixed-size records. All currently existing chunks have > records (columns) of fixed and aligned size. > > This restriction means that idexes where we have variable amount of data > per vertex (per commit) are discouraged. > > > 3. Third, it needs to be reasonably fast to create, and fast to update. > This means time to create the chunk to be proportional to number of > commits, or sum of number of commits and edges (which for commit graph > and other sparse graphs is proprtional to the number of nodes / commits > anyway). In my opinion time proportional to n*lug(n), where 'n' is the > number of commits, is also acceptable. Times proportional to n^2 or n^3 > are not acceptable. > > It is also strongly preferred that time to update the chunk is > proportional to the number of new commits, so that incremental > (append-only) update is possible. The generation numbers index has this > property. Though as Ævar said in response, which I agree with: ÆB> If hypothetically there was some data that significantly sped up Git ÆB> and the algorithm to generate it was ridiculously expensive, that ÆB> would be fine when combined with the ability to fetch it ÆB> pre-generated from the server. There could always be an option where ÆB> you trust the server and optionally don't verify the data, or where ÆB> it's much cheaper to verify than compute. https://public-inbox.org/git/86h8nm2pv8.fsf@gmail.com/T/#mdbc6a4ef052ae95136faf6d243d0c29ddfac58a8 Types of algorithms to speed up reachability queries ==================================================== The [research] issue for answering reachability queries is to find a tradeoff between the online processing cost and the offline processing cost. All different approaches aim to find a trade-off among three main costs: (a) query time which is the time to answer a reachability query online [which for Git would be time to perform operation such as `git branch --contains`], (b) construction time which is the time to construct an index offline [which for Git would be time for client or server to create commit-graph file data], and (c) the index size which is the space needed to maintain the index [which for Git would be the size of commit-graph file]. We can classify existing works / existing algorithms into two categories. The two main approaches are Label-Only and Label+G. The Label-Only approach is to answer reachability queries using the labels computed only (this includes transitive closure compression and hop labeling / transitive closure factorization methods). The Label+G approach is to make use of the labels computed as much as possible, but it may need to traverse the graph G (it also called refined online search). The main idea behind Label+G is to reduce the time to construct an effective index, since the computing cost to create effective Label-Only index can be very high. Note: this categorization is taken from the IP paper. Beside those two groups there also exists so called boosting frameworks. This algorithms try to reduce answering reachability query on very large graph to answering reachability queries on smaller graph (reachability backbone in case of SCARAB framework, reduced/simplified graph in case of DAG Reduction method). Let's table boosting frameworks for later (for example to examine whether SCARAB could be used to choose which commits should have reachability bitmaps). If we follow "Requirements and recommendations about possible new chunks", then we can limit examinations to Label+G approaches. Besides, commit-graphs can get large enough so that Label-Only approaches starts to get unfeasible: index creation time and index size gets very large. Commit graph characteristics ============================ But first lets try to address the elephant in the room: would those algorithms work well for the graph of revisions? These methods are usually evaluated on a variety of real graph datasets, both small and large, as well as large synthetic ones. The real graphs datasets include XML documents, genome databases, citation networks, semantic knowledge databases and ontology terms, taxonomies, social networks and web graphs. It turns out that graph of revisions has size (number of nodes) that for large repositories number of commits/nodes (50k for git.git, 750k for Linux kernel, 1.7M for MS Windows) falls somewhere in the middle between size of small graphs with 6k to 40k nodes and size of large graphs, most of which has number of nodes staring from 1.5M and bigger. Many of those example graphs (that are used for evaluation of reachability algorithms) are approximately scale-free, with relative commonness of vertices with a degree that greatly exceeds the average -- which is not the case for graph of revisions. Some approaches include special handling for those vertices to speed up queries; this probably is not needed for DAG of commits, and I will skip these parts for now. Commit graphs are sparse graphs, with average number of parents (the ratio of edges to nodes in graph) below 2.0; even closer to 1. One major difference is that in those graphs the percentage of reachability queries that are answered negatively over all possible reachability queries is over 90%. On the other hand, the ratio of reachability queries that are answered positively over all possible reachability queries is very small; this ratio is called reachability-ratio (or simply R-ratio), usually denoted as 'r'. For those real graph datasets, the R-ratio is between 2.1e-1 to 1.3e-7. But for graph of revision (based on sample of one) it looks like this R-ratio is around 4.5e-1 (with, if I am not mistaken, maximum R-ratio possible in DAGs being 0.5 or 5.0e-1, that is 50%). This may affect the evaluation of method for commit graphs; on the other hand as positive queries are more interesting, some works do evaluations separately in positive and negative queries, or on equal mixure of positive and negative queries. Another issue is that many of those "evaluation" graphs usually have the large number of sources (nodes with only outgoing edges, "roots" in graph terminology, "tips" or "heads" in Git terminology -- childless nodes) or/and the large number of sinks (nodes with no outgoing edges, "leafs" in graph terminology, "roots" in Git terminology -- parentless nodes). This seems to matter for FELINE index, which as my experiments in mentioned Jupyter notebook shows is much less efective on commit graphs than indicated in original work. Note that while live repositories would have much larger amount of branches than public repositories (that I have used for tests) where there are only a few long-lived public branches, even for Windows repo it is small percentage of total number of nodes / commits (the Windows repo according to the Git Merge 2018 presentation by Derrick has 90k branches and 1700k commits; it is not as where 1/4 of nodes were source nodes). Negative and positive-cut filters, false negatives and false positives ====================================================================== In Label+Graph approaches, indexing (node/edge labels) is enough to answer some queries in constant time. If the indexing can answer positive queries in constant time we call it _positive-cut filter_. This means that querying the index can state that node B is reachable from node A (positive result); reaachability is guaranteed. However if we don't get such answer from the index, we don't know if B is reachable from A, or if B is not reachable from A. The case where B is reachable from A, but the index doesn't say it, we can call _false negatives_: we get negative answer from the index, while the actual ressult of reachability query is positive. If the indexing can answer negative queries in constant time we call it _negative-cut filter_. This means that querying the index can state that node B is definitevly not reachable from node A (negative result). Otherwise we don't know if B is or is not reachable from A. The case where there is no path from A to B, but the index doesn't say it, we can call _false positives_ or _falsely-implied path_. Generation numbers (or topological levels) are an example of negative cut filter. If gen(A) < gen(B) then A cannot reach B. Note: The terminology used here is taken from the FELINE paper (which does not necesarily mean that it originated here; it appears in earlier GRAIL paper too). When one wants just to find if there exist path from A to B, both positive-cut and negative-cut filters can be used. Data: (u, v) ∈ V^2, two vertices of the DAG Result: r(u, v), whether v is reachable from u Note: simplified DFS algorithm, no visited filter begin if positive-cut(u, v) then return true; if u == v then return true; if not negative-cut(u, v) then forall the w ∈ parents(u) do if Reachable(w, v) then return true; return false; This is the case for '--contains' and '--merged' queries in Git. When one wants to find a single path from A to B if it exists, then both negative-cut and most positive-cut filters can be used. When one wants to find out all paths from A to B (e.g. for `git log A ^B --strict-ancestor`) then in most cases only negative-cut filters are really useful. Online search algorithms ======================== When the index or indexes used cannot answer reachability query by itself, one needs to perform a search from the source node to the target node, applying positive-cut and negative-cut filters to reduce search space. One possible solution, often used for rechability testing in various works, is to perform depth-first search (DFS). This can be done simply by using recursion (like in the above pseudo-code), but for large graphs one need to use iterative approach by employing stack. DFS is needed to compute some of the reachability indexes. However DFS it is not always the best strategy. Breadth-first search (BFS) guarantees to find the shortest path with the cost of using more memory. In average, BFS is expected to be faster considering the cases where the target node is close to the source node that has a large reachability set. This is especially true for graphs of revisions, that are deep but narrow (their width is much smaller than their depth; maximum number of commits with the same generation number is much smaller than maximum generation number). In BFS we store nodes to be walked in a queue. Besides the standard graph search strategies of DFS and BFS, sometimes there is also considered bidirectional breadth-first search (BBFS), which has a few advantages over BFS. In bidirectional search two BFS are started simultaneously utilizing separate queues: one starting from source node and going forward, and one starting from target node and going backward (walking the reverse of DAG). The first advantage of bidirectional search is that it can answer positive queries faster when the average degree of the graph is high. In negative queries, the worst case scenario is that bidirectional search takes twice as much time as a BFS would take; however there are substantial amount of cases where bidirectional search is much faster than BFS. Some indexes are geared towards making BBFS faster; this needs to be taken into account when evaluating them, as we would not want to have to deal with reversed commit graph. Git actually uses a variant of BFS in paint_down_to_commin() in commit.c, where it uses priority queue, sorted by generation number then by date. We can say that it is closest-first search variant of BFS. Reachability indexes ==================== The Label+Graph approaches to include the following algorithms to speed up reachability queries: - Tree+SPPI (2005) (Surrogate & Surplus Predecessor Index) - GRIPP (2007) (GRaph Indexing based on Pre- and Postorder numbering) + GRAIL (2010 & 2012) (Graph Reachability indexing via rAndomized Interval Labeling) * PReaCH (2014) (Pruned Reachability Contraction Hierarchies) + FERRARI (2013) (Flexible and Efficient Reachability Range Assignment for gRaph Indexing) + FELINE (2014) (Fast rEfined onLINE search) + IP (2014) (Independent Permutation labeling approach) + BFL (2016) (Bloom Filter Labeling) Algorithms prefixed with '-' are those older ones usually compared against in different works that I won't cover here while algorithms prefixed with '+' are those that I will cover somewhat. Algorithms prefixed with '*' are less known works that I think might be interesting. Levels / generation numbers --------------------------- One of the simplest negative-cut filters, used by almost all works as a supplementary filter... and also by Git's commit-graph feature. The generation number, or [topological] level is defined as length of shortest path from a sink node ("root" in Git terminology). You can also define "reverse" or "backward" levels starting from source nodes, and some works use that. More useful in actually computing generation number for a commit, or level of a node is recursive definition given at the beginning of this email. Note that usually nodes without outgoing edges ("root" nodes) are assigned level zero, but giving it value of one like commit-graph feature does is also valid convention. The recursive definition shows that it is possible to calculate generation numbers incrementally. In particular the cost of updating generation numbers is proportional to the number of new commits. This is not the case for "reverse" level numbers -- we won't consider it then. Generation numbers / topological levels are denoted using gen(B) symbol in commit-graph documentation and commit messages, and l_u or L(u) in works about graph reachability algorithms. Let's use gen(A) here. The following holds true: If A can reach B, and A != B, then gen(A) > gen(B) This means that we get the following negative cut condition (this condition is equivalent to the one above): If gen(A) <= gen(B) and A != B, then A cannot reach B Sometimes it is easier to make use of the weaker negative-cut condition: If gen(A) < gen(B), then A cannot reach B Topological ordering -------------------- A topological sort or topological ordering of a directed acyclic graph is a linear ordering of its vertices such that for every directed edge (u,v) from vertex u to vertex v, u comes before v in the ordering. Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. A topological ordering should be able to be updated quite easily, while the reverse topological ordering should be able to be updated in time proportional to the number of added commits (though there are no guarantees about properties of such topological order). The following holds true: If A can reach B, and A != B, then topo(A) > topo(B) If A can reach B, then topo(A) >= topo(B) Because this is ordering, we also have that if topo(A) <= topo(B) and topo(A) >= topo(B), then A = B; if topo(A) = topo(B) then A = B. This means that we have the following negative-cut condition: If topo(A) < topo(B), then A cannot reach B The condition is reversed for the reverse topological ordering. Note: The FELINE index can be thought of as composed of two topological ordering, where the second ordering is constructed using heuristic algorithm in such way that it tries to minimize number of false positives (falsely-implied paths) for the combination of those two topological sort negative-cut filters. Post-order in spanning tree --------------------------- Directed rooted tree is a directed graph in which, for a vertex u called the [tree] root and any other vertex v, there is exactly one directed path from u to v. A spannig tree of a directed acyclic graph G is a subgraph that is a tree or set of trees (directed rooted tree for DAG) which includes all of the vertices of graph G. Source nodes of G are roots of spanning trees of G. The reachability problem on trees can be solved effectively by interval labeling, which takes linear time and space for constructing the index, and provides constant time querying. Given a tree/forest T, interval labeling assigns each node a label I(u) = [s_u, e_u] which represents interval starting at s_u and ending at e_u. A desired labeling suitable for answering reachability queries in trees must satisfy the condition that I(u) subsumes I(v), or in other words interval I(v) is a subset of interval I(v) if and only if u reaches v. One such labeling is _min-post labeling_ method for trees. In this labeling e_u = post(u) is the post-order value of the node u, defined as the rank of the node u in a post order traversal of the tree. The starting value of the interval s_u is defined as the lowest post-order rank for any node x in the subtree rooted at u (i.e., including u). Note that because of how this interval was defined the check I(v) ⊆ I(u) can be replaced by simpler post(v) ∈ I(u). Pre order traversal of tree gives _pre-max labeling_ method for trees, exactly equivalent to the min-post labeling described above. This method used for a spanning tree gives positive-cut filter: if it gives positive results for u ---> v query, then node v is reachable from u via spanning tree. If the answer from spanning tree labeling is negative, node v can still be reachable from u, but via non-tree edge. This is the idea behind positive-cut filter in GRAIL method, and this is also positive-cut filter used in FELINE work/paper. GRAIL calls this positive-cut interval an interior interval. The post-order or pre-order labeling of spanning tree is also labeling of nodes in the original DAG. Hence for negative-cut filter we can assign to each node u the interval starting at minimum of labels of nodes reachable in the DAG from u, and ending at maximum of labels of nodes reachable in the DAG from u. Then if node v has label outside this interval I'(u), then u cannot reach v. GRAIL calls this negative-cut interval an exterior interval. Now if the traversal of tree was done using DFS (depth-first search), then the ending of min-post interval I(u) for a spanning tree is also end of interval I'(u) for DAG. Moreover the minimum of post(v) for nodes reachable from u, that is the starting point for the DAG interval, can be calculated iteratively during post-order traversal of the tree. (It is possible that DFS is not needed; PReaCH work proves it for DFS, but I have not examined proof in GRAIL work in detail). GRAIL approach uses those two intervals together: the interval label for node u is given as I(u) = [s_u, m_u, e_u ], where EI(u) = [s_u, e_u] and II(u) = [m_u, e_u] are the exterior (negative-cut) and interior (positive-cut) intervals respectively. The motivating idea in GRAIL is to use interval labeling multiple times, for multiple random spaanning trees to improve searching phase (answering reachability queries). The GRAIL work proposes several heuristics how to choose best set of spanning trees. The recommended number of those trees for large graphs is k = 5. Using multiple spanning trees is not the only solution. Alternative is trying to get more out of single DFS numbering. PReaCH work (arXiv:1404.4465) adds one more possible and possibly large positive-cut interval (called "full" interval there), and possibly one more negative-cut interval (called "empty" interval). Note that the original work uses pre-order numbering and pre-max intervals, but the result can be easily translated to post-numbering and min-post intervals (I have done this in mentioned Jupyter notebook). The additional positive-cut interval for node u, if it exists, is the positive-cut interval for a node w = p_tree(u) that is reachable from u and not in positive-cut interval of u, and has maximum size of said positive-cut interval. This can be computed iteratively, while computing post-number ordering on u. This seems promising for graph of revisions. One can also find additional negative-cut interval, finding f_gap so that [f_gap, min] is "empty" interval, and the negative cut is split from [f_min, post] to two intervals: [f_min, f_gap] and [min, post]. This also can be computed iteratively. The FERRARI approach is more generic. It starts with the fact that the reachable set R(v) of a vertex v in the DAG is only partly represented by the spanning tree interval I_T(v), as the tree interval only accounts for reachability relationships that are preserved in tree T. The reachable set R(v) can be represented as set of intervals II(v). The resulting index (collection of reachable interval sets) can be regarded as a materialization of the transitive closure of the graph, which gets too large and is thus infeasible for large graphs. Instead of storing full set of intervals II(v) representing reachability set R(v), FERRARI index assigns a mixture of exact and approximate reachability intervals, given a user-specified size constraint on the index. In FERRARI-L (local version) it is replaced by at most k intervals, some of which come from II(v) and some are merges of intervals from II(v). If node w is in one of exact intervals for node v, then v can reach w; if node w is outside all of intervals for node v, then v cannot reach w. Intervals are chosen in such way as to maximize the sum of sizes of exact intervals, maximizing positive-cut effectiveness. In FERRARI-G variant the user-specified size constraint is made global; some nodes may use more intervals, and some less, keeping the average constant. This allows to use index space more effectively, at the cost of variable-length index and slower access. Contraction hierarchies ----------------------- In general, a Reachability Contraction Hierarchy (RCH) can be defined based on any numbering order. We successively contract nodes with increasing order(v). Contracting node v means removing it from the graph and possibly inserting new edges such that the reachability relation in the new graph is the same as in the previous graph, The query algorithm does not use the contracted versions of the graph but only the ordering and the shortcuts. This algorithm is based on the observation that in an RCH it suffices to look at "up-down" paths, with contraction order first increasing, then decreasing; this works best with bidirectional breadth first search (BBFS). PReaCH approach exploits the fact that any DAG contains source nodes (with in-degree zero) and sink nodes (with out-degree zero). Contracting such a node v never requires the insertion of shortcuts. The index in this case is just one number per node: the contraction order. I'm not sure how useful that would be for BFS search, without accessing reverse graph. Graph dominance drawing ----------------------- The fundamental idea behind FELINE is to associate every vertex u in V with a unique ordered pair of natural integers i(u), so that if vertex u can reach v, then i(u) ≦ u(v), that is first element is less or equal first element in pair, and second element is less or equal second element in pair. If this pair of integers (x, y) associated with a vertex is understood as coordinates in the Cartesian plane, then the implication means that paths always point right and up, and it is enough to check uper-right quadrant. The method is inspired by weak dominance drawing, which guarantees the implication for edges. FELINE work shows proof that this leads to the implication for paths. Such pair of numbers are negative-cut filter. The algorithm for creating such index in FELINE work uses two topological orderings, where the second is choosen to minimize number of false positives for the index. It turns out however from the experiments in mentoned Jupyter notebook that this method doesn't work that well with commit graphs. When there are only a few sources and only a few sinks, then node positions gather around 'x = y' diagonal line, thus FELINE index check almost reduces to the topological ordering filter mentioned earlier. Approximate set inclusion ------------------------- Both IP and BFL takes reachability testing from a vertex u to a vertex v as set-containment testing. Intuitively, if a vertex u can reach a vertex v, then u must be able to reach all the vertices that v can reach. Let Out(x) denote the set of all vertices that a vertex x can reach in a graph. If u can reach v, then Out(v) ⊆ Out(u). This is done by checking by checking Out(v) ⊈ Out(u) instead (that Out(v) is not subset or equal to Out(u), which is to check if there is at least one element in Out(v) that is not in Out(u). Both instead of getting an exact answer for testing, they get an answer with given probability guarantee in a negative-cut fashion. The IP label uses k-min-wise independent permutations, while BFL uses Bloom filters. Those approaches work well with graphs that have not large reachability sets (short depth), and where ratio of positive queries to all queries is very small. This means that they would most probably won't work well for commit graphs, where neither holds true. Summary ======= The min-post labeling (the base of GRAIL approach) looks promising; it needs to be checked whether PReaCH approach or FERRARI approach be better for commit graphs. I hope that this somewhat long post was informative, -- Jakub Narębski ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski ` (2 preceding siblings ...) 2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski @ 2018-05-07 14:26 ` Derrick Stolee 2018-05-12 14:00 ` Jakub Narebski 3 siblings, 1 reply; 10+ messages in thread From: Derrick Stolee @ 2018-05-07 14:26 UTC (permalink / raw) To: Jakub Narebski, git, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason On 5/4/2018 3:40 PM, Jakub Narebski wrote: > Hello, > > With early parts of commit-graph feature (ds/commit-graph and > ds/lazy-load-trees) close to being merged into "master", see > https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/ > I think it would be good idea to think what other data could be added > there to make Git even faster. Before thinking about adding more data to the commit-graph, I think instead we need to finish taking advantage of the data that is already there. This means landing the generation number patch [1] (I think this is close, so I'll send a v6 this week if there is no new feedback.) and the auto-compute patch [2] (this could use more feedback, but I'll send a v1 based on the RFC feedback if no one chimes in). [1] https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/ [PATCH v5 00/11] Compute and consume generation numbers [2] https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/ [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' The big wins remaining from this data are `git tag --merged` and `git log --graph`. The `tag` scenario is probably easier: this can be done by replacing the revision-walk underlying the call to use paint_down_to_common() instead. Requires adding an external method to commit.c, but not too much code. The tougher challenge is `git log --graph`. The revision walk machinery currently uses two precompute phases before iterating results to the pager: limit_list() and sort_in_topological_order(); these correspond to two phases of Kahn's algorithm for topo-sort (compute in-degrees, then walk by peeling commits with in-degree zero). This requires O(N) time, where N is the number of reachable commits. Instead, we could make this be O(W) time to output one page of results, where W is (roughly) the number of reachable commits with generation number above the last reported result. In order to take advantage of this approach, the two phases of Kahn's algorithm need to be done in-line with reporting results to the pager. This means keeping two queues: one is a priority queue by generation number that computes in-degrees, the other is a priority queue (by commit-date or a visit-order value to do the --topo-order priority) that peels the in-degree-zero commits (and decrements the in-degree of their parents). I have not begun this refactoring effort because appears complicated to me, and it will be hard to tease out the logic without affecting other consumers of the revision-walk machinery. I would love it if someone picked up the `git log --graph` task, since it will be a few weeks before I have the time to focus on it. Without completing the benefits we get from generation numbers, these investigations into other reachability indexes will be incomplete as they are comparing benefits without all consumers taking advantage of a reachability index. [...] > Bloom filter for changed paths > ------------------------------ > > The goal of this chunk is to speed up checking if the file or directory > was changed in given commit, for queries such as "git log -- <file>" or > "git blame <file>". This is something that according to "Git Merge > contributor summit notes" [2] is already present in VSTS (Visual Studio > Team Services - the server counterpart of GVFS: Git Virtual File System) > at Microsoft: > > AV> - VSTS adds bloom filters to know which paths have changed on the commit > AV> - tree-same check in the bloom filter is fast; speeds up file history checks > AV> - might be useful in the client as well, since limited-traversal is common > AV> - if the file history is _very_ sparse, then bloom filter is useful > AV> - but needs pre-compute, so useful to do once > AV> - first make the client do it, then think about how to serve it centrally > > [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ > > I think it was what Derrick Stolee was talking about at the end of his > part of "Making Git for Windows" presentation at Git Merge 2018: > https://youtu.be/oOMzi983Qmw?t=1835 > > This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load > trees when reading commit-graph", starting from [3] > [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/ Again, the benefits of Bloom filters should only be measured after already taking advantage of a reachability index during `git log`. However, you could get performance benefits from Bloom filters in a normal `git log` (no topo-order). The tricky part about this feature is that the decisions we made in our C# implementation for the VSTS Git server may be very different than the needs for the C implementation of the Git client. Questions like "how do we handle merge commits?" may have different answers, which can only be discovered by implementing the feature. (The answer for VSTS is that we only store Bloom filters containing the list of changed paths against the first parent. The second parent frequently has too many different paths, and if we are computing file-history simplification we have already determined the first parent is _not_ TREESAME, which requires verifying the difference by parsing trees against the first parent.) I'm happy to provide more information on how we built this feature if someone is writing a patch. Otherwise, I plan to implement it after finishing the parts I think are higher priority. Thanks, -Stolee ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee @ 2018-05-12 14:00 ` Jakub Narebski 2018-05-14 13:20 ` Derrick Stolee 0 siblings, 1 reply; 10+ messages in thread From: Jakub Narebski @ 2018-05-12 14:00 UTC (permalink / raw) To: Derrick Stolee Cc: git, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 5/4/2018 3:40 PM, Jakub Narebski wrote: >> >> With early parts of commit-graph feature (ds/commit-graph and >> ds/lazy-load-trees) close to being merged into "master", see >> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/ >> I think it would be good idea to think what other data could be added >> there to make Git even faster. > > Before thinking about adding more data to the commit-graph, I think > instead we need to finish taking advantage of the data that is already > there. This means landing the generation number patch [1] (I think > this is close, so I'll send a v6 this week if there is no new > feedback.) and the auto-compute patch [2] (this could use more > feedback, but I'll send a v1 based on the RFC feedback if no one > chimes in). > > [1] > https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/ > [PATCH v5 00/11] Compute and consume generation numbers > > [2] > https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/ > [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Right, so the RFC might be a bit premature; I wanted the discussion to be out there to think about when adding new uses of existing features. DIGRESSION: it is commendable that you are sending patches in small, easy digestible chunks / patch series. It is much easier to review 10+ series than 80+ behemoth (though I understand it is not always possible to subdivide patch series into smaller self-contained sub-series). On the other hand, it would be nice to have some roadmap about series and features to be sent in the future, if possible. Something like what was done when 'git rebase --interactive' was getting builtinified: moved (in parts) to C. It would be great to have such roadmap with milestones achieved and milestones to be done in the cover letter for series. > The big wins remaining from this data are `git tag --merged` and `git > log --graph`. The `tag` scenario is probably easier: this can be done > by replacing the revision-walk underlying the call to use > paint_down_to_common() instead. Requires adding an external method to > commit.c, but not too much code. I wonder if there is some significant reason behind `git tag --merged` using its own codepath, beside historical reasons. Maybe performance is better with current code? Utilizing paint_down_to_common() there, beside reducing amount of code you would have to modify, would also unify code (and possibly reduce amount of code). That's very nice. > > The tougher challenge is `git log --graph`. The revision walk > machinery currently uses two precompute phases before iterating > results to the pager: limit_list() and sort_in_topological_order(); > these correspond to two phases of Kahn's algorithm for topo-sort > (compute in-degrees, then walk by peeling commits with in-degree > zero). This requires O(N) time, where N is the number of reachable > commits. Instead, we could make this be O(W) time to output one page > of results, where W is (roughly) the number of reachable commits with > generation number above the last reported result. A reminder: Kahn's algorithm (described for example in [1] and [3]) looks like this: L ← Empty list that will contain the sorted elements S ← Collection of all nodes with no incoming edge while S is non-empty do remove a node 'n' from S add 'n' to tail of L for each parent 'm' of 'n' do decrease the in-degree of 'm' if 'm' has in-degree of 0 then insert 'm' into S [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/ > In order to take advantage of this approach, the two phases of Kahn's > algorithm need to be done in-line with reporting results to the > pager. This means keeping two queues: one is a priority queue by > generation number that computes in-degrees, This simple solition of using priority queue by generation number won't work, I think. In-degree is computed from heads down, generation number is computed from roots up. For example in the graph below *<---*<---*<---*<---*<---*<---*<---*<---*<---A \ \--*<---B both A and B have in-degree of 0, but gen(B) << gen(A). But I might be wrong. > the other is a priority > queue (by commit-date or a visit-order value to do the --topo-order > priority) that peels the in-degree-zero commits (and decrements the > in-degree of their parents). I have not begun this refactoring effort > because appears complicated to me, and it will be hard to tease out > the logic without affecting other consumers of the revision-walk > machinery. > > I would love it if someone picked up the `git log --graph` task, since > it will be a few weeks before I have the time to focus on it. Let's assume that we have extra data (indexes such as generation number) that can be used for positive-cut (we know that A can reach B) and negative-cut (we know that A cannot reach B) filters. Generation number aka. topological level can be used in negative-cut filter. NOTE: I have not looked at current Git code that does topological sorting, as to not be suggested by the existing implementation. How the indexes-amplified incremental Kahn's algorithm could look like: First we need to find at least one node / vertex / commit with an in-degree of zero. We are given a list of commits to start from, but they may not all have in-degree of zero - they may be dependent, or in other words some of them may be reachable from others and be irrelevant. Here negative-cut and positive-cut filters can help. If the order of commits on command line does not matter, we can start from any distinct commit with highest generation number - we know that it cannot be reached from other heads / refs and thus has in-degree of zero. L ← Empty list that will contain the sorted elements S ← Collection of all nodes with no incoming edge R ← Collection of starting points (refs) while S is non-empty do remove a node 'n' from S add 'n' to tail of L for each parent 'm' of 'n' do if 'm' cannot be reached from R then # it has in-degree of 0 insert 'm' into S else if 'm' can be reached from R then # it has in-degree greater than 0 insert 'm' into R else walk from each 'r of R, until we know if 'm' is reachable from R then insert it into S or R perhaps computing in-degrees, or marking commits as reachable... Does it looks correct? I can try to make this pseudocode into actual algorithm. > > Without completing the benefits we get from generation numbers, these > investigations into other reachability indexes will be incomplete as > they are comparing benefits without all consumers taking advantage of > a reachability index. On one hand side, you are right: if we try to investigate of some reachability index is worth it by checking if it *improves performance of actual git operations* without all consumers taking advantage of a reachability index would be incomplete. On the other hand we can still perform synthetic tests: how much less commits we walk when checking that A can reach B on real commit graphs (like I did in mentioned Google Colaboratory notebook [3]). This assumes that the cost of accessing commit data (and possibly also indexes data) dominates, and the cost of using reachability indexes is negligible. [3]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg On the gripping hand the construction of algorithms in those future steps would be, I think, affected by what types of indexes would we have: negative-cut filters, positive-cut filters, reachability bitmaps, Bloom filters for changed files, eic. > [...] >> Bloom filter for changed paths >> ------------------------------ >> >> The goal of this chunk is to speed up checking if the file or directory >> was changed in given commit, for queries such as "git log -- <file>" or >> "git blame <file>". This is something that according to "Git Merge >> contributor summit notes" [2] is already present in VSTS (Visual Studio >> Team Services - the server counterpart of GVFS: Git Virtual File System) >> at Microsoft: >> >> AV> - VSTS adds bloom filters to know which paths have changed on the commit >> AV> - tree-same check in the bloom filter is fast; speeds up file history checks >> AV> - might be useful in the client as well, since limited-traversal is common >> AV> - if the file history is _very_ sparse, then bloom filter is useful >> AV> - but needs pre-compute, so useful to do once >> AV> - first make the client do it, then think about how to serve it centrally >> >> [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ >> >> I think it was what Derrick Stolee was talking about at the end of his >> part of "Making Git for Windows" presentation at Git Merge 2018: >> https://youtu.be/oOMzi983Qmw?t=1835 >> >> This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load >> trees when reading commit-graph", starting from [3] >> [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/ > > Again, the benefits of Bloom filters should only be measured after > already taking advantage of a reachability index during `git > log`. However, you could get performance benefits from Bloom filters > in a normal `git log` (no topo-order). I wonder how much they improve performance of "git blame"... > The tricky part about this feature is that the decisions we made in > our C# implementation for the VSTS Git server may be very different > than the needs for the C implementation of the Git client. Questions > like "how do we handle merge commits?" may have different answers, > which can only be discovered by implementing the feature. > > (The answer for VSTS is that we only store Bloom filters containing > the list of changed paths against the first parent. The second parent > frequently has too many different paths, and if we are computing > file-history simplification we have already determined the first > parent is _not_ TREESAME, which requires verifying the difference by > parsing trees against the first parent.) Thanks for the information. I think for now it is sufficient level of the detail. > I'm happy to provide more information on how we built this feature if > someone is writing a patch. Otherwise, I plan to implement it after > finishing the parts I think are higher priority. All right, I understand that. Time is a scarse resource. I think that, beside writing patches for Git, exploring how various pieces of data and various indexes affect walking commit graphs is also important. My explorations shown that, for example, that FELINE index is not good fit for relatively "narrow" graphs of revision history. Exploring this in Python / Jupyter is easier than trying to write a exploratory patch for Git, in my opinion. Just IMVHO. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-12 14:00 ` Jakub Narebski @ 2018-05-14 13:20 ` Derrick Stolee 2018-05-14 20:58 ` Jakub Narebski 0 siblings, 1 reply; 10+ messages in thread From: Derrick Stolee @ 2018-05-14 13:20 UTC (permalink / raw) To: Jakub Narebski Cc: git, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason On 5/12/2018 10:00 AM, Jakub Narebski wrote: > Derrick Stolee <stolee@gmail.com> writes: >> On 5/4/2018 3:40 PM, Jakub Narebski wrote: >>> With early parts of commit-graph feature (ds/commit-graph and >>> ds/lazy-load-trees) close to being merged into "master", see >>> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/ >>> I think it would be good idea to think what other data could be added >>> there to make Git even faster. >> Before thinking about adding more data to the commit-graph, I think >> instead we need to finish taking advantage of the data that is already >> there. This means landing the generation number patch [1] (I think >> this is close, so I'll send a v6 this week if there is no new >> feedback.) and the auto-compute patch [2] (this could use more >> feedback, but I'll send a v1 based on the RFC feedback if no one >> chimes in). >> >> [1] >> https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/ >> [PATCH v5 00/11] Compute and consume generation numbers >> >> [2] >> https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/ >> [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' > Right, so the RFC might be a bit premature; I wanted the discussion to > be out there to think about when adding new uses of existing features. > > > DIGRESSION: it is commendable that you are sending patches in small, > easy digestible chunks / patch series. It is much easier to review 10+ > series than 80+ behemoth (though I understand it is not always possible > to subdivide patch series into smaller self-contained sub-series). > > On the other hand, it would be nice to have some roadmap about series > and features to be sent in the future, if possible. Something like what > was done when 'git rebase --interactive' was getting builtinified: moved > (in parts) to C. It would be great to have such roadmap with milestones > achieved and milestones to be done in the cover letter for series. I suppose that is what I intended in the "Future Work" section of Documentation/technical/commit-graph.txt. It gives a set of things that need to be done in order to make this a default feature, not just an expert-level feature. When I wrote the section, I was focused entirely on "commit-graph version 1.0" and I didn't know how long that would take. The series have been getting interest and review (in great part to your interest, Jakub) so they have been moving to 'next' faster than I anticipated. I'll plan on writing a more detailed list of future directions, but for now I'll list the things I know about and how I see them in priority order: Commit-graph v1.0: * ds/generation-numbers * 'verify' and fsck/gc integration * correct behavior with shallow clones, replace-objects, and grafts Commit-graph v1.1: * Place commit-graph storage in the_repository * 'git tag --merged' use generation numbers * 'git log --graph' use generation numbers Commit-graph v1.X: * Protocol v2 verb for sending commit-graph * Bloom filters for changed paths > >> The big wins remaining from this data are `git tag --merged` and `git >> log --graph`. The `tag` scenario is probably easier: this can be done >> by replacing the revision-walk underlying the call to use >> paint_down_to_common() instead. Requires adding an external method to >> commit.c, but not too much code. > I wonder if there is some significant reason behind `git tag --merged` > using its own codepath, beside historical reasons. Maybe performance is > better with current code? > > Utilizing paint_down_to_common() there, beside reducing amount of code > you would have to modify, would also unify code (and possibly reduce > amount of code). That's very nice. > >> The tougher challenge is `git log --graph`. The revision walk >> machinery currently uses two precompute phases before iterating >> results to the pager: limit_list() and sort_in_topological_order(); >> these correspond to two phases of Kahn's algorithm for topo-sort >> (compute in-degrees, then walk by peeling commits with in-degree >> zero). This requires O(N) time, where N is the number of reachable >> commits. Instead, we could make this be O(W) time to output one page >> of results, where W is (roughly) the number of reachable commits with >> generation number above the last reported result. > A reminder: Kahn's algorithm (described for example in [1] and [3]) > looks like this: > > L ← Empty list that will contain the sorted elements > S ← Collection of all nodes with no incoming edge > while S is non-empty do > remove a node 'n' from S > add 'n' to tail of L > for each parent 'm' of 'n' do > decrease the in-degree of 'm' > if 'm' has in-degree of 0 then > insert 'm' into S > > [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm > [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/ > >> In order to take advantage of this approach, the two phases of Kahn's >> algorithm need to be done in-line with reporting results to the >> pager. This means keeping two queues: one is a priority queue by >> generation number that computes in-degrees, > This simple solition of using priority queue by generation number won't > work, I think. In-degree is computed from heads down, generation number > is computed from roots up. > > For example in the graph below > > *<---*<---*<---*<---*<---*<---*<---*<---*<---A > \ > \--*<---B > > both A and B have in-degree of 0, but gen(B) << gen(A). > > But I might be wrong. > >> the other is a priority >> queue (by commit-date or a visit-order value to do the --topo-order >> priority) that peels the in-degree-zero commits (and decrements the >> in-degree of their parents). I have not begun this refactoring effort >> because appears complicated to me, and it will be hard to tease out >> the logic without affecting other consumers of the revision-walk >> machinery. >> >> I would love it if someone picked up the `git log --graph` task, since >> it will be a few weeks before I have the time to focus on it. > Let's assume that we have extra data (indexes such as generation number) > that can be used for positive-cut (we know that A can reach B) and > negative-cut (we know that A cannot reach B) filters. Generation number > aka. topological level can be used in negative-cut filter. > > NOTE: I have not looked at current Git code that does topological > sorting, as to not be suggested by the existing implementation. > > > How the indexes-amplified incremental Kahn's algorithm could look like: > > First we need to find at least one node / vertex / commit with an > in-degree of zero. We are given a list of commits to start from, but > they may not all have in-degree of zero - they may be dependent, or in > other words some of them may be reachable from others and be > irrelevant. Here negative-cut and positive-cut filters can help. > > If the order of commits on command line does not matter, we can start > from any distinct commit with highest generation number - we know that > it cannot be reached from other heads / refs and thus has in-degree of > zero. > > L ← Empty list that will contain the sorted elements > S ← Collection of all nodes with no incoming edge > R ← Collection of starting points (refs) > while S is non-empty do > remove a node 'n' from S > add 'n' to tail of L > for each parent 'm' of 'n' do > if 'm' cannot be reached from R then > # it has in-degree of 0 > insert 'm' into S > else if 'm' can be reached from R then > # it has in-degree greater than 0 > insert 'm' into R > else > walk from each 'r of R, > until we know if 'm' is reachable from R > then insert it into S or R > > perhaps computing in-degrees, > or marking commits as reachable... > > > Does it looks correct? I can try to make this pseudocode into actual > algorithm. I'm not following your pseudocode very well, so instead I'll provide a more concrete description of what I mentioned before: Here are some data structures. IDV : A dictionary storing the currently-computed in-degree value of a commit 'c' as 'IDV[c]'. Assume all commits start with in-degree 0. IDQ : A queue, for storing the "in-degree walk". It is a priority queue ordered by generation number. TOQ : A queue, for storing the "topo-order walk". It is a priority queue ordered by "visit order" (see algorithm) Here are some methods. AdvanceInDegreeWalk(IDQ, IDV, gen): while !IDX.Empty && IDQ.Max >= gen: c = IDX.Dequeue for each parent p of c: IDV[p]++; IDQ.Enqueue(p, generation(p)) InitializeTopoOrder(R): Initialize IDQ, IDV, TOQ min_generation = GENERATION_NUMBER_INFINITY visit_order = 0 for each commit c in R: min_generation = min(min_generation, generation(c)) IDQ.Enqueue(c, generation(c)) AdvanceInDegreeWalk(IDQ, IDV, min_generation) for each commit c in R: if IDV[c] == 0: TOQ.Enqueue(c, visit_order++) return (IDQ, IDV, TOQ, visit_order) GetNextInTopoOrder(IDQ, IDV, TOQ, ref visit_order): if TOQ.Empty: return null c = TOQ.Dequeue() for each parent p of c: IDV[p]-- AdvanceInDegreeWalk(IDQ, IDV, generation(p)) if IDV[p] == 0: TOQ.Enqueue(p, visit_order++) return c (I mention "ref visit_order" to be sure we are passing-by-reference. In a full implementation, the walk details would be grouped into a struct.) This algorithm is relatively simple, but the hard part is teasing the revision walk machinery to initialize the data by calling InitializeTopoOrder(R) and to call GetNextInTopoOrder() whenever it needs a new element of the walk. That is, it is hard to be sure we are not causing side-effects as we make that transformation. > >> Without completing the benefits we get from generation numbers, these >> investigations into other reachability indexes will be incomplete as >> they are comparing benefits without all consumers taking advantage of >> a reachability index. > On one hand side, you are right: if we try to investigate of some > reachability index is worth it by checking if it *improves performance > of actual git operations* without all consumers taking advantage of a > reachability index would be incomplete. > > On the other hand we can still perform synthetic tests: how much less > commits we walk when checking that A can reach B on real commit graphs > (like I did in mentioned Google Colaboratory notebook [3]). This > assumes that the cost of accessing commit data (and possibly also > indexes data) dominates, and the cost of using reachability indexes is > negligible. > > [3]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg > > On the gripping hand the construction of algorithms in those future > steps would be, I think, affected by what types of indexes would we > have: negative-cut filters, positive-cut filters, reachability bitmaps, > Bloom filters for changed files, eic. > >> [...] >>> Bloom filter for changed paths >>> ------------------------------ >>> >>> The goal of this chunk is to speed up checking if the file or directory >>> was changed in given commit, for queries such as "git log -- <file>" or >>> "git blame <file>". This is something that according to "Git Merge >>> contributor summit notes" [2] is already present in VSTS (Visual Studio >>> Team Services - the server counterpart of GVFS: Git Virtual File System) >>> at Microsoft: >>> >>> AV> - VSTS adds bloom filters to know which paths have changed on the commit >>> AV> - tree-same check in the bloom filter is fast; speeds up file history checks >>> AV> - might be useful in the client as well, since limited-traversal is common >>> AV> - if the file history is _very_ sparse, then bloom filter is useful >>> AV> - but needs pre-compute, so useful to do once >>> AV> - first make the client do it, then think about how to serve it centrally >>> >>> [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ >>> >>> I think it was what Derrick Stolee was talking about at the end of his >>> part of "Making Git for Windows" presentation at Git Merge 2018: >>> https://youtu.be/oOMzi983Qmw?t=1835 >>> >>> This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load >>> trees when reading commit-graph", starting from [3] >>> [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/ >> Again, the benefits of Bloom filters should only be measured after >> already taking advantage of a reachability index during `git >> log`. However, you could get performance benefits from Bloom filters >> in a normal `git log` (no topo-order). > I wonder how much they improve performance of "git blame"... > >> The tricky part about this feature is that the decisions we made in >> our C# implementation for the VSTS Git server may be very different >> than the needs for the C implementation of the Git client. Questions >> like "how do we handle merge commits?" may have different answers, >> which can only be discovered by implementing the feature. >> >> (The answer for VSTS is that we only store Bloom filters containing >> the list of changed paths against the first parent. The second parent >> frequently has too many different paths, and if we are computing >> file-history simplification we have already determined the first >> parent is _not_ TREESAME, which requires verifying the difference by >> parsing trees against the first parent.) > Thanks for the information. I think for now it is sufficient level of > the detail. > >> I'm happy to provide more information on how we built this feature if >> someone is writing a patch. Otherwise, I plan to implement it after >> finishing the parts I think are higher priority. > All right, I understand that. Time is a scarse resource. > > > I think that, beside writing patches for Git, exploring how various > pieces of data and various indexes affect walking commit graphs is also > important. My explorations shown that, for example, that FELINE index > is not good fit for relatively "narrow" graphs of revision history. > Exploring this in Python / Jupyter is easier than trying to write a > exploratory patch for Git, in my opinion. Just IMVHO. You are right. Ruling out possibilities is the best outcome these prototypes can have. Your work has saved someone a lot of time in investigating that direction. Thanks, -Stolee ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-14 13:20 ` Derrick Stolee @ 2018-05-14 20:58 ` Jakub Narebski 2018-05-15 10:01 ` Johannes Schindelin 0 siblings, 1 reply; 10+ messages in thread From: Jakub Narebski @ 2018-05-14 20:58 UTC (permalink / raw) To: Derrick Stolee Cc: git, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 5/12/2018 10:00 AM, Jakub Narebski wrote: >> Derrick Stolee <stolee@gmail.com> writes: >>> On 5/4/2018 3:40 PM, Jakub Narebski wrote: >>>> >>>> With early parts of commit-graph feature (ds/commit-graph and >>>> ds/lazy-load-trees) close to being merged into "master", see >>>> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/ >>>> I think it would be good idea to think what other data could be added >>>> there to make Git even faster. >>> Before thinking about adding more data to the commit-graph, I think >>> instead we need to finish taking advantage of the data that is already >>> there. This means landing the generation number patch [1] (I think >>> this is close, so I'll send a v6 this week if there is no new >>> feedback.) and the auto-compute patch [2] (this could use more >>> feedback, but I'll send a v1 based on the RFC feedback if no one >>> chimes in). >>> >>> [1] >>> https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/ >>> [PATCH v5 00/11] Compute and consume generation numbers >>> >>> [2] >>> https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/ >>> [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' >> >> Right, so the RFC might be a bit premature; I wanted the discussion to >> be out there to think about when adding new uses of existing features. >> >> >> DIGRESSION: it is commendable that you are sending patches in small, >> easy digestible chunks / patch series. It is much easier to review 10+ >> series than 80+ behemoth (though I understand it is not always possible >> to subdivide patch series into smaller self-contained sub-series). >> >> On the other hand, it would be nice to have some roadmap about series >> and features to be sent in the future, if possible. Something like what >> was done when 'git rebase --interactive' was getting builtinified: moved >> (in parts) to C. It would be great to have such roadmap with milestones >> achieved and milestones to be done in the cover letter for series. > > I suppose that is what I intended in the "Future Work" section of > Documentation/technical/commit-graph.txt. It gives a set of things > that need to be done in order to make this a default feature, not just > an expert-level feature. When I wrote the section, I was focused > entirely on "commit-graph version 1.0" and I didn't know how long that > would take. The series have been getting interest and review (in great > part to your interest, Jakub) so they have been moving to 'next' > faster than I anticipated. > > I'll plan on writing a more detailed list of future directions, but > for now I'll list the things I know about and how I see them in > priority order: > > Commit-graph v1.0: > * ds/generation-numbers > * 'verify' and fsck/gc integration > * correct behavior with shallow clones, replace-objects, and grafts So the goal of current v1.0 phase is to introduce generation numbers. use them for better performance ("low hanging fruit"), ensure that it is automatic and safe -- thus useable for an ordinary user. > > Commit-graph v1.1: > * Place commit-graph storage in the_repository > * 'git tag --merged' use generation numbers > * 'git log --graph' use generation numbers > > Commit-graph v1.X: > * Protocol v2 verb for sending commit-graph > * Bloom filters for changed paths Thanks, that is what I was missing. The "Future Work" section, while very nice to have (because it does not require to follow git development; it is here in technical documentation), lacked prioritization and rough milestones map. [...] >>> The tougher challenge is `git log --graph`. The revision walk >>> machinery currently uses two precompute phases before iterating >>> results to the pager: limit_list() and sort_in_topological_order(); >>> these correspond to two phases of Kahn's algorithm for topo-sort >>> (compute in-degrees, then walk by peeling commits with in-degree >>> zero). This requires O(N) time, where N is the number of reachable >>> commits. Instead, we could make this be O(W) time to output one page >>> of results, where W is (roughly) the number of reachable commits with >>> generation number above the last reported result. >> >> A reminder: Kahn's algorithm (described for example in [1] and [3]) >> looks like this: >> >> L ← Empty list that will contain the sorted elements >> S ← Collection of all nodes with no incoming edge >> while S is non-empty do >> remove a node 'n' from S >> add 'n' to tail of L >> for each parent 'm' of 'n' do >> decrease the in-degree of 'm' >> if 'm' has in-degree of 0 then >> insert 'm' into S >> >> [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm >> [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/ [...] > I'm not following your pseudocode very well, Not surprising, I've lost myself in the middle of writing it... > so instead I'll provide a > more concrete description of what I mentioned before: Before we start with helper data structures, we should start with the inputs. For topological sort it is set of starting commits that define the part of commit graph we are interested in; they may be some that are redundant, but those that do not are assumed to have in-degree of 0. > Here are some data structures. > IDX : ??? > IDV : A dictionary storing the currently-computed in-degree value of a > commit 'c' as 'IDV[c]'. Assume all commits start with in-degree 0. Nitpick: I guess that it is really "with assumed in-degree of 0". This would be stored using commit-slab, isn't it? > IDQ : A queue, for storing the "in-degree walk". It is a priority > queue ordered by generation number. All right, here we use the fact that if we walk up to and including generation number of commit, then its in-degree will be calculated correctly, as any commit with generation number smaller than generation number of given commit cannot reach given commit and change its in-degree. > TOQ : A queue, for storing the "topo-order walk". It is a priority > queue ordered by "visit order" (see algorithm) Note: max priority queue ordered by "visit order" is practically a stack, though I guess we prefer priority queue here to use the same code for '--date-order'. > > Here are some methods. > > AdvanceInDegreeWalk(IDQ, IDV, gen): If I understand it correctly this subroutine walks those commits that do not have correct in-degree calculated, and walks till all commits with generation number greater or equal to 'gen' cutoff parameter are walked and have correct in-degree. Assumes that all commits in graph but not in IDX have their in-degree calculated. > while !IDX.Empty && IDQ.Max >= gen: > c = IDX.Dequeue > > for each parent p of c: Here the ordering of parents walked does not matter (much), as generation number is used as cutoff only. > IDV[p]++; > IDQ.Enqueue(p, generation(p)) All right, looks correct; with gen = 0 it probably reduces to the current code that does in-degree calculation upfront. > > InitializeTopoOrder(R): > Initialize IDQ, IDV, TOQ > > min_generation = GENERATION_NUMBER_INFINITY > visit_order = 0 > > for each commit c in R: > min_generation = min(min_generation, generation(c)) > IDQ.Enqueue(c, generation(c)) > > AdvanceInDegreeWalk(IDQ, IDV, min_generation) This means that we walk BFS-like until all starting points, that is all commits in R, have their in-degree calculated. I think we can avoid that, but it may change the exact behavior of '--topo-order'. Currently if one commit in R is from orphan branch (like 'todo' in git.git), or has otherwise very low generation number, it would mean that AdvanceInDegreeWalk would walk almost all commits. > for each commit c in R: > if IDV[c] == 0: > TOQ.Enqueue(c, visit_order++) I think we may want to either start from commits in the order given on command line (given by R), and not reverse order, or maybe even start from commits with maximum generation number. > > return (IDQ, IDV, TOQ, visit_order) All right. Here is my [allegedly] improved version, which assumes that we always want to start from commit with maximum generation number (there may be more than one such commit). Let's add one more data structure: RRQ : A queue, for storing remaining commits from R (starting point). It is a priority queue ordered by generation number. InitializeTopoOrder(R): Initialize IDQ, IDV, TOQ, RRQ max_generation = 0 visit_order = 0 for each commit c in R: max_generation = max(max_generation, generation(c)) unless IsRedundantFilter(R / c, c): # optional RRQ.Enqueue(c, generation(c)) while RRQ.Max == max_generation: c = RRQ.Dequeue() IDV[c] = 0 # commits with max gen have in-degree of 0 IDQ.Enqueue(c, generation(c)) # this would be almost no-op AdvanceInDegreeWalk(IDQ, IDV, max_generation) for each commit c in reversed R: if generation(c) == max_generation: # then IDV[c] == 0 TOQ.Enqueue(c, visit_order++) return (IDQ, IDV, TOQ, RRQ, visit_order) > > GetNextInTopoOrder(IDQ, IDV, TOQ, ref visit_order): > if TOQ.Empty: > return null > > c = TOQ.Dequeue() > > for each parent p of c: > IDV[p]-- > AdvanceInDegreeWalk(IDQ, IDV, generation(p)) > if IDV[p] == 0: > TOQ.Enqueue(p, visit_order++) > > return c With the modification to InitializeTopoOrder it would be GetNextInTopoOrder(IDQ, IDV, TOQ, RRQ, ref visit_order): if TOQ.Empty: return null c = TOQ.Dequeue() for each parent p of c, sorted by generation number: # some of maybe in-degree zero are now surely in-degree zero while RRQ.Max > generation(p): r = RRQ.Dequeue() IDQ.Enqueue(r, generation(r)) AdvanceInDegreeWalk(IDQ, IDV, generation(p)) IDV[p]-- if IDV[p] == 0: TOQ.Enqueue(p, visit_order++) return c > (I mention "ref visit_order" to be sure we are > passing-by-reference. In a full implementation, the walk details would > be grouped into a struct.) > > This algorithm is relatively simple, but the hard part is teasing the > revision walk machinery to initialize the data by calling > InitializeTopoOrder(R) and to call GetNextInTopoOrder() whenever it > needs a new element of the walk. That is, it is hard to be sure we are > not causing side-effects as we make that transformation. Right. I'll take a look at existing code to find out how involved that would be. It is a pity that we cannot simply turn in-degree calculation into producer that calculates in-degree in generation number order, and spits it to FIFO pipe / channel / supply / blocking queue, while topo-order calculation reads from in-degree pipe, and spits output to pager... I wonder if we could use Scientist tool to make sure that new and old code give the same results. [...] >> I think that, beside writing patches for Git, exploring how various >> pieces of data and various indexes affect walking commit graphs is also >> important. My explorations shown that, for example, that FELINE index >> is not good fit for relatively "narrow" graphs of revision history. >> Exploring this in Python / Jupyter is easier than trying to write a >> exploratory patch for Git, in my opinion. Just IMVHO. Actually I should have said that FELINE index for commit graphs (that I have checked) does not show improvement over simply using some topological ordering as negative-cut filter. > > You are right. Ruling out possibilities is the best outcome these > prototypes can have. Your work has saved someone a lot of time in > investigating that direction. Regards, -- Jakub Narębski ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc. 2018-05-14 20:58 ` Jakub Narebski @ 2018-05-15 10:01 ` Johannes Schindelin 0 siblings, 0 replies; 10+ messages in thread From: Johannes Schindelin @ 2018-05-15 10:01 UTC (permalink / raw) To: Jakub Narebski Cc: Derrick Stolee, git, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason [-- Attachment #1: Type: text/plain, Size: 2230 bytes --] Hi Kuba, On Mon, 14 May 2018, Jakub Narebski wrote: > [... lots and lots of discussions...] > > All right. > > Here is my [allegedly] improved version, which assumes that we always > want to start from commit with maximum generation number (there may be > more than one such commit). > > Let's add one more data structure: > > RRQ : A queue, for storing remaining commits from R (starting point). > It is a priority queue ordered by generation number. > > > InitializeTopoOrder(R): > Initialize IDQ, IDV, TOQ, RRQ > > max_generation = 0 > visit_order = 0 > > for each commit c in R: > max_generation = max(max_generation, generation(c)) > unless IsRedundantFilter(R / c, c): # optional > RRQ.Enqueue(c, generation(c)) > > while RRQ.Max == max_generation: > c = RRQ.Dequeue() > IDV[c] = 0 # commits with max gen have in-degree of 0 > IDQ.Enqueue(c, generation(c)) > > # this would be almost no-op > AdvanceInDegreeWalk(IDQ, IDV, max_generation) > > for each commit c in reversed R: > if generation(c) == max_generation: # then IDV[c] == 0 > TOQ.Enqueue(c, visit_order++) > > return (IDQ, IDV, TOQ, RRQ, visit_order) Isn't it premature to throw out code at this stage? My impression from the side lines is that Stolee is trying to get concrete code implemented, code that can be tested against concrete repositories, so that the question of singularly overall importance can be asked: is it worth it, is it really faster? And by how much? It may not be the best idea to distract Stolee from said implementation (i.e. step 1) by getting ahead ourselves with steps 17, 18 or 19. The next steps would seem to be step 2 and 3, and I am sure that Stolee would appreciate your help in implementing them. So at this point it would maybe make a lot more sense to help Stolee e.g. by refactoring the rev-list code that is a *mess* around the topo order, so that all kinds of cute algorithms can be implemented *directly* in Git, and put to the ultimate test. Ciao, Dscho ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-05-15 10:01 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski 2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason 2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason 2018-05-05 13:28 ` Jakub Narebski 2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski 2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee 2018-05-12 14:00 ` Jakub Narebski 2018-05-14 13:20 ` Derrick Stolee 2018-05-14 20:58 ` Jakub Narebski 2018-05-15 10:01 ` Johannes Schindelin
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).