* [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
2009-06-04 14:05 ` [PATCH 4/7] rev-cache: allow multiple 'start' objects per index Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
` (4 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
As well as storing the sorted list of objects, store a hash table for
faster lookup.
Signed-off-by: Sam Vilain <sam@vilain.net>
---
Documentation/technical/revision-cache.txt | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index cc18535..8349cfe 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -14,6 +14,9 @@ A revision cache contains;
* object ID
+ - A hash table from an (abbreviated) object ID to a position into
+ the above list
+
Start Object
------------
@@ -35,6 +38,18 @@ objects are sorted as if they were commit objects with a single
parent, the object they tag.
+Included object hash table
+--------------------------
+
+This index is used to quickly determine if an object exists in the
+index without scanning the entire topological list.
+
+Entries in the object hash table can be shortened, eg to 3 or 4 bytes;
+basically they just need to be long enough to avoid collisions within
+the objects which exist in the list. Any match must be confirmed by
+checking the full SHA1 in the topological list.
+
+
Use Cases
---------
In this section, the key functions and operations that this index is
@@ -131,7 +146,8 @@ passed, or any 'uninteresting' objects were passed.
This function must revision walk the commit graph, sorting in
--date-order along the way, and may emit revisions as they are
-discovered to the topological object list.
+discovered to the topological object list. It must also build a hash
+table of object IDs and emit it at the end.
receive-pack/pack-objects
@@ -186,9 +202,9 @@ emitted, the delta from the packfile is re-used. If a loop is
detected or the delta base is not in the returned set of objects, then
the delta is first resolved.
-This implies that the list of objects is first loaded into a hash
-table prior to returning any objects; however this is probably
-acceptable as the entire list is in one stream and will load quickly.
+This implies that each delta base must be looked up in the on-disk
+hash table as they are written, which is both low impact and memory
+efficient.
For later fetches, the revision cache is not appropriate as they will
have 'uninteresting' objects set.
--
debian.1.5.6.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 5/7] revision cache: maps of 'new' objects
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
2009-06-04 14:05 ` [PATCH 4/7] rev-cache: allow multiple 'start' objects per index Sam Vilain
` (6 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
When making a pack, it is useful to know if an object is already
reachable from any of the objects that the receiving party is known to
have; it allows the object to be excluded from the pack. Allow
optional bitmaps for 'start' objects which permit storage of this
information.
Signed-off-by: Sam Vilain <sam@vilain.net>
---
Documentation/technical/revision-cache.txt | 24 +++++++++++++++++++++++-
1 files changed, 23 insertions(+), 1 deletions(-)
diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 198c33a..e0adb26 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -28,6 +28,11 @@ A revision cache contains;
space. A count of how many bits are '1' is included. A separate
bitmap is included for the 'start' objects.
+ - Optionally, for any 'end' object, a 'newness' bitmap indicating
+ which of the objects in the object list are reachable from that
+ 'end' object, but not reachable from the 'start' objects reachable
+ from that 'end' object.
+
Start Objects and End Objects
-----------------------------
@@ -120,6 +125,9 @@ generating a newer one which covers objects created after it. This
approach will be able to answer many questions, but not topological
ordering between objects which appeared in different caches.
+Stacking revision caches is essential for being able to generate the
+'newness' bitmaps efficiently.
+
Revision Walker
~~~~~~~~~~~~~~~
@@ -144,7 +152,7 @@ required. Some API exists to do this, and the return value from
'rev_cache_ok' is true if suitable caches were found for the
information required, and false if it would require a graph traversal:
- rev_cache_options( ordered? )
+ rev_cache_options( ordered?, new? )
rev_cache_ok( ) : Bool
The rev_cache_ok() function must open all available revision caches,
@@ -163,6 +171,11 @@ If any 'uninteresting' objects were passed, then the return value is
true if the suitability function passes for all of the revision caches
which are used.
+If the 'new' flag is set to true, then the return value is only true
+if the 'uninteresting' objects also have suitable revision caches
+leading back to the start of history (or last 'shallow' point), or the
+'newness' bitmaps are present for all of the 'start' references used.
+
Returning objects
^^^^^^^^^^^^^^^^^
@@ -208,6 +221,10 @@ Then, it must repeat the topological walk for each of the 'start'
objects, looking up each object in the contents hash for a sequence
number, set a bit in the bitmap for that 'start' object, and finally
RLE compress it.
+If there are suitable other revision caches available, then the
+'newness' bitmaps are built by checking with the stacked revision
+caches whether the object exists and is reachable from the 'start'
+objects which connect to those stacked caches.
receive-pack/pack-objects
@@ -250,6 +267,11 @@ If multiple 'start' objects were used, then the count can be returned
by decompressing and OR'ing the bitmaps together, counting the total
number of '1's in the bitmap.
+If the 'newness' bitmaps are available, then these can be used to
+avoid sending objects which the other end will already have.
+When applicable the 'newness' bitmaps will be masked against the
+combined reachability bitmaps to derive a shorter list of objects.
+
If multiple revision caches were used, but not in single file, then
the number of objects can be obtained by building a hash table of all
the objects in all of the caches, and returning the number of hash
--
debian.1.5.6.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/7] rev-cache: allow multiple 'start' objects per index
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
2009-06-04 14:05 ` [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup Sam Vilain
` (5 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
We can re-use the index for multiple 'start' objects sharing the same
'end' objects, using one list and a reachability bitmap for each
'start' object.
Signed-off-by: Sam Vilain <sam@vilain.net>
---
Documentation/technical/revision-cache.txt | 106 +++++++++++++++++++---------
1 files changed, 73 insertions(+), 33 deletions(-)
diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 759d78d..198c33a 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -6,12 +6,13 @@ reachability operations to be answered quickly.
A revision cache contains;
- - A 'start' object (ie, 'interesting' object ID)
+ - A list of 'start' objects (ie, 'interesting' object IDs)
- A list of 'end' objects, which may be empty (ie, 'uninteresting'
object IDs)
- - A list of objects which are referred to by the 'start' object
+ - A list of objects which are referred to by any 'start' object, but
+ not by crossing 'end' objects, including:
* position when sorted in --date-order
@@ -20,13 +21,25 @@ A revision cache contains;
- A hash table from an (abbreviated) object ID to a position into
the above list
+ - For each 'end' object, a bit sequence indicating which of the
+ objects in the object list are reachable from that 'end' object.
+ These bitmaps are stored in the same order as the object list and
+ then RLE-compressed, so in many cases will not take much on-disk
+ space. A count of how many bits are '1' is included. A separate
+ bitmap is included for the 'start' objects.
-Start Objects and End Object
-----------------------------
+
+Start Objects and End Objects
+-----------------------------
The 'start' object, and 'end' objects are the identifying key of the
revision cache.
+Revision caches for multiple 'start' objects can re-use the same
+object list, conserving space and adding flexibility to the index. As
+such, a single revision cache is logically considered to be multiple
+indexes, unless it is of benefit to consider it a single index.
+
The 'end' objects must be reachable from the 'start' objects, and none
of the 'end' objects may be reachable from other 'end' objects.
@@ -71,27 +84,36 @@ Determining Cache Suitability - rev_cache_suitable()
This is an auxiliary function used by the other use cases.
A revision cache is suitable whenever the walker encounters the single
-object which is the 'end' object of the revision cache, and none of
-the 'uninteresting' revisions to the walker are in the revision cache.
+object which is any of the 'end' objects of the revision cache, and
+none of the 'uninteresting' revisions to the walker are in the
+revision cache.
The function is:
rev_cache_suitable( rev_cache, interesting, uninteresting [] )
The check is simple and fast - it first compares the 'interesting'
-object to the 'start' object in the revision cache, then the
+object to the 'start' object list in the revision cache, then the
'uninteresting' objects that the walker is using are looked up in the
contents hash table.
-Only if none of them are found is the revision cache suitable.
+If they are found, then the reachability bitmap for the matching
+'start' objects are consulted for whether the object was actually
+reachable from those 'start' objects.
+Only if none of them are reachable is the revision cache suitable.
Revision cache stacking
^^^^^^^^^^^^^^^^^^^^^^^
-If there are 'end' objects in the revision cache used, which do not
+If there are 'end' objects in the revision cache used, which are
+reachable from the 'start' objects in use, and do not
match the 'uninteresting' objects in the walker, the function may
recurse, looking for other revision caches which have the unwanted
-'end' object as their 'start' object.
+'end' object in their 'start' object list.
+
+Allowing multiple 'start' objects allows for more instances of 'single
+file' stacking, where at no point is the stack wider than one revision
+cache.
Stacking revision caches is a way to re-use an older index by just
generating a newer one which covers objects created after it. This
@@ -126,18 +148,20 @@ information required, and false if it would require a graph traversal:
rev_cache_ok( ) : Bool
The rev_cache_ok() function must open all available revision caches,
-and see if their 'interesting' object matches the single 'interesting'
-object passed to rev_cache_add(). If it matches, it returns true. If
-multiple 'interesting' objects were specified and 'ordered' is true,
-then the function returns false.
+and see if all of the 'interesting' objects passed to rev_cache_add()
+can be found as 'start' objects in available revision caches.
+If they cannot, the function returns false.
+If multiple 'interesting' objects were specified and 'ordered' is
+true, then the function returns false, unless all of the 'interesting'
+objects were found in a single revision cache and no stacking is
+required.
-If the ordering flag was set to false, then all of the 'interesting'
-objects must be found in separate revision caches for the function to
-return true.
+If the ordering flag was set to false, then this restriction is
+relaxed. Objects may be found in separate revision caches.
If any 'uninteresting' objects were passed, then the return value is
-true if the suitability function passes for the revision caches which
-are used.
+true if the suitability function passes for all of the revision caches
+which are used.
Returning objects
@@ -176,13 +200,14 @@ an 'interesting' object added, then call:
rev_cache_create( )
-This function will not work, if multiple 'interesting' objects are
-passed.
-
This function must revision walk the commit graph, sorting in
--date-order along the way, and may emit revisions as they are
-discovered to the topological object list. It must also build a hash
-table of object IDs and emit it at the end.
+discovered to the topological object list.
+It must also build a hash table of object IDs and emit it at the end.
+Then, it must repeat the topological walk for each of the 'start'
+objects, looking up each object in the contents hash for a sequence
+number, set a bit in the bitmap for that 'start' object, and finally
+RLE compress it.
receive-pack/pack-objects
@@ -215,13 +240,23 @@ Returning object counts
If the revision cache was suitable, eg for a complete fetch, then the
number of objects can be obtained by counting the size of the object
list.
-
-If multiple revision caches were used for multiple 'interesting'
-objects, then the number of objects can be obtained by building a hash
-table of all the objects in all of the caches, and returning the
-number of hash table entries. In the single-file stacked revision
-cache case, the total can be found by adding up the total lengths of
-the object lists.
+If the revision cache was entirely suitable, with all 'start' objects
+matching all 'interesting' objects in the walker, then the total
+number of objects in the cache is the number of objects to be
+returned.
+If a single 'start' object was used, then the count of '1's in the
+bitmap can be used.
+If multiple 'start' objects were used, then the count can be returned
+by decompressing and OR'ing the bitmaps together, counting the total
+number of '1's in the bitmap.
+
+If multiple revision caches were used, but not in single file, then
+the number of objects can be obtained by building a hash table of all
+the objects in all of the caches, and returning the number of hash
+table entries.
+In the single-file stacked revision cache case, the total can be found
+by adding up the total lengths of the object lists that apply to each
+single-file section.
If 'uninteresting' objects were passed, the caches used must be
suitable according to the cache suitability function.
@@ -241,8 +276,13 @@ detected or the delta base is not in the returned set of objects, then
the delta is first resolved.
This implies that each delta base must be looked up in the on-disk
-hash table as they are written, which is both low impact and memory
-efficient.
+hash table as they are written to the network, which is both low
+impact and memory efficient.
+In the case where not all of the 'start' objects are used, then the
+bitmaps for the 'start' objects in use will need to be kept in memory
+to supplement the results of this hash lookup.
+This is a small extra indirection likely to have only a minimal
+performance penalty.
For later fetches, the revision cache will only be appropriate if the
'have' objects sent by the remote are not found in the revision
--
debian.1.5.6.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
` (2 preceding siblings ...)
2009-06-04 14:05 ` [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
2009-06-05 19:28 ` Nicolas Pitre
2009-06-04 14:05 ` [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups Sam Vilain
` (3 subsequent siblings)
7 siblings, 1 reply; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
The very first thing that we want out of the revision cache is to be
able to go from a commit to all of its referred objects. Define a
revision cache that includes just that.
Signed-off-by: Sam Vilain <sam@vilain.net>
---
Documentation/technical/revision-cache.txt | 260 ++++++++++++++++++++++++++++
1 files changed, 260 insertions(+), 0 deletions(-)
create mode 100644 Documentation/technical/revision-cache.txt
diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
new file mode 100644
index 0000000..cc18535
--- /dev/null
+++ b/Documentation/technical/revision-cache.txt
@@ -0,0 +1,260 @@
+Revision Cache Format
+=====================
+
+The revision cache is an on-disk format which allows for certain graph
+reachability operations to be answered quickly.
+
+A revision cache contains;
+
+ - A 'start' object (ie, 'interesting' object ID)
+
+ - A list of objects which are referred to by the 'start' object
+
+ * position when sorted in --date-order
+
+ * object ID
+
+
+Start Object
+------------
+
+The 'start' object is the identifying key of the revision cache.
+
+
+Topological contents list
+-------------------------
+
+This list has fixed-length records, so the topological position into
+the list does not need to be stored in each record - it is implicit
+from the offset.
+
+--date-order is used as it is the strictest sort order available, but
+this still only specifies an ordering for commit objects. Other
+objects will appear after the object which first refers to them. Tag
+objects are sorted as if they were commit objects with a single
+parent, the object they tag.
+
+
+Use Cases
+---------
+In this section, the key functions and operations that this index is
+designed to answer are explored. For each, their efficiency is
+considered in terms of what must be carried out to calculate the
+answer.
+
+
+Determining Cache Suitability - rev_cache_suitable()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is an auxiliary function used by the other use cases, when
+
+The function is:
+
+ rev_cache_suitable( rev_cache, object )
+
+The check is simple and fast - it just compares the object to the
+'start' object in the revision cache.
+
+
+Revision Walker
+~~~~~~~~~~~~~~~
+
+The revision walker is the main user of this cache; there is the
+generic function of revision walking, as well as commands that want
+specific information which they normally derive from the revision
+walker output.
+
+
+Setting up the walker
+^^^^^^^^^^^^^^^^^^^^^
+
+The functions for this are (intentionally resembling the current
+revision walker API):
+
+ rev_cache_init()
+ rev_cache_add( interesting?, oid )
+
+As well as this setup, it is necessary to specify which options are
+required. Some API exists to do this, and the return value from
+'rev_cache_ok' is true if suitable caches were found for the
+information required, and false if it would require a graph traversal:
+
+ rev_cache_options( ordered? )
+ rev_cache_ok( ) : Bool
+
+The rev_cache_ok() function must open all available revision caches,
+and see if their 'interesting' object matches the single 'interesting'
+object passed to rev_cache_add(). If it matches, it returns true. If
+multiple 'interesting' objects were specified and 'ordered' is true,
+then the function returns false.
+
+If the ordering flag was set to false, then all of the 'interesting'
+objects must be found in separate revision caches.
+
+If any 'uninteresting' objects were passed, then the return value is
+always false.
+
+
+Returning objects
+^^^^^^^^^^^^^^^^^
+
+The 'rev_cache_fetch()' iterator returns entries from the topological
+list.
+
+ rev_cache_fetch() : oid
+
+If returning objects from a single revision cache, it opens the
+
+
+Accelerating in-progress walking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is possible that during a revision iteration operation, a revision
+is discovered that may mean the rest of the revision walking can be
+achieved faster.
+
+In practice, this is likely to be implemented by making in-core cache
+entries for objects with revision caches prior to walking; then when
+encountered the special action can be taken.
+
+
+Creating Revision Caches
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Once the revision cache is setup via rev_cache_init() and
+an 'interesting' object added, then call:
+
+ rev_cache_create( )
+
+This function will not work, if multiple 'interesting' objects are
+passed, or any 'uninteresting' objects were passed.
+
+This function must revision walk the commit graph, sorting in
+--date-order along the way, and may emit revisions as they are
+discovered to the topological object list.
+
+
+receive-pack/pack-objects
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+pack-objects when called from pack-objects is a special user of the
+revision cache; it has the extra requirements of wanting to know;
+
+* how many objects are between 'interesting' and 'uninteresting'
+ objects, at the beginning of the run, to emit the pack header
+
+* for checking re-usability of deltas, whether the delta base object
+ in the pack is in the received set of objects.
+
+* for 'thin' pack generation, whether the delta base object is in the
+ received set of objects, -or- reachable from any 'uninteresting'
+ objects
+
+* for 'shallow' clone, whether the delta base object is reachable
+ without passing any of the 'uninteresting' objects
+
+The aim is for 'pack-objects' to be able to start returning objects
+immediately in the case where a suitable revision cache is returned,
+without waiting for revision counting or repacking.
+
+
+Returning object counts
+^^^^^^^^^^^^^^^^^^^^^^^
+
+If the revision cache was suitable, eg for a complete fetch, then the
+number of objects can be obtained by counting the size of the object
+list.
+
+If multiple revision caches were used for multiple 'interesting'
+objects, then the number of objects can be obtained by building a hash
+table of all the objects in all of the caches, and returning the
+number of hash table entries.
+
+If 'uninteresting' objects were passed, no cache can be suitable.
+
+
+Re-using deltas
+^^^^^^^^^^^^^^^
+
+This applies to the 'Compressing Objects' phase, previously known as
+the 'Deltafying Objects' phase. Instead of searching for deltas, if
+we can be sure that the existing delta can be resolved on the remote,
+then we can re-use it.
+
+For the initial clone, this operation is simple - as objects are
+emitted, the delta from the packfile is re-used. If a loop is
+detected or the delta base is not in the returned set of objects, then
+the delta is first resolved.
+
+This implies that the list of objects is first loaded into a hash
+table prior to returning any objects; however this is probably
+acceptable as the entire list is in one stream and will load quickly.
+
+For later fetches, the revision cache is not appropriate as they will
+have 'uninteresting' objects set.
+
+
+Re-using deltas - 'thin' pack
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This case is much like the normal re-using deltas case, except there
+is the complete set of objects that the remote has claimed to have to
+consider.
+
+The cache is not yet sophisticated enough to assist with this use
+case.
+
+
+'shallow' clone considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For enumerating objects, the list of objects between the
+'uninteresting' and the 'shallow' commits must first be enumerated,
+and then subtracted from the objects between the 'interesting' and the
+'uninteresting' list.
+
+For re-using deltas for 'thin' packs, the list of objects between
+'uninteresting' and 'shallow' commits are enumerated and marked as
+valid for delta bases.
+
+The cache is not yet sophisticated enough to assist with this case.
+
+
+creating bundles
+~~~~~~~~~~~~~~~~
+
+So long as a bundle has no 'uninteresting' commits, then the revision
+cache is completely appropriate; with the length of the object list it
+can write a header and continue with the 'pack-objects' use case.
+
+
+slicing bundles (mirror-sync)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For 'slicing' bundles, a deterministic method is required for chopping
+up the list of objects, which can be calculated by any node which has
+the 'interesting' commits.
+
+To determine the objects which fall within a given slice, the object
+list must be enumerated and then divided evenly. As the compressed
+size on one node cannot be reproduced on another node, the
+uncompressed size of the object is used instead, with the hope that
+slices will generally end up of roughly even size once compressed.
+
+To calculate the object boundaries, the list of objects in
+--date-order must first be tie-broken, eg for commits with the same
+commitdate and topological order. If slices are allowed to be
+sub-commit level, then objects between commits (ie blobs and trees)
+are also sorted. Once this deterministic list has been built, then
+all of the objects must be accessed to determine their length. The
+objects which start within a given range are the ones in the slice.
+
+For re-using deltas in sliced bundles, the delta base is looked up in
+the deterministic list. If it has an earlier sequence, then the delta
+can be safely re-used. If it has a later sequence, then the delta
+must be resolved and then the object re-deltified/compressed, using
+only objects with an earlier sequence (or in the returned pack) as a
+base.
+
+This requirement is so that a collection of 'sliced' bundles can
+successfully re-assemble, while still allowing them to be 'thin'.
--
debian.1.5.6.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
` (3 preceding siblings ...)
2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
2009-06-04 14:05 ` [PATCH 7/7] revision cache: be even stricter with sort order Sam Vilain
` (2 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
If we want to be able to accelerate lookups which contain
'uninteresting' revisions, we must be able to specify what those
revisions are.
Signed-off-by: Sam Vilain <sam@vilain.net>
---
Documentation/technical/revision-cache.txt | 85 +++++++++++++++++++++-------
1 files changed, 65 insertions(+), 20 deletions(-)
diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 8349cfe..759d78d 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -8,6 +8,9 @@ A revision cache contains;
- A 'start' object (ie, 'interesting' object ID)
+ - A list of 'end' objects, which may be empty (ie, 'uninteresting'
+ object IDs)
+
- A list of objects which are referred to by the 'start' object
* position when sorted in --date-order
@@ -18,10 +21,14 @@ A revision cache contains;
the above list
-Start Object
-------------
+Start Objects and End Object
+----------------------------
+
+The 'start' object, and 'end' objects are the identifying key of the
+revision cache.
-The 'start' object is the identifying key of the revision cache.
+The 'end' objects must be reachable from the 'start' objects, and none
+of the 'end' objects may be reachable from other 'end' objects.
Topological contents list
@@ -61,14 +68,35 @@ answer.
Determining Cache Suitability - rev_cache_suitable()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This is an auxiliary function used by the other use cases, when
+This is an auxiliary function used by the other use cases.
+
+A revision cache is suitable whenever the walker encounters the single
+object which is the 'end' object of the revision cache, and none of
+the 'uninteresting' revisions to the walker are in the revision cache.
The function is:
- rev_cache_suitable( rev_cache, object )
+ rev_cache_suitable( rev_cache, interesting, uninteresting [] )
+
+The check is simple and fast - it first compares the 'interesting'
+object to the 'start' object in the revision cache, then the
+'uninteresting' objects that the walker is using are looked up in the
+contents hash table.
+Only if none of them are found is the revision cache suitable.
+
+
+Revision cache stacking
+^^^^^^^^^^^^^^^^^^^^^^^
+
+If there are 'end' objects in the revision cache used, which do not
+match the 'uninteresting' objects in the walker, the function may
+recurse, looking for other revision caches which have the unwanted
+'end' object as their 'start' object.
-The check is simple and fast - it just compares the object to the
-'start' object in the revision cache.
+Stacking revision caches is a way to re-use an older index by just
+generating a newer one which covers objects created after it. This
+approach will be able to answer many questions, but not topological
+ordering between objects which appeared in different caches.
Revision Walker
@@ -104,10 +132,12 @@ multiple 'interesting' objects were specified and 'ordered' is true,
then the function returns false.
If the ordering flag was set to false, then all of the 'interesting'
-objects must be found in separate revision caches.
+objects must be found in separate revision caches for the function to
+return true.
If any 'uninteresting' objects were passed, then the return value is
-always false.
+true if the suitability function passes for the revision caches which
+are used.
Returning objects
@@ -118,7 +148,12 @@ list.
rev_cache_fetch() : oid
-If returning objects from a single revision cache, it opens the
+If returning objects from a single or stacked set of revision caches,
+it opens the caches and returns objects from them. If combining
+results from multiple caches (where topological ordering of returned
+objects is not important) then an in-memory object hash table must be
+built, or the on-disk tables for all of the caches consulted along the
+way.
Accelerating in-progress walking
@@ -142,7 +177,7 @@ an 'interesting' object added, then call:
rev_cache_create( )
This function will not work, if multiple 'interesting' objects are
-passed, or any 'uninteresting' objects were passed.
+passed.
This function must revision walk the commit graph, sorting in
--date-order along the way, and may emit revisions as they are
@@ -184,9 +219,12 @@ list.
If multiple revision caches were used for multiple 'interesting'
objects, then the number of objects can be obtained by building a hash
table of all the objects in all of the caches, and returning the
-number of hash table entries.
+number of hash table entries. In the single-file stacked revision
+cache case, the total can be found by adding up the total lengths of
+the object lists.
-If 'uninteresting' objects were passed, no cache can be suitable.
+If 'uninteresting' objects were passed, the caches used must be
+suitable according to the cache suitability function.
Re-using deltas
@@ -206,8 +244,9 @@ This implies that each delta base must be looked up in the on-disk
hash table as they are written, which is both low impact and memory
efficient.
-For later fetches, the revision cache is not appropriate as they will
-have 'uninteresting' objects set.
+For later fetches, the revision cache will only be appropriate if the
+'have' objects sent by the remote are not found in the revision
+caches' object hash table.
Re-using deltas - 'thin' pack
@@ -217,8 +256,13 @@ This case is much like the normal re-using deltas case, except there
is the complete set of objects that the remote has claimed to have to
consider.
-The cache is not yet sophisticated enough to assist with this use
-case.
+The revision cache can assist with this case only if the 'have'
+objects sent by the remote match the 'uninteresting' objects in the
+revision cache (or are not found in the object list), and another
+revision cache exists which uses that 'uninteresting' object as its
+'start' object. In this case, the lower stacked revision cache serves
+as a handy lookup table as to whether the object is known to the
+remote.
'shallow' clone considerations
@@ -239,9 +283,10 @@ The cache is not yet sophisticated enough to assist with this case.
creating bundles
~~~~~~~~~~~~~~~~
-So long as a bundle has no 'uninteresting' commits, then the revision
-cache is completely appropriate; with the length of the object list it
-can write a header and continue with the 'pack-objects' use case.
+So long as a bundle's 'uninteresting' commits match that of the
+revision cache, then the revision cache is completely appropriate;
+with the length of the object list it can write a header and continue
+with the 'pack-objects' use case.
slicing bundles (mirror-sync)
--
debian.1.5.6.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 7/7] revision cache: be even stricter with sort order
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
` (4 preceding siblings ...)
2009-06-04 14:05 ` [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
2009-06-04 14:05 ` [PATCH 6/7] revision cache: allow foreign 'start' commits Sam Vilain
2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
For sliced bundles, there must be absolutely no ambiguity at all about
the sort order. Define how to break ties, and describe why this
affects mirror-sync.
Signed-off-by: Sam Vilain <sam@vilain.net>
---
Documentation/technical/revision-cache.txt | 79 +++++++++++++++++++---------
1 files changed, 54 insertions(+), 25 deletions(-)
diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 0cd7b08..7aaab38 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -14,7 +14,7 @@ A revision cache contains;
- A list of objects which are referred to by any 'start' object, but
not by crossing 'end' objects, including:
- * position when sorted in --date-order
+ * position when sorted in --date-order, with ties broken by SHA1
* object ID
@@ -33,9 +33,15 @@ A revision cache contains;
'end' object, but not reachable from the 'start' objects reachable
from that 'end' object.
- - a list of 'foreign end' objects, for which not all reachable
- objects are in the object list, but can have reachability or a
- 'newness' bitmap.
+ - a list of 'foreign start' objects, for which not all reachable
+ objects are in the object list, but can have reachability and
+ 'newness' bitmaps.
+
+ - a list of object types for each object, in the order they appear
+ in the object list.
+
+ - a list of object lengths for each object, in the order they appear
+ in the object list.
Start Objects and End Objects
@@ -64,7 +70,9 @@ from the offset.
this still only specifies an ordering for commit objects. Other
objects will appear after the object which first refers to them. Tag
objects are sorted as if they were commit objects with a single
-parent, the object they tag.
+parent, the object they tag. No object is allowed to appear before
+another object which refers to it, unless it is a 'start' object.
+Ties in the order are broken by SHA1.
Included object hash table
@@ -231,6 +239,23 @@ caches whether the object exists and is reachable from the 'start'
objects which connect to those stacked caches.
+Extending Revision Caches
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ rev_cache_extend( rev_cache, quasi_interesting[] )
+
+The function works on an existing revision cache, and calculates just
+the 'reachability' and 'newness' bitmaps.
+
+These are calculated by marking the quasi_interesting[] commits as
+interesting, the 'end' objects in the revision cache as
+'uninteresting', and walking.
+Objects which are encountered which exist in the revision cache are
+converted to bits in the emitted bitmap, and the 'newness' bitmap is
+built as with the normal case once the reachable 'end' objects are
+known.
+
+
receive-pack/pack-objects
~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -360,28 +385,32 @@ slicing bundles (mirror-sync)
For 'slicing' bundles, a deterministic method is required for chopping
up the list of objects, which can be calculated by any node which has
-the 'interesting' commits.
-
-To determine the objects which fall within a given slice, the object
-list must be enumerated and then divided evenly. As the compressed
-size on one node cannot be reproduced on another node, the
-uncompressed size of the object is used instead, with the hope that
-slices will generally end up of roughly even size once compressed.
-
-To calculate the object boundaries, the list of objects in
---date-order must first be tie-broken, eg for commits with the same
-commitdate and topological order. If slices are allowed to be
-sub-commit level, then objects between commits (ie blobs and trees)
-are also sorted. Once this deterministic list has been built, then
-all of the objects must be accessed to determine their length. The
-objects which start within a given range are the ones in the slice.
+the 'interesting' commits. As topological order is important, bundles
+must only consist of objects from a single revision cache to be sliced
+in this manner.
+
+To determine the objects which fall within a given slice, the new
+object list in the bundle must be enumerated (such as by decompressing
+and masking bitmaps) and then divided evenly, using the (masked)
+object lengths list. As the compressed size on one node cannot be
+reproduced on another node, the uncompressed size of the object is
+used instead, with the hope that slices will generally end up of
+roughly even size once compressed.
+
+The objects which start within a given range in this list of objects
+are the ones in the slice. If slices are not allowed at the
+sub-commit level, then the requirement to tie-break ordering of
+non-commit/tag objects is relaxed, however the list of per-object
+types is required to know where the boundaries are.
For re-using deltas in sliced bundles, the delta base is looked up in
-the deterministic list. If it has an earlier sequence, then the delta
-can be safely re-used. If it has a later sequence, then the delta
-must be resolved and then the object re-deltified/compressed, using
-only objects with an earlier sequence (or in the returned pack) as a
-base.
+the deterministic list of objects. If it has an earlier sequence,
+then the delta can be safely re-used. If it has a later sequence,
+then the delta must be resolved and then the object
+re-deltified/compressed, using only objects with an earlier sequence
+(or in the returned pack) as a base.
This requirement is so that a collection of 'sliced' bundles can
successfully re-assemble, while still allowing them to be 'thin'.
+Without thin packs, download spreading from multiple mirrors will
+result in a much larger download.
--
debian.1.5.6.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6/7] revision cache: allow foreign 'start' commits
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
` (5 preceding siblings ...)
2009-06-04 14:05 ` [PATCH 7/7] revision cache: be even stricter with sort order Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
Usually the 'start' commits are always 'interesting' commits in a
revision query, however with the 'newness' bitmaps we can also use
them to mask out objects, ie 'uninteresting' commits. An index which
was useless, due to an unknown object existing in the 'uninteresting'
commit list can be converted to a useful one by building the bitmap of
'reachable' and 'new' objects for that commit - without having to
insert new objects in the middle of the list, which would require
rebuilding the entire index. So, there is a use case for storing
information on 'start' objects which aren't actually in the object
list. Permit this case.
This is currently a work in progress; extension to how this affects
the use cases is not there yet.
Signed-off-by: Sam Vilain <sam@vilain.net>
---
Documentation/technical/revision-cache.txt | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index e0adb26..0cd7b08 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -33,6 +33,10 @@ A revision cache contains;
'end' object, but not reachable from the 'start' objects reachable
from that 'end' object.
+ - a list of 'foreign end' objects, for which not all reachable
+ objects are in the object list, but can have reachability or a
+ 'newness' bitmap.
+
Start Objects and End Objects
-----------------------------
--
debian.1.5.6.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
@ 2009-06-04 14:18 Sam Vilain
2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
` (7 more replies)
0 siblings, 8 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:18 UTC (permalink / raw)
To: git
Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
This patch series describes the structure of the object list cache
on-disk format. It is built successively from a very simple design -
just an object list - to a version that allows for as many rev-list
operations to be accelerated as possible, and potentially immediate
startup of full clone operations in the common case; ie skipping the
"Counting Objects" and "Compressing Objects" phase once a matching
index is found.
The plan will be to implement each step incrementally, with a test-*.c
file along the way which tests the API provided by the revision cache
API. While the revision cache format will change along the way, this
will not require an index format deprecation cycle, as integration with
the rest of git will not happen until the format is settled.
The plan is to aim for one of these API milestones completed per week.
When complete, each commit will contain tests for the level of cache
that it delivers. Later milestones include joining the dots -
integrating with the 'rev-list' machinery and most importantly,
'pack-objects'.
Errata: the 'object list' and 'contents hash' will probably be
re-worked to keep a separate SHA-1 and topological index list, to
re-use existing fan-out code. This will be incorporated into the next
version.
Sam Vilain (7):
revision-cache: define revision cache as simple list of revisions
rev-cache: add on-disk format for fast reachability lookup
rev-cache: add 'end' objects for caching 'uninteresting' lookups
rev-cache: allow multiple 'start' objects per index
revision cache: maps of 'new' objects
revision cache: allow foreign 'start' commits
revision cache: be even stricter with sort order
Documentation/technical/revision-cache.txt | 416 ++++++++++++++++++++++++++++
1 files changed, 416 insertions(+), 0 deletions(-)
create mode 100644 Documentation/technical/revision-cache.txt
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
` (6 preceding siblings ...)
2009-06-04 14:05 ` [PATCH 6/7] revision cache: allow foreign 'start' commits Sam Vilain
@ 2009-06-05 16:56 ` Jakub Narebski
2009-06-05 20:58 ` Nicolas Pitre
2009-06-07 4:01 ` Sam Vilain
7 siblings, 2 replies; 14+ messages in thread
From: Jakub Narebski @ 2009-06-05 16:56 UTC (permalink / raw)
To: Sam Vilain
Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
Sam Vilain <sam@vilain.net> writes:
> This patch series describes the structure of the object list cache
> on-disk format. It is built successively from a very simple design -
> just an object list - to a version that allows for as many rev-list
> operations to be accelerated as possible, and potentially immediate
> startup of full clone operations in the common case; ie skipping the
> "Counting Objects" and "Compressing Objects" phase once a matching
> index is found.
>
> The plan will be to implement each step incrementally, with a test-*.c
> file along the way which tests the API provided by the revision cache
> API. While the revision cache format will change along the way, this
> will not require an index format deprecation cycle, as integration with
> the rest of git will not happen until the format is settled.
>
> The plan is to aim for one of these API milestones completed per week.
> When complete, each commit will contain tests for the level of cache
> that it delivers. Later milestones include joining the dots -
> integrating with the 'rev-list' machinery and most importantly,
> 'pack-objects'.
I like this sharing not only completed code, but plans, designs (and
status reports) with Git Development Community (i.e. git mailing
list). I like this very much.
I'd like to ask if there any results of profiling git server
(git-daemon) code: how much is spend on object enumeration this GSoC
project tries to make faster by the means of caching?
Are there prepared benchmarks and tests to check if the code gives
correct results, and to measure improvements brought by caching?
Would it be possible to get some real-life statistics of git-daemon
usage, so that you optimize against real scenarios?
I wish you good work on git-daemon caching...
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
@ 2009-06-05 19:28 ` Nicolas Pitre
2009-06-07 3:49 ` Sam Vilain
0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Pitre @ 2009-06-05 19:28 UTC (permalink / raw)
To: Sam Vilain
Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
On Fri, 5 Jun 2009, Sam Vilain wrote:
> The very first thing that we want out of the revision cache is to be
> able to go from a commit to all of its referred objects. Define a
> revision cache that includes just that.
>
> Signed-off-by: Sam Vilain <sam@vilain.net>
Comments below.
> ---
> Documentation/technical/revision-cache.txt | 260 ++++++++++++++++++++++++++++
> 1 files changed, 260 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/technical/revision-cache.txt
>
> diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
> new file mode 100644
> index 0000000..cc18535
> --- /dev/null
> +++ b/Documentation/technical/revision-cache.txt
> @@ -0,0 +1,260 @@
> +Revision Cache Format
> +=====================
> +
> +The revision cache is an on-disk format which allows for certain graph
> +reachability operations to be answered quickly.
> +
> +A revision cache contains;
> +
> + - A 'start' object (ie, 'interesting' object ID)
> +
> + - A list of objects which are referred to by the 'start' object
> +
> + * position when sorted in --date-order
> +
> + * object ID
Does the object ID contain the object type or just the SHA1? Having the
object type quickly retrievable would be a significant gain as well.
> +Start Object
> +------------
> +
> +The 'start' object is the identifying key of the revision cache.
> +
> +
> +Topological contents list
> +-------------------------
> +
> +This list has fixed-length records, so the topological position into
> +the list does not need to be stored in each record - it is implicit
> +from the offset.
> +
> +--date-order is used as it is the strictest sort order available, but
> +this still only specifies an ordering for commit objects. Other
> +objects will appear after the object which first refers to them. Tag
> +objects are sorted as if they were commit objects with a single
> +parent, the object they tag.
> +
> +
> +Use Cases
> +---------
> +In this section, the key functions and operations that this index is
> +designed to answer are explored. For each, their efficiency is
> +considered in terms of what must be carried out to calculate the
> +answer.
> +
> +
> +Determining Cache Suitability - rev_cache_suitable()
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +This is an auxiliary function used by the other use cases, when
when what?
> +The function is:
> +
> + rev_cache_suitable( rev_cache, object )
> +
> +The check is simple and fast - it just compares the object to the
> +'start' object in the revision cache.
> +
> +
> +Revision Walker
> +~~~~~~~~~~~~~~~
> +
> +The revision walker is the main user of this cache; there is the
> +generic function of revision walking, as well as commands that want
> +specific information which they normally derive from the revision
> +walker output.
> +
> +
> +Setting up the walker
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +The functions for this are (intentionally resembling the current
> +revision walker API):
> +
> + rev_cache_init()
> + rev_cache_add( interesting?, oid )
> +
> +As well as this setup, it is necessary to specify which options are
> +required. Some API exists to do this, and the return value from
> +'rev_cache_ok' is true if suitable caches were found for the
> +information required, and false if it would require a graph traversal:
> +
> + rev_cache_options( ordered? )
> + rev_cache_ok( ) : Bool
> +
> +The rev_cache_ok() function must open all available revision caches,
> +and see if their 'interesting' object matches the single 'interesting'
> +object passed to rev_cache_add(). If it matches, it returns true. If
> +multiple 'interesting' objects were specified and 'ordered' is true,
> +then the function returns false.
> +
> +If the ordering flag was set to false, then all of the 'interesting'
> +objects must be found in separate revision caches.
> +
> +If any 'uninteresting' objects were passed, then the return value is
> +always false.
> +
> +
> +Returning objects
> +^^^^^^^^^^^^^^^^^
> +
> +The 'rev_cache_fetch()' iterator returns entries from the topological
> +list.
> +
> + rev_cache_fetch() : oid
> +
> +If returning objects from a single revision cache, it opens the
opens what?
> +Accelerating in-progress walking
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +It is possible that during a revision iteration operation, a revision
> +is discovered that may mean the rest of the revision walking can be
> +achieved faster.
> +
> +In practice, this is likely to be implemented by making in-core cache
> +entries for objects with revision caches prior to walking; then when
> +encountered the special action can be taken.
> +
> +
> +Creating Revision Caches
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Once the revision cache is setup via rev_cache_init() and
> +an 'interesting' object added, then call:
> +
> + rev_cache_create( )
> +
> +This function will not work, if multiple 'interesting' objects are
> +passed, or any 'uninteresting' objects were passed.
> +
> +This function must revision walk the commit graph, sorting in
> +--date-order along the way, and may emit revisions as they are
> +discovered to the topological object list.
> +
> +
> +receive-pack/pack-objects
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +pack-objects when called from pack-objects is a special user of the
^
Are you missing some punctuation here?
> +revision cache; it has the extra requirements of wanting to know;
> +
> +* how many objects are between 'interesting' and 'uninteresting'
> + objects, at the beginning of the run, to emit the pack header
> +
> +* for checking re-usability of deltas, whether the delta base object
> + in the pack is in the received set of objects.
> +
> +* for 'thin' pack generation, whether the delta base object is in the
> + received set of objects, -or- reachable from any 'uninteresting'
> + objects
> +
> +* for 'shallow' clone, whether the delta base object is reachable
> + without passing any of the 'uninteresting' objects
> +
> +The aim is for 'pack-objects' to be able to start returning objects
> +immediately in the case where a suitable revision cache is returned,
> +without waiting for revision counting or repacking.
Everything that was said so far made lots of sense to me, up to this
section. I think that you should really limit the scope of the problem
to object enumeration only and not dive so deep into pack-objects'
operation. Having an efficient object enumeration cache that is 1)
fast, 2) doesn't use up too much disk space and 3) keep itself up to
date automatically and transparently is already quite a challenge
already. Mixing pack-objects issues listed above into the mix is, I
think, a huge mistake and possibly a misunderstanding of the packing
process. More comments below.
> +Returning object counts
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +If the revision cache was suitable, eg for a complete fetch, then the
> +number of objects can be obtained by counting the size of the object
> +list.
> +
> +If multiple revision caches were used for multiple 'interesting'
> +objects, then the number of objects can be obtained by building a hash
> +table of all the objects in all of the caches, and returning the
> +number of hash table entries.
> +
> +If 'uninteresting' objects were passed, no cache can be suitable.
Knowing the number of objects early is not that useful. The only way
you can know that number is by counting all needed objects, and the
problem to solve here is really to provide that list of objects _fast_.
The pack-objects code already deals with building a hash of objects and
that takes no time compared to actually obtaining that list of objects
in the first place.
> +Re-using deltas
> +^^^^^^^^^^^^^^^
> +
> +This applies to the 'Compressing Objects' phase, previously known as
> +the 'Deltafying Objects' phase. Instead of searching for deltas, if
> +we can be sure that the existing delta can be resolved on the remote,
> +then we can re-use it.
I don't see how the reachability cache could have anything to do with
this. Sure the criteria for reusing some delta data is for the base to
be part of the object set, or in the thin pack case, when the base is
known to be available on the remote side. But the delta and base has to
already exist locally in the _same_ pack, otherwise it simply doesn't
exist and has to be computed. It is a common situation for a repository
to have multiple packs for which deltas can be produced when crossing
source pack boundaries, and that fact simply cannot fit within an object
enumeration cache.
> +For the initial clone, this operation is simple - as objects are
> +emitted, the delta from the packfile is re-used. If a loop is
> +detected or the delta base is not in the returned set of objects, then
> +the delta is first resolved.
How do you know you have a loop? A lot of thoughts went into the
current code to never get into a situation where loops could be
possible. And that can only be inferred from the actual content of used
object packs. Again the enumeration cache cannot and should not be
concerned by such considerations as they will change on every repack.
> +This implies that the list of objects is first loaded into a hash
> +table prior to returning any objects; however this is probably
> +acceptable as the entire list is in one stream and will load quickly.
Again this is duplication of already existing code.
> +For later fetches, the revision cache is not appropriate as they will
> +have 'uninteresting' objects set.
> +
> +
> +Re-using deltas - 'thin' pack
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +This case is much like the normal re-using deltas case, except there
> +is the complete set of objects that the remote has claimed to have to
> +consider.
> +
> +The cache is not yet sophisticated enough to assist with this use
> +case.
And it should not attempt anything in that direction. This is simply
not the right level for such considerations.
Please don't mess up the layers of abstractions. Creating a list of
objects is currently the biggest bottleneck for big clones. It is not
the object compression (delta) phase, it is not the pack creation
either. The current code is smart enough not to delta objects it knows
can be found in source packs already, and it knows how to cull the list
of deltas that needs to be computed to the strict minimum. And that
requires almost no time. What is really time consuming is to figure out
the damn object list in the first place. No need to reinvent what
already works well.
In other words, you should really concentrate on making 'git rev-list
--objects' and 'git rev-list --objects-edge' instantaneous without
having to parse any commit nor tree objects (at least for the part
already covered by the cache). And as new objects are added through
'git add' and 'git commit' or 'git fetch' then the traditional object
parsing would have to take place only up to the point where the cache
could take over, and update the cache for the non covered objects while
at it. I think this is already quite a difficult problem already.
What pack-objects would greatly benefit from is a facility that could
provide a list of objects with their SHA1, type, size and the 32-bit
object path hash given a list of "have" and "want" kind of refs. And so
in a way that doesn't suck, which could mean in less than, say, 5
seconds per million objects. That's it, but that's hard. All the rest
is already done and quite well optimized.
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
@ 2009-06-05 20:58 ` Nicolas Pitre
2009-06-07 4:01 ` Sam Vilain
1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2009-06-05 20:58 UTC (permalink / raw)
To: Jakub Narebski
Cc: Sam Vilain, git, Nick Edelen, Shawn O. Pearce,
Johannes Schindelin, Andreas Ericsson, Christian Couder,
Jeff King
On Fri, 5 Jun 2009, Jakub Narebski wrote:
> I like this sharing not only completed code, but plans, designs (and
> status reports) with Git Development Community (i.e. git mailing
> list). I like this very much.
>
>
> I'd like to ask if there any results of profiling git server
> (git-daemon) code: how much is spend on object enumeration this GSoC
> project tries to make faster by the means of caching?
The git daemon only forks and execs other processes. It is hardly using
any measurable CPU itself.
If you want to profile or simply have a good feel for what happens
during a clone and clearly see what phase is problematic for the server
then do this:
1) cd to a local repository of your choice. The bigger the better,
meaning that git itself is probably too small. Try the linux
kernel, or a gcc mirror, or even better yet the gentoo repository.
2) run this:
git pack-objects --all-progress --revs --all --stdout \
< /dev/null > /dev/null
3) Sit and watch. And for extra fun you may even measure the time
spent in each of the "counting objects", "compressing objects" and
"writing objects" phases and compare them.
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
2009-06-05 19:28 ` Nicolas Pitre
@ 2009-06-07 3:49 ` Sam Vilain
2009-06-07 5:43 ` Nicolas Pitre
0 siblings, 1 reply; 14+ messages in thread
From: Sam Vilain @ 2009-06-07 3:49 UTC (permalink / raw)
To: Nicolas Pitre
Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
On Fri, 2009-06-05 at 15:28 -0400, Nicolas Pitre wrote:
> > + - A list of objects which are referred to by the 'start' object
> > +
> > + * position when sorted in --date-order
> > +
> > + * object ID
>
> Does the object ID contain the object type or just the SHA1? Having the
> object type quickly retrievable would be a significant gain as well.
Well, one of the things I'm trying to avoid is adding information which
isn't clearly supported by a use case. But now that you mention it I
guess the type is quite useful and also quite small. What I could do is
have 4 bitmaps, one for each object type. That should be quite small
and re-use the bitmap infrastructure used later.
[...]
> > +Use Cases
> > +---------
> > +In this section, the key functions and operations that this index is
> > +designed to answer are explored. For each, their efficiency is
> > +considered in terms of what must be carried out to calculate the
> > +answer.
> > +
> > +
> > +Determining Cache Suitability - rev_cache_suitable()
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +This is an auxiliary function used by the other use cases, when
>
> when what?
Whoops. When deciding whether or not a revision cache is going to be
useful to help the current operation. This needs to be quick because we
need to identify whether to use a cache or keep going.
> > +The function is:
> > +
> > + rev_cache_suitable( rev_cache, object )
> > +
> > +The check is simple and fast - it just compares the object to the
> > +'start' object in the revision cache.
> > +
> > +
[...]
> > +Returning objects
> > +^^^^^^^^^^^^^^^^^
> > +
> > +The 'rev_cache_fetch()' iterator returns entries from the topological
> > +list.
> > +
> > + rev_cache_fetch() : oid
> > +
> > +If returning objects from a single revision cache, it opens the
>
> opens what?
Again sorry this got lost when I was re-arranging the document. It
opens the revision cache and returns objects in the order they appear in
the cache's object list.
[...]
> > +receive-pack/pack-objects
> > +~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +pack-objects when called from pack-objects is a special user of the
> ^
> Are you missing some punctuation here?
It should probably say "when called from receive-pack".
> > +revision cache; it has the extra requirements of wanting to know;
> > +
> > +* how many objects are between 'interesting' and 'uninteresting'
> > + objects, at the beginning of the run, to emit the pack header
> > +
> > +* for checking re-usability of deltas, whether the delta base object
> > + in the pack is in the received set of objects.
> > +
> > +* for 'thin' pack generation, whether the delta base object is in the
> > + received set of objects, -or- reachable from any 'uninteresting'
> > + objects
> > +
> > +* for 'shallow' clone, whether the delta base object is reachable
> > + without passing any of the 'uninteresting' objects
> > +
> > +The aim is for 'pack-objects' to be able to start returning objects
> > +immediately in the case where a suitable revision cache is returned,
> > +without waiting for revision counting or repacking.
>
> Everything that was said so far made lots of sense to me, up to this
> section. I think that you should really limit the scope of the problem
> to object enumeration only and not dive so deep into pack-objects'
> operation.
> Having an efficient object enumeration cache that is 1)
> fast, 2) doesn't use up too much disk space and 3) keep itself up to
> date automatically and transparently [...]
Like the rest of git - the object graph, packs, etc - this design is
based around the concept of "immutable data". That is, the revision
cache is defined by the objects it was computed from. Once the later
stages of it are reached, the "revision cache stacking" comes into play
to achieve automatic and transparent 'updating'. I considered designs
which were not immutable data but decided that the update performance
would be too poor or suffer from locking issues.
> > +Re-using deltas
> > +^^^^^^^^^^^^^^^
> > +
> > +This applies to the 'Compressing Objects' phase, previously known as
> > +the 'Deltafying Objects' phase. Instead of searching for deltas, if
> > +we can be sure that the existing delta can be resolved on the remote,
> > +then we can re-use it.
>
> I don't see how the reachability cache could have anything to do with
> this. Sure the criteria for reusing some delta data is for the base to
> be part of the object set, or in the thin pack case, when the base is
> known to be available on the remote side. But the delta and base has to
> already exist locally in the _same_ pack, otherwise it simply doesn't
> exist and has to be computed. It is a common situation for a repository
> to have multiple packs for which deltas can be produced when crossing
> source pack boundaries, and that fact simply cannot fit within an object
> enumeration cache.
>
> > +For the initial clone, this operation is simple - as objects are
> > +emitted, the delta from the packfile is re-used. If a loop is
> > +detected or the delta base is not in the returned set of objects, then
> > +the delta is first resolved.
>
> How do you know you have a loop? A lot of thoughts went into the
> current code to never get into a situation where loops could be
> possible. And that can only be inferred from the actual content of used
> object packs. Again the enumeration cache cannot and should not be
> concerned by such considerations as they will change on every repack.
>
> > +This implies that the list of objects is first loaded into a hash
> > +table prior to returning any objects; however this is probably
> > +acceptable as the entire list is in one stream and will load quickly.
>
> Again this is duplication of already existing code.
There's no code to be duplicate yet. A lot of the above text is trying
to summarise what happens in pack-objects, for the purposes of
considering whether the revision cache can help with it. Delta loop
detection I'd considered, but decided not to discuss in the already
quite long document.
> Please don't mess up the layers of abstractions. Creating a list of
> objects is currently the biggest bottleneck for big clones. It is not
> the object compression (delta) phase, it is not the pack creation
> either.
[...]
> In other words, you should really concentrate on making 'git rev-list
> --objects' and 'git rev-list --objects-edge' instantaneous without
> having to parse any commit nor tree objects (at least for the part
> already covered by the cache). And as new objects are added through
> 'git add' and 'git commit' or 'git fetch' then the traditional object
> parsing would have to take place only up to the point where the cache
> could take over, and update the cache for the non covered objects while
> at it. I think this is already quite a difficult problem already.
It's a good thing, then, that this is exactly what the first milestones
in the project are covering.
What I don't understand though is that during initial clone the object
compression phase does take a lot of time, and that there have been
reports that large initial clones use a lot of VM for the packfile to be
sent. So what I'm doing is trying to answer the question about why we
can't just start streaming the pack as soon as we know how many objects
will be in it. All those other stages can happen in parallel - for
instance I would wager that it's far more efficient to detect loops as
we're streaming, as the packfile is accessed, rather than having to read
seek object header in the packfile up front.
In summary, I'm trying to make sure that the revision cache contains
whatever information might help with making it start as soon as
possible, where putting the information in is trivial.
If pack-objects integration is as small a win as you say - and I have no
hard facts to dispute this - then when we come to that part and
investigate I'm sure with the facts in hand we will agree on what
approach to take.
> What pack-objects would greatly benefit from is a facility that could
> provide a list of objects with their SHA1, type, size
Which size? compressed, uncompressed? What does that win us? Again
I'm trying to support all information with a use case.
> and the 32-bit
> object path hash
I'm not sure what this means, can you refer me to some docs or relevant
source?
> given a list of "have" and "want" kind of refs. And so
> in a way that doesn't suck, which could mean in less than, say, 5
> seconds per million objects. That's it, but that's hard. All the rest
> is already done and quite well optimized.
Thanks for your detailed commentary,
Sam.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
2009-06-05 20:58 ` Nicolas Pitre
@ 2009-06-07 4:01 ` Sam Vilain
1 sibling, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-07 4:01 UTC (permalink / raw)
To: Jakub Narebski
Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
On Fri, 2009-06-05 at 09:56 -0700, Jakub Narebski wrote:
> > The plan is to aim for one of these API milestones completed per week.
> > When complete, each commit will contain tests for the level of cache
> > that it delivers. Later milestones include joining the dots -
> > integrating with the 'rev-list' machinery and most importantly,
> > 'pack-objects'.
>
> I like this sharing not only completed code, but plans, designs (and
> status reports) with Git Development Community (i.e. git mailing
> list). I like this very much.
>
>
> I'd like to ask if there any results of profiling git server
> (git-daemon) code: how much is spend on object enumeration this GSoC
> project tries to make faster by the means of caching?
No, but you're not the first to ask - I'll research this and include
profiling of it (obviously needing to descend into stages taken by
pack-objects/receive-packs, as Nicholas points out) in the next update I
send out.
> Are there prepared benchmarks and tests to check if the code gives
> correct results, and to measure improvements brought by caching?
> Would it be possible to get some real-life statistics of git-daemon
> usage, so that you optimize against real scenarios?
In terms of correct results - that will come down to the test cases
which are written for it, and possibly extending the existing test cases
eg t5500-fetch-pack.sh
> I wish you good work on git-daemon caching...
Thanks!
Sam.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
2009-06-07 3:49 ` Sam Vilain
@ 2009-06-07 5:43 ` Nicolas Pitre
0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2009-06-07 5:43 UTC (permalink / raw)
To: Sam Vilain
Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
Andreas Ericsson, Christian Couder, Jeff King
On Sun, 7 Jun 2009, Sam Vilain wrote:
> On Fri, 2009-06-05 at 15:28 -0400, Nicolas Pitre wrote:
> > Please don't mess up the layers of abstractions. Creating a list of
> > objects is currently the biggest bottleneck for big clones. It is not
> > the object compression (delta) phase, it is not the pack creation
> > either.
> [...]
> > In other words, you should really concentrate on making 'git rev-list
> > --objects' and 'git rev-list --objects-edge' instantaneous without
> > having to parse any commit nor tree objects (at least for the part
> > already covered by the cache). And as new objects are added through
> > 'git add' and 'git commit' or 'git fetch' then the traditional object
> > parsing would have to take place only up to the point where the cache
> > could take over, and update the cache for the non covered objects while
> > at it. I think this is already quite a difficult problem already.
>
> It's a good thing, then, that this is exactly what the first milestones
> in the project are covering.
>
> What I don't understand though is that during initial clone the object
> compression phase does take a lot of time, and that there have been
> reports that large initial clones use a lot of VM for the packfile to be
> sent.
Did you duplicate those results yourself?
Did you also try out the simple bench test I suggested while responding
to Jakub?
For example, let's take the Linux kernel repository as I have it handy
here. For the various phases I get:
Counting objects = 25 seconds
Compressing objects = 2 seconds
Writing objects = 3 seconds
A long compression phase is a sign of a really badly packed repository.
The solution to that is simply to repack. After that subsequent clones
as well as further repacks will have much shorter compression phase as
the work that was performed in the first repack will be directly
reusable.
> So what I'm doing is trying to answer the question about why we
> can't just start streaming the pack as soon as we know how many objects
> will be in it. All those other stages can happen in parallel - for
> instance I would wager that it's far more efficient to detect loops as
> we're streaming, as the packfile is accessed, rather than having to read
> seek object header in the packfile up front.
My stance on this is that it is fundamentally impossible to do
efficiently, if at all. What we currently do is not loop detection but
rather loop avoidance. And because your pack set is not a static thing
(new packs are created or fetched, and sometimes they're repacked into a
single pack for a while) then you simply can't have a stable cache of
object topology based on their storage into packs. This is simply not
some immutable data. Only object names and contents are immutable,
nothing more.
What I'm telling you is: in order to know how many objects the pack
which is about to be created will contain, you need to count them.
This is what takes 25 seconds currently in the above example. Then a 2
second compression phase which could be even less if your repository is
better packed than mine currently is. At which point the writing phase
starts and the pack header is streamed. So trying to stream the pack as
soon as you know how many objects it contains will save a big whoopee 2
seconds. This is nothing to go crazy about, really.
Yet, to be able to stream a pack right away, you still would need to
know _where_ to pick the pack data from. And in most cases this is not
from a nice single pack but rather multiple ones. And some of those
packs have objects which are duplicated into other packs. And yet by
mixing pack data from multiple source packs, you gain new opportunities
for delta compression because you're now putting together objects which
were separated before and therefore couldn't create delta between them
previously (that's the 2 second phase above). You still must be careful
during that phase not to create delta loop, and to do that you must take
into account all the other deltas from existing source packs that you
are going to not recompute but just copy straight into the new pack.
And of course there is the delta depth limit to enforce. This means
that some rearrangements of deltas forces some objects which were deltas
previously to become undeltified, or new deltas against a shallower base
object. And the only way to know what form each object will take in
this context is again by going through the full compression phase which
works on a list of object which ordering is totally different from the
one that is used to actually store objects in the pack. In other words,
you cannot work out delta issues as you stream the pack because you
don't want to write objects in the pack with the same order used for
delta processing.
Why a different object ordering for storing in the pack? Because we want
to create the best data layout possible in the new pack so future
runtime access to that pack will be optimal. This often means that
objects are picked from existing packs in a non linear fashion but
rather in a seemingly random way. This is because we want the most
recent commits to get the best IO patterns in a pack. However, when the
repository is updated through fetches or pushes, the newly obtained
packs usually carry even more recent commits for which the bulk of
referenced objects are still to be found in older packs (this is why a
fetch is so quick). So the more fetches or pushes are performed, the
less efficient your repository becomes with regard to the most recent
commits. This is why a subsequent repack will reshuffle all objects so
that the most recent commit gets a linear IO pattern again by picking
objects from the most recent packs as well as from older packs and
putting them all contiguously in the new pack. But again you cannot
know if those objects will be in delta form or not, or against which
base object before the compression phase is over.
I hope you have a better idea now to answer the question about why we
can't just start streaming the pack as soon as we know how many objects
will be in it, and why all those other stages may not happen in
parallel.
> In summary, I'm trying to make sure that the revision cache contains
> whatever information might help with making it start as soon as
> possible, where putting the information in is trivial.
>
> If pack-objects integration is as small a win as you say - and I have no
> hard facts to dispute this - then when we come to that part and
> investigate I'm sure with the facts in hand we will agree on what
> approach to take.
Good. I'm glad you see things that way. Because, to me, even just the
notion of a good revision cache is not that simple.
> > What pack-objects would greatly benefit from is a facility that could
> > provide a list of objects with their SHA1, type, size
>
> Which size? compressed, uncompressed? What does that win us? Again
> I'm trying to support all information with a use case.
Uncompressed. That information is used by the delta heuristics, and the
code currently does its best not to seek all over through delta chains
to fetch that tiny bit of information if that can be avoided (have a
look at check_object() in builtin-pack-objects.c).
> > and the 32-bit
> > object path hash
>
> I'm not sure what this means, can you refer me to some docs or relevant
> source?
If you don't know what that is, then I'm afraid you might be lacking
some background on the packing and delta strategy. I'd suggest you read
Documentation/technical/pack-heuristics.txt first, and then find out in
the commit logs for builtin-pack-objects.c (or even within the file
itself) what has changed since then.
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-06-07 5:45 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
2009-06-04 14:05 ` [PATCH 4/7] rev-cache: allow multiple 'start' objects per index Sam Vilain
2009-06-04 14:05 ` [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup Sam Vilain
2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
2009-06-05 19:28 ` Nicolas Pitre
2009-06-07 3:49 ` Sam Vilain
2009-06-07 5:43 ` Nicolas Pitre
2009-06-04 14:05 ` [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups Sam Vilain
2009-06-04 14:05 ` [PATCH 7/7] revision cache: be even stricter with sort order Sam Vilain
2009-06-04 14:05 ` [PATCH 6/7] revision cache: allow foreign 'start' commits Sam Vilain
2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
2009-06-05 20:58 ` Nicolas Pitre
2009-06-07 4:01 ` Sam Vilain
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).