[PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
  2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
  2009-06-04 14:05 ` [PATCH 4/7] rev-cache: allow multiple 'start' objects per index Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
  2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

As well as storing the sorted list of objects, store a hash table for
faster lookup.

Signed-off-by: Sam Vilain <sam@vilain.net>
---
 Documentation/technical/revision-cache.txt |   24 ++++++++++++++++++++----
 1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index cc18535..8349cfe 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -14,6 +14,9 @@ A revision cache contains;
 
     * object ID
 
+  - A hash table from an (abbreviated) object ID to a position into
+    the above list
+
 
 Start Object
 ------------
@@ -35,6 +38,18 @@ objects are sorted as if they were commit objects with a single
 parent, the object they tag.
 
 
+Included object hash table
+--------------------------
+
+This index is used to quickly determine if an object exists in the
+index without scanning the entire topological list.
+
+Entries in the object hash table can be shortened, eg to 3 or 4 bytes;
+basically they just need to be long enough to avoid collisions within
+the objects which exist in the list.  Any match must be confirmed by
+checking the full SHA1 in the topological list.
+
+
 Use Cases
 ---------
 In this section, the key functions and operations that this index is
@@ -131,7 +146,8 @@ passed, or any 'uninteresting' objects were passed.
 
 This function must revision walk the commit graph, sorting in
 --date-order along the way, and may emit revisions as they are
-discovered to the topological object list.
+discovered to the topological object list.  It must also build a hash
+table of object IDs and emit it at the end.
 
 
 receive-pack/pack-objects
@@ -186,9 +202,9 @@ emitted, the delta from the packfile is re-used.  If a loop is
 detected or the delta base is not in the returned set of objects, then
 the delta is first resolved.
 
-This implies that the list of objects is first loaded into a hash
-table prior to returning any objects; however this is probably
-acceptable as the entire list is in one stream and will load quickly.
+This implies that each delta base must be looked up in the on-disk
+hash table as they are written, which is both low impact and memory
+efficient.
 
 For later fetches, the revision cache is not appropriate as they will
 have 'uninteresting' objects set.
-- 
debian.1.5.6.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/7] revision cache: maps of 'new' objects
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
  2009-06-04 14:05 ` [PATCH 4/7] rev-cache: allow multiple 'start' objects per index Sam Vilain
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

When making a pack, it is useful to know if an object is already
reachable from any of the objects that the receiving party is known to
have; it allows the object to be excluded from the pack.  Allow
optional bitmaps for 'start' objects which permit storage of this
information.

Signed-off-by: Sam Vilain <sam@vilain.net>
---
 Documentation/technical/revision-cache.txt |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 198c33a..e0adb26 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -28,6 +28,11 @@ A revision cache contains;
     space.  A count of how many bits are '1' is included.  A separate
     bitmap is included for the 'start' objects.
 
+  - Optionally, for any 'end' object, a 'newness' bitmap indicating
+    which of the objects in the object list are reachable from that
+    'end' object, but not reachable from the 'start' objects reachable
+    from that 'end' object.
+
 
 Start Objects and End Objects
 -----------------------------
@@ -120,6 +125,9 @@ generating a newer one which covers objects created after it.  This
 approach will be able to answer many questions, but not topological
 ordering between objects which appeared in different caches.
 
+Stacking revision caches is essential for being able to generate the
+'newness' bitmaps efficiently.
+
 
 Revision Walker
 ~~~~~~~~~~~~~~~
@@ -144,7 +152,7 @@ required.  Some API exists to do this, and the return value from
 'rev_cache_ok' is true if suitable caches were found for the
 information required, and false if it would require a graph traversal:
 
-  rev_cache_options( ordered? )
+  rev_cache_options( ordered?, new? )
   rev_cache_ok( ) : Bool
 
 The rev_cache_ok() function must open all available revision caches,
@@ -163,6 +171,11 @@ If any 'uninteresting' objects were passed, then the return value is
 true if the suitability function passes for all of the revision caches
 which are used.
 
+If the 'new' flag is set to true, then the return value is only true
+if the 'uninteresting' objects also have suitable revision caches
+leading back to the start of history (or last 'shallow' point), or the
+'newness' bitmaps are present for all of the 'start' references used.
+
 
 Returning objects
 ^^^^^^^^^^^^^^^^^
@@ -208,6 +221,10 @@ Then, it must repeat the topological walk for each of the 'start'
 objects, looking up each object in the contents hash for a sequence
 number, set a bit in the bitmap for that 'start' object, and finally
 RLE compress it.
+If there are suitable other revision caches available, then the
+'newness' bitmaps are built by checking with the stacked revision
+caches whether the object exists and is reachable from the 'start'
+objects which connect to those stacked caches.
 
 
 receive-pack/pack-objects
@@ -250,6 +267,11 @@ If multiple 'start' objects were used, then the count can be returned
 by decompressing and OR'ing the bitmaps together, counting the total
 number of '1's in the bitmap.
 
+If the 'newness' bitmaps are available, then these can be used to
+avoid sending objects which the other end will already have.
+When applicable the 'newness' bitmaps will be masked against the
+combined reachability bitmaps to derive a shorter list of objects.
+
 If multiple revision caches were used, but not in single file, then
 the number of objects can be obtained by building a hash table of all
 the objects in all of the caches, and returning the number of hash
-- 
debian.1.5.6.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/7] rev-cache: allow multiple 'start' objects per index
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
  2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
  2009-06-04 14:05 ` [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup Sam Vilain
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

We can re-use the index for multiple 'start' objects sharing the same
'end' objects, using one list and a reachability bitmap for each
'start' object.

Signed-off-by: Sam Vilain <sam@vilain.net>
---
 Documentation/technical/revision-cache.txt |  106 +++++++++++++++++++---------
 1 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 759d78d..198c33a 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -6,12 +6,13 @@ reachability operations to be answered quickly.
 
 A revision cache contains;
 
-  - A 'start' object (ie, 'interesting' object ID)
+  - A list of 'start' objects (ie, 'interesting' object IDs)
 
   - A list of 'end' objects, which may be empty (ie, 'uninteresting'
     object IDs)
 
-  - A list of objects which are referred to by the 'start' object
+  - A list of objects which are referred to by any 'start' object, but
+    not by crossing 'end' objects, including:
 
     * position when sorted in --date-order
 
@@ -20,13 +21,25 @@ A revision cache contains;
   - A hash table from an (abbreviated) object ID to a position into
     the above list
 
+  - For each 'end' object, a bit sequence indicating which of the
+    objects in the object list are reachable from that 'end' object.
+    These bitmaps are stored in the same order as the object list and
+    then RLE-compressed, so in many cases will not take much on-disk
+    space.  A count of how many bits are '1' is included.  A separate
+    bitmap is included for the 'start' objects.
 
-Start Objects and End Object
-----------------------------
+
+Start Objects and End Objects
+-----------------------------
 
 The 'start' object, and 'end' objects are the identifying key of the
 revision cache.
 
+Revision caches for multiple 'start' objects can re-use the same
+object list, conserving space and adding flexibility to the index.  As
+such, a single revision cache is logically considered to be multiple
+indexes, unless it is of benefit to consider it a single index.
+
 The 'end' objects must be reachable from the 'start' objects, and none
 of the 'end' objects may be reachable from other 'end' objects.
 
@@ -71,27 +84,36 @@ Determining Cache Suitability - rev_cache_suitable()
 This is an auxiliary function used by the other use cases.
 
 A revision cache is suitable whenever the walker encounters the single
-object which is the 'end' object of the revision cache, and none of
-the 'uninteresting' revisions to the walker are in the revision cache.
+object which is any of the 'end' objects of the revision cache, and
+none of the 'uninteresting' revisions to the walker are in the
+revision cache.
 
 The function is:
 
   rev_cache_suitable( rev_cache, interesting, uninteresting [] )
 
 The check is simple and fast - it first compares the 'interesting'
-object to the 'start' object in the revision cache, then the
+object to the 'start' object list in the revision cache, then the
 'uninteresting' objects that the walker is using are looked up in the
 contents hash table.
-Only if none of them are found is the revision cache suitable.
+If they are found, then the reachability bitmap for the matching
+'start' objects are consulted for whether the object was actually
+reachable from those 'start' objects.
+Only if none of them are reachable is the revision cache suitable.
 
 
 Revision cache stacking
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-If there are 'end' objects in the revision cache used, which do not
+If there are 'end' objects in the revision cache used, which are
+reachable from the 'start' objects in use, and do not
 match the 'uninteresting' objects in the walker, the function may
 recurse, looking for other revision caches which have the unwanted
-'end' object as their 'start' object.
+'end' object in their 'start' object list.
+
+Allowing multiple 'start' objects allows for more instances of 'single
+file' stacking, where at no point is the stack wider than one revision
+cache.
 
 Stacking revision caches is a way to re-use an older index by just
 generating a newer one which covers objects created after it.  This
@@ -126,18 +148,20 @@ information required, and false if it would require a graph traversal:
   rev_cache_ok( ) : Bool
 
 The rev_cache_ok() function must open all available revision caches,
-and see if their 'interesting' object matches the single 'interesting'
-object passed to rev_cache_add().  If it matches, it returns true.  If
-multiple 'interesting' objects were specified and 'ordered' is true,
-then the function returns false.
+and see if all of the 'interesting' objects passed to rev_cache_add()
+can be found as 'start' objects in available revision caches.
+If they cannot, the function returns false.
+If multiple 'interesting' objects were specified and 'ordered' is
+true, then the function returns false, unless all of the 'interesting'
+objects were found in a single revision cache and no stacking is
+required.
 
-If the ordering flag was set to false, then all of the 'interesting'
-objects must be found in separate revision caches for the function to
-return true.
+If the ordering flag was set to false, then this restriction is
+relaxed.  Objects may be found in separate revision caches.
 
 If any 'uninteresting' objects were passed, then the return value is
-true if the suitability function passes for the revision caches which
-are used.
+true if the suitability function passes for all of the revision caches
+which are used.
 
 
 Returning objects
@@ -176,13 +200,14 @@ an 'interesting' object added, then call:
 
   rev_cache_create( )
 
-This function will not work, if multiple 'interesting' objects are
-passed.
-
 This function must revision walk the commit graph, sorting in
 --date-order along the way, and may emit revisions as they are
-discovered to the topological object list.  It must also build a hash
-table of object IDs and emit it at the end.
+discovered to the topological object list.
+It must also build a hash table of object IDs and emit it at the end.
+Then, it must repeat the topological walk for each of the 'start'
+objects, looking up each object in the contents hash for a sequence
+number, set a bit in the bitmap for that 'start' object, and finally
+RLE compress it.
 
 
 receive-pack/pack-objects
@@ -215,13 +240,23 @@ Returning object counts
 If the revision cache was suitable, eg for a complete fetch, then the
 number of objects can be obtained by counting the size of the object
 list.
-
-If multiple revision caches were used for multiple 'interesting'
-objects, then the number of objects can be obtained by building a hash
-table of all the objects in all of the caches, and returning the
-number of hash table entries.  In the single-file stacked revision
-cache case, the total can be found by adding up the total lengths of
-the object lists.
+If the revision cache was entirely suitable, with all 'start' objects
+matching all 'interesting' objects in the walker, then the total
+number of objects in the cache is the number of objects to be
+returned.
+If a single 'start' object was used, then the count of '1's in the
+bitmap can be used.
+If multiple 'start' objects were used, then the count can be returned
+by decompressing and OR'ing the bitmaps together, counting the total
+number of '1's in the bitmap.
+
+If multiple revision caches were used, but not in single file, then
+the number of objects can be obtained by building a hash table of all
+the objects in all of the caches, and returning the number of hash
+table entries.
+In the single-file stacked revision cache case, the total can be found
+by adding up the total lengths of the object lists that apply to each
+single-file section.
 
 If 'uninteresting' objects were passed, the caches used must be
 suitable according to the cache suitability function.
@@ -241,8 +276,13 @@ detected or the delta base is not in the returned set of objects, then
 the delta is first resolved.
 
 This implies that each delta base must be looked up in the on-disk
-hash table as they are written, which is both low impact and memory
-efficient.
+hash table as they are written to the network, which is both low
+impact and memory efficient.
+In the case where not all of the 'start' objects are used, then the
+bitmaps for the 'start' objects in use will need to be kept in memory
+to supplement the results of this hash lookup.
+This is a small extra indirection likely to have only a minimal
+performance penalty.
 
 For later fetches, the revision cache will only be appropriate if the
 'have' objects sent by the remote are not found in the revision
-- 
debian.1.5.6.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
                   ` (2 preceding siblings ...)
  2009-06-04 14:05 ` [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
  2009-06-05 19:28   ` Nicolas Pitre
  2009-06-04 14:05 ` [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups Sam Vilain
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

The very first thing that we want out of the revision cache is to be
able to go from a commit to all of its referred objects.  Define a
revision cache that includes just that.

Signed-off-by: Sam Vilain <sam@vilain.net>
---
 Documentation/technical/revision-cache.txt |  260 ++++++++++++++++++++++++++++
 1 files changed, 260 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/technical/revision-cache.txt

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
new file mode 100644
index 0000000..cc18535
--- /dev/null
+++ b/Documentation/technical/revision-cache.txt
@@ -0,0 +1,260 @@
+Revision Cache Format
+=====================
+
+The revision cache is an on-disk format which allows for certain graph
+reachability operations to be answered quickly.
+
+A revision cache contains;
+
+  - A 'start' object (ie, 'interesting' object ID)
+
+  - A list of objects which are referred to by the 'start' object
+
+    * position when sorted in --date-order
+
+    * object ID
+
+
+Start Object
+------------
+
+The 'start' object is the identifying key of the revision cache.
+
+
+Topological contents list
+-------------------------
+
+This list has fixed-length records, so the topological position into
+the list does not need to be stored in each record - it is implicit
+from the offset.
+
+--date-order is used as it is the strictest sort order available, but
+this still only specifies an ordering for commit objects.  Other
+objects will appear after the object which first refers to them.  Tag
+objects are sorted as if they were commit objects with a single
+parent, the object they tag.
+
+
+Use Cases
+---------
+In this section, the key functions and operations that this index is
+designed to answer are explored.  For each, their efficiency is
+considered in terms of what must be carried out to calculate the
+answer.
+
+
+Determining Cache Suitability - rev_cache_suitable()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is an auxiliary function used by the other use cases, when 
+
+The function is:
+
+  rev_cache_suitable( rev_cache, object )
+
+The check is simple and fast - it just compares the object to the
+'start' object in the revision cache.
+
+
+Revision Walker
+~~~~~~~~~~~~~~~
+
+The revision walker is the main user of this cache; there is the
+generic function of revision walking, as well as commands that want
+specific information which they normally derive from the revision
+walker output.
+
+
+Setting up the walker
+^^^^^^^^^^^^^^^^^^^^^
+
+The functions for this are (intentionally resembling the current
+revision walker API):
+
+  rev_cache_init()
+  rev_cache_add( interesting?, oid )
+
+As well as this setup, it is necessary to specify which options are
+required.  Some API exists to do this, and the return value from
+'rev_cache_ok' is true if suitable caches were found for the
+information required, and false if it would require a graph traversal:
+
+  rev_cache_options( ordered? )
+  rev_cache_ok( ) : Bool
+
+The rev_cache_ok() function must open all available revision caches,
+and see if their 'interesting' object matches the single 'interesting'
+object passed to rev_cache_add().  If it matches, it returns true.  If
+multiple 'interesting' objects were specified and 'ordered' is true,
+then the function returns false.
+
+If the ordering flag was set to false, then all of the 'interesting'
+objects must be found in separate revision caches.
+
+If any 'uninteresting' objects were passed, then the return value is
+always false.
+
+
+Returning objects
+^^^^^^^^^^^^^^^^^
+
+The 'rev_cache_fetch()' iterator returns entries from the topological
+list.
+
+  rev_cache_fetch() : oid
+
+If returning objects from a single revision cache, it opens the 
+
+
+Accelerating in-progress walking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is possible that during a revision iteration operation, a revision
+is discovered that may mean the rest of the revision walking can be
+achieved faster.
+
+In practice, this is likely to be implemented by making in-core cache
+entries for objects with revision caches prior to walking; then when
+encountered the special action can be taken.
+
+
+Creating Revision Caches
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Once the revision cache is setup via rev_cache_init() and
+an 'interesting' object added, then call:
+
+  rev_cache_create( )
+
+This function will not work, if multiple 'interesting' objects are
+passed, or any 'uninteresting' objects were passed.
+
+This function must revision walk the commit graph, sorting in
+--date-order along the way, and may emit revisions as they are
+discovered to the topological object list.
+
+
+receive-pack/pack-objects
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+pack-objects when called from pack-objects is a special user of the
+revision cache; it has the extra requirements of wanting to know;
+
+* how many objects are between 'interesting' and 'uninteresting'
+  objects, at the beginning of the run, to emit the pack header
+
+* for checking re-usability of deltas, whether the delta base object
+  in the pack is in the received set of objects.
+
+* for 'thin' pack generation, whether the delta base object is in the
+  received set of objects, -or- reachable from any 'uninteresting'
+  objects
+
+* for 'shallow' clone, whether the delta base object is reachable
+  without passing any of the 'uninteresting' objects
+
+The aim is for 'pack-objects' to be able to start returning objects
+immediately in the case where a suitable revision cache is returned,
+without waiting for revision counting or repacking.
+
+
+Returning object counts
+^^^^^^^^^^^^^^^^^^^^^^^
+
+If the revision cache was suitable, eg for a complete fetch, then the
+number of objects can be obtained by counting the size of the object
+list.
+
+If multiple revision caches were used for multiple 'interesting'
+objects, then the number of objects can be obtained by building a hash
+table of all the objects in all of the caches, and returning the
+number of hash table entries.
+
+If 'uninteresting' objects were passed, no cache can be suitable.
+
+
+Re-using deltas
+^^^^^^^^^^^^^^^
+
+This applies to the 'Compressing Objects' phase, previously known as
+the 'Deltafying Objects' phase.  Instead of searching for deltas, if
+we can be sure that the existing delta can be resolved on the remote,
+then we can re-use it.
+
+For the initial clone, this operation is simple - as objects are
+emitted, the delta from the packfile is re-used.  If a loop is
+detected or the delta base is not in the returned set of objects, then
+the delta is first resolved.
+
+This implies that the list of objects is first loaded into a hash
+table prior to returning any objects; however this is probably
+acceptable as the entire list is in one stream and will load quickly.
+
+For later fetches, the revision cache is not appropriate as they will
+have 'uninteresting' objects set.
+
+
+Re-using deltas - 'thin' pack
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This case is much like the normal re-using deltas case, except there
+is the complete set of objects that the remote has claimed to have to
+consider.
+
+The cache is not yet sophisticated enough to assist with this use
+case.
+
+
+'shallow' clone considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For enumerating objects, the list of objects between the
+'uninteresting' and the 'shallow' commits must first be enumerated,
+and then subtracted from the objects between the 'interesting' and the
+'uninteresting' list.
+
+For re-using deltas for 'thin' packs, the list of objects between
+'uninteresting' and 'shallow' commits are enumerated and marked as
+valid for delta bases.
+
+The cache is not yet sophisticated enough to assist with this case.
+
+
+creating bundles
+~~~~~~~~~~~~~~~~
+
+So long as a bundle has no 'uninteresting' commits, then the revision
+cache is completely appropriate; with the length of the object list it
+can write a header and continue with the 'pack-objects' use case.
+
+
+slicing bundles (mirror-sync)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For 'slicing' bundles, a deterministic method is required for chopping
+up the list of objects, which can be calculated by any node which has
+the 'interesting' commits.
+
+To determine the objects which fall within a given slice, the object
+list must be enumerated and then divided evenly.  As the compressed
+size on one node cannot be reproduced on another node, the
+uncompressed size of the object is used instead, with the hope that
+slices will generally end up of roughly even size once compressed.
+
+To calculate the object boundaries, the list of objects in
+--date-order must first be tie-broken, eg for commits with the same
+commitdate and topological order.  If slices are allowed to be
+sub-commit level, then objects between commits (ie blobs and trees)
+are also sorted.  Once this deterministic list has been built, then
+all of the objects must be accessed to determine their length.  The
+objects which start within a given range are the ones in the slice.
+
+For re-using deltas in sliced bundles, the delta base is looked up in
+the deterministic list.  If it has an earlier sequence, then the delta
+can be safely re-used.  If it has a later sequence, then the delta
+must be resolved and then the object re-deltified/compressed, using
+only objects with an earlier sequence (or in the returned pack) as a
+base.
+
+This requirement is so that a collection of 'sliced' bundles can
+successfully re-assemble, while still allowing them to be 'thin'.
-- 
debian.1.5.6.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
                   ` (3 preceding siblings ...)
  2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
  2009-06-04 14:05 ` [PATCH 7/7] revision cache: be even stricter with sort order Sam Vilain
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

If we want to be able to accelerate lookups which contain
'uninteresting' revisions, we must be able to specify what those
revisions are.

Signed-off-by: Sam Vilain <sam@vilain.net>
---
 Documentation/technical/revision-cache.txt |   85 +++++++++++++++++++++-------
 1 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 8349cfe..759d78d 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -8,6 +8,9 @@ A revision cache contains;
 
   - A 'start' object (ie, 'interesting' object ID)
 
+  - A list of 'end' objects, which may be empty (ie, 'uninteresting'
+    object IDs)
+
   - A list of objects which are referred to by the 'start' object
 
     * position when sorted in --date-order
@@ -18,10 +21,14 @@ A revision cache contains;
     the above list
 
 
-Start Object
-------------
+Start Objects and End Object
+----------------------------
+
+The 'start' object, and 'end' objects are the identifying key of the
+revision cache.
 
-The 'start' object is the identifying key of the revision cache.
+The 'end' objects must be reachable from the 'start' objects, and none
+of the 'end' objects may be reachable from other 'end' objects.
 
 
 Topological contents list
@@ -61,14 +68,35 @@ answer.
 Determining Cache Suitability - rev_cache_suitable()
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-This is an auxiliary function used by the other use cases, when 
+This is an auxiliary function used by the other use cases.
+
+A revision cache is suitable whenever the walker encounters the single
+object which is the 'end' object of the revision cache, and none of
+the 'uninteresting' revisions to the walker are in the revision cache.
 
 The function is:
 
-  rev_cache_suitable( rev_cache, object )
+  rev_cache_suitable( rev_cache, interesting, uninteresting [] )
+
+The check is simple and fast - it first compares the 'interesting'
+object to the 'start' object in the revision cache, then the
+'uninteresting' objects that the walker is using are looked up in the
+contents hash table.
+Only if none of them are found is the revision cache suitable.
+
+
+Revision cache stacking
+^^^^^^^^^^^^^^^^^^^^^^^
+
+If there are 'end' objects in the revision cache used, which do not
+match the 'uninteresting' objects in the walker, the function may
+recurse, looking for other revision caches which have the unwanted
+'end' object as their 'start' object.
 
-The check is simple and fast - it just compares the object to the
-'start' object in the revision cache.
+Stacking revision caches is a way to re-use an older index by just
+generating a newer one which covers objects created after it.  This
+approach will be able to answer many questions, but not topological
+ordering between objects which appeared in different caches.
 
 
 Revision Walker
@@ -104,10 +132,12 @@ multiple 'interesting' objects were specified and 'ordered' is true,
 then the function returns false.
 
 If the ordering flag was set to false, then all of the 'interesting'
-objects must be found in separate revision caches.
+objects must be found in separate revision caches for the function to
+return true.
 
 If any 'uninteresting' objects were passed, then the return value is
-always false.
+true if the suitability function passes for the revision caches which
+are used.
 
 
 Returning objects
@@ -118,7 +148,12 @@ list.
 
   rev_cache_fetch() : oid
 
-If returning objects from a single revision cache, it opens the 
+If returning objects from a single or stacked set of revision caches,
+it opens the caches and returns objects from them.  If combining
+results from multiple caches (where topological ordering of returned
+objects is not important) then an in-memory object hash table must be
+built, or the on-disk tables for all of the caches consulted along the
+way.
 
 
 Accelerating in-progress walking
@@ -142,7 +177,7 @@ an 'interesting' object added, then call:
   rev_cache_create( )
 
 This function will not work, if multiple 'interesting' objects are
-passed, or any 'uninteresting' objects were passed.
+passed.
 
 This function must revision walk the commit graph, sorting in
 --date-order along the way, and may emit revisions as they are
@@ -184,9 +219,12 @@ list.
 If multiple revision caches were used for multiple 'interesting'
 objects, then the number of objects can be obtained by building a hash
 table of all the objects in all of the caches, and returning the
-number of hash table entries.
+number of hash table entries.  In the single-file stacked revision
+cache case, the total can be found by adding up the total lengths of
+the object lists.
 
-If 'uninteresting' objects were passed, no cache can be suitable.
+If 'uninteresting' objects were passed, the caches used must be
+suitable according to the cache suitability function.
 
 
 Re-using deltas
@@ -206,8 +244,9 @@ This implies that each delta base must be looked up in the on-disk
 hash table as they are written, which is both low impact and memory
 efficient.
 
-For later fetches, the revision cache is not appropriate as they will
-have 'uninteresting' objects set.
+For later fetches, the revision cache will only be appropriate if the
+'have' objects sent by the remote are not found in the revision
+caches' object hash table.
 
 
 Re-using deltas - 'thin' pack
@@ -217,8 +256,13 @@ This case is much like the normal re-using deltas case, except there
 is the complete set of objects that the remote has claimed to have to
 consider.
 
-The cache is not yet sophisticated enough to assist with this use
-case.
+The revision cache can assist with this case only if the 'have'
+objects sent by the remote match the 'uninteresting' objects in the
+revision cache (or are not found in the object list), and another
+revision cache exists which uses that 'uninteresting' object as its
+'start' object.  In this case, the lower stacked revision cache serves
+as a handy lookup table as to whether the object is known to the
+remote.
 
 
 'shallow' clone considerations
@@ -239,9 +283,10 @@ The cache is not yet sophisticated enough to assist with this case.
 creating bundles
 ~~~~~~~~~~~~~~~~
 
-So long as a bundle has no 'uninteresting' commits, then the revision
-cache is completely appropriate; with the length of the object list it
-can write a header and continue with the 'pack-objects' use case.
+So long as a bundle's 'uninteresting' commits match that of the
+revision cache, then the revision cache is completely appropriate;
+with the length of the object list it can write a header and continue
+with the 'pack-objects' use case.
 
 
 slicing bundles (mirror-sync)
-- 
debian.1.5.6.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 7/7] revision cache: be even stricter with sort order
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
                   ` (4 preceding siblings ...)
  2009-06-04 14:05 ` [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
  2009-06-04 14:05 ` [PATCH 6/7] revision cache: allow foreign 'start' commits Sam Vilain
  2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
  7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

For sliced bundles, there must be absolutely no ambiguity at all about
the sort order.  Define how to break ties, and describe why this
affects mirror-sync.

Signed-off-by: Sam Vilain <sam@vilain.net>
---
 Documentation/technical/revision-cache.txt |   79 +++++++++++++++++++---------
 1 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index 0cd7b08..7aaab38 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -14,7 +14,7 @@ A revision cache contains;
   - A list of objects which are referred to by any 'start' object, but
     not by crossing 'end' objects, including:
 
-    * position when sorted in --date-order
+    * position when sorted in --date-order, with ties broken by SHA1
 
     * object ID
 
@@ -33,9 +33,15 @@ A revision cache contains;
     'end' object, but not reachable from the 'start' objects reachable
     from that 'end' object.
 
-  - a list of 'foreign end' objects, for which not all reachable
-    objects are in the object list, but can have reachability or a
-    'newness' bitmap.
+  - a list of 'foreign start' objects, for which not all reachable
+    objects are in the object list, but can have reachability and
+    'newness' bitmaps.
+
+  - a list of object types for each object, in the order they appear
+    in the object list.
+
+  - a list of object lengths for each object, in the order they appear
+    in the object list.
 
 
 Start Objects and End Objects
@@ -64,7 +70,9 @@ from the offset.
 this still only specifies an ordering for commit objects.  Other
 objects will appear after the object which first refers to them.  Tag
 objects are sorted as if they were commit objects with a single
-parent, the object they tag.
+parent, the object they tag.  No object is allowed to appear before
+another object which refers to it, unless it is a 'start' object.
+Ties in the order are broken by SHA1.
 
 
 Included object hash table
@@ -231,6 +239,23 @@ caches whether the object exists and is reachable from the 'start'
 objects which connect to those stacked caches.
 
 
+Extending Revision Caches
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+  rev_cache_extend( rev_cache, quasi_interesting[] )
+
+The function works on an existing revision cache, and calculates just
+the 'reachability' and 'newness' bitmaps.
+
+These are calculated by marking the quasi_interesting[] commits as
+interesting, the 'end' objects in the revision cache as
+'uninteresting', and walking.
+Objects which are encountered which exist in the revision cache are
+converted to bits in the emitted bitmap, and the 'newness' bitmap is
+built as with the normal case once the reachable 'end' objects are
+known.
+
+
 receive-pack/pack-objects
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -360,28 +385,32 @@ slicing bundles (mirror-sync)
 
 For 'slicing' bundles, a deterministic method is required for chopping
 up the list of objects, which can be calculated by any node which has
-the 'interesting' commits.
-
-To determine the objects which fall within a given slice, the object
-list must be enumerated and then divided evenly.  As the compressed
-size on one node cannot be reproduced on another node, the
-uncompressed size of the object is used instead, with the hope that
-slices will generally end up of roughly even size once compressed.
-
-To calculate the object boundaries, the list of objects in
---date-order must first be tie-broken, eg for commits with the same
-commitdate and topological order.  If slices are allowed to be
-sub-commit level, then objects between commits (ie blobs and trees)
-are also sorted.  Once this deterministic list has been built, then
-all of the objects must be accessed to determine their length.  The
-objects which start within a given range are the ones in the slice.
+the 'interesting' commits.  As topological order is important, bundles
+must only consist of objects from a single revision cache to be sliced
+in this manner.
+
+To determine the objects which fall within a given slice, the new
+object list in the bundle must be enumerated (such as by decompressing
+and masking bitmaps) and then divided evenly, using the (masked)
+object lengths list.  As the compressed size on one node cannot be
+reproduced on another node, the uncompressed size of the object is
+used instead, with the hope that slices will generally end up of
+roughly even size once compressed.
+
+The objects which start within a given range in this list of objects
+are the ones in the slice.  If slices are not allowed at the
+sub-commit level, then the requirement to tie-break ordering of
+non-commit/tag objects is relaxed, however the list of per-object
+types is required to know where the boundaries are.
 
 For re-using deltas in sliced bundles, the delta base is looked up in
-the deterministic list.  If it has an earlier sequence, then the delta
-can be safely re-used.  If it has a later sequence, then the delta
-must be resolved and then the object re-deltified/compressed, using
-only objects with an earlier sequence (or in the returned pack) as a
-base.
+the deterministic list of objects.  If it has an earlier sequence,
+then the delta can be safely re-used.  If it has a later sequence,
+then the delta must be resolved and then the object
+re-deltified/compressed, using only objects with an earlier sequence
+(or in the returned pack) as a base.
 
 This requirement is so that a collection of 'sliced' bundles can
 successfully re-assemble, while still allowing them to be 'thin'.
+Without thin packs, download spreading from multiple mirrors will
+result in a much larger download.
-- 
debian.1.5.6.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/7] revision cache: allow foreign 'start' commits
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
                   ` (5 preceding siblings ...)
  2009-06-04 14:05 ` [PATCH 7/7] revision cache: be even stricter with sort order Sam Vilain
@ 2009-06-04 14:05 ` Sam Vilain
  2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
  7 siblings, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:05 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

Usually the 'start' commits are always 'interesting' commits in a
revision query, however with the 'newness' bitmaps we can also use
them to mask out objects, ie 'uninteresting' commits.  An index which
was useless, due to an unknown object existing in the 'uninteresting'
commit list can be converted to a useful one by building the bitmap of
'reachable' and 'new' objects for that commit - without having to
insert new objects in the middle of the list, which would require
rebuilding the entire index.  So, there is a use case for storing
information on 'start' objects which aren't actually in the object
list.  Permit this case.

This is currently a work in progress; extension to how this affects
the use cases is not there yet.

Signed-off-by: Sam Vilain <sam@vilain.net>
---
 Documentation/technical/revision-cache.txt |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
index e0adb26..0cd7b08 100644
--- a/Documentation/technical/revision-cache.txt
+++ b/Documentation/technical/revision-cache.txt
@@ -33,6 +33,10 @@ A revision cache contains;
     'end' object, but not reachable from the 'start' objects reachable
     from that 'end' object.

+  - a list of 'foreign end' objects, for which not all reachable
+    objects are in the object list, but can have reachability or a
+    'newness' bitmap.
+

 Start Objects and End Objects
 -----------------------------
-- 
debian.1.5.6.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
@ 2009-06-04 14:18 Sam Vilain
  2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-04 14:18 UTC (permalink / raw)
  To: git
  Cc: Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

This patch series describes the structure of the object list cache
on-disk format.  It is built successively from a very simple design -
just an object list - to a version that allows for as many rev-list
operations to be accelerated as possible, and potentially immediate
startup of full clone operations in the common case; ie skipping the
"Counting Objects" and "Compressing Objects" phase once a matching
index is found.

The plan will be to implement each step incrementally, with a test-*.c
file along the way which tests the API provided by the revision cache
API.  While the revision cache format will change along the way, this
will not require an index format deprecation cycle, as integration with
the rest of git will not happen until the format is settled.

The plan is to aim for one of these API milestones completed per week.
When complete, each commit will contain tests for the level of cache
that it delivers.  Later milestones include joining the dots -
integrating with the 'rev-list' machinery and most importantly,
'pack-objects'.

Errata: the 'object list' and 'contents hash' will probably be
re-worked to keep a separate SHA-1 and topological index list, to
re-use existing fan-out code.  This will be incorporated into the next
version.

Sam Vilain (7):
  revision-cache: define revision cache as simple list of revisions
  rev-cache: add on-disk format for fast reachability lookup
  rev-cache: add 'end' objects for caching 'uninteresting' lookups
  rev-cache: allow multiple 'start' objects per index
  revision cache: maps of 'new' objects
  revision cache: allow foreign 'start' commits
  revision cache: be even stricter with sort order

 Documentation/technical/revision-cache.txt |  416 ++++++++++++++++++++++++++++
 1 files changed, 416 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/technical/revision-cache.txt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
  2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
                   ` (6 preceding siblings ...)
  2009-06-04 14:05 ` [PATCH 6/7] revision cache: allow foreign 'start' commits Sam Vilain
@ 2009-06-05 16:56 ` Jakub Narebski
  2009-06-05 20:58   ` Nicolas Pitre
  2009-06-07  4:01   ` Sam Vilain
  7 siblings, 2 replies; 14+ messages in thread
From: Jakub Narebski @ 2009-06-05 16:56 UTC (permalink / raw)
  To: Sam Vilain
  Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

Sam Vilain <sam@vilain.net> writes:

> This patch series describes the structure of the object list cache
> on-disk format.  It is built successively from a very simple design -
> just an object list - to a version that allows for as many rev-list
> operations to be accelerated as possible, and potentially immediate
> startup of full clone operations in the common case; ie skipping the
> "Counting Objects" and "Compressing Objects" phase once a matching
> index is found.
> 
> The plan will be to implement each step incrementally, with a test-*.c
> file along the way which tests the API provided by the revision cache
> API.  While the revision cache format will change along the way, this
> will not require an index format deprecation cycle, as integration with
> the rest of git will not happen until the format is settled.
> 
> The plan is to aim for one of these API milestones completed per week.
> When complete, each commit will contain tests for the level of cache
> that it delivers.  Later milestones include joining the dots -
> integrating with the 'rev-list' machinery and most importantly,
> 'pack-objects'.

I like this sharing not only completed code, but plans, designs (and
status reports) with Git Development Community (i.e. git mailing
list).  I like this very much.


I'd like to ask if there any results of profiling git server
(git-daemon) code: how much is spend on object enumeration this GSoC
project tries to make faster by the means of caching?

Are there prepared benchmarks and tests to check if the code gives
correct results, and to measure improvements brought by caching?
Would it be possible to get some real-life statistics of git-daemon
usage, so that you optimize against real scenarios?


I wish you good work on git-daemon caching...
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
  2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
@ 2009-06-05 19:28   ` Nicolas Pitre
  2009-06-07  3:49     ` Sam Vilain
  0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Pitre @ 2009-06-05 19:28 UTC (permalink / raw)
  To: Sam Vilain
  Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

On Fri, 5 Jun 2009, Sam Vilain wrote:

> The very first thing that we want out of the revision cache is to be
> able to go from a commit to all of its referred objects.  Define a
> revision cache that includes just that.
> 
> Signed-off-by: Sam Vilain <sam@vilain.net>

Comments below.

> ---
>  Documentation/technical/revision-cache.txt |  260 ++++++++++++++++++++++++++++
>  1 files changed, 260 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/technical/revision-cache.txt
> 
> diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
> new file mode 100644
> index 0000000..cc18535
> --- /dev/null
> +++ b/Documentation/technical/revision-cache.txt
> @@ -0,0 +1,260 @@
> +Revision Cache Format
> +=====================
> +
> +The revision cache is an on-disk format which allows for certain graph
> +reachability operations to be answered quickly.
> +
> +A revision cache contains;
> +
> +  - A 'start' object (ie, 'interesting' object ID)
> +
> +  - A list of objects which are referred to by the 'start' object
> +
> +    * position when sorted in --date-order
> +
> +    * object ID

Does the object ID contain the object type or just the SHA1?  Having the 
object type quickly retrievable would be a significant gain as well.

> +Start Object
> +------------
> +
> +The 'start' object is the identifying key of the revision cache.
> +
> +
> +Topological contents list
> +-------------------------
> +
> +This list has fixed-length records, so the topological position into
> +the list does not need to be stored in each record - it is implicit
> +from the offset.
> +
> +--date-order is used as it is the strictest sort order available, but
> +this still only specifies an ordering for commit objects.  Other
> +objects will appear after the object which first refers to them.  Tag
> +objects are sorted as if they were commit objects with a single
> +parent, the object they tag.
> +
> +
> +Use Cases
> +---------
> +In this section, the key functions and operations that this index is
> +designed to answer are explored.  For each, their efficiency is
> +considered in terms of what must be carried out to calculate the
> +answer.
> +
> +
> +Determining Cache Suitability - rev_cache_suitable()
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +This is an auxiliary function used by the other use cases, when 

when what?

> +The function is:
> +
> +  rev_cache_suitable( rev_cache, object )
> +
> +The check is simple and fast - it just compares the object to the
> +'start' object in the revision cache.
> +
> +
> +Revision Walker
> +~~~~~~~~~~~~~~~
> +
> +The revision walker is the main user of this cache; there is the
> +generic function of revision walking, as well as commands that want
> +specific information which they normally derive from the revision
> +walker output.
> +
> +
> +Setting up the walker
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +The functions for this are (intentionally resembling the current
> +revision walker API):
> +
> +  rev_cache_init()
> +  rev_cache_add( interesting?, oid )
> +
> +As well as this setup, it is necessary to specify which options are
> +required.  Some API exists to do this, and the return value from
> +'rev_cache_ok' is true if suitable caches were found for the
> +information required, and false if it would require a graph traversal:
> +
> +  rev_cache_options( ordered? )
> +  rev_cache_ok( ) : Bool
> +
> +The rev_cache_ok() function must open all available revision caches,
> +and see if their 'interesting' object matches the single 'interesting'
> +object passed to rev_cache_add().  If it matches, it returns true.  If
> +multiple 'interesting' objects were specified and 'ordered' is true,
> +then the function returns false.
> +
> +If the ordering flag was set to false, then all of the 'interesting'
> +objects must be found in separate revision caches.
> +
> +If any 'uninteresting' objects were passed, then the return value is
> +always false.
> +
> +
> +Returning objects
> +^^^^^^^^^^^^^^^^^
> +
> +The 'rev_cache_fetch()' iterator returns entries from the topological
> +list.
> +
> +  rev_cache_fetch() : oid
> +
> +If returning objects from a single revision cache, it opens the 

opens what?

> +Accelerating in-progress walking
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +It is possible that during a revision iteration operation, a revision
> +is discovered that may mean the rest of the revision walking can be
> +achieved faster.
> +
> +In practice, this is likely to be implemented by making in-core cache
> +entries for objects with revision caches prior to walking; then when
> +encountered the special action can be taken.
> +
> +
> +Creating Revision Caches
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Once the revision cache is setup via rev_cache_init() and
> +an 'interesting' object added, then call:
> +
> +  rev_cache_create( )
> +
> +This function will not work, if multiple 'interesting' objects are
> +passed, or any 'uninteresting' objects were passed.
> +
> +This function must revision walk the commit graph, sorting in
> +--date-order along the way, and may emit revisions as they are
> +discovered to the topological object list.
> +
> +
> +receive-pack/pack-objects
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +pack-objects when called from pack-objects is a special user of the
               ^
Are you missing some punctuation here?

> +revision cache; it has the extra requirements of wanting to know;
> +
> +* how many objects are between 'interesting' and 'uninteresting'
> +  objects, at the beginning of the run, to emit the pack header
> +
> +* for checking re-usability of deltas, whether the delta base object
> +  in the pack is in the received set of objects.
> +
> +* for 'thin' pack generation, whether the delta base object is in the
> +  received set of objects, -or- reachable from any 'uninteresting'
> +  objects
> +
> +* for 'shallow' clone, whether the delta base object is reachable
> +  without passing any of the 'uninteresting' objects
> +
> +The aim is for 'pack-objects' to be able to start returning objects
> +immediately in the case where a suitable revision cache is returned,
> +without waiting for revision counting or repacking.

Everything that was said so far made lots of sense to me, up to this 
section.  I think that you should really limit the scope of the problem 
to object enumeration only and not dive so deep into pack-objects' 
operation.  Having an efficient object enumeration cache that is 1) 
fast, 2) doesn't use up too much disk space and 3) keep itself up to 
date automatically and transparently is already quite a challenge 
already.  Mixing pack-objects issues listed above into the mix is, I 
think, a huge mistake and possibly a misunderstanding of the packing 
process.  More comments below.

> +Returning object counts
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +If the revision cache was suitable, eg for a complete fetch, then the
> +number of objects can be obtained by counting the size of the object
> +list.
> +
> +If multiple revision caches were used for multiple 'interesting'
> +objects, then the number of objects can be obtained by building a hash
> +table of all the objects in all of the caches, and returning the
> +number of hash table entries.
> +
> +If 'uninteresting' objects were passed, no cache can be suitable.

Knowing the number of objects early is not that useful.  The only way 
you can know that number is by counting all needed objects, and the 
problem to solve here is really to provide that list of objects _fast_.  
The pack-objects code already deals with building a hash of objects and 
that takes no time compared to actually obtaining that list of objects 
in the first place.

> +Re-using deltas
> +^^^^^^^^^^^^^^^
> +
> +This applies to the 'Compressing Objects' phase, previously known as
> +the 'Deltafying Objects' phase.  Instead of searching for deltas, if
> +we can be sure that the existing delta can be resolved on the remote,
> +then we can re-use it.

I don't see how the reachability cache could have anything to do with 
this.  Sure the criteria for reusing some delta data is for the base to 
be part of the object set, or in the thin pack case, when the base is 
known to be available on the remote side.  But the delta and base has to 
already exist locally in the _same_ pack, otherwise it simply doesn't 
exist and has to be computed.  It is a common situation for a repository 
to have multiple packs for which deltas can be produced when crossing 
source pack boundaries, and that fact simply cannot fit within an object 
enumeration cache.

> +For the initial clone, this operation is simple - as objects are
> +emitted, the delta from the packfile is re-used.  If a loop is
> +detected or the delta base is not in the returned set of objects, then
> +the delta is first resolved.

How do you know you have a loop?  A lot of thoughts went into the 
current code to never get into a situation where loops could be 
possible.  And that can only be inferred from the actual content of used 
object packs.  Again the enumeration cache cannot and should not be 
concerned by such considerations as they will change on every repack.

> +This implies that the list of objects is first loaded into a hash
> +table prior to returning any objects; however this is probably
> +acceptable as the entire list is in one stream and will load quickly.

Again this is duplication of already existing code.

> +For later fetches, the revision cache is not appropriate as they will
> +have 'uninteresting' objects set.
> +
> +
> +Re-using deltas - 'thin' pack
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +This case is much like the normal re-using deltas case, except there
> +is the complete set of objects that the remote has claimed to have to
> +consider.
> +
> +The cache is not yet sophisticated enough to assist with this use
> +case.

And it should not attempt anything in that direction.  This is simply 
not the right level for such considerations.

Please don't mess up the layers of abstractions.  Creating a list of 
objects is currently the biggest bottleneck for big clones.  It is not 
the object compression (delta) phase, it is not the pack creation 
either.  The current code is smart enough not to delta objects it knows 
can be found in source packs already, and it knows how to cull the list 
of deltas that needs to be computed to the strict minimum.  And that 
requires almost no time.  What is really time consuming is to figure out 
the damn object list in the first place.  No need to reinvent what 
already works well.

In other words, you should really concentrate on making 'git rev-list 
--objects' and 'git rev-list --objects-edge' instantaneous without 
having to parse any commit nor tree objects (at least for the part 
already covered by the cache).  And as new objects are added through 
'git add' and 'git commit' or 'git fetch' then the traditional object 
parsing would have to take place only up to the point where the cache 
could take over, and update the cache for the non covered objects while 
at it.  I think this is already quite a difficult problem already.

What pack-objects would greatly benefit from is a facility that could 
provide a list of objects with their SHA1, type, size and the 32-bit 
object path hash given a list of "have" and "want" kind of refs.  And so 
in a way that doesn't suck, which could mean in less than, say, 5 
seconds per million objects.  That's it, but that's hard.  All the rest 
is already done and quite well optimized.


Nicolas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
  2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
@ 2009-06-05 20:58   ` Nicolas Pitre
  2009-06-07  4:01   ` Sam Vilain
  1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2009-06-05 20:58 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Sam Vilain, git, Nick Edelen, Shawn O. Pearce,
	Johannes Schindelin, Andreas Ericsson, Christian Couder,
	Jeff King

On Fri, 5 Jun 2009, Jakub Narebski wrote:

> I like this sharing not only completed code, but plans, designs (and
> status reports) with Git Development Community (i.e. git mailing
> list).  I like this very much.
> 
> 
> I'd like to ask if there any results of profiling git server
> (git-daemon) code: how much is spend on object enumeration this GSoC
> project tries to make faster by the means of caching?

The git daemon only forks and execs other processes.  It is hardly using 
any measurable CPU itself.

If you want to profile or simply have a good feel for what happens 
during a clone and clearly see what phase is problematic for the server 
then do this:

 1) cd to a local repository of your choice.  The bigger the better, 
    meaning that git itself is probably too small.  Try the linux 
    kernel, or a gcc mirror, or even better yet the gentoo repository.

 2) run this:

	git pack-objects --all-progress --revs --all --stdout \
		< /dev/null > /dev/null

 3) Sit and watch.  And for extra fun you may even measure the time 
    spent in each of the "counting objects", "compressing objects" and
    "writing objects" phases and compare them.


Nicolas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
  2009-06-05 19:28   ` Nicolas Pitre
@ 2009-06-07  3:49     ` Sam Vilain
  2009-06-07  5:43       ` Nicolas Pitre
  0 siblings, 1 reply; 14+ messages in thread
From: Sam Vilain @ 2009-06-07  3:49 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

On Fri, 2009-06-05 at 15:28 -0400, Nicolas Pitre wrote:
> > +  - A list of objects which are referred to by the 'start' object
> > +
> > +    * position when sorted in --date-order
> > +
> > +    * object ID
> 
> Does the object ID contain the object type or just the SHA1?  Having the 
> object type quickly retrievable would be a significant gain as well.

Well, one of the things I'm trying to avoid is adding information which
isn't clearly supported by a use case.  But now that you mention it I
guess the type is quite useful and also quite small.  What I could do is
have 4 bitmaps, one for each object type.  That should be quite small
and re-use the bitmap infrastructure used later.

  [...]
> > +Use Cases
> > +---------
> > +In this section, the key functions and operations that this index is
> > +designed to answer are explored.  For each, their efficiency is
> > +considered in terms of what must be carried out to calculate the
> > +answer.
> > +
> > +
> > +Determining Cache Suitability - rev_cache_suitable()
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +This is an auxiliary function used by the other use cases, when 
> 
> when what?

Whoops.  When deciding whether or not a revision cache is going to be
useful to help the current operation.  This needs to be quick because we
need to identify whether to use a cache or keep going.

> > +The function is:
> > +
> > +  rev_cache_suitable( rev_cache, object )
> > +
> > +The check is simple and fast - it just compares the object to the
> > +'start' object in the revision cache.
> > +
> > +
  [...]
> > +Returning objects
> > +^^^^^^^^^^^^^^^^^
> > +
> > +The 'rev_cache_fetch()' iterator returns entries from the topological
> > +list.
> > +
> > +  rev_cache_fetch() : oid
> > +
> > +If returning objects from a single revision cache, it opens the 
> 
> opens what?

Again sorry this got lost when I was re-arranging the document.  It
opens the revision cache and returns objects in the order they appear in
the cache's object list.

  [...]
> > +receive-pack/pack-objects
> > +~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +pack-objects when called from pack-objects is a special user of the
>                ^
> Are you missing some punctuation here?

It should probably say "when called from receive-pack".

> > +revision cache; it has the extra requirements of wanting to know;
> > +
> > +* how many objects are between 'interesting' and 'uninteresting'
> > +  objects, at the beginning of the run, to emit the pack header
> > +
> > +* for checking re-usability of deltas, whether the delta base object
> > +  in the pack is in the received set of objects.
> > +
> > +* for 'thin' pack generation, whether the delta base object is in the
> > +  received set of objects, -or- reachable from any 'uninteresting'
> > +  objects
> > +
> > +* for 'shallow' clone, whether the delta base object is reachable
> > +  without passing any of the 'uninteresting' objects
> > +
> > +The aim is for 'pack-objects' to be able to start returning objects
> > +immediately in the case where a suitable revision cache is returned,
> > +without waiting for revision counting or repacking.
> 
> Everything that was said so far made lots of sense to me, up to this 
> section.  I think that you should really limit the scope of the problem 
> to object enumeration only and not dive so deep into pack-objects' 
> operation.

>   Having an efficient object enumeration cache that is 1) 
> fast, 2) doesn't use up too much disk space and 3) keep itself up to 
> date automatically and transparently [...]

Like the rest of git - the object graph, packs, etc - this design is
based around the concept of "immutable data".  That is, the revision
cache is defined by the objects it was computed from.  Once the later
stages of it are reached, the "revision cache stacking" comes into play
to achieve automatic and transparent 'updating'.  I considered designs
which were not immutable data but decided that the update performance
would be too poor or suffer from locking issues.

> > +Re-using deltas
> > +^^^^^^^^^^^^^^^
> > +
> > +This applies to the 'Compressing Objects' phase, previously known as
> > +the 'Deltafying Objects' phase.  Instead of searching for deltas, if
> > +we can be sure that the existing delta can be resolved on the remote,
> > +then we can re-use it.
> 
> I don't see how the reachability cache could have anything to do with 
> this.  Sure the criteria for reusing some delta data is for the base to 
> be part of the object set, or in the thin pack case, when the base is 
> known to be available on the remote side.  But the delta and base has to 
> already exist locally in the _same_ pack, otherwise it simply doesn't 
> exist and has to be computed.  It is a common situation for a repository 
> to have multiple packs for which deltas can be produced when crossing 
> source pack boundaries, and that fact simply cannot fit within an object 
> enumeration cache.
> 
> > +For the initial clone, this operation is simple - as objects are
> > +emitted, the delta from the packfile is re-used.  If a loop is
> > +detected or the delta base is not in the returned set of objects, then
> > +the delta is first resolved.
> 
> How do you know you have a loop?  A lot of thoughts went into the 
> current code to never get into a situation where loops could be 
> possible.  And that can only be inferred from the actual content of used 
> object packs.  Again the enumeration cache cannot and should not be 
> concerned by such considerations as they will change on every repack.
> 
> > +This implies that the list of objects is first loaded into a hash
> > +table prior to returning any objects; however this is probably
> > +acceptable as the entire list is in one stream and will load quickly.
> 
> Again this is duplication of already existing code.

There's no code to be duplicate yet.  A lot of the above text is trying
to summarise what happens in pack-objects, for the purposes of
considering whether the revision cache can help with it.  Delta loop
detection I'd considered, but decided not to discuss in the already
quite long document.

> Please don't mess up the layers of abstractions.  Creating a list of 
> objects is currently the biggest bottleneck for big clones.  It is not 
> the object compression (delta) phase, it is not the pack creation 
> either.
  [...]
> In other words, you should really concentrate on making 'git rev-list 
> --objects' and 'git rev-list --objects-edge' instantaneous without 
> having to parse any commit nor tree objects (at least for the part 
> already covered by the cache).  And as new objects are added through 
> 'git add' and 'git commit' or 'git fetch' then the traditional object 
> parsing would have to take place only up to the point where the cache 
> could take over, and update the cache for the non covered objects while 
> at it.  I think this is already quite a difficult problem already.

It's a good thing, then, that this is exactly what the first milestones
in the project are covering.  

What I don't understand though is that during initial clone the object
compression phase does take a lot of time, and that there have been
reports that large initial clones use a lot of VM for the packfile to be
sent.  So what I'm doing is trying to answer the question about why we
can't just start streaming the pack as soon as we know how many objects
will be in it.  All those other stages can happen in parallel - for
instance I would wager that it's far more efficient to detect loops as
we're streaming, as the packfile is accessed, rather than having to read
seek object header in the packfile up front.

In summary, I'm trying to make sure that the revision cache contains
whatever information might help with making it start as soon as
possible, where putting the information in is trivial.

If pack-objects integration is as small a win as you say - and I have no
hard facts to dispute this - then when we come to that part and
investigate I'm sure with the facts in hand we will agree on what
approach to take.

> What pack-objects would greatly benefit from is a facility that could 
> provide a list of objects with their SHA1, type, size

Which size?  compressed, uncompressed?  What does that win us?  Again
I'm trying to support all information with a use case.

> and the 32-bit 
> object path hash

I'm not sure what this means, can you refer me to some docs or relevant
source?

>  given a list of "have" and "want" kind of refs.  And so 
> in a way that doesn't suck, which could mean in less than, say, 5 
> seconds per million objects.  That's it, but that's hard.  All the rest 
> is already done and quite well optimized.

Thanks for your detailed commentary,
Sam.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan
  2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
  2009-06-05 20:58   ` Nicolas Pitre
@ 2009-06-07  4:01   ` Sam Vilain
  1 sibling, 0 replies; 14+ messages in thread
From: Sam Vilain @ 2009-06-07  4:01 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

On Fri, 2009-06-05 at 09:56 -0700, Jakub Narebski wrote:
> > The plan is to aim for one of these API milestones completed per week.
> > When complete, each commit will contain tests for the level of cache
> > that it delivers.  Later milestones include joining the dots -
> > integrating with the 'rev-list' machinery and most importantly,
> > 'pack-objects'.
> 
> I like this sharing not only completed code, but plans, designs (and
> status reports) with Git Development Community (i.e. git mailing
> list).  I like this very much.
> 
> 
> I'd like to ask if there any results of profiling git server
> (git-daemon) code: how much is spend on object enumeration this GSoC
> project tries to make faster by the means of caching?

No, but you're not the first to ask - I'll research this and include
profiling of it (obviously needing to descend into stages taken by
pack-objects/receive-packs, as Nicholas points out) in the next update I
send out.

> Are there prepared benchmarks and tests to check if the code gives
> correct results, and to measure improvements brought by caching?
> Would it be possible to get some real-life statistics of git-daemon
> usage, so that you optimize against real scenarios?

In terms of correct results - that will come down to the test cases
which are written for it, and possibly extending the existing test cases
eg t5500-fetch-pack.sh

> I wish you good work on git-daemon caching...

Thanks!

Sam.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions
  2009-06-07  3:49     ` Sam Vilain
@ 2009-06-07  5:43       ` Nicolas Pitre
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2009-06-07  5:43 UTC (permalink / raw)
  To: Sam Vilain
  Cc: git, Nick Edelen, Shawn O. Pearce, Johannes Schindelin,
	Andreas Ericsson, Christian Couder, Jeff King

On Sun, 7 Jun 2009, Sam Vilain wrote:

> On Fri, 2009-06-05 at 15:28 -0400, Nicolas Pitre wrote:
> > Please don't mess up the layers of abstractions.  Creating a list of 
> > objects is currently the biggest bottleneck for big clones.  It is not 
> > the object compression (delta) phase, it is not the pack creation 
> > either.
>   [...]
> > In other words, you should really concentrate on making 'git rev-list 
> > --objects' and 'git rev-list --objects-edge' instantaneous without 
> > having to parse any commit nor tree objects (at least for the part 
> > already covered by the cache).  And as new objects are added through 
> > 'git add' and 'git commit' or 'git fetch' then the traditional object 
> > parsing would have to take place only up to the point where the cache 
> > could take over, and update the cache for the non covered objects while 
> > at it.  I think this is already quite a difficult problem already.
> 
> It's a good thing, then, that this is exactly what the first milestones
> in the project are covering.  
> 
> What I don't understand though is that during initial clone the object
> compression phase does take a lot of time, and that there have been
> reports that large initial clones use a lot of VM for the packfile to be
> sent.

Did you duplicate those results yourself?

Did you also try out the simple bench test I suggested while responding 
to Jakub?

For example, let's take the Linux kernel repository as I have it handy 
here.  For the various phases I get:

Counting objects = 25 seconds
Compressing objects = 2 seconds
Writing objects = 3 seconds

A long compression phase is a sign of a really badly packed repository.
The solution to that is simply to repack.  After that subsequent clones 
as well as further repacks will have much shorter compression phase as 
the work that was performed in the first repack will be directly 
reusable.

> So what I'm doing is trying to answer the question about why we
> can't just start streaming the pack as soon as we know how many objects
> will be in it.  All those other stages can happen in parallel - for
> instance I would wager that it's far more efficient to detect loops as
> we're streaming, as the packfile is accessed, rather than having to read
> seek object header in the packfile up front.

My stance on this is that it is fundamentally impossible to do 
efficiently, if at all.  What we currently do is not loop detection but 
rather loop avoidance.  And because your pack set is not a static thing 
(new packs are created or fetched, and sometimes they're repacked into a 
single pack for a while) then you simply can't have a stable cache of 
object topology based on their storage into packs.  This is simply not 
some immutable data.  Only object names and contents are immutable, 
nothing more.

What I'm telling you is: in order to know how many objects the pack 
which is about to be created will contain, you need to count them.  
This is what takes 25 seconds currently in the above example.  Then a 2 
second compression phase which could be even less if your repository is 
better packed than mine currently is.  At which point the writing phase 
starts and the pack header is streamed.  So trying to stream the pack as 
soon as you know how many objects it contains will save a big whoopee 2 
seconds.  This is nothing to go crazy about, really.

Yet, to be able to stream a pack right away, you still would need to 
know _where_ to pick the pack data from.  And in most cases this is not 
from a nice single pack but rather multiple ones.  And some of those 
packs have objects which are duplicated into other packs.  And yet by 
mixing pack data from multiple source packs, you gain new opportunities 
for delta compression because you're now putting together objects which 
were separated before and therefore couldn't create delta between them 
previously (that's the 2 second phase above).  You still must be careful 
during that phase not to create delta loop, and to do that you must take 
into account all the other deltas from existing source packs that you 
are going to not recompute but just copy straight into the new pack.

And of course there is the delta depth limit to enforce.  This means 
that some rearrangements of deltas forces some objects which were deltas 
previously to become undeltified, or new deltas against a shallower base 
object.  And the only way to know what form each object will take in 
this context is again by going through the full compression phase which 
works on a list of object which ordering is totally different from the 
one that is used to actually store objects in the pack.  In other words, 
you cannot work out delta issues as you stream the pack because you 
don't want to write objects in the pack with the same order used for 
delta processing.

Why a different object ordering for storing in the pack? Because we want 
to create the best data layout possible in the new pack so future 
runtime access to that pack will be optimal.  This often means that 
objects are picked from existing packs in a non linear fashion but 
rather in a seemingly random way.  This is because we want the most 
recent commits to get the best IO patterns in a pack.  However, when the 
repository is updated through fetches or pushes, the newly obtained 
packs usually carry even more recent commits for which the bulk of 
referenced objects are still to be found in older packs (this is why a 
fetch is so quick).  So the more fetches or pushes are performed, the 
less efficient your repository becomes with regard to the most recent 
commits.  This is why a subsequent repack will reshuffle all objects so 
that the most recent commit gets a linear IO pattern again by picking 
objects from the most recent packs as well as from older packs and 
putting them all contiguously in the new pack.  But again you cannot 
know if those objects will be in delta form or not, or against which 
base object before the compression phase is over.

I hope you have a better idea now to answer the question about why we 
can't just start streaming the pack as soon as we know how many objects 
will be in it, and why all those other stages may not happen in 
parallel.

> In summary, I'm trying to make sure that the revision cache contains
> whatever information might help with making it start as soon as
> possible, where putting the information in is trivial.
> 
> If pack-objects integration is as small a win as you say - and I have no
> hard facts to dispute this - then when we come to that part and
> investigate I'm sure with the facts in hand we will agree on what
> approach to take.

Good.  I'm glad you see things that way.  Because, to me, even just the 
notion of a good revision cache is not that simple.

> > What pack-objects would greatly benefit from is a facility that could 
> > provide a list of objects with their SHA1, type, size
> 
> Which size?  compressed, uncompressed?  What does that win us?  Again
> I'm trying to support all information with a use case.

Uncompressed.  That information is used by the delta heuristics, and the 
code currently does its best not to seek all over through delta chains 
to fetch that tiny bit of information if that can be avoided (have a 
look at check_object() in builtin-pack-objects.c).

> > and the 32-bit 
> > object path hash
> 
> I'm not sure what this means, can you refer me to some docs or relevant
> source?

If you don't know what that is, then I'm afraid you might be lacking 
some background on the packing and delta strategy.  I'd suggest you read 
Documentation/technical/pack-heuristics.txt first, and then find out in 
the commit logs for builtin-pack-objects.c (or even within the file 
itself) what has changed since then.

Nicolas

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-06-07  5:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-04 14:18 [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Sam Vilain
2009-06-04 14:05 ` [PATCH 5/7] revision cache: maps of 'new' objects Sam Vilain
2009-06-04 14:05 ` [PATCH 4/7] rev-cache: allow multiple 'start' objects per index Sam Vilain
2009-06-04 14:05 ` [PATCH 2/7] rev-cache: add on-disk format for fast reachability lookup Sam Vilain
2009-06-04 14:05 ` [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Sam Vilain
2009-06-05 19:28   ` Nicolas Pitre
2009-06-07  3:49     ` Sam Vilain
2009-06-07  5:43       ` Nicolas Pitre
2009-06-04 14:05 ` [PATCH 3/7] rev-cache: add 'end' objects for caching 'uninteresting' lookups Sam Vilain
2009-06-04 14:05 ` [PATCH 7/7] revision cache: be even stricter with sort order Sam Vilain
2009-06-04 14:05 ` [PATCH 6/7] revision cache: allow foreign 'start' commits Sam Vilain
2009-06-05 16:56 ` [PATCH 0/7] [GSoC2009] Revision cache / git-daemon caching plan Jakub Narebski
2009-06-05 20:58   ` Nicolas Pitre
2009-06-07  4:01   ` Sam Vilain

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).