From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sam Vilain Subject: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions Date: Fri, 5 Jun 2009 02:05:12 +1200 Message-ID: References: Cc: Nick Edelen , "Shawn O. Pearce" , Johannes Schindelin , Andreas Ericsson , Christian Couder , Jeff King To: git@vger.kernel.org X-From: git-owner@vger.kernel.org Fri Jun 05 07:46:38 2009 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1MCSFw-0001n0-5U for gcvg-git-2@gmane.org; Fri, 05 Jun 2009 07:46:20 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755050AbZFEFpY (ORCPT ); Fri, 5 Jun 2009 01:45:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755087AbZFEFpX (ORCPT ); Fri, 5 Jun 2009 01:45:23 -0400 Received: from watts.utsl.gen.nz ([202.78.240.73]:57207 "EHLO mail.utsl.gen.nz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754827AbZFEFpE (ORCPT ); Fri, 5 Jun 2009 01:45:04 -0400 Received: by mail.utsl.gen.nz (Postfix, from userid 1003) id 5F48D21D11A; Fri, 5 Jun 2009 17:45:00 +1200 (NZST) In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: The very first thing that we want out of the revision cache is to be able to go from a commit to all of its referred objects. Define a revision cache that includes just that. Signed-off-by: Sam Vilain --- Documentation/technical/revision-cache.txt | 260 ++++++++++++++++++++++++++++ 1 files changed, 260 insertions(+), 0 deletions(-) create mode 100644 Documentation/technical/revision-cache.txt diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt new file mode 100644 index 0000000..cc18535 --- /dev/null +++ b/Documentation/technical/revision-cache.txt @@ -0,0 +1,260 @@ +Revision Cache Format +===================== + +The revision cache is an on-disk format which allows for certain graph +reachability operations to be answered quickly. + +A revision cache contains; + + - A 'start' object (ie, 'interesting' object ID) + + - A list of objects which are referred to by the 'start' object + + * position when sorted in --date-order + + * object ID + + +Start Object +------------ + +The 'start' object is the identifying key of the revision cache. + + +Topological contents list +------------------------- + +This list has fixed-length records, so the topological position into +the list does not need to be stored in each record - it is implicit +from the offset. + +--date-order is used as it is the strictest sort order available, but +this still only specifies an ordering for commit objects. Other +objects will appear after the object which first refers to them. Tag +objects are sorted as if they were commit objects with a single +parent, the object they tag. + + +Use Cases +--------- +In this section, the key functions and operations that this index is +designed to answer are explored. For each, their efficiency is +considered in terms of what must be carried out to calculate the +answer. + + +Determining Cache Suitability - rev_cache_suitable() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is an auxiliary function used by the other use cases, when + +The function is: + + rev_cache_suitable( rev_cache, object ) + +The check is simple and fast - it just compares the object to the +'start' object in the revision cache. + + +Revision Walker +~~~~~~~~~~~~~~~ + +The revision walker is the main user of this cache; there is the +generic function of revision walking, as well as commands that want +specific information which they normally derive from the revision +walker output. + + +Setting up the walker +^^^^^^^^^^^^^^^^^^^^^ + +The functions for this are (intentionally resembling the current +revision walker API): + + rev_cache_init() + rev_cache_add( interesting?, oid ) + +As well as this setup, it is necessary to specify which options are +required. Some API exists to do this, and the return value from +'rev_cache_ok' is true if suitable caches were found for the +information required, and false if it would require a graph traversal: + + rev_cache_options( ordered? ) + rev_cache_ok( ) : Bool + +The rev_cache_ok() function must open all available revision caches, +and see if their 'interesting' object matches the single 'interesting' +object passed to rev_cache_add(). If it matches, it returns true. If +multiple 'interesting' objects were specified and 'ordered' is true, +then the function returns false. + +If the ordering flag was set to false, then all of the 'interesting' +objects must be found in separate revision caches. + +If any 'uninteresting' objects were passed, then the return value is +always false. + + +Returning objects +^^^^^^^^^^^^^^^^^ + +The 'rev_cache_fetch()' iterator returns entries from the topological +list. + + rev_cache_fetch() : oid + +If returning objects from a single revision cache, it opens the + + +Accelerating in-progress walking +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is possible that during a revision iteration operation, a revision +is discovered that may mean the rest of the revision walking can be +achieved faster. + +In practice, this is likely to be implemented by making in-core cache +entries for objects with revision caches prior to walking; then when +encountered the special action can be taken. + + +Creating Revision Caches +~~~~~~~~~~~~~~~~~~~~~~~~ + +Once the revision cache is setup via rev_cache_init() and +an 'interesting' object added, then call: + + rev_cache_create( ) + +This function will not work, if multiple 'interesting' objects are +passed, or any 'uninteresting' objects were passed. + +This function must revision walk the commit graph, sorting in +--date-order along the way, and may emit revisions as they are +discovered to the topological object list. + + +receive-pack/pack-objects +~~~~~~~~~~~~~~~~~~~~~~~~~ + +pack-objects when called from pack-objects is a special user of the +revision cache; it has the extra requirements of wanting to know; + +* how many objects are between 'interesting' and 'uninteresting' + objects, at the beginning of the run, to emit the pack header + +* for checking re-usability of deltas, whether the delta base object + in the pack is in the received set of objects. + +* for 'thin' pack generation, whether the delta base object is in the + received set of objects, -or- reachable from any 'uninteresting' + objects + +* for 'shallow' clone, whether the delta base object is reachable + without passing any of the 'uninteresting' objects + +The aim is for 'pack-objects' to be able to start returning objects +immediately in the case where a suitable revision cache is returned, +without waiting for revision counting or repacking. + + +Returning object counts +^^^^^^^^^^^^^^^^^^^^^^^ + +If the revision cache was suitable, eg for a complete fetch, then the +number of objects can be obtained by counting the size of the object +list. + +If multiple revision caches were used for multiple 'interesting' +objects, then the number of objects can be obtained by building a hash +table of all the objects in all of the caches, and returning the +number of hash table entries. + +If 'uninteresting' objects were passed, no cache can be suitable. + + +Re-using deltas +^^^^^^^^^^^^^^^ + +This applies to the 'Compressing Objects' phase, previously known as +the 'Deltafying Objects' phase. Instead of searching for deltas, if +we can be sure that the existing delta can be resolved on the remote, +then we can re-use it. + +For the initial clone, this operation is simple - as objects are +emitted, the delta from the packfile is re-used. If a loop is +detected or the delta base is not in the returned set of objects, then +the delta is first resolved. + +This implies that the list of objects is first loaded into a hash +table prior to returning any objects; however this is probably +acceptable as the entire list is in one stream and will load quickly. + +For later fetches, the revision cache is not appropriate as they will +have 'uninteresting' objects set. + + +Re-using deltas - 'thin' pack +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This case is much like the normal re-using deltas case, except there +is the complete set of objects that the remote has claimed to have to +consider. + +The cache is not yet sophisticated enough to assist with this use +case. + + +'shallow' clone considerations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For enumerating objects, the list of objects between the +'uninteresting' and the 'shallow' commits must first be enumerated, +and then subtracted from the objects between the 'interesting' and the +'uninteresting' list. + +For re-using deltas for 'thin' packs, the list of objects between +'uninteresting' and 'shallow' commits are enumerated and marked as +valid for delta bases. + +The cache is not yet sophisticated enough to assist with this case. + + +creating bundles +~~~~~~~~~~~~~~~~ + +So long as a bundle has no 'uninteresting' commits, then the revision +cache is completely appropriate; with the length of the object list it +can write a header and continue with the 'pack-objects' use case. + + +slicing bundles (mirror-sync) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For 'slicing' bundles, a deterministic method is required for chopping +up the list of objects, which can be calculated by any node which has +the 'interesting' commits. + +To determine the objects which fall within a given slice, the object +list must be enumerated and then divided evenly. As the compressed +size on one node cannot be reproduced on another node, the +uncompressed size of the object is used instead, with the hope that +slices will generally end up of roughly even size once compressed. + +To calculate the object boundaries, the list of objects in +--date-order must first be tie-broken, eg for commits with the same +commitdate and topological order. If slices are allowed to be +sub-commit level, then objects between commits (ie blobs and trees) +are also sorted. Once this deterministic list has been built, then +all of the objects must be accessed to determine their length. The +objects which start within a given range are the ones in the slice. + +For re-using deltas in sliced bundles, the delta base is looked up in +the deterministic list. If it has an earlier sequence, then the delta +can be safely re-used. If it has a later sequence, then the delta +must be resolved and then the object re-deltified/compressed, using +only objects with an earlier sequence (or in the returned pack) as a +base. + +This requirement is so that a collection of 'sliced' bundles can +successfully re-assemble, while still allowing them to be 'thin'. -- debian.1.5.6.1