From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: Confused over packfile and index design Date: Fri, 8 Apr 2011 20:20:48 -0400 Message-ID: <20110409002047.GB7445@sigill.intra.peff.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: git@vger.kernel.org To: "Steven E. Harris" X-From: git-owner@vger.kernel.org Sat Apr 09 02:21:11 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Q8LvK-0007kr-67 for gcvg-git-2@lo.gmane.org; Sat, 09 Apr 2011 02:21:10 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757986Ab1DIAUw convert rfc822-to-quoted-printable (ORCPT ); Fri, 8 Apr 2011 20:20:52 -0400 Received: from 99-108-226-0.lightspeed.iplsin.sbcglobal.net ([99.108.226.0]:33995 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757977Ab1DIAUv (ORCPT ); Fri, 8 Apr 2011 20:20:51 -0400 Received: (qmail 7391 invoked by uid 107); 9 Apr 2011 00:21:38 -0000 Received: from 70-36-146-44.dsl.dynamic.sonic.net (HELO sigill.intra.peff.net) (70.36.146.44) (smtp-auth username relayok, mechanism cram-md5) by peff.net (qpsmtpd/0.84) with ESMTPA; Fri, 08 Apr 2011 20:21:38 -0400 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Fri, 08 Apr 2011 20:20:48 -0400 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Fri, Apr 08, 2011 at 07:58:41PM -0400, Steven E. Harris wrote: > ,---- > | Importantly, packfile indexes are /not/ neccesary to extract object= s > | from a packfile, they are simply used to quickly retrieve individua= l > | objects from a pack. The packfile format is used in upload-pack and > | receieve-pack programs (push and fetch protocols) to transfer objec= ts > | and there is no index used then - it can be built after the fact by > | scanning the packfile. > `---- >=20 > That suggests that it's possible to read the packfile linearly and > deduce where the various objects start and end, without the index > available. Yes. For example, when we do a "git fetch", we get _just_ the packfile and create our own local index. > Later, in the section on the packfile format, we find this: >=20 > ,---- > | It is important to note that the size specified in the header data = is > | not the size of the data that actually follows, but the size of tha= t > | data /when expanded/. This is why the offsets in the packfile index= are > | so useful, otherwise you have to expand every object just to tell w= hen > | the next header starts. > `---- >=20 > Now that makes it sound like without the index, even if one knows whe= re > a packed object starts, reading its header tells its /inflated/ size, > /not/ the number of remaining payload bytes representing the object. = If > that's true, then how does one figure out where one object ends and t= he > next one begins /without the index/? The actual object data (whether it is the object itself or a delta) is all zlib-encoded, so it has its own size header and checksum there, I believe. The pack-format documentation is a bit vague, but a quick read of unpack_raw_entry and unpack_entry_data in builtin/index-pack.c seems to confirm that this is how it works. Take that response with a grain of salt, though. That is just from my quick read of the code, so I could be wrong. > Recall that the first paragraph quoted above says that the index can = be > built from the packfile, as opposed to it being essential to reading = the > packfile. Is one of these paragraphs incorrect? No, if I'm correct, it is just that there is an extra header that neither mentions. :) > The Git documentation on the pack format=C2=B2 mentions that the pack= ed > object headers represent the lengths as variable-sized integers >=20 > ,---- > | n-byte type and length (3-bit type, (n-1)*7+4-bit length) > `---- >=20 > but it doesn't say whether that's the number of (deflated) payload by= tes > or the inflated object size, as the Git Book asserts. That should be the inflated object size. > I imagine that if the format is meant to record the size of the defla= ted > payload, then it would be challenging to compress the data straight i= nto > the packfile, because one wouldn't know the final size until it was > written, which means that one wouldn't know how many bytes will be > necessary to write its length in the header, which means one wouldn't > know where to start writing the deflated payload. I believe zlib handles streaming it out for us. I'm not too familiar with zlib's format, but I assume it outputs in chunks with occasional headers. So finding the end of stream means while reading through the whole stream and skipping past each chunk. > Are there any other clarifying documents you can recommend to underst= and > the design? Not that I know of; what's in docs/technical is generally authoritative= , except for reading the code. -Peff