Revisiting large binary files issue.

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Revisiting large binary files issue.
@ 2006-07-10 23:01 Carl Baldwin
  2006-07-10 23:14 ` Junio C Hamano
  2006-07-10 23:28 ` Linus Torvalds
  0 siblings, 2 replies; 37+ messages in thread
From: Carl Baldwin @ 2006-07-10 23:01 UTC (permalink / raw)
  To: git

Hi,

Some of this stuff has been discussed before but I thought I'd bring it
up again.

I am using git in a way for which, admittedly, it was not intended.  I
have repositories in which I'm storing some source code mixed with some
large binary files (from 20MB up to 1GB in the worst case).  The large
files easily dominate the space in the repository and the time that it
takes to perform network operations between repositories.

Just to refresh your memory a sample of these files led to work that
Nicolas did to improve the performance of packing large binary files.
Thank you Nicolas, I think your work improved the speed of git in the
face of largish files.

Attempting to delta compress these blobs is a frivolous effort.  The
nature of the blobs is such that if given two blobs:  'A', and the next
revision, 'B' it is just as good --- from a pack-size standpoint --- to
compress the entire contents of 'B' than try to find a delta from A ->
B.  It is also much faster than trying to find deltas.  In git, I can
accomplish this by setting the pack window (I think this is right) to 0.

When I set the window to 0 I one more issue.  Even though the blobs are
already compressed on disk I still seem to pay the penalty of inflating
them into memory and then deflating them into the pack.  When the window
size is 0 this is just wasted cycles.  With large binary files these
wasted cycles slow down the push/fetch operation considerably.  Couldn't
the compressed blobs be copied into the pack without first deflating
them in this 0 window case?

My 'porcelain' on top of git works around these issues by first rsyncing
the object directories and then running the git push/fetch command
afterward.  The push/fetch command sees that the remote is up-to-date
with all the necessary objects and skips packing.  This works and is
fast (much faster than even packing with window of 0 because of the time
to inflate/deflate the objects) but I'd like to remove this work-around
in the long-term.

Ideally, there are two things that I would like available with git to be
able to remove my work-around.

First, I would like to be able to set the packing window to 0 for all of
the git commands.  It would be nice if I could set this in a
per-repository config file so that any push/fetch operation would honor
this window.  Is there currently a way to do this?

Second, I would like to not pay the penalty to inflate and then deflate
the objects into the pack when I use a window of 0.  How hard would this
be?  I am a capable programmer and wouldn't mind getting my hands dirty
in the code to implement this if someone could point me in the right
direction.

Thanks for your time,
Carl Baldwin

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Revisiting large binary files issue.
  2006-07-10 23:01 Revisiting large binary files issue Carl Baldwin
@ 2006-07-10 23:14 ` Junio C Hamano
  2006-07-11  6:20   ` Peter Baumann
  2006-07-10 23:28 ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2006-07-10 23:14 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: git

Carl Baldwin <cnb@fc.hp.com> writes:

> First, I would like to be able to set the packing window to 0 for all of
> the git commands.  It would be nice if I could set this in a
> per-repository config file so that any push/fetch operation would honor
> this window.  Is there currently a way to do this?

Should not be hard to add.

> Second, I would like to not pay the penalty to inflate and then deflate
> the objects into the pack when I use a window of 0.  How hard would this
> be?  I am a capable programmer and wouldn't mind getting my hands dirty
> in the code to implement this if someone could point me in the right
> direction.

The problem is that unpacked objects have the single line header
(type followed by its inflated size in decimal) which starts the
deflated stream, while in-pack representation of non-delta does
not.

There was an attempt to help doing this, but I haven't pursued it.

	http://article.gmane.org/gmane.comp.version-control.git/17368

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Revisiting large binary files issue.
  2006-07-10 23:14 ` Junio C Hamano
@ 2006-07-11  6:20   ` Peter Baumann
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Baumann @ 2006-07-11  6:20 UTC (permalink / raw)
  To: git

On 2006-07-10, Junio C Hamano <junkio@cox.net> wrote:
> Carl Baldwin <cnb@fc.hp.com> writes:
>
>> First, I would like to be able to set the packing window to 0 for all of
>> the git commands.  It would be nice if I could set this in a
>> per-repository config file so that any push/fetch operation would honor
>> this window.  Is there currently a way to do this?
>
> Should not be hard to add.
>
>> Second, I would like to not pay the penalty to inflate and then deflate
>> the objects into the pack when I use a window of 0.  How hard would this
>> be?  I am a capable programmer and wouldn't mind getting my hands dirty
>> in the code to implement this if someone could point me in the right
>> direction.
>
> The problem is that unpacked objects have the single line header
> (type followed by its inflated size in decimal) which starts the
> deflated stream, while in-pack representation of non-delta does
> not.
>
> There was an attempt to help doing this, but I haven't pursued it.
>
> 	http://article.gmane.org/gmane.comp.version-control.git/17368
>
>

Wouldn't it make more sense to convert the unpacked object reprasentation
to the one which is used in the pack files?

This would make it really easy to just copy the object in the non-delta
case to the pack and avoid the inflate/deflate calls.

Peter Baumann

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Revisiting large binary files issue.
  2006-07-10 23:01 Revisiting large binary files issue Carl Baldwin
  2006-07-10 23:14 ` Junio C Hamano
@ 2006-07-10 23:28 ` Linus Torvalds
  2006-07-11  9:40   ` [RFC]: Pack-file object format for individual objects (Was: Revisiting large binary files issue.) sf
  2006-07-11 14:55   ` Revisiting large binary files issue Carl Baldwin
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-10 23:28 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: git



On Mon, 10 Jul 2006, Carl Baldwin wrote:
> 
> When I set the window to 0 I one more issue.  Even though the blobs are
> already compressed on disk I still seem to pay the penalty of inflating
> them into memory and then deflating them into the pack.  When the window
> size is 0 this is just wasted cycles.  With large binary files these
> wasted cycles slow down the push/fetch operation considerably.  Couldn't
> the compressed blobs be copied into the pack without first deflating
> them in this 0 window case?

The problem is that the individual object disk format isn't actually the 
same as the pack-file object format for one object. The header is 
different: a pack-file uses a very dense bit packing, while the individual 
object format is a bit less dense.

Sad, really, but it means that right now you can only re-use data that was 
already packed (when the format matches).

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC]: Pack-file object format for individual objects (Was: Revisiting large binary files issue.)
  2006-07-10 23:28 ` Linus Torvalds
@ 2006-07-11  9:40   ` sf
  2006-07-11 18:00     ` Linus Torvalds
  2006-07-11 14:55   ` Revisiting large binary files issue Carl Baldwin
  1 sibling, 1 reply; 37+ messages in thread
From: sf @ 2006-07-11  9:40 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:
...
> The problem is that the individual object disk format isn't actually the 
> same as the pack-file object format for one object. The header is 
> different: a pack-file uses a very dense bit packing, while the individual 
> object format is a bit less dense.

I just stumbled over the same fact and asked myself why there are still
two formats. Wouldn't it make more sense to use the pack-file object
format for individual objects as well?

As it happens individual objects all start with nibble 7 (deflated with
default _zlib_ window size of 32K) whereas in the pack-file object
format nibble 7 indicates delta entries which never occur as individual
files.

Roadmap for using pack-file format as individual object disk format:

Step 1. When reading individual objects from disk check the first nibble
and decode accordingly (see above).

Step 2. When writing individual objects to disk write them in pack-file
object format. Make that optional (config-file parameter, command line
option etc.)?

Step 3. Remove code for (old) individual object disk format.

Please comment.

Regards
	Stephan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC]: Pack-file object format for individual objects (Was: Revisiting large binary files issue.)
  2006-07-11  9:40   ` [RFC]: Pack-file object format for individual objects (Was: Revisiting large binary files issue.) sf
@ 2006-07-11 18:00     ` Linus Torvalds
  2006-07-11 21:45       ` sf
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 18:00 UTC (permalink / raw)
  To: sf; +Cc: git

[ I read my personal mailbox first, so I didn't see this one until after I 
  had already written my version.. ]

On Tue, 11 Jul 2006, sf wrote:
> 
> I just stumbled over the same fact and asked myself why there are still
> two formats. Wouldn't it make more sense to use the pack-file object
> format for individual objects as well?

Yes, see the git list for a series of patches that try to do this.

> As it happens individual objects all start with nibble 7 (deflated with
> default _zlib_ window size of 32K) whereas in the pack-file object
> format nibble 7 indicates delta entries which never occur as individual
> files.

I didn't actually do it that way, but it would be better to make the 
"parse_ascii_sha1_header()" more strict, and only accept the old names. 

Right now my patch-series could in theory accept something that is _not_ 
an ASCII header (eg it would be a binary header that just happened to have 
the format "x n\0", where "n" was a valid number).

> Step 1. When reading individual objects from disk check the first nibble
> and decode accordingly (see above).

Check more than that, but yes, this should be tightened up in my 
series.

> Step 2. When writing individual objects to disk write them in pack-file
> object format. Make that optional (config-file parameter, command line
> option etc.)?

Done.

> Step 3. Remove code for (old) individual object disk format.

Well, I'm not sure how necessary that even is. We actually do have to 
generate the old header regardless, if for no other reason than the fact 
that we generate the SHA1 names based on it (even if we then write a 
new-style dense binary header to disk and discard the ASCII header).

Having it there means that you can always just get a new version of git, 
and never worry about how old the archive you're working with is.

(And then doing a "git repack -a -d" will make any archive also work with 
an old-style git, since the pack-file format didn't change, and a "git 
repack" thus ends up always creating something that is readable by 
anybody, including old clients).

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC]: Pack-file object format for individual objects (Was:  Revisiting large binary files issue.)
  2006-07-11 18:00     ` Linus Torvalds
@ 2006-07-11 21:45       ` sf
  2006-07-11 22:17         ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: sf @ 2006-07-11 21:45 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:
...
> On Tue, 11 Jul 2006, sf wrote:
...
>> Step 1. When reading individual objects from disk check the first nibble
>> and decode accordingly (see above).
> 
> Check more than that, but yes, this should be tightened up in my 
> series.

Just look at the first byte of the object file _without doing any
decompression_. It is 0x78 _if and only if_ the object file is in the
traditional format.

>> Step 3. Remove code for (old) individual object disk format.
> 
> Well, I'm not sure how necessary that even is. We actually do have to 
> generate the old header regardless, if for no other reason than the fact 
> that we generate the SHA1 names based on it (even if we then write a 
> new-style dense binary header to disk and discard the ASCII header).
> 
> Having it there means that you can always just get a new version of git, 
> and never worry about how old the archive you're working with is.
> 
> (And then doing a "git repack -a -d" will make any archive also work with 
> an old-style git, since the pack-file format didn't change, and a "git 
> repack" thus ends up always creating something that is readable by 
> anybody, including old clients).

Agreed.

Regards
	Stephan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC]: Pack-file object format for individual objects (Was:   Revisiting large binary files issue.)
  2006-07-11 21:45       ` sf
@ 2006-07-11 22:17         ` Linus Torvalds
  2006-07-11 22:26           ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 22:17 UTC (permalink / raw)
  To: sf; +Cc: git



On Tue, 11 Jul 2006, sf wrote:
> 
> Just look at the first byte of the object file _without doing any
> decompression_. It is 0x78 _if and only if_ the object file is in the
> traditional format.

0x78 isn't the only valid flag for a zlib stream, as far as I can tell.

It may be the only one _in_practice_, of course, but the zlib standard 
defines the first byte as

 - for low bits: CM (compression method):

        "This identifies the compression method used in the file. CM = 8
         denotes the "deflate" compression method with a window size up
         to 32K.  This is the method used by gzip and PNG (see
         references [1] and [2] in Chapter 3, below, for the reference
         documents).  CM = 15 is reserved.  It might be used in a future
         version of this specification to indicate the presence of an
         extra field before the compressed data."

 - four high bits are CINFO: 

        "For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
         size, minus eight (CINFO=7 indicates a 32K window size). Values
         of CINFO above 7 are not allowed in this version of the
         specification.  CINFO is not defined in this specification for
         CM not equal to 8."

so 0x78 means "deflate with 32kB window size", but I don't see anything 
guaranteeing that we might not see something else for an object that 
cannot be compressed, for example.

Anyway, the good news is that _if_ 0x78 is indeed the only possible value, 
then it is also an illegal value for an unpacked object in pack-file 
format (type=7 being OBJ_DELTA) and we wouldn't need any other flag for 
this.

I just don't know if it's the only possible one..

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC]: Pack-file object format for individual objects (Was:   Revisiting large binary files issue.)
  2006-07-11 22:17         ` Linus Torvalds
@ 2006-07-11 22:26           ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 22:26 UTC (permalink / raw)
  To: sf; +Cc: git



On Tue, 11 Jul 2006, Linus Torvalds wrote:
> 
>  - for low bits: CM (compression method):
> 
>         "This identifies the compression method used in the file. CM = 8
>          denotes the "deflate" compression method with a window size up
>          to 32K.  This is the method used by gzip and PNG (see
>          references [1] and [2] in Chapter 3, below, for the reference
>          documents).  CM = 15 is reserved.  It might be used in a future
>          version of this specification to indicate the presence of an
>          extra field before the compressed data."
> 
>  - four high bits are CINFO: 
> 
>         "For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
>          size, minus eight (CINFO=7 indicates a 32K window size). Values
>          of CINFO above 7 are not allowed in this version of the
>          specification.  CINFO is not defined in this specification for
>          CM not equal to 8."
> 
> so 0x78 means "deflate with 32kB window size", but I don't see anything 
> guaranteeing that we might not see something else for an object that 
> cannot be compressed, for example.

Ahh. Looking at the zlib sources, I see

    /* Write the zlib header */
    if (s->status == INIT_STATE) {

        uInt header = (Z_DEFLATED + ((s->w_bits-8)<<4)) << 8;
        uInt level_flags = (s->level-1) >> 1;
     
        if (level_flags > 3) level_flags = 3;
        header |= (level_flags << 6);
        if (s->strstart != 0) header |= PRESET_DICT;
        header += 31 - (header % 31);

        s->status = BUSY_STATE;
        putShortMSB(s, header);

(which is that first 16-bit word, MSB first). So we'll always have the 
Z-DEFLATED (8) there in the low four bits, but the high nybble will be 
"s->w_bits-8" where w_bits comes from windowBits, and I think we can 
depend on it beign 15:

    "The windowBits parameter is the base two logarithm of the window size
   (the size of the history buffer).  It should be in the range 8..15 for this
   version of the library. Larger values of this parameter result in better
   compression at the expense of memory usage. The default value is 15 if
   deflateInit is used instead."

so since we use deflateInit(), we know the window will be 15.

So I guess we _can_ depend on the first byte being 0x78 for our use.

Goodie.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Revisiting large binary files issue.
  2006-07-10 23:28 ` Linus Torvalds
  2006-07-11  9:40   ` [RFC]: Pack-file object format for individual objects (Was: Revisiting large binary files issue.) sf
@ 2006-07-11 14:55   ` Carl Baldwin
  2006-07-11 17:09     ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Carl Baldwin @ 2006-07-11 14:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

I'd like to get my hands dirty and see for myself where the issue lies.
I hope to have some time next week to devote to this.  Is it reasonable
to hope for a solution that is at least a lot lighter weight than the
current status quo?

Carl

On Mon, Jul 10, 2006 at 04:28:24PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 10 Jul 2006, Carl Baldwin wrote:
> > 
> > When I set the window to 0 I one more issue.  Even though the blobs are
> > already compressed on disk I still seem to pay the penalty of inflating
> > them into memory and then deflating them into the pack.  When the window
> > size is 0 this is just wasted cycles.  With large binary files these
> > wasted cycles slow down the push/fetch operation considerably.  Couldn't
> > the compressed blobs be copied into the pack without first deflating
> > them in this 0 window case?
> 
> The problem is that the individual object disk format isn't actually the 
> same as the pack-file object format for one object. The header is 
> different: a pack-file uses a very dense bit packing, while the individual 
> object format is a bit less dense.
> 
> Sad, really, but it means that right now you can only re-use data that was 
> already packed (when the format matches).
> 
> 		Linus
> 

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Revisiting large binary files issue.
  2006-07-11 14:55   ` Revisiting large binary files issue Carl Baldwin
@ 2006-07-11 17:09     ` Linus Torvalds
  2006-07-11 17:10       ` [PATCH 1/3] Make the unpacked object header functions static to sha1_file.c Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 17:09 UTC (permalink / raw)
  To: Carl Baldwin, Junio C Hamano; +Cc: Git Mailing List

On Tue, 11 Jul 2006, Carl Baldwin wrote:
>
> I'd like to get my hands dirty and see for myself where the issue lies.
> I hope to have some time next week to devote to this.  Is it reasonable
> to hope for a solution that is at least a lot lighter weight than the
> current status quo?

Ok, I decided to see how nasty an object database change would be.

It doesn't look too bad. The following three patches implement what looks 
like a workable model.

NOTE NOTE NOTE! It makes the new "binary headers" the default, and that 
will mean that unless you have applied the first two patches, any 
repository that has had objects added with the new git version WILL NOT BE 
READABLE BY AN OLDER GIT VERSION!

So I think that the first two patches can be added to the main git branch 
pretty much immediately (after people have tested this all a _bit_ more, 
of course), because the first two patches just add the capability to 
_read_ a mixed-format repository.

The third patch is the one that actually starts _writing_ new-format repo. 
It should be applied with extreme care, although it can basically be 
de-fanged by setting the default initial value of the "use_binary_headers" 
to 0, which will make it not write the binary headers by default.

I have _not_ verified that the actual object format is identical to the 
pack-file one, but it should be. It's simple enough.

The three patches will be sent as replies to this email.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/3] Make the unpacked object header functions static to sha1_file.c
  2006-07-11 17:09     ` Linus Torvalds
@ 2006-07-11 17:10       ` Linus Torvalds
  2006-07-11 17:12       ` [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format" Linus Torvalds
  2006-07-11 17:16       ` [PATCH 3/3] Enable the new binary header format for unpacked objects Linus Torvalds
  2 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 17:10 UTC (permalink / raw)
  To: Carl Baldwin, Junio C Hamano; +Cc: Git Mailing List


Nobody else uses them, and I'm going to start changing them.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
 cache.h     |    2 --
 sha1_file.c |    4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index b5e3f8f..d433d46 100644
--- a/cache.h
+++ b/cache.h
@@ -219,8 +219,6 @@ int safe_create_leading_directories(char
 char *enter_repo(char *path, int strict);
 
 /* Read and unpack a sha1 file into memory, write memory to a sha1 file */
-extern int unpack_sha1_header(z_stream *stream, void *map, unsigned long mapsize, void *buffer, unsigned long size);
-extern int parse_sha1_header(char *hdr, char *type, unsigned long *sizep);
 extern int sha1_object_info(const unsigned char *, char *, unsigned long *);
 extern void * unpack_sha1_file(void *map, unsigned long mapsize, char *type, unsigned long *size);
 extern void * read_sha1_file(const unsigned char *sha1, char *type, unsigned long *size);
diff --git a/sha1_file.c b/sha1_file.c
index 459430a..8734d50 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -684,7 +684,7 @@ static void *map_sha1_file_internal(cons
 	return map;
 }
 
-int unpack_sha1_header(z_stream *stream, void *map, unsigned long mapsize, void *buffer, unsigned long size)
+static int unpack_sha1_header(z_stream *stream, void *map, unsigned long mapsize, void *buffer, unsigned long size)
 {
 	/* Get the data stream */
 	memset(stream, 0, sizeof(*stream));
@@ -720,7 +720,7 @@ static void *unpack_sha1_rest(z_stream *
  * too permissive for what we want to check. So do an anal
  * object header parse by hand.
  */
-int parse_sha1_header(char *hdr, char *type, unsigned long *sizep)
+static int parse_sha1_header(char *hdr, char *type, unsigned long *sizep)
 {
 	int i;
 	unsigned long size;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 17:09     ` Linus Torvalds
  2006-07-11 17:10       ` [PATCH 1/3] Make the unpacked object header functions static to sha1_file.c Linus Torvalds
@ 2006-07-11 17:12       ` Linus Torvalds
  2006-07-11 18:40         ` Johannes Schindelin
  2006-07-11 21:24         ` sf
  2006-07-11 17:16       ` [PATCH 3/3] Enable the new binary header format for unpacked objects Linus Torvalds
  2 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 17:12 UTC (permalink / raw)
  To: Carl Baldwin, Junio C Hamano; +Cc: Git Mailing List


The pack-file format is slightly different from the traditional git
object format, in that it has a much denser binary header encoding.

The traditional format uses an ASCII string with type and length
information, which is somewhat wasteful.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---

This should probably be applied to the main tree asap if we think
this is at all a worthwhile exercise. But somebody should verify that I 
got the format right first!

 sha1_file.c |   66 +++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8734d50..ca5f0c0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -697,9 +697,9 @@ static int unpack_sha1_header(z_stream *
 	return inflate(stream, 0);
 }
 
-static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size)
+static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size, unsigned int hdrlen)
 {
-	int bytes = strlen(buffer) + 1;
+	int bytes = hdrlen;
 	unsigned char *buf = xmalloc(1+size);
 
 	memcpy(buf, (char *) buffer + bytes, stream->total_out - bytes);
@@ -720,9 +720,9 @@ static void *unpack_sha1_rest(z_stream *
  * too permissive for what we want to check. So do an anal
  * object header parse by hand.
  */
-static int parse_sha1_header(char *hdr, char *type, unsigned long *sizep)
+static int parse_ascii_sha1_header(char *hdr, char *type, unsigned long *sizep)
 {
-	int i;
+	int i, bytes = 0;
 	unsigned long size;
 
 	/*
@@ -733,6 +733,7 @@ static int parse_sha1_header(char *hdr, 
 	i = 10;
 	for (;;) {
 		char c = *hdr++;
+		bytes++;
 		if (c == ' ')
 			break;
 		if (!--i)
@@ -746,6 +747,7 @@ static int parse_sha1_header(char *hdr, 
 	 * decimal format (ie "010" is not valid).
 	 */
 	size = *hdr++ - '0';
+	bytes++;
 	if (size > 9)
 		return -1;
 	if (size) {
@@ -754,6 +756,7 @@ static int parse_sha1_header(char *hdr, 
 			if (c > 9)
 				break;
 			hdr++;
+			bytes++;
 			size = size * 10 + c;
 		}
 	}
@@ -762,20 +765,65 @@ static int parse_sha1_header(char *hdr, 
 	/*
 	 * The length must be followed by a zero byte
 	 */
-	return *hdr ? -1 : 0;
+	bytes++;
+	if (*hdr)
+		bytes = -1;
+	return bytes;
+}
+
+static int parse_binary_sha1_header(char *hdr, char *type, unsigned long *sizep)
+{
+	unsigned char c;
+	int bytes = 1;
+	unsigned long size;
+	unsigned object_type, bits;
+	static const char *typename[8] = {
+		NULL,	/* OBJ_EXT */
+		"commit", "tree", "blob", "tag",
+		NULL, NULL, NULL 
+	};	
+
+	c = *hdr++;
+	object_type = (c >> 4) & 7;
+	if (!typename[object_type])
+		return -1;
+	strcpy(type, typename[object_type]);
+	size = c & 15;
+	bits = 4;
+	while (!(c & 0x80)) {
+		if (bits >= 8*sizeof(unsigned long))
+			return -1;
+		c = *hdr++;
+		size += (unsigned long) (c & 0x7f) << bits;
+		bytes++;
+		bits += 7;
+	}
+	*sizep = size;
+	return bytes;
+}
+
+static int parse_sha1_header(char *hdr, char *type, unsigned long *sizep)
+{
+	int retval = parse_ascii_sha1_header(hdr, type, sizep);
+	if (retval < 0)
+		retval = parse_binary_sha1_header(hdr, type, sizep);
+	return retval;
 }
 
 void * unpack_sha1_file(void *map, unsigned long mapsize, char *type, unsigned long *size)
 {
-	int ret;
+	int ret, hdrlen;
 	z_stream stream;
 	char hdr[8192];
 
 	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
-	if (ret < Z_OK || parse_sha1_header(hdr, type, size) < 0)
+	if (ret < Z_OK)
+		return NULL;
+	hdrlen = parse_sha1_header(hdr, type, size);
+	if (hdrlen < 0)
 		return NULL;
 
-	return unpack_sha1_rest(&stream, hdr, *size);
+	return unpack_sha1_rest(&stream, hdr, *size, hdrlen);
 }
 
 /* forward declaration for a mutually recursive function */
@@ -1192,7 +1240,7 @@ struct packed_git *find_sha1_pack(const 
 
 int sha1_object_info(const unsigned char *sha1, char *type, unsigned long *sizep)
 {
-	int status;
+	int status, hdrlen;
 	unsigned long mapsize, size;
 	void *map;
 	z_stream stream;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 17:12       ` [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format" Linus Torvalds
@ 2006-07-11 18:40         ` Johannes Schindelin
  2006-07-11 18:58           ` Linus Torvalds
  2006-07-11 21:24         ` sf
  1 sibling, 1 reply; 37+ messages in thread
From: Johannes Schindelin @ 2006-07-11 18:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carl Baldwin, Junio C Hamano, Git Mailing List

Hi,

On Tue, 11 Jul 2006, Linus Torvalds wrote:

> @@ -762,20 +765,65 @@ static int parse_sha1_header(char *hdr, 
>  	/*
>  	 * The length must be followed by a zero byte
>  	 */
> -	return *hdr ? -1 : 0;
> +	bytes++;
> +	if (*hdr)
> +		bytes = -1;
> +	return bytes;

Why not just say

	return *hdr ? -1 : bytes;

> +}
> +
> +static int parse_binary_sha1_header(char *hdr, char *type, unsigned long *sizep)
> +{
> +	unsigned char c;
> +	int bytes = 1;
> +	unsigned long size;
> +	unsigned object_type, bits;
> +	static const char *typename[8] = {
> +		NULL,	/* OBJ_EXT */
> +		"commit", "tree", "blob", "tag",
> +		NULL, NULL, NULL 
> +	};	
> +
> +	c = *hdr++;
> +	object_type = (c >> 4) & 7;
> +	if (!typename[object_type])
> +		return -1;

You might want to add a comment saying "since the type is lowercase in 
ascii format, object_type must be 6 or 7, which is an invalid object 
type." It took me a little to figure that out...

> +	strcpy(type, typename[object_type]);
> +	size = c & 15;

Just a nit here: I think 0xf is easier to read with boolean operations.

> +	bits = 4;
> +	while (!(c & 0x80)) {
> +		if (bits >= 8*sizeof(unsigned long))
> +			return -1;
> +		c = *hdr++;
> +		size += (unsigned long) (c & 0x7f) << bits;
> +		bytes++;
> +		bits += 7;
> +	}

Are you not losing the last byte by putting the "while" _before_ instead 
of _after_ the loop?

> @@ -1192,7 +1240,7 @@ struct packed_git *find_sha1_pack(const 
>  
>  int sha1_object_info(const unsigned char *sha1, char *type, unsigned long *sizep)
>  {
> -	int status;
> +	int status, hdrlen;
>  	unsigned long mapsize, size;
>  	void *map;
>  	z_stream stream;
> -

This hunk is unnecessary, right?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 18:40         ` Johannes Schindelin
@ 2006-07-11 18:58           ` Linus Torvalds
  2006-07-11 19:20             ` Johannes Schindelin
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 18:58 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Carl Baldwin, Junio C Hamano, Git Mailing List

On Tue, 11 Jul 2006, Johannes Schindelin wrote:
> 
> Why not just say
> 
> 	return *hdr ? -1 : bytes;

Hey, whatever works. I rewrote more, and edited some of my changes down 
again..

> You might want to add a comment saying "since the type is lowercase in 
> ascii format, object_type must be 6 or 7, which is an invalid object 
> type." It took me a little to figure that out...

It's not even correct in my version - I check the ASCII header _first_, so 
by the time it looks at the binary one, it already knows it's not ascii.

The problematic case is actually the other way around: my 
"parse_ascii_sha1_header()" isn't strict enough.

Or, more likely, the parse_sha1_header() function should just be changed 
to check the binary format first (and then add your comment about why that 
is safe).

> > +	bits = 4;
> > +	while (!(c & 0x80)) {
> > +		if (bits >= 8*sizeof(unsigned long))
> > +			return -1;
> > +		c = *hdr++;
> > +		size += (unsigned long) (c & 0x7f) << bits;
> > +		bytes++;
> > +		bits += 7;
> > +	}
> 
> Are you not losing the last byte by putting the "while" _before_ instead 
> of _after_ the loop?

No. The very first byte can have the 0x80 end marker, when the size was 
between 0..15.

> >  int sha1_object_info(const unsigned char *sha1, char *type, unsigned long *sizep)
> >  {
> > -	int status;
> > +	int status, hdrlen;
> >  	unsigned long mapsize, size;
> >  	void *map;
> >  	z_stream stream;
> 
> This hunk is unnecessary, right?

Yeah, never mind. That function didn't actually need the hdrlen, it only 
cared about the SHA1.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 18:58           ` Linus Torvalds
@ 2006-07-11 19:20             ` Johannes Schindelin
  2006-07-11 19:48               ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Johannes Schindelin @ 2006-07-11 19:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Carl Baldwin, Junio C Hamano, Git Mailing List

Hi,

On Tue, 11 Jul 2006, Linus Torvalds wrote:

> On Tue, 11 Jul 2006, Johannes Schindelin wrote:
> 
> > You might want to add a comment saying "since the type is lowercase in 
> > ascii format, object_type must be 6 or 7, which is an invalid object 
> > type." It took me a little to figure that out...
> 
> It's not even correct in my version - I check the ASCII header _first_, so 
> by the time it looks at the binary one, it already knows it's not ascii.

Just realized it all by myself...

> Or, more likely, the parse_sha1_header() function should just be changed 
> to check the binary format first (and then add your comment about why that 
> is safe).

Yes, exactly.

> > > +	bits = 4;
> > > +	while (!(c & 0x80)) {
> > > +		if (bits >= 8*sizeof(unsigned long))
> > > +			return -1;
> > > +		c = *hdr++;
> > > +		size += (unsigned long) (c & 0x7f) << bits;
> > > +		bytes++;
> > > +		bits += 7;
> > > +	}
> > 
> > Are you not losing the last byte by putting the "while" _before_ instead 
> > of _after_ the loop?
> 
> No. The very first byte can have the 0x80 end marker, when the size was 
> between 0..15.

Yes, I understand now. I was a little confused by the way it is written...

Thanks for the clarification,
Dscho

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 19:20             ` Johannes Schindelin
@ 2006-07-11 19:48               ` Linus Torvalds
  2006-07-11 21:25                 ` Johannes Schindelin
  2006-07-11 21:47                 ` Junio C Hamano
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 19:48 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Carl Baldwin, Junio C Hamano, Git Mailing List



On Tue, 11 Jul 2006, Johannes Schindelin wrote:
> 
> > Or, more likely, the parse_sha1_header() function should just be changed 
> > to check the binary format first (and then add your comment about why that 
> > is safe).
> 
> Yes, exactly.

Here's a newer verson of [2/3], with these issues fixed. It actually fixes 
things twice: (a) by parsing the binary version first (which makes sense 
for a totally independent reason - if that is going to be the "default" 
version in the long run, we should just test it first anyway) and (b) by 
making the ASCII version parser stricter too.

This did, btw, also fix the test failure, so the fact that the ASCII 
header parser wasn't careful enough was actually a problem in real life.

So please throw away the old version.

		Linus

---
From: Linus Torvalds <torvalds@osdl.org>
Subject: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"

The pack-file format is slightly different from the traditional git
object format, in that it has a much denser binary header encoding.

The traditional format uses an ASCII string with type and length
information, which is somewhat wasteful.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
 sha1_file.c |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 82 insertions(+), 11 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8734d50..15ccf5e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -697,9 +697,9 @@ static int unpack_sha1_header(z_stream *
 	return inflate(stream, 0);
 }
 
-static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size)
+static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size, unsigned int hdrlen)
 {
-	int bytes = strlen(buffer) + 1;
+	int bytes = hdrlen;
 	unsigned char *buf = xmalloc(1+size);
 
 	memcpy(buf, (char *) buffer + bytes, stream->total_out - bytes);
@@ -720,25 +720,40 @@ static void *unpack_sha1_rest(z_stream *
  * too permissive for what we want to check. So do an anal
  * object header parse by hand.
  */
-static int parse_sha1_header(char *hdr, char *type, unsigned long *sizep)
+static int parse_ascii_sha1_header(char *hdr, char *type, unsigned long *sizep)
 {
-	int i;
+	int bytes = 0;
 	unsigned long size;
 
 	/*
 	 * The type can be at most ten bytes (including the 
 	 * terminating '\0' that we add), and is followed by
 	 * a space. 
+	 *
+	 * We want at least three characters, and they should
+	 * all be normal lower-case letters.
 	 */
-	i = 10;
 	for (;;) {
-		char c = *hdr++;
+		unsigned char c = *hdr++;
+		bytes++;
 		if (c == ' ')
 			break;
-		if (!--i)
+		if (bytes >= 10)
 			return -1;
 		*type++ = c;
+
+		/*
+		 * The high nybble must be 6 of 7, see
+		 * parse_binary_header(). This covers
+		 * all ASCII lowercase characters.
+		 */
+		if (c < 0x60 || c > 0x7f)
+			return -1;
 	}
+
+	/* Minimum three letters and the space */
+	if (bytes < 4)
+		return -1;
 	*type = 0;
 
 	/*
@@ -746,6 +761,7 @@ static int parse_sha1_header(char *hdr, 
 	 * decimal format (ie "010" is not valid).
 	 */
 	size = *hdr++ - '0';
+	bytes++;
 	if (size > 9)
 		return -1;
 	if (size) {
@@ -754,6 +770,7 @@ static int parse_sha1_header(char *hdr, 
 			if (c > 9)
 				break;
 			hdr++;
+			bytes++;
 			size = size * 10 + c;
 		}
 	}
@@ -762,20 +779,74 @@ static int parse_sha1_header(char *hdr, 
 	/*
 	 * The length must be followed by a zero byte
 	 */
-	return *hdr ? -1 : 0;
+	bytes++;
+	return *hdr ? -1 : bytes;
+}
+
+/*
+ * We never confuse a binary header with an old ASCII one,
+ * because the ASCII one will always start with a lower-case
+ * letter, meaning that the first byte will be of the form
+ * 0x6? or 0x7?.
+ *
+ * That in turn would be parsed as object type 6 or 7, neither
+ * of which is valid for a unpacked object (object type 7 is
+ * a delta, and can only exist in a pack-file, while object type
+ * 6 is invalid).
+ */
+static int parse_binary_sha1_header(char *hdr, char *type, unsigned long *sizep)
+{
+	unsigned char c;
+	int bytes = 1;
+	unsigned long size;
+	unsigned object_type, bits;
+	static const char *typename[8] = {
+		NULL,	/* OBJ_EXT */
+		"commit", "tree", "blob", "tag",
+		NULL, NULL, NULL
+	};
+
+	c = *hdr++;
+	object_type = (c >> 4) & 7;
+	if (!typename[object_type])
+		return -1;
+	strcpy(type, typename[object_type]);
+	size = c & 15;
+	bits = 4;
+	while (!(c & 0x80)) {
+		if (bits >= 8*sizeof(unsigned long))
+			return -1;
+		c = *hdr++;
+		size += (unsigned long) (c & 0x7f) << bits;
+		bytes++;
+		bits += 7;
+	}
+	*sizep = size;
+	return bytes;
+}
+
+static int parse_sha1_header(char *hdr, char *type, unsigned long *sizep)
+{
+	int retval = parse_binary_sha1_header(hdr, type, sizep);
+	if (retval < 0)
+		retval = parse_ascii_sha1_header(hdr, type, sizep);
+	return retval;
 }
 
 void * unpack_sha1_file(void *map, unsigned long mapsize, char *type, unsigned long *size)
 {
-	int ret;
+	int ret, hdrlen;
 	z_stream stream;
 	char hdr[8192];
 
 	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
-	if (ret < Z_OK || parse_sha1_header(hdr, type, size) < 0)
+	if (ret < Z_OK)
+		return NULL;
+	hdrlen = parse_sha1_header(hdr, type, size);
+	if (hdrlen < 0)
 		return NULL;
 
-	return unpack_sha1_rest(&stream, hdr, *size);
+	return unpack_sha1_rest(&stream, hdr, *size, hdrlen);
 }
 
 /* forward declaration for a mutually recursive function */

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 19:48               ` Linus Torvalds
@ 2006-07-11 21:25                 ` Johannes Schindelin
  2006-07-11 21:47                 ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Johannes Schindelin @ 2006-07-11 21:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Hi,

On Tue, 11 Jul 2006, Linus Torvalds wrote:

> On Tue, 11 Jul 2006, Johannes Schindelin wrote:
> > 
> > > Or, more likely, the parse_sha1_header() function should just be changed 
> > > to check the binary format first (and then add your comment about why that 
> > > is safe).
> > 
> > Yes, exactly.
> 
> Here's a newer verson of [2/3], with these issues fixed. It actually fixes 
> things twice: (a) by parsing the binary version first (which makes sense 
> for a totally independent reason - if that is going to be the "default" 
> version in the long run, we should just test it first anyway) and (b) by 
> making the ASCII version parser stricter too.

Melikey.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 19:48               ` Linus Torvalds
  2006-07-11 21:25                 ` Johannes Schindelin
@ 2006-07-11 21:47                 ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2006-07-11 21:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Johannes Schindelin, Carl Baldwin, Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> Here's a newer verson of [2/3], with these issues fixed. It actually fixes 
> things twice: (a) by parsing the binary version first (which makes sense 
> for a totally independent reason - if that is going to be the "default" 
> version in the long run, we should just test it first anyway) and (b) by 
> making the ASCII version parser stricter too.

Wait a minute.

 read-sha1-file maps sha1-file-internal (for unpacked one), and then
 calls unpack-sha1-file.

 unpack-sha1-file calls unpack-sha1-header to start inflation,
 lets parse-sha1-header to read the header in the inflated
 buffer, and calls unpack-sha1-rest to inflate the rest.

But in packs, we have binary header not deflated, followed by
deflated payload.  If we want to copy things from loose objects
into pack without changing the packfile format, this change
would not help, I suspect.

At least, your updated unpack_sha1_file() needs to check for
binary header first (starting from "map"), and if that starts
with binary header, start inflating after the header to extract
the payload.  Otherwise you would do the traditional.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 17:12       ` [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format" Linus Torvalds
  2006-07-11 18:40         ` Johannes Schindelin
@ 2006-07-11 21:24         ` sf
  2006-07-11 22:09           ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: sf @ 2006-07-11 21:24 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:
> The pack-file format is slightly different from the traditional git
> object format, in that it has a much denser binary header encoding.
> 
> The traditional format uses an ASCII string with type and length
> information, which is somewhat wasteful.

And in the traditional format type and length are compressed whereas in
the pack-file format they are not.

> This should probably be applied to the main tree asap if we think
> this is at all a worthwhile exercise. But somebody should verify that I 
> got the format right first!

Sorry but see above.

Regards
	Stephan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 21:24         ` sf
@ 2006-07-11 22:09           ` Linus Torvalds
  2006-07-11 22:25             ` sf
  2006-07-11 23:03             ` Junio C Hamano
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 22:09 UTC (permalink / raw)
  To: sf; +Cc: git

On Tue, 11 Jul 2006, sf wrote:
> 
> And in the traditional format type and length are compressed whereas in
> the pack-file format they are not.

Ahh. Yes.

> > This should probably be applied to the main tree asap if we think
> > this is at all a worthwhile exercise. But somebody should verify that I 
> > got the format right first!
> 
> Sorry but see above.

Good catch, thanks indeed.

Doing that for unpacked objects would in fact make a lot of things much 
simpler, so it would be good to do. The _bad_ part is that this also makes 
it a lot harder to see the difference between a "binary header" and a 
"compressed ASCII header". The two are not "obviously different" any more.

The common byte sequence for a compressed stream is

	78 9c ...

where the first byte is the CMF byte (compression info and method).

But it's not the only possible such sequence according to the zlib format.

(The 16-bit hex number in MSB format, ie 0x789c above, is defined to have 
a built-in checksum, so that it must be a multiple of 31 according to the 
standard: 0x789c = 996 * 31).

So if we have a uncompressed header, we'd need to add a separate 2-byte 
fingerprint to it _before_ the real header that isn't divisible by 31, and 
use that as the thing to test.

Ho humm. I'll see what I can come up with.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 22:09           ` Linus Torvalds
@ 2006-07-11 22:25             ` sf
  2006-07-11 23:03             ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: sf @ 2006-07-11 22:25 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:
...
> The common byte sequence for a compressed stream is
> 
> 	78 9c ...
> 
> where the first byte is the CMF byte (compression info and method).
> 
> But it's not the only possible such sequence according to the zlib format.

No, but 0x78 is the only first byte ever produced by git; the files are
always deflated (second nibble is 8) with window size 32K (first nibble
is 7).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 22:09           ` Linus Torvalds
  2006-07-11 22:25             ` sf
@ 2006-07-11 23:03             ` Junio C Hamano
  2006-07-12  0:03               ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2006-07-11 23:03 UTC (permalink / raw)
  To: git

Linus Torvalds <torvalds@osdl.org> writes:

> So if we have a uncompressed header, we'd need to add a separate 2-byte 
> fingerprint to it _before_ the real header that isn't divisible by 31, and 
> use that as the thing to test.
>
> Ho humm. I'll see what I can come up with.

I do not like to rely too heavily on what zlib compression's
beginning of stream looks like.

I think the new format can be deflated new header (fully
synched) followed by deflated payload.

So the sequence unpack-sha1-header followed by parse-sha1-header
would notice we are dealing with new format and reinitialize the
deflator at the point where the header deflator left off.

Wouldn't that work?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-11 23:03             ` Junio C Hamano
@ 2006-07-12  0:03               ` Linus Torvalds
  2006-07-12  0:39                 ` Johannes Schindelin
                                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-12  0:03 UTC (permalink / raw)
  To: Junio C Hamano, sf, Johannes Schindelin; +Cc: Git Mailing List



On Tue, 11 Jul 2006, Junio C Hamano wrote:
> 
> I do not like to rely too heavily on what zlib compression's
> beginning of stream looks like.

Well, I normally would agree with you if it was a "oh, all our zlib 
objects seem to start with 0x78" thing, but after having dug into both the 
zlib standard (which is actually an RFC, not just some random thing), and 
looked at the sources, it's definitely the case that the "0x78" byte isn't 
just an implementation detail.

It's actually really part of the specs, and not just happenstance.

> I think the new format can be deflated new header (fully
> synched) followed by deflated payload.

That would work, but on the other hand, one of the advantages of doing the 
new format would be that the "check size and type" code wouldn't even need 
to call into the zlib code. 

Anyway, I think this following patch replaces the old 2/3 and 3/3 (it 
still depends on the original [1/3] cleanup.

(It also renames and reverses the meaning of the config file option: it's 
now "[core] LegacyHeaders = true" for using legacy headers.)

Not heavily tested, but seems ok.

sf? Dscho? Can you check this thing out?

		Linus
----
 Documentation/config.txt |    6 +++
 cache.h                  |    1 
 config.c                 |    5 ++
 environment.c            |    1 
 sha1_file.c              |  106 +++++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0b434c1..9780c89 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -97,6 +97,12 @@ core.compression::
 	compression, and 1..9 are various speed/size tradeoffs, 9 being
 	slowest.
 
+core.legacyheaders::
+	A boolean which enables the legacy object header format in case
+	you want to interoperate with old clients accessing the object
+	database directly (where the "http://" and "rsync://" protocols
+	count as direct access).
+
 alias.*::
 	Command aliases for the gitlink:git[1] command wrapper - e.g.
 	after defining "alias.last = cat-file commit HEAD", the invocation
diff --git a/cache.h b/cache.h
index d433d46..eee5fc9 100644
--- a/cache.h
+++ b/cache.h
@@ -176,6 +176,7 @@ extern int commit_lock_file(struct lock_
 extern void rollback_lock_file(struct lock_file *);
 
 /* Environment bits from configuration mechanism */
+extern int use_legacy_headers;
 extern int trust_executable_bit;
 extern int assume_unchanged;
 extern int prefer_symlink_refs;
diff --git a/config.c b/config.c
index 8445f7d..0ac6aeb 100644
--- a/config.c
+++ b/config.c
@@ -279,6 +279,11 @@ int git_default_config(const char *var, 
 		return 0;
 	}
 
+	if (!strcmp(var, "core.legacyheaders")) {
+		use_legacy_headers = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.compression")) {
 		int level = git_config_int(var, value);
 		if (level == -1)
diff --git a/environment.c b/environment.c
index 97d42b1..d80a39a 100644
--- a/environment.c
+++ b/environment.c
@@ -11,6 +11,7 @@ #include "cache.h"
 
 char git_default_email[MAX_GITNAME];
 char git_default_name[MAX_GITNAME];
+int use_legacy_headers = 0;
 int trust_executable_bit = 1;
 int assume_unchanged = 0;
 int prefer_symlink_refs = 0;
diff --git a/sha1_file.c b/sha1_file.c
index 8734d50..475b23d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -684,26 +684,74 @@ static void *map_sha1_file_internal(cons
 	return map;
 }
 
-static int unpack_sha1_header(z_stream *stream, void *map, unsigned long mapsize, void *buffer, unsigned long size)
+static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
 {
+	unsigned char c;
+	unsigned int word, bits;
+	unsigned long size;
+	static const char *typename[8] = {
+		NULL,	/* OBJ_EXT */
+		"commit", "tree", "blob", "tag",
+		NULL, NULL, NULL
+	};
+	const char *type;
+
 	/* Get the data stream */
 	memset(stream, 0, sizeof(*stream));
 	stream->next_in = map;
 	stream->avail_in = mapsize;
 	stream->next_out = buffer;
-	stream->avail_out = size;
+	stream->avail_out = bufsiz;
+
+	/*
+	 * Is it a zlib-compressed buffer? If so, the first byte
+	 * must be 0x78 (15-bit window size, deflated), and the
+	 * first 16-bit word is evenly divisible by 31
+	 */
+	word = (map[0] << 8) + map[1];
+	if (map[0] == 0x78 && !(word % 31)) {
+		inflateInit(stream);
+		return inflate(stream, 0);  
+	}
+
+	c = *map++;
+	mapsize--;
+	type = typename[(c >> 4) & 7];
+	if (!type)
+		return -1;
+	
+	bits = 4;
+	size = c & 0xf;
+	while (!(c & 0x80)) {
+		if (bits >= 8*sizeof(long))
+			return -1;
+		c = *map++;
+		size += (c & 0x7f) << bits;
+		bits += 7;
+		mapsize--;
+	}
 
+	/* Set up the stream for the rest.. */
+	stream->next_in = map;
+	stream->avail_in = mapsize;
 	inflateInit(stream);
-	return inflate(stream, 0);
+
+	/* And generate the fake traditional header */
+	stream->total_out = 1 + snprintf(buffer, bufsiz, "%s %lu", type, size);
+	return 0;
 }
 
 static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size)
 {
 	int bytes = strlen(buffer) + 1;
 	unsigned char *buf = xmalloc(1+size);
+	unsigned long n;
 
-	memcpy(buf, (char *) buffer + bytes, stream->total_out - bytes);
-	bytes = stream->total_out - bytes;
+	n = stream->total_out - bytes;
+	if (n > size)
+		n = size;
+	memcpy(buf, (char *) buffer + bytes, n);
+	bytes = n;
 	if (bytes < size) {
 		stream->next_out = buf + bytes;
 		stream->avail_out = size - bytes;
@@ -1414,6 +1462,49 @@ static int write_buffer(int fd, const vo
 	return 0;
 }
 
+static int write_binary_header(unsigned char *hdr, enum object_type type, unsigned long len)
+{
+	int hdr_len;
+	unsigned char c;
+
+	c = (type << 4) | (len & 15);
+	len >>= 4;
+	hdr_len = 1;
+	while (len) {
+		*hdr++ = c;
+		hdr_len++;
+		c = (len & 0x7f);
+		len >>= 7;
+	}
+	*hdr = c | 0x80;
+	return hdr_len;
+}
+
+static void setup_object_header(z_stream *stream, const char *type, unsigned long len)
+{
+	int obj_type, hdr;
+
+	if (use_legacy_headers) {
+		while (deflate(stream, 0) == Z_OK)
+			/* nothing */;
+		return;
+	}
+	if (!strcmp(type, blob_type))
+		obj_type = OBJ_BLOB;
+	else if (!strcmp(type, tree_type))
+		obj_type = OBJ_TREE;
+	else if (!strcmp(type, commit_type))
+		obj_type = OBJ_COMMIT;
+	else if (!strcmp(type, tag_type))
+		obj_type = OBJ_TAG;
+	else
+		die("trying to generate bogus object of type '%s'", type);
+	hdr = write_binary_header(stream->next_out, obj_type, len);
+	stream->total_out = hdr;
+	stream->next_out += hdr;
+	stream->avail_out -= hdr;
+}
+
 int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
 {
 	int size;
@@ -1459,7 +1550,7 @@ int write_sha1_file(void *buf, unsigned 
 	/* Set it up */
 	memset(&stream, 0, sizeof(stream));
 	deflateInit(&stream, zlib_compression_level);
-	size = deflateBound(&stream, len+hdrlen);
+	size = 8 + deflateBound(&stream, len+hdrlen);
 	compressed = xmalloc(size);
 
 	/* Compress it */
@@ -1469,8 +1560,7 @@ int write_sha1_file(void *buf, unsigned 
 	/* First header.. */
 	stream.next_in = hdr;
 	stream.avail_in = hdrlen;
-	while (deflate(&stream, 0) == Z_OK)
-		/* nothing */;
+	setup_object_header(&stream, type, len);
 
 	/* Then the data itself.. */
 	stream.next_in = buf;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  0:03               ` Linus Torvalds
@ 2006-07-12  0:39                 ` Johannes Schindelin
  2006-07-12  3:45                   ` Linus Torvalds
  2006-07-12  0:46                 ` Junio C Hamano
  2006-07-12  6:49                 ` Peter Baumann
  2 siblings, 1 reply; 37+ messages in thread
From: Johannes Schindelin @ 2006-07-12  0:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, sf, Git Mailing List

Hi,

On Tue, 11 Jul 2006, Linus Torvalds wrote:

> On Tue, 11 Jul 2006, Junio C Hamano wrote:
> > 
> > I do not like to rely too heavily on what zlib compression's
> > beginning of stream looks like.
> 
> Well, I normally would agree with you if it was a "oh, all our zlib 
> objects seem to start with 0x78" thing, but after having dug into both the 
> zlib standard (which is actually an RFC, not just some random thing), and 
> looked at the sources, it's definitely the case that the "0x78" byte isn't 
> just an implementation detail.
> 
> It's actually really part of the specs, and not just happenstance.

Good.

> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -684,26 +684,74 @@ static void *map_sha1_file_internal(cons
>  	return map;
>  }
>  
> -static int unpack_sha1_header(z_stream *stream, void *map, unsigned long mapsize, void *buffer, unsigned long size)
> +static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
>  {
> +	unsigned char c;
> +	unsigned int word, bits;
> +	unsigned long size;
> +	static const char *typename[8] = {
> +		NULL,	/* OBJ_EXT */
> +		"commit", "tree", "blob", "tag",
> +		NULL, NULL, NULL
> +	};

I completely forgot to mention that type_names[] is already declared in 
object.h. Obviously, it is not really important, but maybe it would be 
better to obey the DRY principle (think addition of "bind" object type).

> +	while (!(c & 0x80)) {
> +		if (bits >= 8*sizeof(long))

Another nit: while it is safe to assume that sizeof(long) == 
sizeof(unsigned long), it was nevertheless a little confusing to yours 
truly (especially since you changed it since your last patch).

>  static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size)
>  {
>  	int bytes = strlen(buffer) + 1;
>  	unsigned char *buf = xmalloc(1+size);
> +	unsigned long n;
>  
> -	memcpy(buf, (char *) buffer + bytes, stream->total_out - bytes);
> -	bytes = stream->total_out - bytes;
> +	n = stream->total_out - bytes;
> +	if (n > size)
> +		n = size;

Just out of curiosity: when can this happen? I mean, there is no error or 
something which could tell the caller that not the whole object was 
copied.

>  	memset(&stream, 0, sizeof(stream));
>  	deflateInit(&stream, zlib_compression_level);
> -	size = deflateBound(&stream, len+hdrlen);
> +	size = 8 + deflateBound(&stream, len+hdrlen);

Again, I had to think why this is correct. I think it should be something 
like 2 + sizeof(unsigned long) * 8 / 7, but I did not think all that hard.

It looks good to me, but I did not really test...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  0:39                 ` Johannes Schindelin
@ 2006-07-12  3:45                   ` Linus Torvalds
  2006-07-12  4:31                     ` Linus Torvalds
  2006-07-12  6:35                     ` Junio C Hamano
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-12  3:45 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, sf, Git Mailing List



On Wed, 12 Jul 2006, Johannes Schindelin wrote:
>> 
> I completely forgot to mention that type_names[] is already declared in 
> object.h. Obviously, it is not really important, but maybe it would be 
> better to obey the DRY principle (think addition of "bind" object type).

Actually, the type_names[] array in object.h is _totally_ different. That 
one is indexed by TYPE_xxx, not OBJ_xxxx.

And yeah, it's stupid to have two different numberings for the same thing, 
and we really really shouldn't. But the OBJ_xxx numbering was originally a 
a totally internal implementation detail to just the pack-files. 

I shouldn't have introduced the new TYPE_xxx macros. I should just have 
used the same OBJ_xxx macros that we use in the pack-file.

Junio: this patch is totally independent from the other patches, and is on 
top of you current "master". It gets rid of TYPE_xxx in favor of the 
OBJ_xxx enums, which are moved from pack.h into object.h.

That way, we can't have people mixing up the two different kinds of type 
enumeration.

(Eventually, I want to get rid of passing around the "type strings" 
entirely, and this will help - no confusion about two different integer 
enumeration).

		Linus

----
diff --git a/blob.c b/blob.c
index 496f270..d1af2e6 100644
--- a/blob.c
+++ b/blob.c
@@ -10,12 +10,12 @@ struct blob *lookup_blob(const unsigned 
 	if (!obj) {
 		struct blob *ret = alloc_blob_node();
 		created_object(sha1, &ret->object);
-		ret->object.type = TYPE_BLOB;
+		ret->object.type = OBJ_BLOB;
 		return ret;
 	}
 	if (!obj->type)
-		obj->type = TYPE_BLOB;
-	if (obj->type != TYPE_BLOB) {
+		obj->type = OBJ_BLOB;
+	if (obj->type != OBJ_BLOB) {
 		error("Object %s is a %s, not a blob",
 		      sha1_to_hex(sha1), typename(obj->type));
 		return NULL;
diff --git a/builtin-diff.c b/builtin-diff.c
index ae901dd..cb38f44 100644
--- a/builtin-diff.c
+++ b/builtin-diff.c
@@ -285,9 +285,9 @@ int cmd_diff(int argc, const char **argv
 		obj = deref_tag(obj, NULL, 0);
 		if (!obj)
 			die("invalid object '%s' given.", name);
-		if (obj->type == TYPE_COMMIT)
+		if (obj->type == OBJ_COMMIT)
 			obj = &((struct commit *)obj)->tree->object;
-		if (obj->type == TYPE_TREE) {
+		if (obj->type == OBJ_TREE) {
 			if (ARRAY_SIZE(ent) <= ents)
 				die("more than %d trees given: '%s'",
 				    (int) ARRAY_SIZE(ent), name);
@@ -297,7 +297,7 @@ int cmd_diff(int argc, const char **argv
 			ents++;
 			continue;
 		}
-		if (obj->type == TYPE_BLOB) {
+		if (obj->type == OBJ_BLOB) {
 			if (2 <= blobs)
 				die("more than two blobs given: '%s'", name);
 			memcpy(blob[blobs].sha1, obj->sha1, 20);
diff --git a/builtin-fmt-merge-msg.c b/builtin-fmt-merge-msg.c
index 6527482..b3c4f98 100644
--- a/builtin-fmt-merge-msg.c
+++ b/builtin-fmt-merge-msg.c
@@ -177,7 +177,7 @@ static void shortlog(const char *name, u
 	int flags = UNINTERESTING | TREECHANGE | SEEN | SHOWN | ADDED;
 
 	branch = deref_tag(parse_object(sha1), sha1_to_hex(sha1), 40);
-	if (!branch || branch->type != TYPE_COMMIT)
+	if (!branch || branch->type != OBJ_COMMIT)
 		return;
 
 	setup_revisions(0, NULL, rev, NULL);
diff --git a/builtin-grep.c b/builtin-grep.c
index 4c2f7df..a79bac3 100644
--- a/builtin-grep.c
+++ b/builtin-grep.c
@@ -891,9 +891,9 @@ static int grep_tree(struct grep_opt *op
 static int grep_object(struct grep_opt *opt, const char **paths,
 		       struct object *obj, const char *name)
 {
-	if (obj->type == TYPE_BLOB)
+	if (obj->type == OBJ_BLOB)
 		return grep_sha1(opt, obj->sha1, name);
-	if (obj->type == TYPE_COMMIT || obj->type == TYPE_TREE) {
+	if (obj->type == OBJ_COMMIT || obj->type == OBJ_TREE) {
 		struct tree_desc tree;
 		void *data;
 		int hit;
diff --git a/builtin-rev-list.c b/builtin-rev-list.c
index 63bad0e..8f32871 100644
--- a/builtin-rev-list.c
+++ b/builtin-rev-list.c
@@ -167,16 +167,16 @@ static void show_commit_list(struct rev_
 		const char *name = pending->name;
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
-		if (obj->type == TYPE_TAG) {
+		if (obj->type == OBJ_TAG) {
 			obj->flags |= SEEN;
 			add_object_array(obj, name, &objects);
 			continue;
 		}
-		if (obj->type == TYPE_TREE) {
+		if (obj->type == OBJ_TREE) {
 			process_tree((struct tree *)obj, &objects, NULL, name);
 			continue;
 		}
-		if (obj->type == TYPE_BLOB) {
+		if (obj->type == OBJ_BLOB) {
 			process_blob((struct blob *)obj, &objects, NULL, name);
 			continue;
 		}
diff --git a/commit.c b/commit.c
index c6bf10d..46d5867 100644
--- a/commit.c
+++ b/commit.c
@@ -56,7 +56,7 @@ static struct commit *check_commit(struc
 				   const unsigned char *sha1,
 				   int quiet)
 {
-	if (obj->type != TYPE_COMMIT) {
+	if (obj->type != OBJ_COMMIT) {
 		if (!quiet)
 			error("Object %s is a %s, not a commit",
 			      sha1_to_hex(sha1), typename(obj->type));
@@ -86,11 +86,11 @@ struct commit *lookup_commit(const unsig
 	if (!obj) {
 		struct commit *ret = alloc_commit_node();
 		created_object(sha1, &ret->object);
-		ret->object.type = TYPE_COMMIT;
+		ret->object.type = OBJ_COMMIT;
 		return ret;
 	}
 	if (!obj->type)
-		obj->type = TYPE_COMMIT;
+		obj->type = OBJ_COMMIT;
 	return check_commit(obj, sha1, 0);
 }
 
diff --git a/describe.c b/describe.c
index 8e68d5d..324ca89 100644
--- a/describe.c
+++ b/describe.c
@@ -67,7 +67,7 @@ static int get_name(const char *path, co
 	 * Otherwise only annotated tags are used.
 	 */
 	if (!strncmp(path, "refs/tags/", 10)) {
-		if (object->type == TYPE_TAG)
+		if (object->type == OBJ_TAG)
 			prio = 2;
 		else
 			prio = 1;
diff --git a/fetch-pack.c b/fetch-pack.c
index f2c51eb..b7824db 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -46,7 +46,7 @@ static int rev_list_insert_ref(const cha
 {
 	struct object *o = deref_tag(parse_object(sha1), path, 0);
 
-	if (o && o->type == TYPE_COMMIT)
+	if (o && o->type == OBJ_COMMIT)
 		rev_list_push((struct commit *)o, SEEN);
 
 	return 0;
@@ -256,14 +256,14 @@ static int mark_complete(const char *pat
 {
 	struct object *o = parse_object(sha1);
 
-	while (o && o->type == TYPE_TAG) {
+	while (o && o->type == OBJ_TAG) {
 		struct tag *t = (struct tag *) o;
 		if (!t->tagged)
 			break; /* broken repository */
 		o->flags |= COMPLETE;
 		o = parse_object(t->tagged->sha1);
 	}
-	if (o && o->type == TYPE_COMMIT) {
+	if (o && o->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *)o;
 		commit->object.flags |= COMPLETE;
 		insert_by_date(commit, &complete);
@@ -357,7 +357,7 @@ static int everything_local(struct ref *
 		 * in sync with the other side at some time after
 		 * that (it is OK if we guess wrong here).
 		 */
-		if (o->type == TYPE_COMMIT) {
+		if (o->type == OBJ_COMMIT) {
 			struct commit *commit = (struct commit *)o;
 			if (!cutoff || cutoff < commit->date)
 				cutoff = commit->date;
@@ -376,7 +376,7 @@ static int everything_local(struct ref *
 		struct object *o = deref_tag(lookup_object(ref->old_sha1),
 					     NULL, 0);
 
-		if (!o || o->type != TYPE_COMMIT || !(o->flags & COMPLETE))
+		if (!o || o->type != OBJ_COMMIT || !(o->flags & COMPLETE))
 			continue;
 
 		if (!(o->flags & SEEN)) {
diff --git a/fsck-objects.c b/fsck-objects.c
index ef54a8a..e167f41 100644
--- a/fsck-objects.c
+++ b/fsck-objects.c
@@ -297,13 +297,13 @@ static int fsck_sha1(unsigned char *sha1
 	if (obj->flags & SEEN)
 		return 0;
 	obj->flags |= SEEN;
-	if (obj->type == TYPE_BLOB)
+	if (obj->type == OBJ_BLOB)
 		return 0;
-	if (obj->type == TYPE_TREE)
+	if (obj->type == OBJ_TREE)
 		return fsck_tree((struct tree *) obj);
-	if (obj->type == TYPE_COMMIT)
+	if (obj->type == OBJ_COMMIT)
 		return fsck_commit((struct commit *) obj);
-	if (obj->type == TYPE_TAG)
+	if (obj->type == OBJ_TAG)
 		return fsck_tag((struct tag *) obj);
 	/* By now, parse_object() would've returned NULL instead. */
 	return objerror(obj, "unknown type '%d' (internal fsck error)", obj->type);
@@ -472,7 +472,7 @@ static int fsck_cache_tree(struct cache_
 		}
 		mark_reachable(obj, REACHABLE);
 		obj->used = 1;
-		if (obj->type != TYPE_TREE)
+		if (obj->type != OBJ_TREE)
 			err |= objerror(obj, "non-tree in cache-tree");
 	}
 	for (i = 0; i < it->subtree_nr; i++)
diff --git a/http-push.c b/http-push.c
index f761584..4768619 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1784,16 +1784,16 @@ static int get_delta(struct rev_info *re
 
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
-		if (obj->type == TYPE_TAG) {
+		if (obj->type == OBJ_TAG) {
 			obj->flags |= SEEN;
 			p = add_one_object(obj, p);
 			continue;
 		}
-		if (obj->type == TYPE_TREE) {
+		if (obj->type == OBJ_TREE) {
 			p = process_tree((struct tree *)obj, p, NULL, name);
 			continue;
 		}
-		if (obj->type == TYPE_BLOB) {
+		if (obj->type == OBJ_BLOB) {
 			p = process_blob((struct blob *)obj, p, NULL, name);
 			continue;
 		}
@@ -1960,12 +1960,12 @@ static int ref_newer(const unsigned char
 	 * old.  Otherwise we require --force.
 	 */
 	o = deref_tag(parse_object(old_sha1), NULL, 0);
-	if (!o || o->type != TYPE_COMMIT)
+	if (!o || o->type != OBJ_COMMIT)
 		return 0;
 	old = (struct commit *) o;
 
 	o = deref_tag(parse_object(new_sha1), NULL, 0);
-	if (!o || o->type != TYPE_COMMIT)
+	if (!o || o->type != OBJ_COMMIT)
 		return 0;
 	new = (struct commit *) o;
 
@@ -2044,7 +2044,7 @@ static void add_remote_info_ref(struct r
 	fwrite_buffer(ref_info, 1, len, buf);
 	free(ref_info);
 
-	if (o->type == TYPE_TAG) {
+	if (o->type == OBJ_TAG) {
 		o = deref_tag(o, ls->dentry_name, 0);
 		if (o) {
 			len = strlen(ls->dentry_name) + 45;
diff --git a/name-rev.c b/name-rev.c
index 083d067..f92f14e 100644
--- a/name-rev.c
+++ b/name-rev.c
@@ -84,14 +84,14 @@ static int name_ref(const char *path, co
 	if (tags_only && strncmp(path, "refs/tags/", 10))
 		return 0;
 
-	while (o && o->type == TYPE_TAG) {
+	while (o && o->type == OBJ_TAG) {
 		struct tag *t = (struct tag *) o;
 		if (!t->tagged)
 			break; /* broken repository */
 		o = parse_object(t->tagged->sha1);
 		deref = 1;
 	}
-	if (o && o->type == TYPE_COMMIT) {
+	if (o && o->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *)o;
 
 		if (!strncmp(path, "refs/heads/", 11))
@@ -111,7 +111,7 @@ static const char* get_rev_name(struct o
 	struct rev_name *n;
 	struct commit *c;
 
-	if (o->type != TYPE_COMMIT)
+	if (o->type != OBJ_COMMIT)
 		return "undefined";
 	c = (struct commit *) o;
 	n = c->util;
@@ -172,7 +172,7 @@ int main(int argc, char **argv)
 		}
 
 		o = deref_tag(parse_object(sha1), *argv, 0);
-		if (!o || o->type != TYPE_COMMIT) {
+		if (!o || o->type != OBJ_COMMIT) {
 			fprintf(stderr, "Could not get commit for %s. Skipping.\n",
 					*argv);
 			continue;
diff --git a/object.c b/object.c
index 37277f9..e7ca56e 100644
--- a/object.c
+++ b/object.c
@@ -19,7 +19,8 @@ struct object *get_indexed_object(unsign
 }
 
 const char *type_names[] = {
-	"none", "blob", "tree", "commit", "bad"
+	"none", "commit", "tree", "blob", "tag", 
+	"bad type 5", "bad type 6", "delta", "bad",
 };
 
 static unsigned int hash_obj(struct object *obj, unsigned int n)
@@ -88,7 +89,7 @@ void created_object(const unsigned char 
 {
 	obj->parsed = 0;
 	obj->used = 0;
-	obj->type = TYPE_NONE;
+	obj->type = OBJ_NONE;
 	obj->flags = 0;
 	memcpy(obj->sha1, sha1, 20);
 
@@ -131,7 +132,7 @@ struct object *lookup_unknown_object(con
 	if (!obj) {
 		union any_object *ret = xcalloc(1, sizeof(*ret));
 		created_object(sha1, &ret->object);
-		ret->object.type = TYPE_NONE;
+		ret->object.type = OBJ_NONE;
 		return &ret->object;
 	}
 	return obj;
diff --git a/object.h b/object.h
index e0125e1..7893e94 100644
--- a/object.h
+++ b/object.h
@@ -24,12 +24,19 @@ struct object_array {
 #define TYPE_BITS   3
 #define FLAG_BITS  27
 
-#define TYPE_NONE   0
-#define TYPE_BLOB   1
-#define TYPE_TREE   2
-#define TYPE_COMMIT 3
-#define TYPE_TAG    4
-#define TYPE_BAD    5
+/*
+ * The object type is stored in 3 bits.
+ */
+enum object_type {
+	OBJ_NONE = 0,
+	OBJ_COMMIT = 1,
+	OBJ_TREE = 2,
+	OBJ_BLOB = 3,
+	OBJ_TAG = 4,
+	/* 5/6 for future expansion */
+	OBJ_DELTA = 7,
+	OBJ_BAD,
+};
 
 struct object {
 	unsigned parsed : 1;
@@ -40,14 +47,14 @@ struct object {
 };
 
 extern int track_object_refs;
-extern const char *type_names[];
+extern const char *type_names[9];
 
 extern unsigned int get_max_object_index(void);
 extern struct object *get_indexed_object(unsigned int);
 
 static inline const char *typename(unsigned int type)
 {
-	return type_names[type > TYPE_TAG ? TYPE_BAD : type];
+	return type_names[type > OBJ_TAG ? OBJ_BAD : type];
 }
 
 extern struct object_refs *lookup_object_refs(struct object *);
diff --git a/pack-check.c b/pack-check.c
diff --git a/pack.h b/pack.h
index 694e0c5..eb07b03 100644
--- a/pack.h
+++ b/pack.h
@@ -1,20 +1,7 @@
 #ifndef PACK_H
 #define PACK_H
 
-/*
- * The packed object type is stored in 3 bits.
- * The type value 0 is a reserved prefix if ever there is more than 7
- * object types, or any future format extensions.
- */
-enum object_type {
-	OBJ_EXT = 0,
-	OBJ_COMMIT = 1,
-	OBJ_TREE = 2,
-	OBJ_BLOB = 3,
-	OBJ_TAG = 4,
-	/* 5/6 for future expansion */
-	OBJ_DELTA = 7,
-};
+#include "object.h"
 
 /*
  * Packed object header
diff --git a/revision.c b/revision.c
index 7df9089..874e349 100644
--- a/revision.c
+++ b/revision.c
@@ -135,7 +135,7 @@ static struct commit *handle_commit(stru
 	/*
 	 * Tag object? Look what it points to..
 	 */
-	while (object->type == TYPE_TAG) {
+	while (object->type == OBJ_TAG) {
 		struct tag *tag = (struct tag *) object;
 		if (revs->tag_objects && !(flags & UNINTERESTING))
 			add_pending_object(revs, object, tag->tag);
@@ -148,7 +148,7 @@ static struct commit *handle_commit(stru
 	 * Commit object? Just return it, we'll do all the complex
 	 * reachability crud.
 	 */
-	if (object->type == TYPE_COMMIT) {
+	if (object->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *)object;
 		if (parse_commit(commit) < 0)
 			die("unable to parse commit %s", name);
@@ -164,7 +164,7 @@ static struct commit *handle_commit(stru
 	 * Tree object? Either mark it uniniteresting, or add it
 	 * to the list of objects to look at later..
 	 */
-	if (object->type == TYPE_TREE) {
+	if (object->type == OBJ_TREE) {
 		struct tree *tree = (struct tree *)object;
 		if (!revs->tree_objects)
 			return NULL;
@@ -179,7 +179,7 @@ static struct commit *handle_commit(stru
 	/*
 	 * Blob object? You know the drill by now..
 	 */
-	if (object->type == TYPE_BLOB) {
+	if (object->type == OBJ_BLOB) {
 		struct blob *blob = (struct blob *)object;
 		if (!revs->blob_objects)
 			return NULL;
@@ -494,11 +494,11 @@ static int add_parents_only(struct rev_i
 		return 0;
 	while (1) {
 		it = get_reference(revs, arg, sha1, 0);
-		if (it->type != TYPE_TAG)
+		if (it->type != OBJ_TAG)
 			break;
 		memcpy(sha1, ((struct tag*)it)->tagged->sha1, 20);
 	}
-	if (it->type != TYPE_COMMIT)
+	if (it->type != OBJ_COMMIT)
 		return 0;
 	commit = (struct commit *)it;
 	for (parents = commit->parents; parents; parents = parents->next) {
diff --git a/send-pack.c b/send-pack.c
index 4019a4b..10bc8bc 100644
--- a/send-pack.c
+++ b/send-pack.c
@@ -151,12 +151,12 @@ static int ref_newer(const unsigned char
 	 * old.  Otherwise we require --force.
 	 */
 	o = deref_tag(parse_object(old_sha1), NULL, 0);
-	if (!o || o->type != TYPE_COMMIT)
+	if (!o || o->type != OBJ_COMMIT)
 		return 0;
 	old = (struct commit *) o;
 
 	o = deref_tag(parse_object(new_sha1), NULL, 0);
-	if (!o || o->type != TYPE_COMMIT)
+	if (!o || o->type != OBJ_COMMIT)
 		return 0;
 	new = (struct commit *) o;
 
diff --git a/server-info.c b/server-info.c
index fdfe05a..7df628f 100644
--- a/server-info.c
+++ b/server-info.c
@@ -12,7 +12,7 @@ static int add_info_ref(const char *path
 	struct object *o = parse_object(sha1);
 
 	fprintf(info_ref_fp, "%s	%s\n", sha1_to_hex(sha1), path);
-	if (o->type == TYPE_TAG) {
+	if (o->type == OBJ_TAG) {
 		o = deref_tag(o, path, 0);
 		if (o)
 			fprintf(info_ref_fp, "%s	%s^{}\n",
diff --git a/sha1_name.c b/sha1_name.c
index f2cbafa..5fe8e5d 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -381,13 +381,13 @@ static int peel_onion(const char *name, 
 
 	sp++; /* beginning of type name, or closing brace for empty */
 	if (!strncmp(commit_type, sp, 6) && sp[6] == '}')
-		expected_type = TYPE_COMMIT;
+		expected_type = OBJ_COMMIT;
 	else if (!strncmp(tree_type, sp, 4) && sp[4] == '}')
-		expected_type = TYPE_TREE;
+		expected_type = OBJ_TREE;
 	else if (!strncmp(blob_type, sp, 4) && sp[4] == '}')
-		expected_type = TYPE_BLOB;
+		expected_type = OBJ_BLOB;
 	else if (sp[0] == '}')
-		expected_type = TYPE_NONE;
+		expected_type = OBJ_NONE;
 	else
 		return -1;
 
@@ -416,9 +416,9 @@ static int peel_onion(const char *name, 
 				memcpy(sha1, o->sha1, 20);
 				return 0;
 			}
-			if (o->type == TYPE_TAG)
+			if (o->type == OBJ_TAG)
 				o = ((struct tag*) o)->tagged;
-			else if (o->type == TYPE_COMMIT)
+			else if (o->type == OBJ_COMMIT)
 				o = &(((struct commit *) o)->tree->object);
 			else
 				return error("%.*s: expected %s type, but the object dereferences to %s type",
diff --git a/tag.c b/tag.c
index 74d0dab..864ac1b 100644
--- a/tag.c
+++ b/tag.c
@@ -5,7 +5,7 @@ const char *tag_type = "tag";
 
 struct object *deref_tag(struct object *o, const char *warn, int warnlen)
 {
-	while (o && o->type == TYPE_TAG)
+	while (o && o->type == OBJ_TAG)
 		o = parse_object(((struct tag *)o)->tagged->sha1);
 	if (!o && warn) {
 		if (!warnlen)
@@ -21,12 +21,12 @@ struct tag *lookup_tag(const unsigned ch
         if (!obj) {
                 struct tag *ret = alloc_tag_node();
                 created_object(sha1, &ret->object);
-                ret->object.type = TYPE_TAG;
+                ret->object.type = OBJ_TAG;
                 return ret;
         }
 	if (!obj->type)
-		obj->type = TYPE_TAG;
-        if (obj->type != TYPE_TAG) {
+		obj->type = OBJ_TAG;
+        if (obj->type != OBJ_TAG) {
                 error("Object %s is a %s, not a tree",
                       sha1_to_hex(sha1), typename(obj->type));
                 return NULL;
diff --git a/tree.c b/tree.c
index 1023655..a6032e3 100644
--- a/tree.c
+++ b/tree.c
@@ -131,12 +131,12 @@ struct tree *lookup_tree(const unsigned 
 	if (!obj) {
 		struct tree *ret = alloc_tree_node();
 		created_object(sha1, &ret->object);
-		ret->object.type = TYPE_TREE;
+		ret->object.type = OBJ_TREE;
 		return ret;
 	}
 	if (!obj->type)
-		obj->type = TYPE_TREE;
-	if (obj->type != TYPE_TREE) {
+		obj->type = OBJ_TREE;
+	if (obj->type != OBJ_TREE) {
 		error("Object %s is a %s, not a tree",
 		      sha1_to_hex(sha1), typename(obj->type));
 		return NULL;
@@ -216,11 +216,11 @@ struct tree *parse_tree_indirect(const u
 	do {
 		if (!obj)
 			return NULL;
-		if (obj->type == TYPE_TREE)
+		if (obj->type == OBJ_TREE)
 			return (struct tree *) obj;
-		else if (obj->type == TYPE_COMMIT)
+		else if (obj->type == OBJ_COMMIT)
 			obj = &(((struct commit *) obj)->tree->object);
-		else if (obj->type == TYPE_TAG)
+		else if (obj->type == OBJ_TAG)
 			obj = ((struct tag *) obj)->tagged;
 		else
 			return NULL;
diff --git a/upload-pack.c b/upload-pack.c
index b18eb9b..2e820c9 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -326,7 +326,7 @@ static int got_sha1(char *hex, unsigned 
 			o = parse_object(sha1);
 		if (!o)
 			die("oops (%s)", sha1_to_hex(sha1));
-		if (o->type == TYPE_COMMIT) {
+		if (o->type == OBJ_COMMIT) {
 			struct commit_list *parents;
 			if (o->flags & THEY_HAVE)
 				return 0;
@@ -457,7 +457,7 @@ static int send_ref(const char *refname,
 		o->flags |= OUR_REF;
 		nr_our_refs++;
 	}
-	if (o->type == TYPE_TAG) {
+	if (o->type == OBJ_TAG) {
 		o = deref_tag(o, refname, 0);
 		packet_write(1, "%s %s^{}\n", sha1_to_hex(o->sha1), refname);
 	}

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  3:45                   ` Linus Torvalds
@ 2006-07-12  4:31                     ` Linus Torvalds
  2006-07-12  6:35                     ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-12  4:31 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, sf, Git Mailing List



On Tue, 11 Jul 2006, Linus Torvalds wrote:
> 
> Junio: this patch is totally independent from the other patches, and is on 
> top of you current "master". It gets rid of TYPE_xxx in favor of the 
> OBJ_xxx enums, which are moved from pack.h into object.h.
...
>  static inline const char *typename(unsigned int type)
>  {
> -	return type_names[type > TYPE_TAG ? TYPE_BAD : type];
> +	return type_names[type > OBJ_TAG ? OBJ_BAD : type];

That should be "[type > OBJ_BAD ? OBJ_BAD : type]"

Not that any users will care, because the current users of the typename() 
macro are the old TYPE_xxxx users which will never see the OBJ_DELTA case 
anyway. So it's not a big deal, but let's do it right for future users (ie 
if the pack-file things want to start using "typename(type)" and they 
actually _have_ delta descriptors).

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  3:45                   ` Linus Torvalds
  2006-07-12  4:31                     ` Linus Torvalds
@ 2006-07-12  6:35                     ` Junio C Hamano
  2006-07-12 16:29                       ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2006-07-12  6:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> I shouldn't have introduced the new TYPE_xxx macros. I should just have 
> used the same OBJ_xxx macros that we use in the pack-file.
>
> Junio: this patch is totally independent from the other patches, and is on 
> top of you current "master". It gets rid of TYPE_xxx in favor of the 
> OBJ_xxx enums, which are moved from pack.h into object.h.

You seem to have forgotten one file (fetch-pack.c) but that was
trivial.  I'll apply and push it out shortly.

This might collide with a few topic branches but I have rerere
to help me cope with it ;-).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  6:35                     ` Junio C Hamano
@ 2006-07-12 16:29                       ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-12 16:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, 11 Jul 2006, Junio C Hamano wrote:
> 
> You seem to have forgotten one file (fetch-pack.c) but that was
> trivial.  I'll apply and push it out shortly.

Actually, it was fetch.c, not fetch-pack.c, and I forgot that due to a 
very real (and totally separate) bug.

The makefile doesn't have any dependencies for "fetch.o" and "rsh.o", so 
when you change the headers, they never get rebuilt.

I think fetch.o and rsh.o should either get added to the library files, or 
we need something like this..

(I didn't check that I caught all the appropriate *.o files, but this 
should be better than what we have now).

Even better would be to make the dependancies automatic. The kernel does 
that really well with some GNU Makefile magic, but it also depends on 
magic gcc command line flags ("-Wp,-MD,$(depfile)")

		Linus

---
diff --git a/Makefile b/Makefile
index e75fb13..854e0af 100644
--- a/Makefile
+++ b/Makefile
@@ -236,6 +236,9 @@ BUILTIN_OBJS = \
 	builtin-cat-file.o builtin-mailsplit.o builtin-stripspace.o \
 	builtin-update-ref.o builtin-fmt-merge-msg.o

+MISC_OBJS = \
+	fetch.o rsh.o http-fetch.o http-push.o
+
 GITLIBS = $(LIB_FILE) $(XDIFF_LIB)
 LIBS = $(GITLIBS) -lz

@@ -615,7 +618,7 @@ git-http-push$X: revision.o http.o http-
 	$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \
 		$(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT)

-$(LIB_OBJS) $(BUILTIN_OBJS): $(LIB_H)
+$(LIB_OBJS) $(BUILTIN_OBJS) $(MISC_OBJS): $(LIB_H)
 $(patsubst git-%$X,%.o,$(PROGRAMS)): $(LIB_H) $(wildcard */*.h)
 $(DIFF_OBJS): diffcore.h

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  0:03               ` Linus Torvalds
  2006-07-12  0:39                 ` Johannes Schindelin
@ 2006-07-12  0:46                 ` Junio C Hamano
  2006-07-12  3:42                   ` Linus Torvalds
  2006-07-12  6:49                 ` Peter Baumann
  2 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2006-07-12  0:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: sf, Johannes Schindelin, Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> It's actually really part of the specs, and not just happenstance.

> Well, I normally would agree with you if it was a "oh, all our zlib 
> objects seem to start with 0x78" thing, but after having dug into both the 
> zlib standard (which is actually an RFC, not just some random thing), and 
> looked at the sources, it's definitely the case that the "0x78" byte isn't 
> just an implementation detail.

Ok, I do not think we would worry about casting use of deflate +
32k windowsize in stone that much, and being able to check the
size and type without inflating certainly is attractive.
Validating FCHECK bits is surely a nice touch.  Thanks.

> Anyway, I think this following patch replaces the old 2/3 and 3/3 (it 
> still depends on the original [1/3] cleanup.
>
> (It also renames and reverses the meaning of the config file option: it's 
> now "[core] LegacyHeaders = true" for using legacy headers.)
>
> Not heavily tested, but seems ok.

I'd queue it in "pu" with reversed default and then move it to
"next" later.

>  static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size)
>  {
>  	int bytes = strlen(buffer) + 1;
>  	unsigned char *buf = xmalloc(1+size);
> +	unsigned long n;
>  
> -	memcpy(buf, (char *) buffer + bytes, stream->total_out - bytes);
> -	bytes = stream->total_out - bytes;
> +	n = stream->total_out - bytes;
> +	if (n > size)
> +		n = size;
> +	memcpy(buf, (char *) buffer + bytes, n);
> +	bytes = n;
>  	if (bytes < size) {
>  		stream->next_out = buf + bytes;
>  		stream->avail_out = size - bytes;

This one looks like an independent fix for a well spotted bug.

>  int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
>  {
>  	int size;
> @@ -1459,7 +1550,7 @@ int write_sha1_file(void *buf, unsigned 
>  	/* Set it up */
>  	memset(&stream, 0, sizeof(stream));
>  	deflateInit(&stream, zlib_compression_level);
> -	size = deflateBound(&stream, len+hdrlen);
> +	size = 8 + deflateBound(&stream, len+hdrlen);
>  	compressed = xmalloc(size);
>  
>  	/* Compress it */

I am wondring what this eight is.  You would pack 7 7-bit length
plus 4-bit totalling 49+4 = 53-bit length (plus 4-bit type).  Is
it an unwritten decision that the format would not deal with
objects larger than 2^53 (which is probably fine but looks
magic)?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  0:46                 ` Junio C Hamano
@ 2006-07-12  3:42                   ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-12  3:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: sf, Johannes Schindelin, Git Mailing List



On Tue, 11 Jul 2006, Junio C Hamano wrote:
> >  	unsigned char *buf = xmalloc(1+size);
> > +	unsigned long n;
> >  
> > -	memcpy(buf, (char *) buffer + bytes, stream->total_out - bytes);
> > -	bytes = stream->total_out - bytes;
> > +	n = stream->total_out - bytes;
> > +	if (n > size)
> > +		n = size;
> > +	memcpy(buf, (char *) buffer + bytes, n);
> > +	bytes = n;
> >  	if (bytes < size) {
> >  		stream->next_out = buf + bytes;
> >  		stream->avail_out = size - bytes;
> 
> This one looks like an independent fix for a well spotted bug.

Yeah, well, the "bug" only happens if you screw something up (which I 
triggered both times I tried to rewrite this ;)

Or possibly it the object is corrupt.

But yes, it's independent.

> > -	size = deflateBound(&stream, len+hdrlen);
> > +	size = 8 + deflateBound(&stream, len+hdrlen);
> >  	compressed = xmalloc(size);
> >  
> >  	/* Compress it */
> 
> I am wondring what this eight is.  You would pack 7 7-bit length
> plus 4-bit totalling 49+4 = 53-bit length (plus 4-bit type).  Is
> it an unwritten decision that the format would not deal with
> objects larger than 2^53 (which is probably fine but looks
> magic)?

8 was just "obviously enough".

The "hdrlen" part should already give us _way_ more padding than we need 
(the old-fashioned header will deflate to something bigger than the new 
header, so deflateBound() even _without_ the extra space should be 
plenty).

But I decided to add a few bytes just because it won't hurt.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  0:03               ` Linus Torvalds
  2006-07-12  0:39                 ` Johannes Schindelin
  2006-07-12  0:46                 ` Junio C Hamano
@ 2006-07-12  6:49                 ` Peter Baumann
  2006-07-12  7:16                   ` Junio C Hamano
  2006-07-12 15:13                   ` Linus Torvalds
  2 siblings, 2 replies; 37+ messages in thread
From: Peter Baumann @ 2006-07-12  6:49 UTC (permalink / raw)
  To: git

On 2006-07-12, Linus Torvalds <torvalds@osdl.org> wrote:
[...]
> Anyway, I think this following patch replaces the old 2/3 and 3/3 (it 
> still depends on the original [1/3] cleanup.
>
> (It also renames and reverses the meaning of the config file option: it's 
> now "[core] LegacyHeaders = true" for using legacy headers.)
>
> Not heavily tested, but seems ok.
>
> sf? Dscho? Can you check this thing out?
>
> 		Linus
> ----
[...]
> diff --git a/sha1_file.c b/sha1_file.c
> index 8734d50..475b23d 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -684,26 +684,74 @@ static void *map_sha1_file_internal(cons
>  	return map;
>  }
>  
> -static int unpack_sha1_header(z_stream *stream, void *map, unsigned long mapsize, void *buffer, unsigned long size)
> +static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
>  {
> +	unsigned char c;
> +	unsigned int word, bits;
> +	unsigned long size;
> +	static const char *typename[8] = {
> +		NULL,	/* OBJ_EXT */
> +		"commit", "tree", "blob", "tag",
> +		NULL, NULL, NULL
> +	};
> +	const char *type;
> +
>  	/* Get the data stream */
>  	memset(stream, 0, sizeof(*stream));
>  	stream->next_in = map;
>  	stream->avail_in = mapsize;
>  	stream->next_out = buffer;
> -	stream->avail_out = size;
> +	stream->avail_out = bufsiz;
> +
> +	/*
> +	 * Is it a zlib-compressed buffer? If so, the first byte
> +	 * must be 0x78 (15-bit window size, deflated), and the
> +	 * first 16-bit word is evenly divisible by 31
> +	 */
> +	word = (map[0] << 8) + map[1];
> +	if (map[0] == 0x78 && !(word % 31)) {
> +		inflateInit(stream);
> +		return inflate(stream, 0);
> +	}
> +
> +	c = *map++;
> +	mapsize--;
> +	type = typename[(c >> 4) & 7];
> +	if (!type)
> +		return -1;
> +
> +	bits = 4;
> +	size = c & 0xf;
> +	while (!(c & 0x80)) {
> +		if (bits >= 8*sizeof(long))
> +			return -1;
> +		c = *map++;
> +		size += (c & 0x7f) << bits;
> +		bits += 7;
> +		mapsize--;
> +	}

This doesn't match the logic used in unpack_object_header, which is used
in the packs:

static unsigned long unpack_object_header(struct packed_git *p, unsigned long offset,
        enum object_type *type, unsigned long *sizep)
{
	unsigned shift;
	unsigned char *pack, c;
	unsigned long size;

	if (offset >= p->pack_size)
		die("object offset outside of pack file");

	pack =  (unsigned char *) p->pack_base + offset;
	c = *pack++;
	offset++;
	*type = (c >> 4) & 7;
	size = c & 15;
	shift = 4;
	while (c & 0x80) {			<==========
		if (offset >= p->pack_size)
	        	die("object offset outside of pack file");
		c = *pack++;
		offset++;
		size += (c & 0x7f) << shift;
		shift += 7;
	}
	*sizep = size;				<==========
	return offset;
}

> @@ -1414,6 +1462,49 @@ static int write_buffer(int fd, const vo
>  	return 0;
>  }
>  
> +static int write_binary_header(unsigned char *hdr, enum object_type type, unsigned long len)
> +{
> +	int hdr_len;
> +	unsigned char c;
> +
> +	c = (type << 4) | (len & 15);
> +	len >>= 4;
> +	hdr_len = 1;
> +	while (len) {
> +		*hdr++ = c;
> +		hdr_len++;
> +		c = (len & 0x7f);
> +		len >>= 7;
> +	}
> +	*hdr = c | 0x80;
> +	return hdr_len;
> +}
> +

Dito, but in this case see pack-objects.c

/*
 * The per-object header is a pretty dense thing, which is
 *  - first byte: low four bits are "size", then three bits of "type",
 *    and the high bit is "size continues".
 *  - each byte afterwards: low seven bits are size continuation,
 *    with the high bit being "size continues"
 */
static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
{
        int n = 1;
        unsigned char c;

        if (type < OBJ_COMMIT || type > OBJ_DELTA)
                die("bad type %d", type);

        c = (type << 4) | (size & 15);
        size >>= 4;
        while (size) {
                *hdr++ = c | 0x80;	<=======
                c = size & 0x7f;
                size >>= 7;
                n++;
        }
        *hdr = c;			<=======
        return n;
}



-Peter Baumann

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  6:49                 ` Peter Baumann
@ 2006-07-12  7:16                   ` Junio C Hamano
  2006-07-12  8:28                     ` Peter Baumann
  2006-07-12 15:13                   ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2006-07-12  7:16 UTC (permalink / raw)
  To: Peter Baumann; +Cc: git, Linus Torvalds

Peter Baumann <Peter.B.Baumann@stud.informatik.uni-erlangen.de>
writes:

>> +	bits = 4;
>> +	size = c & 0xf;
>> +	while (!(c & 0x80)) {
>> +		if (bits >= 8*sizeof(long))
>> +			return -1;
>> +		c = *map++;
>> +		size += (c & 0x7f) << bits;
>> +		bits += 7;
>> +		mapsize--;
>> +	}
>
> This doesn't match the logic used in unpack_object_header, which is used
> in the packs:
> ...
>> +	c = (type << 4) | (len & 15);
>> +	len >>= 4;
>> +	hdr_len = 1;
>> +	while (len) {
>> +		*hdr++ = c;
>> +		hdr_len++;
>> +		c = (len & 0x7f);
>> +		len >>= 7;
>> +	}
>> +	*hdr = c | 0x80;
>> +	return hdr_len;
>> +}
>> +
>
> Dito, but in this case see pack-objects.c

Well, while these are not strictly needed to match, there is no
good reason to make them inconsistent.  Very well spotted.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  7:16                   ` Junio C Hamano
@ 2006-07-12  8:28                     ` Peter Baumann
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Baumann @ 2006-07-12  8:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Peter Baumann, git, Linus Torvalds

On Wed, Jul 12, 2006 at 12:16:12AM -0700, Junio C Hamano wrote:
> >> +	bits = 4;
> >> +	size = c & 0xf;
> >> +	while (!(c & 0x80)) {
> >> +		if (bits >= 8*sizeof(long))
> >> +			return -1;
> >> +		c = *map++;
> >> +		size += (c & 0x7f) << bits;
> >> +		bits += 7;
> >> +		mapsize--;
> >> +	}
> >
> > This doesn't match the logic used in unpack_object_header, which is used
> > in the packs:
> > ...
> >> +	c = (type << 4) | (len & 15);
> >> +	len >>= 4;
> >> +	hdr_len = 1;
> >> +	while (len) {
> >> +		*hdr++ = c;
> >> +		hdr_len++;
> >> +		c = (len & 0x7f);
> >> +		len >>= 7;
> >> +	}
> >> +	*hdr = c | 0x80;
> >> +	return hdr_len;
> >> +}
> >> +
> >
> > Dito, but in this case see pack-objects.c
> 
> Well, while these are not strictly needed to match, there is no
> good reason to make them inconsistent.  Very well spotted.
> 

During packing one could simply copy the existing object into the generated
pack in the non delta case, so I think this _is_ necessary/usefull.

-Peter Baumann

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12  6:49                 ` Peter Baumann
  2006-07-12  7:16                   ` Junio C Hamano
@ 2006-07-12 15:13                   ` Linus Torvalds
  2006-07-12 15:27                     ` Junio C Hamano
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2006-07-12 15:13 UTC (permalink / raw)
  To: Peter Baumann; +Cc: Junio C Hamano, Git Mailing List

On Wed, 12 Jul 2006, Peter Baumann wrote:
> 
> This doesn't match the logic used in unpack_object_header, which is used
> in the packs:

Yeah, good point. I reversed the meaning of the high bit by mistake. In 
pack-files, the high bit is a "there is more to come" bit, and in my new 
code it was a "this is the last byte" bit.

It doesn't matter for the ultimate reason for this (being able to re-use 
the actual _bulk_ of the data), but having it be different would be stupid 
and confusing, and means that we can't later try to use the same routines 
for packing/unpacking the objects.

Junio - do you want me to send an updated patch, or do you want to reverse 
bit#7 yourself?

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format"
  2006-07-12 15:13                   ` Linus Torvalds
@ 2006-07-12 15:27                     ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2006-07-12 15:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Junio - do you want me to send an updated patch, or do you want to reverse 
> bit#7 yourself?

Pushed out in "pu" already, thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 3/3] Enable the new binary header format for unpacked objects
  2006-07-11 17:09     ` Linus Torvalds
  2006-07-11 17:10       ` [PATCH 1/3] Make the unpacked object header functions static to sha1_file.c Linus Torvalds
  2006-07-11 17:12       ` [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format" Linus Torvalds
@ 2006-07-11 17:16       ` Linus Torvalds
  2 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-07-11 17:16 UTC (permalink / raw)
  To: Carl Baldwin, Junio C Hamano; +Cc: Git Mailing List


Enable the new binary header format for unpacked objects

This makes unpacked objects use the same header encoding as the packed
objects do, which should eventually allow us to be able to re-use the
object data directly when creating pack-files (rather than having to
decode and re-encode the data when inserting it in a pack).

It's enabled by default, but can be disabled with

	[core]
		BinaryHeaders = false

in the config file.

We can read both formats, of course, so you can have a mixed archive.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---

This should _not_ be applied to the main git sources, at least not
with the default being "use_binary_headers = 1".

But if you change that initial assignment to 0, this should be reasonably 
good.

Not extensively tested, of course. It fails t9102-git-svn-deep-rmdir.sh 
for me for some reason, I didn't really look at it yet, since this whole 
thing is more for Carl Baldwin to play with right now.

 Documentation/config.txt |    5 ++++
 cache.h                  |    1 +
 config.c                 |    5 ++++
 environment.c            |    1 +
 sha1_file.c              |   65 ++++++++++++++++++++++++++++++++++++++++------
 5 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0b434c1..bc95416 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -97,6 +97,11 @@ core.compression::
 	compression, and 1..9 are various speed/size tradeoffs, 9 being
 	slowest.
 
+core.binaryheaders::
+	A boolean, that if set to false, disables the use of the
+	new-style binary objects headers that share the same format with
+	the headers in a pack file.
+
 alias.*::
 	Command aliases for the gitlink:git[1] command wrapper - e.g.
 	after defining "alias.last = cat-file commit HEAD", the invocation
diff --git a/cache.h b/cache.h
index d433d46..756d89f 100644
--- a/cache.h
+++ b/cache.h
@@ -176,6 +176,7 @@ extern int commit_lock_file(struct lock_
 extern void rollback_lock_file(struct lock_file *);
 
 /* Environment bits from configuration mechanism */
+extern int use_binary_headers;
 extern int trust_executable_bit;
 extern int assume_unchanged;
 extern int prefer_symlink_refs;
diff --git a/config.c b/config.c
index 8445f7d..2497447 100644
--- a/config.c
+++ b/config.c
@@ -289,6 +289,11 @@ int git_default_config(const char *var, 
 		return 0;
 	}
 
+	if (!strcmp(var, "core.binaryheaders")) {
+		use_binary_headers = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "user.name")) {
 		strlcpy(git_default_name, value, sizeof(git_default_name));
 		return 0;
diff --git a/environment.c b/environment.c
index 43823ff..340214d 100644
--- a/environment.c
+++ b/environment.c
@@ -11,6 +11,7 @@ #include "cache.h"
 
 char git_default_email[MAX_GITNAME];
 char git_default_name[MAX_GITNAME];
+int use_binary_headers = 1;
 int trust_executable_bit = 1;
 int assume_unchanged = 0;
 int prefer_symlink_refs = 0;
diff --git a/sha1_file.c b/sha1_file.c
index ca5f0c0..700f455 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -699,11 +699,14 @@ static int unpack_sha1_header(z_stream *
 
 static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size, unsigned int hdrlen)
 {
-	int bytes = hdrlen;
+	unsigned long bytes = hdrlen, n;
 	unsigned char *buf = xmalloc(1+size);
 
-	memcpy(buf, (char *) buffer + bytes, stream->total_out - bytes);
-	bytes = stream->total_out - bytes;
+	n = stream->total_out - bytes;
+	if (n > size)
+		n = size;
+	memcpy(buf, (char *) buffer + bytes, n);
+	bytes = n;
 	if (bytes < size) {
 		stream->next_out = buf + bytes;
 		stream->avail_out = size - bytes;
@@ -1240,7 +1243,7 @@ struct packed_git *find_sha1_pack(const 
 
 int sha1_object_info(const unsigned char *sha1, char *type, unsigned long *sizep)
 {
-	int status, hdrlen;
+	int status;
 	unsigned long mapsize, size;
 	void *map;
 	z_stream stream;
@@ -1349,24 +1352,70 @@ void *read_object_with_reference(const u
 	}
 }
 
+static int write_binary_header(unsigned char *hdr, enum object_type type, unsigned long len)
+{
+	int hdr_len;
+	unsigned char c;
+
+	c = (type << 4) | (len & 15);
+	len >>= 4;
+	hdr_len = 1;
+	while (len) {
+		*hdr++ = c;
+		hdr_len++;
+		c = (len & 0x7f);
+		len >>= 7;
+	}
+	*hdr = c | 0x80;
+	return hdr_len;
+}
+
+static int generate_binary_header(unsigned char *hdr, const char *type, unsigned long len)
+{
+	int obj_type;
+
+	if (!strcmp(type, blob_type))
+		obj_type = OBJ_BLOB;
+	else if (!strcmp(type, tree_type))
+		obj_type = OBJ_TREE;
+	else if (!strcmp(type, commit_type))
+		obj_type = OBJ_COMMIT;
+	else if (!strcmp(type, tag_type))
+		obj_type = OBJ_TAG;
+	else
+		die("trying to generate bogus object of type '%s'", type);
+	return write_binary_header(hdr, obj_type, len);
+}
+
 char *write_sha1_file_prepare(void *buf,
 			      unsigned long len,
 			      const char *type,
 			      unsigned char *sha1,
 			      unsigned char *hdr,
-			      int *hdrlen)
+			      int *hdrlenp)
 {
 	SHA_CTX c;
+	int hdr_len;
 
-	/* Generate the header */
-	*hdrlen = sprintf((char *)hdr, "%s %lu", type, len)+1;
+	/*
+	 * Generate the header.
+	 *
+	 * NOTE! Regardless of whether we end up using the ASCII
+	 * or binary header, we always generate the SHA1 of the
+	 * file as if we had the ASCII header.
+	 */
+	hdr_len = sprintf((char *)hdr, "%s %lu", type, len)+1;
 
 	/* Sha1.. */
 	SHA1_Init(&c);
-	SHA1_Update(&c, hdr, *hdrlen);
+	SHA1_Update(&c, hdr, hdr_len);
 	SHA1_Update(&c, buf, len);
 	SHA1_Final(sha1, &c);
 
+	if (use_binary_headers)
+		hdr_len = generate_binary_header(hdr, type, len);
+	*hdrlenp = hdr_len;
+
 	return sha1_file_name(sha1);
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2006-07-12 16:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-10 23:01 Revisiting large binary files issue Carl Baldwin
2006-07-10 23:14 ` Junio C Hamano
2006-07-11  6:20   ` Peter Baumann
2006-07-10 23:28 ` Linus Torvalds
2006-07-11  9:40   ` [RFC]: Pack-file object format for individual objects (Was: Revisiting large binary files issue.) sf
2006-07-11 18:00     ` Linus Torvalds
2006-07-11 21:45       ` sf
2006-07-11 22:17         ` Linus Torvalds
2006-07-11 22:26           ` Linus Torvalds
2006-07-11 14:55   ` Revisiting large binary files issue Carl Baldwin
2006-07-11 17:09     ` Linus Torvalds
2006-07-11 17:10       ` [PATCH 1/3] Make the unpacked object header functions static to sha1_file.c Linus Torvalds
2006-07-11 17:12       ` [PATCH 2/3] sha1_file: add the ability to parse objects in "pack file format" Linus Torvalds
2006-07-11 18:40         ` Johannes Schindelin
2006-07-11 18:58           ` Linus Torvalds
2006-07-11 19:20             ` Johannes Schindelin
2006-07-11 19:48               ` Linus Torvalds
2006-07-11 21:25                 ` Johannes Schindelin
2006-07-11 21:47                 ` Junio C Hamano
2006-07-11 21:24         ` sf
2006-07-11 22:09           ` Linus Torvalds
2006-07-11 22:25             ` sf
2006-07-11 23:03             ` Junio C Hamano
2006-07-12  0:03               ` Linus Torvalds
2006-07-12  0:39                 ` Johannes Schindelin
2006-07-12  3:45                   ` Linus Torvalds
2006-07-12  4:31                     ` Linus Torvalds
2006-07-12  6:35                     ` Junio C Hamano
2006-07-12 16:29                       ` Linus Torvalds
2006-07-12  0:46                 ` Junio C Hamano
2006-07-12  3:42                   ` Linus Torvalds
2006-07-12  6:49                 ` Peter Baumann
2006-07-12  7:16                   ` Junio C Hamano
2006-07-12  8:28                     ` Peter Baumann
2006-07-12 15:13                   ` Linus Torvalds
2006-07-12 15:27                     ` Junio C Hamano
2006-07-11 17:16       ` [PATCH 3/3] Enable the new binary header format for unpacked objects Linus Torvalds

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).