two questions about the format of loose object

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* two questions about the format of loose object
@ 2008-12-01  8:00 Liu Yubao
  2008-12-01  8:25 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-01  8:00 UTC (permalink / raw
  To: git list

Hi,

In current implementation the loose objects are compressed:

     loose object = deflate(typename + <space> + size + '\0' + data)

In sha1_file.c:unpack_sha1_file():
	1) unpack_sha1_header() inflates first 8KB
        2) parse_sha1_header() gets object's size
        3) unpack_sha1_reset() allocates a (1+size) bytes buffer and
           copy the first 8KB without header to it.

* Question 1:

Why not use the format below for loose object?
    loose object = typename + <space> + size + '\0' + deflate(data)

So the size of loose object can be known before inflating it, in
step 3 above the 8KB memcpy isn't required.

In general, deflate() can decrease file size by 70% for text file, 
I checked the git source and linux-2.6 source and got the statistical
data below:

.------------------+--------------+--------.
|                  | <= (8/0.3)KB | <= 8KB |
|------------------+--------------+--------|
| git-1.6.03       |          97% |    84% |
| linux-2.6.27-rc6 |          90% |    66% |
`------------------+--------------+--------'


* Question 2:

Why not use uncompressed loose object? That's to say:
   loose object = typename + <space> + size + '\0' + data

I did a simple benchmark on my notebook and a server in my company,
writing a big file to disk is faster than compressing it first and
writing the result out. The former's performance for reading should
also be better because of file cache.

The current implementation caches objects in one process, the objects
can't be shared by many processes because they are uncompressed
to heap memory area of each process.

Uncompressed loose objects are better for sharing objects among
multiple git processes because they can be used directly after being
mmap-ed.

And I guess the most frequently used objects are loose objects
when you do some coding(git add, git diff, git diff --cached, git merge),
using uncompressed loose objects avoids uncompressing loose objects again
and again.


Below is the result of my simple benchmark:

########################################
# on my notebook
$ perl b.pl git-1.5.6/Makefile 1000
               Rate   compressed uncompressed
compressed    198/s           --         -92%
uncompressed 2463/s        1147%           --


$ perl b.pl git-1.5.6/parse-options.c 2000
               Rate   compressed uncompressed
compressed    341/s           --         -88%
uncompressed 2845/s         734%           --


$ find git-1.5.6/ -name "*.[ch]" -exec cat {} + > all.c
$ perl b.pl all.c 1000
               Rate   compressed uncompressed
compressed   3.39/s           --         -97%
uncompressed  111/s        3182%           --

#######################################
# on a server
$ perl b.pl Makefile 6000
            (warning: too few iterations for a reliable count)
                Rate   compressed uncompressed
compressed     447/s           --         -98%
uncompressed 18750/s        4094%           --

$ perl b.pl parse-options.c 8000
            (warning: too few iterations for a reliable count)
                Rate   compressed uncompressed
compressed    1130/s           --         -97%
uncompressed 33333/s        2850%           --

$ perl b.pl all.c 1000
               Rate   compressed uncompressed
compressed   5.48/s           --         -95%
uncompressed  115/s        1997%           

#####################################################
# b.pl
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:hireswallclock cmpthese);
use File::Slurp;
use IO::Compress::Deflate qw(deflate $DeflateError);

my $text = read_file($ARGV[0], binmode => ':raw');

cmpthese($ARGV[1], {'compressed' => \&zip, 'uncompressed' => \&output});

sub zip {
    deflate \$text => 'all.c.z' || die "$!\n";
}

sub output {
    write_file("all2.c", {binmode => ':raw'}, $text);
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01  8:00 two questions about the format of loose object Liu Yubao
@ 2008-12-01  8:25 ` Junio C Hamano
  2008-12-01  9:28   ` Liu Yubao
                     ` (7 more replies)
  2008-12-01 12:16 ` two questions about the format of " Nick Andrew
  2008-12-01 15:32 ` Shawn O. Pearce
  2 siblings, 8 replies; 27+ messages in thread
From: Junio C Hamano @ 2008-12-01  8:25 UTC (permalink / raw
  To: Liu Yubao; +Cc: git list

Liu Yubao <yubao.liu@gmail.com> writes:

> In current implementation the loose objects are compressed:
>
>      loose object = deflate(typename + <space> + size + '\0' + data)
>
> In sha1_file.c:unpack_sha1_file():
> 	1) unpack_sha1_header() inflates first 8KB
>         2) parse_sha1_header() gets object's size
>         3) unpack_sha1_reset() allocates a (1+size) bytes buffer and
>            copy the first 8KB without header to it.
>
> * Question 1:
> Why not ...
> * Question 2:
> Why not ...

A hint for understanding why loose objects are compressed is that
packfiles were invented much later in the history of git.

These are both good questions, and it might have made a difference if they
were posed in early April 2005.

At this point, the plain and clear answer to both of these "Why not"
questions is "because that is the way it is and it is costly to change
them now in thousands of repositories people use every day."

In other words, it is not interesting anymore to raise these questions
now, especially as a suggestion to change the system, unless they are
accompanied by arguments that convinces everybody that the cost of such a
change outweighs the benefits, and a clear transition plans how to upgrade
everybody's existing repositories without any pain.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01  8:25 ` Junio C Hamano
@ 2008-12-01  9:28   ` Liu Yubao
  2008-12-01 11:32     ` Jakub Narebski
  2008-12-01 15:21     ` Shawn O. Pearce
  2008-12-02  1:48   ` [PATCH 0/5] support reading and writing uncompressed " Liu Yubao
                     ` (6 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-01  9:28 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git list

Junio C Hamano wrote:
> Liu Yubao <yubao.liu@gmail.com> writes:
> 
> 
> A hint for understanding why loose objects are compressed is that
> packfiles were invented much later in the history of git.
> 
> These are both good questions, and it might have made a difference if they
> were posed in early April 2005.
> 
> At this point, the plain and clear answer to both of these "Why not"
> questions is "because that is the way it is and it is costly to change
> them now in thousands of repositories people use every day."
> 
> In other words, it is not interesting anymore to raise these questions
> now, especially as a suggestion to change the system, unless they are
> accompanied by arguments that convinces everybody that the cost of such a
> change outweighs the benefits, and a clear transition plans how to upgrade
> everybody's existing repositories without any pain.
> 

Thanks for your explanation, but I doubt if it's too costly to change the
format of loose object, after all this doesn't change the format of pack
file and affect git-pull/fetch of old git client. 

I ask the "why not" questions because I doubt if I miss some technical points
that the change isn't worth at all in fact.

If no severe technical problem will occur, I think it's worth breaking
*forward* compatibility for better performance and I'm willing to implement
it.

Some cons and pros.

cons:

* old git client can't read loose objects in new format
  (People degrade git rarely and old git can read pack files
   generated by new git, so it's not a big problem)

pros:

* avoid compressing and uncompressing loose objects that are likely
  frequently used when you are coding/merging
* share loose objects among multipe git processes
* the new code path is simpler although we will have more code paths for
  compatibility


Best regards,

Liu Yubao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01  9:28   ` Liu Yubao
@ 2008-12-01 11:32     ` Jakub Narebski
  2008-12-02  2:19       ` Liu Yubao
  2008-12-01 15:21     ` Shawn O. Pearce
  1 sibling, 1 reply; 27+ messages in thread
From: Jakub Narebski @ 2008-12-01 11:32 UTC (permalink / raw
  To: Liu Yubao; +Cc: Junio C Hamano, git list

Liu Yubao <yubao.liu@gmail.com> writes:

> Junio C Hamano wrote:
> > Liu Yubao <yubao.liu@gmail.com> writes:
> > 
> > A hint for understanding why loose objects are compressed is that
> > packfiles were invented much later in the history of git.
> > 
> > These are both good questions, and it might have made a difference if they
> > were posed in early April 2005.
> > 
> > At this point, the plain and clear answer to both of these "Why not"
> > questions is "because that is the way it is and it is costly to change
> > them now in thousands of repositories people use every day."
[...] 
> 
> Thanks for your explanation, but I doubt if it's too costly to change the
> format of loose object, after all this doesn't change the format of pack
> file and affect git-pull/fetch of old git client. 
[...]

> cons:
> 
> * old git client can't read loose objects in new format
>   (People degrade git rarely and old git can read pack files
>    generated by new git, so it's not a big problem)

You forgot about "dumb" protocols, namely HTTP and (deprecated) rsync
(and IIRC also FTP), which doesn't generate packfiles, and would get
loose object in format intelligible for old clients.

IIRC this was main reason why core.legacyHeaders = false was abandoned.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01  8:00 two questions about the format of loose object Liu Yubao
  2008-12-01  8:25 ` Junio C Hamano
@ 2008-12-01 12:16 ` Nick Andrew
  2008-12-02  2:26   ` Liu Yubao
  2008-12-01 15:32 ` Shawn O. Pearce
  2 siblings, 1 reply; 27+ messages in thread
From: Nick Andrew @ 2008-12-01 12:16 UTC (permalink / raw
  To: Liu Yubao; +Cc: git list

On Mon, Dec 01, 2008 at 04:00:55PM +0800, Liu Yubao wrote:
> I did a simple benchmark on my notebook and a server in my company,
> writing a big file to disk is faster than compressing it first and
> writing the result out. The former's performance for reading should
> also be better because of file cache.

In a corporate environment (and not related to git) I found the
opposite. The disk was fairly slow (over NFS) and it was in fact
quicker to read and write compressed files.

Nick.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01  9:28   ` Liu Yubao
  2008-12-01 11:32     ` Jakub Narebski
@ 2008-12-01 15:21     ` Shawn O. Pearce
  2008-12-02  2:43       ` Liu Yubao
  1 sibling, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2008-12-01 15:21 UTC (permalink / raw
  To: Liu Yubao; +Cc: Junio C Hamano, git list

Liu Yubao <yubao.liu@gmail.com> wrote:
> Thanks for your explanation, but I doubt if it's too costly to change the
> format of loose object, after all this doesn't change the format of pack
> file and affect git-pull/fetch of old git client. 

It is too costly; Jakub pointed out the dumb protocol clients
would have issues with the new format.  Anyone copying a repository
between machines using scp or a USB memory stick may also run into
a problem.  Etc.

> Some cons and pros.
> 
> cons:
> 
> * old git client can't read loose objects in new format
>   (People degrade git rarely and old git can read pack files
>    generated by new git, so it's not a big problem)

That's a pretty big con.  We can also add slower performance on NFS,
as has been reported already by others.

> pros:
> 
> * avoid compressing and uncompressing loose objects that are likely
>   frequently used when you are coding/merging

True, loose objects are among the more frequently accessed items.

> * share loose objects among multipe git processes

Probably not a huge issue.  How many concurrent git processes are
you running on the same object store at once?  During development?
Its probably not more than 1.  So sharing the objects doesn't make
a very compelling argument.

> * the new code path is simpler although we will have more code paths for
>   compatibility

The new code path is more complex, because although one branch is
very simple (mmap and use) the other code paths have to stay for
backwards compatibility.  Every time you add a branch point the
code gets more complex.  It works well enough now, and is at least
one branch point simpler than what you are proposing.  So I'm not
really interested in seeing the change made.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01  8:00 two questions about the format of loose object Liu Yubao
  2008-12-01  8:25 ` Junio C Hamano
  2008-12-01 12:16 ` two questions about the format of " Nick Andrew
@ 2008-12-01 15:32 ` Shawn O. Pearce
  2008-12-02  3:05   ` Liu Yubao
  2 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2008-12-01 15:32 UTC (permalink / raw
  To: Liu Yubao; +Cc: git list

Liu Yubao <yubao.liu@gmail.com> wrote:
> 
> In current implementation the loose objects are compressed:
> 
>      loose object = deflate(typename + <space> + size + '\0' + data)
...
> * Question 1:
> 
> Why not use the format below for loose object?
>     loose object = typename + <space> + size + '\0' + deflate(data)

Historical accident.  We really should have used a format more
like what you are asking here, because it makes inflation easier.
The pack file format uses a header structure sort of like this,
for exactly that reason.  IOW we did learn our mistakes and fix them.

If you look up the new style loose object code you'll see that it
has a format like this (sort of), the header is actually the same
format that is used in the pack files, making it smaller than what
you propose but also easier to unpack as the code can be reused
with the pack reading code.

Unfortunately the new style loose object was phased out; it never
really took off and it made the code much more complex.  So it was
pulled in commit 726f852b0ed7e03e88c419a9996c3815911c9db1:

 Author: Nicolas Pitre <nico@cam.org>:
 >  deprecate the new loose object header format
 >
 >  Now that we encourage and actively preserve objects in a packed form
 >  more agressively than we did at the time the new loose object format and
 >  core.legacyheaders were introduced, that extra loose object format
 >  doesn't appear to be worth it anymore.
 >
 >  Because the packing of loose objects has to go through the delta match
 >  loop anyway, and since most of them should end up being deltified in
 >  most cases, there is really little advantage to have this parallel loose
 >  object format as the CPU savings it might provide is rather lost in the
 >  noise in the end.
 >
 >  This patch gets rid of core.legacyheaders, preserve the legacy format as
 >  the only writable loose object format and deprecate the other one to
 >  keep things simpler.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/5] support reading and writing uncompressed loose object
  2008-12-01  8:25 ` Junio C Hamano
  2008-12-01  9:28   ` Liu Yubao
@ 2008-12-02  1:48   ` Liu Yubao
  2008-12-02  1:51   ` [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound Liu Yubao
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  1:48 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git list

Hi,

In original implementation, git stores loose object like this:
    loose object = deflate(typename + <space> + size + data)

The patches below add support to read and write uncompressed loose
object:
    loose object = typename + <space> + size + data

The cons and pros to use uncompressed loose object:

cons
    * old git can't read these uncompressed loose objects
      (I think it's not a big problem because old git can read
       pack files generated by new git)

    * uncompressed loose objects occupy more disk space
      (I also think it's not a big problem because loose objects
       aren't too many in general)

pros
    * avoid compressing and uncompressing loose objects that are likely
      frequently used when coding/merging with git add/diff/diff --cached/
      merge/rebase/log.

    * the code to read and write uncompressed loose objects is
      simpler, although there are now more code paths for compatibility.

    * better to share loose objects among multiple git processes because
      sha1 files can be used directly after mmapped. The original git
      uncompresses loose objects into heap memory area so that they
      can't be shared by other processes.
     (NOTICE: The patches below doesn't use mmapped sha1 files directly
      because I find parse_object() requires a buffer terminated with
      zero.)

    * easy to grep objects in .git/objects  (...stupid use case :-)


If these patches are worth being included into upstream branch,
I will add a new config variable core.uncompressedLooseObject.


Explanation to the patches:

1) avoid parse_sha1_header() accessing memory out of bound
  Just for more safety, no inflateInit() to detect errors for
  uncompressed loose objects.

2) don't die immediately when convert an invalid type name
  So we can fall back to compressed loose objects.

3) optimize parse_sha1_header() a little by detecting object type
  To quickly detect whether it seems an uncompressed loose object.

4) support reading uncompressed loose object
  The new feature.

5) support writing uncompressed loose object
  The new feature, need a git-config variable yet.


The patches are generated against git-1.6.1-rc, I have run the test cases
and it seems ok.


 object.c    |   14 +++++++++++++-
 object.h    |    1 +
 sha1_file.c |   58 +++++++++++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 59 insertions(+), 14 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound
  2008-12-01  8:25 ` Junio C Hamano
  2008-12-01  9:28   ` Liu Yubao
  2008-12-02  1:48   ` [PATCH 0/5] support reading and writing uncompressed " Liu Yubao
@ 2008-12-02  1:51   ` Liu Yubao
  2008-12-02 15:42     ` Shawn O. Pearce
  2008-12-02  1:53   ` [PATCH 2/5] don't die immediately when convert an invalid type name Liu Yubao
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  1:51 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git list


Signed-off-by: Liu Yubao <yubao.liu@gmail.com>
---
 sha1_file.c |   15 +++++++++------
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 6c0e251..efe6967 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1245,8 +1245,9 @@ static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size
  * too permissive for what we want to check. So do an anal
  * object header parse by hand.
  */
-static int parse_sha1_header(const char *hdr, unsigned long *sizep)
+static int parse_sha1_header(const char *hdr, unsigned long length, unsigned long *sizep)
 {
+	const char *hdr_end = hdr + length;
 	char type[10];
 	int i;
 	unsigned long size;
@@ -1254,10 +1255,10 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	/*
 	 * The type can be at most ten bytes (including the
 	 * terminating '\0' that we add), and is followed by
-	 * a space.
+	 * a space, at least one byte for size, and a '\0'.
 	 */
 	i = 0;
-	for (;;) {
+	while (hdr < hdr_end - 2) {
 		char c = *hdr++;
 		if (c == ' ')
 			break;
@@ -1265,6 +1266,8 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
 		if (i >= sizeof(type))
 			return -1;
 	}
+	if (' ' != *(hdr - 1))
+		return -1;
 	type[i] = 0;
 
 	/*
@@ -1275,7 +1278,7 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	if (size > 9)
 		return -1;
 	if (size) {
-		for (;;) {
+		while (hdr < hdr_end - 1) {
 			unsigned long c = *hdr - '0';
 			if (c > 9)
 				break;
@@ -1298,7 +1301,7 @@ static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type
 	char hdr[8192];
 
 	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
-	if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
+	if (ret < Z_OK || (*type = parse_sha1_header(hdr, stream.total_out, size)) < 0)
 		return NULL;
 
 	return unpack_sha1_rest(&stream, hdr, *size, sha1);
@@ -1982,7 +1985,7 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 	if (unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0)
 		status = error("unable to unpack %s header",
 			       sha1_to_hex(sha1));
-	else if ((status = parse_sha1_header(hdr, &size)) < 0)
+	else if ((status = parse_sha1_header(hdr, stream.total_out, &size)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
 	else if (sizep)
 		*sizep = size;
-- 
1.6.1.rc1.5.gde86c

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/5] don't die immediately when convert an invalid type name
  2008-12-01  8:25 ` Junio C Hamano
                     ` (2 preceding siblings ...)
  2008-12-02  1:51   ` [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound Liu Yubao
@ 2008-12-02  1:53   ` Liu Yubao
  2008-12-02  1:55   ` [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type Liu Yubao
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  1:53 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git list



Signed-off-by: Liu Yubao <yubao.liu@gmail.com>
---
 object.c    |   14 +++++++++++++-
 object.h    |    1 +
 sha1_file.c |    2 +-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/object.c b/object.c
index 50b6528..0a18db6 100644
--- a/object.c
+++ b/object.c
@@ -33,13 +33,25 @@ const char *typename(unsigned int type)
 	return object_type_strings[type];
 }
 
-int type_from_string(const char *str)
+int type_from_string_gently(const char *str)
 {
 	int i;
 
 	for (i = 1; i < ARRAY_SIZE(object_type_strings); i++)
 		if (!strcmp(str, object_type_strings[i]))
 			return i;
+
+	return -1;
+}
+
+int type_from_string(const char *str)
+{
+	int i;
+
+	i = type_from_string_gently(str);
+	if (i > 0)
+		return i;
+
 	die("invalid object type \"%s\"", str);
 }
 
diff --git a/object.h b/object.h
index d962ff1..88baf2b 100644
--- a/object.h
+++ b/object.h
@@ -36,6 +36,7 @@ struct object {
 };
 
 extern const char *typename(unsigned int type);
+extern int type_from_string_gently(const char *str);
 extern int type_from_string(const char *str);
 
 extern unsigned int get_max_object_index(void);
diff --git a/sha1_file.c b/sha1_file.c
index efe6967..dccc455 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1291,7 +1291,7 @@ static int parse_sha1_header(const char *hdr, unsigned long length, unsigned lon
 	/*
 	 * The length must be followed by a zero byte
 	 */
-	return *hdr ? -1 : type_from_string(type);
+	return *hdr ? -1 : type_from_string_gently(type);
 }
 
 static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
-- 
1.6.1.rc1.5.gde86c

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type
  2008-12-01  8:25 ` Junio C Hamano
                     ` (3 preceding siblings ...)
  2008-12-02  1:53   ` [PATCH 2/5] don't die immediately when convert an invalid type name Liu Yubao
@ 2008-12-02  1:55   ` Liu Yubao
  2008-12-02 15:53     ` Shawn O. Pearce
  2008-12-02  1:56   ` [PATCH 4/5] support reading uncompressed loose object Liu Yubao
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  1:55 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git list



Signed-off-by: Liu Yubao <yubao.liu@gmail.com>
---
 sha1_file.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index dccc455..79062f0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1099,7 +1099,8 @@ static void *map_sha1_file(const unsigned char *sha1, unsigned long *size)
 
 		if (!fstat(fd, &st)) {
 			*size = xsize_t(st.st_size);
-			map = xmmap(NULL, *size, PROT_READ, MAP_PRIVATE, fd, 0);
+			if (*size > 0)
+				map = xmmap(NULL, *size, PROT_READ, MAP_PRIVATE, fd, 0);
 		}
 		close(fd);
 	}
@@ -1257,6 +1258,8 @@ static int parse_sha1_header(const char *hdr, unsigned long length, unsigned lon
 	 * terminating '\0' that we add), and is followed by
 	 * a space, at least one byte for size, and a '\0'.
 	 */
+	if ('b' != *hdr && 'c' != *hdr && 't' != *hdr)	/* blob/commit/tag/tree */
+		return -1;
 	i = 0;
 	while (hdr < hdr_end - 2) {
 		char c = *hdr++;
-- 
1.6.1.rc1.5.gde86c

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/5] support reading uncompressed loose object
  2008-12-01  8:25 ` Junio C Hamano
                     ` (4 preceding siblings ...)
  2008-12-02  1:55   ` [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type Liu Yubao
@ 2008-12-02  1:56   ` Liu Yubao
  2008-12-02 15:58     ` Shawn O. Pearce
  2008-12-02  2:03   ` [PATCH 5/5] support writing " Liu Yubao
  2008-12-02  3:11   ` [PATCH 0/5] support reading and " Liu Yubao
  7 siblings, 1 reply; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  1:56 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git list



Signed-off-by: Liu Yubao <yubao.liu@gmail.com>
---
 sha1_file.c |   20 +++++++++++++++++++-
 1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 79062f0..05a9fa3 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1985,6 +1985,16 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
 		return error("unable to find %s", sha1_to_hex(sha1));
+
+	/*
+	 * Is it an uncompressed loose objects?
+	 */
+	if ((status = parse_sha1_header(map, mapsize, &size)) >= 0) {
+		if (sizep)
+			*sizep = size;
+		goto L_leave;
+	}
+
 	if (unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0)
 		status = error("unable to unpack %s header",
 			       sha1_to_hex(sha1));
@@ -1993,6 +2003,8 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 	else if (sizep)
 		*sizep = size;
 	inflateEnd(&stream);
+
+L_leave:
 	munmap(map, mapsize);
 	return status;
 }
@@ -2124,7 +2136,13 @@ void *read_object(const unsigned char *sha1, enum object_type *type,
 		return buf;
 	map = map_sha1_file(sha1, &mapsize);
 	if (map) {
-		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
+		/*
+		 * Is it an uncompressed loose object?
+		 */
+		if ((*type = parse_sha1_header(map, mapsize, size)) >= 0)
+			buf = xmemdupz(map + strlen(map) + 1, *size);
+		else
+			buf = unpack_sha1_file(map, mapsize, type, size, sha1);
 		munmap(map, mapsize);
 		return buf;
 	}
-- 
1.6.1.rc1.5.gde86c

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/5] support writing uncompressed loose object
  2008-12-01  8:25 ` Junio C Hamano
                     ` (5 preceding siblings ...)
  2008-12-02  1:56   ` [PATCH 4/5] support reading uncompressed loose object Liu Yubao
@ 2008-12-02  2:03   ` Liu Yubao
  2008-12-02 16:07     ` Shawn O. Pearce
  2008-12-02  3:11   ` [PATCH 0/5] support reading and " Liu Yubao
  7 siblings, 1 reply; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  2:03 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git list



Signed-off-by: Liu Yubao <yubao.liu@gmail.com>
---
 sha1_file.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 05a9fa3..053b564 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2328,7 +2328,7 @@ static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
 }
 
 static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
-			      void *buf, unsigned long len, time_t mtime)
+			      void *buf, unsigned long len, time_t mtime, int dont_deflate)
 {
 	int fd, size, ret;
 	unsigned char *compressed;
@@ -2345,6 +2345,12 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 			return error("unable to create temporary sha1 filename %s: %s\n", tmpfile, strerror(errno));
 	}
 
+	if (dont_deflate) {
+		if (write_buffer(fd, hdr, hdrlen) < 0 || write_buffer(fd, buf, len) < 0)
+			die("unable to write sha1 file");
+		goto L_close_file;
+	}
+
 	/* Set it up */
 	memset(&stream, 0, sizeof(stream));
 	deflateInit(&stream, zlib_compression_level);
@@ -2376,9 +2382,11 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 
 	if (write_buffer(fd, compressed, size) < 0)
 		die("unable to write sha1 file");
-	close_sha1_file(fd);
 	free(compressed);
 
+L_close_file:
+	close_sha1_file(fd);
+
 	if (mtime) {
 		struct utimbuf utb;
 		utb.actime = mtime;
@@ -2405,7 +2413,7 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 		hashcpy(returnsha1, sha1);
 	if (has_sha1_file(sha1))
 		return 0;
-	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
+	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0, 1);
 }
 
 int force_object_loose(const unsigned char *sha1, time_t mtime)
@@ -2423,7 +2431,7 @@ int force_object_loose(const unsigned char *sha1, time_t mtime)
 	if (!buf)
 		return error("cannot read sha1_file for %s", sha1_to_hex(sha1));
 	hdrlen = sprintf(hdr, "%s %lu", typename(type), len) + 1;
-	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime);
+	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime, 1);
 	free(buf);
 
 	return ret;
-- 
1.6.1.rc1.5.gde86c

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01 11:32     ` Jakub Narebski
@ 2008-12-02  2:19       ` Liu Yubao
  0 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  2:19 UTC (permalink / raw
  To: Jakub Narebski; +Cc: Junio C Hamano, git list

Jakub Narebski wrote:
> Liu Yubao <yubao.liu@gmail.com> writes:
> 
>> cons:
>>
>> * old git client can't read loose objects in new format
>>   (People degrade git rarely and old git can read pack files
>>    generated by new git, so it's not a big problem)
> 
> You forgot about "dumb" protocols, namely HTTP and (deprecated) rsync
> (and IIRC also FTP), which doesn't generate packfiles, and would get
> loose object in format intelligible for old clients.
> 
> IIRC this was main reason why core.legacyHeaders = false was abandoned.
> 

The server can keep an old git or set core.uncompressedLooseObject = false.

Even the pack file format can be changed so that forward compatability will
be broken, I think it's unavoidable.

I'm not clear about the story of core.legacyHeaders, I'll dig the mail
list archive.

Thanks for reminding me of the dumb protocols.


Best regards,

Liu Yubao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01 12:16 ` two questions about the format of " Nick Andrew
@ 2008-12-02  2:26   ` Liu Yubao
  0 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  2:26 UTC (permalink / raw
  To: Nick Andrew; +Cc: git list

Nick Andrew wrote:
> On Mon, Dec 01, 2008 at 04:00:55PM +0800, Liu Yubao wrote:
>> I did a simple benchmark on my notebook and a server in my company,
>> writing a big file to disk is faster than compressing it first and
>> writing the result out. The former's performance for reading should
>> also be better because of file cache.
> 
> In a corporate environment (and not related to git) I found the
> opposite. The disk was fairly slow (over NFS) and it was in fact
> quicker to read and write compressed files.
> 
> Nick.
> 
Ok, there must be exceptional cases, for these cases you can turn
off core.uncompressedLooseObject (if there will be this config).


Best regards,

Liu Yubao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01 15:21     ` Shawn O. Pearce
@ 2008-12-02  2:43       ` Liu Yubao
  0 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  2:43 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git list

Shawn O. Pearce wrote:
> Liu Yubao <yubao.liu@gmail.com> wrote:
>> Thanks for your explanation, but I doubt if it's too costly to change the
>> format of loose object, after all this doesn't change the format of pack
>> file and affect git-pull/fetch of old git client. 
> 
> It is too costly; Jakub pointed out the dumb protocol clients
> would have issues with the new format.  Anyone copying a repository
> between machines using scp or a USB memory stick may also run into
> a problem.  Etc.
>  

Yes, exceptional case, is it acceptable that core.uncompressedLooseObject
is set to false by default especially for NFS file system?

>> Some cons and pros.
>>
>> cons:
>>
>> * old git client can't read loose objects in new format
>>   (People degrade git rarely and old git can read pack files
>>    generated by new git, so it's not a big problem)
> 
> That's a pretty big con.  We can also add slower performance on NFS,
> as has been reported already by others.
>  

I mean to add a format, not to replace the current format of loose object.

>> pros:
>>
>> * avoid compressing and uncompressing loose objects that are likely
>>   frequently used when you are coding/merging
> 
> True, loose objects are among the more frequently accessed items.
> 
>> * share loose objects among multipe git processes
> 
> Probably not a huge issue.  How many concurrent git processes are
> you running on the same object store at once?  During development?
> Its probably not more than 1.  So sharing the objects doesn't make
> a very compelling argument.
> 

In my company we have a central server to host source code repository
managed by git+ssh. Some collegues also work on the same machine (maybe
not a good practice) and set alternates to the central repository, so
there can be multiple git processes operating same git object database.

In fact we have a wrapper script of git to make git fit our development
process better because git's submodule support isn't good enough. One
command in the wrapper script can execute many git commands in a short
time. 

>> * the new code path is simpler although we will have more code paths for
>>   compatibility
> 
> The new code path is more complex, because although one branch is
> very simple (mmap and use) the other code paths have to stay for
> backwards compatibility.  Every time you add a branch point the
> code gets more complex.  It works well enough now, and is at least
> one branch point simpler than what you are proposing.  So I'm not
> really interested in seeing the change made.
> 

Could you review my patches sent just a moment ago? The key changes are
rather small.


Best regards,

Liu Yubao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-01 15:32 ` Shawn O. Pearce
@ 2008-12-02  3:05   ` Liu Yubao
  2008-12-04  0:54     ` Nicolas Pitre
  0 siblings, 1 reply; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  3:05 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: git list

Shawn O. Pearce wrote:
> Liu Yubao <yubao.liu@gmail.com> wrote:
>> In current implementation the loose objects are compressed:
>>
>>      loose object = deflate(typename + <space> + size + '\0' + data)
> ...
>> * Question 1:
>>
>> Why not use the format below for loose object?
>>     loose object = typename + <space> + size + '\0' + deflate(data)
> 
> Historical accident.  We really should have used a format more
> like what you are asking here, because it makes inflation easier.
> The pack file format uses a header structure sort of like this,
> for exactly that reason.  IOW we did learn our mistakes and fix them.
> 
> If you look up the new style loose object code you'll see that it
> has a format like this (sort of), the header is actually the same
> format that is used in the pack files, making it smaller than what
> you propose but also easier to unpack as the code can be reused
> with the pack reading code.
> 
> Unfortunately the new style loose object was phased out; it never
> really took off and it made the code much more complex.  So it was
> pulled in commit 726f852b0ed7e03e88c419a9996c3815911c9db1:
> 

In fact the format I proposed in my patches is uncompressed loose
object, not uncompressed loose object header, that's to say I
proposed format 2 in my question 2, I am just curious why the
loose object header is compressed in question 1.

I did a test to add all files of git-1.6.1-rc1 with git-add, the
time spent decreased by half. Other commands like git diff,
git diff --cached, git diff HEAD~ HEAD should be faster now
although the change may be not noticable for small and medium project.


>  Author: Nicolas Pitre <nico@cam.org>:
>  >  deprecate the new loose object header format
>  >
>  >  Now that we encourage and actively preserve objects in a packed form
>  >  more agressively than we did at the time the new loose object format and
>  >  core.legacyheaders were introduced, that extra loose object format
>  >  doesn't appear to be worth it anymore.
>  >
>  >  Because the packing of loose objects has to go through the delta match
>  >  loop anyway, and since most of them should end up being deltified in
>  >  most cases, there is really little advantage to have this parallel loose
>  >  object format as the CPU savings it might provide is rather lost in the
>  >  noise in the end.
>  >
>  >  This patch gets rid of core.legacyheaders, preserve the legacy format as
>  >  the only writable loose object format and deprecate the other one to
>  >  keep things simpler.
> 

Thank you for dig it out for me!


Best regards,

Liu Yubao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] support reading and writing uncompressed loose object
  2008-12-01  8:25 ` Junio C Hamano
                     ` (6 preceding siblings ...)
  2008-12-02  2:03   ` [PATCH 5/5] support writing " Liu Yubao
@ 2008-12-02  3:11   ` Liu Yubao
  7 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-02  3:11 UTC (permalink / raw
  To: git list; +Cc: Junio C Hamano

Hi all,

It seems my patch series are not in one mail thread, I'm very sorry
for that, I replied this thread and paste my patches into Thunderbird
with external editor plugin, don't know why they become separate topics.

Best regards,

Liu Yubao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound
  2008-12-02  1:51   ` [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound Liu Yubao
@ 2008-12-02 15:42     ` Shawn O. Pearce
  2008-12-03  3:49       ` Liu Yubao
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2008-12-02 15:42 UTC (permalink / raw
  To: Liu Yubao; +Cc: Junio C Hamano, git list

Liu Yubao <yubao.liu@gmail.com> wrote:
> diff --git a/sha1_file.c b/sha1_file.c
> index 6c0e251..efe6967 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1254,10 +1255,10 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
>  	/*
>  	 * The type can be at most ten bytes (including the
>  	 * terminating '\0' that we add), and is followed by
> -	 * a space.
> +	 * a space, at least one byte for size, and a '\0'.
>  	 */
>  	i = 0;
> -	for (;;) {
> +	while (hdr < hdr_end - 2) {
>  		char c = *hdr++;
>  		if (c == ' ')
>  			break;
> @@ -1265,6 +1266,8 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
>  		if (i >= sizeof(type))
>  			return -1;

That first hunk I am citing is unnecessary, because of the lines
right above.  All of the callers of this function pass in a buffer
that is at least 32 bytes in size; this loop aborts if it does not
find a ' ' within the first 10 bytes of the buffer.  We'll never
access memory outside of the buffer during this loop.

So IMHO your first three hunks here aren't necessary.

> @@ -1275,7 +1278,7 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
>  	if (size > 9)
>  		return -1;
>  	if (size) {
> -		for (;;) {
> +		while (hdr < hdr_end - 1) {
>  			unsigned long c = *hdr - '0';
>  			if (c > 9)
>  				break;

OK, there's no promise here that we don't roll off the buffer.

This can be fixed in the caller, ensuring we always have the '\0'
at some point in the initial header buffer we were asked to parse:

diff --git a/sha1_file.c b/sha1_file.c
index 6c0e251..26c6ffb 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1162,9 +1162,10 @@ static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned lon
 	stream->next_in = map;
 	stream->avail_in = mapsize;
 	stream->next_out = buffer;
-	stream->avail_out = bufsiz;
 
 	if (legacy_loose_object(map)) {
+		stream->avail_out = bufsiz - 1;
+		buffer[bufsiz - 1] = '\0';
 		inflateInit(stream);
 		return inflate(stream, 0);
 	}
@@ -1186,6 +1187,7 @@ static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned lon
 	/* Set up the stream for the rest.. */
 	stream->next_in = map;
 	stream->avail_in = mapsize;
+	stream->avail_out = bufsiz;
 	inflateInit(stream);
 
 	/* And generate the fake traditional header */

-- 
Shawn.

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type
  2008-12-02  1:55   ` [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type Liu Yubao
@ 2008-12-02 15:53     ` Shawn O. Pearce
  2008-12-03  4:06       ` Liu Yubao
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2008-12-02 15:53 UTC (permalink / raw
  To: Liu Yubao; +Cc: Junio C Hamano, git list

Liu Yubao <yubao.liu@gmail.com> wrote:
> diff --git a/sha1_file.c b/sha1_file.c
> index dccc455..79062f0 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1099,7 +1099,8 @@ static void *map_sha1_file(const unsigned char *sha1, unsigned long *size)
>  
>  		if (!fstat(fd, &st)) {
>  			*size = xsize_t(st.st_size);
> -			map = xmmap(NULL, *size, PROT_READ, MAP_PRIVATE, fd, 0);
> +			if (*size > 0)
> +				map = xmmap(NULL, *size, PROT_READ, MAP_PRIVATE, fd, 0);
>  		}
>  		close(fd);
>  	}

This has nothing to do with this change description.  Why are we
returning NULL from map_sha1_file when the file length is 0 bytes?
No loose object should ever be an empty file, there must always be
some sort of type header present.  So it probably is an error to
have a 0 length file here.  But that bug is a different change.

> @@ -1257,6 +1258,8 @@ static int parse_sha1_header(const char *hdr, unsigned long length, unsigned lon
>  	 * terminating '\0' that we add), and is followed by
>  	 * a space, at least one byte for size, and a '\0'.
>  	 */
> +	if ('b' != *hdr && 'c' != *hdr && 't' != *hdr)	/* blob/commit/tag/tree */
> +		return -1;
>  	i = 0;
>  	while (hdr < hdr_end - 2) {
>  		char c = *hdr++;

Oh.  I wouldn't do that.  Its a cute trick and it works to quickly
determine if the header is an uncompressed header vs. a zlib header
vs. a new-style loose object header (which git cannot write anymore,
but it still can read).  But its just asking for trouble when/if a
new object type was ever added to the type table.

Given that we know that no type name can be more than 10 bytes and
if you use my patch from earlier today you can be certain hdr has a
'\0' terminator, so you could write a function to test for the type
against the hdr, stopping on either ' ' or '\0'.  Or find the first
' ' in the first 10 bytes (which is what this loop does anyway) and
then test that against the type name table.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] support reading uncompressed loose object
  2008-12-02  1:56   ` [PATCH 4/5] support reading uncompressed loose object Liu Yubao
@ 2008-12-02 15:58     ` Shawn O. Pearce
  2008-12-03  4:09       ` Liu Yubao
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2008-12-02 15:58 UTC (permalink / raw
  To: Liu Yubao; +Cc: Junio C Hamano, git list

Liu Yubao <yubao.liu@gmail.com> wrote:
> 
> Signed-off-by: Liu Yubao <yubao.liu@gmail.com>

I'd like to see a bit more of an explanation of the new loose
object format you are reading in the commit message.  We have a
long history of explaining *why* the code behaves the way it does
in our commits, so we can look at it in blame/log and understand
what the heck went on.
 
> ---
>  sha1_file.c |   20 +++++++++++++++++++-
>  1 files changed, 19 insertions(+), 1 deletions(-)
> 
> diff --git a/sha1_file.c b/sha1_file.c
> index 79062f0..05a9fa3 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1985,6 +1985,16 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
>  	map = map_sha1_file(sha1, &mapsize);
>  	if (!map)
>  		return error("unable to find %s", sha1_to_hex(sha1));
> +
> +	/*
> +	 * Is it an uncompressed loose objects?
> +	 */
> +	if ((status = parse_sha1_header(map, mapsize, &size)) >= 0) {
> +		if (sizep)
> +			*sizep = size;
> +		goto L_leave;
> +	}
> +
>  	if (unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0)
>  		status = error("unable to unpack %s header",
>  			       sha1_to_hex(sha1));
> @@ -1993,6 +2003,8 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
>  	else if (sizep)
>  		*sizep = size;
>  	inflateEnd(&stream);
> +
> +L_leave:
>  	munmap(map, mapsize);
>  	return status;
>  }
> @@ -2124,7 +2136,13 @@ void *read_object(const unsigned char *sha1, enum object_type *type,
>  		return buf;
>  	map = map_sha1_file(sha1, &mapsize);
>  	if (map) {
> -		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
> +		/*
> +		 * Is it an uncompressed loose object?
> +		 */
> +		if ((*type = parse_sha1_header(map, mapsize, size)) >= 0)
> +			buf = xmemdupz(map + strlen(map) + 1, *size);
> +		else
> +			buf = unpack_sha1_file(map, mapsize, type, size, sha1);
>  		munmap(map, mapsize);
>  		return buf;
>  	}

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] support writing uncompressed loose object
  2008-12-02  2:03   ` [PATCH 5/5] support writing " Liu Yubao
@ 2008-12-02 16:07     ` Shawn O. Pearce
  2008-12-03  4:22       ` Liu Yubao
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2008-12-02 16:07 UTC (permalink / raw
  To: Liu Yubao; +Cc: Junio C Hamano, git list

Liu Yubao <yubao.liu@gmail.com> wrote:
> 
> Signed-off-by: Liu Yubao <yubao.liu@gmail.com>

IMHO, this needs more description in the commit message.

> diff --git a/sha1_file.c b/sha1_file.c
> index 05a9fa3..053b564 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2328,7 +2328,7 @@ static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
>  }
>  
>  static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
> -			      void *buf, unsigned long len, time_t mtime)
> +			      void *buf, unsigned long len, time_t mtime, int dont_deflate)

Passing this as an argument is pointless.  It should be a repository
wide configuration option in core, so you can declare it a static and
allow git_config to populate it.  Defaulting to 1 (no compression)
like you do elsewhere in the patch isn't good.

I'm still against this file format change.  The series itself isn't
that bad, and the buffer overflow catch in parse_sha1_header()
may be something worthwhile fixing.  But I'm still not sold that
introducing a new loose object format is worth it.

I'd rather use a binary header encoding like the new-style/in-pack
format rather than the older style text headers.  Its faster to
parse for one thing.

Your changes in the reading code cause a copy of the buffer we
mmap()'d.  That sort of ruins your argument that this change is
worthwhile because concurrent processes on the same host can mmap the
same buffer and save memory.  We're still copying the buffer anyway.
I probably should have commented on that in patch 4/5, but I just
realized it, so I'm saying it here.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound
  2008-12-02 15:42     ` Shawn O. Pearce
@ 2008-12-03  3:49       ` Liu Yubao
  0 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-03  3:49 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git list

Shawn O. Pearce wrote:
> Liu Yubao <yubao.liu@gmail.com> wrote:
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 6c0e251..efe6967 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -1254,10 +1255,10 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
>>  	/*
>>  	 * The type can be at most ten bytes (including the
>>  	 * terminating '\0' that we add), and is followed by
>> -	 * a space.
>> +	 * a space, at least one byte for size, and a '\0'.
>>  	 */
>>  	i = 0;
>> -	for (;;) {
>> +	while (hdr < hdr_end - 2) {
>>  		char c = *hdr++;
>>  		if (c == ' ')
>>  			break;
>> @@ -1265,6 +1266,8 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
>>  		if (i >= sizeof(type))
>>  			return -1;
> 
> That first hunk I am citing is unnecessary, because of the lines
> right above.  All of the callers of this function pass in a buffer
> that is at least 32 bytes in size; this loop aborts if it does not
> find a ' ' within the first 10 bytes of the buffer.  We'll never
> access memory outside of the buffer during this loop.
> 
> So IMHO your first three hunks here aren't necessary.
> 

Seems you missed the cover letter sent as patch 0/5, all patches are explained
in the cover letter, sorry I sent them as separate topics by mistake.

This bound check is mainly for uncompressed loose object, a loose object
that just are uncompressed:

uncompressed loose object = inflate(loose object)
loose object = deflate(typename + <space> + size + '\0' + data)

I'm doing a defensive programming, for uncompressed loose object the mmapped
memory is passed to parse_sha1_header without being checked by inflateInit() first,
so there may be a SIGSEGV crash for a corrupted uncompressed loose object.


>> @@ -1275,7 +1278,7 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
>>  	if (size > 9)
>>  		return -1;
>>  	if (size) {
>> -		for (;;) {
>> +		while (hdr < hdr_end - 1) {
>>  			unsigned long c = *hdr - '0';
>>  			if (c > 9)
>>  				break;
> 
> OK, there's no promise here that we don't roll off the buffer.
> 
> This can be fixed in the caller, ensuring we always have the '\0'
> at some point in the initial header buffer we were asked to parse:
> 
Isn't it easier to solve this problem in one place and maintain it? Maybe someday
someone forgets parse_sha1_header requires a null terminated buffer, and a corrupted
uncompressed loose object even doesn't have to be null terminated (if there will be
this kind of loose object).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type
  2008-12-02 15:53     ` Shawn O. Pearce
@ 2008-12-03  4:06       ` Liu Yubao
  0 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-03  4:06 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git list

Shawn O. Pearce wrote:
> Liu Yubao <yubao.liu@gmail.com> wrote:
>> diff --git a/sha1_file.c b/sha1_file.c
>> index dccc455..79062f0 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -1099,7 +1099,8 @@ static void *map_sha1_file(const unsigned char *sha1, unsigned long *size)
>>  
>>  		if (!fstat(fd, &st)) {
>>  			*size = xsize_t(st.st_size);
>> -			map = xmmap(NULL, *size, PROT_READ, MAP_PRIVATE, fd, 0);
>> +			if (*size > 0)
>> +				map = xmmap(NULL, *size, PROT_READ, MAP_PRIVATE, fd, 0);
>>  		}
>>  		close(fd);
>>  	}
> 
> This has nothing to do with this change description.  Why are we
> returning NULL from map_sha1_file when the file length is 0 bytes?
> No loose object should ever be an empty file, there must always be
> some sort of type header present.  So it probably is an error to
> have a 0 length file here.  But that bug is a different change.
> 

Also a defensive programming for uncompressed loose object because the mapped memory
will be passed to parse_sha1_header() directly without being checked by inflateInit().

In fact unpack_sha1_header() in current code calls legacy_loose_object() without checking
mapsize first. If it encounters (very very unlikely) a corrupted empty loose object, it
will crash.

>> @@ -1257,6 +1258,8 @@ static int parse_sha1_header(const char *hdr, unsigned long length, unsigned lon
>>  	 * terminating '\0' that we add), and is followed by
>>  	 * a space, at least one byte for size, and a '\0'.
>>  	 */
>> +	if ('b' != *hdr && 'c' != *hdr && 't' != *hdr)	/* blob/commit/tag/tree */
>> +		return -1;
>>  	i = 0;
>>  	while (hdr < hdr_end - 2) {
>>  		char c = *hdr++;
> 
> Oh.  I wouldn't do that.  Its a cute trick and it works to quickly
> determine if the header is an uncompressed header vs. a zlib header
> vs. a new-style loose object header (which git cannot write anymore,
> but it still can read).  But its just asking for trouble when/if a
> new object type was ever added to the type table.
> 
I can't agree any more, it's just a trick. I considered adding
a function seems_likely_uncompressed_loose_object(), but I didn't
because this patch series are just my first try, I don't know whether
the idea to support uncompressed loose object is attractive enough.

> Given that we know that no type name can be more than 10 bytes and
> if you use my patch from earlier today you can be certain hdr has a
> '\0' terminator, so you could write a function to test for the type
> against the hdr, stopping on either ' ' or '\0'.  Or find the first
> ' ' in the first 10 bytes (which is what this loop does anyway) and
> then test that against the type name table.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] support reading uncompressed loose object
  2008-12-02 15:58     ` Shawn O. Pearce
@ 2008-12-03  4:09       ` Liu Yubao
  0 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-03  4:09 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git list

Shawn O. Pearce wrote:
> Liu Yubao <yubao.liu@gmail.com> wrote:
>> Signed-off-by: Liu Yubao <yubao.liu@gmail.com>
> 
> I'd like to see a bit more of an explanation of the new loose
> object format you are reading in the commit message.  We have a
> long history of explaining *why* the code behaves the way it does
> in our commits, so we can look at it in blame/log and understand
> what the heck went on.
>  
I mentioned in the cover letter sent as patch 0/5, indeed I should
have mentioned it in the commit message again.

An uncompressed loose object is just an uncompressed loose object:

loose object = deflate(typename + <space> + size + '\0' + data)
uncompressed object = inflate(loose object)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] support writing uncompressed loose object
  2008-12-02 16:07     ` Shawn O. Pearce
@ 2008-12-03  4:22       ` Liu Yubao
  0 siblings, 0 replies; 27+ messages in thread
From: Liu Yubao @ 2008-12-03  4:22 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git list

Shawn O. Pearce wrote:
> Liu Yubao <yubao.liu@gmail.com> wrote:
>> Signed-off-by: Liu Yubao <yubao.liu@gmail.com>
> 
> IMHO, this needs more description in the commit message.
> 
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 05a9fa3..053b564 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -2328,7 +2328,7 @@ static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
>>  }
>>  
>>  static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
>> -			      void *buf, unsigned long len, time_t mtime)
>> +			      void *buf, unsigned long len, time_t mtime, int dont_deflate)
> 
> Passing this as an argument is pointless.  It should be a repository
> wide configuration option in core, so you can declare it a static and
> allow git_config to populate it.  Defaulting to 1 (no compression)
> like you do elsewhere in the patch isn't good.
> 
Aha, sorry again, I sent the patch series as separate topics by mistake.

I considered adding a configuration variable, the patch series are sent
just to see whether the idea is worth.

> I'm still against this file format change.  The series itself isn't
> that bad, and the buffer overflow catch in parse_sha1_header()
> may be something worthwhile fixing.  But I'm still not sold that
> introducing a new loose object format is worth it.
> 
> I'd rather use a binary header encoding like the new-style/in-pack
> format rather than the older style text headers.  Its faster to
> parse for one thing.
> 
The key point I suggest is to use *uncompressed* loose object, I didn't
change the format of uncompressed loose object because I don't want
to distract your attention and keep the patches small.

> Your changes in the reading code cause a copy of the buffer we
> mmap()'d.  That sort of ruins your argument that this change is
> worthwhile because concurrent processes on the same host can mmap the
> same buffer and save memory.  We're still copying the buffer anyway.
> I probably should have commented on that in patch 4/5, but I just
> realized it, so I'm saying it here.
> 
Yes, I mentioned it in the cover letter(sigh, sorry!)

I didn't use the mapped buffer directly because other functions required
a null terminated buffer to parse data part of loose object. It can be
fixed but I don't want to make the patches too big.

The two big pros of uncompressed loose object are:

*) avoid compressing and uncompressing loose objects    (I have implemented it)
*) use memory mapped loose object directly              (I havn't implemented it)


Thank you for reviewing my patches, seems the idea to use uncompressed loose
object isn't attractive enough, I will keep the patches locally.


Best regards,

Liu Yubao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: two questions about the format of loose object
  2008-12-02  3:05   ` Liu Yubao
@ 2008-12-04  0:54     ` Nicolas Pitre
  0 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2008-12-04  0:54 UTC (permalink / raw
  To: Liu Yubao; +Cc: Shawn O. Pearce, git list

On Tue, 2 Dec 2008, Liu Yubao wrote:

> In fact the format I proposed in my patches is uncompressed loose
> object, not uncompressed loose object header, that's to say I
> proposed format 2 in my question 2, I am just curious why the
> loose object header is compressed in question 1.
> 
> I did a test to add all files of git-1.6.1-rc1 with git-add, the
> time spent decreased by half. Other commands like git diff,
> git diff --cached, git diff HEAD~ HEAD should be faster now
> although the change may be not noticable for small and medium project.

Please try this with an unmodified git version:

	git config --global core.loosecompression 0

and redo your tests please.

One thing that a purely uncompressed loose object format is missing is 
quick data integrity protection. With the above, you'll have all your 
loose objects uncompressed but they'll still have a CRC32 done over 
them.


Nicolas

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-12-04  0:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-01  8:00 two questions about the format of loose object Liu Yubao
2008-12-01  8:25 ` Junio C Hamano
2008-12-01  9:28   ` Liu Yubao
2008-12-01 11:32     ` Jakub Narebski
2008-12-02  2:19       ` Liu Yubao
2008-12-01 15:21     ` Shawn O. Pearce
2008-12-02  2:43       ` Liu Yubao
2008-12-02  1:48   ` [PATCH 0/5] support reading and writing uncompressed " Liu Yubao
2008-12-02  1:51   ` [PATCH 1/5] avoid parse_sha1_header() accessing memory out of bound Liu Yubao
2008-12-02 15:42     ` Shawn O. Pearce
2008-12-03  3:49       ` Liu Yubao
2008-12-02  1:53   ` [PATCH 2/5] don't die immediately when convert an invalid type name Liu Yubao
2008-12-02  1:55   ` [PATCH 3/5] optimize parse_sha1_header() a little by detecting object type Liu Yubao
2008-12-02 15:53     ` Shawn O. Pearce
2008-12-03  4:06       ` Liu Yubao
2008-12-02  1:56   ` [PATCH 4/5] support reading uncompressed loose object Liu Yubao
2008-12-02 15:58     ` Shawn O. Pearce
2008-12-03  4:09       ` Liu Yubao
2008-12-02  2:03   ` [PATCH 5/5] support writing " Liu Yubao
2008-12-02 16:07     ` Shawn O. Pearce
2008-12-03  4:22       ` Liu Yubao
2008-12-02  3:11   ` [PATCH 0/5] support reading and " Liu Yubao
2008-12-01 12:16 ` two questions about the format of " Nick Andrew
2008-12-02  2:26   ` Liu Yubao
2008-12-01 15:32 ` Shawn O. Pearce
2008-12-02  3:05   ` Liu Yubao
2008-12-04  0:54     ` Nicolas Pitre

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).