git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Recent unresolved issues
@ 2006-04-14  9:31 Junio C Hamano
  2006-04-14 16:02 ` Petr Baudis
                   ` (5 more replies)
  0 siblings, 6 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-04-14  9:31 UTC (permalink / raw
  To: git

Here is a list of topics in the recent git traffic that I feel
inadequately addressed.  I've commented on some of them to give
people a feel for what my priorities are.  Somebody might want
to rehash the ones low on my priority list to conclusion with a
concrete proposal if they cared about them enough.  The list is
*not* ordered in any way.

Also please add whatever I missed (or dismissed).  I am hoping
this will be a good basis for 1.4 to-do list.

* Message-ID: <Pine.LNX.4.64.0604121828370.14565@g5.osdl.org>
  Common option parsing (Linus Torvalds)

* Message-ID: <Pine.LNX.4.64.0604050855080.2550@localhost.localdomain>
  Binary diff output? (Nicolas Pitre)

  I do not think this is needed for our primary audience (the
  kernel project), but I am sure it would be helpful for some
  other projects if we allowed them to exchange patches that
  describe binary file changes via e-mail, so I am not
  dismissing this.  Needs to wait "option parsing".

* Message-ID: <Pine.LNX.4.64.0604111725590.14565@g5.osdl.org>
  Colored diff? (Linus Torvalds)

  I am not opposed to it, but I'd like to do that internally if
  we go this route.  Needs to wait "option parsing".  Also
  Message-ID: <3536.10.10.10.24.1114117965.squirrel@linux1> is
  slightly related to this.

* Message-ID: <7vek02ynif.fsf@assigned-by-dhcp.cox.net>
  diff --with-raw, --with-stat? (me)

  I think "git diff" can be internalized next, after "option
  parsing" unification.  When that is done, --with-stat would
  help internalize format-patch's process_one(), and it would be
  trivial to do "git log --pretty=format-patch master..next".

* #irc 2006-04-10
  Shallow clones (Carl Worth).

  The experiment last round did not work out very well, but as
  existing repositories get bigger, and more projects being
  migrated from foreign SCM systems, this would become a
  must-have from would-be-nice-to-have.

  I am beginning to think using "graft" to cauterize history
  for this, while it technically would work, would not be so
  helpful to users, so the design needs to be worked out again.

* Message-ID: <E1FMH3o-0001B5-Dw@jdl.com>
  git status does not distinguish contents changes and mode
  changes; it just says "modified" (Jon Loeliger).

  Unconditionally changing the status letter would break
  Porcelains so we would need an extra option to do this.
  An outline patch has been already prepared -- this perhaps has
  to wait until we sort out the "option parsing" one.

* Message-ID: <tnxmzf9sh7k.fsf@arm.com>
  git could use diff3 instead of merge which is a wrapper around
  diff3. (Catalin Marinas)

  If having "diff3" is a lot more common than having "merge", I
  do not have problem with this; "merge" being a wrapper to
  "diff3", people who have been happy with the current code
  would certainly have "diff3" installed so changing to "diff3"
  would not break them.

* Message-ID: <81b0412b0603020649u99a2035i3b8adde8ddce9410@mail.gmail.com>
  Windows problems summary (Alex Riesen)

  A good list to keep in mind.

* Message-ID: <Pine.LNX.4.64.0604030730040.3781@g5.osdl.org>
  Huge packfiles (Linus Torvalds)

  Because I do not think asking users to break up packs to
  manageable and mmap()able size is too much to ask, I would not
  be advocating for updating the pack idx to 64-bit offset and
  mmap()ing parts of a packfile, at least too strongly.

  However, we currently lack tool support or recepe for users
  with such a repository to easily break up packs.

* Message-ID: <1143856098.3555.48.camel@dv>
  Per branch property, esp. where to merge from (Pavel Roskin)

  This involves user-level "world model" design, which is more
  Porcelainish than Plumbing, and as people know I do not do
  Porcelain well; interested parties need to come up with what
  they want and how they want to use it.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-14  9:31 Recent unresolved issues Junio C Hamano
@ 2006-04-14 16:02 ` Petr Baudis
       [not found] ` <20060414151030.11c64730.seanlkml@sympatico.ca>
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 81+ messages in thread
From: Petr Baudis @ 2006-04-14 16:02 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

Dear diary, on Fri, Apr 14, 2006 at 11:31:36AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> Here is a list of topics in the recent git traffic that I feel
> inadequately addressed.  I've commented on some of them to give
> people a feel for what my priorities are.  Somebody might want
> to rehash the ones low on my priority list to conclusion with a
> concrete proposal if they cared about them enough.  The list is
> *not* ordered in any way.

Nice summary!

> * Message-ID: <tnxmzf9sh7k.fsf@arm.com>
>   git could use diff3 instead of merge which is a wrapper around
>   diff3. (Catalin Marinas)
> 
>   If having "diff3" is a lot more common than having "merge", I
>   do not have problem with this; "merge" being a wrapper to
>   "diff3", people who have been happy with the current code
>   would certainly have "diff3" installed so changing to "diff3"
>   would not break them.

I've decided to bite the bullet and made Cogito use diff3 instead of
merge as of now. Let's see if anybody complains...

> * Message-ID: <1143856098.3555.48.camel@dv>
>   Per branch property, esp. where to merge from (Pavel Roskin)
> 
>   This involves user-level "world model" design, which is more
>   Porcelainish than Plumbing, and as people know I do not do
>   Porcelain well; interested parties need to come up with what
>   they want and how they want to use it.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
       [not found] ` <20060414151030.11c64730.seanlkml@sympatico.ca>
@ 2006-04-14 19:10   ` sean
  0 siblings, 0 replies; 81+ messages in thread
From: sean @ 2006-04-14 19:10 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Fri, 14 Apr 2006 02:31:36 -0700
Junio C Hamano <junkio@cox.net> wrote:

> * Message-ID: <Pine.LNX.4.64.0604111725590.14565@g5.osdl.org>
>   Colored diff? (Linus Torvalds)
> 
>   I am not opposed to it, but I'd like to do that internally if
>   we go this route.  Needs to wait "option parsing".  Also
>   Message-ID: <3536.10.10.10.24.1114117965.squirrel@linux1> is
>   slightly related to this.

Moving it internal sounds like a good idea.  Would you be open to
including the GIT_DIFF_PAGER option now anyway?   It has utility
beyond just color diffs.

Sean

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-14  9:31 Recent unresolved issues Junio C Hamano
  2006-04-14 16:02 ` Petr Baudis
       [not found] ` <20060414151030.11c64730.seanlkml@sympatico.ca>
@ 2006-04-14 19:24 ` Petr Baudis
  2006-04-14 22:56 ` Recent unresolved issues: shallow clone Carl Worth
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 81+ messages in thread
From: Petr Baudis @ 2006-04-14 19:24 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

Dear diary, on Fri, Apr 14, 2006 at 11:31:36AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> * Message-ID: <Pine.LNX.4.64.0604111725590.14565@g5.osdl.org>
>   Colored diff? (Linus Torvalds)
> 
>   I am not opposed to it, but I'd like to do that internally if
>   we go this route.  Needs to wait "option parsing".  Also
>   Message-ID: <3536.10.10.10.24.1114117965.squirrel@linux1> is
>   slightly related to this.

It might be worthwhile to make Git and Cogito compatible if you offer
colors customization. Cogito lets the user customize the colors through
the $CG_COLORS variable (see cg-diff(1)).

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues: shallow clone
  2006-04-14  9:31 Recent unresolved issues Junio C Hamano
                   ` (2 preceding siblings ...)
  2006-04-14 19:24 ` Petr Baudis
@ 2006-04-14 22:56 ` Carl Worth
  2006-04-15  0:17   ` Johannes Schindelin
  2006-04-15  0:25   ` Junio C Hamano
  2006-04-14 23:52 ` Recent unresolved issues Linus Torvalds
  2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
  5 siblings, 2 replies; 81+ messages in thread
From: Carl Worth @ 2006-04-14 22:56 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 3465 bytes --]

On Fri, 14 Apr 2006 02:31:36 -0700, Junio C Hamano wrote:
>   Shallow clones (Carl Worth).
> 
>   The experiment last round did not work out very well, but as
>   existing repositories get bigger, and more projects being
>   migrated from foreign SCM systems, this would become a
>   must-have from would-be-nice-to-have.
> 
>   I am beginning to think using "graft" to cauterize history
>   for this, while it technically would work, would not be so
>   helpful to users, so the design needs to be worked out again.

As context, here is some of what you mentioned in IRC:

>>	Suppose you have this:
>>
>>	A---B---C
>>	 \       \ 
>>	  D---E---F---G
>>	 
>>	and you made a shallow clone of C (because that is where the
>>	upstream master was when you made that clone).  Then the
>>	upstream updated the master branch tip to G.
>>
>>	The next update from upstream to your shallow clone would break.
>>	The upstream says: I have G at master.
>>	You say: I want G then.  By the way, I have C.
>>
>>	What it means to tell the other end "I have X" is to promise
>>	that you have X and _everything_ behind it.  So the upstream
>>	would send objects necessary to complete D, E, F and G for
>>	"somebody who already have A and B".  As a consequence, you
>>	would not see A nor B.
>>
>>	Even if the only thing you are interested in is to be in sync
>>	with the tip of the upstream, you can end up with an
>>	incomplete tree for G, if some of the blobs or trees contained
>>	in G already exist in A or B.  They are not sent -- because
>>	you told the upstream that you have everything necessary to
>>	get to C.

So that's an argument against using a cauterizing graft for the
shallow clone of C. It definitely confuses the existing protocol to
say "I have C" if I have only a cauterized C, (its tree only, but none
of the commits that should be backing C).

I also read over some of your discussion of extending the protocol
with a new "shallow" extension.

I'm wondering if the shallow clone support couldn't be achieved
through a simpler tweak to the protocol semantics, (and no change to
protocol syntax), that would avoid the problem above. Specifically,
for shallow stuff, could we just do the same "want" and "have"
conversation with tree objects rather than commit objects?

So, in the scenario above, the original shallow clone of C would be:

	Want C->tree, have nothing.

and the later shallow update to G would be:

	Want G->tree, have C->tree

A final step of a shallow clone would then require creating a new
parent-less commit object so that there's something to point refs/head
at, (or maybe rather than being parentless, they could be chained
together with each update?).

I admit that this would result in a rather atypical kind of
repository, but it would contain plenty of valid trees and blobs, so
it should conceptually be fairly easy to promote such a thing to a
full repository.

But, even without any tool support for promotion, the ability to do
shallow clone and shallow updates would still provide a useful
capability [*].

-Carl

[*] For reference, what I'm looking for here is a way to justify
providing git support for jhbuild, which is a tool used by testers of
GNOME and other software to efficiently track the latest development
of an arbitrarily large number of packages. It's currently primarily a
CVS-based thing. Switching to git would be a huge win for the
incremental updates, but would currently cause quite a hit for the
first clone.

[-- Attachment #2: Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-14  9:31 Recent unresolved issues Junio C Hamano
                   ` (3 preceding siblings ...)
  2006-04-14 22:56 ` Recent unresolved issues: shallow clone Carl Worth
@ 2006-04-14 23:52 ` Linus Torvalds
  2006-04-15  0:19   ` Linus Torvalds
  2006-04-15  0:38   ` Junio C Hamano
  2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
  5 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-04-14 23:52 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Git Mailing List



On Fri, 14 Apr 2006, Junio C Hamano wrote:
> 
> * Message-ID: <Pine.LNX.4.64.0604121828370.14565@g5.osdl.org>
>   Common option parsing (Linus Torvalds)

Ok, here's a first cut at starting this.

This basically does a few things that are sadly somewhat interdependent, 
and nontrivial to split out

 - get rid of "struct log_tree_opt"

   The fields in "log_tree_opt" are moved into "struct rev_info", and all 
   users of log_tree_opt are changed to use the rev_info struct instead.

 - add the parsing for the log_tree_opt arguments to "setup_revision()"

 - make setup_revision set a flag (revs->diff) if the diff-related 
   arguments were used. This allows "git log" to decide whether it wants 
   to show diffs or not.

 - make setup_revision() also initialize the diffopt part of rev_info 
   (which we had from before, but we just didn't initialize it)

 - make setup_revision() do all the "finishing touches" on it all (it will 
   do the proper flag combination logic, and call "diff_setup_done()")

Now, that was the easy and straightforward part.

The slightly more involved part is that some of the programs that want to 
use the new-and-improved rev_info parsing don't actually want _commits_, 
they may want tree'ish arguments instead. That meant that I had to change 
setup_revision() to parse the arguments not into the "revs->commits" list, 
but into the "revs->pending_objects" list.

Then, when we do "prepare_revision_walk()", we walk that list, and create 
the sorted commit list from there. 

This actually cleaned some stuff up, but it's the less obvious part of the 
patch, and re-organized the "revision.c" logic somewhat. It actually paves 
the way for splitting argument parsing _entirely_ out of "revision.c", 
since now the argument parsing really is totally independent of the commit 
walking: that didn't use to be true, since there was lots of overlap with 
get_commit_reference() handling etc, now the _only_ overlap is the shared 
(and trivial) "add_pending_object()" thing.

However, I didn't do that file split, just because I wanted the diff 
itself to be smaller, and show the actual changes more clearly. If this 
gets accepted, I'll do further cleanups then - that includes the file 
split, but also using the new infrastructure to do a nicer "git diff" etc.

Even in this form, it actually ends up removing more lines than it adds.

It's nice to note how simple and straightforward this makes the built-in 
"git log" command, even though it continues to support all the diff flags 
too. It doesn't get much simpler that this.

I think this is worth merging soonish, because it does allow for future 
cleanup and even more sharing of code. However, it obviously touches 
"revision.c", which is subtle. I've tested that it passes all the tests we 
have, and it passes my "looks sane" detector, but somebody else should 
also give it a good look-over.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---

 diff-tree.c |   91 ++++++++++++++++-------------------
 git.c       |   68 ++------------------------
 log-tree.c  |   60 ++---------------------
 log-tree.h  |   22 ++------
 revision.c  |  155 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
 revision.h  |   18 +++++++
 6 files changed, 202 insertions(+), 212 deletions(-)

diff --git a/diff-tree.c b/diff-tree.c
index 2b79dd0..54157e4 100644
--- a/diff-tree.c
+++ b/diff-tree.c
@@ -3,7 +3,7 @@ #include "diff.h"
 #include "commit.h"
 #include "log-tree.h"
 
-static struct log_tree_opt log_tree_opt;
+static struct rev_info log_tree_opt;
 
 static int diff_tree_commit_sha1(const unsigned char *sha1)
 {
@@ -62,66 +62,55 @@ int main(int argc, const char **argv)
 {
 	int nr_sha1;
 	char line[1000];
-	unsigned char sha1[2][20];
-	const char *prefix = setup_git_directory();
-	static struct log_tree_opt *opt = &log_tree_opt;
+	struct object *tree1, *tree2;
+	static struct rev_info *opt = &log_tree_opt;
+	struct object_list *list;
 	int read_stdin = 0;
 
 	git_config(git_diff_config);
 	nr_sha1 = 0;
-	init_log_tree_opt(opt);
+	argc = setup_revisions(argc, argv, opt, NULL);
 
-	for (;;) {
-		int opt_cnt;
-		const char *arg;
+	while (--argc > 0) {
+		const char *arg = *++argv;
 
-		argv++;
-		argc--;
-		arg = *argv;
-		if (!arg)
-			break;
-
-		if (*arg != '-') {
-			if (nr_sha1 < 2 && !get_sha1(arg, sha1[nr_sha1])) {
-				nr_sha1++;
-				continue;
-			}
-			break;
-		}
-
-		opt_cnt = log_tree_opt_parse(opt, argv, argc);
-		if (opt_cnt < 0)
-			usage(diff_tree_usage);
-		else if (opt_cnt) {
-			argv += opt_cnt - 1;
-			argc -= opt_cnt - 1;
-			continue;
-		}
-
-		if (!strcmp(arg, "--")) {
-			argv++;
-			argc--;
-			break;
-		}
 		if (!strcmp(arg, "--stdin")) {
 			read_stdin = 1;
 			continue;
 		}
 		usage(diff_tree_usage);
 	}
-
-	if (opt->combine_merges)
-		opt->ignore_merges = 0;
-
-	/* We can only do dense combined merges with diff output */
-	if (opt->dense_combined_merges)
-		opt->diffopt.output_format = DIFF_FORMAT_PATCH;
-
-	if (opt->diffopt.output_format == DIFF_FORMAT_PATCH)
-		opt->diffopt.recursive = 1;
 
-	diff_tree_setup_paths(get_pathspec(prefix, argv), opt);
-	diff_setup_done(&opt->diffopt);
+	/*
+	 * NOTE! "setup_revisions()" will have inserted the revisions
+	 * it parsed in reverse order. So if you do
+	 *
+	 *	git-diff-tree a b
+	 *
+	 * the commit list will be "b" -> "a" -> NULL, so we reverse
+	 * the order of the objects if the first one is not marked
+	 * UNINTERESTING.
+	 */
+	nr_sha1 = 0;
+	list = opt->pending_objects;
+	if (list) {
+		nr_sha1++;
+		tree1 = list->item;
+		list = list->next;
+		if (list) {
+			nr_sha1++;
+			tree2 = tree1;
+			tree1 = list->item;
+			if (list->next)
+				usage(diff_tree_usage);
+			/* Switch them around if the second one was uninteresting.. */
+			if (tree2->flags & UNINTERESTING) {
+				struct object *tmp = tree2;
+				tree2 = tree1;
+				tree1 = tmp;
+			}
+		}
+	}
 
 	switch (nr_sha1) {
 	case 0:
@@ -129,10 +118,12 @@ int main(int argc, const char **argv)
 			usage(diff_tree_usage);
 		break;
 	case 1:
-		diff_tree_commit_sha1(sha1[0]);
+		diff_tree_commit_sha1(tree1->sha1);
 		break;
 	case 2:
-		diff_tree_sha1(sha1[0], sha1[1], "", &opt->diffopt);
+		diff_tree_sha1(tree1->sha1,
+			       tree2->sha1,
+			       "", &opt->diffopt);
 		log_tree_diff_flush(opt);
 		break;
 	}
diff --git a/git.c b/git.c
index 78ed403..e8d1fcc 100644
--- a/git.c
+++ b/git.c
@@ -287,74 +287,18 @@ static int cmd_log(int argc, const char 
 	int abbrev = DEFAULT_ABBREV;
 	int abbrev_commit = 0;
 	const char *commit_prefix = "commit ";
-	struct log_tree_opt opt;
 	int shown = 0;
-	int do_diff = 0;
-	int full_diff = 0;
 
-	init_log_tree_opt(&opt);
 	argc = setup_revisions(argc, argv, &rev, "HEAD");
-	while (1 < argc) {
-		const char *arg = argv[1];
-		if (!strncmp(arg, "--pretty", 8)) {
-			commit_format = get_commit_format(arg + 8);
-			if (commit_format == CMIT_FMT_ONELINE)
-				commit_prefix = "";
-		}
-		else if (!strcmp(arg, "--no-abbrev")) {
-			abbrev = 0;
-		}
-		else if (!strcmp(arg, "--abbrev")) {
-			abbrev = DEFAULT_ABBREV;
-		}
-		else if (!strcmp(arg, "--abbrev-commit")) {
-			abbrev_commit = 1;
-		}
-		else if (!strncmp(arg, "--abbrev=", 9)) {
-			abbrev = strtoul(arg + 9, NULL, 10);
-			if (abbrev && abbrev < MINIMUM_ABBREV)
-				abbrev = MINIMUM_ABBREV;
-			else if (40 < abbrev)
-				abbrev = 40;
-		}
-		else if (!strcmp(arg, "--full-diff")) {
-			do_diff = 1;
-			full_diff = 1;
-		}
-		else {
-			int cnt = log_tree_opt_parse(&opt, argv+1, argc-1);
-			if (0 < cnt) {
-				do_diff = 1;
-				argv += cnt;
-				argc -= cnt;
-				continue;
-			}
-			die("unrecognized argument: %s", arg);
-		}
+	if (argc > 1)
+		die("unrecognized argument: %s", argv[1]);
 
-		argc--; argv++;
-	}
-
-	if (do_diff) {
-		opt.diffopt.abbrev = abbrev;
-		opt.verbose_header = 0;
-		opt.always_show_header = 0;
-		opt.no_commit_id = 1;
-		if (opt.combine_merges)
-			opt.ignore_merges = 0;
-		if (opt.dense_combined_merges)
-			opt.diffopt.output_format = DIFF_FORMAT_PATCH;
-		if (opt.diffopt.output_format == DIFF_FORMAT_PATCH)
-			opt.diffopt.recursive = 1;
-		if (!full_diff && rev.prune_data)
-			diff_tree_setup_paths(rev.prune_data, &opt.diffopt);
-		diff_setup_done(&opt.diffopt);
-	}
+	rev.no_commit_id = 1;
 
 	prepare_revision_walk(&rev);
 	setup_pager();
 	while ((commit = get_revision(&rev)) != NULL) {
-		if (shown && do_diff && commit_format != CMIT_FMT_ONELINE)
+		if (shown && rev.diff && commit_format != CMIT_FMT_ONELINE)
 			putchar('\n');
 		fputs(commit_prefix, stdout);
 		if (abbrev_commit && abbrev)
@@ -388,8 +332,8 @@ static int cmd_log(int argc, const char 
 		pretty_print_commit(commit_format, commit, ~0, buf,
 				    LOGSIZE, abbrev);
 		printf("%s\n", buf);
-		if (do_diff)
-			log_tree_commit(&opt, commit);
+		if (rev.diff)
+			log_tree_commit(&rev, commit);
 		shown = 1;
 		free(commit->buffer);
 		commit->buffer = NULL;
diff --git a/log-tree.c b/log-tree.c
index 3d40482..04a68e0 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -3,58 +3,8 @@ #include "diff.h"
 #include "commit.h"
 #include "log-tree.h"
 
-void init_log_tree_opt(struct log_tree_opt *opt)
+int log_tree_diff_flush(struct rev_info *opt)
 {
-	memset(opt, 0, sizeof *opt);
-	opt->ignore_merges = 1;
-	opt->header_prefix = "";
-	opt->commit_format = CMIT_FMT_RAW;
-	diff_setup(&opt->diffopt);
-}
-
-int log_tree_opt_parse(struct log_tree_opt *opt, const char **av, int ac)
-{
-	const char *arg;
-	int cnt = diff_opt_parse(&opt->diffopt, av, ac);
-	if (0 < cnt)
-		return cnt;
-	arg = *av;
-	if (!strcmp(arg, "-r"))
-		opt->diffopt.recursive = 1;
-	else if (!strcmp(arg, "-t")) {
-		opt->diffopt.recursive = 1;
-		opt->diffopt.tree_in_recursive = 1;
-	}
-	else if (!strcmp(arg, "-m"))
-		opt->ignore_merges = 0;
-	else if (!strcmp(arg, "-c"))
-		opt->combine_merges = 1;
-	else if (!strcmp(arg, "--cc")) {
-		opt->dense_combined_merges = 1;
-		opt->combine_merges = 1;
-	}
-	else if (!strcmp(arg, "-v")) {
-		opt->verbose_header = 1;
-		opt->header_prefix = "diff-tree ";
-	}
-	else if (!strncmp(arg, "--pretty", 8)) {
-		opt->verbose_header = 1;
-		opt->header_prefix = "diff-tree ";
-		opt->commit_format = get_commit_format(arg+8);
-	}
-	else if (!strcmp(arg, "--root"))
-		opt->show_root_diff = 1;
-	else if (!strcmp(arg, "--no-commit-id"))
-		opt->no_commit_id = 1;
-	else if (!strcmp(arg, "--always"))
-		opt->always_show_header = 1;
-	else
-		return 0;
-	return 1;
-}
-
-int log_tree_diff_flush(struct log_tree_opt *opt)
-{
 	diffcore_std(&opt->diffopt);
 	if (diff_queue_is_empty()) {
 		int saved_fmt = opt->diffopt.output_format;
@@ -73,7 +23,7 @@ int log_tree_diff_flush(struct log_tree_
 	return 1;
 }
 
-static int diff_root_tree(struct log_tree_opt *opt,
+static int diff_root_tree(struct rev_info *opt,
 			  const unsigned char *new, const char *base)
 {
 	int retval;
@@ -93,7 +43,7 @@ static int diff_root_tree(struct log_tre
 	return retval;
 }
 
-static const char *generate_header(struct log_tree_opt *opt,
+static const char *generate_header(struct rev_info *opt,
 				   const unsigned char *commit_sha1,
 				   const unsigned char *parent_sha1,
 				   const struct commit *commit)
@@ -129,7 +79,7 @@ static const char *generate_header(struc
 	return this_header;
 }
 
-static int do_diff_combined(struct log_tree_opt *opt, struct commit *commit)
+static int do_diff_combined(struct rev_info *opt, struct commit *commit)
 {
 	unsigned const char *sha1 = commit->object.sha1;
 
@@ -142,7 +92,7 @@ static int do_diff_combined(struct log_t
 	return 0;
 }
 
-int log_tree_commit(struct log_tree_opt *opt, struct commit *commit)
+int log_tree_commit(struct rev_info *opt, struct commit *commit)
 {
 	struct commit_list *parents;
 	unsigned const char *sha1 = commit->object.sha1;
diff --git a/log-tree.h b/log-tree.h
index da166c6..91a909b 100644
--- a/log-tree.h
+++ b/log-tree.h
@@ -1,23 +1,11 @@
 #ifndef LOG_TREE_H
 #define LOG_TREE_H
 
-struct log_tree_opt {
-	struct diff_options diffopt;
-	int show_root_diff;
-	int no_commit_id;
-	int verbose_header;
-	int ignore_merges;
-	int combine_merges;
-	int dense_combined_merges;
-	int always_show_header;
-	const char *header_prefix;
-	const char *header;
-	enum cmit_fmt commit_format;
-};
+#include "revision.h"
 
-void init_log_tree_opt(struct log_tree_opt *);
-int log_tree_diff_flush(struct log_tree_opt *);
-int log_tree_commit(struct log_tree_opt *, struct commit *);
-int log_tree_opt_parse(struct log_tree_opt *, const char **, int);
+void init_log_tree_opt(struct rev_info *);
+int log_tree_diff_flush(struct rev_info *);
+int log_tree_commit(struct rev_info *, struct commit *);
+int log_tree_opt_parse(struct rev_info *, const char **, int);
 
 #endif
diff --git a/revision.c b/revision.c
index 0505f3f..99077af 100644
--- a/revision.c
+++ b/revision.c
@@ -116,21 +116,27 @@ static void add_pending_object(struct re
 	add_object(obj, &revs->pending_objects, NULL, name);
 }
 
-static struct commit *get_commit_reference(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags)
+static struct object *get_reference(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags)
 {
 	struct object *object;
 
 	object = parse_object(sha1);
 	if (!object)
 		die("bad object %s", name);
+	object->flags |= flags;
+	return object;
+}
+
+static struct commit *handle_commit(struct rev_info *revs, struct object *object, const char *name)
+{
+	unsigned long flags = object->flags;
 
 	/*
 	 * Tag object? Look what it points to..
 	 */
 	while (object->type == tag_type) {
 		struct tag *tag = (struct tag *) object;
-		object->flags |= flags;
-		if (revs->tag_objects && !(object->flags & UNINTERESTING))
+		if (revs->tag_objects && !(flags & UNINTERESTING))
 			add_pending_object(revs, object, tag->tag);
 		object = parse_object(tag->tagged->sha1);
 		if (!object)
@@ -143,7 +149,6 @@ static struct commit *get_commit_referen
 	 */
 	if (object->type == commit_type) {
 		struct commit *commit = (struct commit *)object;
-		object->flags |= flags;
 		if (parse_commit(commit) < 0)
 			die("unable to parse commit %s", name);
 		if (flags & UNINTERESTING) {
@@ -449,14 +454,6 @@ static void limit_list(struct rev_info *
 		}
 	}
 	revs->commits = newlist;
-}
-
-static void add_one_commit(struct commit *commit, struct rev_info *revs)
-{
-	if (!commit || (commit->object.flags & SEEN))
-		return;
-	commit->object.flags |= SEEN;
-	commit_list_insert(commit, &revs->commits);
 }
 
 static int all_flags;
@@ -464,8 +461,8 @@ static struct rev_info *all_revs;
 
 static int handle_one_ref(const char *path, const unsigned char *sha1)
 {
-	struct commit *commit = get_commit_reference(all_revs, path, sha1, all_flags);
-	add_one_commit(commit, all_revs);
+	struct object *object = get_reference(all_revs, path, sha1, all_flags);
+	add_pending_object(all_revs, object, "");
 	return 0;
 }
 
@@ -494,6 +491,11 @@ void init_revisions(struct rev_info *rev
 
 	revs->topo_setter = topo_sort_default_setter;
 	revs->topo_getter = topo_sort_default_getter;
+
+	revs->header_prefix = "";
+	revs->commit_format = CMIT_FMT_RAW;
+
+	diff_setup(&revs->diffopt);
 }
 
 /*
@@ -526,13 +528,14 @@ int setup_revisions(int argc, const char
 
 	flags = 0;
 	for (i = 1; i < argc; i++) {
-		struct commit *commit;
+		struct object *object;
 		const char *arg = argv[i];
 		unsigned char sha1[20];
 		char *dotdot;
 		int local_flags;
 
 		if (*arg == '-') {
+			int opts;
 			if (!strncmp(arg, "--max-count=", 12)) {
 				revs->max_count = atoi(arg + 12);
 				continue;
@@ -638,6 +641,78 @@ int setup_revisions(int argc, const char
 			}
 			if (!strcmp(arg, "--unpacked")) {
 				revs->unpacked = 1;
+				continue;
+			}
+			if (!strcmp(arg, "-r")) {
+				revs->diff = 1;
+				revs->diffopt.recursive = 1;
+				continue;
+			}
+			if (!strcmp(arg, "-t")) {
+				revs->diff = 1;
+				revs->diffopt.recursive = 1;
+				revs->diffopt.tree_in_recursive = 1;
+				continue;
+			}
+			if (!strcmp(arg, "-m")) {
+				revs->ignore_merges = 0;
+				continue;
+			}
+			if (!strcmp(arg, "-c")) {
+				revs->diff = 1;
+				revs->combine_merges = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--cc")) {
+				revs->diff = 1;
+				revs->dense_combined_merges = 1;
+				revs->combine_merges = 1;
+				continue;
+			}
+			if (!strcmp(arg, "-v")) {
+				revs->verbose_header = 1;
+				revs->header_prefix = "diff-tree ";
+				continue;
+			}
+			if (!strncmp(arg, "--pretty", 8)) {
+				revs->verbose_header = 1;
+				revs->header_prefix = "diff-tree ";
+				revs->commit_format = get_commit_format(arg+8);
+				continue;
+			}
+			if (!strcmp(arg, "--root")) {
+				revs->show_root_diff = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--no-commit-id")) {
+				revs->no_commit_id = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--always")) {
+				revs->always_show_header = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--no-abbrev")) {
+				revs->abbrev = 0;
+				continue;
+			}
+			if (!strcmp(arg, "--abbrev")) {
+				revs->abbrev = DEFAULT_ABBREV;
+				continue;
+			}
+			if (!strcmp(arg, "--abbrev-commit")) {
+				revs->abbrev_commit = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--full-diff")) {
+				revs->diff = 1;
+				revs->full_diff = 1;
+				continue;
+			}
+			opts = diff_opt_parse(&revs->diffopt, argv+i, argc-i);
+			if (opts > 0) {
+				revs->diff = 1;
+				i += opts - 1;
 				continue;
 			}
 			*unrecognized++ = arg;
@@ -656,15 +731,15 @@ int setup_revisions(int argc, const char
 				this = "HEAD";
 			if (!get_sha1(this, from_sha1) &&
 			    !get_sha1(next, sha1)) {
-				struct commit *exclude;
-				struct commit *include;
+				struct object *exclude;
+				struct object *include;
 
-				exclude = get_commit_reference(revs, this, from_sha1, flags ^ UNINTERESTING);
-				include = get_commit_reference(revs, next, sha1, flags);
+				exclude = get_reference(revs, this, from_sha1, flags ^ UNINTERESTING);
+				include = get_reference(revs, next, sha1, flags);
 				if (!exclude || !include)
 					die("Invalid revision range %s..%s", arg, next);
-				add_one_commit(exclude, revs);
-				add_one_commit(include, revs);
+				add_pending_object(revs, exclude, this);
+				add_pending_object(revs, include, next);
 				continue;
 			}
 			*dotdot = '.';
@@ -689,16 +764,16 @@ int setup_revisions(int argc, const char
 			revs->prune_data = get_pathspec(revs->prefix, argv + i);
 			break;
 		}
-		commit = get_commit_reference(revs, arg, sha1, flags ^ local_flags);
-		add_one_commit(commit, revs);
+		object = get_reference(revs, arg, sha1, flags ^ local_flags);
+		add_pending_object(revs, object, arg);
 	}
-	if (def && !revs->commits) {
+	if (def && !revs->pending_objects) {
 		unsigned char sha1[20];
-		struct commit *commit;
+		struct object *object;
 		if (get_sha1(def, sha1) < 0)
 			die("bad default revision '%s'", def);
-		commit = get_commit_reference(revs, def, sha1, 0);
-		add_one_commit(commit, revs);
+		object = get_reference(revs, def, sha1, 0);
+		add_pending_object(revs, object, def);
 	}
 
 	if (revs->topo_order || revs->unpacked)
@@ -708,13 +783,37 @@ int setup_revisions(int argc, const char
 		diff_tree_setup_paths(revs->prune_data, &revs->diffopt);
 		revs->prune_fn = try_to_simplify_commit;
 	}
+	if (revs->combine_merges) {
+		revs->ignore_merges = 0;
+		if (revs->dense_combined_merges)
+			revs->diffopt.output_format = DIFF_FORMAT_PATCH;
+	}
+	if (revs->diffopt.output_format == DIFF_FORMAT_PATCH)
+		revs->diffopt.recursive = 1;
+	if (!revs->full_diff && revs->prune_data)
+		diff_tree_setup_paths(revs->prune_data, &revs->diffopt);
+	diff_setup_done(&revs->diffopt);
 
 	return left;
 }
 
 void prepare_revision_walk(struct rev_info *revs)
 {
-	sort_by_date(&revs->commits);
+	struct object_list *list;
+
+	list = revs->pending_objects;
+	revs->pending_objects = NULL;
+	while (list) {
+		struct commit *commit = handle_commit(revs, list->item, list->name);
+		if (commit) {
+			if (!(commit->object.flags & SEEN)) {
+				commit->object.flags |= SEEN;
+				insert_by_date(commit, &revs->commits);
+			}
+		}
+		list = list->next;
+	}
+
 	if (revs->limited)
 		limit_list(revs);
 	if (revs->topo_order)
diff --git a/revision.h b/revision.h
index 8970b57..9a45986 100644
--- a/revision.h
+++ b/revision.h
@@ -38,6 +38,24 @@ struct rev_info {
 			boundary:1,
 			parents:1;
 
+	/* Diff flags */
+	unsigned int	diff:1,
+			full_diff:1,
+			show_root_diff:1,
+			no_commit_id:1,
+			verbose_header:1,
+			ignore_merges:1,
+			combine_merges:1,
+			dense_combined_merges:1,
+			always_show_header:1;
+
+	/* Format info */
+	unsigned int	abbrev_commit:1;
+	unsigned int	abbrev;
+	enum cmit_fmt	commit_format;
+	const char	*header_prefix;
+	const char	*header;
+
 	/* special limits */
 	int max_count;
 	unsigned long max_age;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues: shallow clone
  2006-04-14 22:56 ` Recent unresolved issues: shallow clone Carl Worth
@ 2006-04-15  0:17   ` Johannes Schindelin
  2006-04-15  0:25   ` Junio C Hamano
  1 sibling, 0 replies; 81+ messages in thread
From: Johannes Schindelin @ 2006-04-15  0:17 UTC (permalink / raw
  To: Carl Worth; +Cc: Junio C Hamano, git

Hi,

On Fri, 14 Apr 2006, Carl Worth wrote:

> I also read over some of your discussion of extending the protocol
> with a new "shallow" extension.
> 
> I'm wondering if the shallow clone support couldn't be achieved
> through a simpler tweak to the protocol semantics, (and no change to
> protocol syntax), that would avoid the problem above. Specifically,
> for shallow stuff, could we just do the same "want" and "have"
> conversation with tree objects rather than commit objects?

It would not help your problem at all. "have commit" really means that you 
have the commit and all its ancestors and their combined tree objects and 
the combined tree objects' blob objects.

If you have a cauterized history, you know that you are lacking some of 
them. But you don't know which ones.

Now, issuing a pull could mean to get an object which was present in an 
old revision, which you unfortunately do not have (because you have a cut 
off history). Boom.

I know, this is probably unlikely, but not at all *impossible*, so you 
have to take care of that case. And you need a protocol extension for 
that.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-14 23:52 ` Recent unresolved issues Linus Torvalds
@ 2006-04-15  0:19   ` Linus Torvalds
  2006-04-15  0:39     ` Linus Torvalds
  2006-04-15  0:38   ` Junio C Hamano
  1 sibling, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15  0:19 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Git Mailing List



On Fri, 14 Apr 2006, Linus Torvalds wrote:
> 
> It's nice to note how simple and straightforward this makes the built-in 
> "git log" command, even though it continues to support all the diff flags 
> too. It doesn't get much simpler that this.

Gaah. Missed this important part, which causes the thing to ignore the 
"--pretty=xyzzy" argument, since it would always use its own default 
format that is no longer ever changed.

I even tested that it works for git-diff-tree, just not for git log. Duh.

(Found the hard way - after I had already used the broken git version for 
doing several merges, and the "--pretty=oneline" format didn't work and 
screwed up the merge message ;)

		Linus

----
diff --git a/git.c b/git.c
index e8d1fcc..d5a4a24 100644
--- a/git.c
+++ b/git.c
@@ -283,9 +283,6 @@ static int cmd_log(int argc, const char 
 	struct rev_info rev;
 	struct commit *commit;
 	char *buf = xmalloc(LOGSIZE);
-	static enum cmit_fmt commit_format = CMIT_FMT_DEFAULT;
-	int abbrev = DEFAULT_ABBREV;
-	int abbrev_commit = 0;
 	const char *commit_prefix = "commit ";
 	int shown = 0;
 
@@ -298,11 +295,11 @@ static int cmd_log(int argc, const char 
 	prepare_revision_walk(&rev);
 	setup_pager();
 	while ((commit = get_revision(&rev)) != NULL) {
-		if (shown && rev.diff && commit_format != CMIT_FMT_ONELINE)
+		if (shown && rev.diff && rev.commit_format != CMIT_FMT_ONELINE)
 			putchar('\n');
 		fputs(commit_prefix, stdout);
-		if (abbrev_commit && abbrev)
-			fputs(find_unique_abbrev(commit->object.sha1, abbrev),
+		if (rev.abbrev_commit && rev.abbrev)
+			fputs(find_unique_abbrev(commit->object.sha1, rev.abbrev),
 			      stdout);
 		else
 			fputs(sha1_to_hex(commit->object.sha1), stdout);
@@ -325,12 +322,12 @@ static int cmd_log(int argc, const char 
 			     parents = parents->next)
 				parents->item->object.flags &= ~TMP_MARK;
 		}
-		if (commit_format == CMIT_FMT_ONELINE)
+		if (rev.commit_format == CMIT_FMT_ONELINE)
 			putchar(' ');
 		else
 			putchar('\n');
-		pretty_print_commit(commit_format, commit, ~0, buf,
-				    LOGSIZE, abbrev);
+		pretty_print_commit(rev.commit_format, commit, ~0, buf,
+				    LOGSIZE, rev.abbrev);
 		printf("%s\n", buf);
 		if (rev.diff)
 			log_tree_commit(&rev, commit);

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues: shallow clone
  2006-04-14 22:56 ` Recent unresolved issues: shallow clone Carl Worth
  2006-04-15  0:17   ` Johannes Schindelin
@ 2006-04-15  0:25   ` Junio C Hamano
  2006-04-15  2:11     ` Junio C Hamano
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  0:25 UTC (permalink / raw
  To: Carl Worth; +Cc: git

Carl Worth <cworth@cworth.org> writes:

> On Fri, 14 Apr 2006 02:31:36 -0700, Junio C Hamano wrote:
>>   I am beginning to think using "graft" to cauterize history
>>   for this, while it technically would work, would not be so
>>   helpful to users, so the design needs to be worked out again.
>
> As context, here is some of what you mentioned in IRC:
>
>>>	Suppose you have this:
>>>
>>>	A---B---C
>>>	 \       \ 
>>>	  D---E---F---G
>>>	 
>>>	and you made a shallow clone of C (because that is where the
>>>	upstream master was when you made that clone).  Then the
>>>	upstream updated the master branch tip to G.
>>>
>>>	The next update from upstream to your shallow clone would break.
>>>	The upstream says: I have G at master.
>>>	You say: I want G then.  By the way, I have C.
>>>
>>>	What it means to tell the other end "I have X" is to promise
>>>	that you have X and _everything_ behind it.  So the upstream
>>>	would send objects necessary to complete D, E, F and G for
>>>	"somebody who already have A and B".  As a consequence, you
>>>	would not see A nor B.
>>>
>>>	Even if the only thing you are interested in is to be in sync
>>>	with the tip of the upstream, you can end up with an
>>>	incomplete tree for G, if some of the blobs or trees contained
>>>	in G already exist in A or B.  They are not sent -- because
>>>	you told the upstream that you have everything necessary to
>>>	get to C.
>
> So that's an argument against using a cauterizing graft for the
> shallow clone of C. It definitely confuses the existing protocol to
> say "I have C" if I have only a cauterized C, (its tree only, but none
> of the commits that should be backing C).

That's what I meant by "graft technically works but is
inconvenient". 

Maybe after the update to G happens (which means you now have C,
F, G but not A B D E commits), the client side could enumerate
commits on "rev-list ^C G" and cauterize the ones with missing
parents (in this case, F does not have one of its parents).
While doing this would help keeping the resulting commit
ancestry sane, it does not solve the problem of missing blobs
and trees.  See below.

> So, in the scenario above, the original shallow clone of C would be:
>
> 	Want C->tree, have nothing.
>
> and the later shallow update to G would be:
>
> 	Want G->tree, have C->tree

When you ask for G, you do not know what G^{tree} is, so that is
fantasy without a protocol extention.  To solve the missing
blobs/trees problem we would probably need a protocol extention
that says it wants to receive enough data to complete trees and
blobs associated with the commits being sent _without_ assuming
the recipient has any trees or blobs other than what are
contained in "have" commits.  Then after such a successful
transfer, missing parents of commits listed in "rev-list ^C G"
are the ones from the side branch, so the client can cauterize
them (F in the above example) appropriately without bothering
the server.

However, I think this "do not assume I have any trees behind the
commits I explicitly say I have" must be an option, because it
makes the resulting transfer unnecessarily more expensive for
normal uses.  A fetch of the Linux kernel once a day would
update about a couple of hundered commits, each of which touches
only 3 paths on average (so that would be 600 files out of
18,000 file tree.  When side-branch merges are involved, usually
many things in G (and F) are unchanged since either A or C, but
the extention we are discussing forbids reusing what are found
in A (it still allows reusing what are found in C).

> A final step of a shallow clone would then require creating a new
> parent-less commit object so that there's something to point refs/head
> at, (or maybe rather than being parentless, they could be chained
> together with each update?).

Rewriting commit objects transferred to the cloner is something
you would _not_ want to do (e.g. rewriting F commits to say it
has only one parent C).  The history based on that would diverge
from parents and would become unmergeable.  It is cleaner to
just make a new graft entry to say "As far as this repository is
concerned, F has one parent C".  Shallowness of the repository
and its slightly different view of history is a local matter.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-14 23:52 ` Recent unresolved issues Linus Torvalds
  2006-04-15  0:19   ` Linus Torvalds
@ 2006-04-15  0:38   ` Junio C Hamano
  2006-04-15  0:49     ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  0:38 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> On Fri, 14 Apr 2006, Junio C Hamano wrote:
>> 
>> * Message-ID: <Pine.LNX.4.64.0604121828370.14565@g5.osdl.org>
>>   Common option parsing (Linus Torvalds)
>
> Ok, here's a first cut at starting this.

Hmph.  You did it while I was still thinking about it ;-).

I was thinking long because I had an impression that anything
based on revision.c interface, if it wants to do a tree-diff on
the commit stream, would need two different diff options.  One
is used by revision.c internally so that it can use its own
add_remove/change for parent pruning, and another to control the
way diff is run by the user of revision.c.

I'll take a look at it tonight.  Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  0:19   ` Linus Torvalds
@ 2006-04-15  0:39     ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15  0:39 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Git Mailing List



On Fri, 14 Apr 2006, Linus Torvalds wrote:
> 
> Gaah. Missed this important part, which causes the thing to ignore the 
> "--pretty=xyzzy" argument, since it would always use its own default 
> format that is no longer ever changed.

And here's one more fixup: get the default format right, and don't prefix 
the "oneline" format.

		Linus

----
diff --git a/git.c b/git.c
index d5a4a24..437e9b5 100644
--- a/git.c
+++ b/git.c
@@ -291,6 +291,8 @@ static int cmd_log(int argc, const char 
 		die("unrecognized argument: %s", argv[1]);
 
 	rev.no_commit_id = 1;
+	if (rev.commit_format == CMIT_FMT_ONELINE)
+		commit_prefix = "";
 
 	prepare_revision_walk(&rev);
 	setup_pager();
diff --git a/revision.c b/revision.c
index 99077af..0f98960 100644
--- a/revision.c
+++ b/revision.c
@@ -493,7 +493,7 @@ void init_revisions(struct rev_info *rev
 	revs->topo_getter = topo_sort_default_getter;
 
 	revs->header_prefix = "";
-	revs->commit_format = CMIT_FMT_RAW;
+	revs->commit_format = CMIT_FMT_DEFAULT;
 
 	diff_setup(&revs->diffopt);
 }

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  0:38   ` Junio C Hamano
@ 2006-04-15  0:49     ` Linus Torvalds
  2006-04-15  0:56       ` Linus Torvalds
  2006-04-15  1:35       ` Junio C Hamano
  0 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15  0:49 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git



On Fri, 14 Apr 2006, Junio C Hamano wrote:
> 
> I was thinking long because I had an impression that anything
> based on revision.c interface, if it wants to do a tree-diff on
> the commit stream, would need two different diff options.  One
> is used by revision.c internally so that it can use its own
> add_remove/change for parent pruning, and another to control the
> way diff is run by the user of revision.c.

I think you're right, and I've probably broken "--full-diff" (causing the 
revparse to also use the empty set of paths). Gaah.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  0:49     ` Linus Torvalds
@ 2006-04-15  0:56       ` Linus Torvalds
  2006-04-15  1:09         ` Linus Torvalds
  2006-04-15  6:18         ` Junio C Hamano
  2006-04-15  1:35       ` Junio C Hamano
  1 sibling, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15  0:56 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git



On Fri, 14 Apr 2006, Linus Torvalds wrote:
> 
> I think you're right, and I've probably broken "--full-diff" (causing the 
> revparse to also use the empty set of paths). Gaah.

In fact, it's broken path-limited revisions entirely. Duh. We should have 
a test for that, so I would have noticed.

I think we need two diffopt structures there - one for the actual diff, 
and one for the pruning.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  0:56       ` Linus Torvalds
@ 2006-04-15  1:09         ` Linus Torvalds
  2006-04-15  2:22           ` Junio C Hamano
  2006-04-15  6:18         ` Junio C Hamano
  1 sibling, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15  1:09 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git



Ok, fourth time lucky?

		Linus

----
diff --git a/revision.c b/revision.c
index 0f98960..2061ca8 100644
--- a/revision.c
+++ b/revision.c
@@ -246,7 +246,7 @@ int rev_compare_tree(struct rev_info *re
 		return REV_TREE_DIFFERENT;
 	tree_difference = REV_TREE_SAME;
 	if (diff_tree_sha1(t1->object.sha1, t2->object.sha1, "",
-			   &revs->diffopt) < 0)
+			   &revs->pruning) < 0)
 		return REV_TREE_DIFFERENT;
 	return tree_difference;
 }
@@ -269,7 +269,7 @@ int rev_same_tree_as_empty(struct rev_in
 	empty.size = 0;
 
 	tree_difference = 0;
-	retval = diff_tree(&empty, &real, "", &revs->diffopt);
+	retval = diff_tree(&empty, &real, "", &revs->pruning);
 	free(tree);
 
 	return retval >= 0 && !tree_difference;
@@ -476,9 +476,9 @@ static void handle_all(struct rev_info *
 void init_revisions(struct rev_info *revs)
 {
 	memset(revs, 0, sizeof(*revs));
-	revs->diffopt.recursive = 1;
-	revs->diffopt.add_remove = file_add_remove;
-	revs->diffopt.change = file_change;
+	revs->pruning.recursive = 1;
+	revs->pruning.add_remove = file_add_remove;
+	revs->pruning.change = file_change;
 	revs->lifo = 1;
 	revs->dense = 1;
 	revs->prefix = setup_git_directory();
@@ -780,8 +780,10 @@ int setup_revisions(int argc, const char
 		revs->limited = 1;
 
 	if (revs->prune_data) {
-		diff_tree_setup_paths(revs->prune_data, &revs->diffopt);
+		diff_tree_setup_paths(revs->prune_data, &revs->pruning);
 		revs->prune_fn = try_to_simplify_commit;
+		if (!revs->full_diff)
+			diff_tree_setup_paths(revs->prune_data, &revs->diffopt);
 	}
 	if (revs->combine_merges) {
 		revs->ignore_merges = 0;
@@ -790,8 +792,6 @@ int setup_revisions(int argc, const char
 	}
 	if (revs->diffopt.output_format == DIFF_FORMAT_PATCH)
 		revs->diffopt.recursive = 1;
-	if (!revs->full_diff && revs->prune_data)
-		diff_tree_setup_paths(revs->prune_data, &revs->diffopt);
 	diff_setup_done(&revs->diffopt);
 
 	return left;
diff --git a/revision.h b/revision.h
index 9a45986..6eaa904 100644
--- a/revision.h
+++ b/revision.h
@@ -61,8 +61,9 @@ struct rev_info {
 	unsigned long max_age;
 	unsigned long min_age;
 
-	/* paths limiting */
+	/* diff info for patches and for paths limiting */
 	struct diff_options diffopt;
+	struct diff_options pruning;
 
 	topo_sort_set_fn_t topo_setter;
 	topo_sort_get_fn_t topo_getter;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  0:49     ` Linus Torvalds
  2006-04-15  0:56       ` Linus Torvalds
@ 2006-04-15  1:35       ` Junio C Hamano
  2006-04-15  4:09         ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  1:35 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> On Fri, 14 Apr 2006, Junio C Hamano wrote:
>> 
>> I was thinking long because I had an impression that anything
>> based on revision.c interface, if it wants to do a tree-diff on
>> the commit stream, would need two different diff options.  One
>> is used by revision.c internally so that it can use its own
>> add_remove/change for parent pruning, and another to control the
>> way diff is run by the user of revision.c.
>
> I think you're right, and I've probably broken "--full-diff" (causing the 
> revparse to also use the empty set of paths). Gaah.

Another thing is that some revision.c users are not interested
in taking diff options at all.

I was going to suggest a new structure that captures struct
rev_info, struct log_tree_opt and miscellaneous bits cmd_log
uses such as do_diff, full_diff, etc., and move the option
parser out of cmd_log() to a separate function, and have that
shared across cmd_log(), cmd_show(), cmd_whatchanged(), and
cmd_diff() without affecting any of the existing revision.c
users.  That way, "rev-list --cc HEAD" will remain nonsense.

One nice property your approach has is that it makes
"git diff-tree a..b" magically starts working, unlike what
I suggested above.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues: shallow clone
  2006-04-15  0:25   ` Junio C Hamano
@ 2006-04-15  2:11     ` Junio C Hamano
  0 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  2:11 UTC (permalink / raw
  To: Carl Worth; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> Maybe after the update to G happens (which means you now have C,
> F, G but not A B D E commits), the client side could enumerate
> commits on "rev-list ^C G" and cauterize the ones with missing
> parents (in this case, F does not have one of its parents).
> While doing this would help keeping the resulting commit
> ancestry sane, it does not solve the problem of missing blobs
> and trees.  See below.

Actually, it is more involved than the above.

The sender would give you A B D E as well, so we would not be
able to cauterise at F; instead you would do so at A, making
your shallow clone a bit deeper.  When you look at the objects
the parent gave you by running "rev-list ^C G", you would notice
that you do not have any of real parents of A, and add a new
graft.  While you are at it, you would hopefully notice that the
real parent of commit C is something you now have -- so you
remove the graft entry for C.

However, depending on what you care about more, having the
sender not to send the side branch and cauterising the result at
F _might_ be a better thing to do.  This is probably quite
involved and I offhand would not know how to efficiently
compute.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  1:09         ` Linus Torvalds
@ 2006-04-15  2:22           ` Junio C Hamano
  0 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  2:22 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Ok, fourth time lucky?
>
> 		Linus

With this, perhaps.

---
diff --git a/revision.c b/revision.c
index 2061ca8..1d26e0d 100644
--- a/revision.c
+++ b/revision.c
@@ -792,6 +792,7 @@ int setup_revisions(int argc, const char
 	}
 	if (revs->diffopt.output_format == DIFF_FORMAT_PATCH)
 		revs->diffopt.recursive = 1;
+	revs->diffopt.abbrev = revs->abbrev;
 	diff_setup_done(&revs->diffopt);
 
 	return left;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  1:35       ` Junio C Hamano
@ 2006-04-15  4:09         ` Linus Torvalds
  2006-04-15  5:06           ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15  4:09 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git



On Fri, 14 Apr 2006, Junio C Hamano wrote:
> 
> Another thing is that some revision.c users are not interested
> in taking diff options at all.

Well, it's easy enough to do something like

	if (rev->diff)
		usage(no_diff_cmd_usage);

for something like that.

> I was going to suggest a new structure that captures struct
> rev_info, struct log_tree_opt and miscellaneous bits cmd_log
> uses such as do_diff, full_diff, etc., and move the option
> parser out of cmd_log() to a separate function, and have that
> shared across cmd_log(), cmd_show(), cmd_whatchanged(), and
> cmd_diff() without affecting any of the existing revision.c
> users.  That way, "rev-list --cc HEAD" will remain nonsense.

Well, I actually was going to make git-rev-list just take the diff 
options, and it ends up doing the same thing as "git log" with them. 
There's no real downside.

> One nice property your approach has is that it makes
> "git diff-tree a..b" magically starts working, unlike what
> I suggested above.

Yeah. It just fell out automatically from using the rev-list parsing.

Although, the thing is, once we have a built-in "git diff", there's 
actually little enough reason to ever use the old "git-diff-tree" vs 
"git-diff-index" vs "git-diff-files" at all. 

It might actually be nice to prune some of the tons of git commands. At 
some point, the fact that

	echo bin/git-* | wc -w

returns 122 just makes you go "Hmm..".

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  4:09         ` Linus Torvalds
@ 2006-04-15  5:06           ` Junio C Hamano
  0 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  5:06 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Well, it's easy enough to do something like
>
> 	if (rev->diff)
> 		usage(no_diff_cmd_usage);
>
> for something like that.

You're right.  I've swallowed all four patches with a fixlet on
top; thanks.

> Although, the thing is, once we have a built-in "git diff", there's 
> actually little enough reason to ever use the old "git-diff-tree" vs 
> "git-diff-index" vs "git-diff-files" at all. 

True, unless you are writing a Porcelain, that is.

> It might actually be nice to prune some of the tons of git commands. At 
> some point, the fact that
>
> 	echo bin/git-* | wc -w
>
> returns 122 just makes you go "Hmm..".

Yes, but I thought the plan to deal with that was to set gitexecdir
somewhere other than $(prefix)/bin; removing git-diff-* siblings
would be unfriendly to Porcelains.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  0:56       ` Linus Torvalds
  2006-04-15  1:09         ` Linus Torvalds
@ 2006-04-15  6:18         ` Junio C Hamano
  2006-04-15  8:57           ` Junio C Hamano
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  6:18 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> On Fri, 14 Apr 2006, Linus Torvalds wrote:
>> 
>> I think you're right, and I've probably broken "--full-diff" (causing the 
>> revparse to also use the empty set of paths). Gaah.
>
> In fact, it's broken path-limited revisions entirely. Duh. We should have 
> a test for that, so I would have noticed.
>
> I think we need two diffopt structures there - one for the actual diff, 
> and one for the pruning.

Although I've already decided to merge it up, there are small
fallout from this.  I've fixed the ones I noticed, but there
probably remain some backward compatibility issues in commands
that I do not usually use.  We'll see.

Also I merged the commit prettyprinter change, but I was hoping
we could instead pass down the commit object and commit format
to places where log message needs to be output and write it out
to the standard output instead of formatting in core.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  6:18         ` Junio C Hamano
@ 2006-04-15  8:57           ` Junio C Hamano
  2006-04-15 11:46             ` Johannes Schindelin
  2006-04-15 16:59             ` Linus Torvalds
  0 siblings, 2 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-04-15  8:57 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> Although I've already decided to merge it up, there are small
> fallout from this.  I've fixed the ones I noticed, but there
> probably remain some backward compatibility issues in commands
> that I do not usually use.  We'll see.

I am very to sorry to say this, but...


				Pain


"git log" wants default abbrev (to show Merge: lines and
"whatchanged -r" output compactly) while "git diff-tree -r" by
default wants to show full SHA1 unless asked, which means
"memset(revs, 0, sizeof(*revs))" in revision.c::init_revisions()
needs to be defeated by the caller.

"git rev-list" wants to know if any --pretty was specified to
set verbose_header, but there is no way to tell if the user did
not say anything or said --pretty because revs->commit_format
will be CMIT_FMT_DEFAULT either way.  This is the worst breakage
I found so far -- "git rev-list --pretty" no longer works,
although "git rev-list --header" works so you probably did not
notice the breakage with gitk.

Honestly, the longer I look at it, the more I feel that this way
might break more things than it fixes.  I haven't even looked at
blame.c or http-push.c to see what's broken yet.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  8:57           ` Junio C Hamano
@ 2006-04-15 11:46             ` Johannes Schindelin
  2006-04-15 16:59             ` Linus Torvalds
  1 sibling, 0 replies; 81+ messages in thread
From: Johannes Schindelin @ 2006-04-15 11:46 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

Hi,

On Sat, 15 Apr 2006, Junio C Hamano wrote:

> Honestly, the longer I look at it, the more I feel that this way
> might break more things than it fixes.  I haven't even looked at
> blame.c or http-push.c to see what's broken yet.

I do not have time to look at this closely, but it sounds to me like you 
need a two-stage approach:

setup_diff_options(&options);
[... set defaults ...]
handle_cmdline_arguments(&options);
[... possibly check if the user overrode some defaults ...]

I think that the unified option parsing is the right approach.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15  8:57           ` Junio C Hamano
  2006-04-15 11:46             ` Johannes Schindelin
@ 2006-04-15 16:59             ` Linus Torvalds
  2006-04-15 17:17               ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15 16:59 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git



On Sat, 15 Apr 2006, Junio C Hamano wrote:
> 
> 				Pain
> 
> "git log" wants default abbrev (to show Merge: lines and
> "whatchanged -r" output compactly) while "git diff-tree -r" by
> default wants to show full SHA1 unless asked, which means
> "memset(revs, 0, sizeof(*revs))" in revision.c::init_revisions()
> needs to be defeated by the caller.

I'd suggest just moving the call to "init_revisions()" out from 
"setup_revisions()" entirely.

So the calling sequence would be something like this:

	init_revisions(&rev);
	.. any localized setup ..
	setup_revisions(&rev);

which isn't really all that painful, and allows us maximal flexibility for 
different defaults etc.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15 16:59             ` Linus Torvalds
@ 2006-04-15 17:17               ` Linus Torvalds
  2006-04-16  8:14                 ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-04-15 17:17 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git



On Sat, 15 Apr 2006, Linus Torvalds wrote:
> 
> So the calling sequence would be something like this:
> 
> 	init_revisions(&rev);
> 	.. any localized setup ..
> 	setup_revisions(&rev);

Btw, I can certainly understand if you don't want to do this before 1.3.x. 
Since there's no actual user-visible advantage to it, it's probably worth 
dropping for now.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Recent unresolved issues
  2006-04-15 17:17               ` Linus Torvalds
@ 2006-04-16  8:14                 ` Junio C Hamano
  0 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-04-16  8:14 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Btw, I can certainly understand if you don't want to do this before 1.3.x. 
> Since there's no actual user-visible advantage to it, it's probably worth 
> dropping for now.

Well, I bit the bullet, fixed-up the remaining issues I found in
rev-list in your grand unified version, reverted the revert, and
ported the changes for log/whatchanged/show.

For now this lives in the "next" branch and _will_ not graduate
before 1.3.0, of course.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Unresolved issues #2
  2006-04-14  9:31 Recent unresolved issues Junio C Hamano
                   ` (4 preceding siblings ...)
  2006-04-14 23:52 ` Recent unresolved issues Linus Torvalds
@ 2006-05-04  8:15 ` Junio C Hamano
  2006-05-04  8:32   ` Jakub Narebski
                     ` (4 more replies)
  5 siblings, 5 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-05-04  8:15 UTC (permalink / raw
  To: git

Here is a list of topics in the recent git traffic that I feel
inadequately addressed.  I've commented on some of them to give
people a feel for what my priorities are.  Somebody might want
to rehash the ones low on my priority list to conclusion with a
concrete proposal if they cared about them enough.

The list is *not* ordered in any way, except that the entries
kept from the previous issue of this message have been pushed
down to the bottom.  I will probably start dropping some entries
that did not get any reaction from the list in future issues of
this message, but for now I kept all of them from the first one.

* Message-ID: <4fb292fa0604290630r19edd7ejf88642e33b350d1d@mail.gmail.com>
  Content-type charset for send-email (Bertrand Jacquin)

  The output from format-patch by default is unmarked, which
  means the commit message part is UTF-8 (by strong convention),
  and the contents of the diff is whatever the contents of the
  file is encoded in.

  David Woodhouse did a patch to allow specifying charset on the
  command line (and default to UTF-8) which is a move in the
  right direction, but Bertrand's system seems to have trouble
  with it.

  I think if we were to do this we probably need to teach
  format-patch to optionally do multi-part.  We may not
  necessarily want to mark the payload to be in the same
  encoding as the commit message (not that git-apply cares -- to
  it, the payload is just 8-bit unencoded text, but we would
  want to protect it from getting mangled by e-mail transport).

* Message-ID: <Pine.LNX.4.64.0604291006270.3701@g5.osdl.org>
  Perhaps "note" field in commit objects are useful?

* Message-ID: <Pine.LNX.4.63.0604301524080.2646@wbgn013.biozentrum.uni-wuerzburg.de>

  An optional "git fetch --store newname URL refspecs..." to
  create an equivalent of remotes file so newname can then be
  used as a short-hand.  I still have somewhat negative reaction
  to it, but I am willing to apply it if there are enough people
  who want this.

-- carried over from the first issue of this list.

* Message-ID: <Pine.LNX.4.64.0604050855080.2550@localhost.localdomain>
  Binary diff output? (Nicolas Pitre)

  I do not think this is needed for our primary audience (the
  kernel project), but I am sure it would be helpful for some
  other projects if we allowed them to exchange patches that
  describe binary file changes via e-mail, so I am not
  dismissing this.

* #irc 2006-04-10
  Shallow clones (Carl Worth).

  The experiment last round did not work out very well, but as
  existing repositories get bigger, and more projects being
  migrated from foreign SCM systems, this would become a
  must-have from would-be-nice-to-have.

  I am beginning to think using "graft" to cauterize history
  for this, while it technically would work, would not be so
  helpful to users, so the design needs to be worked out again.

* Message-ID: <E1FMH3o-0001B5-Dw@jdl.com>
  git status does not distinguish contents changes and mode
  changes; it just says "modified" (Jon Loeliger).

  Unconditionally changing the status letter would break
  Porcelains so we would need an extra option to do this.
  An outline patch has been already prepared -- this perhaps has
  to wait until we sort out the "option parsing" one.

* Message-ID: <tnxmzf9sh7k.fsf@arm.com>
  git could use diff3 instead of merge which is a wrapper around
  diff3. (Catalin Marinas)

  If having "diff3" is a lot more common than having "merge", I
  do not have problem with this; "merge" being a wrapper to
  "diff3", people who have been happy with the current code
  would certainly have "diff3" installed so changing to "diff3"
  would not break them.

* Message-ID: <81b0412b0603020649u99a2035i3b8adde8ddce9410@mail.gmail.com>
  Windows problems summary (Alex Riesen)

  A good list to keep in mind.

* Message-ID: <Pine.LNX.4.64.0604030730040.3781@g5.osdl.org>
  Huge packfiles (Linus Torvalds)

  Because I do not think asking users to break up packs to
  manageable and mmap()able size is too much to ask, I would not
  be advocating for updating the pack idx to 64-bit offset and
  mmap()ing parts of a packfile, at least too strongly.

  However, we currently lack tool support or recepe for users
  with such a repository to easily break up packs.

* Message-ID: <1143856098.3555.48.camel@dv>
  Per branch property, esp. where to merge from (Pavel Roskin)

  This involves user-level "world model" design, which is more
  Porcelainish than Plumbing, and as people know I do not do
  Porcelain well; interested parties need to come up with what
  they want and how they want to use it.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
@ 2006-05-04  8:32   ` Jakub Narebski
  2006-05-04  9:14     ` Junio C Hamano
  2006-05-04  9:58   ` Petr Baudis
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 81+ messages in thread
From: Jakub Narebski @ 2006-05-04  8:32 UTC (permalink / raw
  To: git

Junio C Hamano wrote:

> * #irc 2006-04-10
>   Shallow clones (Carl Worth).
> 
>   The experiment last round did not work out very well, but as
>   existing repositories get bigger, and more projects being
>   migrated from foreign SCM systems, this would become a
>   must-have from would-be-nice-to-have.
> 
>   I am beginning to think using "graft" to cauterize history
>   for this, while it technically would work, would not be so
>   helpful to users, so the design needs to be worked out again.

Perhaps use comment for marking graft as cauterizing history?

There was also talk about proposed git-splithist, which would move some of
the history to other (historical, archive) repository.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04  8:32   ` Jakub Narebski
@ 2006-05-04  9:14     ` Junio C Hamano
  2006-05-04  9:26       ` Jakub Narebski
  0 siblings, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-05-04  9:14 UTC (permalink / raw
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

>>   I am beginning to think using "graft" to cauterize history
>>   for this, while it technically would work, would not be so
>>   helpful to users, so the design needs to be worked out again.
>
> Perhaps use comment for marking graft as cauterizing history?

?

> There was also talk about proposed git-splithist, which would move some of
> the history to other (historical, archive) repository.

I stayed out from that discussion, but my impression was that
you could essentially do the same thing as what Linus did when
he started the recent kernel history since v2.6.12-rc2 without
any tool support.

The older kernel history from BKCVS was resurrected later by
independent parties and Linus's history can be grafted onto it,
but if you have an existing history stored in git, you could do:
(1) take a snapshot of the tip of your development with "git
tar-tree HEAD"; (2) extract it into an empty repository and
start a new history; (3) build on top of the truncated history;
and (4) graft that onto the history that stopped at (1), which
you tentatively abandoned, as needed.



	

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04  9:14     ` Junio C Hamano
@ 2006-05-04  9:26       ` Jakub Narebski
  0 siblings, 0 replies; 81+ messages in thread
From: Jakub Narebski @ 2006-05-04  9:26 UTC (permalink / raw
  To: git

Junio C Hamano wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
>>>   I am beginning to think using "graft" to cauterize history
>>>   for this, while it technically would work, would not be so
>>>   helpful to users, so the design needs to be worked out again.
>>
>> Perhaps use comment for marking graft as cauterizing history?
> 
> ?

For example:

# begin shallow clone
<sha1 of commit 1> # no parents... - cut-off commit
<sha1 of commit 2>
...
<sha1 of commmit n>
# end shallow clone

I don't think it is very good idea, though...

>> There was also talk about proposed git-splithist, which would move some
>> of the history to other (historical, archive) repository.
> 
> I stayed out from that discussion, but my impression was that
> you could essentially do the same thing as what Linus did when
> he started the recent kernel history since v2.6.12-rc2 without
> any tool support.
> 
> The older kernel history from BKCVS was resurrected later by
> independent parties and Linus's history can be grafted onto it,
> but if you have an existing history stored in git, you could do:
> (1) take a snapshot of the tip of your development with "git
> tar-tree HEAD"; (2) extract it into an empty repository and
> start a new history; (3) build on top of the truncated history;
> and (4) graft that onto the history that stopped at (1), which
> you tentatively abandoned, as needed.

I have thought about splitting not at current tip(s), but for example at 1
year ago. Current repository would have history cautherized using grafts
(although it would be nice to have option to omit grafts and reach to
historic repository), and archive/history repository ending with commits up
to (but not including) the cut-off (cauterization) points.

IIRC the problem with 'shallow clone' was telling which commits the clone
has, and how to join commits and recauterize history.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
  2006-05-04  8:32   ` Jakub Narebski
@ 2006-05-04  9:58   ` Petr Baudis
  2006-05-04 15:45     ` Pavel Roskin
  2006-05-04 17:01   ` Unresolved issues #2 (shallow clone again) Carl Worth
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 81+ messages in thread
From: Petr Baudis @ 2006-05-04  9:58 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, mj

Dear diary, on Thu, May 04, 2006 at 10:15:03AM CEST, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> * Message-ID: <1143856098.3555.48.camel@dv>
>   Per branch property, esp. where to merge from (Pavel Roskin)
> 
>   This involves user-level "world model" design, which is more
>   Porcelainish than Plumbing, and as people know I do not do
>   Porcelain well; interested parties need to come up with what
>   they want and how they want to use it.

Oh, my holey memory. In Cogito, I have just implemented a solution
suggested by Martin Mares, which is pretty simple, non-obtrusive
and will work equally fine with remotes as well as remote branches:

	if [ $branch != master ] && [ -s .git/branches/$branch-origin ]
		origin=.git/branches/$branch-origin
	else
		origin=.git/branches/origin
	fi

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04  9:58   ` Petr Baudis
@ 2006-05-04 15:45     ` Pavel Roskin
  0 siblings, 0 replies; 81+ messages in thread
From: Pavel Roskin @ 2006-05-04 15:45 UTC (permalink / raw
  To: Petr Baudis; +Cc: Junio C Hamano, git, mj

Hello, Petr!

On Thu, 2006-05-04 at 11:58 +0200, Petr Baudis wrote:

> Oh, my holey memory. In Cogito, I have just implemented a solution
> suggested by Martin Mares, which is pretty simple, non-obtrusive
> and will work equally fine with remotes as well as remote branches:
> 
> 	if [ $branch != master ] && [ -s .git/branches/$branch-origin ]
> 		origin=.git/branches/$branch-origin
> 	else
> 		origin=.git/branches/origin
> 	fi

Isn't ".git/branches" obsolete, at least in git?  I'm surprised it's
still referenced in git sources.

What is the future of ".git/branches"?  Is it becoming a Cogito specific
branch database?  Or is it now a database or branch dependencies?

I would prefer to have one single standard for branch origins that could
be used by git, StGIT and Cogito.  Using a location that is obsolete
outside Cogito is probably the worst possible approach.  I'd rather have
a separate directory, e.g. .git/origins or something.

Alternatively, we could reuse .git/refs by having files with "Pull:" but
without "URI:", e.g.

$ cat .git/refs/branch
Pull: branch-origin:branch

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
  2006-05-04  8:32   ` Jakub Narebski
  2006-05-04  9:58   ` Petr Baudis
@ 2006-05-04 17:01   ` Carl Worth
  2006-05-05  0:25     ` Junio C Hamano
  2006-05-04 20:41   ` Unresolved issues #2 Daniel Barkalow
  2006-05-09 11:40   ` David Woodhouse
  4 siblings, 1 reply; 81+ messages in thread
From: Carl Worth @ 2006-05-04 17:01 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2913 bytes --]

On Thu, 04 May 2006 01:15:03 -0700, Junio C Hamano wrote:
> * #irc 2006-04-10
>   Shallow clones (Carl Worth).
> 
>   The experiment last round did not work out very well, but as
>   existing repositories get bigger, and more projects being
>   migrated from foreign SCM systems, this would become a
>   must-have from would-be-nice-to-have.
> 
>   I am beginning to think using "graft" to cauterize history
>   for this, while it technically would work, would not be so
>   helpful to users, so the design needs to be worked out again.

I've been meaning to follow up with some thoughts on this topic, so
thanks for the tickler.

For the one use case I had, (track latest tree), I had thrown out the
idea of using "faked", parent-less commit objects to point to the tree
of interest. Junio pointed out that there's no protocol to learn the
name of a remote commit's tree from the name of the commit. I worked
around that by simply making the parent-less commit object on the
server side, (branch name of "master-shallow", say).

That seemed to work just fine, and if someone really wanted to do
this, they could use a hook to maintain the master-shallow branch,
and no change to git itself would be needed. But there's a very
minimal amount of interesting functionality in this, and it's not
clear that it's much better than git-tar-tree. So I'm considering that
idea dead.

Meanwhile, a more general ability to use shallow clones would still be
very useful. I think what I'd like to be able to do is to pass
rev-list limiting options (--max-count, --max-age via --since,
etc.). That would limit the expansion of the WANT commits, and then
the existing logic to compute the necessary objects needed to satisfy
the list of desired commits should do the right thing.

Then, in order for this to actually be useful, when returning objects
from a limited fetch like this, the server should provide a list of
commits that should be noted as cauterized, (whether through the
existing grafts mechanism or otherwise).

Additionally, when doing a fetch into a tree that has any such
cauterized commits, the client must also provide its list of
cauterized commits. So the conversation changes from "I WANT
<fetch-heads> and I HAVE <heads>" to one of "I WANT <fetch-heads>, and
I HAVE <heads>, except that I'm MISSING <cauterized-commits>".

Finally, whenever a fetch receives an commit object that is in its
list of cauterized commits, it should remove that commit from the
list. This allows a shallow clone to be naturally migrated to
something unshallow. And the user can do this as incrementally as
desired based on the need to see more history:

get a bit:
	git fetch somewhere --since=2.weeks.ago

then a bit more:
	git fetch somewhere --since=1.year.ago

then get it all:
	git fetch somewhere

Maybe that's no different from Junio's original proposal. If not, what
do you see in the above that wouldn't work?

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
                     ` (2 preceding siblings ...)
  2006-05-04 17:01   ` Unresolved issues #2 (shallow clone again) Carl Worth
@ 2006-05-04 20:41   ` Daniel Barkalow
  2006-05-04 21:33     ` Linus Torvalds
  2006-05-09 11:40   ` David Woodhouse
  4 siblings, 1 reply; 81+ messages in thread
From: Daniel Barkalow @ 2006-05-04 20:41 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Thu, 4 May 2006, Junio C Hamano wrote:

> * Message-ID: <Pine.LNX.4.63.0604301524080.2646@wbgn013.biozentrum.uni-wuerzburg.de>
> 
>   An optional "git fetch --store newname URL refspecs..." to
>   create an equivalent of remotes file so newname can then be
>   used as a short-hand.  I still have somewhat negative reaction
>   to it, but I am willing to apply it if there are enough people
>   who want this.

I was just about to suggest something for this general use. It's currently 
kind of a pain to deal with the situation where you've got stuff on your 
workstation that you want to version control in a shared repository on a 
server.

I think it shouldn't be on fetch, though; I think a "git remote" command 
for describing, creating, and modifying remotes would be better, since you 
also sometimes want to add a "Push:" line.

Maybe:

 git remote <name>: Print info about <name>
 git remote add <name> <URL> [<direction> ...]: create a remote
 git remote <name> <direction> ...: modify a remote

 where <direction> is either:
  pull <remote> <local> or
  push <local> <remote>

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04 20:41   ` Unresolved issues #2 Daniel Barkalow
@ 2006-05-04 21:33     ` Linus Torvalds
  2006-05-06  5:58       ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-05-04 21:33 UTC (permalink / raw
  To: Daniel Barkalow; +Cc: Junio C Hamano, Git Mailing List



On Thu, 4 May 2006, Daniel Barkalow wrote:
> 
> I think it shouldn't be on fetch, though; I think a "git remote" command 
> for describing, creating, and modifying remotes would be better, since you 
> also sometimes want to add a "Push:" line.

I don't think this is wrong, but I think it's more important to try to 
decide on how we want to represent this information first, and stabilize 
that.

I realize that git has gotten a lot more porcelainish over time, but at 
the same time, now you're really starting to argue about syntax that 
really ends up being often a feature of the development environment. If 
you did development using an IDE that knows about git, I think the 
"remote" information ends up being not necessarily a git command at all, 
but really an interface in the IDE.

I'm actually growing pretty fond of the config file interfaces that Dscho 
is pushing. I really like the idea of "git pull" doing different things 
depending on which branch is active at the time, because different 
branches really can have different sources they come from.

Always pulling from the same default source seems wrong, and having to 
remember whose source some branch is associated with is just not all that 
user-friendly, but perhaps more importantly, it's also going to result in 
people making mistakes, pulling from the wrong branch (because they didn't 
think about where they were), and then having strange merges that they 
might not notice were wrong until it's too late and they pushed the result 
out.

So Johannes' patches seem to move into that direction, and having it all 
in the config file actually seems to be quite readable.

And that, in turn, may mean that a lot of porcelains really only care 
about that syntax, and then they may update the config file any way they 
please (whether by hand, or by using "git repo-config" or by using "git 
remote").

So I'd argue that (a) yes, we do want to have the "proto porcelain" that 
sets remote branch information without the user having to know the magic 
"git repo-config" incantation, or know which file in .git/remotes/ to 
edit, but that (b) it's even more important to try to decide on what the 
remote description format _is_.

I personally have just two preferences:

 - I'd like each branch I'm on to have a "default source" for pulling (and 
   _maybe_ for pushing too). I'd like to just say "git pull", and it would 
   automatically select the appropriate thing to pull from.

 - maybe the same per-branch thing for "push", but more importantly for 
   me, I like to push to multiple destinations, and I'd like the 
   description format to be sane. I think it may already be sane in the 
   form it is in now (supporting both config file _and_ .git/remotes/ 
   formats), I'd just like us to decide on exactly what the meaning is, 
   and hopefully get to the point where we can tell porcelain how to use 
   that meaning to their advantage (and not change it)

Others may disagree, or (equally importantly), may have additional 
preferences. We should try to find something that works for everybody, and 
that is easy to work with.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-04 17:01   ` Unresolved issues #2 (shallow clone again) Carl Worth
@ 2006-05-05  0:25     ` Junio C Hamano
  2006-05-05  5:17       ` Martin Langhoff
                         ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-05-05  0:25 UTC (permalink / raw
  To: Carl Worth; +Cc: git

Carl Worth <cworth@cworth.org> writes:

> ... So the conversation changes from "I WANT
> <fetch-heads> and I HAVE <heads>" to one of "I WANT <fetch-heads>, and
> I HAVE <heads>, except that I'm MISSING <cauterized-commits>".
>
> Finally, whenever a fetch receives an commit object that is in its
> list of cauterized commits, it should remove that commit from the
> list. This allows a shallow clone to be naturally migrated to
> something unshallow. And the user can do this as incrementally as
> desired based on the need to see more history:
>
> get a bit:
> 	git fetch somewhere --since=2.weeks.ago
>
> then a bit more:
> 	git fetch somewhere --since=1.year.ago
>
> then get it all:
> 	git fetch somewhere
>
> Maybe that's no different from Junio's original proposal. If not, what
> do you see in the above that wouldn't work?

Lack of actual code to do all that ;-)

Jokes aside, I think listing the updated conversation elements
like you did above is a good step forward.

The vocabulary we would want from the requestor side is probably
(at least):

	I WANT to have these
        I HAVE these
        I'm MISSING these
        Don't bother with these this time around (--since, ^v2.6.16, ...)

I am not sure how we would want to encode the last one and have
it used by rev-list on the upload-pack end safely and sanely.

And the responder side needs to be able to say, "Now you are
MISSING these, remember it and tell me you are missing them next
time you make a request".  That would be, in the simplest case,
a list of commit IDs to cauterize, but I am not sure what is the
right way to come up with that list.  Especially I do not know
if --boundary would/should work with --objects.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05  0:25     ` Junio C Hamano
@ 2006-05-05  5:17       ` Martin Langhoff
  2006-05-05  5:23         ` Carl Worth
  2006-05-05 15:10       ` Linus Torvalds
  2006-05-07 13:30       ` Jakub Narebski
  2 siblings, 1 reply; 81+ messages in thread
From: Martin Langhoff @ 2006-05-05  5:17 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Carl Worth, git

On 5/5/06, Junio C Hamano <junkio@cox.net> wrote:
> The vocabulary we would want from the requestor side is probably
> (at least):
>
>         I WANT to have these
>         I HAVE these
>         I'm MISSING these
>         Don't bother with these this time around (--since, ^v2.6.16, ...)

Thinking... does the MISSING part matter at all? It seems that what
really matters are the "ignore rules". The pull may bring in a new
merge of a long-running branch, whose mergebase falls out of the
ignore rules.

In that case, the server should apply the ignore rules. Except that
later merges in the local repo would perhaps have to deal with missing
part of the history. I suspect it should refuse to merge something we
don't have all the merging parts for.

cheers,


martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05  5:17       ` Martin Langhoff
@ 2006-05-05  5:23         ` Carl Worth
  2006-05-05  5:48           ` Jakub Narebski
  0 siblings, 1 reply; 81+ messages in thread
From: Carl Worth @ 2006-05-05  5:23 UTC (permalink / raw
  To: Martin Langhoff; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 1119 bytes --]

On Fri, 5 May 2006 17:17:10 +1200, "Martin Langhoff" wrote:
> >         I WANT to have these
> >         I HAVE these
> >         I'm MISSING these
> >         Don't bother with these this time around (--since, ^v2.6.16, ...)
> 
> Thinking... does the MISSING part matter at all?

Yes.

Imagine doing a shallow clone and then fetching a tree that includes a
blob that existed before MISSING. If we say HAVE without MISSING then
the server will not send that blob and we'll be left with a broken
tree.

> In that case, the server should apply the ignore rules. Except that
> later merges in the local repo would perhaps have to deal with missing
> part of the history. I suspect it should refuse to merge something we
> don't have all the merging parts for.

Yeah, shallow clones can shake up the conventions a bit. It's
definitely common for a repository to only have a single parent-less
commit, such that there is always an identifiable merge base for any
pair of revisions. Shallow clones would make (effectively) parent-less
commits much more common.

Should be fun to see what things fall over with this...

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05  5:23         ` Carl Worth
@ 2006-05-05  5:48           ` Jakub Narebski
  0 siblings, 0 replies; 81+ messages in thread
From: Jakub Narebski @ 2006-05-05  5:48 UTC (permalink / raw
  To: git

Carl Worth wrote:

> On Fri, 5 May 2006 17:17:10 +1200, "Martin Langhoff" wrote:

>> In that case, the server should apply the ignore rules. Except that
>> later merges in the local repo would perhaps have to deal with missing
>> part of the history. I suspect it should refuse to merge something we
>> don't have all the merging parts for.
> 
> Yeah, shallow clones can shake up the conventions a bit. It's
> definitely common for a repository to only have a single parent-less
> commit, such that there is always an identifiable merge base for any
> pair of revisions. Shallow clones would make (effectively) parent-less
> commits much more common.

I wonder if it would be possible for git to:
a) as for a fetch which would bring all the commits up to the merge base
   (and merge base has to be calculated on the server side I think),
   i.e. give command to use (for fetch or for force baseless merge)
b) fetch the commits
c) do merge
d) optionally re-cauterize history again

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05  0:25     ` Junio C Hamano
  2006-05-05  5:17       ` Martin Langhoff
@ 2006-05-05 15:10       ` Linus Torvalds
  2006-05-05 15:18         ` Jakub Narebski
  2006-05-05 15:31         ` Carl Worth
  2006-05-07 13:30       ` Jakub Narebski
  2 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-05-05 15:10 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Carl Worth, git



On Thu, 4 May 2006, Junio C Hamano wrote:
> 
> Jokes aside, I think listing the updated conversation elements
> like you did above is a good step forward.
> 
> The vocabulary we would want from the requestor side is probably
> (at least):
> 
> 	I WANT to have these
>         I HAVE these
>         I'm MISSING these
>         Don't bother with these this time around (--since, ^v2.6.16, ...)

Actually, I think we can do something simpler that _most_ people might be 
happy with.

Namely just have a mode to "git-send-pack" that uses the "--no-walk" flag 
to generate the object list to send.

What that does is to never walk the object history: so it will just use 
the "I HAVE THESE" and "I WANT THESE" commit references to directly 
generate the list of commits, and then walks the trees to generate the 
list of trees/blobs that differ between the particular end-points.

We already have the "no_walk" flag internally, we just don't expose it.

So what you'd get is a _really_ cut down history that doesn't contain any 
commit history at all (just distinct "points in commit history time"), but 
that _does_ contain all the objects that the commits point to.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05 15:10       ` Linus Torvalds
@ 2006-05-05 15:18         ` Jakub Narebski
  2006-05-05 15:59           ` Linus Torvalds
  2006-05-05 15:31         ` Carl Worth
  1 sibling, 1 reply; 81+ messages in thread
From: Jakub Narebski @ 2006-05-05 15:18 UTC (permalink / raw
  To: git

Linus Torvalds wrote:

> So what you'd get is a _really_ cut down history that doesn't contain any
> commit history at all (just distinct "points in commit history time"), but
> that _does_ contain all the objects that the commits point to.

So we would get 'skin-deep clone' rather than 'shallow' one?

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05 15:10       ` Linus Torvalds
  2006-05-05 15:18         ` Jakub Narebski
@ 2006-05-05 15:31         ` Carl Worth
  1 sibling, 0 replies; 81+ messages in thread
From: Carl Worth @ 2006-05-05 15:31 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

On Fri, 5 May 2006 08:10:50 -0700 (PDT), Linus Torvalds wrote:
> 
> Namely just have a mode to "git-send-pack" that uses the "--no-walk" flag 
> to generate the object list to send.
> 
> What that does is to never walk the object history: so it will just use 
> the "I HAVE THESE" and "I WANT THESE" commit references to directly 
> generate the list of commits, and then walks the trees to generate the 
> list of trees/blobs that differ between the particular end-points.

Oh, I think that's a great idea. I had proposed cutting the WANT list
down to a single commit, but I wasn't clever enough to think to also
cut down the HAVE walking to solve the problems that would have been
introduced by shallow clones.

And I think the resulting behavior is quite reasonable. I think one
can argue a sort of zero-one-infinity rule here. With history, either
you don't care about any of it, or else you really should care about
all of it.

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05 15:18         ` Jakub Narebski
@ 2006-05-05 15:59           ` Linus Torvalds
  2006-05-06  6:23             ` Martin Langhoff
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-05-05 15:59 UTC (permalink / raw
  To: Jakub Narebski; +Cc: git



On Fri, 5 May 2006, Jakub Narebski wrote:

> Linus Torvalds wrote:
> 
> > So what you'd get is a _really_ cut down history that doesn't contain any
> > commit history at all (just distinct "points in commit history time"), but
> > that _does_ contain all the objects that the commits point to.
> 
> So we would get 'skin-deep clone' rather than 'shallow' one?

Well, it's really shallow, but perhaps more importantly, I think it should 
be really easy, and have totally unambiguous semantics. Never any question 
of how far back to go, and I think we already really do have all the 
support logic for doing it.

Now, we don't actually expose the internal "no_walk" flag with a 
"--no-walk" command line argument parsing, but that's a one-liner.

There's another approach that might be a bit friendlier, which is again to 
walk only the objects of the WANT/HAVE things, but then _do_ walk the 
history for just commit objects. Something close to what I think the http 
fetch thing does if you pass it "-c -t". That too shouldn't require too 
much extra complexity, and it would mean that "git log" at least works.

Of course, that would require another slight difference to "rev-list.c", 
where we'd only recurse into trees of selected commit objects (ie we'd 
have to mark the HAVE/WANT commits specially, but it's not exactly 
complex either).

Of course, the complexity of _both_ of these approaches is really in the 
fsck stage, and all the crud you need to then do other things with these 
pared-down repos. For example, do you allow cloning? And do you just 
automatically notice that you're cloning a shallow repo, and only do a 
shallow clone. Etc etc..

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04 21:33     ` Linus Torvalds
@ 2006-05-06  5:58       ` Junio C Hamano
  2006-05-06 15:26         ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-05-06  5:58 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Daniel Barkalow, Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> I'm actually growing pretty fond of the config file interfaces that Dscho 
> is pushing. I really like the idea of "git pull" doing different things 
> depending on which branch is active at the time, because different 
> branches really can have different sources they come from.

> Always pulling from the same default source seems wrong,...

> So Johannes' patches seem to move into that direction, and having it all 
> in the config file actually seems to be quite readable.

I share the same reasoning and that is why I am carrying the
series in "next".  I think per branch attributes are wonderful
things.

> So I'd argue that (a) yes, we do want to have the "proto porcelain" that 
> sets remote branch information without the user having to know the magic 
> "git repo-config" incantation, or know which file in .git/remotes/ to 
> edit, but that (b) it's even more important to try to decide on what the 
> remote description format _is_.

Is it format you care about or the semantics?

> I personally have just two preferences:
>
>  - I'd like each branch I'm on to have a "default source" for pulling (and 
>    _maybe_ for pushing too). I'd like to just say "git pull", and it would 
>    automatically select the appropriate thing to pull from.
>
>  - maybe the same per-branch thing for "push", but more importantly for 
>    me, I like to push to multiple destinations, and I'd like the 
>    description format to be sane. I think it may already be sane in the 
>    form it is in now (supporting both config file _and_ .git/remotes/ 
>    formats), I'd just like us to decide on exactly what the meaning is, 
>    and hopefully get to the point where we can tell porcelain how to use 
>    that meaning to their advantage (and not change it)
>
> Others may disagree, or (equally importantly), may have additional 
> preferences. We should try to find something that works for everybody, and 
> that is easy to work with.

In my day job, I maintain a base code for a generic application
in "master", various topics, mostly branched from "master" but
sometimes from another topic branch, and one branch each per
customer installation, which pulls from the master, topics and
contains specific customizations.  While on master or any one of
generic topic branch, I need to remember not to pull from
installation branches.  For that matter, the installation
branches should not be pulled into anything else.  So not just
"this branch usually merges from there", but "this branch should
not be merged into others" (mark "installation branches" as
such), and "this branch should never merge from that one" (mark
"master" with "installation branches") would prevent mistakes.

One thing I noticed in "What's in libata.git" Jeff did by
mimicking my "What's in git.git" was that the description for
each topic branch included where it branched from (iow, what
other branch it builds on).  This is sometimes derivable, but
having it as a property for a branch is very handy.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05 15:59           ` Linus Torvalds
@ 2006-05-06  6:23             ` Martin Langhoff
  2006-05-06  7:10               ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Martin Langhoff @ 2006-05-06  6:23 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Jakub Narebski, git

On 5/6/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Of course, that would require another slight difference to "rev-list.c",
> where we'd only recurse into trees of selected commit objects (ie we'd
> have to mark the HAVE/WANT commits specially, but it's not exactly
> complex either).

Would it make sense to make all the shallow clone clone machinery walk
everything and trim only blob objects? In that case, all the machinery
that walks commits/trees would remain intact -- we only have to deal
with the case of not having blob objects, which affects less
codepaths.

It means that for a merge or checkout involving stuff we "don't have",
it's trivial to know we are missing, and so we can  attempt a fetch of
the missing objects or tell the user how to request them them before
retrying.

And in any case commits and trees are lightweight and compress well...

> Of course, the complexity of _both_ of these approaches is really in the
> fsck stage, and all the crud you need to then do other things with these
> pared-down repos. For example, do you allow cloning? And do you just
> automatically notice that you're cloning a shallow repo, and only do a
> shallow clone. Etc etc..

Definitely.

cheers,


martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-06  6:23             ` Martin Langhoff
@ 2006-05-06  7:10               ` Junio C Hamano
  2006-05-07  6:08                 ` Martin Langhoff
  0 siblings, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-05-06  7:10 UTC (permalink / raw
  To: Martin Langhoff; +Cc: git

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> On 5/6/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> Of course, that would require another slight difference to "rev-list.c",
>> where we'd only recurse into trees of selected commit objects (ie we'd
>> have to mark the HAVE/WANT commits specially, but it's not exactly
>> complex either).
>
> Would it make sense to make all the shallow clone clone machinery walk
> everything and trim only blob objects? In that case, all the machinery
> that walks commits/trees would remain intact -- we only have to deal
> with the case of not having blob objects, which affects less
> codepaths.
>
> It means that for a merge or checkout involving stuff we "don't have",
> it's trivial to know we are missing, and so we can  attempt a fetch of
> the missing objects or tell the user how to request them them before
> retrying.
>
> And in any case commits and trees are lightweight and compress well...

Commit maybe, but is this based on a hard fact?  

Earlier Linus said something about "git log" working on
commit-only copy, but obviously you would want at least trees
for the path limiting part to work, so having commits and trees
would be handy, but my impression was that at least for deep
project like the kernel trees tend to be nonnegligible (a commit
consists of 18K paths and 1200 trees or something like that).

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06  5:58       ` Junio C Hamano
@ 2006-05-06 15:26         ` Linus Torvalds
       [not found]           ` <20060506113549.48e553d1.seanlkml@sympatico.ca>
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-05-06 15:26 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Daniel Barkalow, Git Mailing List



On Fri, 5 May 2006, Junio C Hamano wrote:
>
> > So I'd argue that (a) yes, we do want to have the "proto porcelain" that 
> > sets remote branch information without the user having to know the magic 
> > "git repo-config" incantation, or know which file in .git/remotes/ to 
> > edit, but that (b) it's even more important to try to decide on what the 
> > remote description format _is_.
> 
> Is it format you care about or the semantics?

I _personally_ care about the semantics, but not very deeply - since I 
tend to actually have just one main branch, and a couple of throw-away 
ones if I ended up working on something.

But I think that for this thing to become useful, we want to care about 
the format - or at least the interface to the different users (with the 
acknowledgement that "users" should often be porcelain above us).

Right now we've basically had people hand-editing the remotes files, and I 
think cogito still uses the older branches format that came from cogito in 
the first place. I think we should just try to decide on a config file 
format, and make it easy for cogito etc to use it.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
       [not found]           ` <20060506113549.48e553d1.seanlkml@sympatico.ca>
@ 2006-05-06 15:35             ` sean
  2006-05-06 16:30               ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: sean @ 2006-05-06 15:35 UTC (permalink / raw
  To: Linus Torvalds; +Cc: junkio, barkalow, git

On Sat, 6 May 2006 08:26:36 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> I _personally_ care about the semantics, but not very deeply - since I 
> tend to actually have just one main branch, and a couple of throw-away 
> ones if I ended up working on something.
> 
> But I think that for this thing to become useful, we want to care about 
> the format - or at least the interface to the different users (with the 
> acknowledgement that "users" should often be porcelain above us).
> 
> Right now we've basically had people hand-editing the remotes files, and I 
> think cogito still uses the older branches format that came from cogito in 
> the first place. I think we should just try to decide on a config file 
> format, and make it easy for cogito etc to use it.

Linus,

Wondering why you feel so strongly that most "users" shouldn't be real people.
What is wrong with continuing to make git easier for developers to use without
needing any extra software?

Sean

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06 15:35             ` sean
@ 2006-05-06 16:30               ` Linus Torvalds
       [not found]                 ` <20060506125323.544c35db.seanlkml@sympatico.ca>
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-05-06 16:30 UTC (permalink / raw
  To: sean; +Cc: junkio, barkalow, git



On Sat, 6 May 2006, sean wrote:
> 
> Wondering why you feel so strongly that most "users" shouldn't be real people.
> What is wrong with continuing to make git easier for developers to use without
> needing any extra software?

Basically, it boils down to the end result.

If you design things for "people", then things tend to become hard to 
automate, and it's hard to make wrappers around it. Maybe you've even made 
the interfaces interactive, and thus any wrappers around it are simply 
screwed, or need to do insane things.

On the other hand, if you design things for automation, doing a "people 
wrapper" that uses the automation should be trivial if the design is even 
remotely any good at all.

In other words: you should always design things for automation, and 
consider the "people interface" to be be just _one_ wrapper layer among 
many.

This has worked really well in git. The whole system was designed from the 
start to be all about scripting and automation, and the "people wrappers" 
tend to be trivial scripts around it.

This was even more obvious when we had a number of basically one-liner 
scripts like "git log", which just did some trivial wrapping around

	git-rev-list | git-diff-tree --stdin | $PAGER

(Now we still have that trivial wrapper, but you just need to look into C 
code to see it, so it's not _as_ obviously trivial).

Contrast this with going the other way: if you talk about the interfaces 
that _people_ want first, you immediately start doing pretty-printing, 
nice parsing, maybe interactive stuff that asks questions. Nice GUIs. And 
the end result is CRAP. Exactly because it lost its ability to be generic.

To some degree, this is the fundamental difference between the Windows and 
the UNIX mindset. At least it used to be.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
       [not found]                 ` <20060506125323.544c35db.seanlkml@sympatico.ca>
@ 2006-05-06 16:53                   ` sean
  2006-05-06 17:20                     ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: sean @ 2006-05-06 16:53 UTC (permalink / raw
  To: Linus Torvalds; +Cc: junkio, barkalow, git

On Sat, 6 May 2006 09:30:48 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> Basically, it boils down to the end result.
> 
> If you design things for "people", then things tend to become hard to 
> automate, and it's hard to make wrappers around it. Maybe you've even made 
> the interfaces interactive, and thus any wrappers around it are simply 
> screwed, or need to do insane things.

Okay, I mistook the scope of you comments to apply to all of git rather than
as a reminder that we can't forget about the toolkit design.  So I take it
you're not at all against git including higher level user commands; just so
long as they're built on top of lower level toolkit commands that other
porcelain can use as well.

In this particular case I see "git repo-config" as the low level command that
any porcelain can use to access the remotes information and the proposed
"git remotes" as a simple convenience wrapper on top of this.  Of course,
everyone has to agree on the config file format; but that is true whether
the human-friendly wrapper exists or not.

Sean

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06 16:53                   ` sean
@ 2006-05-06 17:20                     ` Linus Torvalds
  2006-05-06 21:16                       ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-05-06 17:20 UTC (permalink / raw
  To: sean, Johannes Schindelin; +Cc: Junio C Hamano, barkalow, Git Mailing List



On Sat, 6 May 2006, sean wrote:
>
> Okay, I mistook the scope of you comments to apply to all of git rather than
> as a reminder that we can't forget about the toolkit design.  So I take it
> you're not at all against git including higher level user commands; just so
> long as they're built on top of lower level toolkit commands that other
> porcelain can use as well.

Correct. I think we've been able to handle that balance particularly well 
so far. Or maybe the porcelains don't complain enough.

> In this particular case I see "git repo-config" as the low level command that
> any porcelain can use to access the remotes information and the proposed
> "git remotes" as a simple convenience wrapper on top of this.  Of course,
> everyone has to agree on the config file format; but that is true whether
> the human-friendly wrapper exists or not.

I agree, but my point is that in order for a porcelain to _use_ 
"repo-config", the config file format needs to be defined somewhere, and 
we need to tell people that it's not changing. Are we there yet?

That was my argument for why we should concentrate not on what the user 
wrapper should be named, but why we should look at what the low-level 
meaning of these things are.

Finally, I think "git repo-config" is buggy. Try with this .config file:

	[user]
		name = Bozo the Clown
		email = bozo@circus.com

	[core]
		filemode = true

	[merge]
		summary = true

and then do

	git repo-config core.gitproxy 'dummy example'

and look where it ends up. For me, it ends up at the end, in the "[merge]" 
section, which is obviously bogus.

So we'd really be screwing with porcelain if we made them use this ;)

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06 17:20                     ` Linus Torvalds
@ 2006-05-06 21:16                       ` Junio C Hamano
  2006-05-06 21:33                         ` Johannes Schindelin
  2006-05-07  0:41                         ` Jakub Narebski
  0 siblings, 2 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-05-06 21:16 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git, sean, Johannes Schindelin

Linus Torvalds <torvalds@osdl.org> writes:

> Finally, I think "git repo-config" is buggy. Try with this .config file:
> ...
> So we'd really be screwing with porcelain if we made them use this ;)

Thanks Linus and Sean for bringing this up and fixing it.

I have a vague feeling that this may not be the last breakage of
the repo-config command.  My first reaction to the repo-config
code was "eek".  It tries to reuse as much the existing material
as possible -- I understand it was done that way in order to
preserve the comments and blank lines from the original config
file intact, but it just felt very error prone (demonstrated by
cases like this and the other one Sean brought up) and generally
wrong.

It might make sense to rewrite it to parse and read the existing
configuration as a whole, do necessary manupulations on the
parsed internal representation in-core, and write the result out
from scratch.  That would fix another of my pet peeve: after an
invocation of repo-config to remove the last variable in a
section, it leaves an empty section header in.

        $ git repo-config foo.bar true
        $ cat .git/config 
        [core]
                repositoryformatversion = 0
                filemode = true
        [foo]
                bar = true
        $ git repo-config foo1.baz false
        $ git repo-config --unset foo.bar
        $ cat .git/config 
        [core]
                repositoryformatversion = 0
                filemode = true
        [foo]
        [foo1]
                baz = false

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06 21:16                       ` Junio C Hamano
@ 2006-05-06 21:33                         ` Johannes Schindelin
  2006-05-06 21:51                           ` Linus Torvalds
  2006-05-07  9:39                           ` Junio C Hamano
  2006-05-07  0:41                         ` Jakub Narebski
  1 sibling, 2 replies; 81+ messages in thread
From: Johannes Schindelin @ 2006-05-06 21:33 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Linus Torvalds, git, sean

Hi,

On Sat, 6 May 2006, Junio C Hamano wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > Finally, I think "git repo-config" is buggy. Try with this .config file:
> > ...
> > So we'd really be screwing with porcelain if we made them use this ;)
> 
> Thanks Linus and Sean for bringing this up and fixing it.
> 
> I have a vague feeling that this may not be the last breakage of
> the repo-config command.  My first reaction to the repo-config
> code was "eek".  It tries to reuse as much the existing material
> as possible -- I understand it was done that way in order to
> preserve the comments and blank lines from the original config
> file intact, but it just felt very error prone (demonstrated by
> cases like this and the other one Sean brought up) and generally
> wrong.

It was done because the very syntax of the config suggests it be a 
user-editable file. I do not want to mess with the comments more than 
necessary.

> It might make sense to rewrite it to parse and read the existing
> configuration as a whole, do necessary manupulations on the
> parsed internal representation in-core, and write the result out
> from scratch.  That would fix another of my pet peeve: after an
> invocation of repo-config to remove the last variable in a
> section, it leaves an empty section header in.

Does it really hurt? I think not.

Anyway, I'll look into this.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06 21:33                         ` Johannes Schindelin
@ 2006-05-06 21:51                           ` Linus Torvalds
  2006-05-07  9:39                           ` Junio C Hamano
  1 sibling, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-05-06 21:51 UTC (permalink / raw
  To: Johannes Schindelin; +Cc: Junio C Hamano, git, sean



On Sat, 6 May 2006, Johannes Schindelin wrote:
> 
> It was done because the very syntax of the config suggests it be a 
> user-editable file.

Yeah, I personally much prefer user-friendly config files. Any format that 
thinks that "easy parsing" is more important than "visually obvious" is 
bad. So I obviously think that XML is a horrid piece of cr*p (has anybody 
ever noticed I have strong opinions?) and totally unreadable.

I think "git repo-config" is doing a reasonable job of editing a file that 
is really designed to be user-friendly. That said, the code _is_ a bit 
scary.

It might be worthwhile to re-write config.c to read the config file into 
memory and work on it in-memory instead of doing the funky mixed usage 
(using fgetc/ftell to read it, but then switching over to mmap when 
rewriting it).

IOW, maybe that "static FILE *config_file" should be changed to something 
more like "static const char *config_buffer; unsigned int len;" instead, 
and at least make both the reading and writing use the same buffer rather 
than mixing stdio and mmap..

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06 21:16                       ` Junio C Hamano
  2006-05-06 21:33                         ` Johannes Schindelin
@ 2006-05-07  0:41                         ` Jakub Narebski
  1 sibling, 0 replies; 81+ messages in thread
From: Jakub Narebski @ 2006-05-07  0:41 UTC (permalink / raw
  To: git

Junio C Hamano wrote:

> It might make sense to rewrite it to parse and read the existing
> configuration as a whole, do necessary manupulations on the
> parsed internal representation in-core, and write the result out
> from scratch.

Or perhaps do git repo-config read and change config file in two passes:
read and build some kind of index (beginning of section, end of
section/last variable in section, number of elements in section), then in
second pass add some information if needed.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-06  7:10               ` Junio C Hamano
@ 2006-05-07  6:08                 ` Martin Langhoff
  2006-05-07  7:56                   ` Jeff King
  2006-05-07  8:01                   ` Sergey Vlasov
  0 siblings, 2 replies; 81+ messages in thread
From: Martin Langhoff @ 2006-05-07  6:08 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On 5/6/06, Junio C Hamano <junkio@cox.net> wrote:
> "Martin Langhoff" <martin.langhoff@gmail.com> writes:
> >
> > It means that for a merge or checkout involving stuff we "don't have",
> > it's trivial to know we are missing, and so we can  attempt a fetch of
> > the missing objects or tell the user how to request them them before
> > retrying.
> >
> > And in any case commits and trees are lightweight and compress well...
>
> Commit maybe, but is this based on a hard fact?

No hard facts here :( but I think it's reasonable to assume that the
trees delta/compress reasonably well, as a given commit will change
just a few entries in each tree.

I might try and hack a shallow local clone of the kernel and pack it
tightly to see what it yields.

cheers,



martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07  6:08                 ` Martin Langhoff
@ 2006-05-07  7:56                   ` Jeff King
  2006-05-07 15:27                     ` Linus Torvalds
  2006-05-08  0:33                     ` Theodore Tso
  2006-05-07  8:01                   ` Sergey Vlasov
  1 sibling, 2 replies; 81+ messages in thread
From: Jeff King @ 2006-05-07  7:56 UTC (permalink / raw
  To: git

On Sun, May 07, 2006 at 06:08:03PM +1200, Martin Langhoff wrote:

> >> And in any case commits and trees are lightweight and compress well...
> >Commit maybe, but is this based on a hard fact?
> No hard facts here :( but I think it's reasonable to assume that the
> trees delta/compress reasonably well, as a given commit will change
> just a few entries in each tree.

A few hard facts (using Linus' linux-2.6 tree):
  - original packsize: 120996 kilobytes
  - unpacked: 233338 objects, 1417476 kilobytes
    This is an 11.7:1 compression ratio (of course, much of this is
    wasted space from the 4k block size in the filesystem)
  - There were 87915 total blob objects, of which 19321 were in the
    current tree. I removed all non-current blobs to produce a "shallow"
    tree.
  - The shallow tree unpacked: 164744 objects, 761960 kilobytes
    IOW, about half of the unpacked disk usage was old blobs.
  - Shallow commit/tree/tag objects packed (using 1.3.1
    git-pack-objects):
      Total 164744, written 164744 (delta 92322), reused 0 (delta 0)
      size: 108088
    The compression ratio here is only 7.0:1
  - Total savings by going shallow: 10.7%

So basically, trees and commits DON'T compress as well as historical
blobs (potentially because git-pack-objects isn't currently optimized
for this -- I haven't checked). As a result, we're saving only 10% by
going shallow instead of a potential 50%.

-Peff

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07  6:08                 ` Martin Langhoff
  2006-05-07  7:56                   ` Jeff King
@ 2006-05-07  8:01                   ` Sergey Vlasov
  2006-05-07 23:27                     ` Martin Langhoff
  1 sibling, 1 reply; 81+ messages in thread
From: Sergey Vlasov @ 2006-05-07  8:01 UTC (permalink / raw
  To: Martin Langhoff; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 1915 bytes --]

On Sun, 7 May 2006 18:08:03 +1200 Martin Langhoff wrote:

> On 5/6/06, Junio C Hamano <junkio@cox.net> wrote:
> > "Martin Langhoff" <martin.langhoff@gmail.com> writes:
> > >
> > > It means that for a merge or checkout involving stuff we "don't have",
> > > it's trivial to know we are missing, and so we can  attempt a fetch of
> > > the missing objects or tell the user how to request them them before
> > > retrying.
> > >
> > > And in any case commits and trees are lightweight and compress well...
> >
> > Commit maybe, but is this based on a hard fact?
> 
> No hard facts here :( but I think it's reasonable to assume that the
> trees delta/compress reasonably well, as a given commit will change
> just a few entries in each tree.
> 
> I might try and hack a shallow local clone of the kernel and pack it
> tightly to see what it yields.

For linux v2.6.16:

7,3M commits-b41b04a36afebdba3b70b74f419fc7d97249bd7f.pack
 24M commits_trees-8397f1c2a885527acd07e2caa8c95df626451493.pack
 97M full-c7b2747a674ff55cb4a59dabebe419f191e360df.pack

For comparizon, a single version in packed form:

 51M v2.6.12-rc2-4f3526b6815eb63da6c43ed85be1494bb776e2c5.pack

Made with

git-rev-list v2.6.16 | git-pack-objects commits
git-rev-list --objects --no-blobs v2.6.16 | git-pack-objects commits_trees
git-rev-list --objects v2.6.16 | git-pack-objects full

and this hack to git-rev-list:

diff --git a/revision.c b/revision.c
index f2a9f25..b5a929e 100644
--- a/revision.c
+++ b/revision.c
@@ -636,6 +636,10 @@ int setup_revisions(int argc, const char
 				revs->blob_objects = 1;
 				continue;
 			}
+			if (!strcmp(arg, "--no-blobs")) {
+				revs->blob_objects = 0;
+				continue;
+			}
 			if (!strcmp(arg, "--objects-edge")) {
 				revs->tag_objects = 1;
 				revs->tree_objects = 1;

So trees are definitely not lightweight, and commits are rather large
too.

[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-06 21:33                         ` Johannes Schindelin
  2006-05-06 21:51                           ` Linus Torvalds
@ 2006-05-07  9:39                           ` Junio C Hamano
  2006-05-07  9:42                             ` Junio C Hamano
                                               ` (2 more replies)
  1 sibling, 3 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-05-07  9:39 UTC (permalink / raw
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> It was done because the very syntax of the config suggests it be a 
> user-editable file. I do not want to mess with the comments more than 
> necessary.

I personally feel that is a lost cause _unless_ you come up with
a reasonable convention for where to put comments, stress that
rule to the user in the documentation, _and_ make repo-config to
follow that rule as well.

We _do_ want to treat config file as hand editable and cat
reviewable file, not an unreadable gunk like xml, so trying to
preserve user comments is important and I am not opposed to that
you did (at least some of) it.  But as the code currently
stands, what it does is at best half baked, at worst somewhat
confusing.

A demonstration.  What is wrong with this picture?

        $ cat .git/config
        [core]
                repositoryformatversion = 0
                ; are the mode bits trustworthy?
                filemode = true ; yes, on ext3 
                ; We want symlinked HEAD because we will bisect
                ; recent kernel history.
                prefersymlinkrefs = true
        $ git repo-config core.prefersymlinkrefs false
        $ git repo-config core.filemode false
        $ cat .git/config
        [core]
                repositoryformatversion = 0
                ; are the mode bits trustworthy?
                filemode = false
                ; We want symlinked HEAD because we will bisect
                ; recent kernel history.
                prefersymlinkrefs = false
	$ exit

The comment given to "filemode" is "reasonable" in that it
describes what the value that is set to the variable does, and
losing the original comment given to its "true" when we set it
to false is better than keeping it, so that part happens to be
doing the right thing -- only because I knew what repo-config
would do to the comments and arranged original comments in the
file that way.

But what about prefersymlinkrefs one?  When setting the variable
to such a non-standard value, it is unreasonable for people to
want to justify why with a comment like the above.  But after
resetting the value the comment becomes stale.

It gets worse:

        $ git repo-config --unset core.filemode
        $ cat .git/config
        [core]
                repositoryformatversion = 0
                ; please please use symlinks please
                prefersymlinkrefs = false
                ; are the mode bits trustworthy?
	$ exit

There now is a confusing trailing comment left that does not
comment anything.  Removing core.filemode is not so common, but
this can happen whenever you remove any variable, so we can use
any other variable as an example.

Now, enough being negative and pointing out problems.  Time to
become constructive.  Probably a reasonable convention would be
to define the config file format to be something like this:

        <comment that applies to the section>
        [section]
                <comment that applies to the variable stands on
		 its own before the variable>
                variable [= value] <comment that applies to the
        			    fact the variable is set to
                                    this particular value starts
				    on the same line as the
                                    "variable = value" thing>

 - when a variable is reset to another value, remove the
   "value comment";
 - when a variable disappears, remove "variable comment";
 - when a section disappears, remove "section comment";
 - otherwise leave the comment intact.

Then we could tell the user the rule is like above, and tell
them to structure the file with comments that way, if they ever
want to edit the file by hand.

Now if we wanted to do something like the above, I suspect that
it would be easier and less error prone to first scan the config
file, note what appears where, and do the processing in-core,
and then write the results out, perhaps using data structures
like these:

        struct config_section {
            char *pre_comment;
            char *name; /* e.g. "core" */
            struct config_section *next; /* next section */
            struct config_var *vars; /* pointer to the first one */
        };
        struct config_var {
            char *pre_comment;
            char *name;
            char *value; /* "existence" bool may have NULL,
                          * otherwise probably a string "= value"
                          */
            char *value_comment;
            struct config_var *next; /* pointer to the next one
                                      * in the section
                                      */
        };

Obviously, data structures like these would make it even easier
if we decide we do _not_ care about comments (we would just lose
x_comment fields, parse the thing and write the resulting list
out).

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-07  9:39                           ` Junio C Hamano
@ 2006-05-07  9:42                             ` Junio C Hamano
  2006-05-07 11:31                             ` Johannes Schindelin
  2006-05-07 11:38                             ` Jakub Narebski
  2 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-05-07  9:42 UTC (permalink / raw
  To: git

Junio C Hamano <junkio@cox.net> writes:

> But what about prefersymlinkrefs one?  When setting the variable
> to such a non-standard value, it is unreasonable for people to
> want to justify why with a comment like the above.

Obviously I was not reading what I was typing.  It is very
reasonable for people to want to do that.  Sorry.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-07  9:39                           ` Junio C Hamano
  2006-05-07  9:42                             ` Junio C Hamano
@ 2006-05-07 11:31                             ` Johannes Schindelin
  2006-05-07 11:38                             ` Jakub Narebski
  2 siblings, 0 replies; 81+ messages in thread
From: Johannes Schindelin @ 2006-05-07 11:31 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

Hi,

On Sun, 7 May 2006, Junio C Hamano wrote:

> [...] Probably a reasonable convention would be to define the config 
> file format to be something like this:
> 
>         <comment that applies to the section>
>         [section]
>                 <comment that applies to the variable stands on
> 		 its own before the variable>
>                 variable [= value] <comment that applies to the
>         			    fact the variable is set to
>                                     this particular value starts
> 				    on the same line as the
>                                     "variable = value" thing>
> 
>  - when a variable is reset to another value, remove the
>    "value comment";
>  - when a variable disappears, remove "variable comment";
>  - when a section disappears, remove "section comment";
>  - otherwise leave the comment intact.
> 
> Then we could tell the user the rule is like above, and tell
> them to structure the file with comments that way, if they ever
> want to edit the file by hand.
> 
> Now if we wanted to do something like the above, I suspect that
> it would be easier and less error prone to first scan the config
> file, note what appears where, and do the processing in-core,
> and then write the results out, perhaps using data structures
> like these:
> 
>         struct config_section {
>             char *pre_comment;
>             char *name; /* e.g. "core" */
>             struct config_section *next; /* next section */
>             struct config_var *vars; /* pointer to the first one */
>         };
>         struct config_var {
>             char *pre_comment;
>             char *name;
>             char *value; /* "existence" bool may have NULL,
>                           * otherwise probably a string "= value"
>                           */
>             char *value_comment;
>             struct config_var *next; /* pointer to the next one
>                                       * in the section
>                                       */
>         };
> 
> Obviously, data structures like these would make it even easier
> if we decide we do _not_ care about comments (we would just lose
> x_comment fields, parse the thing and write the resulting list
> out).

Sounds very reasonable.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-07  9:39                           ` Junio C Hamano
  2006-05-07  9:42                             ` Junio C Hamano
  2006-05-07 11:31                             ` Johannes Schindelin
@ 2006-05-07 11:38                             ` Jakub Narebski
  2006-05-08  2:51                               ` Junio C Hamano
  2 siblings, 1 reply; 81+ messages in thread
From: Jakub Narebski @ 2006-05-07 11:38 UTC (permalink / raw
  To: git

Junio C Hamano wrote:

>             char *value; /* "existence" bool may have NULL,
>                           * otherwise probably a string "= value"
>                           */

Probably " = value" to preserve whitespace (e.g. justify on equal sign in
hand crafted config file).

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-05  0:25     ` Junio C Hamano
  2006-05-05  5:17       ` Martin Langhoff
  2006-05-05 15:10       ` Linus Torvalds
@ 2006-05-07 13:30       ` Jakub Narebski
  2006-05-08  2:54         ` Junio C Hamano
  2 siblings, 1 reply; 81+ messages in thread
From: Jakub Narebski @ 2006-05-07 13:30 UTC (permalink / raw
  To: git

Junio C Hamano wrote:

> The vocabulary we would want from the requestor side is probably
> (at least):
> 
>         I WANT to have these
>         I HAVE these
>         I'm MISSING these
>         Don't bother with these this time around (--since, ^v2.6.16, ...)

Wouldn't it be easier (sorry, no code yet) to have the following:

        I WANT to have these
        I HAVE these
        These are GRAFT PARENTLESS        

with the target side sending list of all parentless commits in the
info/grafts file. The source side will then do the grafting 'in memory' and
send the packs like normal, only with those cauterizing grafts in place.

Now I'm waiting for someone to say that it is too simple and cannot be done,
or that shallow clone/shallow fetch uses this method...

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07  7:56                   ` Jeff King
@ 2006-05-07 15:27                     ` Linus Torvalds
  2006-05-08  0:33                     ` Theodore Tso
  1 sibling, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-05-07 15:27 UTC (permalink / raw
  To: Jeff King; +Cc: git



On Sun, 7 May 2006, Jeff King wrote:
>
>   - Total savings by going shallow: 10.7%
> 
> So basically, trees and commits DON'T compress as well as historical
> blobs (potentially because git-pack-objects isn't currently optimized
> for this -- I haven't checked). As a result, we're saving only 10% by
> going shallow instead of a potential 50%.

The biggest size savers from packing is (in rough order of relevance, if 
I recall the rough statistics I did):

 - avoiding block boundaries. 
 - delta packing of blobs
 - delta packing of trees
 - regular compression

The block boundaries are huge, we have tons of small objects, and that was 
one of the primary reasons for packing. I'd suspect that this is a 3:1 
factor for a lot of things for many "common" filesystem setups. You 
probably didn't even account for the size of inodes in your "du" setup.

And blobs with history generally delta very well (_much_ better than 
regular compression).

Trees should _delta_ very well, but they basically don't compress, 
especially after deltaing. The SHA1's are totally incompressible (in a 
tree they aren't even ASCII), and as a deta, the names won't compress much 
either because they are short.

Commits are fairly small, shouldn't delta all that much, and they don't 
even compress _that_ well either (they're normal text and often have some 
redundancy with the committer and author being the same, but they are 
short and have some fairly incompressible elements, so..)

The thing with trees in particular is that they are very common for the 
kernel (and probably not so much for many other projects). A single commit 
ends up quite commonly being just one commit object, one blob (that deltas 
really well), and three or four trees. Merges often have no new blobs at 
all, just several new trees and the commit object.

So a huge amount of the wins from packing come from the file _history_, 
the part that a shallow clone (on purpose) leaves behind. 

The regular compression will pick up a fair amount of slack with the 
blobs, but it's a much smaller factor than the delta compression for 
something that has a long history.

It's somewhat interesting to note that over the year that we've used git, 
the kernel pack-size hasn't even increased all that much. I forget exactly 
what it was when we started packing, but it was on the order of ~75M. It 
is now 115M for me. And the old linux-history thing (full BK history over 
three years) is 177M - not much more than twice the size of just a few 
kernel versions - with some higher packing ratios..

Exactly because blobs delta so incredibly well.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07  8:01                   ` Sergey Vlasov
@ 2006-05-07 23:27                     ` Martin Langhoff
  2006-05-07 23:35                       ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Martin Langhoff @ 2006-05-07 23:27 UTC (permalink / raw
  To: Sergey Vlasov; +Cc: Junio C Hamano, git

On 5/7/06, Sergey Vlasov <vsu@altlinux.ru> wrote:
> For linux v2.6.16:
>
> 7,3M commits-b41b04a36afebdba3b70b74f419fc7d97249bd7f.pack
>  24M commits_trees-8397f1c2a885527acd07e2caa8c95df626451493.pack
>  97M full-c7b2747a674ff55cb4a59dabebe419f191e360df.pack

With this pack arrangement, do you get any noticeable difference in
walking commits? How about walking commits+trees with git-log <path> ?

I wonder whether segregating packs by object type would make things better...

cheers,



martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07 23:27                     ` Martin Langhoff
@ 2006-05-07 23:35                       ` Junio C Hamano
  2006-05-07 23:44                         ` Martin Langhoff
  0 siblings, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2006-05-07 23:35 UTC (permalink / raw
  To: Martin Langhoff; +Cc: git

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> On 5/7/06, Sergey Vlasov <vsu@altlinux.ru> wrote:
>> For linux v2.6.16:
>>
>> 7,3M commits-b41b04a36afebdba3b70b74f419fc7d97249bd7f.pack
>>  24M commits_trees-8397f1c2a885527acd07e2caa8c95df626451493.pack
>>  97M full-c7b2747a674ff55cb4a59dabebe419f191e360df.pack
>
> With this pack arrangement, do you get any noticeable difference in
> walking commits? How about walking commits+trees with git-log <path> ?
>
> I wonder whether segregating packs by object type would make things better...

It shouldn't.  The existing packfile is designed to make "git
log" very efficient, by making it cheap to look only at the
commit message and ancestry information.

The objects are sorted first by type in the pack with the
existing code already, and commits come first.  Try this.

        git repack -a -d
        git show-index <.git/objects/pack/pack-*.idx |
        sort -n |
        while read offset objectname
        do
                git cat-file -t "$objectname"
        done

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07 23:35                       ` Junio C Hamano
@ 2006-05-07 23:44                         ` Martin Langhoff
  0 siblings, 0 replies; 81+ messages in thread
From: Martin Langhoff @ 2006-05-07 23:44 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On 5/8/06, Junio C Hamano <junkio@cox.net> wrote:
> It shouldn't.  The existing packfile is designed to make "git
> log" very efficient, by making it cheap to look only at the
> commit message and ancestry information.
>
> The objects are sorted first by type in the pack with the
> existing code already, and commits come first.  Try this.

Thanks for the explanation! My ignorance in the pre-coffee stage of
Monday is... exemplary. I should know better than assume that there
are trivial optimizations for git that I can come up with ;-)

I guess I should get more into the pack stuff and educate myself --
everyone's having fun with it... grumble, will have to brush up my C
to get to play.

cheers,


martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07  7:56                   ` Jeff King
  2006-05-07 15:27                     ` Linus Torvalds
@ 2006-05-08  0:33                     ` Theodore Tso
  2006-05-08  0:50                       ` Linus Torvalds
  2006-05-08  4:24                       ` Jeff King
  1 sibling, 2 replies; 81+ messages in thread
From: Theodore Tso @ 2006-05-08  0:33 UTC (permalink / raw
  To: git

On Sun, May 07, 2006 at 03:56:31AM -0400, Jeff King wrote:
> On Sun, May 07, 2006 at 06:08:03PM +1200, Martin Langhoff wrote:
> 
> > >> And in any case commits and trees are lightweight and compress well...
> > >Commit maybe, but is this based on a hard fact?
> > No hard facts here :( but I think it's reasonable to assume that the
> > trees delta/compress reasonably well, as a given commit will change
> > just a few entries in each tree.
> 
> A few hard facts (using Linus' linux-2.6 tree):
>   - original packsize: 120996 kilobytes
>   - unpacked: 233338 objects, 1417476 kilobytes
>     This is an 11.7:1 compression ratio (of course, much of this is
>     wasted space from the 4k block size in the filesystem)

If there are 233338 objects, then the average wasted space due to
internal fragmentation is 233338 * 2k, or 466676 kilobytes, or only
36% of the wasted space.  Most of the savings is probably coming from
the compression and delta packing.

						- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  0:33                     ` Theodore Tso
@ 2006-05-08  0:50                       ` Linus Torvalds
  2006-05-08  1:26                         ` Theodore Tso
  2006-05-08  4:24                       ` Jeff King
  1 sibling, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-05-08  0:50 UTC (permalink / raw
  To: Theodore Tso; +Cc: git



On Sun, 7 May 2006, Theodore Tso wrote:
>> 
> If there are 233338 objects, then the average wasted space due to
> internal fragmentation is 233338 * 2k, or 466676 kilobytes, or only
> 36% of the wasted space.

That's not necessarily true.

That assumes a randomly distributed filesize. File sizes are _not_ random, 
and in particular if you have the distribution leaning towards <2kB being 
common, you can actually get >50% fragmentation.

Btw, I hit this when some people argued that the page size should be made 
64kB. The above (incorrect) logic implies that you waste 32kB on average 
per file. That's not true, if a large fraction of your files are small, in 
which case you may actually be wastign closer to 60kB on average from 
using a big page-size, because about half of the kernel files are actually 
smaller than 4kB (or something. I forget the exact statistics, I did them 
with a script at some point).

Anyway, with inode overhead and a lot of objects being just a couple of 
hundred bytes, I think I estimated at some point that you actually lost 
closer to 3kB per object.

Many of the objects actually end up being smaller than the inode they end 
up allocating ;(

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  0:50                       ` Linus Torvalds
@ 2006-05-08  1:26                         ` Theodore Tso
  2006-05-08  2:04                           ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Theodore Tso @ 2006-05-08  1:26 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

On Sun, May 07, 2006 at 05:50:42PM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 7 May 2006, Theodore Tso wrote:
> >> 
> > If there are 233338 objects, then the average wasted space due to
> > internal fragmentation is 233338 * 2k, or 466676 kilobytes, or only
> > 36% of the wasted space.
> 
> That's not necessarily true.
> 
> That assumes a randomly distributed filesize. File sizes are _not_ random, 
> and in particular if you have the distribution leaning towards <2kB being 
> common, you can actually get >50% fragmentation.
> 
> Btw, I hit this when some people argued that the page size should be made 
> 64kB. The above (incorrect) logic implies that you waste 32kB on average 
> per file. That's not true, if a large fraction of your files are small, in 
> which case you may actually be wastign closer to 60kB on average from 
> using a big page-size, because about half of the kernel files are actually 
> smaller than 4kB (or something. I forget the exact statistics, I did them 
> with a script at some point).
> 
> Anyway, with inode overhead and a lot of objects being just a couple of 
> hundred bytes, I think I estimated at some point that you actually lost 
> closer to 3kB per object.

I just ran the numbers on filesizes of a kernel tree I had handy,
which happened to be 2.6.16.11.  With no object files, git files,
etc. the average loss was 2351 bytes --- not that far away from the
average of 2048 bytes.  Granted, it may be there is more different
versions of small objects causing a skewing of the distributions of
git objects in the 2.6 tree, but I'm not familiar enough with the git
porcelain to be able to make it disgorge the sizes of the repository
to do the math.

						- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  1:26                         ` Theodore Tso
@ 2006-05-08  2:04                           ` Linus Torvalds
  2006-05-08  2:24                             ` Theodore Tso
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2006-05-08  2:04 UTC (permalink / raw
  To: Theodore Tso; +Cc: git



On Sun, 7 May 2006, Theodore Tso wrote:
> 
> I just ran the numbers on filesizes of a kernel tree I had handy,
> which happened to be 2.6.16.11.  With no object files, git files,
> etc. the average loss was 2351 bytes --- not that far away from the
> average of 2048 bytes.

Is that without compression?

git objects are compressed, and common types (trees) tend to be smaller 
than your normal C file.

So git objects tend to be _smaller_ than the regular files. By about 30%. 
In addition, the non-blob git objects themselves tend to be smaller still.

So for example, right now I have just 58 unpacked objects (I repack pretty 
often). But of those 58 objects, exactly _fifty_ are smaller than 2kB, and 
38 are smaller than 1kB. The median size is 771 bytes.

On master.kernel.org, I've not repacked as recently, so I've got 2268 
unpacked objects. But the median size there is 773 bytes, so it looks like 
the numbers are statistically pretty stable.

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  2:04                           ` Linus Torvalds
@ 2006-05-08  2:24                             ` Theodore Tso
  2006-05-08  2:42                               ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Theodore Tso @ 2006-05-08  2:24 UTC (permalink / raw
  To: Linus Torvalds; +Cc: git

On Sun, May 07, 2006 at 07:04:48PM -0700, Linus Torvalds wrote:
> Is that without compression?

Yes, without compression.  So yes, that probably explains the
difference between your numbers and mine. 

That brings up an interesting question though --- why not skip
compressing files that are under 4k (or whatever the filesystem
blocksize happens to be) if they are unpacked?  It burns CPU time;
maybe not enough to be human-noticeable, but it's still not buying you
anything.

						- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  2:24                             ` Theodore Tso
@ 2006-05-08  2:42                               ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-05-08  2:42 UTC (permalink / raw
  To: Theodore Tso; +Cc: git



On Sun, 7 May 2006, Theodore Tso wrote:
> 
> That brings up an interesting question though --- why not skip
> compressing files that are under 4k (or whatever the filesystem
> blocksize happens to be) if they are unpacked?  It burns CPU time;
> maybe not enough to be human-noticeable, but it's still not buying you
> anything.

Well, other filesystems don't have 4kB issues. Reiser can do smaller 
things iirc, and you might obviously have a ext3 filesystem with a 1kB 
blocksize too. And with tails on FFS, you might have a filesystem with a 
8kB blocksize, but despite that it might lay out <1kB files well.

Anyway, packing makes all this basically a non-issue. There are no block 
boundaries in a pack-file, and you only use a single inode. And you'd 
obviously want to pack for other reasons anyway (ie the delta compression 
will makea huge difference over time).

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-07 11:38                             ` Jakub Narebski
@ 2006-05-08  2:51                               ` Junio C Hamano
  0 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-05-08  2:51 UTC (permalink / raw
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> Junio C Hamano wrote:
>
>>             char *value; /* "existence" bool may have NULL,
>>                           * otherwise probably a string "= value"
>>                           */
>
> Probably " = value" to preserve whitespace (e.g. justify on equal sign in
> hand crafted config file).

Probably even better is to remove the separate *value_comment,
and make this thing point at the entire " = value ; this is the
comment for the value\n" thing.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-07 13:30       ` Jakub Narebski
@ 2006-05-08  2:54         ` Junio C Hamano
  2006-05-08  4:02           ` Jakub Narebski
  2006-05-08  4:24           ` Jakub Narebski
  0 siblings, 2 replies; 81+ messages in thread
From: Junio C Hamano @ 2006-05-08  2:54 UTC (permalink / raw
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> writes:

> Wouldn't it be easier (sorry, no code yet) to have the following:
>
>         I WANT to have these
>         I HAVE these
>         These are GRAFT PARENTLESS        
>
> with the target side sending list of all parentless commits in the
> ... The source side will then do the grafting 'in memory' and
> send the packs like normal, only with those cauterizing grafts in place.

I think that is essentially the outline of shallow clone
proposal, except that you have to be careful and take not just
"parentless" but other grafts (e.g. one that removes one parent
from a merge commit to pretend that a side branch did not exist)
into account as well.  I do not remember if I already coded it
or not -- I might have.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  2:54         ` Junio C Hamano
@ 2006-05-08  4:02           ` Jakub Narebski
  2006-05-08  4:24           ` Jakub Narebski
  1 sibling, 0 replies; 81+ messages in thread
From: Jakub Narebski @ 2006-05-08  4:02 UTC (permalink / raw
  To: git

Junio C Hamano wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
>> Wouldn't it be easier (sorry, no code yet) to have the following:
>>
>>         I WANT to have these
>>         I HAVE these
>>         These are GRAFT PARENTLESS
>>
>> with the target side sending list of all parentless commits in the
>> ... The source side will then do the grafting 'in memory' and
>> send the packs like normal, only with those cauterizing grafts in place.
> 
> I think that is essentially the outline of shallow clone
> proposal, except that you have to be careful and take not just
> "parentless" but other grafts (e.g. one that removes one parent
> from a merge commit to pretend that a side branch did not exist)
> into account as well.  I do not remember if I already coded it
> or not -- I might have.

Having grafts file being used for both joining history and cauterizing
history makes re-cauterizing (e.g. changing depth of clone) difficult at
best...

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  2:54         ` Junio C Hamano
  2006-05-08  4:02           ` Jakub Narebski
@ 2006-05-08  4:24           ` Jakub Narebski
  1 sibling, 0 replies; 81+ messages in thread
From: Jakub Narebski @ 2006-05-08  4:24 UTC (permalink / raw
  To: git

Junio C Hamano wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
>> Wouldn't it be easier (sorry, no code yet) to have the following:
>>
>>         I WANT to have these
>>         I HAVE these
>>         These are GRAFT PARENTLESS
>>
>> with the target side sending list of all parentless commits in the
>> ... The source side will then do the grafting 'in memory' and
>> send the packs like normal, only with those cauterizing grafts in place.
> 
> I think that is essentially the outline of shallow clone
> proposal, except that you have to be careful and take not just
> "parentless" but other grafts (e.g. one that removes one parent
> from a merge commit to pretend that a side branch did not exist)
> into account as well.  I do not remember if I already coded it
> or not -- I might have.

So, let it be all grafts removing some or all parents from commit.

And that proposal would work I think also for the fetch, not only clone.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  0:33                     ` Theodore Tso
  2006-05-08  0:50                       ` Linus Torvalds
@ 2006-05-08  4:24                       ` Jeff King
  2006-05-08 15:32                         ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: Jeff King @ 2006-05-08  4:24 UTC (permalink / raw
  To: git

On Sun, May 07, 2006 at 08:27:02AM -0700, Linus Torvalds wrote:

> factor for a lot of things for many "common" filesystem setups. You 
> probably didn't even account for the size of inodes in your "du" setup.

My numbers came from git-count-objects, which uses the st_blocks sum for
all objects. The actual du numbers showing space wasted by block
boundaries are:
  du -c ??: 1429216
  du -c --apparent-size ??: 792277
So it's about 45% wasted space.

On Sun, May 07, 2006 at 08:33:38PM -0400, Theodore Tso wrote:

> If there are 233338 objects, then the average wasted space due to
> internal fragmentation is 233338 * 2k, or 466676 kilobytes, or only
> 36% of the wasted space.  Most of the savings is probably coming from
> the compression and delta packing.

As Linus indicated, that assumes a uniform distribution of file sizes
(and my numbers above show that it is, in fact, somewhat higher). FYI,
the mean and median of usage of the final 4K block in the linux-2.6
repository are 1309 and 912 bytes, respectively.

-Peff

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2 (shallow clone again)
  2006-05-08  4:24                       ` Jeff King
@ 2006-05-08 15:32                         ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2006-05-08 15:32 UTC (permalink / raw
  To: Jeff King; +Cc: git



On Mon, 8 May 2006, Jeff King wrote:
>
> On Sun, May 07, 2006 at 08:27:02AM -0700, Linus Torvalds wrote:
> 
> > factor for a lot of things for many "common" filesystem setups. You 
> > probably didn't even account for the size of inodes in your "du" setup.
> 
> My numbers came from git-count-objects, which uses the st_blocks sum for
> all objects. The actual du numbers showing space wasted by block
> boundaries are:
>   du -c ??: 1429216
>   du -c --apparent-size ??: 792277
> So it's about 45% wasted space.

And that's actually ignoring inode sizes and directory sizes (well, it 
doesn't "ignore" directory sizes - it counts them - but if you compare it 
to a straight packed format, it's still overhead).

Anyway, looks like it's about 2:1, not 3:1 like I claimed, but the point 
being that blocking factors tend to be at least on the same order of 
magnitude as just plain compression (which also tends to be in the 2:1 
area for normal, fairly easily compressible, stuff).

The delta-packing obviously is much bigger for any project with real 
history. In traditional setups (where you always delta-pack within one 
thing, ie at the level of individual SCCS/RCS files), the delta packing 
obviously _also_ avoids blocking issues, since it means that a thousand 
revisions of the same file will all share the same inode.

So because git uses a whole-file model, it obviously makes the blocking 
issues with its unpacked format _much_ higher than for any traditional 
medium - no conglomeration of different versions of the file in the same 
filesystem object. On the other hand, the packed format also tends to be 
even _more_ efficient than a traditional one, so the end result of it all 
is apparently a pretty big net win even in space consumption).

Side note: I realize that some people think the packs are ugly and 
strange. They aren't linear versions of a file, and instead appear as a 
fairly random "jumble". And they can't be incrementally re-packed, and you 
have to generate a whole new pack-file (which can be incremental in 
_content_, of course). So people think they are ugly.

I'd argue that they are beautiful. They are beautiful because they _don't_ 
contain history in themselves (the objects they contain encode the history 
of course, but the pack-file itself does not).

And they are beautiful because we can use the exact same format for 
streaming data over the network as for the database itself (that, of 
course, was just about _the_ design consideration). Show me another system 
that has exactly the same (not "similar", not "same concepts": _same_) 
network protocol as it internal database.

And they are beautiful exactly because their lack of any internal 
structure allows you to pack things by criteria _you_ care about, ie the 
whole "sort things by recency" thing, so that commonly accessed data can 
be packed at the head of the pack-file - exactly because the pack-file 
doesn't have any internal structure of its own that you need to worry 
about and that constrains your sorting.

The same thing is what allows you to delta any blob against any other 
blob - without worrying about history or other random pack-file rules. You 
can do packign purely by how well you want to pack, not by any secondary 
constraints.

And the "no incremental updates" may sound like a huge downside, but it's 
all the same basic git logic: objects and filesystem contents are 
immutable, and that allows us to avoid a lot of locking overhead. Locking 
is _hard_. Locking is _inefficient_. And locking really really screws you 
when you miss it.

So I'll happily say that pack-files are strange, and that you have to get 
a bit used to the notion that they should be repacked "asynchronously". 
But it's really a matter of "getting used to it", because once you do, 
you'll see that it's actually an absolutely huge deal, and you'll learn to 
love the bomb^H^H^H^Hpack-file.

			Linus "pack-files rule" Torvalds

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
                     ` (3 preceding siblings ...)
  2006-05-04 20:41   ` Unresolved issues #2 Daniel Barkalow
@ 2006-05-09 11:40   ` David Woodhouse
  2006-05-09 11:53     ` Bertrand Jacquin
  2006-05-09 13:09     ` Nicolas Pitre
  4 siblings, 2 replies; 81+ messages in thread
From: David Woodhouse @ 2006-05-09 11:40 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Thu, 2006-05-04 at 01:15 -0700, Junio C Hamano wrote:
> 
> * Message-ID:
> <4fb292fa0604290630r19edd7ejf88642e33b350d1d@mail.gmail.com>
>   Content-type charset for send-email (Bertrand Jacquin)
> 
>   The output from format-patch by default is unmarked, which
>   means the commit message part is UTF-8 (by strong convention),
>   and the contents of the diff is whatever the contents of the
>   file is encoded in.

Email without a Content-Type: header is supposed to be ASCII. If it
contains 8-bit characters, it's invalid. It'll be interpreted by
different systems in different ways -- not necessarily as UTF-8. Some
may even just reject it, on grounds of RFC non-compliance.

>   David Woodhouse did a patch to allow specifying charset on the
>   command line (and default to UTF-8) which is a move in the
>   right direction, but Bertrand's system seems to have trouble
>   with it.

I thought Bertrand then confirmed that he was having trouble _before_
applying my patch, too? His response when I asked it it appears without
my patch was "[it] appear without in 1.3.1 and I can't seed mail with
too. Also, 1.2.4 work fine here (without patch)."

>   I think if we were to do this we probably need to teach
>   format-patch to optionally do multi-part.  We may not
>   necessarily want to mark the payload to be in the same
>   encoding as the commit message (not that git-apply cares -- to
>   it, the payload is just 8-bit unencoded text, but we would
>   want to protect it from getting mangled by e-mail transport). 

I'm not sure about that. The payload is patches, isn't it? That's just
text, too -- we aren't going to deal with diffs of binary content very
well _anyway_, are we?

Obviously, there's nothing to stop people from storing binary blobs in
GIT, but unless you want to start sending actual _blobs_ as attachments
instead of sending patches, I think there's no need to play with MIME
multipart stuff.

I've no particular objection to it, but it's a separate issue to
Bertanrd's. That's a bug-fix, while multipart is an RFE without much
point, IMO.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-09 11:40   ` David Woodhouse
@ 2006-05-09 11:53     ` Bertrand Jacquin
  2006-05-09 13:09     ` Nicolas Pitre
  1 sibling, 0 replies; 81+ messages in thread
From: Bertrand Jacquin @ 2006-05-09 11:53 UTC (permalink / raw
  To: David Woodhouse; +Cc: Junio C Hamano, git

On 5/9/06, David Woodhouse <dwmw2@infradead.org> wrote:
> On Thu, 2006-05-04 at 01:15 -0700, Junio C Hamano wrote:
> >
> > * Message-ID:
> > <4fb292fa0604290630r19edd7ejf88642e33b350d1d@mail.gmail.com>
>
> >   David Woodhouse did a patch to allow specifying charset on the
> >   command line (and default to UTF-8) which is a move in the
> >   right direction, but Bertrand's system seems to have trouble
> >   with it.
>
> I thought Bertrand then confirmed that he was having trouble _before_
> applying my patch, too? His response when I asked it it appears without
> my patch was "[it] appear without in 1.3.1 and I can't seed mail with
> too. Also, 1.2.4 work fine here (without patch)."

Ok, to make short :
git-send-email 1.2.4 :
No EOF error on my smtp server.
git-send-email 1.3.1 :
EOF error on my smtp server.

I upgraded to 1.3.1 when I received patch from you, don't test 1.3.1
and then applied your patch, you test. And so test failed.

--
Beber
#e.fr@freenode

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unresolved issues #2
  2006-05-09 11:40   ` David Woodhouse
  2006-05-09 11:53     ` Bertrand Jacquin
@ 2006-05-09 13:09     ` Nicolas Pitre
  1 sibling, 0 replies; 81+ messages in thread
From: Nicolas Pitre @ 2006-05-09 13:09 UTC (permalink / raw
  To: David Woodhouse; +Cc: Junio C Hamano, git

On Tue, 9 May 2006, David Woodhouse wrote:

> I'm not sure about that. The payload is patches, isn't it? That's just
> text, too -- we aren't going to deal with diffs of binary content very
> well _anyway_, are we?

Yes we do.  GIT now has its own email friendly binary patch format.


Nicolas

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2006-05-09 13:09 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-14  9:31 Recent unresolved issues Junio C Hamano
2006-04-14 16:02 ` Petr Baudis
     [not found] ` <20060414151030.11c64730.seanlkml@sympatico.ca>
2006-04-14 19:10   ` sean
2006-04-14 19:24 ` Petr Baudis
2006-04-14 22:56 ` Recent unresolved issues: shallow clone Carl Worth
2006-04-15  0:17   ` Johannes Schindelin
2006-04-15  0:25   ` Junio C Hamano
2006-04-15  2:11     ` Junio C Hamano
2006-04-14 23:52 ` Recent unresolved issues Linus Torvalds
2006-04-15  0:19   ` Linus Torvalds
2006-04-15  0:39     ` Linus Torvalds
2006-04-15  0:38   ` Junio C Hamano
2006-04-15  0:49     ` Linus Torvalds
2006-04-15  0:56       ` Linus Torvalds
2006-04-15  1:09         ` Linus Torvalds
2006-04-15  2:22           ` Junio C Hamano
2006-04-15  6:18         ` Junio C Hamano
2006-04-15  8:57           ` Junio C Hamano
2006-04-15 11:46             ` Johannes Schindelin
2006-04-15 16:59             ` Linus Torvalds
2006-04-15 17:17               ` Linus Torvalds
2006-04-16  8:14                 ` Junio C Hamano
2006-04-15  1:35       ` Junio C Hamano
2006-04-15  4:09         ` Linus Torvalds
2006-04-15  5:06           ` Junio C Hamano
2006-05-04  8:15 ` Unresolved issues #2 Junio C Hamano
2006-05-04  8:32   ` Jakub Narebski
2006-05-04  9:14     ` Junio C Hamano
2006-05-04  9:26       ` Jakub Narebski
2006-05-04  9:58   ` Petr Baudis
2006-05-04 15:45     ` Pavel Roskin
2006-05-04 17:01   ` Unresolved issues #2 (shallow clone again) Carl Worth
2006-05-05  0:25     ` Junio C Hamano
2006-05-05  5:17       ` Martin Langhoff
2006-05-05  5:23         ` Carl Worth
2006-05-05  5:48           ` Jakub Narebski
2006-05-05 15:10       ` Linus Torvalds
2006-05-05 15:18         ` Jakub Narebski
2006-05-05 15:59           ` Linus Torvalds
2006-05-06  6:23             ` Martin Langhoff
2006-05-06  7:10               ` Junio C Hamano
2006-05-07  6:08                 ` Martin Langhoff
2006-05-07  7:56                   ` Jeff King
2006-05-07 15:27                     ` Linus Torvalds
2006-05-08  0:33                     ` Theodore Tso
2006-05-08  0:50                       ` Linus Torvalds
2006-05-08  1:26                         ` Theodore Tso
2006-05-08  2:04                           ` Linus Torvalds
2006-05-08  2:24                             ` Theodore Tso
2006-05-08  2:42                               ` Linus Torvalds
2006-05-08  4:24                       ` Jeff King
2006-05-08 15:32                         ` Linus Torvalds
2006-05-07  8:01                   ` Sergey Vlasov
2006-05-07 23:27                     ` Martin Langhoff
2006-05-07 23:35                       ` Junio C Hamano
2006-05-07 23:44                         ` Martin Langhoff
2006-05-05 15:31         ` Carl Worth
2006-05-07 13:30       ` Jakub Narebski
2006-05-08  2:54         ` Junio C Hamano
2006-05-08  4:02           ` Jakub Narebski
2006-05-08  4:24           ` Jakub Narebski
2006-05-04 20:41   ` Unresolved issues #2 Daniel Barkalow
2006-05-04 21:33     ` Linus Torvalds
2006-05-06  5:58       ` Junio C Hamano
2006-05-06 15:26         ` Linus Torvalds
     [not found]           ` <20060506113549.48e553d1.seanlkml@sympatico.ca>
2006-05-06 15:35             ` sean
2006-05-06 16:30               ` Linus Torvalds
     [not found]                 ` <20060506125323.544c35db.seanlkml@sympatico.ca>
2006-05-06 16:53                   ` sean
2006-05-06 17:20                     ` Linus Torvalds
2006-05-06 21:16                       ` Junio C Hamano
2006-05-06 21:33                         ` Johannes Schindelin
2006-05-06 21:51                           ` Linus Torvalds
2006-05-07  9:39                           ` Junio C Hamano
2006-05-07  9:42                             ` Junio C Hamano
2006-05-07 11:31                             ` Johannes Schindelin
2006-05-07 11:38                             ` Jakub Narebski
2006-05-08  2:51                               ` Junio C Hamano
2006-05-07  0:41                         ` Jakub Narebski
2006-05-09 11:40   ` David Woodhouse
2006-05-09 11:53     ` Bertrand Jacquin
2006-05-09 13:09     ` Nicolas Pitre

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).