What's cooking in git.git (topics)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* What's cooking in git.git (topics)
@ 2007-09-06  8:52 Junio C Hamano
       [not found] ` <7v1wd1d0le.fsf@gitster.siamese.dyndns.org>
  0 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-09-06  8:52 UTC (permalink / raw)
  To: git

Here are the topics that have been cooking.  Commits prefixed
with '-' are only in 'pu' while commits prefixed with '+' are
in 'next'.  The topics list the commits in reverse chronological
order.

* js/tag (Mon Sep 3 17:51:43 2007 +0100) 1 commit
 + verify-tag: also grok CR/LFs in the tag signature

Looks correct.  Merge to 'master' this weekend.

* lh/svn-first-parent (Wed Sep 5 11:35:29 2007 +0200) 1 commit
 + git-svn: add support for --first-parent

Queued to 'next' with Eric's blessing.  Perhaps merge to
'master' by the end of next week unless there are issues.

* rs/archive (Mon Sep 3 20:08:01 2007 +0200) 3 commits
 + Remove unused function convert_sha1_file()
 + archive: specfile support (--pretty=format: in archive files)
 + Export format_commit_message()

Waiting for the "$Format: ...$" updates.

* js/remote (Sun Sep 2 21:10:14 2007 +0100) 1 commit
 + Teach "git remote" a mirror mode

Waiting for tests.  We should resurrect earlier "git remote rm"
and add tests for it as well.

* jc/stash-create (Mon Jul 9 00:51:23 2007 -0700) 2 commits
 - rebase: allow starting from a dirty tree.
 - stash: implement "stash create"

A quick hack to allow starting "git rebase" in a dirty work tree
by automatically stashing the changes first, and unstashing them
after rebase is done.  Needs tests and documentation.

* np/delta (Thu Sep 6 02:13:11 2007 -0400) 4 commits
 - basic threaded delta search
 - rearrange delta search progress reporting
 - localize window memory usage accounting
 - straighten the list of objects to deltify

I do not know where Nico's "threaded pack generation" would lead
us to yet, so they are parked on 'pu' for now.  The first in the
series should be applicable to 'next', though.

* jc/pack (Sat Sep 1 23:53:47 2007 -0700) 1 commit
 + Keep last used delta base in the delta window

Would need to straighten out the implementation from the one
that is suited for the original FIFO usage to another that is
more appropriate for LRU.

* jc/autogc (Wed Sep 5 14:59:59 2007 -0700) 2 commits
 - Invoke "git gc --auto" from commit, merge, am and rebase.
 - Implement git gc --auto

This has been updated since the ones I sent to the list earlier
in the day.  It detects a situation where the user has too much
cruft in the repository that too many loose objects are left
unpruned, and issues a warning.  Also 'rebase' is covered by
running "git gc --auto" from either merge or am.

* ph/strbuf (Wed Sep 5 21:18:43 2007 +0200) 7 commits
 - Use strbuf in cache-tree.c
 - Use strbuf in buitin-rerere.c
 - Use strbuf in apply, blame, commit-tree and diff
 - mktree: Simplify write_tree() using strbuf's.
 - fast-import: Use strbuf API, and simplify cmd_data()
 - Simplify strbuf uses in archive-tar.c using strbuf API
 - Rework strbuf API and semantics.

The idea is good, and removes more code than it adds, but I find
it not 'next' material yet.  I haven't checked every single line
yet, and this series needs that kind of vetting.

* jc/pathspec (Tue Sep 4 02:47:25 2007 -0700) 1 commit
 - tree-diff.c: split out a function to match a single pattern.

Just started and not even started to cause breakage yet ;-).
I'd want to fix pathspec semantics of "diff-tree", "log" and
"ls-tree" so that they understand globs in addition to leading
directory prefix, just like "ls-files", "diff-files",
"diff-index" and "grep" does.

* jc/diff (Mon Dec 25 01:08:50 2006 -0800) 2 commits
 - test-para: combined diff between HEAD, index and working tree.
 - para-walk: walk n trees, index and working tree in parallel

Have been on hold for a long time.  This is about traversing the
index, work tree and zero or more trees in parallel, which is
one way to rewrite the merge backend.  I may end up reusing
merge-tree.c implementation which would make this series
unnecessary.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
       [not found] ` <7v1wd1d0le.fsf@gitster.siamese.dyndns.org>
@ 2007-09-14 18:30   ` Shawn O. Pearce
  2007-09-15  7:47     ` db/fetch-pack topic (was: What's cooking in git.git (topics)) Shawn O. Pearce
  2007-09-14 23:47   ` What's cooking in git.git (topics) Johannes Schindelin
  2007-09-26 20:05   ` Junio C Hamano
  2 siblings, 1 reply; 44+ messages in thread
From: Shawn O. Pearce @ 2007-09-14 18:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> wrote:
> * db/fetch-pack (Fri Sep 14 03:31:25 2007 -0400) 22 commits
...
> This is Daniel's fetch-pack in C plus fixups from Shawn.
> Unfortunately the fixups breaks t3200 ("*** glibc detected ***
> fetch: free(): invalid pointer xxx ***"), which I haven't looked
> into yet.

Doesn't crash out on my Mac OS X system but I am getting the
above failure on my amd64 Linux system.  I'm debugging it now.
I'll have to quit in about an hour and pick it up later, so don't
expect a patch immediately.  But I'll certainly send something soon.
Clearly I made a change in my fixups that I shouldn't have.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
       [not found] ` <7v1wd1d0le.fsf@gitster.siamese.dyndns.org>
  2007-09-14 18:30   ` Shawn O. Pearce
@ 2007-09-14 23:47   ` Johannes Schindelin
  2007-09-26 21:07     ` Carlos Rica
  2007-09-26 20:05   ` Junio C Hamano
  2 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-09-14 23:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi,

On Fri, 14 Sep 2007, Junio C Hamano wrote:

> * cr/reset (Fri Sep 14 01:19:30 2007 -0700) 5 commits
>  + Simplify cache API
>  + An additional test for "git-reset -- path"
>  + Make "git reset" a builtin.
>  + Move make_cache_entry() from merge-recursive.c into read-cache.c
>  + Add tests for documented features of "git reset".
> 
> I found "git reset commit paths..." had problem in this series,
> which was why jc/cachetree is merged into this topic to fix it.
> Hopefully we can put this in 'master' soon, after giving it
> another and final round of eyeballing.

I think this is my bug.  The initial reset-with-paths functionality in the 
builtin reset came out of my feather... Sorry!

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: db/fetch-pack topic (was: What's cooking in git.git (topics))
  2007-09-14 18:30   ` Shawn O. Pearce
@ 2007-09-15  7:47     ` Shawn O. Pearce
  2007-09-16  4:03       ` Shawn O. Pearce
  0 siblings, 1 reply; 44+ messages in thread
From: Shawn O. Pearce @ 2007-09-15  7:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

"Shawn O. Pearce" <spearce@spearce.org> wrote:
> Junio C Hamano <gitster@pobox.com> wrote:
> > * db/fetch-pack (Fri Sep 14 03:31:25 2007 -0400) 22 commits
> ...
> > This is Daniel's fetch-pack in C plus fixups from Shawn.
> > Unfortunately the fixups breaks t3200 ("*** glibc detected ***
> > fetch: free(): invalid pointer xxx ***"), which I haven't looked
> > into yet.
> 
> Doesn't crash out on my Mac OS X system but I am getting the
> above failure on my amd64 Linux system.

OK, so in addition to the above fixups Junio mentions I have
also sent two more series today:

  2 patch "fixup of the fixup" to resolve the t3200 crash
  5 patch "fixup + cleanup" of http support

and now I just discovered that pushing to yourself is probably also
broken by this series:

  $ git push . jc/maint:gfi-maint
  updating 'refs/heads/gfi-maint' using 'refs/remotes/jc/maint'
    from 05cc2ffc572f05e8aeec495a9ab9bc9609863491
    to   8419d2ee9ba8b375186a5c1019df8dfbce610aba
  Also local refs/heads/gfi-maint
  Generating pack...
  Done counting 0 objects.
  Writing 0 objects...
  Unpacking 0 objects...
  error: Ref refs/heads/gfi-maint is at 8419d2ee9ba8b375186a5c1019df8dfbce610aba but expected 05cc2ffc572f05e8aeec495a9ab9bc9609863491
  error: failed to lock refs/heads/gfi-maint
  Total 0 (delta 0), reused 0 (delta 0)
  ng refs/heads/gfi-maint failed to lock
  error: failed to push to '.'

What's really exciting is we actually updated the ref 'gfi-maint',
even though it was "ng" and we failed to push.  Yup.  More work
for me to look at tomorrow.  Right now I think I'm all memory
corruptioned out for *** brain detected *** Shawn(): nextword():
too tired, try sleep ***

;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: db/fetch-pack topic (was: What's cooking in git.git (topics))
  2007-09-15  7:47     ` db/fetch-pack topic (was: What's cooking in git.git (topics)) Shawn O. Pearce
@ 2007-09-16  4:03       ` Shawn O. Pearce
  0 siblings, 0 replies; 44+ messages in thread
From: Shawn O. Pearce @ 2007-09-16  4:03 UTC (permalink / raw)
  To: git

> "Shawn O. Pearce" <spearce@spearce.org> wrote:
> > Junio C Hamano <gitster@pobox.com> wrote:
> > > * db/fetch-pack (Fri Sep 14 03:31:25 2007 -0400) 22 commits
> > ...
> > > This is Daniel's fetch-pack in C plus fixups from Shawn.
> > > Unfortunately the fixups breaks t3200 ("*** glibc detected ***
> > > fetch: free(): invalid pointer xxx ***"), which I haven't looked
> > > into yet.

With all of my fixups applied I took some performance measurements
today from two of my larger day-job repositories:

  A.git (356 branches):
    Current Fetch             New "Builtin" Fetch
    ------------------        --------------------
    real    3m19.921s         real    0m9.969s
    user    1m58.050s         user    0m1.449s
    sys     2m4.747s          sys     0m4.465s

  B.git (24 branches):
    Current Fetch             New "Builtin" Fetch
    ------------------        --------------------
    real    0m21.250s         real    0m4.735s
    user    0m10.981s         user    0m0.510s
    sys     0m12.387s         sys     0m1.481s

All runs were a no-op fetch over SSH on the LAN.  The system running
git-fetch was a Cygwin/Windows 2000 box and the server was Solaris 9.
Just starting an SSH connection (`ssh solsrv echo hi`) takes me on
average 1 second so a 4.7s no-op fetch on the smaller repository
is very respectable.

3m19s vs. 9s?  I'll take 9s, thankyouverymuch.  Even if I have
more testing and debugging to do.  Especially since the two repos
above are only a sampling of the actual set I have to deal with on
a daily basis.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* What's cooking in git.git (topics)
       [not found] ` <7v1wd1d0le.fsf@gitster.siamese.dyndns.org>
  2007-09-14 18:30   ` Shawn O. Pearce
  2007-09-14 23:47   ` What's cooking in git.git (topics) Johannes Schindelin
@ 2007-09-26 20:05   ` Junio C Hamano
  2007-09-26 21:44     ` Johannes Schindelin
                       ` (3 more replies)
  2 siblings, 4 replies; 44+ messages in thread
From: Junio C Hamano @ 2007-09-26 20:05 UTC (permalink / raw)
  To: git

Here are the topics that have been cooking.  Commits prefixed
with '-' are only in 'pu' while commits prefixed with '+' are
in 'next'.  The topics list the commits in reverse chronological
order.

* jk/diff-rename (Tue Sep 25 15:29:42 2007 -0400) 1 commit
 + diffcore-rename: cache file deltas

Parked in 'next' for now but is 'master' material.

* mv/unknown (Tue Sep 25 16:38:46 2007 +0200) 1 commit
 + Don't use "<unknown>" for placeholders and suppress printing of
   empty user formats.

Parked in 'next'; I was already burned by it not passing one of
the test cases, and I am not absolutely certain what else this
subtly breaks.  Hopefully minor.

* jb/remote-rm (Sun Sep 23 22:29:12 2007 -0700) 3 commits
 + git-remote rm: add tests and minor fix-ups
 + remote: document the 'rm' subcommand
 + remote: add 'rm' subcommand

Should be Ok to push out to 'master'.

* ml/submodule (Sun Sep 23 22:19:42 2007 -0400) 1 commit
 + git-submodule - allow a relative path as the subproject url

Should be Ok to push out to 'master'.

* lh/merge (Mon Sep 24 00:51:45 2007 +0200) 6 commits
 + git-merge: add --ff and --no-ff options
 + git-merge: add support for --commit and --no-squash
 + git-merge: add support for branch.<name>.mergeoptions
 + git-merge: refactor option parsing
 + git-merge: fix faulty SQUASH_MSG
 + Add test-script for git-merge porcelain

Comments?  I personally never felt need for --no-ff but the
series is reasonably clean so I do not see strong objection
against this series either.

* sv/svn (Fri Sep 21 15:27:01 2007 +1200) 3 commits
 + git-svn: handle changed svn command-line syntax
 + git-svn: fix test for trunk svn (transaction out of date)
 + git-svn: fix test for trunk svn (commit message not needed)

Will merge to 'master' this weekend.

* js/rebase-i (Tue Sep 25 16:43:15 2007 +0100) 1 commit
 + rebase -i: work on a detached HEAD

Waiting for autogc change as this textually interacts with it,
and the additional convenience can wait.

* jc/autogc (Mon Sep 17 00:55:13 2007 -0700) 10 commits
 + git-gc --auto: run "repack -A -d -l" as necessary.
 + git-gc --auto: restructure the way "repack" command line is built.
 + git-gc --auto: protect ourselves from accumulated cruft
 + git-gc --auto: add documentation.
 + git-gc --auto: move threshold check to need_to_gc() function.
 + repack -A -d: use --keep-unreachable when repacking
 + pack-objects --keep-unreachable
 + Export matches_pack_name() and fix its return value
 + Invoke "git gc --auto" from commit, merge, am and rebase.
 + Implement git gc --auto

I think the only remaining thing left with this thing is to
prevent more than one instances of it from running at the same
time.  Any takers?

* ph/strbuf (Tue Sep 25 10:22:44 2007 +0200) 37 commits
 + Small cache_tree_write refactor.
 + Make builtin-rerere use of strbuf nicer and more efficient.
 + Add strbuf_cmp.
 + strbuf_setlen(): do not barf on setting length of an empty buffer
   to 0
 + sq_quote_argv and add_to_string rework with strbuf's.
 + Full rework of quote_c_style and write_name_quoted.
 + ...

I had to make a small fix-up to strbuf_setlen() last night to
this series; this should be ready for 'master'.

And it is better to push this out early, as the series touches
everywhere and conflicts with peoples' patches.

* db/fetch-pack (Tue Sep 25 00:13:25 2007 -0400) 45 commits
 + Prevent send-pack from segfaulting when a branch doesn't match
 + Cleanup unnecessary break in remote.c
 + Cleanup style nit of 'x == NULL' in remote.c
 + Fix memory leaks when disconnecting transport instances
 + Ensure builtin-fetch honors {fetch,transfer}.unpackLimit
 + ...

Two issues known to me are:

 - "rsync" transport is not supported yet;

 - regresses "git pull <name>" using .git/remotes/<name>; does
   not merge the first refspec when branch.<name>.merge is not
   set.

There may be others but some people apparently use this in
production (including me) and I do not expect major breakages in
the really essential part.

* ss/svnimport (Mon Sep 24 12:57:40 2007 +0200) 1 commit
 + Fix pool handling in git-svnimport to avoid memory leaks.

This is meant to eventually go to 'maint' as well but with
diminishing user base of svnimport it is getting harder to get
good "tested successfully, seen improvements" reports.

* jc/stash-create (Mon Jul 9 00:51:23 2007 -0700) 2 commits
 + rebase: allow starting from a dirty tree.
 + stash: implement "stash create"

I think "stash create" is going in a good direction, but I do
not think rebase should unstash unconditionally on the resulting
work tree.  A good compromise might be not to unstash if the
user asked to switch branches first and to unstash if he didn't.

* kh/commit (Mon Sep 17 20:06:48 2007 -0400) 7 commits
 - Implement git commit as a builtin command.
 - Export rerere() and launch_editor().
 - Add strbuf_read_file().
 - Clean up stripspace a bit, use strbuf even more.
 - Introduce entry point for launching add--interactive.
 - Enable wt-status to run against non-standard index file.
 - Enable wt-status output to a given FILE pointer.

There were a few updates/replacements to the list I missed;

* gr/smtp (Tue Sep 25 17:27:54 2007 -0700) 2 commits
 - [TO BE SQUASHED] Fix-up after review
 - Add ability to specify SMTP server port when using git-send-email.

Will be in 'next'.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-14 23:47   ` What's cooking in git.git (topics) Johannes Schindelin
@ 2007-09-26 21:07     ` Carlos Rica
  0 siblings, 0 replies; 44+ messages in thread
From: Carlos Rica @ 2007-09-26 21:07 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

2007/9/15, Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> On Fri, 14 Sep 2007, Junio C Hamano wrote:
>
> > * cr/reset (Fri Sep 14 01:19:30 2007 -0700) 5 commits
> >  + Simplify cache API
> >  + An additional test for "git-reset -- path"
> >  + Make "git reset" a builtin.
> >  + Move make_cache_entry() from merge-recursive.c into read-cache.c
> >  + Add tests for documented features of "git reset".
> >
> > I found "git reset commit paths..." had problem in this series,
> > which was why jc/cachetree is merged into this topic to fix it.
> > Hopefully we can put this in 'master' soon, after giving it
> > another and final round of eyeballing.
>
> I think this is my bug.  The initial reset-with-paths functionality in the
> builtin reset came out of my feather... Sorry!

I'm sorry for not being able to review that code, but now it was
a bit too hard for my current knowledge of the git's internals.
These days I'm very busy here and I cannot continue with that work,
but I will be soon in git to learn more. Please, be free to rework the
code of builtin-reset.c without waiting for me if anyone is interested
in get it done ASAP.

--
Carlos

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-26 20:05   ` Junio C Hamano
@ 2007-09-26 21:44     ` Johannes Schindelin
  2007-09-26 21:53       ` Tom Clarke
  2007-09-27  2:36     ` Jeff King
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-09-26 21:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi,

On Wed, 26 Sep 2007, Junio C Hamano wrote:

> * mv/unknown (Tue Sep 25 16:38:46 2007 +0200) 1 commit
>  + Don't use "<unknown>" for placeholders and suppress printing of
>    empty user formats.
> 
> Parked in 'next'; I was already burned by it not passing one of the test 
> cases, and I am not absolutely certain what else this subtly breaks.  
> Hopefully minor.

I guess a few scripts could maybe rely on this behaviour.  We should 
advertise it as such.

> * lh/merge (Mon Sep 24 00:51:45 2007 +0200) 6 commits
>  + git-merge: add --ff and --no-ff options
>  + git-merge: add support for --commit and --no-squash
>  + git-merge: add support for branch.<name>.mergeoptions
>  + git-merge: refactor option parsing
>  + git-merge: fix faulty SQUASH_MSG
>  + Add test-script for git-merge porcelain
> 
> Comments?  I personally never felt need for --no-ff but the series is 
> reasonably clean so I do not see strong objection against this series 
> either.

Together with a resubmitted git-merge-rebase.sh (hint, hint), the 
mergeOptions would be quite useful for a workflow where you want to rebase 
on top of an upstream quite often.

> * js/rebase-i (Tue Sep 25 16:43:15 2007 +0100) 1 commit
>  + rebase -i: work on a detached HEAD
> 
> Waiting for autogc change as this textually interacts with it, and the 
> additional convenience can wait.

Sure.  I never used it anyway, but you specifically requested it ;-)  BTW 
thanks for merging the rest; especially the progress meter was a sore 
point for me since long.

> * jc/autogc (Mon Sep 17 00:55:13 2007 -0700) 10 commits
>  + git-gc --auto: run "repack -A -d -l" as necessary.
>  + git-gc --auto: restructure the way "repack" command line is built.
>  + git-gc --auto: protect ourselves from accumulated cruft
>  + git-gc --auto: add documentation.
>  + git-gc --auto: move threshold check to need_to_gc() function.
>  + repack -A -d: use --keep-unreachable when repacking
>  + pack-objects --keep-unreachable
>  + Export matches_pack_name() and fix its return value
>  + Invoke "git gc --auto" from commit, merge, am and rebase.
>  + Implement git gc --auto
> 
> I think the only remaining thing left with this thing is to prevent more 
> than one instances of it from running at the same time.  Any takers?

You mean, just creating a throw-away lock file?

> * ph/strbuf (Tue Sep 25 10:22:44 2007 +0200) 37 commits
>  + Small cache_tree_write refactor.
>  + Make builtin-rerere use of strbuf nicer and more efficient.
>  + Add strbuf_cmp.
>  + strbuf_setlen(): do not barf on setting length of an empty buffer
>    to 0
>  + sq_quote_argv and add_to_string rework with strbuf's.
>  + Full rework of quote_c_style and write_name_quoted.
>  + ...
> 
> I had to make a small fix-up to strbuf_setlen() last night to this 
> series; this should be ready for 'master'.
> 
> And it is better to push this out early, as the series touches 
> everywhere and conflicts with peoples' patches.

Hehe.  Indeed, I had to fix the notes series after rebasing it...

> * db/fetch-pack (Tue Sep 25 00:13:25 2007 -0400) 45 commits
>  + Prevent send-pack from segfaulting when a branch doesn't match
>  + Cleanup unnecessary break in remote.c
>  + Cleanup style nit of 'x == NULL' in remote.c
>  + Fix memory leaks when disconnecting transport instances
>  + Ensure builtin-fetch honors {fetch,transfer}.unpackLimit
>  + ...
> 
> Two issues known to me are:
> 
>  - "rsync" transport is not supported yet;

I promised to do this, and so I will today.

> * jc/stash-create (Mon Jul 9 00:51:23 2007 -0700) 2 commits
>  + rebase: allow starting from a dirty tree.
>  + stash: implement "stash create"
> 
> I think "stash create" is going in a good direction, but I do not think 
> rebase should unstash unconditionally on the resulting work tree.  A 
> good compromise might be not to unstash if the user asked to switch 
> branches first and to unstash if he didn't.

Sounds like a sensible change to me; maybe a little warning after the 
rebase?

I have no idea if I come around to do the same for rebase--interactive any 
time soon, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-26 21:44     ` Johannes Schindelin
@ 2007-09-26 21:53       ` Tom Clarke
  0 siblings, 0 replies; 44+ messages in thread
From: Tom Clarke @ 2007-09-26 21:53 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

On 9/26/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Together with a resubmitted git-merge-rebase.sh (hint, hint), the
> mergeOptions would be quite useful for a workflow where you want to rebase
> on top of an upstream quite often.

I'll resubmit merge rebase tomorrow :-)

-Tom

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-26 20:05   ` Junio C Hamano
  2007-09-26 21:44     ` Johannes Schindelin
@ 2007-09-27  2:36     ` Jeff King
  2007-09-27  6:08       ` David Kastrup
  2007-10-02  4:16       ` Jeff King
  2007-09-28  3:24     ` Daniel Barkalow
  2007-10-02  5:53     ` Junio C Hamano
  3 siblings, 2 replies; 44+ messages in thread
From: Jeff King @ 2007-09-27  2:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Sep 26, 2007 at 01:05:59PM -0700, Junio C Hamano wrote:

> * jk/diff-rename (Tue Sep 25 15:29:42 2007 -0400) 1 commit
>  + diffcore-rename: cache file deltas
> 
> Parked in 'next' for now but is 'master' material.

My tests after this patch show that spanhash_find is responsible for
a large portion of the processing time in large renames, so I am going
to look into speeding that up.

> * lh/merge (Mon Sep 24 00:51:45 2007 +0200) 6 commits
>  + git-merge: add --ff and --no-ff options
>  + git-merge: add support for --commit and --no-squash
>  + git-merge: add support for branch.<name>.mergeoptions
>  + git-merge: refactor option parsing
>  + git-merge: fix faulty SQUASH_MSG
>  + Add test-script for git-merge porcelain
> 
> Comments?  I personally never felt need for --no-ff but the
> series is reasonably clean so I do not see strong objection
> against this series either.

I like it. I know that --no-ff is frowned upon, but I think previous
discussions have mentioned workflows where it might be used
intelligently. Since the patch is unlikely to break anything for
traditional workflows, I think it is a nice way to let people experiment
with alternative workflows that use --no-ff. Maybe something interesting
will come of it.

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-27  2:36     ` Jeff King
@ 2007-09-27  6:08       ` David Kastrup
  2007-09-27  6:43         ` David Kastrup
  2007-09-27 13:30         ` Jeff King
  2007-10-02  4:16       ` Jeff King
  1 sibling, 2 replies; 44+ messages in thread
From: David Kastrup @ 2007-09-27  6:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Jeff King <peff@peff.net> writes:

> On Wed, Sep 26, 2007 at 01:05:59PM -0700, Junio C Hamano wrote:
>
>> * jk/diff-rename (Tue Sep 25 15:29:42 2007 -0400) 1 commit
>>  + diffcore-rename: cache file deltas
>> 
>> Parked in 'next' for now but is 'master' material.
>
> My tests after this patch show that spanhash_find is responsible for
> a large portion of the processing time in large renames, so I am going
> to look into speeding that up.

In itself, it does not look like there is all too much room for
optimization.  One can remove the temporary pointer "optimization" and
see whether this makes strength reduction possible for the compiler.
Making this an endless loop wrapped around a loop on bucket might also
help the compiler in that effect.

But there is really not all too much leeway, and it might be better
spent in the caller.  For example, the search will take something like
r/(1-r) iterations on average where r is the fill ratio of the hash
array.  So one would not want to, say, let r grow above 0.75 or
something like that.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-27  6:08       ` David Kastrup
@ 2007-09-27  6:43         ` David Kastrup
  2007-09-27 13:30         ` Jeff King
  1 sibling, 0 replies; 44+ messages in thread
From: David Kastrup @ 2007-09-27  6:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

David Kastrup <dak@gnu.org> writes:

> Jeff King <peff@peff.net> writes:
>
>> On Wed, Sep 26, 2007 at 01:05:59PM -0700, Junio C Hamano wrote:
>>
>>> * jk/diff-rename (Tue Sep 25 15:29:42 2007 -0400) 1 commit
>>>  + diffcore-rename: cache file deltas
>>> 
>>> Parked in 'next' for now but is 'master' material.
>>
>> My tests after this patch show that spanhash_find is responsible for
>> a large portion of the processing time in large renames, so I am going
>> to look into speeding that up.
>
> In itself, it does not look like there is all too much room for
> optimization.  One can remove the temporary pointer "optimization" and
> see whether this makes strength reduction possible for the compiler.
> Making this an endless loop wrapped around a loop on bucket might also
> help the compiler in that effect.
>
> But there is really not all too much leeway, and it might be better
> spent in the caller.  For example, the search will take something like
> r/(1-r) iterations on average where r is the fill ratio of the hash
> array.  So one would not want to, say, let r grow above 0.75 or
> something like that.

Ok, here is some suggestion:

Here is the inner loop for this stuff:

	for (i = 0; i < ssz; i++) {
		struct spanhash *s = &(src_count->data[i]);
		struct spanhash *d;
		unsigned dst_cnt, src_cnt;
		if (!s->cnt)
			continue;
		src_cnt = s->cnt;
		d = spanhash_find(dst_count, s->hashval);
		dst_cnt = d ? d->cnt : 0;
		if (src_cnt < dst_cnt) {
			la += dst_cnt - src_cnt;
			sc += src_cnt;
		}
		else
			sc += dst_cnt;
	}

Now here is how one could optimize the data structures: The hash
structures are with linear probing, and we try to find any hash
matches from source to destination.  If we sort all hashes indexed to
a given first hash bucket by their full hash value, then one could
basically use passes similar to list merges for figuring the 1:1
relations.  That cuts down the O(l n) cost (where n is the number of
elements and l their average run length) to O(n).

Of course, making l close to 1 by keeping the hash utilization
reasonably low is much simpler.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-27  6:08       ` David Kastrup
  2007-09-27  6:43         ` David Kastrup
@ 2007-09-27 13:30         ` Jeff King
  2007-09-27 13:46           ` David Kastrup
  1 sibling, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-09-27 13:30 UTC (permalink / raw)
  To: David Kastrup; +Cc: Junio C Hamano, git

On Thu, Sep 27, 2007 at 08:08:44AM +0200, David Kastrup wrote:

> In itself, it does not look like there is all too much room for
> optimization.  One can remove the temporary pointer "optimization" and
> see whether this makes strength reduction possible for the compiler.
> Making this an endless loop wrapped around a loop on bucket might also
> help the compiler in that effect.

I am considering reworking the data structure to be a hash table whose
buckets never overflow. However, Junio indicated that he tried something
similar at one point and was not successful. So we will see. I haven't
had time to play with it yet, but I will post numbers when I do.

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-27 13:30         ` Jeff King
@ 2007-09-27 13:46           ` David Kastrup
  0 siblings, 0 replies; 44+ messages in thread
From: David Kastrup @ 2007-09-27 13:46 UTC (permalink / raw)
  To: git

Jeff King <peff@peff.net> writes:

> On Thu, Sep 27, 2007 at 08:08:44AM +0200, David Kastrup wrote:
>
>> In itself, it does not look like there is all too much room for
>> optimization.  One can remove the temporary pointer "optimization" and
>> see whether this makes strength reduction possible for the compiler.
>> Making this an endless loop wrapped around a loop on bucket might also
>> help the compiler in that effect.
>
> I am considering reworking the data structure to be a hash table
> whose buckets never overflow. However, Junio indicated that he tried
> something similar at one point and was not successful. So we will
> see. I haven't had time to play with it yet, but I will post numbers
> when I do.

Linear probing is pretty efficient with regard to keeping memory
access locality.  With a reasonable table filling ratio (not more than
something like 75%, for which it is necessary to know the maximum
number of hashable entries in advance), there is no gain to be
expected in either speed or even memory usage (the waste of 25% is
offset by not needing space for link pointers) with escape lists.
Linear probing hashes are quite hard to resize: if the maximum member
count is _not_ to be guessed in advance, things might look different.

I don't have the time to look at the code right now, so I don't know
whether resizing or unknown maximum size is a relevant factor.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-26 20:05   ` Junio C Hamano
  2007-09-26 21:44     ` Johannes Schindelin
  2007-09-27  2:36     ` Jeff King
@ 2007-09-28  3:24     ` Daniel Barkalow
  2007-10-02  5:53     ` Junio C Hamano
  3 siblings, 0 replies; 44+ messages in thread
From: Daniel Barkalow @ 2007-09-28  3:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, 26 Sep 2007, Junio C Hamano wrote:

> * db/fetch-pack (Tue Sep 25 00:13:25 2007 -0400) 45 commits
>  + Prevent send-pack from segfaulting when a branch doesn't match
>  + Cleanup unnecessary break in remote.c
>  + Cleanup style nit of 'x == NULL' in remote.c
>  + Fix memory leaks when disconnecting transport instances
>  + Ensure builtin-fetch honors {fetch,transfer}.unpackLimit
>  + ...
> 
> Two issues known to me are:
> 
>  - "rsync" transport is not supported yet;
> 
>  - regresses "git pull <name>" using .git/remotes/<name>; does
>    not merge the first refspec when branch.<name>.merge is not
>    set.

Actually, I think it's only that it doesn't merge anything if the first 
refspec doesn't match the first fetched ref from the server. I'm running 
tests on a patch that should fix it, but I'm not sure how to write a test 
for this particular case; I think the fetch tests try to test this, but 
pass by chance.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-09-27  2:36     ` Jeff King
  2007-09-27  6:08       ` David Kastrup
@ 2007-10-02  4:16       ` Jeff King
  2007-10-02  5:01         ` Junio C Hamano
  1 sibling, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-10-02  4:16 UTC (permalink / raw)
  To: git

On Wed, Sep 26, 2007 at 10:36:33PM -0400, Jeff King wrote:

> > * jk/diff-rename (Tue Sep 25 15:29:42 2007 -0400) 1 commit
> >  + diffcore-rename: cache file deltas
> > 
> > Parked in 'next' for now but is 'master' material.
> 
> My tests after this patch show that spanhash_find is responsible for
> a large portion of the processing time in large renames, so I am going
> to look into speeding that up.

Just to update, I tried using a non-colliding hash for this (at the
expense of much memory), and I wasn't able to get things much faster
(and certainly not worth the explosion in memory), short of reducing the
size of the hash (which is going to reduce the quality of the output).
So I am giving up for the time being, but if others are interested in
trying to speed things up, I would be happy to discuss ideas.

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  4:16       ` Jeff King
@ 2007-10-02  5:01         ` Junio C Hamano
  2007-10-02  5:08           ` Jeff King
  0 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-10-02  5:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> Just to update, I tried using a non-colliding hash for this (at the
> expense of much memory), and I wasn't able to get things much faster
> (and certainly not worth the explosion in memory), short of reducing the
> size of the hash (which is going to reduce the quality of the output).
> So I am giving up for the time being, but if others are interested in
> trying to speed things up, I would be happy to discuss ideas.

Bummer.  You are giving up at the same place I gave up the last
time.  I was somehow hoping that other people are more clever
and determined than I was ;-).

Thanks for trying.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  5:01         ` Junio C Hamano
@ 2007-10-02  5:08           ` Jeff King
  2007-10-02  5:13             ` Jeff King
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-10-02  5:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Oct 01, 2007 at 10:01:16PM -0700, Junio C Hamano wrote:

> > Just to update, I tried using a non-colliding hash for this (at the
> > expense of much memory), and I wasn't able to get things much faster
> > (and certainly not worth the explosion in memory), short of reducing the
> > size of the hash (which is going to reduce the quality of the output).
> > So I am giving up for the time being, but if others are interested in
> > trying to speed things up, I would be happy to discuss ideas.
> 
> Bummer.  You are giving up at the same place I gave up the last
> time.  I was somehow hoping that other people are more clever
> and determined than I was ;-).
> 
> Thanks for trying.

What was so discouraging is that I literally simplified the process to

  for(i = 0; i < HASH_SIZE; i++)
      if(src[i] < dst[i])
        ...

and it spent all of the time on that one conditional.

One approach which I haven't tried but might be promising is to actually
keep each list sorted, and then do a "merge" of the two lists, comparing
as you go. We don't really need to do arbitrary lookups in the hash; we
just need to compare two hash tables at a time. My approach was to be
simple, but have O(HASH_SIZE) comparisons (where HASH_SIZE is on the
order of 2^17), and that's clearly just too big. But with a list merge,
it should be O(n), where n is the actual number of lines in the files
(or binary chunks for the binary case).

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  5:08           ` Jeff King
@ 2007-10-02  5:13             ` Jeff King
  2007-10-02  6:10               ` David Kastrup
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-10-02  5:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Oct 02, 2007 at 01:08:20AM -0400, Jeff King wrote:

> One approach which I haven't tried but might be promising is to actually
> keep each list sorted, and then do a "merge" of the two lists, comparing
> as you go. We don't really need to do arbitrary lookups in the hash; we
> just need to compare two hash tables at a time. My approach was to be
> simple, but have O(HASH_SIZE) comparisons (where HASH_SIZE is on the
> order of 2^17), and that's clearly just too big. But with a list merge,
> it should be O(n), where n is the actual number of lines in the files
> (or binary chunks for the binary case).

BTW, I don't want to steal credit for this idea...it comes from thinking
about what David Kastrup said earlier in the thread, though I think he
was proposing sorting just inside buckets.

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* What's cooking in git.git (topics)
  2007-09-26 20:05   ` Junio C Hamano
                       ` (2 preceding siblings ...)
  2007-09-28  3:24     ` Daniel Barkalow
@ 2007-10-02  5:53     ` Junio C Hamano
  2007-10-02  6:41       ` Steven Grimm
                         ` (3 more replies)
  3 siblings, 4 replies; 44+ messages in thread
From: Junio C Hamano @ 2007-10-02  5:53 UTC (permalink / raw)
  To: git

Here are the topics that have been cooking.  Commits prefixed
with '-' are only in 'pu' while commits prefixed with '+' are
in 'next'.  The topics list the commits in reverse chronological
order.

----------------------------------------------------------------

* ph/strbuf (Thu Sep 27 13:33:19 2007 +0200) 44 commits
 + Make read_patch_file work on a strbuf.
 + strbuf_read_file enhancement, and use it.
 + strbuf change: be sure ->buf is never ever NULL.
 + double free in builtin-update-index.c
 + Clean up stripspace a bit, use strbuf even more.
 + Add strbuf_read_file().
 ...

Will be in 'master' soon.  We've seen nice linecount reduction
and the end result is rather pleasant to read.

* jc/am-quiet (Mon Oct 1 00:27:51 2007 -0700) 2 commits
 + git-am: fix typo in the previous one.
 + git-am: make the output quieter.

Response against recent "rebase being too chatty" complaints.
This should be a 'master' material.

* ap/dateformat (Fri Sep 28 15:17:45 2007 +0100) 3 commits
 + Make for-each-ref's grab_date() support per-atom formatting
 + Make for-each-ref allow atom names like "<name>:<something>"
 + parse_date_format(): convert a format name to an enum date_mode

With some test suite additions, this could go to 'master' soon.
Yes, that's a HINT, people ;-).

* je/hooks (Wed Sep 26 15:31:01 2007 -0600) 1 commit
 + post-checkout hook, tests, and docs

Will be in 'master' soon.

* db/fetch-pack (Mon Oct 1 00:59:39 2007 +0100) 49 commits
 + fetch/push: readd rsync support
 + Introduce remove_dir_recursively()
 + bundle transport: fix an alloc_ref() call
 + Allow abbreviations in the first refspec to be merged
 + Prevent send-pack from segfaulting when a branch doesn't match
 + Cleanup unnecessary break in remote.c
 ...

Has been cooking for quite long time.

There was a regression that made me quite unhappy about the
rewrite, but Daniel fixed it, so I should be happy.  There is
another usability regression: http transport is now totally
silent.  Even when you fetch daily, if the other end frequently
repacks everything into one big ball of wax like repo.or.cz
does, you will end up transferring quite a large pack every
time, and the total lack of progress report is unacceptably
unnerving.  At least we should reinstate "Fetching blah from URL
using http", and preferrably "walk $object_name" lines.  The
latter could be replaced with just series of CR + "walked N
commits..." if we do not like many output from the current "walk
$object_name" lines scrolling the other information away.

I am not sure the quality of "rsync" transport near the tip,
either, but at least the change should not affect other
transports.  Nobody should using about rsync transport these
days anyway.  Perhaps we should put a deprecation notice in the
release notes to 1.5.4, and remove it three months later.

* jc/autogc (Mon Sep 17 00:55:13 2007 -0700) 10 commits
 + git-gc --auto: run "repack -A -d -l" as necessary.
 + git-gc --auto: restructure the way "repack" command line is built.
 + git-gc --auto: protect ourselves from accumulated cruft
 + git-gc --auto: add documentation.
 + git-gc --auto: move threshold check to need_to_gc() function.
 + repack -A -d: use --keep-unreachable when repacking
 + pack-objects --keep-unreachable
 + Export matches_pack_name() and fix its return value
 + Invoke "git gc --auto" from commit, merge, am and rebase.
 + Implement git gc --auto

I think this one is reasonably sane, but I was the one who wrote
it so people should take that with a grain of salt.  What it is
and isn't:

 - "gc --auto" is a way to prevent you from keeping your
   repository _grossly_ inefficient.  Ideally, if you ever
   rapacked your repository once, and do the regular repository
   maintenance ("'git gc' before you leave for lunch every other
   day"), it should never trigger.

 - "gc --auto" is not something you can background.  We do not
   want to lock the repository and worry about associated stale
   lock, expiry etc.  The complexity is not worth it, compared
   to the stated purpose above (I suspect it might already be
   safe to run multiple instances at the same time, but the
   effort to analyze if it is is not even worth, compared to the
   stated purpose above.  Just let it run synchronously if it
   triggers, but it should not trigger for you).

* js/rebase-i (Tue Sep 25 16:43:15 2007 +0100) 1 commits
 + rebase -i: work on a detached HEAD

Will be in 'master', together with "gc --auto", soon.

* mv/unknown (Tue Sep 25 16:38:46 2007 +0200) 1 commit
 + Don't use "<unknown>" for placeholders and suppress printing of
   empty user formats.

Will be in 'master' soon.

* lh/merge (Mon Sep 24 00:51:45 2007 +0200) 6 commits
 + git-merge: add --ff and --no-ff options
 + git-merge: add support for --commit and --no-squash
 + git-merge: add support for branch.<name>.mergeoptions
 + git-merge: refactor option parsing
 + git-merge: fix faulty SQUASH_MSG
 + Add test-script for git-merge porcelain

Will be in 'master' soon.

* jc/stash-create (Mon Jul 9 00:51:23 2007 -0700) 2 commits
 + rebase: allow starting from a dirty tree.
 + stash: implement "stash create"

Instead of refusing to rebase, telling you that your work tree
is dirty, this stashes your local changes, runs rebase and then
unstashes automatically.  That _sounds_ nicer and easier to use,
but I am not sure if it is a wise/sane thing to do.  We may want
to revert the "autostash" from rebase.  Opinions?

* kh/commit (Mon Sep 17 20:06:47 2007 -0400)
 + Export rerere() and launch_editor().
 + Introduce entry point add_interactive and add_files_to_cache
 + Clean up stripspace a bit, use strbuf even more.
 + Add strbuf_read_file().
 + rerere: Fix use of an empty strbuf.buf
 + Small cache_tree_write refactor.
 ...

Stalled.

* jc/pathspec (Thu Sep 13 13:38:19 2007 -0700) 3 commits
 - pathspec_can_match(): move it from builtin-ls-tree.c to tree.c
 - ls-tree.c: refactor show_recursive() and rename it.
 - tree-diff.c: split out a function to match a single pattern.

Stalled.  This is about my pet-peeve that log (diff-tree) family
has much limited pathspec semantics.  It should learn to glob
like ls-files and grep do.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  5:13             ` Jeff King
@ 2007-10-02  6:10               ` David Kastrup
  2007-10-02 16:11                 ` Jeff King
  0 siblings, 1 reply; 44+ messages in thread
From: David Kastrup @ 2007-10-02  6:10 UTC (permalink / raw)
  To: git

Jeff King <peff@peff.net> writes:

> On Tue, Oct 02, 2007 at 01:08:20AM -0400, Jeff King wrote:
>
>> One approach which I haven't tried but might be promising is to actually
>> keep each list sorted, and then do a "merge" of the two lists, comparing
>> as you go. We don't really need to do arbitrary lookups in the hash; we
>> just need to compare two hash tables at a time. My approach was to be
>> simple, but have O(HASH_SIZE) comparisons (where HASH_SIZE is on the
>> order of 2^17), and that's clearly just too big. But with a list merge,
>> it should be O(n), where n is the actual number of lines in the files
>> (or binary chunks for the binary case).
>
> BTW, I don't want to steal credit for this idea...it comes from thinking
> about what David Kastrup said earlier in the thread, though I think he
> was proposing sorting just inside buckets.

Yes: my proposal was about a microoptimization: work with the
basically existing data structures and put the already contained
information to best use.

I have not actually looked at the actual task that the structures are
going to be used in, and whether "reusing" the information is likely
to be worth the trouble.

When we are talking about buzzword compliance, "keep sorted" with the
meaning of "maintain sorted across modifications" has an O(n^2) or at
least O(nm) ring to it.  However, if it is possible to sort it just
once, and then then only merge with other lists...

I am actually quite a fan of merge sort and have even posted a small
and quite efficient version to this list once.  However, merge sorts
were really greatest at the time when cache memory was unusual to
have.  Nowadays, quicksort or similar could be faster due to better
locality of memory accesses.  I think the glibc qsort more or less
uses an array-based merge into a separate memory area (unless it runs
out of memory in which case it resorts to regular quicksort).

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  5:53     ` Junio C Hamano
@ 2007-10-02  6:41       ` Steven Grimm
  2007-10-02  6:44       ` Steffen Prohaska
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Steven Grimm @ 2007-10-02  6:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> * jc/stash-create (Mon Jul 9 00:51:23 2007 -0700) 2 commits
>  + rebase: allow starting from a dirty tree.
>  + stash: implement "stash create"
>
> Instead of refusing to rebase, telling you that your work tree
> is dirty, this stashes your local changes, runs rebase and then
> unstashes automatically.  That _sounds_ nicer and easier to use,
> but I am not sure if it is a wise/sane thing to do.  We may want
> to revert the "autostash" from rebase.  Opinions?
>   

I can say that for people coming from svn (who are often using "git svn 
rebase" rather than directly running "git rebase") this is a nice 
workflow improvement. It eliminates one more "Why is this more of a pain 
to do in git than in svn?" complaint.

I don't see any circumstance in my use of git -- either in a git-svn 
context or not -- where this wouldn't be an improvement over the 
existing behavior. However, I don't claim to be using git in any 
particularly interesting way, so I suppose it's possible that this will 
break someone's workflow horribly.

-Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  5:53     ` Junio C Hamano
  2007-10-02  6:41       ` Steven Grimm
@ 2007-10-02  6:44       ` Steffen Prohaska
  2007-10-02  7:03         ` Matthieu Moy
  2007-10-02 12:52       ` What's cooking in git.git (topics) Johannes Schindelin
  2007-10-02 17:00       ` Daniel Barkalow
  3 siblings, 1 reply; 44+ messages in thread
From: Steffen Prohaska @ 2007-10-02  6:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Oct 2, 2007, at 7:53 AM, Junio C Hamano wrote:

>
> * jc/stash-create (Mon Jul 9 00:51:23 2007 -0700) 2 commits
>  + rebase: allow starting from a dirty tree.
>  + stash: implement "stash create"
>
> Instead of refusing to rebase, telling you that your work tree
> is dirty, this stashes your local changes, runs rebase and then
> unstashes automatically.  That _sounds_ nicer and easier to use,
> but I am not sure if it is a wise/sane thing to do.  We may want
> to revert the "autostash" from rebase.  Opinions?

What would happen if 'git stash' fails to work? Could this bring
the repo in a state that is hard to recover from? Especially if
'stash' commands were run automatically for you. Maybe if you had
a choice you would not choose to use stash but would commit your
changes, or would bring your work tree in a clean state by other means.

I'm a bit concerned because 'git stash' still doesn't work for me
when the work tree is dirty because of a changed subproject (in
msysgit with git 1.5.3). After I run 'git stash' the work tree stays
dirty. How would "autostash" behave?

BTW, I run 'git submodule update' to bring the tree into a clean
state and later manually check out the previous head in the submodule.
Quite annoying, but not directly related to the discussion above.

	Steffen

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  6:44       ` Steffen Prohaska
@ 2007-10-02  7:03         ` Matthieu Moy
  2007-10-02  7:21           ` Junio C Hamano
  0 siblings, 1 reply; 44+ messages in thread
From: Matthieu Moy @ 2007-10-02  7:03 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Junio C Hamano, git

Steffen Prohaska <prohaska@zib.de> writes:

> On Oct 2, 2007, at 7:53 AM, Junio C Hamano wrote:
>
>>
>> * jc/stash-create (Mon Jul 9 00:51:23 2007 -0700) 2 commits
>>  + rebase: allow starting from a dirty tree.
>>  + stash: implement "stash create"
>>
>> Instead of refusing to rebase, telling you that your work tree
>> is dirty, this stashes your local changes, runs rebase and then
>> unstashes automatically.  That _sounds_ nicer and easier to use,
>> but I am not sure if it is a wise/sane thing to do.  We may want
>> to revert the "autostash" from rebase.  Opinions?
>
> What would happen if 'git stash' fails to work? Could this bring
> the repo in a state that is hard to recover from? Especially if
> 'stash' commands were run automatically for you. Maybe if you had
> a choice you would not choose to use stash but would commit your
> changes, or would bring your work tree in a clean state by other means.

I'm also concerned about the possibility of stash/unstash to fail:
that means the user has to deal with two kinds of conflicts (rebase
can conflict, unstash can also), which can be confusing.

But the current behavior can be greatly improved by just making the
error message better. Currently, I have this:

$ git rebase branch
foo.txt: needs update
$ _

I'd prefer something like

$ git rebase branch
cannot rebase: the working tree is not clean.
foo.txt: Changed but not updated
Commit your changes, or put them appart with "git stash" and retry.
$ _

-- 
Matthieu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  7:03         ` Matthieu Moy
@ 2007-10-02  7:21           ` Junio C Hamano
  2007-10-02  8:01             ` David Kågedal
  2007-10-02  8:07             ` Matthieu Moy
  0 siblings, 2 replies; 44+ messages in thread
From: Junio C Hamano @ 2007-10-02  7:21 UTC (permalink / raw)
  To: git; +Cc: Steffen Prohaska, Matthieu Moy

Matthieu Moy <Matthieu.Moy@imag.fr> writes:

> I'm also concerned about the possibility of stash/unstash to fail:
> that means the user has to deal with two kinds of conflicts (rebase
> can conflict, unstash can also), which can be confusing.

Unstash should be invoked automatically _only_ after rebase
completes, so I do not forsee such confusion.

But the trouble I have with the auto unstashing is more at the
conceptual and workflow level.  You start rebasing a branch, and
your work tree is dirty.  What branch should the local
modification belong to?

Logically, it is with the branch you were on when you typed "git
rebase" command.

But when "rebase" successfully concludes (either with or without
manual fix-ups), you can be either on your original branch (if
you said "git rebase base") or something totally unrelated (if
you said "git rebase base other").  Currently we auto-unstash in
both cases.  I _think_ the former case should auto-unstash, but
the latter shouldn't.

However, this auto-stash uses the new "git stash create" command
that does not update the reflog for "refs/stash", specifically
so that the auto-stashing does _not_ interfere with the normal
stash the end user uses.  Which means that it is a bit
cumbersome to give the autostashed state back to the user if we
do _not_ unstash upon completion of rebase.

Perhaps a good compromise would be to

 * not to do the autostash, but fail as we have always done,
   when "git rebase base other" form was used, first instructing
   rebase to switch to another branch;

 * autostash when "git rebase base" form was used, and auto
   unstash upon completion.

> But the current behavior can be greatly improved by just making the
> error message better. Currently, I have this:
>
> $ git rebase branch
> foo.txt: needs update
> $ _
>
> I'd prefer something like
>
> $ git rebase branch
> cannot rebase: the working tree is not clean.
> foo.txt: Changed but not updated
> Commit your changes, or put them appart with "git stash" and retry.
> $ _

You forgot 'needs merge' case, so that would not fly very well,
but something like this might be a good starting point.

---
 git-rebase.sh |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/git-rebase.sh b/git-rebase.sh
index 058fcac..4f8aeb9 100755
--- a/git-rebase.sh
+++ b/git-rebase.sh
@@ -252,7 +252,10 @@ else
 fi

 # The tree must be really really clean.
-git update-index --refresh || exit
+o=$(git update-index -q --refresh) || {
+	printf "cannot rebase: the work tree is not clean.\n%s\n" "$o"
+	exit 1
+}
 diff=$(git diff-index --cached --name-status -r HEAD)
 case "$diff" in
 ?*)	echo "cannot rebase: your index is not up-to-date"

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  7:21           ` Junio C Hamano
@ 2007-10-02  8:01             ` David Kågedal
  2007-10-02  8:07             ` Matthieu Moy
  1 sibling, 0 replies; 44+ messages in thread
From: David Kågedal @ 2007-10-02  8:01 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:

>  * not to do the autostash, but fail as we have always done,
>    when "git rebase base other" form was used, first instructing
>    rebase to switch to another branch;

I don't like the idea of automatic stashing on a rebase.  It makes it
harder to understand what is happening, and figuring out were things
went if everything wasn't successful.

-- 
David Kågedal

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  7:21           ` Junio C Hamano
  2007-10-02  8:01             ` David Kågedal
@ 2007-10-02  8:07             ` Matthieu Moy
  2007-10-02 17:44               ` Junio C Hamano
  1 sibling, 1 reply; 44+ messages in thread
From: Matthieu Moy @ 2007-10-02  8:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Steffen Prohaska

Junio C Hamano <gitster@pobox.com> writes:

> Matthieu Moy <Matthieu.Moy@imag.fr> writes:
>
>> I'm also concerned about the possibility of stash/unstash to fail:
>> that means the user has to deal with two kinds of conflicts (rebase
>> can conflict, unstash can also), which can be confusing.
>
> Unstash should be invoked automatically _only_ after rebase
> completes, so I do not forsee such confusion.

Well, if rebase itself conflicts, it will stop and tell you about
conflicts, then you'll have to "rebase --continue". If unstash fails,
you'll have to resolve the conflicts, and probably do "commit", or
continue working afterwards. You don't have to deal with both at the
same time, but both do exist, and they have to be taken care of
differently.

For an advanced user with good understanding of the flow, that's OK,
but I'm still afraid of the confusion for not-so-advanced users.

But that's not a strong argument against auto-stash, just one
downside.

> But the trouble I have with the auto unstashing is more at the
> conceptual and workflow level.  You start rebasing a branch, and
> your work tree is dirty.  What branch should the local
> modification belong to?

You're in a better position than me to juge on that point.

>  # The tree must be really really clean.
> -git update-index --refresh || exit
> +o=$(git update-index -q --refresh) || {
> +	printf "cannot rebase: the work tree is not clean.\n%s\n" "$o"
> +	exit 1
> +}
>  diff=$(git diff-index --cached --name-status -r HEAD)
>  case "$diff" in
>  ?*)	echo "cannot rebase: your index is not up-to-date"

That alone would already be a real improvement.

I'd add this to be consistant with "git status". I find the "needs
update" really short, and especially confusing for centralized systems
users, for whom "needs update" would probably mean "new version
available, please run '$VCS update'".

diff --git a/read-cache.c b/read-cache.c
index 2e40a34..3745a48 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -869,7 +869,7 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
                        }
                        if (quiet)
                                continue;
-                       printf("%s: needs update\n", ce->name);
+                       printf("%s: Changed but not updated\n", ce->name);
                        has_errors = 1;
                        continue;
                }

-- 
Matthieu

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  5:53     ` Junio C Hamano
  2007-10-02  6:41       ` Steven Grimm
  2007-10-02  6:44       ` Steffen Prohaska
@ 2007-10-02 12:52       ` Johannes Schindelin
  2007-10-02 17:00       ` Daniel Barkalow
  3 siblings, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-10-02 12:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi,

On Mon, 1 Oct 2007, Junio C Hamano wrote:

> I am not sure the quality of "rsync" transport near the tip, either, but 
> at least the change should not affect other transports.  Nobody should 
> using about rsync transport these days anyway.  Perhaps we should put a 
> deprecation notice in the release notes to 1.5.4, and remove it three 
> months later.

Why not keep it?  It's not like it hurts somebody, and in some 
circumstances (lacking git on the remote side, where it was served via 
http) I found it really convenient.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  6:10               ` David Kastrup
@ 2007-10-02 16:11                 ` Jeff King
  2007-10-02 16:31                   ` David Kastrup
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-10-02 16:11 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Tue, Oct 02, 2007 at 08:10:28AM +0200, David Kastrup wrote:

> I have not actually looked at the actual task that the structures are
> going to be used in, and whether "reusing" the information is likely
> to be worth the trouble.

The algorithm is something like this:  We have N files, and we want to
find "similar" candidates. So we go through each file and generate a
table of fingperint hashes (diffcore-rename.c:hash_chars), and then
compare each file with every other file, using the hash tables to do the
comparison.

So the comparison step for two files is currently something like:

  for each hash in file1
    hash2 = look up hash in file2
    compare hash and hash2

and if they were sorted, perhaps we could do something merge-like:

  while hashes are left to compare
      compare file1.next, file2.next
      advance file1, file2, or both (depending on comparison)

> When we are talking about buzzword compliance, "keep sorted" with the
> meaning of "maintain sorted across modifications" has an O(n^2) or at
> least O(nm) ring to it.  However, if it is possible to sort it just
> once, and then then only merge with other lists...

It would be sort once. I.e.,:

  for each file
     generate file.hashes
     sort file.hashes
  for each file1
    for each file2
      compare file1.hashes to file2.hashes

where that 'compare' step is taking most of the CPU time (for the
obvious reason that we call it in an O(n^2) loop).

I will try to implement this as time permits, but if you want to tinker
with it in the meantime, feel free.

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02 16:11                 ` Jeff King
@ 2007-10-02 16:31                   ` David Kastrup
  2007-10-02 17:39                     ` Jeff King
  2007-10-03  2:28                     ` Linus Torvalds
  0 siblings, 2 replies; 44+ messages in thread
From: David Kastrup @ 2007-10-02 16:31 UTC (permalink / raw)
  To: git

Jeff King <peff@peff.net> writes:

> On Tue, Oct 02, 2007 at 08:10:28AM +0200, David Kastrup wrote:

[...]

> The algorithm is something like this: We have N files, and we want
> to find "similar" candidates. So we go through each file and
> generate a table of fingperint hashes
> (diffcore-rename.c:hash_chars), and then compare each file with
> every other file, using the hash tables to do the comparison.
>
> So the comparison step for two files is currently something like:
>
>   for each hash in file1
>     hash2 = look up hash in file2
>     compare hash and hash2
>
> and if they were sorted, perhaps we could do something merge-like:
>
>   while hashes are left to compare
>       compare file1.next, file2.next
>       advance file1, file2, or both (depending on comparison)
>
>> When we are talking about buzzword compliance, "keep sorted" with
>> the meaning of "maintain sorted across modifications" has an O(n^2)
>> or at least O(nm) ring to it.  However, if it is possible to sort
>> it just once, and then then only merge with other lists...
>
> It would be sort once. I.e.,:
>
>   for each file
>      generate file.hashes
>      sort file.hashes
>   for each file1
>     for each file2
>       compare file1.hashes to file2.hashes
>
> where that 'compare' step is taking most of the CPU time (for the
> obvious reason that we call it in an O(n^2) loop).
>
> I will try to implement this as time permits, but if you want to
> tinker with it in the meantime, feel free.

This does not actually require an actual merge _sort_ AFAICS: do the
"sort file.hashed" step using qsort.  The comparison step does not
actually need to produce merged output, but merely advances through
two hash arrays and generates statistics.

This should already beat the pants off the current implementation,
even when the hash array is sparse, simply because our inner loop then
has perfect hash coherence.

Getting rid of this outer O(n^2) remains an interesting challenge,
though.  One way would be the following: fill a _single_ array with
entries containing _both_ hash and file number.  Sort this, and then
gather the statistics of hash runs by making a single pass through.
That reduces the O(n^2) behavior to only those parts with actual hash
collisions.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  5:53     ` Junio C Hamano
                         ` (2 preceding siblings ...)
  2007-10-02 12:52       ` What's cooking in git.git (topics) Johannes Schindelin
@ 2007-10-02 17:00       ` Daniel Barkalow
  3 siblings, 0 replies; 44+ messages in thread
From: Daniel Barkalow @ 2007-10-02 17:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, 1 Oct 2007, Junio C Hamano wrote:

> * db/fetch-pack (Mon Oct 1 00:59:39 2007 +0100) 49 commits
>  + fetch/push: readd rsync support
>  + Introduce remove_dir_recursively()
>  + bundle transport: fix an alloc_ref() call
>  + Allow abbreviations in the first refspec to be merged
>  + Prevent send-pack from segfaulting when a branch doesn't match
>  + Cleanup unnecessary break in remote.c
>  ...
> 
> Has been cooking for quite long time.
> 
> There was a regression that made me quite unhappy about the
> rewrite, but Daniel fixed it, so I should be happy.  There is
> another usability regression: http transport is now totally
> silent.

I think this is due to passing through equal verbosity levels, when the 
non-verbose case for the native protocols was a lot less silent than the 
non-verbose case for http. "git fetch -v -v" does show everything. I think 
just replacing "transport->verbose" with "1" on line 347 of transport.c 
would give the old default behavior, but fetch probably needs a more quiet 
setting than the default, as well as the current more verbose than 
default. I'll put together a patch for this when I get a chance.

> Even when you fetch daily, if the other end frequently
> repacks everything into one big ball of wax like repo.or.cz
> does, you will end up transferring quite a large pack every
> time, and the total lack of progress report is unacceptably
> unnerving.  At least we should reinstate "Fetching blah from URL
> using http", and preferrably "walk $object_name" lines.  The
> latter could be replaced with just series of CR + "walked N
> commits..." if we do not like many output from the current "walk
> $object_name" lines scrolling the other information away.

The right thing for now is probably to match the old git-fetch's behavior, 
once I can remember what it is. (I've been using my C version for my 
personal use long enough that I can't remember everything the shell 
version did. My new version should be sufficiently flexible to accomodate 
most things without too much trouble, but I've lost my ability to notice 
differences without prompting.)

After the implementation change is in, we can look at improving 
user-visible things. I think a display like the "counting objects" display 
(number that counts up in place), plus progress bars for big downloads, would 
be ideal.

> I am not sure the quality of "rsync" transport near the tip,
> either, but at least the change should not affect other
> transports.  Nobody should using about rsync transport these
> days anyway.  Perhaps we should put a deprecation notice in the
> release notes to 1.5.4, and remove it three months later.

I think that rsync should be kept until we've got sftp in place, which 
should cover the same cases and be better overall. 

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02 16:31                   ` David Kastrup
@ 2007-10-02 17:39                     ` Jeff King
  2007-10-02 18:44                       ` David Kastrup
  2007-10-03  2:28                     ` Linus Torvalds
  1 sibling, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-10-02 17:39 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Tue, Oct 02, 2007 at 06:31:18PM +0200, David Kastrup wrote:

> This does not actually require an actual merge _sort_ AFAICS: do the
> "sort file.hashed" step using qsort.  The comparison step does not
> actually need to produce merged output, but merely advances through
> two hash arrays and generates statistics.

Right, that's why I used "merge" in quotes. The sort used in the O(n)
step is irrelevant, but we are doing a merge-sort-like behavior in the
second step (except instead of actually merging into a new list, we are
summarizing the comparisons in a numeric "difference" variable). But I
think we are on the same page.

> This should already beat the pants off the current implementation,
> even when the hash array is sparse, simply because our inner loop then
> has perfect hash coherence.

Yes, I hope so. We'll see. :)

> Getting rid of this outer O(n^2) remains an interesting challenge,
> though.  One way would be the following: fill a _single_ array with
> entries containing _both_ hash and file number.  Sort this, and then
> gather the statistics of hash runs by making a single pass through.
> That reduces the O(n^2) behavior to only those parts with actual hash
> collisions.

Interesting. Care to take a stab at implementing it?

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02  8:07             ` Matthieu Moy
@ 2007-10-02 17:44               ` Junio C Hamano
  2007-10-03  5:30                 ` [PATCH] rebase: make the warning more useful when the work tree is unclean Junio C Hamano
  0 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-10-02 17:44 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git, Steffen Prohaska

Matthieu Moy <Matthieu.Moy@imag.fr> writes:

> I'd add this to be consistant with "git status". I find the "needs
> update" really short, and especially confusing for centralized systems
> users, for whom "needs update" would probably mean "new version
> available, please run '$VCS update'".

Yeah, and "needs merge" solicits 'SCM merge'.

> diff --git a/read-cache.c b/read-cache.c
> index 2e40a34..3745a48 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -869,7 +869,7 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
>                         }
>                         if (quiet)
>                                 continue;
> -                       printf("%s: needs update\n", ce->name);
> +                       printf("%s: Changed but not updated\n", ce->name);
>                         has_errors = 1;
>                         continue;
>                 }

I tried to stay away from touching that part on purpose.  Doing
this unconditionally may break people's existing scripts that
use update-index --refresh plumbing.

We could introduce a new option to "update-index --refresh" that
makes the output more machine readable by either NUL terminating
or c_quoting ce->name to protect the caller from potential
spaces and newlines in the name, and use that from the caller.

But for this particular case, I think a much simpler alternative
would be to do it this way:

>  # The tree must be really really clean.
> -git update-index --refresh || exit
> +git update-index -q --refresh || {
+	git status
> +	printf "cannot rebase: the work tree is not clean.\n"
> +	exit 1
> +}
>  diff=$(git diff-index --cached --name-status -r HEAD)
>  case "$diff" in
>  ?*)	echo "cannot rebase: your index is not up-to-date"

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02 17:39                     ` Jeff King
@ 2007-10-02 18:44                       ` David Kastrup
  0 siblings, 0 replies; 44+ messages in thread
From: David Kastrup @ 2007-10-02 18:44 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> On Tue, Oct 02, 2007 at 06:31:18PM +0200, David Kastrup wrote:
>
>> This does not actually require an actual merge _sort_ AFAICS: do the
>> "sort file.hashed" step using qsort.  The comparison step does not
>> actually need to produce merged output, but merely advances through
>> two hash arrays and generates statistics.
>
> Right, that's why I used "merge" in quotes.
>
>> This should already beat the pants off the current implementation,
>> even when the hash array is sparse, simply because our inner loop
>> then has perfect hash coherence.
>
> Yes, I hope so. We'll see. :)
>
>> Getting rid of this outer O(n^2) remains an interesting challenge,
>> though.  One way would be the following: fill a _single_ array with
>> entries containing _both_ hash and file number.  Sort this, and
>> then gather the statistics of hash runs by making a single pass
>> through.  That reduces the O(n^2) behavior to only those parts with
>> actual hash collisions.
>
> Interesting. Care to take a stab at implementing it?

I actually have worked through the last night on the day job, have
urgent stuff piling up in my freelance work queue, and the next thing
I need to finish for git is some smart stuff for delta packing.

So it's unlikely I'll get to _that_ anytime soon.  However, I had a
hilarious idea on the way home that kept me rather amused (perhaps my
programmer's humour is affected by sleep deprivation).

I was annoyed at needing double the space because of having to keep
score of both hash and file number.  So I came up with a rather cute
manner to avoid this: first do all files in isolation with full
precision, but store the resulting list of hash as difference to the
last value.  When merging the data of 2^k and 2^k (or somewhat less)
files, we multiply the values by two (this will not carry except for
utterly improbable cases or very small data sets which we can do
differently) and add one bit of identification.  When we have just a
single sequence remaining, undeltafying will tell us about collisions
in the high bits, and the affected files in the low bits.

Of course, using a merge-like algorithm means that we temporarily need
double space anyway.  Which takes some of the fun.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-02 16:31                   ` David Kastrup
  2007-10-02 17:39                     ` Jeff King
@ 2007-10-03  2:28                     ` Linus Torvalds
  2007-10-03  6:54                       ` Jeff King
                                         ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Linus Torvalds @ 2007-10-03  2:28 UTC (permalink / raw)
  To: David Kastrup, Jeff King; +Cc: Git Mailing List


[ This is the discussed stupid approach - just sort the dang hash array, 
  so that we can use a linear scan over the src/dst ]

On Tue, 2 Oct 2007, David Kastrup wrote:
> 
> This does not actually require an actual merge _sort_ AFAICS: do the
> "sort file.hashed" step using qsort.  The comparison step does not
> actually need to produce merged output, but merely advances through
> two hash arrays and generates statistics.
> 
> This should already beat the pants off the current implementation,
> even when the hash array is sparse, simply because our inner loop then
> has perfect hash coherence.

Sadly, that's not the case. It *does* seem to beat the current 
implementation, but it's not "beat the pants off". It looks like an 
improvement of about 15%, which is nothing to sneeze at, but it's not an 
order-of-magnitude improvement either.

Here's a test-patch. I don't guarantee anything, except that when I did 
the timings I also did a "wc" on the result, and they matched..

Before:
	[torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. | wc
	   7104   28574  438020

	real    0m10.526s
	user    0m10.401s
	sys     0m0.136s

After:
	[torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. | wc
	   7104   28574  438020
	
	real    0m8.876s
	user    0m8.761s
	sys     0m0.128s

but the diff is fairly simple, so if somebody will go over it and say 
whether it's likely to be *correct* too, that 15% may well be worth it.

[ Side note, without rename detection, that diff takes just under three 
  seconds for me, so in that sense the improvement to the rename detection 
  itself is larger than the overall 15% - it brings the cost of just 
  rename detection from 7.5s to 5.9s, which would be on the order of just 
  over a 20% performance improvement. ]

Hmm. The patch depends on half-way subtle issues like the fact that the 
hashtables are guaranteed to not be full => we're guaranteed to have zero 
counts at the end => we don't need to do any steenking iterator count in 
the loop. A few comments might in order.

		Linus

---
 diffcore-delta.c |   54 ++++++++++++++++++++++++++++++------------------------
 1 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/diffcore-delta.c b/diffcore-delta.c
index d9729e5..6d65697 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -46,22 +46,6 @@ struct spanhash_top {
 	struct spanhash data[FLEX_ARRAY];
 };
 
-static struct spanhash *spanhash_find(struct spanhash_top *top,
-				      unsigned int hashval)
-{
-	int sz = 1 << top->alloc_log2;
-	int bucket = hashval & (sz - 1);
-	while (1) {
-		struct spanhash *h = &(top->data[bucket++]);
-		if (!h->cnt)
-			return NULL;
-		if (h->hashval == hashval)
-			return h;
-		if (sz <= bucket)
-			bucket = 0;
-	}
-}
-
 static struct spanhash_top *spanhash_rehash(struct spanhash_top *orig)
 {
 	struct spanhash_top *new;
@@ -122,6 +106,20 @@ static struct spanhash_top *add_spanhash(struct spanhash_top *top,
 	}
 }
 
+static int spanhash_cmp(const void *_a, const void *_b)
+{
+	const struct spanhash *a = _a;
+	const struct spanhash *b = _b;
+
+	/* A count of zero compares at the end.. */
+	if (!a->cnt)
+		return !b->cnt ? 0 : 1;
+	if (!b->cnt)
+		return -1;
+	return a->hashval < b->hashval ? -1 :
+		a->hashval > b->hashval ? 1 : 0;
+}
+
 static struct spanhash_top *hash_chars(struct diff_filespec *one)
 {
 	int i, n;
@@ -158,6 +156,10 @@ static struct spanhash_top *hash_chars(struct diff_filespec *one)
 		n = 0;
 		accum1 = accum2 = 0;
 	}
+	qsort(hash->data,
+		1ul << hash->alloc_log2,
+		sizeof(hash->data[0]),
+		spanhash_cmp);
 	return hash;
 }
 
@@ -169,7 +171,7 @@ int diffcore_count_changes(struct diff_filespec *src,
 			   unsigned long *src_copied,
 			   unsigned long *literal_added)
 {
-	int i, ssz;
+	struct spanhash *s, *d;
 	struct spanhash_top *src_count, *dst_count;
 	unsigned long sc, la;
 
@@ -190,22 +192,26 @@ int diffcore_count_changes(struct diff_filespec *src,
 	}
 	sc = la = 0;
 
-	ssz = 1 << src_count->alloc_log2;
-	for (i = 0; i < ssz; i++) {
-		struct spanhash *s = &(src_count->data[i]);
-		struct spanhash *d;
+	s = src_count->data;
+	d = dst_count->data;
+	for (;;) {
 		unsigned dst_cnt, src_cnt;
 		if (!s->cnt)
-			continue;
+			break;
+		while (d->cnt) {
+			if (d->hashval >= s->hashval)
+				break;
+			d++;
+		}
 		src_cnt = s->cnt;
-		d = spanhash_find(dst_count, s->hashval);
-		dst_cnt = d ? d->cnt : 0;
+		dst_cnt = d->hashval == s->hashval ? d->cnt : 0;
 		if (src_cnt < dst_cnt) {
 			la += dst_cnt - src_cnt;
 			sc += src_cnt;
 		}
 		else
 			sc += dst_cnt;
+		s++;
 	}
 
 	if (!src_count_p)

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH] rebase: make the warning more useful when the work tree is unclean.
  2007-10-02 17:44               ` Junio C Hamano
@ 2007-10-03  5:30                 ` Junio C Hamano
  2007-10-03  9:02                   ` Matthieu Moy
  0 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-10-03  5:30 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git, Steffen Prohaska

Instead of letting "update-index --refresh" report paths needing
updates and merges, use git-status to give more useful output.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * I won't be a good judge of the updated behaviour, as I never
   start rebase in an unclear tree.  Running git-status in a
   large tree may be too expensive to be worth changing the
   output.

 git-rebase.sh |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/git-rebase.sh b/git-rebase.sh
index 1583402..93e3b3c 100755
--- a/git-rebase.sh
+++ b/git-rebase.sh
@@ -254,7 +254,11 @@ else
 fi
 
 # The tree must be really really clean.
-git update-index --refresh || exit
+git update-index -q --refresh || {
+	git status
+	printf "cannot rebase: the work tree is not clean.\n"
+	exit 1
+}
 diff=$(git diff-index --cached --name-status -r HEAD)
 case "$diff" in
 ?*)	echo "cannot rebase: your index is not up-to-date"
-- 
1.5.3.3.1144.gf10f2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-03  2:28                     ` Linus Torvalds
@ 2007-10-03  6:54                       ` Jeff King
  2007-10-03 16:13                         ` Linus Torvalds
  2007-10-03  8:20                       ` David Kastrup
  2007-10-04  7:10                       ` Junio C Hamano
  2 siblings, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-10-03  6:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Kastrup, Git Mailing List

On Tue, Oct 02, 2007 at 07:28:19PM -0700, Linus Torvalds wrote:

> Sadly, that's not the case. It *does* seem to beat the current 
> implementation, but it's not "beat the pants off". It looks like an 
> improvement of about 15%, which is nothing to sneeze at, but it's not an 
> order-of-magnitude improvement either.
> 
> Here's a test-patch. I don't guarantee anything, except that when I did 
> the timings I also did a "wc" on the result, and they matched..

I get slightly better speedups with my pathological case (around 30%):

Before:
  $ /usr/bin/time git-diff --raw -M -l0 06d288^ 06d288 >/dev/null
  105.38user 3.65system 2:14.90elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (15432major+542627minor)pagefaults 0swaps

After:
  $ /usr/bin/time git-diff --raw -M -l0 06d288^ 06d288 >/dev/null
  71.70user 3.47system 1:40.43elapsed 74%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (15065major+551778minor)pagefaults 0swaps

But yes, it's not the order of magnitude we were looking for.

> 	[torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. | wc

I found less noise in the timing by using --raw, since the patch
computation takes an appreciable amount of time.

> but the diff is fairly simple, so if somebody will go over it and say 
> whether it's likely to be *correct* too, that 15% may well be worth it.

Patch looks correct, and it produces correct results on my (admittedly
limited) test data.

I think it's worth applying (though I agree that a comment on the
assumption of a zero "cnt" at the end is worth adding) unless some
drastically different solution comes along (e.g., David's idea to try
avoiding the outer O(n^2) loop). But I don't think there is much more to
be gained from a different approach to comparing the two hash tables.

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-03  2:28                     ` Linus Torvalds
  2007-10-03  6:54                       ` Jeff King
@ 2007-10-03  8:20                       ` David Kastrup
  2007-10-03 16:59                         ` Jeff King
  2007-10-04  7:10                       ` Junio C Hamano
  2 siblings, 1 reply; 44+ messages in thread
From: David Kastrup @ 2007-10-03  8:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> [ This is the discussed stupid approach - just sort the dang hash array, 
>   so that we can use a linear scan over the src/dst ]
>
> On Tue, 2 Oct 2007, David Kastrup wrote:
>> 
>> This does not actually require an actual merge _sort_ AFAICS: do the
>> "sort file.hashed" step using qsort.  The comparison step does not
>> actually need to produce merged output, but merely advances through
>> two hash arrays and generates statistics.
>> 
>> This should already beat the pants off the current implementation,
>> even when the hash array is sparse, simply because our inner loop then
>> has perfect hash coherence.
>
> Sadly, that's not the case. It *does* seem to beat the current 
> implementation, but it's not "beat the pants off".

Part of the reason is that it is not actually what I had in mind.  Why
create the hash array as a hash array?  Filling the hash array in
basically random order, then sort+compressing it is what is causing
much of the costs.  My idea was to just fill the "hash array"
linearly.  It is quite pointless (and certainly very inefficient with
regard to cache poisoning) to do it in hash order when we are going to
sort it anyway.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] rebase: make the warning more useful when the work tree is unclean.
  2007-10-03  5:30                 ` [PATCH] rebase: make the warning more useful when the work tree is unclean Junio C Hamano
@ 2007-10-03  9:02                   ` Matthieu Moy
  0 siblings, 0 replies; 44+ messages in thread
From: Matthieu Moy @ 2007-10-03  9:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Steffen Prohaska

Junio C Hamano <gitster@pobox.com> writes:

> Instead of letting "update-index --refresh" report paths needing
> updates and merges, use git-status to give more useful output.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>
>  * I won't be a good judge of the updated behaviour, as I never
>    start rebase in an unclear tree.  Running git-status in a
>    large tree may be too expensive to be worth changing the
>    output.

I see your patch as an improvement too. status is a bit expansive, but
you hit this portion of code only when trying "rebase" by mistake, so
it's acceptable to let git take a bit of time to explain your mistake.

That said, I think it's still worth making the messages of
"update-index" a little more verbose and consistant with "status" (my
previous patch), since I think there are other occurences of
user-visible output of update-index in porcelain git.

Ideally, update-index would be a C function returning a struct
containing all the information about the status, with a function
is_clean(...) and another print_as_status(...), to allow the same
functionality with better performances, but I won't have time to do
this.

-- 
Matthieu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-03  6:54                       ` Jeff King
@ 2007-10-03 16:13                         ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2007-10-03 16:13 UTC (permalink / raw)
  To: Jeff King; +Cc: David Kastrup, Git Mailing List

On Wed, 3 Oct 2007, Jeff King wrote:
> 
> I get slightly better speedups with my pathological case (around 30%):

Ok, 30% is definitely "worth doing". Even if your performance still sucks, 
and 71 seconds is just way out of line for anything like this (of course, 
these days you need that "-l0" to ever trigger that case, but it would be 
nice if we could speed things up so much that we no longer care).

		Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-03  8:20                       ` David Kastrup
@ 2007-10-03 16:59                         ` Jeff King
  2007-10-03 17:53                           ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-10-03 16:59 UTC (permalink / raw)
  To: David Kastrup; +Cc: Linus Torvalds, Git Mailing List

On Wed, Oct 03, 2007 at 10:20:49AM +0200, David Kastrup wrote:

> Part of the reason is that it is not actually what I had in mind.  Why
> create the hash array as a hash array?  Filling the hash array in
> basically random order, then sort+compressing it is what is causing
> much of the costs.  My idea was to just fill the "hash array"
> linearly.  It is quite pointless (and certainly very inefficient with
> regard to cache poisoning) to do it in hash order when we are going to
> sort it anyway.

Try profiling the code, and you will see that the creation of the hashes
is totally dwarfed by the comparisons. So yes, you might be able to
speed up the creation code, but it's going to have a minimal impact on
the overall run time.

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-03 16:59                         ` Jeff King
@ 2007-10-03 17:53                           ` Linus Torvalds
  2007-10-03 18:09                             ` David Kastrup
  0 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2007-10-03 17:53 UTC (permalink / raw)
  To: Jeff King; +Cc: David Kastrup, Git Mailing List

On Wed, 3 Oct 2007, Jeff King wrote:
>
> Try profiling the code, and you will see that the creation of the hashes
> is totally dwarfed by the comparisons. So yes, you might be able to
> speed up the creation code, but it's going to have a minimal impact on
> the overall run time.

Yeah. Oprofile is your friend.

The biggest win would be to avoid calling diffcore_count_changes() in the 
first place, and we actually do that in the caller by looking at the size 
of the files. However, while that prunes out the *really* obvious cases, 
it's not nearly enough, since there tends to be very limited filesizes 
anyway.

What we could also do is to pass in the minimum similarity score, and use 
that to at least exit early. We currently pass in "delta_limit" which is 
close, but we never use it, and we really probably would be better off 
passing in the minimum score itself and perhaps be able to exit early from 
diffcore_count_changes.

However, while I did think about doing that, I came to the conclusion that 
we'd still always end up having to look at least at *half* the hash data 
(for the default 50% score), so while it would help, it again wouldn't be 
a matter of orders-of-magnitudes, and it looked like the end result would 
be unnecessarily complex too.

The best option, of course, would be to use a similarity hash to make the 
initial choice. I think we had one at one point, but it wasn't 
fine-grained enough. But it might be interesting to do that as a filter in 
*front* of the more expensive diffcore_count_changes() call.

We had some "similarity fingerprint" code at some point using a Rabin 
polynomial. It wasn't reliable enough to be used as a direct score, but 
maybe it can be used as a first-line "we know this isn't even worth doing 
the expensive stuff on".

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-03 17:53                           ` Linus Torvalds
@ 2007-10-03 18:09                             ` David Kastrup
  0 siblings, 0 replies; 44+ messages in thread
From: David Kastrup @ 2007-10-03 18:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 3 Oct 2007, Jeff King wrote:
>>
>> Try profiling the code, and you will see that the creation of the hashes
>> is totally dwarfed by the comparisons. So yes, you might be able to
>> speed up the creation code, but it's going to have a minimal impact on
>> the overall run time.
>
> Yeah. Oprofile is your friend.

Well, and if -Oprofile has no strong opinion, I'd let wc -l pitch in.

When we are not actually going to use the hash tables as hash tables,
why create them as such?  If the first thing that actually looks at
the values of the hashes (except possibly for the optimization of not
storing the same hash twice in succession) is the sort, then there is
no code that can go out of whack when confronted with degenerate data.

Maybe it's not much of an optimization, but it certainly should be a
cleanup.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What's cooking in git.git (topics)
  2007-10-03  2:28                     ` Linus Torvalds
  2007-10-03  6:54                       ` Jeff King
  2007-10-03  8:20                       ` David Kastrup
@ 2007-10-04  7:10                       ` Junio C Hamano
  2 siblings, 0 replies; 44+ messages in thread
From: Junio C Hamano @ 2007-10-04  7:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Kastrup, Jeff King, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Hmm. The patch depends on half-way subtle issues like the fact that the 
> hashtables are guaranteed to not be full => we're guaranteed to have zero 
> counts at the end => we don't need to do any steenking iterator count in 
> the loop. A few comments might in order.

The patch actually is quite readable.  That double-loop finding
the matching hashval in destination hash was simply silly to
begin with, so even if this is not "orders of magnitude"
improvement, I think your patch is worth doing.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2007-10-04  7:11 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-06  8:52 What's cooking in git.git (topics) Junio C Hamano
     [not found] ` <7v1wd1d0le.fsf@gitster.siamese.dyndns.org>
2007-09-14 18:30   ` Shawn O. Pearce
2007-09-15  7:47     ` db/fetch-pack topic (was: What's cooking in git.git (topics)) Shawn O. Pearce
2007-09-16  4:03       ` Shawn O. Pearce
2007-09-14 23:47   ` What's cooking in git.git (topics) Johannes Schindelin
2007-09-26 21:07     ` Carlos Rica
2007-09-26 20:05   ` Junio C Hamano
2007-09-26 21:44     ` Johannes Schindelin
2007-09-26 21:53       ` Tom Clarke
2007-09-27  2:36     ` Jeff King
2007-09-27  6:08       ` David Kastrup
2007-09-27  6:43         ` David Kastrup
2007-09-27 13:30         ` Jeff King
2007-09-27 13:46           ` David Kastrup
2007-10-02  4:16       ` Jeff King
2007-10-02  5:01         ` Junio C Hamano
2007-10-02  5:08           ` Jeff King
2007-10-02  5:13             ` Jeff King
2007-10-02  6:10               ` David Kastrup
2007-10-02 16:11                 ` Jeff King
2007-10-02 16:31                   ` David Kastrup
2007-10-02 17:39                     ` Jeff King
2007-10-02 18:44                       ` David Kastrup
2007-10-03  2:28                     ` Linus Torvalds
2007-10-03  6:54                       ` Jeff King
2007-10-03 16:13                         ` Linus Torvalds
2007-10-03  8:20                       ` David Kastrup
2007-10-03 16:59                         ` Jeff King
2007-10-03 17:53                           ` Linus Torvalds
2007-10-03 18:09                             ` David Kastrup
2007-10-04  7:10                       ` Junio C Hamano
2007-09-28  3:24     ` Daniel Barkalow
2007-10-02  5:53     ` Junio C Hamano
2007-10-02  6:41       ` Steven Grimm
2007-10-02  6:44       ` Steffen Prohaska
2007-10-02  7:03         ` Matthieu Moy
2007-10-02  7:21           ` Junio C Hamano
2007-10-02  8:01             ` David Kågedal
2007-10-02  8:07             ` Matthieu Moy
2007-10-02 17:44               ` Junio C Hamano
2007-10-03  5:30                 ` [PATCH] rebase: make the warning more useful when the work tree is unclean Junio C Hamano
2007-10-03  9:02                   ` Matthieu Moy
2007-10-02 12:52       ` What's cooking in git.git (topics) Johannes Schindelin
2007-10-02 17:00       ` Daniel Barkalow

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).