People unaware of the importance of "git gc"?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* People unaware of the importance of "git gc"?
@ 2007-09-05  7:09 Linus Torvalds
  2007-09-05  7:21 ` Martin Langhoff
                   ` (6 more replies)
  0 siblings, 7 replies; 97+ messages in thread
From: Linus Torvalds @ 2007-09-05  7:09 UTC (permalink / raw)
  To: Git Mailing List

So we had a git bof at linux.conf.eu yesterday, and I leart something 
new: even people who have been using git for a long time apparently don't 
necessarily realize the importance of repacking.

James Bottomley (the Linux SCSI maintainer) is an old-time BK user, and 
very comfy using git. But when he was demonstrating things on his poor old 
laptop, simple things like "git branch" literally took a long time, and 
James didn't seem to realize that the fact that he had apparently never 
ever repacked his repository was a big deal.

The kernel archive is a 190MB pack for me fully repacked (I just checked - 
I had actually thought that it was somewhat larger than that), but because 
James hadn't repacked, his .git directory was over a gigabyte in size, and 
his laptop wasn't able to cache anything at all effectively as a result.

Repacking it took over an hour, simply because everything was *so* 
unpacked, and James' kernel repository had something like 92 thousand 
loose objects, and several hundred packfiles. Simple operations that 
really take much less than a second for me ("git branch" takes 0.022s on 
my laptop, which has the same 512M that James had on his) took many many 
seconds as a result, and James seemed to think that this was all normal.

And James didn't even want to repack, because it was so expensive (which 
he knew - he claims to have never ever repacked at all, but maybe he had 
started it and just control-C'd it when it was really slow at some point).

Now, it may be that James didn't realize how important the occasional 
garbage collect is exactly *because* he is an old-timer and used BK long 
before he used git, and just continued using git simply as a BK 
replacement, but it did make me wonder whether maybe this lack of 
repacking awareness is fairly common. 

I've been against automatic repacking, but that was really based on what 
appears to be potentially a very wrong assumption, namely that people 
would do the manual repack on their own. If it turns out that people don't 
do it, maybe the right thing for git to do really is to at least notify 
people when they have way too many pack-files and/or loose objects.

I personally repack everything way more often than is necessary, and I had 
kind of assumed that people did it that way, but I was apparently wrong. 
Comments?

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
@ 2007-09-05  7:21 ` Martin Langhoff
  2007-09-05  7:37   ` Karl Hasselström
  2007-09-05  7:30 ` Junio C Hamano
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 97+ messages in thread
From: Martin Langhoff @ 2007-09-05  7:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 9/5/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> I personally repack everything way more often than is necessary, and I had
> kind of assumed that people did it that way, but I was apparently wrong.
> Comments?

(resent with CC to git@)

I never followed up on one of your suggestions back in the day -- that
we printed an informational msg along the lines of "you have X loose
objects, it's about time to repack" after some operations (fetch,
merge, commit). These days it's all C, so I'll pass the buck to people
that actually know how to do printf() ;-)

Also -- early users got everything exploded during clone, James is
probable one of them. It is the worst case scenario, really. Users of
a modern git will start off with a large packs, and accumulate little
packs from pulls, so it's not as bad.

In fact, in James' case, it would have been way way way faster to
"steal" the packs from git.kernel.org via http (or your laptop) and
_then_ repack. He'd been sorted in a minute.

cheers,

martin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:30 ` Junio C Hamano
@ 2007-09-05  7:26   ` Tomash Brechko
  2007-09-05  8:13   ` Johan Herland
  2007-09-05  8:51   ` Wincent Colaiuta
  2 siblings, 0 replies; 97+ messages in thread
From: Tomash Brechko @ 2007-09-05  7:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List

Hi!

On Wed, Sep 05, 2007 at 00:30:35 -0700, Junio C Hamano wrote:
> Perhaps _exiting_ "git-commit" and "git-fetch" before doing
> anything, when the repository has more than 5000 loose objects
> with a LOUD bang that instructs an immediate repack would be
> good?

This may break automation.  I run git-gc monthly via cron, but that
doesn't guarantee I won't get 5000 loose objects before that.  And I
agree that automatic run is annoying.  Perhaps simple BIG FAT WARNING
is the best after all.


-- 
   Tomash Brechko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
  2007-09-05  7:21 ` Martin Langhoff
@ 2007-09-05  7:30 ` Junio C Hamano
  2007-09-05  7:26   ` Tomash Brechko
                     ` (2 more replies)
  2007-09-05  7:42 ` Pierre Habouzit
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05  7:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I personally repack everything way more often than is necessary, and I had 
> kind of assumed that people did it that way, but I was apparently wrong. 
> Comments?

I am as old timer as you are so I am not qualified to add much
variety to the discussion, but I agree that excessive cruft is
something we should warn the user about.

I personally was _extremely_ annoyed by git-cvsimport
occassionary deciding to repack whenever it finds more than
certain number of loose objects, not because it is a big import,
but because I happened to start the command to start a very
small import after doing my own development for a while to
accumulate loose objects, and I really hate automatic repacking
for any operation (or tool that thinks it knows better than I do
in general).

Perhaps _exiting_ "git-commit" and "git-fetch" before doing
anything, when the repository has more than 5000 loose objects
with a LOUD bang that instructs an immediate repack would be
good?

I really do not like the idea of automatically running a repack
after first interrupting the original command and then resuming.
For one thing it would make a horribly difficult situation to
debug if anything goes wrong.  You cannot reproduce such a
situation easily.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:21 ` Martin Langhoff
@ 2007-09-05  7:37   ` Karl Hasselström
  0 siblings, 0 replies; 97+ messages in thread
From: Karl Hasselström @ 2007-09-05  7:37 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Linus Torvalds, Git Mailing List

On 2007-09-05 19:21:29 +1200, Martin Langhoff wrote:

> I never followed up on one of your suggestions back in the day --
> that we printed an informational msg along the lines of "you have X
> loose objects, it's about time to repack" after some operations
> (fetch, merge, commit).

git-gui pops up a dialog that says precisely that, and gives you the
choice of repacking right then and there, or skip it.

As for truly automatic repacking after commands such as fetch, it
could probably be a config option (defaulting to "on"). It'd be
important to have "press any key to abort repacking (with no ill
effects)" type funtctionality, though.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
  2007-09-05  7:21 ` Martin Langhoff
  2007-09-05  7:30 ` Junio C Hamano
@ 2007-09-05  7:42 ` Pierre Habouzit
  2007-09-05  8:16   ` Junio C Hamano
                     ` (2 more replies)
  2007-09-05  8:16 ` People unaware of the importance of "git gc"? David Kastrup
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 97+ messages in thread
From: Pierre Habouzit @ 2007-09-05  7:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 2717 bytes --]

On Wed, Sep 05, 2007 at 07:09:27AM +0000, Linus Torvalds wrote:
> I've been against automatic repacking, but that was really based on what 
> appears to be potentially a very wrong assumption, namely that people 
> would do the manual repack on their own. If it turns out that people don't 
> do it, maybe the right thing for git to do really is to at least notify 
> people when they have way too many pack-files and/or loose objects.

  Well independently from the fact that one could suppose that users
should use gc on their own, the big nasty problem with repacking is that
it's really slow. And I just can't imagine git that I use to commit
blazingly fast, will then be unavailable for a very long time (repacks
on my projects -- that are not as big as the kernel but still -- usually
take more than 10 to 20 seconds each).

> I personally repack everything way more often than is necessary, and I had 
> kind of assumed that people did it that way, but I was apparently wrong. 
> Comments?

  I do, when I'm bored and that I can't get things done. you know, it
has become one of my many twitches when I have an empty tty in front of
me and that I'm doing nothing useful. Though, when I'm in a hack-attack,
well I don't necessarily remember to repack. I'm in one of the (not so
many ?) very lucky companies (yay start-ups) where I could show that git
was very superior, and we now use it as our sole SCM. So when I'm in a
hack attack, it's usually that it's a busy week, and that new patches,
trees, objects (and sometimes with large binary things in it) flows like
hell. And the repository grows larger and larger. Well, the way we chose
to avoid the "I'm coding don't bother me with administrivia"-attitude is
that our users use a small cron that basically runs git gc each day, and
an aggressive repack (with a window of 50 or 100 I don't remember) each
Week-end in a cron. Because the best criterion to repack a repository
is: when there is no-one on the computer.

  It has proven quite good, as we have never seen a repository explode
in a day, even after some funny mistakes where people rebase some big
parts of the tree many times, generating very large number of loose
objets.

  I know I don't really answer the question, but the point I try to make
is that yeah, some kind of automated way to run the gc is great, but I'm
not sure that _git_ is the tool to automate that, because when *I* use
git, I expect it to be just plain fast, and I don't want it to
occasionally hang.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:30 ` Junio C Hamano
  2007-09-05  7:26   ` Tomash Brechko
@ 2007-09-05  8:13   ` Johan Herland
  2007-09-05  8:39     ` Matthieu Moy
  2007-09-05  8:51   ` Wincent Colaiuta
  2 siblings, 1 reply; 97+ messages in thread
From: Johan Herland @ 2007-09-05  8:13 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 2295 bytes --]

On Wednesday 05 September 2007, Junio C Hamano wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > I personally repack everything way more often than is necessary, and I had 
> > kind of assumed that people did it that way, but I was apparently wrong. 
> > Comments?
> 
> I am as old timer as you are so I am not qualified to add much
> variety to the discussion, but I agree that excessive cruft is
> something we should warn the user about.
> 
> I personally was _extremely_ annoyed by git-cvsimport
> occassionary deciding to repack whenever it finds more than
> certain number of loose objects, not because it is a big import,
> but because I happened to start the command to start a very
> small import after doing my own development for a while to
> accumulate loose objects, and I really hate automatic repacking
> for any operation (or tool that thinks it knows better than I do
> in general).
> 
> Perhaps _exiting_ "git-commit" and "git-fetch" before doing
> anything, when the repository has more than 5000 loose objects
> with a LOUD bang that instructs an immediate repack would be
> good?
> 
> I really do not like the idea of automatically running a repack
> after first interrupting the original command and then resuming.
> For one thing it would make a horribly difficult situation to
> debug if anything goes wrong.  You cannot reproduce such a
> situation easily.

What about some sort of middle ground:

When git-fetch and git-commit has done its job and is about to exit, it checks 
the number of loose object, and if too high tells the user something 
like "There are too many loose objects in the repo, do you want me to repack? 
(y/N)". If the user answers "n" or simply <Enter>, it exits immediately 
without doing anything, but if the user answers "y", or if there is no 
response, say, within a minute (i.e. the user went to lunch), the repack is 
initiated. (Of course, the user should be told that a Ctrl-C will abort the 
repack and not be harmful in any way.)

If the user answers "n" (or aborts the repack), the question will keep popping 
up on the next git-{commit,fetch} to remind/annoy the user until a repack is 
done.


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:42 ` Pierre Habouzit
@ 2007-09-05  8:16   ` Junio C Hamano
  2007-09-05  8:50   ` Steven Grimm
  2007-09-05 17:51   ` Nix
  2 siblings, 0 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05  8:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Pierre Habouzit <madcoder@debian.org> writes:

>   I do, when I'm bored and that I can't get things done. you know, it
> has become one of my many twitches when I have an empty tty in front of
> me and that I'm doing nothing useful.

Very well said ;-)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
                   ` (2 preceding siblings ...)
  2007-09-05  7:42 ` Pierre Habouzit
@ 2007-09-05  8:16 ` David Kastrup
  2007-09-05 16:47 ` Govind Salinas
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-05  8:16 UTC (permalink / raw)
  To: git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Now, it may be that James didn't realize how important the
> occasional garbage collect is exactly *because* he is an old-timer
> and used BK long before he used git, and just continued using git
> simply as a BK replacement, but it did make me wonder whether maybe
> this lack of repacking awareness is fairly common.
>
> I've been against automatic repacking, but that was really based on
> what appears to be potentially a very wrong assumption, namely that
> people would do the manual repack on their own. If it turns out that
> people don't do it, maybe the right thing for git to do really is to
> at least notify people when they have way too many pack-files and/or
> loose objects.
>
> I personally repack everything way more often than is necessary, and
> I had kind of assumed that people did it that way, but I was
> apparently wrong.  Comments?

Can it be that getting rid of unused objects is harder once they are
packed?  If that is the case, an automatic pack while mucking about
with temporary branches and/or confidential files would be quite a
nuisance.

Automatic packing maybe would be acceptable if packing was really
transparent to what you do with your repo (including janitoring work).
And it would be nice if automatic packing could be done in an
incremental manner, not bogging down normal work.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:13   ` Johan Herland
@ 2007-09-05  8:39     ` Matthieu Moy
  2007-09-05  8:41       ` Johan Herland
  2007-09-05  8:51       ` Pierre Habouzit
  0 siblings, 2 replies; 97+ messages in thread
From: Matthieu Moy @ 2007-09-05  8:39 UTC (permalink / raw)
  To: Johan Herland; +Cc: git, Junio C Hamano, Linus Torvalds

Johan Herland <johan@herland.net> writes:

> When git-fetch and git-commit has done its job and is about to exit, it checks 
> the number of loose object, and if too high tells the user something 
> like "There are too many loose objects in the repo, do you want me to repack? 
> (y/N)". If the user answers "n" or simply <Enter>,

I don't like commands to be interactive if they don't _need_ to be so.
It kills scripting, it makes it hard for a front-end (git gui or so)
to use the command, ...

-- 
Matthieu

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:39     ` Matthieu Moy
@ 2007-09-05  8:41       ` Johan Herland
  2007-09-05  8:47         ` David Kastrup
  2007-09-05  8:51       ` Pierre Habouzit
  1 sibling, 1 reply; 97+ messages in thread
From: Johan Herland @ 2007-09-05  8:41 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git, Junio C Hamano, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 725 bytes --]

On Wednesday 05 September 2007, Matthieu Moy wrote:
> Johan Herland <johan@herland.net> writes:
> 
> > When git-fetch and git-commit has done its job and is about to exit, it checks 
> > the number of loose object, and if too high tells the user something 
> > like "There are too many loose objects in the repo, do you want me to repack? 
> > (y/N)". If the user answers "n" or simply <Enter>,
> 
> I don't like commands to be interactive if they don't _need_ to be so.
> It kills scripting, it makes it hard for a front-end (git gui or so)
> to use the command, ...

Ok, so add an option or config variable to turn on/off this behaviour.

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:41       ` Johan Herland
@ 2007-09-05  8:47         ` David Kastrup
  0 siblings, 0 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-05  8:47 UTC (permalink / raw)
  To: git

Johan Herland <johan@herland.net> writes:

> On Wednesday 05 September 2007, Matthieu Moy wrote:
>> Johan Herland <johan@herland.net> writes:
>> 
>> > When git-fetch and git-commit has done its job and is about to exit, it checks 
>> > the number of loose object, and if too high tells the user something 
>> > like "There are too many loose objects in the repo, do you want me to repack? 
>> > (y/N)". If the user answers "n" or simply <Enter>,
>> 
>> I don't like commands to be interactive if they don't _need_ to be so.
>> It kills scripting, it makes it hard for a front-end (git gui or so)
>> to use the command, ...
>
> Ok, so add an option or config variable to turn on/off this behaviour.

A bad idea which one can turn optionally off remains a bad idea for
everyone that has not been bitten enough by it already to actually
look up the problem and remedy.

Make this a warning.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:42 ` Pierre Habouzit
  2007-09-05  8:16   ` Junio C Hamano
@ 2007-09-05  8:50   ` Steven Grimm
       [not found]     ` <86ps0xcwxo.fsf@lola.quinscape.zz>
                       ` (3 more replies)
  2007-09-05 17:51   ` Nix
  2 siblings, 4 replies; 97+ messages in thread
From: Steven Grimm @ 2007-09-05  8:50 UTC (permalink / raw)
  To: Linus Torvalds, Git Mailing List

Pierre Habouzit wrote:
>   Well independently from the fact that one could suppose that users
> should use gc on their own, the big nasty problem with repacking is that
> it's really slow. And I just can't imagine git that I use to commit
> blazingly fast, will then be unavailable for a very long time (repacks
> on my projects -- that are not as big as the kernel but still -- usually
> take more than 10 to 20 seconds each).
>   

What about kicking off a repack in the background at the ends of certain 
commands? With an option to disable, of course. It could run at a low 
priority and could even sleep a lot to avoid saturating the system's 
disks -- since it'd be running asynchronously there should be no problem 
if it takes longer to run.

Alternately, if it's possible to break the repack work up into chunks 
that can be executed a bit at a time, you could do a small amount of 
repacking very frequently (possibly still in the background) rather than 
the whole thing at once. I suspect the nature of a repack, where you 
presumably want everything loaded at once, would make that a challenge, 
but it might not be impossible.

On the more general question...

IMO expecting end users to regularly perform what are essentially 
database administration tasks (running git-gc is akin to rebuilding 
indexes or packing tables on a DBMS) is naive. Heck, even database 
administrators don't like to run database administration commands; 
PostgreSQL added the "autovacuum" feature precisely because manual 
periodic repacking (and the associated monitoring to figure out when to 
do it) was too annoying for developers and DBAs. But you don't have to 
look that far; anyone who has worked in IT can tell you horror stories 
of users, including developers, whose computers have slowed to a crawl 
because the users never bothered to defrag their hard disks. And that 
affects *everything* the users do, not just version control operations!

It'll get worse as better UIs and tool integration become available and 
git gains large numbers of users who are neither software developers nor 
system administrators, and wouldn't know a packfile from a hole in the 
ground. I'm talking web designers, graphic artists, mechanical 
engineers, even managers and secretaries -- all of those people are in 
git's ultimate target audience, even if it's not ready for them today. 
None of them is going to be interested in doing random housekeeping 
operations by hand, but they'll all appreciate a fast environment.

The fact that git sometimes stores your files individually in the .git 
directory and sometimes bundles them together into big archives should 
be an implementation detail that end-users don't have to worry about day 
to day; git should do the right thing to remain fast under typical usage 
scenarios, while leaving the plumbing exposed so people with atypical 
usage can get their stuff done too.

-Steve

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:30 ` Junio C Hamano
  2007-09-05  7:26   ` Tomash Brechko
  2007-09-05  8:13   ` Johan Herland
@ 2007-09-05  8:51   ` Wincent Colaiuta
  2 siblings, 0 replies; 97+ messages in thread
From: Wincent Colaiuta @ 2007-09-05  8:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List

El 5/9/2007, a las 9:30, Junio C Hamano escribió:

> Perhaps _exiting_ "git-commit" and "git-fetch" before doing
> anything, when the repository has more than 5000 loose objects
> with a LOUD bang that instructs an immediate repack would be
> good?
>
> I really do not like the idea of automatically running a repack
> after first interrupting the original command and then resuming.
> For one thing it would make a horribly difficult situation to
> debug if anything goes wrong.  You cannot reproduce such a
> situation easily.

I would strongly oppose any *automatic* repacking and strongly  
support any *advisory* recommandation to repack when the loose object  
count exceeds a certain threshold. I don't think *exiting* a command  
in such cases is a good idea; worse than automatic repacking this  
would be *forced* manual repacking, which isn't very user-friendly.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:39     ` Matthieu Moy
  2007-09-05  8:41       ` Johan Herland
@ 2007-09-05  8:51       ` Pierre Habouzit
  2007-09-05  9:02         ` David Kastrup
  2007-09-05  9:04         ` Matthieu Moy
  1 sibling, 2 replies; 97+ messages in thread
From: Pierre Habouzit @ 2007-09-05  8:51 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Johan Herland, git, Junio C Hamano, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1374 bytes --]

On Wed, Sep 05, 2007 at 08:39:52AM +0000, Matthieu Moy wrote:
> Johan Herland <johan@herland.net> writes:
> 
> > When git-fetch and git-commit has done its job and is about to exit, it checks 
> > the number of loose object, and if too high tells the user something 
> > like "There are too many loose objects in the repo, do you want me to repack? 
> > (y/N)". If the user answers "n" or simply <Enter>,
> 
> I don't like commands to be interactive if they don't _need_ to be so.
> It kills scripting, it makes it hard for a front-end (git gui or so)
> to use the command, ...

  There is absolutely no problem here, as it can be avoided if the
output is not a tty. It's not _that_ hard to guess if you're currently
running in a script or in an interactive shell after all.

  Really, git commit/fetch/... whatever suggesting to repack/gc when it
believes it begins to be critical to performance is not a bad idea.
Though the risk is that the warning could be printed very often, but
that can be avoided trivially by just writing to a state file in the
.git directory that the warning was printed not so long time ago, and
that git should STFU for some more commits/time.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:51       ` Pierre Habouzit
@ 2007-09-05  9:02         ` David Kastrup
  2007-09-05  9:04         ` Matthieu Moy
  1 sibling, 0 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-05  9:02 UTC (permalink / raw)
  To: git

Pierre Habouzit <madcoder@debian.org> writes:

> On Wed, Sep 05, 2007 at 08:39:52AM +0000, Matthieu Moy wrote:
>> Johan Herland <johan@herland.net> writes:
>> 
>> > When git-fetch and git-commit has done its job and is about to exit, it checks 
>> > the number of loose object, and if too high tells the user something 
>> > like "There are too many loose objects in the repo, do you want me to repack? 
>> > (y/N)". If the user answers "n" or simply <Enter>,
>> 
>> I don't like commands to be interactive if they don't _need_ to be so.
>> It kills scripting, it makes it hard for a front-end (git gui or so)
>> to use the command, ...
>
>   There is absolutely no problem here, as it can be avoided if the
> output is not a tty.

Which output?  stdout?  stderr?  Where is the question appearing?
What if the command has been started in the background?  What if stdin
(not stdout) is from a pipe, maybe for taking a commit message?  What
if stdin is from a pseudo-tty because the commit has been started with
an internal shell command inside of Emacs, and the command/message
will only get echoed once git-commit completes?

> It's not _that_ hard to guess if you're currently running in a
> script or in an interactive shell after all.

Oh, it is not hard to _guess_.  Just throw a die.  What is hard is to
_know_ 100% sure that one is doing the right thing and not breaking
any legitimate use.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:51       ` Pierre Habouzit
  2007-09-05  9:02         ` David Kastrup
@ 2007-09-05  9:04         ` Matthieu Moy
  1 sibling, 0 replies; 97+ messages in thread
From: Matthieu Moy @ 2007-09-05  9:04 UTC (permalink / raw)
  To: Johan Herland; +Cc: git, Junio C Hamano, Linus Torvalds

Pierre Habouzit <madcoder@debian.org> writes:

> On Wed, Sep 05, 2007 at 08:39:52AM +0000, Matthieu Moy wrote:
>> Johan Herland <johan@herland.net> writes:
>> 
>> > When git-fetch and git-commit has done its job and is about to exit, it checks 
>> > the number of loose object, and if too high tells the user something 
>> > like "There are too many loose objects in the repo, do you want me to repack? 
>> > (y/N)". If the user answers "n" or simply <Enter>,
>> 
>> I don't like commands to be interactive if they don't _need_ to be so.
>> It kills scripting, it makes it hard for a front-end (git gui or so)
>> to use the command, ...
>
>   There is absolutely no problem here, as it can be avoided if the
> output is not a tty. It's not _that_ hard to guess if you're currently
> running in a script or in an interactive shell after all.

I do find it hard to guess _reliably_ if you're running interactively
or not. For example, I've been bitten recently by "git log" running
inside a pager while I was launching it non-interactively inside Emacs
(as part of DVC). I don't know whether this was git's or Emacs's
fault, and the fix was not too hard (GIT_PAGER=cat), but it took some
of my time to get it working.

Adding more interactive stuff means adding more opportunities for this
kind of problems. None will be a huge problem, but each problem will
take some time to be fixed (I'm pretty sure adding an interactive
prompt in git-commit will break DVC's commit functionality, and we'll
have to fix it).

>   Really, git commit/fetch/... whatever suggesting to repack/gc when it
> believes it begins to be critical to performance is not a bad idea.

_Suggesting_ is a good idea, definitely. Something like

if (number_of_unpacked > 1000 && number_of_unpacked < 10000) {
	printf ("more than 1000 unpacked objects. Think of running git-gc\n");
} else if (number_of_unpacked >= 10000) {
	printf ("HEY, WHAT THE HELL ARE YOU DOING WITH >10000 UNPACKED OBJECTS???\n"
                "I TOLD YOU TO REPACK\n");
}

would be fine with me. The proposal to run git-gc in the background,
with low priority seems to be a good idea too.

But please, don't put an interactive prompt where it's not needed.

-- 
Matthieu

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
       [not found]     ` <86ps0xcwxo.fsf@lola.quinscape.zz>
@ 2007-09-05  9:07       ` Steven Grimm
  2007-09-05  9:13         ` David Kastrup
  0 siblings, 1 reply; 97+ messages in thread
From: Steven Grimm @ 2007-09-05  9:07 UTC (permalink / raw)
  To: David Kastrup; +Cc: Linus Torvalds, Git Mailing List

David Kastrup wrote:
> You'll potentially get accumulating unfinished files from
> aborted/killed repack processes.

Which can get cleaned up when the next repack starts. This is no 
different from unfinished files accumulating from aborted/killed manual 
repacks.

> If communication fails, you'll get a
> new repack session for every command you start.

Git handles this already:

$ git-gc
fatal: unable to create '.git/packed-refs.lock': File exists
error: failed to run pack-refs

Presumably in that case you would simply not fire up a new repack.

>   If a repository is used by multiple people...
>   

Then the first one will kick off the repack, and subsequent ones won't.

> And so on.  The multiuser aspect makes it a bad idea to do any
> janitorial tasks automatically.  You don't really want every user to
> start a repack at the same time.
>   

Quite true, but that's already impossible, so not a problem.

One other thing: The heuristics for this can be such that users who are 
already regularly running git-gc by hand will see no change in behavior. 
Their repos will never get to a bad enough state that the automatic 
git-gc is invoked. Old-timers who run git-gc might, in theory, never 
even notice a change like this.

-Steve

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:50   ` Steven Grimm
       [not found]     ` <86ps0xcwxo.fsf@lola.quinscape.zz>
@ 2007-09-05  9:07     ` Junio C Hamano
  2007-09-05  9:27       ` Martin Langhoff
  2007-09-05  9:13     ` David Kastrup
  2007-09-05  9:14     ` Pierre Habouzit
  3 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05  9:07 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Linus Torvalds, Git Mailing List

Steven Grimm <koreth@midwinter.com> writes:

> The fact that git sometimes stores your files individually in the .git
> directory and sometimes bundles them together into big archives should
> be an implementation detail that end-users don't have to worry about
> day to day...

[alias]
        begin = gc
	leave = gc

That is, the user's manual says 'at the beginning of the day,
run "git begin" to start the day, and at the end of day, run
"git leave" to conclude your day', without saying why ;-)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  9:07       ` Steven Grimm
@ 2007-09-05  9:13         ` David Kastrup
  0 siblings, 0 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-05  9:13 UTC (permalink / raw)
  To: git

Steven Grimm <koreth@midwinter.com> writes:

> David Kastrup wrote:
>> You'll potentially get accumulating unfinished files from
>> aborted/killed repack processes.
>
> Which can get cleaned up when the next repack starts. This is no
> different from unfinished files accumulating from aborted/killed
> manual repacks.
>
>> If communication fails, you'll get a
>> new repack session for every command you start.
>
> Git handles this already:
>
> $ git-gc
> fatal: unable to create '.git/packed-refs.lock': File exists
> error: failed to run pack-refs
>
> Presumably in that case you would simply not fire up a new repack.
>
>>   If a repository is used by multiple people...
>>   
>
> Then the first one will kick off the repack, and subsequent ones won't.

And the first one might get habitually killed by the user unwittingly
having started it (because he really only logs in for shorter amounts
of times than needed for git-gc to finish), wasting disk space and
time all the while.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:50   ` Steven Grimm
       [not found]     ` <86ps0xcwxo.fsf@lola.quinscape.zz>
  2007-09-05  9:07     ` Junio C Hamano
@ 2007-09-05  9:13     ` David Kastrup
  2007-09-05  9:14     ` Pierre Habouzit
  3 siblings, 0 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-05  9:13 UTC (permalink / raw)
  To: git

Steven Grimm <koreth@midwinter.com> writes:

> Pierre Habouzit wrote:
>>   Well independently from the fact that one could suppose that users
>> should use gc on their own, the big nasty problem with repacking is that
>> it's really slow. And I just can't imagine git that I use to commit
>> blazingly fast, will then be unavailable for a very long time (repacks
>> on my projects -- that are not as big as the kernel but still -- usually
>> take more than 10 to 20 seconds each).
>>   
>
> What about kicking off a repack in the background at the ends of
> certain commands? With an option to disable, of course. It could run
> at a low priority and could even sleep a lot to avoid saturating the
> system's disks -- since it'd be running asynchronously there should
> be no problem if it takes longer to run.

You'll potentially get accumulating unfinished files from
aborted/killed repack processes.  If communication fails, you'll get a
new repack session for every command you start.  If a repository is
used by multiple people...

And so on.  The multiuser aspect makes it a bad idea to do any
janitorial tasks automatically.  You don't really want every user to
start a repack at the same time.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  8:50   ` Steven Grimm
                       ` (2 preceding siblings ...)
  2007-09-05  9:13     ` David Kastrup
@ 2007-09-05  9:14     ` Pierre Habouzit
  3 siblings, 0 replies; 97+ messages in thread
From: Pierre Habouzit @ 2007-09-05  9:14 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Linus Torvalds, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 3080 bytes --]

On Wed, Sep 05, 2007 at 08:50:04AM +0000, Steven Grimm wrote:
> Pierre Habouzit wrote:
> >  Well independently from the fact that one could suppose that users
> >should use gc on their own, the big nasty problem with repacking is that
> >it's really slow. And I just can't imagine git that I use to commit
> >blazingly fast, will then be unavailable for a very long time (repacks
> >on my projects -- that are not as big as the kernel but still -- usually
> >take more than 10 to 20 seconds each).
> >  
> 
> What about kicking off a repack in the background at the ends of certain 
> commands? With an option to disable, of course. It could run at a low 
> priority and could even sleep a lot to avoid saturating the system's 
> disks -- since it'd be running asynchronously there should be no problem 
> if it takes longer to run.

  there is an issue with that: repack is memory and CPU intensive. Of
course renicing the process deals with the CPU issue, but not with the
memory one. I've often seen repacks eat more than 300 to 400Mo of memory
on not so big repositories: it seems (and experience tells me that, not
looking at the code) that if you have some big binary blobs (we have
.swf's and .fla's in our repository) it can consume quite a lot of RAM
to (presumably) compute efficient deltas.

  Sadly there is no way to "renice" the ram usage of a process. Once a
repack is launched, it will make your system swap, and put the whole
computer on its knees.

> IMO expecting end users to regularly perform what are essentially 
> database administration tasks (running git-gc is akin to rebuilding 
> indexes or packing tables on a DBMS) is naive. Heck, even database 
> administrators don't like to run database administration commands; 

  Well that's what crons are for. When you install a SGBD in a
reasonable enough distro, it comes with the optimizing scripts in crons,
launched at a reasonable period of the day (localtime). So the
comparison doesn't hold. And that's exactly the problem: it's quite hard
to ship git with an optimizing cron task, because we can't know where
the user will keep his repositories, and when he works, so you have
somehow to do it yourself.

  Or you can deal with that with a "rule". At work, we have our devel
trees under $HOME/dev/, so the cron we use is just a (roughly):

    find $HOME/dev/ -name .git -type d -maxdepth 4 | while read repo
    do
        GIT_DIR="$repo" git gc
    done

  As we work on NFS, with a new developper, we can just setup the cron
for him at a date where he's not supposed to be at work, and that's it.
I'm not sure there is a good solution at all.

  Or we could also provide a: git-coffee-break command that would tell
git: do whatever you want with this computer in the next 10 minutes,
there won't be anyone watching, but I assume tea-lovers will feel
excluded.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  9:07     ` Junio C Hamano
@ 2007-09-05  9:27       ` Martin Langhoff
  2007-09-05  9:33         ` Matthieu Moy
  0 siblings, 1 reply; 97+ messages in thread
From: Martin Langhoff @ 2007-09-05  9:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Steven Grimm, Linus Torvalds, Git Mailing List

On 9/5/07, Junio C Hamano <gitster@pobox.com> wrote:
> [alias]
>         begin = gc
>         leave = gc
>
> That is, the user's manual says 'at the beginning of the day,
> run "git begin" to start the day, and at the end of day, run
> "git leave" to conclude your day', without saying why ;-)

I actually like that one ;-)

Anyway - this is turning out to be a bit of a bikeshed-painting event.
You guys should google earlier discussions on this very same subject.
They have always ended in "automatic=bad", "warning=good", and
"careful or you might be called an idiot" before ;-)

cheers,


martin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  9:27       ` Martin Langhoff
@ 2007-09-05  9:33         ` Matthieu Moy
  2007-09-05 14:17           ` Johan De Messemaeker
  0 siblings, 1 reply; 97+ messages in thread
From: Matthieu Moy @ 2007-09-05  9:33 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Junio C Hamano, Steven Grimm, Linus Torvalds, Git Mailing List

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> On 9/5/07, Junio C Hamano <gitster@pobox.com> wrote:
>> [alias]
>>         begin = gc
>>         leave = gc
>>
>> That is, the user's manual says 'at the beginning of the day,
>> run "git begin" to start the day, and at the end of day, run
>> "git leave" to conclude your day', without saying why ;-)
>
> I actually like that one ;-)

There's indeed a real idea behind that. The issue is that the alias
shouldn't be just "gc", but "find-all-repositories-and-do-gc-there".

Currently, AFAIK, that can only be done with a (trivial) script
external to git. I suppose this can easily be added to the core git
porcelain. Perhaps a "git gc --recursive" would do.

It doesn't solve the problem, but makes it easier to solve it (git gc
--recursive in cron for example).

-- 
Matthieu

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  9:33         ` Matthieu Moy
@ 2007-09-05 14:17           ` Johan De Messemaeker
  2007-09-05 17:31             ` Matthieu Moy
  0 siblings, 1 reply; 97+ messages in thread
From: Johan De Messemaeker @ 2007-09-05 14:17 UTC (permalink / raw)
  To: Matthieu Moy
  Cc: Martin Langhoff, Junio C Hamano, Steven Grimm, Linus Torvalds,
	Git Mailing List


On 05 Sep 2007, at 11:33, Matthieu Moy wrote:
>
> There's indeed a real idea behind that. The issue is that the alias
> shouldn't be just "gc", but "find-all-repositories-and-do-gc-there".
>
> Currently, AFAIK, that can only be done with a (trivial) script
> external to git. I suppose this can easily be added to the core git
> porcelain. Perhaps a "git gc --recursive" would do.
>
> It doesn't solve the problem, but makes it easier to solve it (git gc
> --recursive in cron for example).

I'm a git newb so I can be wrong here but ...

Why --recursive? Why not use the submodule-information ?

Johan

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
                   ` (3 preceding siblings ...)
  2007-09-05  8:16 ` People unaware of the importance of "git gc"? David Kastrup
@ 2007-09-05 16:47 ` Govind Salinas
  2007-09-05 17:19   ` Carl Worth
  2007-09-05 17:35   ` Steven Grimm
  2007-09-05 17:44 ` J. Bruce Fields
  2007-09-05 21:07 ` Alex Riesen
  6 siblings, 2 replies; 97+ messages in thread
From: Govind Salinas @ 2007-09-05 16:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Hi,

I am very new to git but I have thought about this a bit from a user's
perspective.  I have several thoughts on the matter.

First, I would like to point out that the hg folks like to compare
themselves to git a lot and they list the need for manual gc as a
reason to choose hg over git.  This may not be something that the git
community cares about but I thought I would point it out.

Second, it *is* a hassle.  When trying to figure out what I could
convince my co-workers to use, having to gc was something that I did
not think they would be conscious of or care enough about to do.  It
makes git more of a PITA than it could be.  Similarly, I have no idea
when it is a good time to do a gc.  After every commit?  Before push?
What if I never push a repo?  What if it is a remote repo only used to
sync up with my co-workers, do I have to go there and periodically gc?
 This is one reason why I really think that gc should be *plumbing*
and *not* porcelain.

The user should never have to trigger a gc, they should even be
discouraged from doing so.  That is how other gc systems are.  Can you
imagine if you had a Java app that had a button on it to do a gc?
When should I push it?  Should I wait till the system is getting slow
or just start spamming the button whenever I'm bored?  I know that
Java/c#/py GC are different than git gc, but they fulfill the same
basic purpose as git gc.  IE to clean up unused items and free up
resources.  Git additionally may do some re-optimization, but that is
not relevant to a user.

I know this goes against the general mood here (which seems to be
against auto-gc) but I thought I would give my $.02 as a user of git.

Thanks,
Govind.

On 9/5/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> So we had a git bof at linux.conf.eu yesterday, and I leart something
> new: even people who have been using git for a long time apparently don't
> necessarily realize the importance of repacking.
>
> James Bottomley (the Linux SCSI maintainer) is an old-time BK user, and
> very comfy using git. But when he was demonstrating things on his poor old
> laptop, simple things like "git branch" literally took a long time, and
> James didn't seem to realize that the fact that he had apparently never
> ever repacked his repository was a big deal.
>
> The kernel archive is a 190MB pack for me fully repacked (I just checked -
> I had actually thought that it was somewhat larger than that), but because
> James hadn't repacked, his .git directory was over a gigabyte in size, and
> his laptop wasn't able to cache anything at all effectively as a result.
>
> Repacking it took over an hour, simply because everything was *so*
> unpacked, and James' kernel repository had something like 92 thousand
> loose objects, and several hundred packfiles. Simple operations that
> really take much less than a second for me ("git branch" takes 0.022s on
> my laptop, which has the same 512M that James had on his) took many many
> seconds as a result, and James seemed to think that this was all normal.
>
> And James didn't even want to repack, because it was so expensive (which
> he knew - he claims to have never ever repacked at all, but maybe he had
> started it and just control-C'd it when it was really slow at some point).
>
> Now, it may be that James didn't realize how important the occasional
> garbage collect is exactly *because* he is an old-timer and used BK long
> before he used git, and just continued using git simply as a BK
> replacement, but it did make me wonder whether maybe this lack of
> repacking awareness is fairly common.
>
> I've been against automatic repacking, but that was really based on what
> appears to be potentially a very wrong assumption, namely that people
> would do the manual repack on their own. If it turns out that people don't
> do it, maybe the right thing for git to do really is to at least notify
> people when they have way too many pack-files and/or loose objects.
>
> I personally repack everything way more often than is necessary, and I had
> kind of assumed that people did it that way, but I was apparently wrong.
> Comments?
>
>                 Linus
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 16:47 ` Govind Salinas
@ 2007-09-05 17:19   ` Carl Worth
  2007-09-05 17:55     ` Jing Xue
  2007-09-05 17:35   ` Steven Grimm
  1 sibling, 1 reply; 97+ messages in thread
From: Carl Worth @ 2007-09-05 17:19 UTC (permalink / raw)
  To: Govind Salinas; +Cc: Linus Torvalds, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 3615 bytes --]

On Wed, 5 Sep 2007 11:47:45 -0500, "Govind Salinas" wrote:
> I know this goes against the general mood here (which seems to be
> against auto-gc) but I thought I would give my $.02 as a user of git.

I'll throw my opinion in here as well. I think git should
automatically do repacking by default, (once loose objects exceed some
threshold). There have several posts in this thread from people who
don't want auto-gc, but these same people should be able to avoid it,
and likely without changing habits. That's because:

  * They're already in the habit of manually repacking every once in a
    while, (or like, Linus, much more often than strictly necessary).

  * They've already got cron jobs setup to do the repacking.

And one could augment this with an option to disable the repacking of
course.

And if you're really concerned about people that don't want this
getting it anyway, just determine some useful threshold and then
double it or so before it triggers automatic repacking, (so the
automatic repacking hits only us idiots that completely neglect it).

[Pardon me for continuing to quote in the original top-posted order,
but I like the flow here.]

On 9/5/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > James didn't seem to realize that the fact that he had apparently never
> > ever repacked his repository was a big deal.

I know it was surprising to you, Linus, but I'm glad you noticed
it. I've seen the same thing from many users. And git actually
discourages users from learning about repacking. If the user starts
with a small (or new) project, then everything performs well, and
there's no performance problem whatsoever.

So then the problems creep up gradually, and the user has no idea that
he should be doing anything different than he's always done. Instead
the user is left to just conclude that git's performance isn't scaling
well as the project grows. That's a bad conclusion of course, and it's
bad that git sets things up so the user reaches that conclusion.
Instead, git should just fix things up itself in this case.

> > I've been against automatic repacking, but that was really based on what
> > appears to be potentially a very wrong assumption, namely that people
> > would do the manual repack on their own. If it turns out that people don't
> > do it, maybe the right thing for git to do really is to at least notify
> > people when they have way too many pack-files and/or loose
> > objects.

I don't think the warning message alone is a good fix. I think the
people who would understand the warning and appreciate that they could
then take care of repacking as convenient are the same people that
already understand the repacking concept, and are likely already
repacking occasionally, (so would likely never see the warning).

But the problematic case is the user who knows nothing of the
issue. And in that case, giving this warning isn't useful education,
it's just forcing the user to learn more and do more work. "If git
notices it has too many 'loose object' and 'git gc' would fix the
problem, then why didn't it do that itself? And what the heck is a
'loose object' anyway?"

In general, git has always printed too many obscure messages that
don't actually help a new user get his work done, (and the work is
_not_ to learn more about git internals). From 1.4 to 1.5 much of that
was improved. But please let's not go backwards by adding more of
these.

So one vote from me for auto repacking, (but feel free to make the
threshold so high that anyone that actually _cares_ about loose
objects and repacking will never get the auto repack).

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 14:17           ` Johan De Messemaeker
@ 2007-09-05 17:31             ` Matthieu Moy
  2007-09-05 23:56               ` Jeff King
  0 siblings, 1 reply; 97+ messages in thread
From: Matthieu Moy @ 2007-09-05 17:31 UTC (permalink / raw)
  To: Johan De Messemaeker
  Cc: Martin Langhoff, Junio C Hamano, Steven Grimm, Linus Torvalds,
	Git Mailing List

Johan De Messemaeker <johan.demessemaeker@wgaf.org> writes:

>> Currently, AFAIK, that can only be done with a (trivial) script
>> external to git. I suppose this can easily be added to the core git
>> porcelain. Perhaps a "git gc --recursive" would do.
>>
>> It doesn't solve the problem, but makes it easier to solve it (git gc
>> --recursive in cron for example).
>
> I'm a git newb so I can be wrong here but ...
>
> Why --recursive? Why not use the submodule-information ?

all projects are not necessarily subprojects of each others.

I have ~/teaching/some-course/.git (well, almost) and ~/etc/.git which
are two unrelated projects, and to "git gc" both of them, I need
either a script, or two manual invocations.

(yes, I'm really talking about something trivial)

-- 
Matthieu

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 16:47 ` Govind Salinas
  2007-09-05 17:19   ` Carl Worth
@ 2007-09-05 17:35   ` Steven Grimm
  2007-09-05 18:28     ` Nix
  1 sibling, 1 reply; 97+ messages in thread
From: Steven Grimm @ 2007-09-05 17:35 UTC (permalink / raw)
  To: Govind Salinas; +Cc: Linus Torvalds, Git Mailing List

Govind Salinas wrote:
> This is one reason why I really think that gc should be *plumbing*
> and *not* porcelain.
>   

That's a good way to think of it IMO. It's a low-level operation (albeit 
one that encapsulates other, lower-level ones) that tells git to 
rearrange its internal data structures. It is not something that has any 
user-visible effect. Every other porcelain-level git command *does 
something* from the user's point of view. Running git-gc is basically a 
no-op, which from the user's point of view makes it a waste of 
keystrokes and an annoying distraction from focusing on the stuff 
they're using git to help them build.

> The user should never have to trigger a gc, they should even be
> discouraged from doing so.  That is how other gc systems are.  Can you
> imagine if you had a Java app that had a button on it to do a gc?
> When should I push it?  Should I wait till the system is getting slow
> or just start spamming the button whenever I'm bored?  I know that
> Java/c#/py GC are different than git gc, but they fulfill the same
> basic purpose as git gc.  IE to clean up unused items and free up
> resources.  Git additionally may do some re-optimization, but that is
> not relevant to a user.
>   

I'll play devil's advocate for a moment here, though, and say that, as 
others have suggested in this thread, git could be made to tell you when 
it's appropriate to run gc. So the "I don't know when to run it" 
argument isn't a hard one to address.

With that in mind, here's what the message should look like IMO:

---
Your repository can be optimized for better performance and lower disk 
usage.
Please run "git gc" to optimize it now, or run "git config gc.auto true" 
to tell
git to automatically optimize it in the future (this will launch 
processes in the
background.) For more information, "man git-gc".
---

And that "gc.auto" config option (just an arbitrary name, call it 
something else if that's no good) actually has four settings:

warn (the default) - prints the warning message, at most once every N 
minutes (we can determine a good value for N)
true - launches git-gc in the background as needed
false - suppresses the warning and the check that triggers the warning
foreground - launches git-gc in the foreground as needed (to make it 
easier to abort)

I don't buy the "git gc takes too much memory to run in the background" 
argument as a reason automatic git-gc is a bad idea. Many of us (me 
included) work on machines with plenty of memory to launch a background 
git-gc without hampering our development work, and/or on repositories 
small enough that it doesn't eat that much memory in the first place. 
And if you make it an option that the user has to enable, people on 
low-memory machines can simply not enable it, end of problem.

One big problem with git-gc now is that it's not discoverable. Or 
rather, the need for it isn't discoverable. So at the very least we 
should print the warning, IMO -- and if we're already going to all the 
trouble to determine whether or not git-gc needs to be run, it will 
reduce the "why are you telling me to run something when you could just 
do it for me, you stupid machine?" factor if there's an easily 
discoverable way to just do it as needed.

-Steve

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
                   ` (4 preceding siblings ...)
  2007-09-05 16:47 ` Govind Salinas
@ 2007-09-05 17:44 ` J. Bruce Fields
  2007-09-05 18:46   ` Brandon Casey
  2007-09-05 21:07 ` Alex Riesen
  6 siblings, 1 reply; 97+ messages in thread
From: J. Bruce Fields @ 2007-09-05 17:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On Wed, Sep 05, 2007 at 12:09:27AM -0700, Linus Torvalds wrote:
> I personally repack everything way more often than is necessary, and I had 
> kind of assumed that people did it that way, but I was apparently wrong. 
> Comments?

Well, this may just prove I'm an idiot, but one of the reasons I rarely
run it is that I have trouble remembering exactly what it does; in
particular,

	- does it prune anything that might be needed by a repo I
	  cloned with -s?
	- is there anything that's unsafe to do while the git-gc is
	  running?
	- what are the implications for http users if this is a public
	  repo?
	- is git-gc enough on its own or should I be running something
	  more agressive ocassionally too?

No doubt they all have simple answers, which probably amount to "just
don't worry about it", and which I could have found in less time than
it'd take to write this email.  But when I've got other work to do,
reading "man git-gc" is just enough effort for me to postpone the whole
thing to another day.

So, anyway, your message reminded me to run git-gc on my main working
repo.  At which point one of my personal scripts immediately started
failing--it was assuming it could find any ref under .git/refs/, and I
hadn't realized (or maybe I had once, and I'd forgotten) that git-gc
packs refs by default now.

Bah.  I don't know what the moral of that story is.

--b.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:42 ` Pierre Habouzit
  2007-09-05  8:16   ` Junio C Hamano
  2007-09-05  8:50   ` Steven Grimm
@ 2007-09-05 17:51   ` Nix
  2007-09-05 18:14     ` Steven Grimm
  2 siblings, 1 reply; 97+ messages in thread
From: Nix @ 2007-09-05 17:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 5 Sep 2007, Pierre Habouzit said:
>   I know I don't really answer the question, but the point I try to make
> is that yeah, some kind of automated way to run the gc is great, but I'm
> not sure that _git_ is the tool to automate that, because when *I* use
> git, I expect it to be just plain fast, and I don't want it to
> occasionally hang.

Indeed. I repack all our git trees in the middle of the night, and our
incremental backup script drops .keep files corresponding to every
existing pack before running the backup.

This is probably a good job for cron :)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 17:19   ` Carl Worth
@ 2007-09-05 17:55     ` Jing Xue
  0 siblings, 0 replies; 97+ messages in thread
From: Jing Xue @ 2007-09-05 17:55 UTC (permalink / raw)
  To: Carl Worth; +Cc: Govind Salinas, Linus Torvalds, Git Mailing List

Quoting Carl Worth <cworth@cworth.org>:

> I don't think the warning message alone is a good fix. I think the
> people who would understand the warning and appreciate that they could
> then take care of repacking as convenient are the same people that
> already understand the repacking concept, and are likely already
> repacking occasionally, (so would likely never see the warning).
>
> But the problematic case is the user who knows nothing of the
> issue. And in that case, giving this warning isn't useful education,
> it's just forcing the user to learn more and do more work. "If git
> notices it has too many 'loose object' and 'git gc' would fix the
> problem, then why didn't it do that itself? And what the heck is a
> 'loose object' anyway?"

(my 2 cents as another ordinary new git user)
Hmm, not necessarily. That a system knows what the best action is  
doesn't meant that _right now_ is the best time to take that action.   
One subtle difference I think between git's gc and Java/python/etc.'s  
gc is that in the latter case it is, at least metaphorically, a life  
and death situation - if gc isn't run, the application will run out of  
memory, where as in git, it's more of a performance degradation issue,  
which, sort of, can wait.

On the issue of implementation awareness, a warning message saying  
something along the lines of "your repository is getting slower. You  
might want to consider running 'git gc', and remember to do that from  
time to time." is not much different from "your file system is getting  
slower. You might want to consider running <whatever-defrag-tool>, and  
remember to do that from time to time."

Neither these messages nor the actions they propose _require_ users to  
learn what "repacking", "loose object", or "file fragments" are about  
before they can proceed.

Cheers.
-- 
Jing Xue

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 17:51   ` Nix
@ 2007-09-05 18:14     ` Steven Grimm
  2007-09-05 18:22       ` Nix
  0 siblings, 1 reply; 97+ messages in thread
From: Steven Grimm @ 2007-09-05 18:14 UTC (permalink / raw)
  To: Nix; +Cc: Linus Torvalds, Git Mailing List

Nix wrote:
> Indeed. I repack all our git trees in the middle of the night, and our
> incremental backup script drops .keep files corresponding to every
> existing pack before running the backup.
>
> This is probably a good job for cron :)
>   

If you are setting up cron jobs to repack multiple git trees, you are 
not the kind of novice or casual git user who this proposal would 
primarily be aimed at.

But in any event, since you are doing that, your repos will never 
accumulate a high enough percentage of loose objects (whatever the 
threshold is) to trigger the warning and/or automatic launch. So you can 
continue to operate as before, no difference in behavior, while people 
who don't know how / want to set up cron jobs will have their 
repositories cleaned too.

git-gc can leave behind a "last completed" timestamp and we can suppress 
the check for excess loose objects until some minimum amount of time has 
passed since last git-gc. If that amount is greater than the interval 
between your cron jobs, you won't even get any (measurable) overhead 
from the detection to see if the warning is needed.

-Steve

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 18:14     ` Steven Grimm
@ 2007-09-05 18:22       ` Nix
  2007-09-05 18:54         ` Nicolas Pitre
  0 siblings, 1 reply; 97+ messages in thread
From: Nix @ 2007-09-05 18:22 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Linus Torvalds, Git Mailing List

On 5 Sep 2007, Steven Grimm stated:

> Nix wrote:
>> Indeed. I repack all our git trees in the middle of the night, and our
>> incremental backup script drops .keep files corresponding to every
>> existing pack before running the backup.
>>
>> This is probably a good job for cron :)
>
> If you are setting up cron jobs to repack multiple git trees, you are
> not the kind of novice or casual git user who this proposal would
> primarily be aimed at.

True enough: but the point is that it was only about three lines of code
(a locate and git-gc pipeline). We could just put that in the
documentation...

... which people then won't read. Oh well. Sorry for the mindless
optimism.

> git-gc can leave behind a "last completed" timestamp and we can
> suppress the check for excess loose objects until some minimum amount
> of time has passed since last git-gc. If that amount is greater than
> the interval between your cron jobs, you won't even get any
> (measurable) overhead from the detection to see if the warning is
> needed.

I personally wonder if git-gc shouldn't use a proportional scheme, so
that only some packs get repacked, maybe the smallest ones (and when
they grow to the same size as the next largest one, the two get repacked
into one). This has the singular advantage that you won't have to
carefully drop .keep files everywhere or have to worry about your git-gc
of 50K of loose objects suddenly deciding to repack 100Mb of packfiles
and taking ages.

It's probably not hard to implement, but I don't need it because I keep
everything packed anyway...

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 17:35   ` Steven Grimm
@ 2007-09-05 18:28     ` Nix
  0 siblings, 0 replies; 97+ messages in thread
From: Nix @ 2007-09-05 18:28 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Govind Salinas, Linus Torvalds, Git Mailing List

On 5 Sep 2007, Steven Grimm stated:

> Govind Salinas wrote:
>> This is one reason why I really think that gc should be *plumbing*
>> and *not* porcelain.
>
> That's a good way to think of it IMO. It's a low-level operation
> (albeit one that encapsulates other, lower-level ones) that tells git
> to rearrange its internal data structures. It is not something that
> has any user-visible effect.

It certainly has a sysadmin-visible effect. Repack a couple of big git
repositories and that's a backup tape gone if you do incremental
backups: and you can't *not* back up the pack files, even though a lot
of the state in them is recoverable from elsewhere on the net: the stuff
which is not recoverable is tangled up with the stuff which is.

(of course the solution here was .keep files. I cheered when they were
introduced and started rolling git out everywhere I could. There's just
one last vast repository maintained by a horrible shell script layered
atop SCCS which I have to find some way to convert...)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 17:44 ` J. Bruce Fields
@ 2007-09-05 18:46   ` Brandon Casey
  2007-09-05 19:09     ` David Kastrup
  0 siblings, 1 reply; 97+ messages in thread
From: Brandon Casey @ 2007-09-05 18:46 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linus Torvalds, Git Mailing List

On Wed, 5 Sep 2007, J. Bruce Fields wrote:

> Well, this may just prove I'm an idiot, but one of the reasons I rarely
> run it is that I have trouble remembering exactly what it does; in
> particular,
>
> 	- does it prune anything that might be needed by a repo I
> 	  cloned with -s?

     YES! yikes.

This is about the best argument put forth so far for not automatically
running git-gc. Personally, I think git-gc should not remove unreferenced
objects without --prune (but I haven't done anything about it). But even
if git-gc was modified in this way, an occasional git-gc --prune would
still be necessary to remove all of the unreferenced and dangling objects
safely with a human thinking about the shared repo implications (unless
shared repo handling is modified).

-brandon

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 18:22       ` Nix
@ 2007-09-05 18:54         ` Nicolas Pitre
  2007-09-05 20:01           ` Junio C Hamano
  2018-10-07 18:28           ` What's so special about objects/17/ ? Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 97+ messages in thread
From: Nicolas Pitre @ 2007-09-05 18:54 UTC (permalink / raw)
  To: Nix; +Cc: Steven Grimm, Linus Torvalds, Git Mailing List

On Wed, 5 Sep 2007, Nix wrote:

> I personally wonder if git-gc shouldn't use a proportional scheme, so
> that only some packs get repacked, maybe the smallest ones (and when
> they grow to the same size as the next largest one, the two get repacked
> into one). This has the singular advantage that you won't have to
> carefully drop .keep files everywhere or have to worry about your git-gc
> of 50K of loose objects suddenly deciding to repack 100Mb of packfiles
> and taking ages.

Not only that.  Currently the "Counting objects" phase when running 
git-gc on the Linux repo takes a significant amount of time, even if 
there is little to repack.

If any kind of automatic repack is implemented, it should be an 
incremental repacking only, not the full thing, i.e. git-repack without 
-a, or git-pack-objects with --unpacked.  The idea is to be the least 
intrusive as possible.  Also, object walking should be limited to 
objects linked to a commit object which is itself unpacked in order to 
cut on the time required to fully enumerate all objects.

This way a semi-packed state will always be preserved and should be good 
enough.  The full repacking should probably be left to manual execution 
of git-gc.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 18:46   ` Brandon Casey
@ 2007-09-05 19:09     ` David Kastrup
  2007-09-05 19:13       ` J. Bruce Fields
  2007-09-05 19:20       ` Mike Hommey
  0 siblings, 2 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-05 19:09 UTC (permalink / raw)
  To: Brandon Casey; +Cc: J. Bruce Fields, Linus Torvalds, Git Mailing List

Brandon Casey <casey@nrlssc.navy.mil> writes:

> On Wed, 5 Sep 2007, J. Bruce Fields wrote:
>
>> Well, this may just prove I'm an idiot, but one of the reasons I rarely
>> run it is that I have trouble remembering exactly what it does; in
>> particular,
>>
>> 	- does it prune anything that might be needed by a repo I
>> 	  cloned with -s?
>
>     YES! yikes.
>
> This is about the best argument put forth so far for not
> automatically running git-gc.

Well, it could also mean that if git finds a dead symbolic link when
looking up an object, it should check the corresponding link target
directory for a pack file with the respective object...  and if it
finds such a pack file, create a link to it and use it.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 19:09     ` David Kastrup
@ 2007-09-05 19:13       ` J. Bruce Fields
  2007-09-05 19:43         ` David Kastrup
  2007-09-05 19:20       ` Mike Hommey
  1 sibling, 1 reply; 97+ messages in thread
From: J. Bruce Fields @ 2007-09-05 19:13 UTC (permalink / raw)
  To: David Kastrup; +Cc: Brandon Casey, Linus Torvalds, Git Mailing List

On Wed, Sep 05, 2007 at 09:09:40PM +0200, David Kastrup wrote:
> Brandon Casey <casey@nrlssc.navy.mil> writes:
> 
> > On Wed, 5 Sep 2007, J. Bruce Fields wrote:
> >
> >> Well, this may just prove I'm an idiot, but one of the reasons I rarely
> >> run it is that I have trouble remembering exactly what it does; in
> >> particular,
> >>
> >> 	- does it prune anything that might be needed by a repo I
> >> 	  cloned with -s?
> >
> >     YES! yikes.
> >
> > This is about the best argument put forth so far for not
> > automatically running git-gc.
> 
> Well, it could also mean that if git finds a dead symbolic link when
> looking up an object, it should check the corresponding link target
> directory for a pack file with the respective object...  and if it
> finds such a pack file, create a link to it and use it.

One of the two of us is very confused about what "git-clone -s" does.
See the git-clone man page.  I don't think symlinks are involved.

--b.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 19:09     ` David Kastrup
  2007-09-05 19:13       ` J. Bruce Fields
@ 2007-09-05 19:20       ` Mike Hommey
  1 sibling, 0 replies; 97+ messages in thread
From: Mike Hommey @ 2007-09-05 19:20 UTC (permalink / raw)
  To: David Kastrup
  Cc: Brandon Casey, J. Bruce Fields, Linus Torvalds, Git Mailing List

On Wed, Sep 05, 2007 at 09:09:40PM +0200, David Kastrup <dak@gnu.org> wrote:
> Brandon Casey <casey@nrlssc.navy.mil> writes:
> 
> > On Wed, 5 Sep 2007, J. Bruce Fields wrote:
> >
> >> Well, this may just prove I'm an idiot, but one of the reasons I rarely
> >> run it is that I have trouble remembering exactly what it does; in
> >> particular,
> >>
> >> 	- does it prune anything that might be needed by a repo I
> >> 	  cloned with -s?
> >
> >     YES! yikes.
> >
> > This is about the best argument put forth so far for not
> > automatically running git-gc.
> 
> Well, it could also mean that if git finds a dead symbolic link when
> looking up an object, it should check the corresponding link target
> directory for a pack file with the respective object...  and if it
> finds such a pack file, create a link to it and use it.

The problem here is that the clone could be having refs on objects from
the origin that don't have refs left there. git-gc might, at some point,
prune these refs, and the clone would have dangling refs. That could
easily happen, for example, if you rebase a branch in the origin, but
still have a clone with the original branch.

Mike

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 19:13       ` J. Bruce Fields
@ 2007-09-05 19:43         ` David Kastrup
  0 siblings, 0 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-05 19:43 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Brandon Casey, Linus Torvalds, Git Mailing List

"J. Bruce Fields" <bfields@fieldses.org> writes:

> On Wed, Sep 05, 2007 at 09:09:40PM +0200, David Kastrup wrote:
>> Brandon Casey <casey@nrlssc.navy.mil> writes:
>> 
>> > On Wed, 5 Sep 2007, J. Bruce Fields wrote:
>> >
>> >> Well, this may just prove I'm an idiot, but one of the reasons I rarely
>> >> run it is that I have trouble remembering exactly what it does; in
>> >> particular,
>> >>
>> >> 	- does it prune anything that might be needed by a repo I
>> >> 	  cloned with -s?
>> >
>> >     YES! yikes.
>> >
>> > This is about the best argument put forth so far for not
>> > automatically running git-gc.
>> 
>> Well, it could also mean that if git finds a dead symbolic link when
>> looking up an object, it should check the corresponding link target
>> directory for a pack file with the respective object...  and if it
>> finds such a pack file, create a link to it and use it.
>
> One of the two of us is very confused about what "git-clone -s" does.
> See the git-clone man page.  I don't think symlinks are involved.

Guilty as charged.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 18:54         ` Nicolas Pitre
@ 2007-09-05 20:01           ` Junio C Hamano
  2007-09-05 20:35             ` Nicolas Pitre
                               ` (5 more replies)
  2018-10-07 18:28           ` What's so special about objects/17/ ? Ævar Arnfjörð Bjarmason
  1 sibling, 6 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05 20:01 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> Not only that.  Currently the "Counting objects" phase when running 
> git-gc on the Linux repo takes a significant amount of time, even if 
> there is little to repack.
>
> If any kind of automatic repack is implemented, it should be an 
> incremental repacking only, not the full thing, i.e. git-repack without 
> -a, or git-pack-objects with --unpacked.  The idea is to be the least 
> intrusive as possible.  Also, object walking should be limited to 
> objects linked to a commit object which is itself unpacked in order to 
> cut on the time required to fully enumerate all objects.
>
> This way a semi-packed state will always be preserved and should be good 
> enough.  The full repacking should probably be left to manual execution 
> of git-gc.

Ok, how about doing something like this?

-- >8 -- snipsnap -- >8 -- clipcrap -- >8 --
Implement git gc --auto

This implements a new option "git gc --auto".  When gc.auto is
set to a positive value, and the object database has accumulated
roughly that many number of loose objects, this runs a
lightweight version of "git gc".  The primary difference from
the full "git gc" is that it does not pass "-a" option to "git
repack", which means we do not try to repack _everything_, but
only repack incrementally.  We still do "git prune-packed".  The
default threshold is arbitrarily set by yours truly to:

 - not trigger it for fully unpacked git v0.99 history;

 - do trigger it for fully unpacked git v1.0.0 history;

 - not trigger it for incremental update to git v1.0.0 starting
   from fully packed git v0.99 history.

This patch does not add invocation of the "auto repacking".  It
is left to key Porcelain commands that could produce tons of
loose objects to add a call to "git gc --auto" after they are
done their work.  Obvious candidates are:

	git add
	git fetch
        git merge
        git rebase        

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 builtin-gc.c |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 63 insertions(+), 1 deletions(-)

diff --git a/builtin-gc.c b/builtin-gc.c
index 9397482..093b3dd 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -20,6 +20,7 @@ static const char builtin_gc_usage[] = "git-gc [--prune] [--aggressive]";
 
 static int pack_refs = 1;
 static int aggressive_window = -1;
+static int gc_auto_threshold = 6700;
 
 #define MAX_ADD 10
 static const char *argv_pack_refs[] = {"pack-refs", "--all", "--prune", NULL};
@@ -28,6 +29,8 @@ static const char *argv_repack[MAX_ADD] = {"repack", "-a", "-d", "-l", NULL};
 static const char *argv_prune[] = {"prune", NULL};
 static const char *argv_rerere[] = {"rerere", "gc", NULL};
 
+static const char *argv_repack_auto[] = {"repack", "-d", "-l", NULL};
+
 static int gc_config(const char *var, const char *value)
 {
 	if (!strcmp(var, "gc.packrefs")) {
@@ -41,6 +44,10 @@ static int gc_config(const char *var, const char *value)
 		aggressive_window = git_config_int(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "gc.auto")) {
+		gc_auto_threshold = git_config_int(var, value);
+		return 0;
+	}
 	return git_default_config(var, value);
 }
 
@@ -57,10 +64,49 @@ static void append_option(const char **cmd, const char *opt, int max_length)
 	cmd[i] = NULL;
 }
 
+static int need_to_gc(void)
+{
+	/*
+	 * Quickly check if a "gc" is needed, by estimating how
+	 * many loose objects there are.  Because SHA-1 is evenly
+	 * distributed, we can check only one and get a reasonable
+	 * estimate.
+	 */
+	char path[PATH_MAX];
+	const char *objdir = get_object_directory();
+	DIR *dir;
+	struct dirent *ent;
+	int auto_threshold;
+	int num_loose = 0;
+	int needed = 0;
+
+	if (sizeof(path) <= snprintf(path, sizeof(path), "%s/17", objdir)) {
+		warning("insanely long object directory %.*s", 50, objdir);
+		return 0;
+	}
+	dir = opendir(path);
+	if (!dir)
+		return 0;
+
+	auto_threshold = (gc_auto_threshold + 255) / 256;
+	while ((ent = readdir(dir)) != NULL) {
+		if (strspn(ent->d_name, "0123456789abcdef") != 38 ||
+		    ent->d_name[38] != '\0')
+			continue;
+		if (++num_loose > auto_threshold) {
+			needed = 1;
+			break;
+		}
+	}
+	closedir(dir);
+	return needed;
+}
+
 int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int i;
 	int prune = 0;
+	int auto_gc = 0;
 	char buf[80];
 
 	git_config(gc_config);
@@ -82,12 +128,28 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			}
 			continue;
 		}
-		/* perhaps other parameters later... */
+		if (!strcmp(arg, "--auto")) {
+			if (gc_auto_threshold <= 0)
+				return 0;
+			auto_gc = 1;
+			continue;
+		}
 		break;
 	}
 	if (i != argc)
 		usage(builtin_gc_usage);
 
+	if (auto_gc) {
+		/*
+		 * Auto-gc should be least intrusive as possible.
+		 */
+		prune = 0;
+		for (i = 0; i < ARRAY_SIZE(argv_repack_auto); i++)
+			argv_repack[i] = argv_repack_auto[i];
+		if (!need_to_gc())
+			return 0;
+	}
+
 	if (pack_refs && run_command_v_opt(argv_pack_refs, RUN_GIT_CMD))
 		return error(FAILED_RUN, argv_pack_refs[0]);
 

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:01           ` Junio C Hamano
@ 2007-09-05 20:35             ` Nicolas Pitre
  2007-09-05 21:14               ` Nix
                                 ` (2 more replies)
  2007-09-05 20:37             ` [PATCH] Invoke "git gc --auto" from "git add" and "git fetch" Junio C Hamano
                               ` (4 subsequent siblings)
  5 siblings, 3 replies; 97+ messages in thread
From: Nicolas Pitre @ 2007-09-05 20:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

On Wed, 5 Sep 2007, Junio C Hamano wrote:

> Implement git gc --auto
> 
> This implements a new option "git gc --auto".  When gc.auto is
> set to a positive value, and the object database has accumulated
> roughly that many number of loose objects, this runs a
> lightweight version of "git gc".  The primary difference from
> the full "git gc" is that it does not pass "-a" option to "git
> repack", which means we do not try to repack _everything_, but
> only repack incrementally.  We still do "git prune-packed".  

A big part of the repack cost is the counting of objects. I don't know 
if --unpacked to git-pack-objects skips walking trees of a packed commit 
object.  If no then it probably should to gain a significant speed up, 
or maybe a separate option should be created to actually imply this 
loosened semantic.

> This patch does not add invocation of the "auto repacking".  It
> is left to key Porcelain commands that could produce tons of
> loose objects to add a call to "git gc --auto" after they are
> done their work.  Obvious candidates are:
> 
> 	git add

Nope!  'git add' creates loose objects which are not yet reachable from 
anywhere.  They won't get repacked until a commit is made.

> 	git fetch

I think that would be a much better idea to simply decrease the 
fetch.unpackLimit default value.

>         git merge
>         git rebase        

and git commit.  Which resumes it to commit creating operation.

Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH] Invoke "git gc --auto" from "git add" and "git fetch"
  2007-09-05 20:01           ` Junio C Hamano
  2007-09-05 20:35             ` Nicolas Pitre
@ 2007-09-05 20:37             ` Junio C Hamano
       [not found]               ` <69b0c0350709051357ifa547aarfe3e0b36cf9be98f@mail.gmail.com>
  2007-09-06 12:02               ` Johannes Schindelin
  2007-09-05 21:18             ` People unaware of the importance of "git gc"? Alex Riesen
                               ` (3 subsequent siblings)
  5 siblings, 2 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05 20:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

This makes the two commands to call "git gc --auto" when they
are done.

I earlier said that obvious candidates also include merge and
rebase, but these are lot less frequent operations compared to
add, and more importantly, in a normal workflow they would
almost always happen after "git fetch" is done.

In other words, if you are downstream developer, the automatic
invocation in "git fetch" will take care of things for you, and
otherwise if you do not have an upstream, you would be doing
your own development, so "git add" to add your changes will take
care of the auto invocation for you.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 * This is obviously a follow-up to the previous one that allows
   you to say "git gc --auto".  I somewhat feel dirty about
   calling cmd_gc() bypassing fork & exec from "git add",
   though...

 builtin-add.c |    2 ++
 git-fetch.sh  |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/builtin-add.c b/builtin-add.c
index 105a9f0..8431c16 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -263,9 +263,11 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 
  finish:
 	if (active_cache_changed) {
+		const char *args[] = { "gc", "--auto", NULL };
 		if (write_cache(newfd, active_cache, active_nr) ||
 		    close(newfd) || commit_locked_index(&lock_file))
 			die("Unable to write new index file");
+		cmd_gc(2, args, NULL);
 	}
 
 	return 0;
diff --git a/git-fetch.sh b/git-fetch.sh
index c3a2001..86050eb 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -375,3 +375,4 @@ case "$orig_head" in
 	fi
 	;;
 esac
+git gc --auto
-- 
1.5.3.1.840.g0fedbc

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Fwd: [PATCH] Invoke "git gc --auto" from "git add" and "git fetch"
       [not found]               ` <69b0c0350709051357ifa547aarfe3e0b36cf9be98f@mail.gmail.com>
@ 2007-09-05 20:59                 ` Govind Salinas
  0 siblings, 0 replies; 97+ messages in thread
From: Govind Salinas @ 2007-09-05 20:59 UTC (permalink / raw)
  To: Git Mailing List

Forgot to cc the list.

---------- Forwarded message ----------
From: Govind Salinas <govindsalinas@gmail.com>
Date: Sep 5, 2007 3:57 PM
Subject: Re: [PATCH] Invoke "git gc --auto" from "git add" and "git fetch"
To: Junio C Hamano <gitster@pobox.com>


I have a completely uninformed question...

Can git-add/rm/etc create dangling object or objects that would
be cleaned up by git-gc --auto?  I would think (and I could be
completely off base here) that you would only want to call gc
after an operation that could create stuff that needs to be gc'ed,
since only then could the threshold be reached.

Anyways, just curious.  One day I should actually go in and read
some git code.

-Govind

On 9/5/07, Junio C Hamano <gitster@pobox.com> wrote:
> This makes the two commands to call "git gc --auto" when they
> are done.
>
> I earlier said that obvious candidates also include merge and
> rebase, but these are lot less frequent operations compared to
> add, and more importantly, in a normal workflow they would
> almost always happen after "git fetch" is done.
>
> In other words, if you are downstream developer, the automatic
> invocation in "git fetch" will take care of things for you, and
> otherwise if you do not have an upstream, you would be doing
> your own development, so "git add" to add your changes will take
> care of the auto invocation for you.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  * This is obviously a follow-up to the previous one that allows
>    you to say "git gc --auto".  I somewhat feel dirty about
>    calling cmd_gc() bypassing fork & exec from "git add",
>    though...
>
>  builtin-add.c |    2 ++
>  git-fetch.sh  |    1 +
>  2 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/builtin-add.c b/builtin-add.c
> index 105a9f0..8431c16 100644
> --- a/builtin-add.c
> +++ b/builtin-add.c
> @@ -263,9 +263,11 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>
>   finish:
>         if (active_cache_changed) {
> +               const char *args[] = { "gc", "--auto", NULL };
>                 if (write_cache(newfd, active_cache, active_nr) ||
>                     close(newfd) || commit_locked_index(&lock_file))
>                         die("Unable to write new index file");
> +               cmd_gc(2, args, NULL);
>         }
>
>         return 0;
> diff --git a/git-fetch.sh b/git-fetch.sh
> index c3a2001..86050eb 100755
> --- a/git-fetch.sh
> +++ b/git-fetch.sh
> @@ -375,3 +375,4 @@ case "$orig_head" in
>         fi
>         ;;
>  esac
> +git gc --auto
> --
> 1.5.3.1.840.g0fedbc
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
                   ` (5 preceding siblings ...)
  2007-09-05 17:44 ` J. Bruce Fields
@ 2007-09-05 21:07 ` Alex Riesen
  6 siblings, 0 replies; 97+ messages in thread
From: Alex Riesen @ 2007-09-05 21:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds, Wed, Sep 05, 2007 09:09:27 +0200:
> I personally repack everything way more often than is necessary, and I had 
> kind of assumed that people did it that way, but I was apparently wrong. 
> Comments?

I do it from time to time. Seldom in working repositories, because
they usually come and go before they have a chance to accumulate
enough of loose objects. I do a partial repack (git repack -d) after
every import from p4 repo, because every snapshot of it is an ugly
mess changing files all over the tree. Sometimes, after I merged a big
chunk with the p4 repo and sent it over (the process involves rebase).

It is usually concious decision when to do a repack or gc. The repack
time is seldom a problem: it is fast enough even on windows (and I do
have big repos and binary objects). The gc causes my machines to swap,
though. Some of them heavily, so there my repos stay longer partially
packed. I do use .keep packs for this reason (and because windows or
cygwin or both have more problems with big files the they have with
small).

I used to clone repos with "-s", but quickly stopped after a few
broken histories.  This also tought me to think before running
"git gc" or "git repack -a -d".

On a rare occurance I even use "git repack -a -d -l" and "git
pack-refs" separately.

This was all specific to my day-job. At home, on linux systems I just
run git-gc whenever I please, without even thinking why. It finishes
mostly in less than a minute (the kernel: ~40-50 sec on my P4 2.6GHz, 1Gb).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:35             ` Nicolas Pitre
@ 2007-09-05 21:14               ` Nix
  2007-09-05 21:46               ` Junio C Hamano
  2007-09-05 21:49               ` Junio C Hamano
  2 siblings, 0 replies; 97+ messages in thread
From: Nix @ 2007-09-05 21:14 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Steven Grimm, Linus Torvalds, Git Mailing List

On 5 Sep 2007, Nicolas Pitre stated:

> On Wed, 5 Sep 2007, Junio C Hamano wrote:
>> 	git fetch
>
> I think that would be a much better idea to simply decrease the 
> fetch.unpackLimit default value.

I think `git fetch' works reasonably well as is: unless you're fetching
every five minutes you often find you get packs anyway. There's no point
packing incrementally *too* often, or you replace a lots-of-objects
problem with a lots-of-packs problem, after which you're worse off than
when you started.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:01           ` Junio C Hamano
  2007-09-05 20:35             ` Nicolas Pitre
  2007-09-05 20:37             ` [PATCH] Invoke "git gc --auto" from "git add" and "git fetch" Junio C Hamano
@ 2007-09-05 21:18             ` Alex Riesen
  2007-09-06  2:44             ` Russ Dill
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 97+ messages in thread
From: Alex Riesen @ 2007-09-05 21:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Junio C Hamano, Wed, Sep 05, 2007 22:01:37 +0200:
> +	/*
> +	 * Quickly check if a "gc" is needed, by estimating how
> +	 * many loose objects there are.  Because SHA-1 is evenly
> +	 * distributed, we can check only one and get a reasonable
> +	 * estimate.
> +	 */

:))

> +	if (sizeof(path) <= snprintf(path, sizeof(path), "%s/17", objdir)) {
> +		warning("insanely long object directory %.*s", 50, objdir);

or a non-POSIX snprintf returning "negative value" (Microsoft)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:35             ` Nicolas Pitre
  2007-09-05 21:14               ` Nix
@ 2007-09-05 21:46               ` Junio C Hamano
  2007-09-05 23:04                 ` Nicolas Pitre
  2007-09-06  5:55                 ` David Kastrup
  2007-09-05 21:49               ` Junio C Hamano
  2 siblings, 2 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05 21:46 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

>> This patch does not add invocation of the "auto repacking".  It
>> is left to key Porcelain commands that could produce tons of
>> loose objects to add a call to "git gc --auto" after they are
>> done their work.  Obvious candidates are:
>> 
>> 	git add
>
> Nope!  'git add' creates loose objects which are not yet reachable from 
> anywhere.  They won't get repacked until a commit is made.

Bzzt, I am releaved to see you are sometimes wrong ;-)

They are reachable from the index and are not subject to
pruning.

>> 	git fetch
>
> I think that would be a much better idea to simply decrease the 
> fetch.unpackLimit default value.

One thing that I find lacking in that auto patch is actually
that we should sometimes consolidate multiple small packs into a
single larger one.  Any behaviour change to encourage creation
of many tiny packs should be avoided until it materializes.

Probably we should introduce a built-in minimum value for a
positive gc.auto, somewhere around 1000 or so, for this reason.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:35             ` Nicolas Pitre
  2007-09-05 21:14               ` Nix
  2007-09-05 21:46               ` Junio C Hamano
@ 2007-09-05 21:49               ` Junio C Hamano
  2007-09-05 21:59                 ` Invoke "git gc --auto" from commit, merge, am and rebase Junio C Hamano
  2 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05 21:49 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> and git commit.  Which resumes it to commit creating operation.

Good point.  I think that makes sense.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Invoke "git gc --auto" from commit, merge, am and rebase.
  2007-09-05 21:49               ` Junio C Hamano
@ 2007-09-05 21:59                 ` Junio C Hamano
  2007-09-06  2:39                   ` Shawn O. Pearce
  0 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05 21:59 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

The point of auto gc is to pack new objects created in loose
format, so a good rule of thumb is where we do update-ref after
creating a new commit.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
  Let's chuck the previous "git add/git fetch" one, and replace it
  with this.

  Also I realize I misread your earlier comment about "git add".
  You are still among the only few people on the list that I
  consider are always more right than I am ;-).

 git-am.sh                  |    2 ++
 git-commit.sh              |    1 +
 git-merge.sh               |    1 +
 git-rebase--interactive.sh |    2 ++
 4 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/git-am.sh b/git-am.sh
index 6809aa0..4db4701 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -466,6 +466,8 @@ do
 		"$GIT_DIR"/hooks/post-applypatch
 	fi
 
+	git gc --auto
+
 	go_next
 done
 
diff --git a/git-commit.sh b/git-commit.sh
index 1d04f1f..d22d35e 100755
--- a/git-commit.sh
+++ b/git-commit.sh
@@ -652,6 +652,7 @@ git rerere
 
 if test "$ret" = 0
 then
+	git gc --auto
 	if test -x "$GIT_DIR"/hooks/post-commit
 	then
 		"$GIT_DIR"/hooks/post-commit
diff --git a/git-merge.sh b/git-merge.sh
index 3a01db0..697bec2 100755
--- a/git-merge.sh
+++ b/git-merge.sh
@@ -82,6 +82,7 @@ finish () {
 			;;
 		*)
 			git update-ref -m "$rlogm" HEAD "$1" "$head" || exit 1
+			git gc --auto
 			;;
 		esac
 		;;
diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index abc2b1c..8258b7a 100755
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -307,6 +307,8 @@ do_next () {
 	rm -rf "$DOTEST" &&
 	warn "Successfully rebased and updated $HEADNAME."
 
+	git gc --auto
+
 	exit
 }
 

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 21:46               ` Junio C Hamano
@ 2007-09-05 23:04                 ` Nicolas Pitre
  2007-09-05 23:42                   ` Junio C Hamano
  2007-09-06  5:55                 ` David Kastrup
  1 sibling, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2007-09-05 23:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

On Wed, 5 Sep 2007, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> >> This patch does not add invocation of the "auto repacking".  It
> >> is left to key Porcelain commands that could produce tons of
> >> loose objects to add a call to "git gc --auto" after they are
> >> done their work.  Obvious candidates are:
> >> 
> >> 	git add
> >
> > Nope!  'git add' creates loose objects which are not yet reachable from 
> > anywhere.  They won't get repacked until a commit is made.
> 
> Bzzt, I am releaved to see you are sometimes wrong ;-)
> 
> They are reachable from the index and are not subject to
> pruning.

The index?  What's that?  ;-)

> >> 	git fetch
> >
> > I think that would be a much better idea to simply decrease the 
> > fetch.unpackLimit default value.
> 
> One thing that I find lacking in that auto patch is actually
> that we should sometimes consolidate multiple small packs into a
> single larger one.  Any behaviour change to encourage creation
> of many tiny packs should be avoided until it materializes.
> 
> Probably we should introduce a built-in minimum value for a
> positive gc.auto, somewhere around 1000 or so, for this reason.

Why not just let the default value take care of it?  If someone really 
wants to set gc.auto to 50, why prevent it?

The more I think of it, the less I like automatic repack.  There is 
always a bad case for it somewhere.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 23:04                 ` Nicolas Pitre
@ 2007-09-05 23:42                   ` Junio C Hamano
  2007-09-06  0:27                     ` Carlos Rica
  0 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2007-09-05 23:42 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Nix, Steven Grimm, Linus Torvalds, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 5 Sep 2007, Junio C Hamano wrote:
>
>> Nicolas Pitre <nico@cam.org> writes:
>> 
> The index?  What's that?  ;-)

Sorry, my mistake.  You are always more right than I am [tm] ;-)

> The more I think of it, the less I like automatic repack.  There is 
> always a bad case for it somewhere.

I tend to agree, but at the same time, I think the long term
goal should be not to have bad cases.

Old timers like ourselves learned to run "repack -a -d" when not
doing real work (i.e. beginning of the day while fetching
coffee, before leaving to lunch break, end of the day before
leaving) and we have been _trained_ not to feel that a choir,
but I think that is wrong.  "Sync freezes I/O for and causes my
real-time databasy job undue latency --- I would want to disable
swapper/bdflush/whatever machine-wide and prefer typing 'sync'
from the command line when it is convenient for me" is fine for
an experienced user working on a single user machine, but it
still feels wrong (we do not have "multi-user" issues in git
repository, so this analogy is not quite right, though).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 17:31             ` Matthieu Moy
@ 2007-09-05 23:56               ` Jeff King
  0 siblings, 0 replies; 97+ messages in thread
From: Jeff King @ 2007-09-05 23:56 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Git Mailing List

On Wed, Sep 05, 2007 at 07:31:44PM +0200, Matthieu Moy wrote:

> I have ~/teaching/some-course/.git (well, almost) and ~/etc/.git which
> are two unrelated projects, and to "git gc" both of them, I need
> either a script, or two manual invocations.
>
> (yes, I'm really talking about something trivial)

I tend to have a lot of small projects, so I have on the order of 80 git
repositories on each machine I use, most of which have a 'mothership'
origin on a central, backed-up machine.

When I sit down to work, I want to see which repositories
have changes that need to be pulled. And when I get up to leave, I want
to see which repositories have changes that need to be pushed. Not to
mention files that need committed, loose objects that need packed, etc.

So I wrote the 'git-stale' script, included below. It's not especially
user-friendly, but you might find it useful, as it solves the exact
problem you are talking about (and much more).

It reads 'repository specifications' from ~/.gitstale, one per line,
which are either of the form:

  /path/to/repo

which specifies a repo to check, or:

  r:/path/to/many/repos

which specifies a hierarchy in which to recursively find repos.

My .gitstale looks something like this:

  /home/peff/compile/git
  /home/peff/compile/tig
  r:/home/peff/work

and I get output something like this (edited for brevity):

Checking (1/77) /home/peff/compile/git...
Checking (2/77) /home/peff/compile/tig...
[...]
Checking (77/77) /home/peff/work/foo...
MERGE:next /home/peff/compile/git
COMMIT: /home/peff/work/foo
PACK: /home/peff/work/foo
PUSH:master /home/peff/work/bar

which translates to:
  - the git repo has commits in 'origin/next' that are not in 'next'
    (and you might want to merge them in)
  - there are uncommitted files in 'foo'
  - 'foo' needs packing
  - in the 'bar' repo there are commits in master that are not in origin
    (and you might want to push)

Hopefully it will be useful to you, though I think it is probably too
specific to my workflow to be part of git.

-Peff

-- >8 --
#!/usr/bin/perl

use strict;
use Getopt::Long;

my $CONFIG_FILE = "$ENV{HOME}/.gitstale";

my $nofetch = $ENV{GITSTALE_NOFETCH};
Getopt::Long::Configure(qw(bundling));
GetOptions('nofetch|n!' => \$nofetch) or exit 100;

my @projects = process_spec(@ARGV ? @ARGV : cat($CONFIG_FILE));

my $n = 1;
my $total = @projects;
my %errors;
foreach my $p (@projects) {
  print "Checking ($n/$total) $p...\n";
  $errors{$p} = [check_git($p)];
  $n++;
}

my $errcount;
foreach my $p (@projects) {
  foreach my $e (@{$errors{$p}}) {
    print "$e: $p\n";
  }
}

exit $errcount ? 1 : 0;

sub cat {
  my $fn = shift;
  open(my $fh, '<', $fn)
    or die "unable to open $fn: $!\n";
  return map { chomp; length($_) ? $_ : () } <$fh>;
}

sub process_spec {
  my @dirs;
  my @roots;
  my @exclude;

  foreach (@_) {
    if(/^r:(.*)/) { push @roots, $1 }
    elsif(/^d:(.*)/) { push @dirs, $1 }
    elsif(/^-(.*)/) { push @exclude, qr#(^|/)$1($|/)# }
    else { push @dirs, $_ }
  }

  use File::Find;
  find({
      no_chdir => 1,
      preprocess => sub { sort @_ },
      wanted => sub {
        return unless -d $_ && $_ =~ m#/.git$#;
        foreach my $e (@exclude) { return if $_ =~ $e }
        my $d = $_;
        $d =~ s#/\.git$##;
        push @dirs, $d;
      }
    }, @roots) if @roots;
  return @dirs;
}

sub count_zero {
  open(my $fh, '-|', @_) or die "unable to fork: $!\n";
  my $line = <$fh>;
  return length($line) == 0;
}

sub check_git {
  my $d = shift;

  chdir($d) or return 'CHDIR';

  my @r;
  count_zero(qw(
        git-ls-files -m -o -d --exclude-per-directory=.gitignore
        --directory --no-empty-directory
  )) or push @r, 'COMMIT';

  if(has_origin()) {
    push @r, 'FETCH' if !$nofetch && system('git-fetch');

    foreach my $p (branch_pairs()) {
      count_zero('git-rev-list', "$p->[0]..$p->[1]")
        or push @r, "MERGE:$p->[0]";
      count_zero('git-rev-list', "$p->[1]..$p->[0]")
        or push @r, "PUSH:$p->[0]";
    }
  }
  else {
    push @r, 'ORIGIN';
  }

  push @r, 'PACK' if unpacked_objects() > 1000;

  return @r;
}

sub unpacked_objects {
  my $objects = `git-count-objects`;
  $objects =~ /^(\d+)/;
  return $1;
}

sub branch_pairs {
  my %config;
  foreach my $line (`git-repo-config --get-regexp 'branch..*..*'`) {
    $line =~ m#^branch\.([^.]+)\.([^ ]+) (?:refs/heads/)?(.*)#
      or die "confusing git-repo-config output: $line\n";
    $config{$1}{$2} = $3;
  }

  return [qw(master origin)] if -e '.git/refs/heads/origin';

  return
    (-e '.git/refs/heads/origin' ? [qw(master origin)] : ()),
    map {
      $config{$_}{remote} && $config{$_}{merge} ?
        [$_, $config{$_}{remote} . '/' . $config{$_}{merge}] :
        ()
    } sort keys(%config);
}

sub has_origin {
  return
    -e '.git/branches/origin' ||
    -e '.git/remotes/origin' ||
    !count_zero(qw(git-repo-config --get remote.origin.url));
}
__END__

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 23:42                   ` Junio C Hamano
@ 2007-09-06  0:27                     ` Carlos Rica
  0 siblings, 0 replies; 97+ messages in thread
From: Carlos Rica @ 2007-09-06  0:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

2007/9/6, Junio C Hamano <gitster@pobox.com>:
> Nicolas Pitre <nico@cam.org> writes:
> > The more I think of it, the less I like automatic repack.  There is
> > always a bad case for it somewhere.
>
> I tend to agree, but at the same time, I think the long term
> goal should be not to have bad cases.

The best solution is make "git gc" unnecessary.
At the long term, and without loss of efficiency.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Invoke "git gc --auto" from commit, merge, am and rebase.
  2007-09-05 21:59                 ` Invoke "git gc --auto" from commit, merge, am and rebase Junio C Hamano
@ 2007-09-06  2:39                   ` Shawn O. Pearce
  0 siblings, 0 replies; 97+ messages in thread
From: Shawn O. Pearce @ 2007-09-06  2:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Junio C Hamano <gitster@pobox.com> wrote:
> The point of auto gc is to pack new objects created in loose
> format, so a good rule of thumb is where we do update-ref after
> creating a new commit.
...
>  git-am.sh                  |    2 ++
>  git-commit.sh              |    1 +
>  git-merge.sh               |    1 +
>  git-rebase--interactive.sh |    2 ++
>  4 files changed, 6 insertions(+), 0 deletions(-)
...
> diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
> index abc2b1c..8258b7a 100755
> --- a/git-rebase--interactive.sh
> +++ b/git-rebase--interactive.sh
> @@ -307,6 +307,8 @@ do_next () {
>  	rm -rf "$DOTEST" &&
>  	warn "Successfully rebased and updated $HEADNAME."
>  
> +	git gc --auto
> +
>  	exit
>  }

Why bother with git-rebase--interactive.sh?  It calls two tools,
git-cherry-pick (which calls git-commit) and git-commit to do its
per-commit dirty work.  So on every step of `git rebase -i` we are
now running `git gc --auto`.  No need to also run it at the end.

Note this is also true of `git rebase -m` as that uses the wonderful
feature of `git commit -C $oldid` per commit to make the new commit.
 
-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:01           ` Junio C Hamano
                               ` (2 preceding siblings ...)
  2007-09-05 21:18             ` People unaware of the importance of "git gc"? Alex Riesen
@ 2007-09-06  2:44             ` Russ Dill
  2007-09-06  2:52               ` Shawn O. Pearce
  2007-09-06  9:28               ` Andreas Ericsson
  2007-09-06  2:45             ` Shawn O. Pearce
  2007-09-06 15:54             ` Johannes Schindelin
  5 siblings, 2 replies; 97+ messages in thread
From: Russ Dill @ 2007-09-06  2:44 UTC (permalink / raw)
  To: git


> Ok, how about doing something like this?
> 

git add? merge? rebase? No, I have a sneakier place to invoke gc.

Whenever $EDITOR gets invoked. Heck, whenever git is waiting for any user input,
do some gc in the background, it'd just have to be incremental so that we could
pick up where we left off.

Similarly, you could mix it in with git pull/push so that while we are waiting
on the network, we can do some packing.

Course, this wouldn't work for all repositories.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:01           ` Junio C Hamano
                               ` (3 preceding siblings ...)
  2007-09-06  2:44             ` Russ Dill
@ 2007-09-06  2:45             ` Shawn O. Pearce
  2007-09-06  2:49               ` Steven Grimm
  2007-09-06 15:54             ` Johannes Schindelin
  5 siblings, 1 reply; 97+ messages in thread
From: Shawn O. Pearce @ 2007-09-06  2:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Junio C Hamano <gitster@pobox.com> wrote:
> Implement git gc --auto
... 

Danger...  If the user sets `gc.auto` to a low enough value and
they are also unlucky enough to have a few truely unreachable (thus
pruneable) objects in .git/objects/17/ then this is going to run
a bunch of gc work on every commit they make.

I'm actually running into this problem in git-gui.  On Windows
it suggests a repack if there is one object in .git/objects/42/.
Some users have been unlucky enough to stage a file, have it
hash into that directory, then restage a different version of it.
The prior one is never considered reachable (it was never committed),
but will now *always* cause git-gui to suggest a repack on every
startup.  For all time.

Yea, I need to fix that.

But this suffers from the same fate if the user sets gc.auto too
small and doesn't realize that the reason Git is always repacking
is because over the last 6 months they have been unlucky enough to
stage the magic number of unreachable blobs into the 17 directory
and they have *never* run `git gc --prune` because the auto thing
is working just fine for them and they don't realize they need to
prune every once in a blue moon.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06  2:45             ` Shawn O. Pearce
@ 2007-09-06  2:49               ` Steven Grimm
  2007-09-06  2:56                 ` Shawn O. Pearce
  0 siblings, 1 reply; 97+ messages in thread
From: Steven Grimm @ 2007-09-06  2:49 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Junio C Hamano, Nicolas Pitre, Nix, Linus Torvalds,
	Git Mailing List

Shawn O. Pearce wrote:
> But this suffers from the same fate if the user sets gc.auto too
> small and doesn't realize that the reason Git is always repacking
> is because over the last 6 months they have been unlucky enough to
> stage the magic number of unreachable blobs into the 17 directory
> and they have *never* run `git gc --prune` because the auto thing
> is working just fine for them and they don't realize they need to
> prune every once in a blue moon.
>   

Check the modification times on those files and don't count ones that 
are older than the last git-gc run, maybe? That'd take care of the problem.

-Steve

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06  2:44             ` Russ Dill
@ 2007-09-06  2:52               ` Shawn O. Pearce
  2007-09-06  9:28               ` Andreas Ericsson
  1 sibling, 0 replies; 97+ messages in thread
From: Shawn O. Pearce @ 2007-09-06  2:52 UTC (permalink / raw)
  To: Russ Dill; +Cc: git, Junio C Hamano, Nicolas Pitre

Russ Dill <Russ.Dill@gmail.com> wrote:
> > Ok, how about doing something like this?
> 
> git add? merge? rebase? No, I have a sneakier place to invoke gc.
> 
> Whenever $EDITOR gets invoked. Heck, whenever git is waiting for any user input,
> do some gc in the background, it'd just have to be incremental so that we could
> pick up where we left off.

Heh.  That is a really good idea.  I've been thinking about doing
some automatic generational style GC type repacking controls in
git-gui, and doing them when git-gui is sitting idle and has not
been used in the past couple of minutes.

This is along the same vein of thought.  I like it.  Often it
takes me a while to come up with a good commit message even if
I am using command line commit.

But git-rebase/git-am can cause a huge number of objects to be
created, especially if you are pushing a large stack of patches
around.  So it may still be a good idea to trigger `gc --auto`
at the end of those operations.

> Similarly, you could mix it in with git pull/push so that while we are waiting
> on the network, we can do some packing.

Here's a better thought:

If we are pushing somewhere, and the push size is "large-ish" and
we aren't pushing a thin pack (its currently considered not nice
to the remote end so it doesn't happen by default) and the objects
we are packing are mostly all loose maybe we should also save a
copy of that packfile locally, then prune *only* those loose objects
back.

Not every git user pushes their work.  But many do.  And those
that push usually will do so in bursts, are already expecting to
wait for the network latency, and usually are pushing the majority
of the things that are loose.  Such users will probably never see
the `gc --auto` trip in places like commit/am/merge as they would
already be clearing their ODB with the push.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06  2:49               ` Steven Grimm
@ 2007-09-06  2:56                 ` Shawn O. Pearce
  0 siblings, 0 replies; 97+ messages in thread
From: Shawn O. Pearce @ 2007-09-06  2:56 UTC (permalink / raw)
  To: Steven Grimm
  Cc: Junio C Hamano, Nicolas Pitre, Nix, Linus Torvalds,
	Git Mailing List

Steven Grimm <koreth@midwinter.com> wrote:
> Shawn O. Pearce wrote:
> >But this suffers from the same fate if the user sets gc.auto too
> >small and doesn't realize that the reason Git is always repacking
> >is because over the last 6 months they have been unlucky enough to
> >stage the magic number of unreachable blobs into the 17 directory
> >and they have *never* run `git gc --prune` because the auto thing
> >is working just fine for them and they don't realize they need to
> >prune every once in a blue moon.
> 
> Check the modification times on those files and don't count ones that 
> are older than the last git-gc run, maybe? That'd take care of the problem.

Eh, that could mean a bunch of stat calls that it would be nice
to avoid.  The counter Junio (and git-gui) implements just does
a readdir().  Reasonably cheap.

Maybe just save a ".git/gc_last_auto" with the last object count
of .git/objects/17, after repacking.  If the count is over the
gc.auto limit *and* is still over the limit after subtracting the
".git/gc_last_auto" value then consider that auto is required.

This way the file is only consulted if we are really thinking
about running a repack, and its only written to if we actually do
the repack.  So we only take the extra penalty if we are going to
be taking a *really* big extra penalty by repacking.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 21:46               ` Junio C Hamano
  2007-09-05 23:04                 ` Nicolas Pitre
@ 2007-09-06  5:55                 ` David Kastrup
  1 sibling, 0 replies; 97+ messages in thread
From: David Kastrup @ 2007-09-06  5:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Nicolas Pitre <nico@cam.org> writes:
>
>>> This patch does not add invocation of the "auto repacking".  It
>>> is left to key Porcelain commands that could produce tons of
>>> loose objects to add a call to "git gc --auto" after they are
>>> done their work.  Obvious candidates are:
>>> 
>>> 	git add
>>
>> Nope!  'git add' creates loose objects which are not yet reachable from 
>> anywhere.  They won't get repacked until a commit is made.
>
> Bzzt, I am releaved to see you are sometimes wrong ;-)
>
> They are reachable from the index and are not subject to
> pruning.

Hm.  Isn't it possible to work with several index files at once?  I
seem to remember that even git-add does this itself.  So what is it
that protects objects in such a temporary index from being garbage
collected by a different git process running on the same repository?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06  2:44             ` Russ Dill
  2007-09-06  2:52               ` Shawn O. Pearce
@ 2007-09-06  9:28               ` Andreas Ericsson
  1 sibling, 0 replies; 97+ messages in thread
From: Andreas Ericsson @ 2007-09-06  9:28 UTC (permalink / raw)
  To: Russ Dill; +Cc: git

Russ Dill wrote:
>> Ok, how about doing something like this?
>>
> 
> git add? merge? rebase? No, I have a sneakier place to invoke gc.
> 
> Whenever $EDITOR gets invoked. Heck, whenever git is waiting for any user input,
> do some gc in the background, it'd just have to be incremental so that we could
> pick up where we left off.
> 

I like it. Writing a commit-message takes anywhere from 30 seconds to 5 minutes
for me (sometimes having to check up bug id's, or verifying details in the code).
Sneaking in a repack here would be absolutely stellar :)

It's also nice in that it won't affect people who just follow a project's tip to
get the bleeding edge. For them it shouldn't matter much that they have multiple
small packs obtained while fetching, or if it's all bungled together in a big one.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] Invoke "git gc --auto" from "git add" and "git fetch"
  2007-09-05 20:37             ` [PATCH] Invoke "git gc --auto" from "git add" and "git fetch" Junio C Hamano
       [not found]               ` <69b0c0350709051357ifa547aarfe3e0b36cf9be98f@mail.gmail.com>
@ 2007-09-06 12:02               ` Johannes Schindelin
  1 sibling, 0 replies; 97+ messages in thread
From: Johannes Schindelin @ 2007-09-06 12:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Hi,

On Wed, 5 Sep 2007, Junio C Hamano wrote:

>  * This is obviously a follow-up to the previous one that allows
>    you to say "git gc --auto".  I somewhat feel dirty about
>    calling cmd_gc() bypassing fork & exec from "git add",
>    though...

Since all git-gc seems to do is to fork() and exec() other git programs, 
this should be fine (have not looked at cmd_gc() in a while, though).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-05 20:01           ` Junio C Hamano
                               ` (4 preceding siblings ...)
  2007-09-06  2:45             ` Shawn O. Pearce
@ 2007-09-06 15:54             ` Johannes Schindelin
  2007-09-06 17:49               ` Junio C Hamano
  5 siblings, 1 reply; 97+ messages in thread
From: Johannes Schindelin @ 2007-09-06 15:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Hi,

On Wed, 5 Sep 2007, Junio C Hamano wrote:

> @@ -20,6 +20,7 @@ static const char builtin_gc_usage[] = "git-gc [--prune] [--aggressive]";
>  
>  static int pack_refs = 1;
>  static int aggressive_window = -1;
> +static int gc_auto_threshold = 6700;

Please don't do that.

When you share objects with another git directory, git-gc --auto can get 
rid of the objects when some objects go away in the referenced repository.  

So we need _at least_ check gc.auto not being set in the repo when "git 
clone --share"ing it (and fail otherwise).

My preferred way would be to set it in "git init" so that existing setups 
are not affected, and put some big red message on top of the next release 
notes that people might want to set gc.auto in their existing setups.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06 15:54             ` Johannes Schindelin
@ 2007-09-06 17:49               ` Junio C Hamano
  2007-09-06 18:15                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-06 17:49 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Wed, 5 Sep 2007, Junio C Hamano wrote:
>
>> @@ -20,6 +20,7 @@ static const char builtin_gc_usage[] = "git-gc [--prune] [--aggressive]";
>>  
>>  static int pack_refs = 1;
>>  static int aggressive_window = -1;
>> +static int gc_auto_threshold = 6700;
>
> Please don't do that.
>
> When you share objects with another git directory, git-gc --auto can get 
> rid of the objects when some objects go away in the referenced repository.  

I thought the whole point of "gc --auto" was to have something
that does not lose/prune any objects, even the ones that do not
seem to be referenced from anywhere.  That is why invocations of
"git gc --auto" do not say --prune as you saw the second patch,
and the repack command "gc --auto" runs is "repack -d -l"
instead of "repack -a -d -l", which means that it does run
git-prune-packed after repacking but not git-prune.

Maybe I am missing something...

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06 17:49               ` Junio C Hamano
@ 2007-09-06 18:15                 ` Linus Torvalds
  2007-09-06 18:29                   ` Steven Grimm
  2007-09-06 23:12                   ` Subject: [PATCH] git-merge-pack Junio C Hamano
  2007-09-07  4:48                 ` People unaware of the importance of "git gc"? Shawn O. Pearce
  2007-09-07 10:12                 ` Johannes Schindelin
  2 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2007-09-06 18:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Nicolas Pitre, Nix, Steven Grimm,
	Git Mailing List

On Thu, 6 Sep 2007, Junio C Hamano wrote:
> 
> I thought the whole point of "gc --auto" was to have something
> that does not lose/prune any objects, even the ones that do not
> seem to be referenced from anywhere.  That is why invocations of
> "git gc --auto" do not say --prune as you saw the second patch,
> and the repack command "gc --auto" runs is "repack -d -l"
> instead of "repack -a -d -l", which means that it does run
> git-prune-packed after repacking but not git-prune.

I think "repack -d -l" should be ok from a safety perspective, but I'd 
also like to say that always running it incrementally is going to largely 
suck after a time.

IOW, if you get lots of small incrmental packs, after a while you really 
*do* need to do "git gc" to get the real pack generated.

In the case I saw, James really had hundreds of pack-files. That makes all 
our object lookups suck. Yes, not having loose objects at all is a big 
deal too, and yes, we try to start from the last pack-file we found (for 
the locality that we hope is there), but it's still pretty bad from a 
cache usage standpoint, and when we create a new object, we'll first 
search (in vain) in all the hundreds of pack-files.

So would "git gc --auto" have helped James? I'm sure it would have. But he 
already had lots of pack-files from doing "git fetch/pull", and while 
doing the "git gc --auto" will likely *delay* the point where you need to 
do a full repack, it doesn't make it go away.

We still need to tell people to do a full git gc at some point, or do it 
for them. And the longer you delay doing it, the more expensive it's going 
to get to do and/or the worse the final packing is going to be (especially 
if it ends up reusing non-optimal packing decisions from the smaller 
packs).

So I think the --auto stuff is still worth it, but it's really just 
pushing the pain somewhat further out.

(In the kernel community, if you fetch my tree daily, you really *are* 
going to have hundreds and hundreds of packfiles just from doing that).

So I'd really like us to also remind people to do a *real* and full "git 
gc", not just the incremental ones.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06 18:15                 ` Linus Torvalds
@ 2007-09-06 18:29                   ` Steven Grimm
  2007-09-06 23:12                   ` Subject: [PATCH] git-merge-pack Junio C Hamano
  1 sibling, 0 replies; 97+ messages in thread
From: Steven Grimm @ 2007-09-06 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Johannes Schindelin, Nicolas Pitre, Nix,
	Git Mailing List

Linus Torvalds wrote:
> IOW, if you get lots of small incrmental packs, after a while you really 
> *do* need to do "git gc" to get the real pack generated.
>   

I wonder if it makes sense to repack just the small incremental packs 
into a large (but still incremental) pack, rather than repacking the 
entire repository. Presumably that would be a lot faster than a full 
"git gc", while still giving you reasonably good packing (at least, if 
the threshold is set to a hugh enough number of small packs) and keeping 
things fast. That could run as a second phase of "git gc --auto" -- it 
should be quick enough to not be too terribly annoying since we're not 
running it in the background.

Yeah, if you use the same repo for a long time, you'll accumulate a ton 
of medium-sized packs this way, but (a) that's much better than the 
situation we have today, and (b) it puts off the performance degradation 
for long enough that it becomes more reasonable to expect people to find 
out about running the full "git gc" in the meantime, or for git to 
further evolve to not need it.

-Steve

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Subject: [PATCH] git-merge-pack
  2007-09-06 18:15                 ` Linus Torvalds
  2007-09-06 18:29                   ` Steven Grimm
@ 2007-09-06 23:12                   ` Junio C Hamano
  2007-09-06 23:35                     ` Linus Torvalds
                                       ` (3 more replies)
  1 sibling, 4 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-06 23:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Nicolas Pitre, Nix, Steven Grimm,
	Git Mailing List

This is a beginning of "git-merge-pack" that combines smaller
packs into one.  Currently it does not actually create a new
pack, but pretends that it is a (dumb) "git-rev-list --objects"
that lists the objects in the affected packs.  You have to pipe
its output to "git-pack-objects".

The command reads names of pack-*.pack files from the standard
input, outputs the objects' names in the order they are stored
in the original packs (i.e. the offset order).  This sorting is
done in order to emulate the traversal order the original
"git-rev-list --objects" that was used to create the existing
pack listed the objects.

While this approach would give the resulting packfile very
similar locality of access as the original, it does not give the
"name" component you would see in "git-rev-list --objects"
output.  This information is used as the clustering cue while
computing delta, and the lack of it means you can get horrible
delta selection.  You do _not_ want to run the downstream
"git-pack-objects" without the optimization/heuristics to reuse
delta.  IOW, do not run it with --no-reuse-delta.

To consolidate all packs that are smaller than a megabytes into
one, you would use it in its current form like this:

    $ old=$(find .git/objects/pack -type f -name '*.pack' -size 1M)
    $ new=$(echo "$old" | git merge-pack | git pack-objects pack)
    $ for p in $old; do rm -f $p ${p%.pack}.idx; done
    $ for s in pack idx; do mv pack-$new.$s .git/objects/pack/; done

An obvious next steps that can be done in parallel by interested
parties would be:

 (1) come up with a way to give "name" aka "clustering cue" (I
     think this is very hard);

 (2) run the above four command sequence internally without
     having to resort to shell wrapper (easy).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

  Linus Torvalds <torvalds@linux-foundation.org> writes:

  > IOW, if you get lots of small incrmental packs, after a while you really 
  > *do* need to do "git gc" to get the real pack generated.

  'auto' should do a lessor impact repack than the usual one.
  Especially we do not want to lose objects that do not look like
  they are reachable from this reopsitory, to help people with
  alternate object stores, aka "repo.or.cz style _forked_
  repositories".  However, a full repack with "-a -d" discards
  unreferenced objects that are only in packs.

  We need a middle ground between "pack and prune-pack only loose
  ones" and "full repack.

  Here is one.

 Makefile             |    1 +
 builtin-merge-pack.c |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 builtin.h            |    1 +
 git.c                |    1 +
 4 files changed, 90 insertions(+), 0 deletions(-)
 create mode 100644 builtin-merge-pack.c

diff --git a/Makefile b/Makefile
index dace211..cdff756 100644
--- a/Makefile
+++ b/Makefile
@@ -343,6 +343,7 @@ BUILTIN_OBJS = \
 	builtin-mailsplit.o \
 	builtin-merge-base.o \
 	builtin-merge-file.o \
+	builtin-merge-pack.o \
 	builtin-mv.o \
 	builtin-name-rev.o \
 	builtin-pack-objects.o \
diff --git a/builtin-merge-pack.c b/builtin-merge-pack.c
new file mode 100644
index 0000000..c98da80
--- /dev/null
+++ b/builtin-merge-pack.c
@@ -0,0 +1,87 @@
+#include "builtin.h"
+#include "cache.h"
+#include "pack.h"
+
+struct in_pack_object {
+	off_t offset;
+	const unsigned char *sha1;
+};
+
+static uint32_t get_packed_object_list(struct packed_git *p, struct in_pack_object *list, uint32_t loc)
+{
+	uint32_t n;
+
+	for (n = 0; n < p->num_objects; n++) {
+		list[loc].sha1 = nth_packed_object_sha1(p, n);
+		list[loc].offset = find_pack_entry_one(list[loc].sha1, p);
+		loc++;
+	}
+	return loc;
+}
+
+static int ofscmp(const void *a_, const void *b_)
+{
+	struct in_pack_object *a = (struct in_pack_object *)a_;
+	struct in_pack_object *b = (struct in_pack_object *)b_;
+	if (a->offset < b->offset)
+		return -1;
+	else if (a->offset > b->offset)
+		return 1;
+	else
+		return hashcmp(a->sha1, b->sha1);
+}
+
+int cmd_merge_pack(int ac, const char **av, const char *prefix)
+{
+	char filename[PATH_MAX];
+	struct packed_git **pack = NULL;
+	int pack_nr = 0;
+	int pack_alloc = 0;
+	uint32_t max_objs, cnt;
+	struct in_pack_object *objs;
+	int i;
+
+	while (fgets(filename, sizeof(filename), stdin) != NULL) {
+		int len = strlen(filename);
+		struct packed_git *p;
+
+		while (0 < len) {
+			if (filename[len-1] != '\n' &&
+			    filename[len-1] != '\r')
+				break;
+			filename[--len] = '\0';
+		}
+		if (strcmp(filename + len - 5, ".pack"))
+			goto error;
+
+		/* add-packed-git wants the name of .idx file */
+		strcpy(filename + len - 5, ".idx");
+		len--;
+		p = add_packed_git(filename, len, 1);
+		if (!p)
+			goto error;
+		if (open_pack_index(p))
+			goto error;
+
+		if (pack_alloc <= pack_nr) {
+			pack_alloc = alloc_nr(pack_nr);
+			pack = xrealloc(pack, pack_alloc * sizeof(*pack));
+		}
+		pack[pack_nr++] = p;
+		continue;
+	error:
+		die("Cannot add a pack .idx file: %s", filename);
+	}
+
+	max_objs = 0;
+	for (i = 0; i < pack_nr; i++)
+		max_objs += pack[i]->num_objects;
+	objs = xmalloc(sizeof(*objs) * max_objs);
+	cnt = 0;
+	for (i = 0; i < pack_nr; i++)
+		cnt = get_packed_object_list(pack[i], objs, cnt);
+	qsort(objs, cnt, sizeof(*objs), ofscmp);
+	for (cnt = 0; cnt < max_objs; cnt++)
+		printf("%s\n", sha1_to_hex(objs[cnt].sha1));
+	return 0;
+}
diff --git a/builtin.h b/builtin.h
index bb72000..aff28ca 100644
--- a/builtin.h
+++ b/builtin.h
@@ -49,6 +49,7 @@ extern int cmd_mailinfo(int argc, const char **argv, const char *prefix);
 extern int cmd_mailsplit(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_base(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
+extern int cmd_merge_pack(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
 extern int cmd_name_rev(int argc, const char **argv, const char *prefix);
 extern int cmd_pack_objects(int argc, const char **argv, const char *prefix);
diff --git a/git.c b/git.c
index fd3d83c..69e86bc 100644
--- a/git.c
+++ b/git.c
@@ -353,6 +353,7 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "mailsplit", cmd_mailsplit },
 		{ "merge-base", cmd_merge_base, RUN_SETUP },
 		{ "merge-file", cmd_merge_file },
+		{ "merge-pack", cmd_merge_pack },
 		{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
 		{ "name-rev", cmd_name_rev, RUN_SETUP },
 		{ "pack-objects", cmd_pack_objects, RUN_SETUP },
-- 
1.5.3.1.860.g2cce2

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-06 23:12                   ` Subject: [PATCH] git-merge-pack Junio C Hamano
@ 2007-09-06 23:35                     ` Linus Torvalds
  2007-09-07  0:51                     ` Nicolas Pitre
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2007-09-06 23:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Nicolas Pitre, Nix, Steven Grimm,
	Git Mailing List



On Thu, 6 Sep 2007, Junio C Hamano wrote:
>
> This is a beginning of "git-merge-pack" that combines smaller
> packs into one.  Currently it does not actually create a new
> pack, but pretends that it is a (dumb) "git-rev-list --objects"
> that lists the objects in the affected packs.  You have to pipe
> its output to "git-pack-objects".

Ok, so I had to double-check that builtin-pack-objects then deals properly 
with duplicate object names (which it does seem to do), so maybe it's 
worth adding a comment to that effect.

But ACK, this seems to be the right thing to do to generate a single 
bigger pack from many smaller ones.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-06 23:12                   ` Subject: [PATCH] git-merge-pack Junio C Hamano
  2007-09-06 23:35                     ` Linus Torvalds
@ 2007-09-07  0:51                     ` Nicolas Pitre
  2007-09-07  1:58                       ` Junio C Hamano
                                         ` (2 more replies)
  2007-09-07  7:11                     ` Subject: [PATCH] git-merge-pack Johannes Sixt
  2007-09-07  7:24                     ` Andy Parkins
  3 siblings, 3 replies; 97+ messages in thread
From: Nicolas Pitre @ 2007-09-07  0:51 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Johannes Schindelin, Nix, Steven Grimm,
	Git Mailing List

On Thu, 6 Sep 2007, Junio C Hamano wrote:

> This is a beginning of "git-merge-pack" that combines smaller
> packs into one.  Currently it does not actually create a new
> pack, but pretends that it is a (dumb) "git-rev-list --objects"
> that lists the objects in the affected packs.  You have to pipe
> its output to "git-pack-objects".
> 
> The command reads names of pack-*.pack files from the standard
> input, outputs the objects' names in the order they are stored
> in the original packs (i.e. the offset order).  This sorting is
> done in order to emulate the traversal order the original
> "git-rev-list --objects" that was used to create the existing
> pack listed the objects.
> 
> While this approach would give the resulting packfile very
> similar locality of access as the original, it does not give the
> "name" component you would see in "git-rev-list --objects"
> output.  This information is used as the clustering cue while
> computing delta, and the lack of it means you can get horrible
> delta selection.  You do _not_ want to run the downstream
> "git-pack-objects" without the optimization/heuristics to reuse
> delta.  IOW, do not run it with --no-reuse-delta.

I wonder if this is the best way to go.  In the context of a really fast 
repack happening automatically after (or during) user interactive 
operations, the above seems a bit heavyweight and slow to me.

I would have concatenated all packs provided on the command line into a 
single one, simply by reading data from existing packs and writing it 
back without any processing at all.  The offset for OBJ_OFS_DELTA is 
relative so a simple concatenation will just work.

Then the index for that pack can be created just as easily by reading 
existing pack index files and storing the data into an array of struct 
pack_idx_entry, adding the appropriate offset to object offsets, then 
call write_idx_file().

All data is read once and written once making it no more costly than a 
simple file copy.  On the flip side it wouldn't get rid of duplicated 
objects (I don't know if that matters i.e. if something might break with 
the same object twice in a pack).

> To consolidate all packs that are smaller than a megabytes into
> one, you would use it in its current form like this:
> 
>     $ old=$(find .git/objects/pack -type f -name '*.pack' -size 1M)
>     $ new=$(echo "$old" | git merge-pack | git pack-objects pack)
>     $ for p in $old; do rm -f $p ${p%.pack}.idx; done
>     $ for s in pack idx; do mv pack-$new.$s .git/objects/pack/; done

You might want to move the new pack before removing the old ones though.

> An obvious next steps that can be done in parallel by interested
> parties would be:
> 
>  (1) come up with a way to give "name" aka "clustering cue" (I
>      think this is very hard);

It is, and IMHO not worth it.  If you do it separately from the usual 
pack-objects process you'll perform extra IO and decompression when 
walking tree objects just to reconstruct those paths, becoming really 
slow by the context definition I provided above.

If you really want to do it then the best way might simply to reverse 
your find result above, in order to use pack-objects as if the larger 
packs, i.e. the ones that you don't want to merge, simply had an 
associated .keep file.

In fact, since we want to _also_ perform a repack of loose objects in 
the context of automatic repacking, I wonder why we wouldn't use that 
--unpacked= argument to also repack smallish packs at the same time in 
only one pack-objects pass.  Or maybe I'm missing something?

Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-07  0:51                     ` Nicolas Pitre
@ 2007-09-07  1:58                       ` Junio C Hamano
  2007-09-07  2:32                         ` Nicolas Pitre
  2007-09-07  4:07                       ` Shawn O. Pearce
  2007-09-07  4:43                       ` Junio C Hamano
  2 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2007-09-07  1:58 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, Johannes Schindelin, Nix, Steven Grimm,
	Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> I wonder if this is the best way to go.  In the context of a really fast 
> repack happening automatically after (or during) user interactive 
> operations, the above seems a bit heavyweight and slow to me.

Honestly, I do not believe in that mode of operation that much.

"While the user is waiting for the EDITOR"?

Because you do not know how much time you will be given before
you start, unless

 (1) your process can be snapshotted and you can restart at the
     next chance; or

 (2) it is so cheap and you can afford to abort and start over
     from scratch at the next chance; or

 (3) it is so quick that you can simply have the user wait until
     you are done without adding too much latency to be annoying,
     when you cannnot finish before the EDITOR come back;

I think that is a false sense of "ok, we will be able to do
something else in the background meantime", which is not so
useful in practice.

>> An obvious next steps that can be done in parallel by interested
>> parties would be:
>> 
>>  (1) come up with a way to give "name" aka "clustering cue" (I
>>      think this is very hard);
>
> It is, and IMHO not worth it.  If you do it separately from the usual 
> pack-objects process you'll perform extra IO and decompression when 
> walking tree objects just to reconstruct those paths, becoming really 
> slow by the context definition I provided above.

Well, I said "name" in quotes because you do _NOT_ have to give
the real name.  I was not thinking about doing the actual tree
traversal at all.  What you need to do is to come up with a
token that is the same for the objects in the same deltification
chain so that they cluster together, and that should be doable
by looking at the delta chain patterns inside a packfile.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-07  1:58                       ` Junio C Hamano
@ 2007-09-07  2:32                         ` Nicolas Pitre
  0 siblings, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2007-09-07  2:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Johannes Schindelin, Nix, Steven Grimm,
	Git Mailing List

On Thu, 6 Sep 2007, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > I wonder if this is the best way to go.  In the context of a really fast 
> > repack happening automatically after (or during) user interactive 
> > operations, the above seems a bit heavyweight and slow to me.
> 
> Honestly, I do not believe in that mode of operation that much.
> 
> "While the user is waiting for the EDITOR"?
> 
> Because you do not know how much time you will be given before
> you start, unless
> 
>  (1) your process can be snapshotted and you can restart at the
>      next chance; or
> 
>  (2) it is so cheap and you can afford to abort and start over
>      from scratch at the next chance; or
> 
>  (3) it is so quick that you can simply have the user wait until
>      you are done without adding too much latency to be annoying,
>      when you cannnot finish before the EDITOR come back;

I think we have to aim for #3.  "Automatic" certainly doesn't imply "can 
be slow".  It should be reasonably instantaneous, otherwise it'll become 
annoying quickly enough.  If it can't be (almost) instantaneous in 99% 
of normal cases, then I think it simply should be remain asynchronously 
througha manual invokation of 'git gc' and we only need to teach/remind 
people about it more strongly.

> >> An obvious next steps that can be done in parallel by interested
> >> parties would be:
> >> 
> >>  (1) come up with a way to give "name" aka "clustering cue" (I
> >>      think this is very hard);
> >
> > It is, and IMHO not worth it.  If you do it separately from the usual 
> > pack-objects process you'll perform extra IO and decompression when 
> > walking tree objects just to reconstruct those paths, becoming really 
> > slow by the context definition I provided above.
> 
> Well, I said "name" in quotes because you do _NOT_ have to give
> the real name.  I was not thinking about doing the actual tree
> traversal at all.  What you need to do is to come up with a
> token that is the same for the objects in the same deltification
> chain so that they cluster together, and that should be doable
> by looking at the delta chain patterns inside a packfile.

Obviously!  Sorry for being slow.

But I still think that a single repack pass should already be able to 
pick loose objects and selected (small) packs, and produce a pack with 
them all.  No need for a separate merge-pack I'd say.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-07  0:51                     ` Nicolas Pitre
  2007-09-07  1:58                       ` Junio C Hamano
@ 2007-09-07  4:07                       ` Shawn O. Pearce
  2007-09-07  4:43                       ` Junio C Hamano
  2 siblings, 0 replies; 97+ messages in thread
From: Shawn O. Pearce @ 2007-09-07  4:07 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Linus Torvalds, Johannes Schindelin, Nix,
	Steven Grimm, Git Mailing List

Nicolas Pitre <nico@cam.org> wrote:
> I would have concatenated all packs provided on the command line into a 
> single one, simply by reading data from existing packs and writing it 
> back without any processing at all.  The offset for OBJ_OFS_DELTA is 
> relative so a simple concatenation will just work.
> 
> Then the index for that pack can be created just as easily by reading 
> existing pack index files and storing the data into an array of struct 
> pack_idx_entry, adding the appropriate offset to object offsets, then 
> call write_idx_file().
> 
> All data is read once and written once making it no more costly than a 
> simple file copy.  On the flip side it wouldn't get rid of duplicated 
> objects (I don't know if that matters i.e. if something might break with 
> the same object twice in a pack).

Yea, that's a really quick repack.  :-)  Plus its actually something
that can be easily halted in the middle and resumed later.  Just need
to save the list of packfiles you are concatenating so you can pick
up later when you get more time.

There shouldn't be a problem with having duplicates in the packfile.
You can do one of two things:

  a) Omit the duplicates from the .idx when you merge the .idx tables
     together to produce the new one.  Just take the object with the
	 earliest offset.

  b) Leave the duplicates in the final .idx.  In this case the
     binary search may pick any of them, but it wouldn't matter
     which it finds.

About the only process that might care about duplicates would be
index-pack.  I don't think it makes sense to run index-pack on a
packfile you already have a .idx for.  I don't think it would have
a problem with the duplicate SHA-1s either, but it wouldn't be hard
to make it do something reasonable when it finds them.
 
> > To consolidate all packs that are smaller than a megabytes into
> > one, you would use it in its current form like this:
> > 
> >     $ old=$(find .git/objects/pack -type f -name '*.pack' -size 1M)
> >     $ new=$(echo "$old" | git merge-pack | git pack-objects pack)
> >     $ for p in $old; do rm -f $p ${p%.pack}.idx; done
> >     $ for s in pack idx; do mv pack-$new.$s .git/objects/pack/; done
> 
> You might want to move the new pack before removing the old ones though.

Not might, *must*.  If you delete the old ones before the new
ones are ready then readers can run into problems trying to access
the objects.  We've spent some effort trying to make these sorts
of operations safe.  No sense in destroying that by getting the
order wrong here.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-07  0:51                     ` Nicolas Pitre
  2007-09-07  1:58                       ` Junio C Hamano
  2007-09-07  4:07                       ` Shawn O. Pearce
@ 2007-09-07  4:43                       ` Junio C Hamano
  2007-09-08  9:50                         ` [PATCH] make sha1_file.c::matches_pack_name() available to others Junio C Hamano
  2007-09-08 10:01                         ` [PATCH] pack-objects --repack-unpacked Junio C Hamano
  2 siblings, 2 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-07  4:43 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, Johannes Schindelin, Nix, Steven Grimm,
	Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> I would have concatenated all packs provided on the command line into a 
> single one, simply by reading data from existing packs and writing it 
> back without any processing at all.  The offset for OBJ_OFS_DELTA is 
> relative so a simple concatenation will just work.

As I was planning to do this outside of pack-objects, I did not
want to write something that intimately knows the details of
packfile format, but see below.

> All data is read once and written once making it no more costly than a 
> simple file copy.  On the flip side it wouldn't get rid of duplicated 
> objects (I don't know if that matters i.e. if something might break with 
> the same object twice in a pack).

I do not think duplicates create problems, as long as the pack
idx remains sane.  But a bigger issue is for people who fetch
over dumb protocols, from a repository that repacks with "-a -d"
every once in a while.  There, many duplicates are norm.

> In fact, since we want to _also_ perform a repack of loose objects in 
> the context of automatic repacking, I wonder why we wouldn't use that 
> --unpacked= argument to also repack smallish packs at the same time in 
> only one pack-objects pass.  Or maybe I'm missing something?

I think this is a much better idea.  You obviously need some
twist to the pack-objects, and being lazy that was the reason I
did not want to do this that way.

When a new parameter, perhaps --lossless, is given, together
with the --unpacked= parameters, we can change pack-objects to
iterate over all objects in the --unpacked= packs, and add the
ones that are not marked for inclusion to the set of objects to
be packed, after doing the usual "objects to be packed"
discovery.

I am not sure --lossless is a good option name from marketing
point of view, though.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06 17:49               ` Junio C Hamano
  2007-09-06 18:15                 ` Linus Torvalds
@ 2007-09-07  4:48                 ` Shawn O. Pearce
  2007-09-07 10:12                 ` Johannes Schindelin
  2 siblings, 0 replies; 97+ messages in thread
From: Shawn O. Pearce @ 2007-09-07  4:48 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Nicolas Pitre, Nix, Steven Grimm,
	Linus Torvalds, Git Mailing List

Junio C Hamano <gitster@pobox.com> wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Wed, 5 Sep 2007, Junio C Hamano wrote:
> >>  static int aggressive_window = -1;
> >> +static int gc_auto_threshold = 6700;
> >
> > Please don't do that.
> >
> > When you share objects with another git directory, git-gc --auto can get 
> > rid of the objects when some objects go away in the referenced repository.  
> 
> I thought the whole point of "gc --auto" was to have something
> that does not lose/prune any objects, even the ones that do not
> seem to be referenced from anywhere.  That is why invocations of
> "git gc --auto" do not say --prune as you saw the second patch,
> and the repack command "gc --auto" runs is "repack -d -l"
> instead of "repack -a -d -l", which means that it does run
> git-prune-packed after repacking but not git-prune.
> 
> Maybe I am missing something...

No, you aren't Junio.  `gc --auto` as you defined it is safe.
It won't delete objects from the database.  So it won't impact shared
repositories, or readers that are actively running in parallel with
the gc.  Both of which are important.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-06 23:12                   ` Subject: [PATCH] git-merge-pack Junio C Hamano
  2007-09-06 23:35                     ` Linus Torvalds
  2007-09-07  0:51                     ` Nicolas Pitre
@ 2007-09-07  7:11                     ` Johannes Sixt
  2007-09-07  7:34                       ` Junio C Hamano
  2007-09-07  7:24                     ` Andy Parkins
  3 siblings, 1 reply; 97+ messages in thread
From: Johannes Sixt @ 2007-09-07  7:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Johannes Schindelin, Nicolas Pitre, Nix,
	Steven Grimm, Git Mailing List

Junio C Hamano schrieb:
> This is a beginning of "git-merge-pack" that combines smaller
> packs into one.

This gives a new meaning to the term "merge". IMHO, "git-combine-pack" would 
be a better name.

-- Hannes

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-06 23:12                   ` Subject: [PATCH] git-merge-pack Junio C Hamano
                                       ` (2 preceding siblings ...)
  2007-09-07  7:11                     ` Subject: [PATCH] git-merge-pack Johannes Sixt
@ 2007-09-07  7:24                     ` Andy Parkins
  3 siblings, 0 replies; 97+ messages in thread
From: Andy Parkins @ 2007-09-07  7:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Linus Torvalds, Johannes Schindelin,
	Nicolas Pitre, Nix, Steven Grimm

On Friday 2007 September 07, Junio C Hamano wrote:

>  builtin-merge-pack.c |   87

Can I suggest not calling it git-merge-pack?  It makes it look like it's a new 
merge strategy called "pack"...

git-merge-base
git-merge-file
git-merge-index
git-merge-octopus
git-merge-one-file
git-merge-ours
git-merge-recur
git-merge-recursive
git-merge-resolve
git-merge-stupid
git-merge-subtree
git-merge-tree


  
Andy

-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Subject: [PATCH] git-merge-pack
  2007-09-07  7:11                     ` Subject: [PATCH] git-merge-pack Johannes Sixt
@ 2007-09-07  7:34                       ` Junio C Hamano
  0 siblings, 0 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-07  7:34 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Linus Torvalds, Johannes Schindelin, Nicolas Pitre, Nix,
	Steven Grimm, Git Mailing List

Johannes Sixt <j.sixt@eudaptics.com> writes:

> Junio C Hamano schrieb:
>> This is a beginning of "git-merge-pack" that combines smaller
>> packs into one.
>
> This gives a new meaning to the term "merge". IMHO, "git-combine-pack"
> would be a better name.

Yeah, that makes sense, but I think this can and should be done
as part of pack-objects itself as Nico suggested.

So consider that patch scrapped for now.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: People unaware of the importance of "git gc"?
  2007-09-06 17:49               ` Junio C Hamano
  2007-09-06 18:15                 ` Linus Torvalds
  2007-09-07  4:48                 ` People unaware of the importance of "git gc"? Shawn O. Pearce
@ 2007-09-07 10:12                 ` Johannes Schindelin
  2 siblings, 0 replies; 97+ messages in thread
From: Johannes Schindelin @ 2007-09-07 10:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Hi,

On Thu, 6 Sep 2007, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Wed, 5 Sep 2007, Junio C Hamano wrote:
> >
> >> @@ -20,6 +20,7 @@ static const char builtin_gc_usage[] = "git-gc [--prune] [--aggressive]";
> >>  
> >>  static int pack_refs = 1;
> >>  static int aggressive_window = -1;
> >> +static int gc_auto_threshold = 6700;
> >
> > Please don't do that.
> >
> > When you share objects with another git directory, git-gc --auto can 
> > get rid of the objects when some objects go away in the referenced 
> > repository.
> 
> I thought the whole point of "gc --auto" was to have something
> that does not lose/prune any objects, even the ones that do not
> seem to be referenced from anywhere.  That is why invocations of
> "git gc --auto" do not say --prune as you saw the second patch,
> and the repack command "gc --auto" runs is "repack -d -l"
> instead of "repack -a -d -l", which means that it does run
> git-prune-packed after repacking but not git-prune.
> 
> Maybe I am missing something...

No, _I_ missed the fact that no pack is rewritten...

Sorry for the line noise,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH] make sha1_file.c::matches_pack_name() available to others
  2007-09-07  4:43                       ` Junio C Hamano
@ 2007-09-08  9:50                         ` Junio C Hamano
  2007-09-08 10:01                         ` [PATCH] pack-objects --repack-unpacked Junio C Hamano
  1 sibling, 0 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-08  9:50 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, Johannes Schindelin, Nix, Steven Grimm,
	Git Mailing List

Even though our convention is "zero return means good", it goes a
bit too far for matches_pack_name() to return 0 when it found
the pack is what the name refers to.  This fixes that silly and
obvious interface bug.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 Junio C Hamano <gitster@pobox.com> writes:

 > Nicolas Pitre <nico@cam.org> writes:
 > ...
 >> In fact, since we want to _also_ perform a repack of loose objects in 
 >> the context of automatic repacking, I wonder why we wouldn't use that 
 >> --unpacked= argument to also repack smallish packs at the same time in 
 >> only one pack-objects pass.  Or maybe I'm missing something?
 >
 > I think this is a much better idea.  You obviously need some
 > twist to the pack-objects, and being lazy that was the reason I
 > did not want to do this that way.

 So what follows is two-patch series, which still is a rough
 sketch, as I am feeling a bit too tired to do tests and
 documentation (help is always welcomed, hint hint).

 This message contains the first one, which is more or less
 independent, that exposes matches_pack_name() function from
 sha1_file.c, while fixing a silly and obvious interface bug.

 cache.h     |    1 +
 sha1_file.c |   14 +++++++-------
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index 70abbd5..3fa5b8e 100644
--- a/cache.h
+++ b/cache.h
@@ -529,6 +529,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign
 extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int matches_pack_name(struct packed_git *p, const char *name);
 
 /* Dumb servers support */
 extern int update_server_info(int);
diff --git a/sha1_file.c b/sha1_file.c
index 9978a58..5801c3e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1684,22 +1684,22 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	return 0;
 }
 
-static int matches_pack_name(struct packed_git *p, const char *ig)
+int matches_pack_name(struct packed_git *p, const char *name)
 {
 	const char *last_c, *c;
 
-	if (!strcmp(p->pack_name, ig))
-		return 0;
+	if (!strcmp(p->pack_name, name))
+		return 1;
 
 	for (c = p->pack_name, last_c = c; *c;)
 		if (*c == '/')
 			last_c = ++c;
 		else
 			++c;
-	if (!strcmp(last_c, ig))
-		return 0;
+	if (!strcmp(last_c, name))
+		return 1;
 
-	return 1;
+	return 0;
 }
 
 static int find_pack_entry(const unsigned char *sha1, struct pack_entry *e, const char **ignore_packed)
@@ -1717,7 +1717,7 @@ static int find_pack_entry(const unsigned char *sha1, struct pack_entry *e, cons
 		if (ignore_packed) {
 			const char **ig;
 			for (ig = ignore_packed; *ig; ig++)
-				if (!matches_pack_name(p, *ig))
+				if (matches_pack_name(p, *ig))
 					break;
 			if (*ig)
 				goto next;

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH] pack-objects --repack-unpacked
  2007-09-07  4:43                       ` Junio C Hamano
  2007-09-08  9:50                         ` [PATCH] make sha1_file.c::matches_pack_name() available to others Junio C Hamano
@ 2007-09-08 10:01                         ` Junio C Hamano
  1 sibling, 0 replies; 97+ messages in thread
From: Junio C Hamano @ 2007-09-08 10:01 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, Johannes Schindelin, Nix, Steven Grimm,
	Git Mailing List

The usual command line that uses "--unpacked=<existing>" option
looks like this:

	git pack-objects --non-empty --all --reflog \
        	--unpacked --unpacked=<existing> \
                packname-prefix

This packs loose objects and objects in the named existing
packs that are reachable from any and all refs and reflog
entries.  It is typically used by "git repack -a -d", which
then removes the named existing packs from the repository, and
has an effect of getting rid of unreachable objects these packs
hold.

This adds "--repack-unpacked" option to pack-objects to help
combining small packs into one, without losing unreferenced
objects that are in the packs.  When this option is given in
addition to the above command line, we also make sure all the
objects in the named existing packs are included in the result.

This allows us to safely remove the packs that were named on the
command line after installing the resulting pack in the
repository.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 I am too tired to keep staring at this code now.  Fixes,
 improvements, replacements and enhancements, in the code,
 documentation and tests, are very much welcomed.

 builtin-pack-objects.c |   95 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 12509fa..9bc2faa 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -21,7 +21,7 @@ git-pack-objects [{ -q | --progress | --all-progress }] \n\
 	[--window=N] [--window-memory=N] [--depth=N] \n\
 	[--no-reuse-delta] [--no-reuse-object] [--delta-base-offset] \n\
 	[--non-empty] [--revs [--unpacked | --all]*] [--reflog] \n\
-	[--stdout | base-name] [<ref-list | <object-list]";
+	[--stdout | base-name] [--repack-unpacked] [<ref-list | <object-list]";
 
 struct object_entry {
 	struct pack_idx_entry idx;
@@ -57,7 +57,7 @@ static struct object_entry **written_list;
 static uint32_t nr_objects, nr_alloc, nr_result, nr_written;
 
 static int non_empty;
-static int no_reuse_delta, no_reuse_object;
+static int no_reuse_delta, no_reuse_object, repack_unpacked;
 static int local;
 static int incremental;
 static int allow_ofs_delta;
@@ -1625,15 +1625,21 @@ static void read_object_list_from_stdin(void)
 	}
 }
 
+#define OBJECT_ADDED (1u<<20)
+
 static void show_commit(struct commit *commit)
 {
 	add_object_entry(commit->object.sha1, OBJ_COMMIT, NULL, 0);
+	commit->object.flags |= OBJECT_ADDED;
 }
 
 static void show_object(struct object_array_entry *p)
 {
+	struct object *o = lookup_unknown_object(p->item->sha1);
+
 	add_preferred_base_object(p->name);
 	add_object_entry(p->item->sha1, p->item->type, p->name, 0);
+	o->flags |= OBJECT_ADDED;
 }
 
 static void show_edge(struct commit *commit)
@@ -1641,6 +1647,84 @@ static void show_edge(struct commit *commit)
 	add_preferred_base(commit->object.sha1);
 }
 
+struct in_pack_object {
+	off_t offset;
+	const unsigned char *sha1;
+};
+
+struct in_pack {
+	int alloc;
+	int nr;
+	struct in_pack_object *array;
+};
+
+static void mark_in_pack_object(const unsigned char *sha1, struct packed_git *p, struct in_pack *in_pack)
+{
+	in_pack->array[in_pack->nr].offset = find_pack_entry_one(sha1, p);
+	in_pack->array[in_pack->nr].sha1 = sha1;
+	in_pack->nr++;
+}
+
+/*
+ * Compare the objects in the offset order, in order to emulate the
+ * "git-rev-list --objects" output that produced the pack originally.
+ */
+static int ofscmp(const void *a_, const void *b_)
+{
+	struct in_pack_object *a = (struct in_pack_object *)a_;
+	struct in_pack_object *b = (struct in_pack_object *)b_;
+
+	if (a->offset < b->offset)
+		return -1;
+	else if (a->offset > b->offset)
+		return 1;
+	else
+		return hashcmp(a->sha1, b->sha1);
+}
+
+static void add_objects_in_unpacked_packs(struct rev_info *revs)
+{
+	struct packed_git *p;
+
+	for (p = packed_git; p; p = p->next) {
+		struct in_pack in_pack;
+		const unsigned char *sha1;
+		struct object *o;
+		uint32_t i;
+
+		for (i = 0; i < revs->num_ignore_packed; i++) {
+			if (matches_pack_name(p, revs->ignore_packed[i]))
+				break;
+		}
+		if (revs->num_ignore_packed <= i)
+			continue;
+		if (open_pack_index(p))
+			die("cannot open pack index");
+
+		in_pack.alloc = p->num_objects;
+		in_pack.nr = 0;
+		in_pack.array = xmalloc(sizeof(in_pack.array[0]) *
+					p->num_objects);
+		for (i = 0; i < p->num_objects; i++) {
+			sha1 = nth_packed_object_sha1(p, i);
+			o = lookup_unknown_object(sha1);
+			if (!(o->flags & OBJECT_ADDED))
+				mark_in_pack_object(sha1, p, &in_pack);
+			o->flags |= OBJECT_ADDED;
+		}
+		if (!in_pack.nr)
+			continue;
+		qsort(in_pack.array, in_pack.nr, sizeof(in_pack.array[0]),
+		      ofscmp);
+		for (i = 0; i < in_pack.nr; i++) {
+			sha1 = in_pack.array[i].sha1;
+			o = lookup_unknown_object(sha1);
+			add_object_entry(sha1, o->type, "", 0);
+		}
+		free(in_pack.array);
+	}
+}
+
 static void get_object_list(int ac, const char **av)
 {
 	struct rev_info revs;
@@ -1672,6 +1756,9 @@ static void get_object_list(int ac, const char **av)
 	prepare_revision_walk(&revs);
 	mark_edges_uninteresting(revs.commits, &revs, show_edge);
 	traverse_commit_list(&revs, show_commit, show_object);
+
+	if (repack_unpacked)
+		add_objects_in_unpacked_packs(&revs);
 }
 
 static int adjust_perm(const char *path, mode_t mode)
@@ -1789,6 +1876,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			use_internal_rev_list = 1;
 			continue;
 		}
+		if (!strcmp("--repack-unpacked", arg)) {
+			repack_unpacked = 1;
+			continue;
+		}
 		if (!strcmp("--unpacked", arg) ||
 		    !prefixcmp(arg, "--unpacked=") ||
 		    !strcmp("--reflog", arg) ||

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* What's so special about objects/17/ ?
  2007-09-05 18:54         ` Nicolas Pitre
  2007-09-05 20:01           ` Junio C Hamano
@ 2018-10-07 18:28           ` Ævar Arnfjörð Bjarmason
  2018-10-07 18:35             ` Johannes Sixt
  2018-10-07 19:46             ` Junio C Hamano
  1 sibling, 2 replies; 97+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-07 18:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

In 2007 Junio wrote
(https://public-inbox.org/git/7vr6lcj2zi.fsf@gitster.siamese.dyndns.org/):

    +static int need_to_gc(void)
    +{
    +	/*
    +	 * Quickly check if a "gc" is needed, by estimating how
    +	 * many loose objects there are.  Because SHA-1 is evenly
    +	 * distributed, we can check only one and get a reasonable
    +	 * estimate.
    +	 */
    +	char path[PATH_MAX];
    +	const char *objdir = get_object_directory();
    +	DIR *dir;
    +	struct dirent *ent;
    +	int auto_threshold;
    +	int num_loose = 0;
    +	int needed = 0;
    +
    +	if (sizeof(path) <= snprintf(path, sizeof(path), "%s/17", objdir)) {
    +		warning("insanely long object directory %.*s", 50, objdir);
    +		return 0;
    +	}
    +	dir = opendir(path);
    +	if (!dir)
    +		return 0;
    +
    +	auto_threshold = (gc_auto_threshold + 255) / 256;
    +	while ((ent = readdir(dir)) != NULL) {
    +		if (strspn(ent->d_name, "0123456789abcdef") != 38 ||
    +		    ent->d_name[38] != '\0')
    +			continue;
    +		if (++num_loose > auto_threshold) {
    +			needed = 1;
    +			break;
    +		}
    +	}

A couple of questions about this patch, which is in git.git as
2c3c439947 ("Implement git gc --auto", 2007-09-05)

1. We still have this check of objects/17/ in builtin/gc.c today. Why
   objects/17/ and not e.g. objects/00/ to go with other 000* magic such
   as the 0000000000000000000000000000000000000000 SHA-1?  Statistically
   it doesn't matter, but 17 seems like an odd thing to pick at random
   out of 00..ff, does it have any significance?

2. It seems overly paranoid to be checking that the files in
  .git/objects/17/ look like a SHA-1. If we have stuff not generated by
  git in .git/objects/??/ we probably have bigger problems than
  prematurely triggering auto gc, can this just be removed as
  redundant. Was this some check e.g. expecting that this would need to
  deal with tempfiles in these directories that we created at the time
  (but no longer do?)?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 18:28           ` What's so special about objects/17/ ? Ævar Arnfjörð Bjarmason
@ 2018-10-07 18:35             ` Johannes Sixt
  2018-10-07 19:06               ` Ævar Arnfjörð Bjarmason
  2018-10-07 19:46             ` Junio C Hamano
  1 sibling, 1 reply; 97+ messages in thread
From: Johannes Sixt @ 2018-10-07 18:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Am 07.10.18 um 20:28 schrieb Ævar Arnfjörð Bjarmason:
> In 2007 Junio wrote
> (https://public-inbox.org/git/7vr6lcj2zi.fsf@gitster.siamese.dyndns.org/):
> 
>      +static int need_to_gc(void)
>      +{
>      +	/*
>      +	 * Quickly check if a "gc" is needed, by estimating how
>      +	 * many loose objects there are.  Because SHA-1 is evenly
>      +	 * distributed, we can check only one and get a reasonable
>      +	 * estimate.
>      +	 */

> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>     objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>     as the 0000000000000000000000000000000000000000 SHA-1?  Statistically
>     it doesn't matter, but 17 seems like an odd thing to pick at random
>     out of 00..ff, does it have any significance?

The reason is explained in the comment. And, BTW, you do know about this 
one: https://xkcd.com/221/ don't you? (TLDR: the title is "Random Number")

> 2. It seems overly paranoid to be checking that the files in
>    .git/objects/17/ look like a SHA-1. If we have stuff not generated by
>    git in .git/objects/??/ we probably have bigger problems than
>    prematurely triggering auto gc, can this just be removed as
>    redundant. Was this some check e.g. expecting that this would need to
>    deal with tempfiles in these directories that we created at the time
>    (but no longer do?)?

It's not about that there are SHA-1s in there, it's about how many there 
are.

-- Hannes

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 18:35             ` Johannes Sixt
@ 2018-10-07 19:06               ` Ævar Arnfjörð Bjarmason
  2018-10-07 22:39                 ` Johannes Sixt
  0 siblings, 1 reply; 97+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-07 19:06 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Junio C Hamano, Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List


On Sun, Oct 07 2018, Johannes Sixt wrote:

> Am 07.10.18 um 20:28 schrieb Ævar Arnfjörð Bjarmason:
>> In 2007 Junio wrote
>> (https://public-inbox.org/git/7vr6lcj2zi.fsf@gitster.siamese.dyndns.org/):
>>
>>      +static int need_to_gc(void)
>>      +{
>>      +	/*
>>      +	 * Quickly check if a "gc" is needed, by estimating how
>>      +	 * many loose objects there are.  Because SHA-1 is evenly
>>      +	 * distributed, we can check only one and get a reasonable
>>      +	 * estimate.
>>      +	 */
>
>> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>>     objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>>     as the 0000000000000000000000000000000000000000 SHA-1?  Statistically
>>     it doesn't matter, but 17 seems like an odd thing to pick at random
>>     out of 00..ff, does it have any significance?
>
> The reason is explained in the comment. And, BTW, you do know about
> this one: https://xkcd.com/221/ don't you? (TLDR: the title is "Random
> Number")

Picking any one number is explained in the comment. I'm asking why 17 in
particular not for correctness reasons but as a bit of historical lore,
and because my ulterior is to improve the GC docs.

The number in that comic is 4 (and no datestamp on when it was
published). Are you saying Junio's patch is somehow a reference to that
xkcd in particular, or that it's just a funny reference in this context?

>> 2. It seems overly paranoid to be checking that the files in
>>    .git/objects/17/ look like a SHA-1. If we have stuff not generated by
>>    git in .git/objects/??/ we probably have bigger problems than
>>    prematurely triggering auto gc, can this just be removed as
>>    redundant. Was this some check e.g. expecting that this would need to
>>    deal with tempfiles in these directories that we created at the time
>>    (but no longer do?)?
>
> It's not about that there are SHA-1s in there, it's about how many
> there are.

Right, I'm wondering if it couldn't be replaced by some general path.c
"number_of_files_in_dir" helper. I.e. why this code is being paranoid
about ignoring the likes of
.git/objects/17/{foo,bar,some-other-garbage}. A number_of_files_in_dir()
would obviously need to ignore "." and "..".

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 18:28           ` What's so special about objects/17/ ? Ævar Arnfjörð Bjarmason
  2018-10-07 18:35             ` Johannes Sixt
@ 2018-10-07 19:46             ` Junio C Hamano
  2018-10-07 20:07               ` Junio C Hamano
  2018-10-08 10:36               ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 97+ messages in thread
From: Junio C Hamano @ 2018-10-07 19:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>    objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>    as the 0000000000000000000000000000000000000000 SHA-1?d  Statistically
>    it doesn't matter, but 17 seems like an odd thing to pick at random
>    out of 00..ff, does it have any significance?

There is no "other 000* magic such as ...". There is only one 0{40}
magic and that one must be memorable and explainable.

The 1/256 sample can be any one among 256.  Just like the date
string on the first line of the output to be used as the /etc/magic
signature by format-patch, it was an arbitrary choice, rather than a
random choice, and unlike 0{40} this does not have to be memorable
by general public and I do not have to explain the choice to the
general public ;-)

> 2. It seems overly paranoid to be checking that the files in
>   .git/objects/17/ look like a SHA-1.

There is no other reason than futureproofing.  We were paying cost
to open and scan the directory anyway, and checking that we only
count the loose object files was (and still is) a sensible thing to
do to allow us not even worry about the other kind of things we
might end up creating there.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 19:46             ` Junio C Hamano
@ 2018-10-07 20:07               ` Junio C Hamano
  2018-10-08 19:17                 ` Stefan Beller
  2018-10-08 10:36               ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2018-10-07 20:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>>    objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>>    as the 0000000000000000000000000000000000000000 SHA-1?d  Statistically
>>    it doesn't matter, but 17 seems like an odd thing to pick at random
>>    out of 00..ff, does it have any significance?
>
> ...
> by general public and I do not have to explain the choice to the
> general public ;-)

One thing that is more important than "why not 00 but 17?" to answer
is why a hardcoded number rather than a runtime random.  It is for
repeatability.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 19:06               ` Ævar Arnfjörð Bjarmason
@ 2018-10-07 22:39                 ` Johannes Sixt
  2018-10-08  0:54                   ` Junio C Hamano
  0 siblings, 1 reply; 97+ messages in thread
From: Johannes Sixt @ 2018-10-07 22:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List

Am 07.10.18 um 21:06 schrieb Ævar Arnfjörð Bjarmason:
> Picking any one number is explained in the comment. I'm asking why 17 in
> particular not for correctness reasons but as a bit of historical lore,
> and because my ulterior is to improve the GC docs.
> 
> The number in that comic is 4 (and no datestamp on when it was
> published). Are you saying Junio's patch is somehow a reference to that
> xkcd in particular, or that it's just a funny reference in this context?

No lore, AFAIR. It's just a random number, determined by a fair dice 
roll or something ;)

-- Hannes

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 22:39                 ` Johannes Sixt
@ 2018-10-08  0:54                   ` Junio C Hamano
  0 siblings, 0 replies; 97+ messages in thread
From: Junio C Hamano @ 2018-10-08  0:54 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Ævar Arnfjörð Bjarmason, Nix, Steven Grimm,
	Linus Torvalds, Git Mailing List

Johannes Sixt <j6t@kdbg.org> writes:

> Am 07.10.18 um 21:06 schrieb Ævar Arnfjörð Bjarmason:
>> Picking any one number is explained in the comment. I'm asking why 17 in
>> particular not for correctness reasons but as a bit of historical lore,
>> and because my ulterior is to improve the GC docs.
>>
>> The number in that comic is 4 (and no datestamp on when it was
>> published). Are you saying Junio's patch is somehow a reference to that
>> xkcd in particular, or that it's just a funny reference in this context?
>
> No lore, AFAIR. It's just a random number, determined by a fair dice
> roll or something ;)

As I already said, I did not pick the number randomly, but rather
arbitrarily, and it is not 00 because the chosen number (unlike the
0{40} magic we use elsewhere) does not have to be memorable, and the
choice does not have to be explainable.

So people will not get any further explanation as to the reason
behind that arbitrary choice, but it was not random.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 19:46             ` Junio C Hamano
  2018-10-07 20:07               ` Junio C Hamano
@ 2018-10-08 10:36               ` Ævar Arnfjörð Bjarmason
  2018-10-09  1:07                 ` Junio C Hamano
  1 sibling, 1 reply; 97+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-08 10:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List, Christian Couder

On Sun, Oct 07 2018, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> 1. We still have this check of objects/17/ in builtin/gc.c today. Why
>>    objects/17/ and not e.g. objects/00/ to go with other 000* magic such
>>    as the 0000000000000000000000000000000000000000 SHA-1?d  Statistically
>>    it doesn't matter, but 17 seems like an odd thing to pick at random
>>    out of 00..ff, does it have any significance?
>
> There is no "other 000* magic such as ...". There is only one 0{40}
> magic and that one must be memorable and explainable.

Depending on how we're counting there's at least two. We also use
0000000000000000000000000000000000000000 as a placeholder for "couldn't
read a ref" in addition or "this is a placeholder for an invalid ref" in
addition to how it's used to signify creation/deletion to the in the
likes of the pre-receive hook:

    $ echo hello > .git/refs/something
    $ git fsck
    [...]
    error: refs/something: invalid sha1 pointer 0000000000000000000000000000000000000000
    $ > .git/refs/something
    $ git fsck
    [...]
    error: refs/something: invalid sha1 pointer 0000000000000000000000000000000000000000

This is because the refs backend will memzero the oid struct, and if we
fail to read things it'll still be zero'd out.

This manifests e.g. in this confusing fsck output, due to a bug where
GitLab will write empty refs/keep-around/* refs sometimes:
https://gitlab.com/gitlab-org/gitlab-ce/issues/44431

> The 1/256 sample can be any one among 256.  Just like the date
> string on the first line of the output to be used as the /etc/magic
> signature by format-patch, it was an arbitrary choice, rather than a
> random choice, and unlike 0{40} this does not have to be memorable
> by general public and I do not have to explain the choice to the
> general public ;-)

I wanted to elaborate on the explanation for "gc.auto" in
git-config. Now we just say "approximately 6700". Since this behavior
has been really stable for a long time we could say we sample 1/256 of
the .git/objects/?? dirs, and this explains any perceived discrepancies
between the 6700 number and $(find .git/objects/?? -type f | wc -l).

>> 2. It seems overly paranoid to be checking that the files in
>>   .git/objects/17/ look like a SHA-1.
>
> There is no other reason than futureproofing.  We were paying cost
> to open and scan the directory anyway, and checking that we only
> count the loose object files was (and still is) a sensible thing to
> do to allow us not even worry about the other kind of things we
> might end up creating there.

Makes sense. Just wanted to ask if it was that or some workaround for
historical files being there.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-07 20:07               ` Junio C Hamano
@ 2018-10-08 19:17                 ` Stefan Beller
  2018-10-09  1:03                   ` Junio C Hamano
  0 siblings, 1 reply; 97+ messages in thread
From: Stefan Beller @ 2018-10-08 19:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, nico, nix, koreth,
	Linus Torvalds, git

On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:

> > ...
> > by general public and I do not have to explain the choice to the
> > general public ;-)
>
> One thing that is more important than "why not 00 but 17?" to answer
> is why a hardcoded number rather than a runtime random.  It is for
> repeatability.

Let's talk about repeatability vs statistics for a second. ;-)

If I am a user and I were really into optimizing my loose object count
for some reason, so I would want to choose a low number of
gc.auto. Let's say I go with 128.

At the low end of loose objects the approximation is yielding
some high relative errors. This is because of the granularity, i.e.
gc would implicitly estimate the loose objects to be 0 or 256 or 512, (or more)
if there is 0, 1, 2 (or more) loose objects in the objects/17.

As each object can be viewed following an unfair coin flip
(With a chance of 1/256 it is in objects/17), the distribution in
objects/17 (and hence any other objects/XX bin) follows the
Bernoulli distribution.

If I do have say about 157 loose objects (and having auto.gc
configured anywhere in 1..255), then the probability to not
gc is 54% (as that is the probability to have 0 objects in /17,
following probability mass function of the Bernoulli distribution,
(i.e. Pr(0 objects) = (157 over 0) x (1/256)^0 x (255/256)^157))

As it is repeatable (by picking the same /17 every time), I can run
"gc --auto" multiple times and still have 157 loose objects, despite
wanting to have only 128 loose objects at a 54% chance.

If we'd roll the 256 dice every time to pick a different bin,
then we might hit another bin and gc in the second or third
gc, which would be more precise on average.

By having repeatability we allow for these numbers to be far off
more often when configuring small numbers.

I think that is the right choice, as we probably do not care about the
exactness of auto-gc for small numbers, as it is a performance
thing anyway. Although documenting it properly might be a challenge.

The current wording of auto.gc seems to suggest that we are right
for the number as we compute it via the implying the expected value,
(i.e. we pick a bin and multiply the fullness of the bin by the number
of bins to estimate the whole fullness, see the mean=n p on [1])
I think a user would be far more interested in giving an upper bound,
i.e. expressing something like "I will have at most $auto.gc objects
before gc kicks in" or "The likelihood to exceed the $auto.gc number
of loose objects by $this much is less than 5%", for which the math
would be more complicated, but easier to document with the words of
statistics.

[1]  https://en.wikipedia.org/wiki/Binomial_distribution

Stefan

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-08 19:17                 ` Stefan Beller
@ 2018-10-09  1:03                   ` Junio C Hamano
  2018-10-09 17:37                     ` Stefan Beller
  0 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2018-10-09  1:03 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Ævar Arnfjörð Bjarmason, nico, nix, koreth,
	Linus Torvalds, git

Stefan Beller <sbeller@google.com> writes:

> On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Junio C Hamano <gitster@pobox.com> writes:
>
>> > ...
>> > by general public and I do not have to explain the choice to the
>> > general public ;-)
>>
>> One thing that is more important than "why not 00 but 17?" to answer
>> is why a hardcoded number rather than a runtime random.  It is for
>> repeatability.
>
> Let's talk about repeatability vs statistics for a second. ;-)

Oh, I think I misled you by saying "more important".  

I didn't mean that it is more important to stick to the "use
hardcoded value" design decision than sticking to "use 17".  I've
made sure that everybody would understnd choosing any arbitrary byte
value other than "17" does not make the resulting Git any better nor
worse.  But discussing the design decision to use hardcoded value is
"more important", as that affects the balance between the end-user
experience and debuggability, and I tried to help those who do not
know the history by giving the fact that choice was made for the
latter and not for other hidden reasons, that those who would
propose to change the system may have to keep in mind.

Sorry if you mistook it as if I were saying that it is important to
keep the design to use a hardcoded byte value.  That wasn't what the
message was about.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-08 10:36               ` Ævar Arnfjörð Bjarmason
@ 2018-10-09  1:07                 ` Junio C Hamano
  2018-10-09 17:40                   ` Stefan Beller
  0 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2018-10-09  1:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Nicolas Pitre, Nix, Steven Grimm, Linus Torvalds,
	Git Mailing List, Christian Couder

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Depending on how we're counting there's at least two.

I thought you were asking "why the special sentinel is not 0{40}?"
You counted the number of reasons why 0{40} is used to stand in for
a real value, but that was the number I didn't find interesting in
the scope of this discussion, i.e. "why the special sample is 17?"

I vaguely recall we also used 0{39}1 for something else long time
ago; I offhand do not recall if we still do, or we got rid of it.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-09  1:03                   ` Junio C Hamano
@ 2018-10-09 17:37                     ` Stefan Beller
  2018-10-10  1:10                       ` Junio C Hamano
  0 siblings, 1 reply; 97+ messages in thread
From: Stefan Beller @ 2018-10-09 17:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, nico, Nick Alcock, koreth,
	Linus Torvalds, git

On Mon, Oct 8, 2018 at 6:03 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Stefan Beller <sbeller@google.com> writes:
>
> > On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano <gitster@pobox.com> wrote:
> >>
> >> Junio C Hamano <gitster@pobox.com> writes:
> >
> >> > ...
> >> > by general public and I do not have to explain the choice to the
> >> > general public ;-)
> >>
> >> One thing that is more important than "why not 00 but 17?" to answer
> >> is why a hardcoded number rather than a runtime random.  It is for
> >> repeatability.
> >
> > Let's talk about repeatability vs statistics for a second. ;-)
>
> Oh, I think I misled you by saying "more important".
>
> I didn't mean that it is more important to stick to the "use
> hardcoded value" design decision than sticking to "use 17".  I've
> made sure that everybody would understnd choosing any arbitrary byte
> value other than "17" does not make the resulting Git any better nor
> worse.

Yes, I totally get that. We could have chosen 42 just because.


>  But discussing the design decision to use hardcoded value is
> "more important", as that affects the balance between the end-user
> experience and debuggability, and I tried to help those who do not
> know the history by giving the fact that choice was made for the
> latter and not for other hidden reasons, that those who would
> propose to change the system may have to keep in mind.

From an end users point of view, the auto gc kicks in at random.
(Maybe it's just me, but I don't keep track of the loose object count ;-)

For debuggability, we could design a system that allows for debugging,
e.g. "When GIT_AUTO_GC_BIN is set, use the number as set, otherwise
take a random slot".

> Sorry if you mistook it as if I were saying that it is important to
> keep the design to use a hardcoded byte value.  That wasn't what the
> message was about.

I understood very well that the choice of value was arbitrary and you
do not have a convincing story as to why 17 (and not say 23, but such
a story is not required, as all slots are equal from a design perspective).

I do challenge the decision to take a hardcoded value, though, as it
yields better properties for the end users IMHO, whereas debugging
this specific case does not seem to be important to me.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-09  1:07                 ` Junio C Hamano
@ 2018-10-09 17:40                   ` Stefan Beller
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Beller @ 2018-10-09 17:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, nico, Nick Alcock, koreth,
	Linus Torvalds, git, Christian Couder

On Mon, Oct 8, 2018 at 6:07 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
> > Depending on how we're counting there's at least two.
>
> I thought you were asking "why the special sentinel is not 0{40}?"
> You counted the number of reasons why 0{40} is used to stand in for
> a real value, but that was the number I didn't find interesting in
> the scope of this discussion, i.e. "why the special sample is 17?"
>
> I vaguely recall we also used 0{39}1 for something else long time
> ago; I offhand do not recall if we still do, or we got rid of it.

gitk still shows changes added to the index as 0{39}1, whereas
changes not added yet are marked as 0{40}.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-09 17:37                     ` Stefan Beller
@ 2018-10-10  1:10                       ` Junio C Hamano
  2018-10-10 19:08                         ` Stefan Beller
  0 siblings, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2018-10-10  1:10 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Ævar Arnfjörð Bjarmason, nico, Nick Alcock, koreth,
	Linus Torvalds, git

Stefan Beller <sbeller@google.com> writes:
>> Oh, I think I misled you by saying "more important".
>> ...
> I do challenge the decision to take a hardcoded value, though, ...

I do not find any reason why you need to say "though" here.  If you
understood the message you are responding to that use of hardcoded
value was chosen not to help the end-user experience, it should have
been clear that we are in agreement.

I also sometimes find certain people here are unnecessarily
combative in their discussion.  It this just some language issue?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: What's so special about objects/17/ ?
  2018-10-10  1:10                       ` Junio C Hamano
@ 2018-10-10 19:08                         ` Stefan Beller
  0 siblings, 0 replies; 97+ messages in thread
From: Stefan Beller @ 2018-10-10 19:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, nico, Nick Alcock, koreth,
	Linus Torvalds, git

On Tue, Oct 9, 2018 at 6:10 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Stefan Beller <sbeller@google.com> writes:
> >> Oh, I think I misled you by saying "more important".
> >> ...
> > I do challenge the decision to take a hardcoded value, though, ...
>
> I do not find any reason why you need to say "though" here.

I caught myself using lots of filler-words lately.

  Though, however, I think, I would guess, IMHO....
  fills a lot of space without saying much.

I'll reduce that.

>  If you
> understood the message you are responding to that use of hardcoded
> value was chosen not to help the end-user experience, it should have
> been clear that we are in agreement.

We are, but for different reasons.

> I also sometimes find certain people here are unnecessarily
> combative in their discussion.  It this just some language issue?

certain people? ;-)
I have issues with ambiguity in communication directed towards me,
which is why I sometimes try to be very direct and blunt.
Other times I strive on ambiguity as well (mostly in my reviews).

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2018-10-10 19:08 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
2007-09-05  7:21 ` Martin Langhoff
2007-09-05  7:37   ` Karl Hasselström
2007-09-05  7:30 ` Junio C Hamano
2007-09-05  7:26   ` Tomash Brechko
2007-09-05  8:13   ` Johan Herland
2007-09-05  8:39     ` Matthieu Moy
2007-09-05  8:41       ` Johan Herland
2007-09-05  8:47         ` David Kastrup
2007-09-05  8:51       ` Pierre Habouzit
2007-09-05  9:02         ` David Kastrup
2007-09-05  9:04         ` Matthieu Moy
2007-09-05  8:51   ` Wincent Colaiuta
2007-09-05  7:42 ` Pierre Habouzit
2007-09-05  8:16   ` Junio C Hamano
2007-09-05  8:50   ` Steven Grimm
     [not found]     ` <86ps0xcwxo.fsf@lola.quinscape.zz>
2007-09-05  9:07       ` Steven Grimm
2007-09-05  9:13         ` David Kastrup
2007-09-05  9:07     ` Junio C Hamano
2007-09-05  9:27       ` Martin Langhoff
2007-09-05  9:33         ` Matthieu Moy
2007-09-05 14:17           ` Johan De Messemaeker
2007-09-05 17:31             ` Matthieu Moy
2007-09-05 23:56               ` Jeff King
2007-09-05  9:13     ` David Kastrup
2007-09-05  9:14     ` Pierre Habouzit
2007-09-05 17:51   ` Nix
2007-09-05 18:14     ` Steven Grimm
2007-09-05 18:22       ` Nix
2007-09-05 18:54         ` Nicolas Pitre
2007-09-05 20:01           ` Junio C Hamano
2007-09-05 20:35             ` Nicolas Pitre
2007-09-05 21:14               ` Nix
2007-09-05 21:46               ` Junio C Hamano
2007-09-05 23:04                 ` Nicolas Pitre
2007-09-05 23:42                   ` Junio C Hamano
2007-09-06  0:27                     ` Carlos Rica
2007-09-06  5:55                 ` David Kastrup
2007-09-05 21:49               ` Junio C Hamano
2007-09-05 21:59                 ` Invoke "git gc --auto" from commit, merge, am and rebase Junio C Hamano
2007-09-06  2:39                   ` Shawn O. Pearce
2007-09-05 20:37             ` [PATCH] Invoke "git gc --auto" from "git add" and "git fetch" Junio C Hamano
     [not found]               ` <69b0c0350709051357ifa547aarfe3e0b36cf9be98f@mail.gmail.com>
2007-09-05 20:59                 ` Fwd: " Govind Salinas
2007-09-06 12:02               ` Johannes Schindelin
2007-09-05 21:18             ` People unaware of the importance of "git gc"? Alex Riesen
2007-09-06  2:44             ` Russ Dill
2007-09-06  2:52               ` Shawn O. Pearce
2007-09-06  9:28               ` Andreas Ericsson
2007-09-06  2:45             ` Shawn O. Pearce
2007-09-06  2:49               ` Steven Grimm
2007-09-06  2:56                 ` Shawn O. Pearce
2007-09-06 15:54             ` Johannes Schindelin
2007-09-06 17:49               ` Junio C Hamano
2007-09-06 18:15                 ` Linus Torvalds
2007-09-06 18:29                   ` Steven Grimm
2007-09-06 23:12                   ` Subject: [PATCH] git-merge-pack Junio C Hamano
2007-09-06 23:35                     ` Linus Torvalds
2007-09-07  0:51                     ` Nicolas Pitre
2007-09-07  1:58                       ` Junio C Hamano
2007-09-07  2:32                         ` Nicolas Pitre
2007-09-07  4:07                       ` Shawn O. Pearce
2007-09-07  4:43                       ` Junio C Hamano
2007-09-08  9:50                         ` [PATCH] make sha1_file.c::matches_pack_name() available to others Junio C Hamano
2007-09-08 10:01                         ` [PATCH] pack-objects --repack-unpacked Junio C Hamano
2007-09-07  7:11                     ` Subject: [PATCH] git-merge-pack Johannes Sixt
2007-09-07  7:34                       ` Junio C Hamano
2007-09-07  7:24                     ` Andy Parkins
2007-09-07  4:48                 ` People unaware of the importance of "git gc"? Shawn O. Pearce
2007-09-07 10:12                 ` Johannes Schindelin
2018-10-07 18:28           ` What's so special about objects/17/ ? Ævar Arnfjörð Bjarmason
2018-10-07 18:35             ` Johannes Sixt
2018-10-07 19:06               ` Ævar Arnfjörð Bjarmason
2018-10-07 22:39                 ` Johannes Sixt
2018-10-08  0:54                   ` Junio C Hamano
2018-10-07 19:46             ` Junio C Hamano
2018-10-07 20:07               ` Junio C Hamano
2018-10-08 19:17                 ` Stefan Beller
2018-10-09  1:03                   ` Junio C Hamano
2018-10-09 17:37                     ` Stefan Beller
2018-10-10  1:10                       ` Junio C Hamano
2018-10-10 19:08                         ` Stefan Beller
2018-10-08 10:36               ` Ævar Arnfjörð Bjarmason
2018-10-09  1:07                 ` Junio C Hamano
2018-10-09 17:40                   ` Stefan Beller
2007-09-05  8:16 ` People unaware of the importance of "git gc"? David Kastrup
2007-09-05 16:47 ` Govind Salinas
2007-09-05 17:19   ` Carl Worth
2007-09-05 17:55     ` Jing Xue
2007-09-05 17:35   ` Steven Grimm
2007-09-05 18:28     ` Nix
2007-09-05 17:44 ` J. Bruce Fields
2007-09-05 18:46   ` Brandon Casey
2007-09-05 19:09     ` David Kastrup
2007-09-05 19:13       ` J. Bruce Fields
2007-09-05 19:43         ` David Kastrup
2007-09-05 19:20       ` Mike Hommey
2007-09-05 21:07 ` Alex Riesen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).