Adding a new file as if it had existed

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Adding a new file as if it had existed
@ 2006-12-12 10:05 Bahadir Balban
  2006-12-12 10:13 ` Junio C Hamano
  2006-12-12 12:36 ` Jakub Narebski
  0 siblings, 2 replies; 11+ messages in thread
From: Bahadir Balban @ 2006-12-12 10:05 UTC (permalink / raw)
  To: git

Hi,

When I initialise a git repository, I use a subset of files in the
project and leave out irrelevant files for performance reasons. Then
when I need to make changes to a file not yet in the repository, the
file is treated as new, and if I reset the change or change branches
the file is gone.

Is there a good way of adding new files to git as if they had existed
from the initial commit (or even better, since a particular commit)?
This way I would only track the new changes I made to an existing
file.

Thanks,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 10:05 Adding a new file as if it had existed Bahadir Balban
@ 2006-12-12 10:13 ` Junio C Hamano
  2006-12-12 11:32   ` Bahadir Balban
  2006-12-12 12:36 ` Jakub Narebski
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2006-12-12 10:13 UTC (permalink / raw)
  To: Bahadir Balban; +Cc: git

"Bahadir Balban" <bahadir.balban@gmail.com> writes:

> Is there a good way of adding new files to git as if they had existed
> from the initial commit (or even better, since a particular commit)?
> This way I would only track the new changes I made to an existing
> file.

No.

I do not understand why not adding all the files you care about
eventually anyway in the initial commit is needed for
"performance reasons", if you do not touch majority of them for
a long time.  Care to explain?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 10:13 ` Junio C Hamano
@ 2006-12-12 11:32   ` Bahadir Balban
  2006-12-12 12:07     ` Johannes Schindelin
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Bahadir Balban @ 2006-12-12 11:32 UTC (permalink / raw)
  To: git

On 12/12/06, Junio C Hamano <junkio@cox.net> wrote:
> No.
>
> I do not understand why not adding all the files you care about
> eventually anyway in the initial commit is needed for
> "performance reasons", if you do not touch majority of them for
> a long time.  Care to explain?

If I don't know which files I may be touching in the future for
implementing some feature, then I am obliged to add all the files even
if they are irrelevant. I said "performance reasons" assuming all the
file hashes need checked for every commit -a to see if they're
changed, but I just tried on a PIII and it seems not so slow.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 11:32   ` Bahadir Balban
@ 2006-12-12 12:07     ` Johannes Schindelin
  2006-12-12 12:26     ` Andy Parkins
  2006-12-12 18:31     ` Junio C Hamano
  2 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2006-12-12 12:07 UTC (permalink / raw)
  To: Bahadir Balban; +Cc: git

Hi,

On Tue, 12 Dec 2006, Bahadir Balban wrote:

> On 12/12/06, Junio C Hamano <junkio@cox.net> wrote:
> > No.
> > 
> > I do not understand why not adding all the files you care about
> > eventually anyway in the initial commit is needed for
> > "performance reasons", if you do not touch majority of them for
> > a long time.  Care to explain?
> 
> If I don't know which files I may be touching in the future for
> implementing some feature,

When I use an SCM, it is to track the revisions of a project. It seems you 
are content to have only parts of a revision? That does not make sense to 
me.

> I said "performance reasons" assuming all the file hashes need checked 
> for every commit -a to see if they're changed, but I just tried on a 
> PIII and it seems not so slow.

Bingo!

You just felt the consequences of the "index".

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 11:32   ` Bahadir Balban
  2006-12-12 12:07     ` Johannes Schindelin
@ 2006-12-12 12:26     ` Andy Parkins
  2006-12-12 13:20       ` Andreas Ericsson
  2006-12-12 18:31     ` Junio C Hamano
  2 siblings, 1 reply; 11+ messages in thread
From: Andy Parkins @ 2006-12-12 12:26 UTC (permalink / raw)
  To: git

On Tuesday 2006 December 12 11:32, Bahadir Balban wrote:

> If I don't know which files I may be touching in the future for
> implementing some feature, then I am obliged to add all the files even
> if they are irrelevant. I said "performance reasons" assuming all the
> file hashes need checked for every commit -a to see if they're
> changed, but I just tried on a PIII and it seems not so slow.

Here's a handy rule of thumb I've learned in my use of git:

 "git is fast.  Really fast."

That'll hold you in good stead.  In my experience there is no operation in git 
that is slow.  I've got some trees that are for embedded work and hold the 
whole linux kernel, often more than once.  Subversion, which I used 
previously, took literally hours to import the whole tree.  Git takes 
minutes.

As to your direct concern: git doesn't hash every file at every commit.  There 
is no need.  git has an "index" that is used to prepare a commit; at the time 
you do the actual commit, git already knows which files are being checked in.  
Obviously, Linus uses git for managing the linux kernel, he's said before 
that he wanted a version control system that can do multiple commits /per 
second/.  git can do that.

In short - don't worry about making life easy for git - it's a workhorse and 
does a grand job.

Andy
-- 
Dr Andy Parkins, M Eng (hons), MIEE

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 10:05 Adding a new file as if it had existed Bahadir Balban
  2006-12-12 10:13 ` Junio C Hamano
@ 2006-12-12 12:36 ` Jakub Narebski
  1 sibling, 0 replies; 11+ messages in thread
From: Jakub Narebski @ 2006-12-12 12:36 UTC (permalink / raw)
  To: git

Bahadir Balban wrote:

> When I initialise a git repository, I use a subset of files in the
> project and leave out irrelevant files for performance reasons. Then
> when I need to make changes to a file not yet in the repository, the
> file is treated as new, and if I reset the change or change branches
> the file is gone.
> 
> Is there a good way of adding new files to git as if they had existed
> from the initial commit (or even better, since a particular commit)?
> This way I would only track the new changes I made to an existing
> file.

Generally, it is not possible without rewriting history. In git (in any
sane SCM) commits are atomic; there is no CVS-like bunch of per-file
histories. You can use cg-admin-rewritehist from Cogito (alternate UI
for git)... but as it was said somewhere else git is fast. And the rule
of thumb: check first, then optimize.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 12:26     ` Andy Parkins
@ 2006-12-12 13:20       ` Andreas Ericsson
  0 siblings, 0 replies; 11+ messages in thread
From: Andreas Ericsson @ 2006-12-12 13:20 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Andy Parkins wrote:
> On Tuesday 2006 December 12 11:32, Bahadir Balban wrote:
> 
>> If I don't know which files I may be touching in the future for
>> implementing some feature, then I am obliged to add all the files even
>> if they are irrelevant. I said "performance reasons" assuming all the
>> file hashes need checked for every commit -a to see if they're
>> changed, but I just tried on a PIII and it seems not so slow.
> 
> Here's a handy rule of thumb I've learned in my use of git:
> 
>  "git is fast.  Really fast."
> 

Almost alarmingly so. When I started using git (back in May/June last 
year, when git was 2 - 3 months old), I was worried at first because it 
didn't seem to actually *do* anything, but just returned me to the 
prompt immediately.

> 
> As to your direct concern: git doesn't hash every file at every commit.  There 
> is no need.  git has an "index" that is used to prepare a commit; at the time 
> you do the actual commit, git already knows which files are being checked in.  
> 
> In short - don't worry about making life easy for git - it's a workhorse and 
> does a grand job.
> 

Yup. Now I've gone the other way around and think other scm's are broken 
when they chew disk for 10 seconds whenever I try to do anything with 
them. I usually end up importing the other repo into git and do my work 
there.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 11:32   ` Bahadir Balban
  2006-12-12 12:07     ` Johannes Schindelin
  2006-12-12 12:26     ` Andy Parkins
@ 2006-12-12 18:31     ` Junio C Hamano
  2006-12-13  9:40       ` Andreas Ericsson
  2 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2006-12-12 18:31 UTC (permalink / raw)
  To: Bahadir Balban; +Cc: git, Johannes Schindelin, Andy Parkins, Andreas Ericsson

"Bahadir Balban" <bahadir.balban@gmail.com> writes:

> ... I said "performance reasons" assuming all the
> file hashes need checked for every commit -a to see if they're
> changed, but I just tried on a PIII and it seems not so slow.

Ok.

Other people have already cleared the fear for 'commit' case, so
I hope you are happier.

There is one thing we could further optimize, though.

Switching branches with 100k blobs in a commit even when there
are a handful paths different between the branches would still
need to populate the index by reading two trees and collapsing
them into a single stage.  In theory, we should be able to do a
lot better if two-tree case of read-tree took advanrage of
cache-tree information.  If ce_match_stat() says Ok for all
paths in a subdirectory and the cached tree object name for that
subdirectory in the index match what we are reading from the new
tree, we should be able to skip reading that subdirectory (and
its subdirectories) from the new tree object at all.

Anybody interested to give it a try?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-12 18:31     ` Junio C Hamano
@ 2006-12-13  9:40       ` Andreas Ericsson
  2006-12-13 15:46         ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Ericsson @ 2006-12-13  9:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Bahadir Balban, git, Johannes Schindelin, Andy Parkins

Junio C Hamano wrote:
> "Bahadir Balban" <bahadir.balban@gmail.com> writes:
> 
> There is one thing we could further optimize, though.
> 
> Switching branches with 100k blobs in a commit even when there
> are a handful paths different between the branches would still
> need to populate the index by reading two trees and collapsing
> them into a single stage.  In theory, we should be able to do a
> lot better if two-tree case of read-tree took advanrage of
> cache-tree information.  If ce_match_stat() says Ok for all
> paths in a subdirectory and the cached tree object name for that
> subdirectory in the index match what we are reading from the new
> tree, we should be able to skip reading that subdirectory (and
> its subdirectories) from the new tree object at all.
> 
> Anybody interested to give it a try?
> 

I'm not vell-versed enough in git internals to have my hopes high of 
making something useful of it, but if you give me a pointer of where to 
start I'd be happy to try, and perhaps learn something in the process.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-13  9:40       ` Andreas Ericsson
@ 2006-12-13 15:46         ` Johannes Schindelin
  2006-12-13 15:52           ` Andreas Ericsson
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2006-12-13 15:46 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Junio C Hamano, Bahadir Balban, git, Andy Parkins

Hi,

On Wed, 13 Dec 2006, Andreas Ericsson wrote:

> Junio C Hamano wrote:
> > "Bahadir Balban" <bahadir.balban@gmail.com> writes:
> > 
> > There is one thing we could further optimize, though.
> > 
> > Switching branches with 100k blobs in a commit even when there
> > are a handful paths different between the branches would still
> > need to populate the index by reading two trees and collapsing
> > them into a single stage.  In theory, we should be able to do a
> > lot better if two-tree case of read-tree took advanrage of
> > cache-tree information.  If ce_match_stat() says Ok for all
> > paths in a subdirectory and the cached tree object name for that
> > subdirectory in the index match what we are reading from the new
> > tree, we should be able to skip reading that subdirectory (and
> > its subdirectories) from the new tree object at all.
> > 
> > Anybody interested to give it a try?
> > 
> 
> I'm not vell-versed enough in git internals to have my hopes high of 
> making something useful of it, but if you give me a pointer of where to 
> start I'd be happy to try, and perhaps learn something in the process.

Okay, I'll have a stab at explaining it.

For huge working directories, you usually have a huge number of trees. The 
idea of cache_tree is to remember not only the stat information of the 
blobs in the index, but to cache the hashes of the trees also (until they 
are invalidated, e.g. by an update-index). This avoids recalculation of 
the hashes when committing.

This cache is accessible by the global variable active_cache_tree. It is 
best accessed by the function cache_tree_find(), which you call like that:

	struct cache_tree *ct = cache_tree_find(active_cache_tree, path);

where the variable "path" may contain slashes. The SHA1 of the 
corresponding tree is in ct->sha1, and you can check if the hash is still 
valid by asking

	if (cache_tree_fully_valid(ct))
		/* still valid */

AFAIU Junio would like to take the shortcut of doing nothing at all when 
(twoway) reading a tree whose hash is identical to the hash stored in the 
corresponding cache_tree _and_ when the cache is still fully valid.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Adding a new file as if it had existed
  2006-12-13 15:46         ` Johannes Schindelin
@ 2006-12-13 15:52           ` Andreas Ericsson
  0 siblings, 0 replies; 11+ messages in thread
From: Andreas Ericsson @ 2006-12-13 15:52 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Bahadir Balban, git, Andy Parkins

Johannes Schindelin wrote:
> Hi,
> 
> On Wed, 13 Dec 2006, Andreas Ericsson wrote:
> 
>> Junio C Hamano wrote:
>>> "Bahadir Balban" <bahadir.balban@gmail.com> writes:
>>>
>>> There is one thing we could further optimize, though.
>>>
>>> Switching branches with 100k blobs in a commit even when there
>>> are a handful paths different between the branches would still
>>> need to populate the index by reading two trees and collapsing
>>> them into a single stage.  In theory, we should be able to do a
>>> lot better if two-tree case of read-tree took advanrage of
>>> cache-tree information.  If ce_match_stat() says Ok for all
>>> paths in a subdirectory and the cached tree object name for that
>>> subdirectory in the index match what we are reading from the new
>>> tree, we should be able to skip reading that subdirectory (and
>>> its subdirectories) from the new tree object at all.
>>>
>>> Anybody interested to give it a try?
>>>
>> I'm not vell-versed enough in git internals to have my hopes high of 
>> making something useful of it, but if you give me a pointer of where to 
>> start I'd be happy to try, and perhaps learn something in the process.
> 
> Okay, I'll have a stab at explaining it.
> 
> For huge working directories, you usually have a huge number of trees. The 
> idea of cache_tree is to remember not only the stat information of the 
> blobs in the index, but to cache the hashes of the trees also (until they 
> are invalidated, e.g. by an update-index). This avoids recalculation of 
> the hashes when committing.
> 
> This cache is accessible by the global variable active_cache_tree. It is 
> best accessed by the function cache_tree_find(), which you call like that:
> 
> 	struct cache_tree *ct = cache_tree_find(active_cache_tree, path);
> 
> where the variable "path" may contain slashes. The SHA1 of the 
> corresponding tree is in ct->sha1, and you can check if the hash is still 
> valid by asking
> 
> 	if (cache_tree_fully_valid(ct))
> 		/* still valid */
> 
> AFAIU Junio would like to take the shortcut of doing nothing at all when 
> (twoway) reading a tree whose hash is identical to the hash stored in the 
> corresponding cache_tree _and_ when the cache is still fully valid.
> 

Seems you wrote half the code for me already. :)

Thanks for the excellent explanation. I'll see if I can grok it further 
tonight.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-12-13 15:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-12 10:05 Adding a new file as if it had existed Bahadir Balban
2006-12-12 10:13 ` Junio C Hamano
2006-12-12 11:32   ` Bahadir Balban
2006-12-12 12:07     ` Johannes Schindelin
2006-12-12 12:26     ` Andy Parkins
2006-12-12 13:20       ` Andreas Ericsson
2006-12-12 18:31     ` Junio C Hamano
2006-12-13  9:40       ` Andreas Ericsson
2006-12-13 15:46         ` Johannes Schindelin
2006-12-13 15:52           ` Andreas Ericsson
2006-12-12 12:36 ` Jakub Narebski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).