* CAREFUL! No more delta object support! @ 2005-06-28 1:14 Linus Torvalds 2005-06-27 23:58 ` Christopher Li ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 1:14 UTC (permalink / raw To: Git Mailing List Some people may have noticed already (hopefully not the hard way) that the current git code doesn't support delta objects lying around in the object directory any more. In other words, if you have delta objects, you need to un-deltify your repository _before_ you upgrade your git binaries, or they won't be able to read your objects any more. The reason? The new git understands packed files natively, which ends up being a much bigger win in many many ways. You should be very careful about using packed files (since they are a very recent addition), but what you can do to try them out is to do so in a separate repository. Starting to use a packed repository is very simple indeed, and here's what you need to do for git, for example: In your regular "git" directory (once you have ypdated your git to a recent version, in particular you need to have the "csum-file: fix missing buf pointer update" commit), do: git-rev-list --objects HEAD | git-pack-objects --window=50 --depth=50 out which will say something like "Packing 3741 objects" and result in two new files a few seconds later: torvalds@ppc970:~/git> ls -lh out* -rw-r--r-- 1 torvalds torvalds 89K Jun 27 17:59 out.idx -rw-r--r-- 1 torvalds torvalds 1.3M Jun 27 17:59 out.pack now, don't do anythign with those files, but instead go and create a directory somewhere else: cd ~ mkdir packed-git-trial cd packed-git-trial git-init-db you have now obviously created a totally empty repository. Now, let's populate that empty repository with _just_ the pack files: mkdir .git/objects/pack mv ~/git/out.* .git/objects/pack and then, move over your tags, in particularly the HEAD pointer, with something like cat ~/git/.git/HEAD > .git/HEAD and voila, you're done. Try "gitk", for example. Or "git log". Now, what's even cooler is how you can just start using this packed tree: feel free to do a test-commit or something, and notice how git starts populating the empty .git/objects/xx/ subdirectories with new objects. But it still relies on the pack-file for the old history. Now, there's still a misfeature there, which is that when you create a new object, it doesn't check whether that object already exists in the pack-file, so you'll end up with a few recent objects that you really don't need (notably tree objects), and we'll fix that eventually. But notice how you started with a 17MB .git/objects/ directory in your original tree, and you now have just a 1.3MB pack-file and a 90kB index file that replaces all that? There are some other issues too, like the fact that "git-fsck-cache" doesn't know about the pack-files yet, so it will complain about missing objects etc. Also, please note that the pack-file _only_ packs the commits and the things reachable from them: things like tags (and your references in your .git/refs directory) need to be copied over separately. So this is all very rough, still, but the basics do actually seem to work (ie anything that doesn't look directly at the object files - which is pretty much all of it except for fsck and the direct-filesystem-access things like "rsync" and "git-local-pull"). Maybe you might not want to switch over yet, and as mentioned, rsync then ends up not being a good way to sync (nor git-local-pull), but the "git-http/ssh-pull" family should hopefully just work. I've used a packed kernel tree too, so this has gotten _some_ testing even on really quite big git trees. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 1:14 CAREFUL! No more delta object support! Linus Torvalds @ 2005-06-27 23:58 ` Christopher Li 2005-06-28 3:30 ` Linus Torvalds 2005-06-28 2:01 ` CAREFUL! No more delta object support! Junio C Hamano 2005-06-28 8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano 2 siblings, 1 reply; 38+ messages in thread From: Christopher Li @ 2005-06-27 23:58 UTC (permalink / raw To: Linus Torvalds; +Cc: Git Mailing List On Mon, Jun 27, 2005 at 06:14:40PM -0700, Linus Torvalds wrote: > > The reason? The new git understands packed files natively, which ends up > being a much bigger win in many many ways. Interesting. I take a look at your change, it still support delta object inside the pack file right? For a second I am wondering you drop the delta feature completely. Chris ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-27 23:58 ` Christopher Li @ 2005-06-28 3:30 ` Linus Torvalds 2005-06-28 9:40 ` Junio C Hamano 2005-06-28 10:38 ` Christopher Li 0 siblings, 2 replies; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 3:30 UTC (permalink / raw To: Christopher Li; +Cc: Git Mailing List On Mon, 27 Jun 2005, Christopher Li wrote: > On Mon, Jun 27, 2005 at 06:14:40PM -0700, Linus Torvalds wrote: > > > > The reason? The new git understands packed files natively, which ends up > > being a much bigger win in many many ways. > > Interesting. I take a look at your change, it still support delta object > inside the pack file right? For a second I am wondering you drop the delta > feature completely. Deltas do exist inside pack-files, yes. They just don't exist as independent objects any more, so you can never get into the situation that you find a delta but you don't find the delta it points to. Because in the pack-files, there are only deltas _within_ a pack-file. You can't have a delta that points to outside the pack. This means that pack-files with few objects will inevitably be larger than they could otherwise be (ie you can never have a pack file that _only_ contains deltas to the outside world), but it's just incredibly reassuring to me that a pack-file is always self-sufficient. So when/if we start using pack-files for doing "git pull" etc, the pack-file won't actually help pack things for small updates: small updates will probably contain the whole changed file, unless the update has several changes to the same file (which is not unusual, of course), in which case it will only contain one version and then deltas from that. But the savings get increasingly bigger the more history we have. That's also why the packed git archive is about 1/14th of the size of the fully unpacked disk usage of the git project, but a packed kernel archive "only" achieves a packing rate of 1/5th of the fully unpacked kernel archive. The git archive is all history, while the kernel archive just "appears", and 2/3 of the files have only one single version and thus don't delta- compress at all. (Another reason is probably that the kernel has bigger files, which means that it thus has relatively less loss in filesystem block padding). But not having any outside deltas not only makes me feel safer, it also means that you can fully validate a pack archive consistency without even knowing what project it is from - you can check the SHA1 results of every file in the pack against the index of the pack, and check that the SHA1's of the pack files themselves are valid. Again, this is just a data _consistency_ check, of course - it means that you can validate that it downloaded fine, and that you don't have disk corruption, but it doesn't mean that the data isn't evil and nasty and buggy ;) Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 3:30 ` Linus Torvalds @ 2005-06-28 9:40 ` Junio C Hamano 2005-06-28 11:06 ` Christopher Li 2005-06-28 14:46 ` Jan Harkes 2005-06-28 10:38 ` Christopher Li 1 sibling, 2 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 9:40 UTC (permalink / raw To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> But the savings get increasingly bigger the more history we have. That's LT> also why the packed git archive is about 1/14th of the size of the fully LT> unpacked disk usage of the git project,... GIT archive may be an odd-ball because the project itself is so small, but a fair comparison should include the disk usage of 256 fan-out directories. Counting them, empty .git/objects/ with the 1.4MB packed archive and 90KB index file ends up being somewhere around 2.4MB on my machine, compared with 17MB for the traditional one. Still a good space reduction. Good job! I am now dreaming if we someday would enhance the mechanism with append-only updates to the *.pack files with complete rewrite of the *.idx files, and get rid of files under .git/objects totally. This would make things reasonably friendly to rsync. The kernel pack has around 60M pack with 1.1M index, so everyday use would involve incremental updates to the pack [*1*] and full download of the index file. [Footnote] *1* Presumably many objects are deltified against older objects which is suboptimal. Most likely the newer objects are accessed far more often and they are what we would want to keep in full not as delta. So even with this scheme we would want to have weekly repacking. Interestingly enough, pack-objects gets the objects via usual read_sha1_file() interface so it can produce a new pack from an existing pack. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 9:40 ` Junio C Hamano @ 2005-06-28 11:06 ` Christopher Li 2005-06-28 14:52 ` Petr Baudis 2005-06-28 14:46 ` Jan Harkes 1 sibling, 1 reply; 38+ messages in thread From: Christopher Li @ 2005-06-28 11:06 UTC (permalink / raw To: Junio C Hamano; +Cc: Linus Torvalds, git On Tue, Jun 28, 2005 at 02:40:56AM -0700, Junio C Hamano wrote: > >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: > Still a good space reduction. Good job! > > I am now dreaming if we someday would enhance the mechanism with > append-only updates to the *.pack files with complete rewrite of > the *.idx files, and get rid of files under .git/objects totally. No offense my friend, this has been done. It's name is mercurial. > This would make things reasonably friendly to rsync. The kernel > pack has around 60M pack with 1.1M index, so everyday use would > involve incremental updates to the pack [*1*] and full download > of the index file. It still have other open issue. Now it would be harder to not sync all the heads. If I just want the clean Linus-2.6 tree, I have to dig it out from the pack file which mixing with other heads. You could host different projects with it's own pack file. That will lost the space saving on co-hosting projects. So I am not convince rsync is the way to go in long run. You need to have your own network syncing method. > > [Footnote] > > *1* Presumably many objects are deltified against older objects > which is suboptimal. Most likely the newer objects are accessed > far more often and they are what we would want to keep in full > not as delta. So even with this scheme we would want to have > weekly repacking. Interestingly enough, pack-objects gets the > objects via usual read_sha1_file() interface so it can produce a > new pack from an existing pack. It sounds like you are suggesting backward delta. Keeping the latest node in full and using delta to access the old one. It should work but it will lose the append only property. Chris ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 11:06 ` Christopher Li @ 2005-06-28 14:52 ` Petr Baudis 2005-06-28 16:35 ` Benjamin LaHaise 0 siblings, 1 reply; 38+ messages in thread From: Petr Baudis @ 2005-06-28 14:52 UTC (permalink / raw To: Christopher Li; +Cc: Junio C Hamano, Linus Torvalds, git Dear diary, on Tue, Jun 28, 2005 at 01:06:25PM CEST, I got a letter where Christopher Li <git@chrisli.org> told me that... > On Tue, Jun 28, 2005 at 02:40:56AM -0700, Junio C Hamano wrote: > > >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: > > Still a good space reduction. Good job! > > > > I am now dreaming if we someday would enhance the mechanism with > > append-only updates to the *.pack files with complete rewrite of > > the *.idx files, and get rid of files under .git/objects totally. > > No offense my friend, this has been done. It's name is mercurial. > > > This would make things reasonably friendly to rsync. The kernel > > pack has around 60M pack with 1.1M index, so everyday use would > > involve incremental updates to the pack [*1*] and full download > > of the index file. > > It still have other open issue. Now it would be harder to not sync > all the heads. If I just want the clean Linus-2.6 tree, I have to > dig it out from the pack file which mixing with other heads. > > You could host different projects with it's own pack file. That > will lost the space saving on co-hosting projects. > > So I am not convince rsync is the way to go in long run. You need > to have your own network syncing method. I think the git-*-pull tools are actually just fine. You will only need to have some server-side CGI gadget to frontend the file, but we need that anyway to make the pull reasonably effective. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ <Espy> be careful, some twit might quote you out of context.. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 14:52 ` Petr Baudis @ 2005-06-28 16:35 ` Benjamin LaHaise 2005-06-28 20:30 ` Petr Baudis 0 siblings, 1 reply; 38+ messages in thread From: Benjamin LaHaise @ 2005-06-28 16:35 UTC (permalink / raw To: Petr Baudis; +Cc: Christopher Li, Junio C Hamano, Linus Torvalds, git On Tue, Jun 28, 2005 at 04:52:56PM +0200, Petr Baudis wrote: > I think the git-*-pull tools are actually just fine. You will only need > to have some server-side CGI gadget to frontend the file, but we need > that anyway to make the pull reasonably effective. Not really -- the use of rsync for the objects fails horribly on slow links when the project scales in the number of commits. The rsync protocol has to transfer the names of each file and some information about it, and that information isn't delta compressed. This is where kernel.org is falling over, as well as what makes the kernel tree very painful to use over a dialup modem link. -ben -- "Time is what keeps everything from happening all at once." -- John Wheeler ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 16:35 ` Benjamin LaHaise @ 2005-06-28 20:30 ` Petr Baudis 0 siblings, 0 replies; 38+ messages in thread From: Petr Baudis @ 2005-06-28 20:30 UTC (permalink / raw To: Benjamin LaHaise; +Cc: Christopher Li, Junio C Hamano, Linus Torvalds, git Dear diary, on Tue, Jun 28, 2005 at 06:35:51PM CEST, I got a letter where Benjamin LaHaise <bcrl@kvack.org> told me that... > On Tue, Jun 28, 2005 at 04:52:56PM +0200, Petr Baudis wrote: > > I think the git-*-pull tools are actually just fine. You will only need > > to have some server-side CGI gadget to frontend the file, but we need > > that anyway to make the pull reasonably effective. > > Not really -- the use of rsync for the objects fails horribly on slow > links when the project scales in the number of commits. The rsync > protocol has to transfer the names of each file and some information > about it, and that information isn't delta compressed. This is where > kernel.org is falling over, as well as what makes the kernel tree very > painful to use over a dialup modem link. Yes. But isn't that what I'm after all saying too? git-*-pull tools shouldn't have that problem since they have much less overhead and only pull stuff you need. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ <Espy> be careful, some twit might quote you out of context.. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 9:40 ` Junio C Hamano 2005-06-28 11:06 ` Christopher Li @ 2005-06-28 14:46 ` Jan Harkes 1 sibling, 0 replies; 38+ messages in thread From: Jan Harkes @ 2005-06-28 14:46 UTC (permalink / raw To: Junio C Hamano; +Cc: Linus Torvalds, git On Tue, Jun 28, 2005 at 02:40:56AM -0700, Junio C Hamano wrote: > I am now dreaming if we someday would enhance the mechanism with > append-only updates to the *.pack files with complete rewrite of > the *.idx files, and get rid of files under .git/objects totally. Stop dreaming, please. The current separate objects setup might not be space efficient, but it has many other advantages. - Objects are only written only once, and from then on are only read. This works well on filesystems that provide session semantics, as opposed to unix semantics. And the resulting objects are perfectly cacheable since they are only invalidated if someone ever decides to pack the repository. - The hierarchy and the way the objects directories are updated works very well in combination with AFS style directory acls. What surprised me was that subdirectories in refs/heads work perfectly with all the core git tools, branchnames simply become 'user/branch'. - Objects that differ in content have different naming, as a result multiple developers can safely commit into a shared repository without requiring locks. This is also why it is safe to pull from another repository without clobbering your own history. Imagine if you appended some local changes to a packed archive and the next rsync wipes your local commits. I've been trying to keep an up to date document on how (and why) I use git on Coda. It started pretty much the identical to jgarzik's HOWTO. But it ended up a lot more complicated, to a point where I needed my own scripts for just about every action. Until I discovered that the alternate objects pool would work well in my environment. http://www.coda.cs.cmu.edu/git.html Jan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 3:30 ` Linus Torvalds 2005-06-28 9:40 ` Junio C Hamano @ 2005-06-28 10:38 ` Christopher Li 2005-06-28 16:45 ` Linus Torvalds 1 sibling, 1 reply; 38+ messages in thread From: Christopher Li @ 2005-06-28 10:38 UTC (permalink / raw To: Linus Torvalds; +Cc: Git Mailing List That is all nice improvement to address the space usage issue. Should people just run repacking once a while or is it automaticly add new object to the pack file? Chris On Mon, Jun 27, 2005 at 08:30:22PM -0700, Linus Torvalds wrote: > > Deltas do exist inside pack-files, yes. They just don't exist as > independent objects any more, so you can never get into the situation that > you find a delta but you don't find the delta it points to. > > Because in the pack-files, there are only deltas _within_ a pack-file. You > can't have a delta that points to outside the pack. > > This means that pack-files with few objects will inevitably be larger than > they could otherwise be (ie you can never have a pack file that _only_ > contains deltas to the outside world), but it's just incredibly reassuring > to me that a pack-file is always self-sufficient. > > So when/if we start using pack-files for doing "git pull" etc, the > pack-file won't actually help pack things for small updates: small updates > will probably contain the whole changed file, unless the update has > several changes to the same file (which is not unusual, of course), in > which case it will only contain one version and then deltas from that. > > But the savings get increasingly bigger the more history we have. That's > also why the packed git archive is about 1/14th of the size of the fully > unpacked disk usage of the git project, but a packed kernel archive "only" > achieves a packing rate of 1/5th of the fully unpacked kernel archive. The > git archive is all history, while the kernel archive just "appears", and > 2/3 of the files have only one single version and thus don't delta- > compress at all. > > (Another reason is probably that the kernel has bigger files, which means > that it thus has relatively less loss in filesystem block padding). > > But not having any outside deltas not only makes me feel safer, it also > means that you can fully validate a pack archive consistency without even > knowing what project it is from - you can check the SHA1 results of every > file in the pack against the index of the pack, and check that the SHA1's > of the pack files themselves are valid. Again, this is just a data > _consistency_ check, of course - it means that you can validate that it > downloaded fine, and that you don't have disk corruption, but it doesn't > mean that the data isn't evil and nasty and buggy ;) > > Linus > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 10:38 ` Christopher Li @ 2005-06-28 16:45 ` Linus Torvalds 2005-06-29 0:49 ` [PATCH] Emit base objects of a delta chain when the delta is output Junio C Hamano 0 siblings, 1 reply; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 16:45 UTC (permalink / raw To: Christopher Li; +Cc: Git Mailing List On Tue, 28 Jun 2005, Christopher Li wrote: > > That is all nice improvement to address the space usage issue. > > Should people just run repacking once a while or is it automaticly > add new object to the pack file? While adding a new object to a pack file is _possible_ (you add it to the end of the pack-file, and re-generate the index file), I would strongly suggest against it for several reasons: - It's a lot more complex and expensive than just writing a new file. Much better to make the pack generation be an off-line thing, and make new object creation really cheap. - it has serious locking issues, and if something goes wrong you are just horribly screwed. This implies, for example, that to be safe you really have to use fsync() etc at every point (and be careful about writing the index), making the update even _more_ expensive. Over NFS you need to be extremely careful to make sure that everybody got the right lock, yadda yadda. Packing things off-line just means that _all_ of these problems go away. - There are operations that want to remove objects (I do that all the time: I do something stupid, and decide to undo it, or I just do a "git-update-cache" and notice that I need to do more work so I edit it some more and actually never commit the first version) If _adding_ to the file had some serious correctness issues, _removing_ an object from a file is even worse. MUCH worse. Now you don't just have to lock against other people creating new objects, now you have to lock against updates (or totally re-write the whole big file and do an atomic "rename"). - it can actually generate worse packing. The current "offline" method means that we can pack any version of a file against any other version of a file, and we do. We pick the closest version we can find, and we try to always pack against the bigger one (deletes are smaller deltas, and the biggest one tends to be the latest version, so this not only means that the delta is denser, it also means that the latest version - which is likely to be the biggest and most often used - tends to be non-delta). In contrast, updating the pack file means that you always write the latest version as a delta, which means that you're doing things _exactly_ the wrong way around both for performance and size. - Finally: packing allows us to do optimize for locality. In particular, I write out the pack file in "recency" order, ie the top-most objects go first, and in particular, the "commit" objects go at the very top of the file. Why? Because it means that the commit objects (which are heavily used for the history generation by pretty much anything, since "git-rev-list" will access them) are packed together, and in the right order. Again, you can't do that if you do on-line updates as opposed to offline packing. So the usage pattern I envision is to pack stuff maybe once a month (depending on how much changes, of course), because then you really do get the best of both worlds: the simplicity of individual objects for recent work and the optimal packing and ordering that you can really work on for the longer range case. And your project never grows very big. Btw, I'm not claiming that my current pack format is "optimal" of course. For example, while I write all objects in recency order, right now that means that if a recent object has been written as a delta that depends on an older one, I actually write the delta first (correct) but I won't write the older object until its recency ordering (wrong). That kind of thing is trivial to fix (eventually), but it's an example of where ordering matters (ie if it's the other way around: the delta is the older object, it's probably better to leave it at the end of the file, since it's probably not going to be accessed much, making the effective packing at the head more efficicient). It's also an example of the kinds of things we can do exactly because we're doing the packing off-line. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] Emit base objects of a delta chain when the delta is output. 2005-06-28 16:45 ` Linus Torvalds @ 2005-06-29 0:49 ` Junio C Hamano 0 siblings, 0 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-29 0:49 UTC (permalink / raw To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> While adding a new object to a pack file is _possible_ (you add it to the LT> end of the pack-file, and re-generate the index file), I would strongly LT> suggest against it for several reasons: OK, people have convinced me not to dream on ;-). LT> Btw, I'm not claiming that my current pack format is "optimal" of course. LT> For example, while I write all objects in recency order, right now that LT> means that if a recent object has been written as a delta that depends on LT> an older one, I actually write the delta first (correct) but I won't write LT> the older object until its recency ordering (wrong). I agree. How does this one look? Lightly tested by packing, unpacking without -n and fsck'ing, not unpacking but placing it under .git/objects/pack and running fsck with --full, all using the current GIT repo. ------------ Deltas are useless by themselves and when you use them you need to get to their base objects. A base object should inherit recency from the most recent deltified object that is based on it and that is what this patch teaches git-pack-objects. Signed-off-by: Junio C Hamano <junkio@cox.net> --- cd /opt/packrat/playpen/public/in-place/git/git.junio/ jit-diff # - master: Use enhanced diff_delta() in the similarity estimator. # + (working tree) diff --git a/pack-objects.c b/pack-objects.c --- a/pack-objects.c +++ b/pack-objects.c @@ -118,6 +118,23 @@ static unsigned long write_object(struct return hdrlen + datalen; } +static unsigned long write_one(struct sha1file *f, + struct object_entry *e, + unsigned long offset) +{ + if (e->offset) + /* offset starts from header size and cannot be zero + * if it is written already. + */ + return offset; + e->offset = offset; + offset += write_object(f, e); + /* if we are delitified, write out its base object. */ + if (e->delta) + offset = write_one(f, e->delta, offset); + return offset; +} + static void write_pack_file(void) { int i; @@ -135,11 +152,9 @@ static void write_pack_file(void) hdr.hdr_entries = htonl(nr_objects); sha1write(f, &hdr, sizeof(hdr)); offset = sizeof(hdr); - for (i = 0; i < nr_objects; i++) { - struct object_entry *entry = objects + i; - entry->offset = offset; - offset += write_object(f, entry); - } + for (i = 0; i < nr_objects; i++) + offset = write_one(f, objects + i, offset); + sha1close(f, pack_file_sha1, 1); mb = offset >> 20; offset &= 0xfffff; Compilation finished at Tue Jun 28 17:43:31 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 1:14 CAREFUL! No more delta object support! Linus Torvalds 2005-06-27 23:58 ` Christopher Li @ 2005-06-28 2:01 ` Junio C Hamano 2005-06-28 2:03 ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano 2005-06-28 2:13 ` CAREFUL! No more delta object support! Linus Torvalds 2005-06-28 8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano 2 siblings, 2 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 2:01 UTC (permalink / raw To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Now, there's still a misfeature there, which is that when you create a new LT> object, it doesn't check whether that object already exists in the LT> pack-file, so you'll end up with a few recent objects that you really LT> don't need (notably tree objects), and we'll fix that eventually. Patch will be sent separately. LT> ... Also, please note that the pack-file _only_ packs the commits LT> and the things reachable from them ... Shouldn't feeding "git-rev-list --object" output plus handcrafted list of objects in 2.6.11 tree object to git-pack-objects just work??? LT> Maybe you might not want to switch over yet, and as mentioned, rsync then LT> ends up not being a good way to sync (nor git-local-pull), but the LT> "git-http/ssh-pull" family should hopefully just work. No. The pull protocol Dan did expects to throw compressed representation around on the wire (which is valid if you assume uncompressed transfer) and does not use read-sha1-file -- write-sha1-file pair, so all three do not work. ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] Skip writing out sha1 files for objects in packed git. 2005-06-28 2:01 ` CAREFUL! No more delta object support! Junio C Hamano @ 2005-06-28 2:03 ` Junio C Hamano 2005-06-28 2:43 ` Linus Torvalds 2005-06-28 2:13 ` CAREFUL! No more delta object support! Linus Torvalds 1 sibling, 1 reply; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 2:03 UTC (permalink / raw To: Linus Torvalds; +Cc: git Now, there's still a misfeature there, which is that when you create a new object, it doesn't check whether that object already exists in the pack-file, so you'll end up with a few recent objects that you really don't need (notably tree objects), and this patch fixes it. Signed-off-by: Junio C Hamano <junkio@cox.net> --- apply.c | 2 +- cache.h | 2 +- commit-tree.c | 2 +- convert-cache.c | 6 +++--- mktag.c | 2 +- sha1_file.c | 44 ++++++++++++++++++++++++++++++-------------- unpack-objects.c | 4 ++-- update-cache.c | 2 +- write-tree.c | 2 +- 9 files changed, 41 insertions(+), 25 deletions(-) f4f76b275cdabc038bcb4f3c7ca0d443638df88d diff --git a/apply.c b/apply.c --- a/apply.c +++ b/apply.c @@ -1221,7 +1221,7 @@ static void add_index_file(const char *p if (lstat(path, &st) < 0) die("unable to stat newly created file %s", path); fill_stat_cache_info(ce, &st); - if (write_sha1_file(buf, size, "blob", ce->sha1) < 0) + if (write_sha1_file(buf, size, "blob", ce->sha1, 0) < 0) die("unable to create backing store for newly created file %s", path); if (add_cache_entry(ce, ADD_CACHE_OK_TO_ADD) < 0) die("unable to add cache entry for %s", path); diff --git a/cache.h b/cache.h --- a/cache.h +++ b/cache.h @@ -165,7 +165,7 @@ extern int parse_sha1_header(char *hdr, extern int sha1_object_info(const unsigned char *, char *, unsigned long *); extern void * unpack_sha1_file(void *map, unsigned long mapsize, char *type, unsigned long *size); extern void * read_sha1_file(const unsigned char *sha1, char *type, unsigned long *size); -extern int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *return_sha1); +extern int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *return_sha1, int do_expand); extern int check_sha1_signature(const unsigned char *sha1, void *buf, unsigned long size, const char *type); diff --git a/commit-tree.c b/commit-tree.c --- a/commit-tree.c +++ b/commit-tree.c @@ -191,7 +191,7 @@ int main(int argc, char **argv) while (fgets(comment, sizeof(comment), stdin) != NULL) add_buffer(&buffer, &size, "%s", comment); - write_sha1_file(buffer, size, "commit", commit_sha1); + write_sha1_file(buffer, size, "commit", commit_sha1, 0); printf("%s\n", sha1_to_hex(commit_sha1)); return 0; } diff --git a/convert-cache.c b/convert-cache.c --- a/convert-cache.c +++ b/convert-cache.c @@ -111,7 +111,7 @@ static int write_subdirectory(void *buff buffer += len; } - write_sha1_file(new, newlen, "tree", result_sha1); + write_sha1_file(new, newlen, "tree", result_sha1, 0); free(new); return used; } @@ -251,7 +251,7 @@ static void convert_date(void *buffer, u memcpy(new + newlen, buffer, size); newlen += size; - write_sha1_file(new, newlen, "commit", result_sha1); + write_sha1_file(new, newlen, "commit", result_sha1, 0); free(new); } @@ -286,7 +286,7 @@ static struct entry * convert_entry(unsi memcpy(buffer, data, size); if (!strcmp(type, "blob")) { - write_sha1_file(buffer, size, "blob", entry->new_sha1); + write_sha1_file(buffer, size, "blob", entry->new_sha1, 0); } else if (!strcmp(type, "tree")) convert_tree(buffer, size, entry->new_sha1); else if (!strcmp(type, "commit")) diff --git a/mktag.c b/mktag.c --- a/mktag.c +++ b/mktag.c @@ -123,7 +123,7 @@ int main(int argc, char **argv) if (verify_tag(buffer, size) < 0) die("invalid tag signature file"); - if (write_sha1_file(buffer, size, "tag", result_sha1) < 0) + if (write_sha1_file(buffer, size, "tag", result_sha1, 0) < 0) die("unable to write tag file"); printf("%s\n", sha1_to_hex(result_sha1)); return 0; diff --git a/sha1_file.c b/sha1_file.c --- a/sha1_file.c +++ b/sha1_file.c @@ -891,31 +891,47 @@ void *read_object_with_reference(const u } } -int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1) +static char *write_sha1_file_prepare(void *buf, + unsigned long len, + const char *type, + unsigned char *sha1, + unsigned char *hdr, + int *hdrlen) { - int size; - unsigned char *compressed; - z_stream stream; - unsigned char sha1[20]; SHA_CTX c; - char *filename; - static char tmpfile[PATH_MAX]; - unsigned char hdr[50]; - int fd, hdrlen, ret; /* Generate the header */ - hdrlen = sprintf((char *)hdr, "%s %lu", type, len)+1; + *hdrlen = sprintf((char *)hdr, "%s %lu", type, len)+1; /* Sha1.. */ SHA1_Init(&c); - SHA1_Update(&c, hdr, hdrlen); + SHA1_Update(&c, hdr, *hdrlen); SHA1_Update(&c, buf, len); SHA1_Final(sha1, &c); + return sha1_file_name(sha1); +} + +int write_sha1_file(void *buf, unsigned long len, const char *type, + unsigned char *returnsha1, int do_expand) +{ + int size; + unsigned char *compressed; + z_stream stream; + unsigned char sha1[20]; + char *filename; + static char tmpfile[PATH_MAX]; + unsigned char hdr[50]; + int fd, hdrlen, ret; + + /* Normally if we have it in the pack then we do not bother writing + * it out into .git/objects/??/?{38} file. + */ + filename = write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen); if (returnsha1) memcpy(returnsha1, sha1, 20); - - filename = sha1_file_name(sha1); + if (!do_expand && has_sha1_file(sha1)) + return 0; fd = open(filename, O_RDONLY); if (fd >= 0) { /* @@ -1082,7 +1098,7 @@ int index_fd(unsigned char *sha1, int fd if ((int)(long)buf == -1) return -1; - ret = write_sha1_file(buf, size, "blob", sha1); + ret = write_sha1_file(buf, size, "blob", sha1, 0); if (size) munmap(buf, size); return ret; diff --git a/unpack-objects.c b/unpack-objects.c --- a/unpack-objects.c +++ b/unpack-objects.c @@ -126,7 +126,7 @@ static int unpack_non_delta_entry(struct case 'B': type_s = "blob"; break; default: goto err_finish; } - if (write_sha1_file(buffer, size, type_s, sha1) < 0) + if (write_sha1_file(buffer, size, type_s, sha1, 1) < 0) die("failed to write %s (%s)", sha1_to_hex(entry->sha1), type_s); printf("%s %s\n", sha1_to_hex(sha1), type_s); @@ -223,7 +223,7 @@ static int unpack_delta_entry(struct pac die("failed to apply delta"); free(delta_data); - if (write_sha1_file(result, result_size, type, sha1) < 0) + if (write_sha1_file(result, result_size, type, sha1, 1) < 0) die("failed to write %s (%s)", sha1_to_hex(entry->sha1), type); free(result); diff --git a/update-cache.c b/update-cache.c --- a/update-cache.c +++ b/update-cache.c @@ -77,7 +77,7 @@ static int add_file_to_cache(char *path) free(target); return -1; } - if (write_sha1_file(target, st.st_size, "blob", ce->sha1)) + if (write_sha1_file(target, st.st_size, "blob", ce->sha1, 0)) return -1; free(target); break; diff --git a/write-tree.c b/write-tree.c --- a/write-tree.c +++ b/write-tree.c @@ -76,7 +76,7 @@ static int write_tree(struct cache_entry nr++; } - write_sha1_file(buffer, offset, "tree", returnsha1); + write_sha1_file(buffer, offset, "tree", returnsha1, 0); free(buffer); return nr; } ------------ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] Skip writing out sha1 files for objects in packed git. 2005-06-28 2:03 ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano @ 2005-06-28 2:43 ` Linus Torvalds 2005-06-28 3:33 ` Junio C Hamano 0 siblings, 1 reply; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 2:43 UTC (permalink / raw To: Junio C Hamano; +Cc: git On Mon, 27 Jun 2005, Junio C Hamano wrote: > > Now, there's still a misfeature there, which is that when you > create a new object, it doesn't check whether that object > already exists in the pack-file, so you'll end up with a few > recent objects that you really don't need (notably tree > objects), and this patch fixes it. > > Signed-off-by: Junio C Hamano <junkio@cox.net> Actually, I don't think that "do_expand" flag should exist. If we want to expand a packed file and really write the objects to the .git/objects directories, we should just not have that packed file in the .git/objects/pack directory. And if we have a pack-file in .git/objects/ that already has the object, that may not be the _same_ pack-file that we're expanding at all, so if that pack file already has the object, then not writing it out is actually the right thing to do. That will also simplify your patch a bit. I'll fix it up. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] Skip writing out sha1 files for objects in packed git. 2005-06-28 2:43 ` Linus Torvalds @ 2005-06-28 3:33 ` Junio C Hamano 2005-06-28 15:45 ` Linus Torvalds 0 siblings, 1 reply; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 3:33 UTC (permalink / raw To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> If we want to expand a packed file and really write the objects to the LT> .git/objects directories, we should just not have that packed file in the LT> .git/objects/pack directory. What I was aiming for was this: (1) Introduce an interface to sha1_file.c that lets you say "use this file as one of the packs, although it is not under .git/objects/pack"; (2) Introduce another interface to sha1_file.c that lets you enumerate the index entries for a given pack file. (3) Remove the unpacking logic from unpack-object.c; instead call the above interfaces to register the pack and enumerate entries, and call read_sha1_file() followed by write_sha1_file() with do_expand repeatedly. However, the infrastructure (1) and (2) may end up being a special case only to support unpack-object (and removing the code duplication for unpacking), in which case what you suggest would make more sense. LT> And if we have a pack-file in .git/objects/ that already has LT> the object, that may not be the _same_ pack-file that we're LT> expanding at all, so if that pack file already has the LT> object, then not writing it out is actually the right thing LT> to do. This I have to think about a bit. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] Skip writing out sha1 files for objects in packed git. 2005-06-28 3:33 ` Junio C Hamano @ 2005-06-28 15:45 ` Linus Torvalds 0 siblings, 0 replies; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 15:45 UTC (permalink / raw To: Junio C Hamano; +Cc: git On Mon, 27 Jun 2005, Junio C Hamano wrote: > > LT> And if we have a pack-file in .git/objects/ that already has > LT> the object, that may not be the _same_ pack-file that we're > LT> expanding at all, so if that pack file already has the > LT> object, then not writing it out is actually the right thing > LT> to do. > > This I have to think about a bit. The most trivial example is doing a "git pull" of a small pack-file update. We probably don't want to leave it around as a pack-file (we'll re-pack everything at some later date, but we also don't want to expand the stuff we already have in our _real_ pack-file). Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 2:01 ` CAREFUL! No more delta object support! Junio C Hamano 2005-06-28 2:03 ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano @ 2005-06-28 2:13 ` Linus Torvalds 2005-06-28 2:32 ` Junio C Hamano ` (2 more replies) 1 sibling, 3 replies; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 2:13 UTC (permalink / raw To: Junio C Hamano; +Cc: git On Mon, 27 Jun 2005, Junio C Hamano wrote: > > LT> ... Also, please note that the pack-file _only_ packs the commits > LT> and the things reachable from them ... > > Shouldn't feeding "git-rev-list --object" output plus > handcrafted list of objects in 2.6.11 tree object to > git-pack-objects just work??? You could do that. And yes, we can add support for "tag" objects too (which the packing doesn't do at all right now. So this is not a "fundamental" problem, it's just a practical one right now. > > [.. git-ssh-pull hopefully working ..] > > No. The pull protocol Dan did expects to throw compressed > representation around on the wire (which is valid if you assume > uncompressed transfer) and does not use read-sha1-file -- > write-sha1-file pair, so all three do not work. Fair enough. I'd prefer for the pull/push to push object packs around anyway, so there's some more work there.. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 2:13 ` CAREFUL! No more delta object support! Linus Torvalds @ 2005-06-28 2:32 ` Junio C Hamano 2005-06-28 2:37 ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano 2005-06-28 2:48 ` CAREFUL! No more delta object support! Linus Torvalds 2005-06-28 5:09 ` Daniel Barkalow 2005-06-29 18:59 ` Linus Torvalds 2 siblings, 2 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 2:32 UTC (permalink / raw To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Fair enough. I'd prefer for the pull/push to push object packs around LT> anyway, so there's some more work there.. Yes, I'd prefer that too. By the way, you broke t/t0000 with the last commit. Now an empty GIT_OBJECT_DIRECTORY has 257 subdirectories. ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack 2005-06-28 2:32 ` Junio C Hamano @ 2005-06-28 2:37 ` Junio C Hamano 2005-06-28 2:48 ` CAREFUL! No more delta object support! Linus Torvalds 1 sibling, 0 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 2:37 UTC (permalink / raw To: Linus Torvalds; +Cc: git Some tests expected the directory not to exist by default. Updated git-init-db prepares it properly so adjust tests to match that behaviour. Signed-off-by: Junio C Hamano <junkio@cox.net> --- t/t0000-basic.sh | 6 +++--- t/t5300-pack-object.sh | 1 - 2 files changed, 3 insertions(+), 4 deletions(-) de500ab0379e4db18d1511cbe91ace106eee7830 diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh --- a/t/t0000-basic.sh +++ b/t/t0000-basic.sh @@ -28,11 +28,11 @@ test_expect_success \ '.git/objects should be empty after git-init-db in an empty repo.' \ 'cmp -s /dev/null should-be-empty' -# also it should have 256 subdirectories. 257 is counting "objects" +# also it should have 257 subdirectories. 258 is counting "objects" find .git/objects -type d -print >full-of-directories test_expect_success \ - '.git/objects should have 256 subdirectories.' \ - 'test $(wc -l < full-of-directories) = 257' + '.git/objects should have 257 subdirectories.' \ + 'test $(wc -l < full-of-directories) = 258' ################################################################ # Basics of the basics diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh --- a/t/t5300-pack-object.sh +++ b/t/t5300-pack-object.sh @@ -99,7 +99,6 @@ test_expect_success \ 'GIT_OBJECT_DIRECTORY=.git2/objects && export GIT_OBJECT_DIRECTORY && git-init-db && - mkdir .git2/objects/pack && cp test-1.pack test-1.idx .git2/objects/pack && { git-diff-tree --root -p $commit && while read object ------------ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 2:32 ` Junio C Hamano 2005-06-28 2:37 ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano @ 2005-06-28 2:48 ` Linus Torvalds 1 sibling, 0 replies; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 2:48 UTC (permalink / raw To: Junio C Hamano; +Cc: git On Mon, 27 Jun 2005, Junio C Hamano wrote: > > By the way, you broke t/t0000 with the last commit. Now an > empty GIT_OBJECT_DIRECTORY has 257 subdirectories. Yup, I noticed that. Fix pushed out (along with another one that was failing because it wanted to create the "pack" directory itself, and was unhappy when it already existed). Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 2:13 ` CAREFUL! No more delta object support! Linus Torvalds 2005-06-28 2:32 ` Junio C Hamano @ 2005-06-28 5:09 ` Daniel Barkalow 2005-06-28 15:49 ` Linus Torvalds 2005-06-29 18:59 ` Linus Torvalds 2 siblings, 1 reply; 38+ messages in thread From: Daniel Barkalow @ 2005-06-28 5:09 UTC (permalink / raw To: Linus Torvalds; +Cc: Junio C Hamano, git On Mon, 27 Jun 2005, Linus Torvalds wrote: > > > [.. git-ssh-pull hopefully working ..] > > > > No. The pull protocol Dan did expects to throw compressed > > representation around on the wire (which is valid if you assume > > uncompressed transfer) and does not use read-sha1-file -- > > write-sha1-file pair, so all three do not work. > > Fair enough. I'd prefer for the pull/push to push object packs around > anyway, so there's some more work there.. It shouldn't be hard to add; the main issue is determining when transfering a pack file is a good idea, because it probably doesn't make sense to transfer a pack file just because the source side has an object that the target side wants in that pack. (If you pull from someone who packed up the whole history of everything, which you already have, into a file with one new commit, you'd be sad to get the huge thing; you really want a little custom (or just limited) pack file.) The ideal thing is probably to pick up some tricks from Mercurial in figuring out what needs to be transferred, and have the source side write a pack file directly to the connection, which the target side would then save directly. I never worked out exactly what those tricks were, though. The next trick would be to put something in place of cleverly-chosen objects to specify what pack file they're in, so that the HTTP client could find things from a packed repository. (Or we could just have an option to unpack post-transfer.) -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 5:09 ` Daniel Barkalow @ 2005-06-28 15:49 ` Linus Torvalds 2005-06-28 16:21 ` Linus Torvalds 0 siblings, 1 reply; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 15:49 UTC (permalink / raw To: Daniel Barkalow; +Cc: Junio C Hamano, git On Tue, 28 Jun 2005, Daniel Barkalow wrote: > > It shouldn't be hard to add; the main issue is determining when > transfering a pack file is a good idea, because it probably doesn't make > sense to transfer a pack file just because the source side has an object > that the target side wants in that pack. Oh, you'd never just transfer the whole big pack-file at all: you'd just create a new one. And creatign a new one is just a matter of finding the common parent, and then doing git-rev-list --objects common..HEAD | git-pack-file .git/tmp-pack and then you send the result to the other side.. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 15:49 ` Linus Torvalds @ 2005-06-28 16:21 ` Linus Torvalds 2005-06-28 17:04 ` Daniel Barkalow 0 siblings, 1 reply; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 16:21 UTC (permalink / raw To: Daniel Barkalow; +Cc: Junio C Hamano, git On Tue, 28 Jun 2005, Linus Torvalds wrote: > > Oh, you'd never just transfer the whole big pack-file at all: you'd just > create a new one. And creatign a new one is just a matter of finding the > common parent, and then doing > > git-rev-list --objects common..HEAD | git-pack-file .git/tmp-pack > > and then you send the result to the other side.. To clarify: this also works with objects that are already in another pack-file (now that Junio fixed the "get size of a deltified packed entry"), so you can have any number of unpacked objects in your objects directory, _and_ a pack-file (or several), and you can generate a new temporary pack-file just for sending somewhere else that contains arbistrary parts of that (ie a mix of objects that are in your "main" packfiles and objects that are unpacked). You don't have to use "git-rev-list" to generate the objects, btw, git-pack-file takes an arbitrary list of object ID's (plus a "packing hint" in the form of a filename that is not required, but that can help the packing heuristics, and that git-rev-list does provide). I'll also fix up git-pack-file to be able to pack tag objects (and the unpacking to understand them), so that any valid object can be packed. Right now it only handles the objects that git-rev-list knows about. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 16:21 ` Linus Torvalds @ 2005-06-28 17:04 ` Daniel Barkalow 2005-06-28 17:36 ` Linus Torvalds 0 siblings, 1 reply; 38+ messages in thread From: Daniel Barkalow @ 2005-06-28 17:04 UTC (permalink / raw To: Linus Torvalds; +Cc: Junio C Hamano, git On Tue, 28 Jun 2005, Linus Torvalds wrote: > I'll also fix up git-pack-file to be able to pack tag objects (and the > unpacking to understand them), so that any valid object can be packed. > Right now it only handles the objects that git-rev-list knows about. Actually, the ideal thing would be to move the packing code into an object file that git-ssh-push can include; that way it can write directly to the socket instead of going through disk, and it can also go from getting the remote end's list of common ancestors to having a pack to send without needing to exec a script. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 17:04 ` Daniel Barkalow @ 2005-06-28 17:36 ` Linus Torvalds 2005-06-28 18:17 ` Linus Torvalds 0 siblings, 1 reply; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 17:36 UTC (permalink / raw To: Daniel Barkalow; +Cc: Junio C Hamano, git On Tue, 28 Jun 2005, Daniel Barkalow wrote: > > Actually, the ideal thing would be to move the packing code into an object > file that git-ssh-push can include; that way it can write directly to the > socket instead of going through disk It doesn't work very easily that way because the index file (which contains the object list and the offsets into the pack file) cannot be created until after the pack file has been created (and we don't want to evaluate that one in memory, since it can be quite big). Now, what we could do is to stream out the pack file first to stdout, and write the index file afterwards. But since we don't know how big the pack file will be when we start packing, and the pack-file can contain basically arbitrary patterns, that requires that the receiver actually parse the pack-file as it comes in. The format of the pack-file is a fairly trivial data stream of - rinse and repeat for each object: - one character of type of file (C, T, B, G, D for "commit", "tree", "blob", "tag" or "delta" respectively) - four bytes of network-order unpacked data length - [ if delta: 20 bytes of delta object ID ] - zlib-packed data (length unknown, except we know how much we want it to unpack to) - Finally at the end: 20 bytes of SHA1 of the pack-file contents (up to the SHA1) so it's actually possible to pick up the objects as they come off the stream, since the SHA1 name is defined by the contents and you don't need the index file unless you want to look things up. So the receiver side could try this algorithm: - unpack each object in memory on the receiving side If the unpack failed, it must have been the SHA1 at the end, so verify it! - if it's a delta object and you haven't seen the object it's a delta against, keep it in memory. - if it's a non-delta object, just write it to the object store, and try to resolve any delta objects you have pending that this new object satisfies. That in turn creates other objects that may have more deltas they satisfy etc. which looks quite doable. The delta objects are small, so keeping them in memory shouldn't be a problem (especially since we _tend_ to write deltas after the object they depend on). I can certainly add an option to git-pack-file that disables writing of the index file, and just writes the pack-file to stdout. I'm not sure I want to write the "parse incoming pack-file" thing, but git-unpack-objects comes _reasonably_ close (but right now it seeks around using the index file to resolve deltas, instead of keeping them in memory and resolving them when possible). But I can make the infrastructure ready for it. Sounds like a plan. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 17:36 ` Linus Torvalds @ 2005-06-28 18:17 ` Linus Torvalds 2005-06-28 19:49 ` Matthias Urlichs ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Linus Torvalds @ 2005-06-28 18:17 UTC (permalink / raw To: Daniel Barkalow; +Cc: Junio C Hamano, git On Tue, 28 Jun 2005, Linus Torvalds wrote: > > I can certainly add an option to git-pack-file that disables writing of > the index file, and just writes the pack-file to stdout. Done. > I'm not sure I > want to write the "parse incoming pack-file" thing, but git-unpack-objects > comes _reasonably_ close (but right now it seeks around using the index > file to resolve deltas, instead of keeping them in memory and resolving > them when possible). I'm still thinking about this one. I think I'll just do it. One problem here is that since we don't know how big the incoming pack-file will be, in a streaming input environment the receiver needs to either make the pack-file reception be the last thing it sees, or it will have to live with the fact that "git-unpack-objects" will read some more than it needs before it notices that it got it all... We can handle the latter either by padding (make the rule be that git-unpack-file will always read in chunks of 4kB max, and pad the output with 4kB of zero bytes or something, and then you can execute git-unpack-objects and continue reading stdin afterwards, removing any zeroes that git-unpack-file didn't eat), or by having git-unpack-objects flush anything after the final SHA1 to _its_ stdout, so that you can get the following data/commands in the stream from the unpack-file thing. Ugly, in any case. Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 18:17 ` Linus Torvalds @ 2005-06-28 19:49 ` Matthias Urlichs 2005-06-28 20:18 ` Matthias Urlichs 2005-06-28 20:01 ` Daniel Barkalow 2005-06-29 3:53 ` Linus Torvalds 2 siblings, 1 reply; 38+ messages in thread From: Matthias Urlichs @ 2005-06-28 19:49 UTC (permalink / raw To: git Hi, Linus Torvalds wrote: > Ugly, in any case. Why not chunk the thing? In other words, the stream shouldn't be "here's a big-ass packfile of unknown size" but an arbitrary number of "here's a N-byte sized chunk of the current pack file" snippets, followed by a "here's the SHA1 of the whole thing" packet. -- Matthias Urlichs | {M:U} IT Design @ m-u-it.de | smurf@smurf.noris.de Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de - - Be like a duck -- keep calm and unruffled on the surface but paddle like the devil under water. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 19:49 ` Matthias Urlichs @ 2005-06-28 20:18 ` Matthias Urlichs 0 siblings, 0 replies; 38+ messages in thread From: Matthias Urlichs @ 2005-06-28 20:18 UTC (permalink / raw To: git I wrote: > Linus Torvalds wrote: > >> Ugly, in any case. > > Why not chunk the thing? Having the number of files sent first would work too, I'd think. I'm wary of trying to interpret something non-decompressible as a sha1 chunk, however -- the set of random bytes that, to zlib, look like a sufficiently valid zip header that it wants to read more than 20 of them before punting is certainly not zero. -- Matthias Urlichs | {M:U} IT Design @ m-u-it.de | smurf@smurf.noris.de Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de - - I was sure the old fellow would never make it to the other side of the curb when I struck him. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 18:17 ` Linus Torvalds 2005-06-28 19:49 ` Matthias Urlichs @ 2005-06-28 20:01 ` Daniel Barkalow 2005-06-29 3:53 ` Linus Torvalds 2 siblings, 0 replies; 38+ messages in thread From: Daniel Barkalow @ 2005-06-28 20:01 UTC (permalink / raw To: Linus Torvalds; +Cc: Junio C Hamano, git On Tue, 28 Jun 2005, Linus Torvalds wrote: > On Tue, 28 Jun 2005, Linus Torvalds wrote: > > > > I can certainly add an option to git-pack-file that disables writing of > > the index file, and just writes the pack-file to stdout. > > Done. What I actually meant was that it would be useful for git-ssh-push to be able to pack stuff as a function call rather than execing an external program, because just sticking git-ssh-push at the end of a pipeline doesn't work if you don't remember what the remote side has. > > I'm not sure I > > want to write the "parse incoming pack-file" thing, but git-unpack-objects > > comes _reasonably_ close (but right now it seeks around using the index > > file to resolve deltas, instead of keeping them in memory and resolving > > them when possible). > > I'm still thinking about this one. I think I'll just do it. One possibility would be to put a special type tag (like '\0') before the hash, so that the format is more deterministic. > One problem here is that since we don't know how big the incoming > pack-file will be, in a streaming input environment the receiver needs to > either make the pack-file reception be the last thing it sees, or it will > have to live with the fact that "git-unpack-objects" will read some more > than it needs before it notices that it got it all... In a completely streaming environment, yes; but the receiving side is the one sending commands, so you don't run into the next thing unless you're overlapping requests. Failing that, we can just keep a 4k buffer of stuff we've already read around; we don't have to worry about reading into something we won't want to read at all. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 18:17 ` Linus Torvalds 2005-06-28 19:49 ` Matthias Urlichs 2005-06-28 20:01 ` Daniel Barkalow @ 2005-06-29 3:53 ` Linus Torvalds 2 siblings, 0 replies; 38+ messages in thread From: Linus Torvalds @ 2005-06-29 3:53 UTC (permalink / raw To: Daniel Barkalow; +Cc: Junio C Hamano, git On Tue, 28 Jun 2005, Linus Torvalds wrote: > > > I'm not sure I > > want to write the "parse incoming pack-file" thing, but git-unpack-objects > > comes _reasonably_ close (but right now it seeks around using the index > > file to resolve deltas, instead of keeping them in memory and resolving > > them when possible). > > I'm still thinking about this one. I think I'll just do it. Ok, done. I had to basically rewrite that unpacking logic, but the end result is actually slightly smaller and cleaner, and it can now unpack from a stream. That stream reading logic that uncompresses directly from the stream buffer might be considered a bit too subtle (and somebody should really double-check it), but hey, it works for me. In fact, I just did this: # # Create empty git archive "~/unpack" # mkdir ~/unpack cd ~/unpack git-init-db # # Copy the git archive there over a pipe # cd ~/git git-rev-list --objects HEAD | git-pack-objects --depth=50 --window=50 --stdout | (cd ~/unpack ; git-unpack-objects) # # Go to new archive, set up the head, and fsck to verify # cd ~/unpack cat ~/git/.git/HEAD > .git/HEAD git-fsck-cache --unreachable Now, the above is a silly example, since I _could_ just have moved the pack file into .git/objects/pack, but that was not the point of this whole thing. The point was to do what a "git-ssh-push" would basically boil down to. I'd like somebody who knows zlib intimately to take a look at how I do the streaming input thing (in particular, the "use(len - stream.avail_in);" part in the inflate loop in the "get_data()" function). Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-28 2:13 ` CAREFUL! No more delta object support! Linus Torvalds 2005-06-28 2:32 ` Junio C Hamano 2005-06-28 5:09 ` Daniel Barkalow @ 2005-06-29 18:59 ` Linus Torvalds 2005-06-29 21:05 ` Daniel Barkalow 2 siblings, 1 reply; 38+ messages in thread From: Linus Torvalds @ 2005-06-29 18:59 UTC (permalink / raw To: Junio C Hamano; +Cc: Git Mailing List, Daniel Barkalow On Mon, 27 Jun 2005, Linus Torvalds wrote: > > On Mon, 27 Jun 2005, Junio C Hamano wrote: > > > > Shouldn't feeding "git-rev-list --object" output plus > > handcrafted list of objects in 2.6.11 tree object to > > git-pack-objects just work??? > > You could do that. And yes, we can add support for "tag" objects too > (which the packing doesn't do at all right now. So this is not a > "fundamental" problem, it's just a practical one right now. Ok, I've added the logic to "git-rev-list --object" to handle arbitrary object dependencies. So you can do things like this, if you want to: git-rev-list --object HEAD ^v2.6.11-tree which basically generates the complete list of every object reachable from HEAD, but not reachable from the v2.6.11 tree. It also understands about tags, so if you do git-rev-list --object v2.6.12 ^v2.6.11-tree the end result will have the "v2.6.12" tag in it (along with all the objects reachable from it, but not reachable from v2.6.11-tree). What does this mean? It means that you can do a "push" from repository "a" to repository "b" by doing - in "b", do refs_in_b=($(find .git/refs -type f | xargs cat)) - in "a" do refs_in_a=($(find .git/refs -type f | xargs cat)) - then, in "a", do git-rev-list "${refs_in_a[@]}" --not "${refs_in_b[@]}" | git-pack-objects --stdout > push.pack to generate the objects pack in "push.pack" - then, in "b", do git-unpack-objects < push.pack and you now have moved over _all_ the objects that were referenced in "a", but not in "b". Including tags etc. So after that last stage, when you've unpacked the objects, the only thing left to do is to make the refs in "b" point to the new references from "a" (which basically boils down to a "cp", except it would be good to verify that the refs in "b" still have the same values as they did before we did the object push). Daniel (or anybody else), interested? Please? Of course, you can do this one branch at a time, too, if you want to, but the above was meant as an example of how you can actually do all the branches in one single pack-file, which is a lot more efficient (if you do it one branch at a time, you'll quite possible end up transferring objects that are reachable in other branches multiple times, while the "all in one go" thing will pack each object just once). Now, have I actually _tested_ the above? Hell no. But all the heavy lifting should now be done for doing an efficient "git push" that pushes all branches in one go (or one at a time, it's your choice on how you end up using git-rev-list). Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-29 18:59 ` Linus Torvalds @ 2005-06-29 21:05 ` Daniel Barkalow 2005-06-29 21:38 ` Linus Torvalds 0 siblings, 1 reply; 38+ messages in thread From: Daniel Barkalow @ 2005-06-29 21:05 UTC (permalink / raw To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Wed, 29 Jun 2005, Linus Torvalds wrote: > and you now have moved over _all_ the objects that were referenced in "a", > but not in "b". Including tags etc. So after that last stage, when you've > unpacked the objects, the only thing left to do is to make the refs in "b" > point to the new references from "a" (which basically boils down to a > "cp", except it would be good to verify that the refs in "b" still have > the same values as they did before we did the object push). > > Daniel (or anybody else), interested? Please? I'll probably get to this over the weekend. > Of course, you can do this one branch at a time, too, if you want to, but > the above was meant as an example of how you can actually do all the > branches in one single pack-file, which is a lot more efficient (if you do > it one branch at a time, you'll quite possible end up transferring objects > that are reachable in other branches multiple times, while the "all in one > go" thing will pack each object just once). It should transfer each only once if you recalculate "refs_in_b" after each push, right? Or is the marking for "--objects ^commit" still not tight wrt object and tree files? I think branch-at-a-time is preferable for the case where the source doesn't want to send quite everything, and the target doesn't necessarily want everything named the same. > Now, have I actually _tested_ the above? Hell no. But all the heavy > lifting should now be done for doing an efficient "git push" that pushes > all branches in one go (or one at a time, it's your choice on how you end > up using git-rev-list). The one thing I can think of is whether things will blow up if the target repository has heads that aren't in the source, at which point the source has no clue what to exclude. I.e.: parent -- new-b \ new-a If I've moved the head on b forward to new-b, and a wants to push new-a (as a new branch, perhaps), refs_in_b has only new-b, refs_in_a has parent and new-a, and git-rev-list in a can't see that b has parent (and everything upwards of that). You probably just don't want to do this, but I bet that some people will (e.g. projects that synchronize through a shared-owner repository). -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-29 21:05 ` Daniel Barkalow @ 2005-06-29 21:38 ` Linus Torvalds 2005-06-29 22:24 ` Daniel Barkalow 0 siblings, 1 reply; 38+ messages in thread From: Linus Torvalds @ 2005-06-29 21:38 UTC (permalink / raw To: Daniel Barkalow; +Cc: Junio C Hamano, Git Mailing List On Wed, 29 Jun 2005, Daniel Barkalow wrote: > > > Of course, you can do this one branch at a time, too, if you want to, but > > the above was meant as an example of how you can actually do all the > > branches in one single pack-file, which is a lot more efficient (if you do > > it one branch at a time, you'll quite possible end up transferring objects > > that are reachable in other branches multiple times, while the "all in one > > go" thing will pack each object just once). > > It should transfer each only once if you recalculate "refs_in_b" after > each push, right? Yes, you can do it that way too. It will possibly not pack as well due to giving you fewer opportunities for deltas, but that's likely not a huge issue. > The one thing I can think of is whether things will blow up if the target > repository has heads that aren't in the source Right. I think that's a "feature" of pushing: you cannot push to an archive that has state that you don't know about. Ie you can only push to something that is a proper subset of what you are (on a per-branch basis, of course - not necessarily on a "global" stage - so you could push just _one_ branch, even if another branch was ahead of where you are). Linus ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CAREFUL! No more delta object support! 2005-06-29 21:38 ` Linus Torvalds @ 2005-06-29 22:24 ` Daniel Barkalow 0 siblings, 0 replies; 38+ messages in thread From: Daniel Barkalow @ 2005-06-29 22:24 UTC (permalink / raw To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Wed, 29 Jun 2005, Linus Torvalds wrote: > On Wed, 29 Jun 2005, Daniel Barkalow wrote: > > The one thing I can think of is whether things will blow up if the target > > repository has heads that aren't in the source > > Right. I think that's a "feature" of pushing: you cannot push to an > archive that has state that you don't know about. Ie you can only push to > something that is a proper subset of what you are (on a per-branch basis, > of course - not necessarily on a "global" stage - so you could push just > _one_ branch, even if another branch was ahead of where you are). The issue is really distinguishing the "other" branches I don't care about from the one that I do care about. With -w, I almost certainly care about the ref I'm writing, but that doesn't help for refs that are new (new branches or tags), for which I care about some other thing. Also, the failure is a bit hard to detect, I think, in that I could find I do recognize some ancient thing that's barely useful for exclusion, and miss something that should exclude almost everything but it's been updated. In any case, when things go wrong we simply send stuff the recipient already has, so it's not the end of the world. (And there's probably some clever way of dealing with it) -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] Adjust fsck-cache to packed GIT and alternate object pool. 2005-06-28 1:14 CAREFUL! No more delta object support! Linus Torvalds 2005-06-27 23:58 ` Christopher Li 2005-06-28 2:01 ` CAREFUL! No more delta object support! Junio C Hamano @ 2005-06-28 8:49 ` Junio C Hamano 2005-06-28 21:56 ` [PATCH] Expose packed_git and alt_odb Junio C Hamano 2005-06-28 21:58 ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano 2 siblings, 2 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 8:49 UTC (permalink / raw To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> There are some other issues too, like the fact that "git-fsck-cache" LT> doesn't know about the pack-files yet, so it will complain about missing LT> objects etc. And here is a patch to fix it. It is interesting to know that the same problem existed for a long time in a different form and nobody has complained: GIT_ALTERNATE_OBJECT_DIRECTORIES. Maybe the alternate object pool mechanism is not so widely used and probably not very useful for everyday use. I donno. ------------ The fsck-cache complains if objects referred to by files in .git/refs/ or objects stored in files under .git/objects/??/ are not found as stand-alone SHA1 files (i.e. found in alternate object pools GIT_ALTERNATE_OBJECT_DIRECTORIES or packed archives stored under .git/objects/pack). Although this is a good semantics to maintain consistency of a single .git/objects directory as a self contained set of objects, it sometimes is useful to consider it is OK as long as these "outside" objects are available. This commit introduces a new flag, --standalone, to git-fsck-cache. When it is not specified, connectivity checks and .git/refs pointer checks are taught that it is OK when expected objects do not exist under .git/objects/?? hierarchy but are available from an packed archive or in an alternate object pool. Signed-off-by: Junio C Hamano <junkio@cox.net> --- fsck-cache.c | 20 ++++++++++++++++---- 1 files changed, 16 insertions(+), 4 deletions(-) ea4255429bb0b4b760ba2fe327f5806d8d24d8a6 diff --git a/fsck-cache.c b/fsck-cache.c --- a/fsck-cache.c +++ b/fsck-cache.c @@ -12,6 +12,7 @@ static int show_root = 0; static int show_tags = 0; static int show_unreachable = 0; +static int standalone = 0; static int keep_cache_objects = 0; static unsigned char head_sha1[20]; @@ -25,13 +26,17 @@ static void check_connectivity(void) struct object_list *refs; if (!obj->parsed) { - printf("missing %s %s\n", - obj->type, sha1_to_hex(obj->sha1)); + if (!standalone && has_sha1_file(obj->sha1)) + ; /* it is in pack */ + else + printf("missing %s %s\n", + obj->type, sha1_to_hex(obj->sha1)); continue; } for (refs = obj->refs; refs; refs = refs->next) { - if (refs->item->parsed) + if (refs->item->parsed || + (!standalone && has_sha1_file(refs->item->sha1))) continue; printf("broken link from %7s %s\n", obj->type, sha1_to_hex(obj->sha1)); @@ -315,8 +320,11 @@ static int read_sha1_reference(const cha return -1; obj = lookup_object(sha1); - if (!obj) + if (!obj) { + if (!standalone && has_sha1_file(sha1)) + return 0; /* it is in pack */ return error("%s: invalid sha1 pointer %.40s", path, hexname); + } obj->used = 1; mark_reachable(obj, REACHABLE); @@ -390,6 +398,10 @@ int main(int argc, char **argv) keep_cache_objects = 1; continue; } + if (!strcmp(arg, "--standalone")) { + standalone = 1; + continue; + } if (*arg == '-') usage("git-fsck-cache [--tags] [[--unreachable] [--cache] <head-sha1>*]"); } ------------ ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] Expose packed_git and alt_odb. 2005-06-28 8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano @ 2005-06-28 21:56 ` Junio C Hamano 2005-06-28 21:58 ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano 1 sibling, 0 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 21:56 UTC (permalink / raw To: Linus Torvalds; +Cc: git The commands git-fsck-cache and probably git-*-pull needs to have a way to enumerate objects contained in packed GIT archives and alternate object pools. This commit exposes the data structure used to keep track of them from sha1_file.c, and adds a couple of accessor interface functions for use by the enhanced git-fsck-cache command. Signed-off-by: Junio C Hamano <junkio@cox.net> --- cache.h | 19 +++++++++++++++++++ sha1_file.c | 43 ++++++++++++++++++++++++------------------- 2 files changed, 43 insertions(+), 19 deletions(-) da37711700d11f8c7f44fcb6819c724978c840b7 diff --git a/cache.h b/cache.h --- a/cache.h +++ b/cache.h @@ -233,4 +233,23 @@ struct checkout { extern int checkout_entry(struct cache_entry *ce, struct checkout *state); +extern struct alternate_object_database { + char *base; + char *name; +} *alt_odb; +extern void prepare_alt_odb(void); + +extern struct packed_git { + struct packed_git *next; + unsigned long index_size; + unsigned long pack_size; + unsigned int *index_base; + void *pack_base; + unsigned int pack_last_used; + char pack_name[0]; /* something like ".git/objects/pack/xxxxx.pack" */ +} *packed_git; +extern void prepare_packed_git(void); +extern int num_packed_objects(const struct packed_git *p); +extern int nth_packed_object_sha1(const struct packed_git *, int, unsigned char*); + #endif /* CACHE_H */ diff --git a/sha1_file.c b/sha1_file.c --- a/sha1_file.c +++ b/sha1_file.c @@ -184,10 +184,7 @@ char *sha1_file_name(const unsigned char return base; } -static struct alternate_object_database { - char *base; - char *name; -} *alt_odb; +struct alternate_object_database *alt_odb; /* * Prepare alternate object database registry. @@ -205,13 +202,15 @@ static struct alternate_object_database * pointed by base fields of the array elements with one xmalloc(); * the string pool immediately follows the array. */ -static void prepare_alt_odb(void) +void prepare_alt_odb(void) { int pass, totlen, i; const char *cp, *last; char *op = NULL; const char *alt = gitenv(ALTERNATE_DB_ENVIRONMENT) ? : ""; + if (alt_odb) + return; /* The first pass counts how large an area to allocate to * hold the entire alt_odb structure, including array of * structs and path buffers for them. The second pass fills @@ -258,8 +257,7 @@ static char *find_sha1_file(const unsign if (!stat(name, st)) return name; - if (!alt_odb) - prepare_alt_odb(); + prepare_alt_odb(); for (i = 0; (name = alt_odb[i].name) != NULL; i++) { fill_sha1_path(name, sha1); if (!stat(alt_odb[i].base, st)) @@ -271,15 +269,7 @@ static char *find_sha1_file(const unsign #define PACK_MAX_SZ (1<<26) static int pack_used_ctr; static unsigned long pack_mapped; -static struct packed_git { - struct packed_git *next; - unsigned long index_size; - unsigned long pack_size; - unsigned int *index_base; - void *pack_base; - unsigned int pack_last_used; - char pack_name[0]; /* something like ".git/objects/pack/xxxxx.pack" */ -} *packed_git; +struct packed_git *packed_git; struct pack_entry { unsigned int offset; @@ -430,7 +420,7 @@ static void prepare_packed_git_one(char } } -static void prepare_packed_git(void) +void prepare_packed_git(void) { int i; static int run_once = 0; @@ -439,8 +429,7 @@ static void prepare_packed_git(void) return; prepare_packed_git_one(get_object_directory()); - if (!alt_odb) - prepare_alt_odb(); + prepare_alt_odb(); for (i = 0; alt_odb[i].base != NULL; i++) { alt_odb[i].name[0] = 0; prepare_packed_git_one(alt_odb[i].base); @@ -750,6 +739,22 @@ static void *unpack_entry(struct pack_en return unpack_non_delta_entry(pack+5, size, left); } +int num_packed_objects(const struct packed_git *p) +{ + /* See check_packed_git_idx and pack-objects.c */ + return (p->index_size - 20 - 20 - 4*256) / 24; +} + +int nth_packed_object_sha1(const struct packed_git *p, int n, + unsigned char* sha1) +{ + void *index = p->index_base + 256; + if (n < 0 || num_packed_objects(p) <= n) + return -1; + memcpy(sha1, (index + 24 * n + 4), 20); + return 0; +} + static int find_pack_entry_1(const unsigned char *sha1, struct pack_entry *e, struct packed_git *p) { ------------ ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 3/3] Update fsck-cache (take 2) 2005-06-28 8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano 2005-06-28 21:56 ` [PATCH] Expose packed_git and alt_odb Junio C Hamano @ 2005-06-28 21:58 ` Junio C Hamano 1 sibling, 0 replies; 38+ messages in thread From: Junio C Hamano @ 2005-06-28 21:58 UTC (permalink / raw To: Linus Torvalds; +Cc: git The fsck-cache complains if objects referred to by files in .git/refs/ or objects stored in files under .git/objects/??/ are not found as stand-alone SHA1 files (i.e. found in alternate object pools GIT_ALTERNATE_OBJECT_DIRECTORIES or packed archives stored under .git/objects/pack). Although this is a good semantics to maintain consistency of a single .git/objects directory as a self contained set of objects, it sometimes is useful to consider it is OK as long as these "outside" objects are available. This commit introduces a new flag, --standalone, to git-fsck-cache. When it is not specified, connectivity checks and .git/refs pointer checks are taught that it is OK when expected objects do not exist under .git/objects/?? hierarchy but are available from an packed archive or in an alternate object pool. Another new flag, --full, makes git-fsck-cache to check not only the current GIT_OBJECT_DIRECTORY but also objects found in alternate object pools and packed GIT archives.a Signed-off-by: Junio C Hamano <junkio@cox.net> --- *** This completes "the other half" the fsck updates I did last *** night was missing. Please discard that one and use this *** instead. Documentation/git-fsck-cache.txt | 18 +++++++++- fsck-cache.c | 71 ++++++++++++++++++++++++++++++++------ 2 files changed, 76 insertions(+), 13 deletions(-) 5cae1fa43bfeae6722d916aa764fa75d9ce1839a diff --git a/Documentation/git-fsck-cache.txt b/Documentation/git-fsck-cache.txt --- a/Documentation/git-fsck-cache.txt +++ b/Documentation/git-fsck-cache.txt @@ -9,7 +9,7 @@ git-fsck-cache - Verifies the connectivi SYNOPSIS -------- -'git-fsck-cache' [--tags] [--root] [--unreachable] [--cache] [<object>*] +'git-fsck-cache' [--tags] [--root] [--unreachable] [--cache] [--standalone | --full] [<object>*] DESCRIPTION ----------- @@ -37,6 +37,22 @@ OPTIONS Consider any object recorded in the cache also as a head node for an unreachability trace. +--standalone:: + Limit checks to the contents of GIT_OBJECT_DIRECTORY + (.git/objects), making sure that it is consistent and + complete without referring to objects found in alternate + object pools listed in GIT_ALTERNATE_OBJECT_DIRECTORIES, + nor packed GIT archives found in .git/objects/pack; + cannot be used with --full. + +--full:: + Check not just objects in GIT_OBJECT_DIRECTORY + (.git/objects), but also the ones found in alternate + object pools listed in GIT_ALTERNATE_OBJECT_DIRECTORIES, + and in packed GIT archives found in .git/objects/pack + and corresponding pack subdirectories in alternate + object pools; cannot be used with --standalone. + It tests SHA1 and general object sanity, and it does full tracking of the resulting reachability and everything else. It prints out any corruption it finds (missing or bad objects), and if you use the diff --git a/fsck-cache.c b/fsck-cache.c --- a/fsck-cache.c +++ b/fsck-cache.c @@ -12,6 +12,8 @@ static int show_root = 0; static int show_tags = 0; static int show_unreachable = 0; +static int standalone = 0; +static int check_full = 0; static int keep_cache_objects = 0; static unsigned char head_sha1[20]; @@ -25,13 +27,17 @@ static void check_connectivity(void) struct object_list *refs; if (!obj->parsed) { - printf("missing %s %s\n", - obj->type, sha1_to_hex(obj->sha1)); + if (!standalone && has_sha1_file(obj->sha1)) + ; /* it is in pack */ + else + printf("missing %s %s\n", + obj->type, sha1_to_hex(obj->sha1)); continue; } for (refs = obj->refs; refs; refs = refs->next) { - if (refs->item->parsed) + if (refs->item->parsed || + (!standalone && has_sha1_file(refs->item->sha1))) continue; printf("broken link from %7s %s\n", obj->type, sha1_to_hex(obj->sha1)); @@ -315,8 +321,11 @@ static int read_sha1_reference(const cha return -1; obj = lookup_object(sha1); - if (!obj) + if (!obj) { + if (!standalone && has_sha1_file(sha1)) + return 0; /* it is in pack */ return error("%s: invalid sha1 pointer %.40s", path, hexname); + } obj->used = 1; mark_reachable(obj, REACHABLE); @@ -366,10 +375,20 @@ static void get_default_heads(void) die("No default references"); } +static void fsck_object_dir(const char *path) +{ + int i; + for (i = 0; i < 256; i++) { + static char dir[4096]; + sprintf(dir, "%s/%02x", path, i); + fsck_dir(i, dir); + } + fsck_sha1_list(); +} + int main(int argc, char **argv) { int i, heads; - char *sha1_dir; for (i = 1; i < argc; i++) { const char *arg = argv[i]; @@ -390,17 +409,45 @@ int main(int argc, char **argv) keep_cache_objects = 1; continue; } + if (!strcmp(arg, "--standalone")) { + standalone = 1; + continue; + } + if (!strcmp(arg, "--full")) { + check_full = 1; + continue; + } if (*arg == '-') - usage("git-fsck-cache [--tags] [[--unreachable] [--cache] <head-sha1>*]"); + usage("git-fsck-cache [--tags] [[--unreachable] [--cache] [--standalone | --full] <head-sha1>*]"); } - sha1_dir = get_object_directory(); - for (i = 0; i < 256; i++) { - static char dir[4096]; - sprintf(dir, "%s/%02x", sha1_dir, i); - fsck_dir(i, dir); + if (standalone && check_full) + die("Only one of --standalone or --full can be used."); + if (standalone) + unsetenv("GIT_ALTERNATE_OBJECT_DIRECTORIES"); + + fsck_object_dir(get_object_directory()); + if (check_full) { + int j; + struct packed_git *p; + prepare_alt_odb(); + for (j = 0; alt_odb[j].base; j++) { + alt_odb[j].name[-1] = 0; /* was slash */ + fsck_object_dir(alt_odb[j].base); + alt_odb[j].name[-1] = '/'; + } + prepare_packed_git(); + for (p = packed_git; p; p = p->next) { + int num = num_packed_objects(p); + for (i = 0; i < num; i++) { + unsigned char sha1[20]; + nth_packed_object_sha1(p, i, sha1); + if (fsck_sha1(sha1) < 0) + fprintf(stderr, "bad sha1 entry '%s'\n", sha1_to_hex(sha1)); + + } + } } - fsck_sha1_list(); heads = 0; for (i = 1; i < argc; i++) { ------------ ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2005-06-29 22:19 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-06-28 1:14 CAREFUL! No more delta object support! Linus Torvalds 2005-06-27 23:58 ` Christopher Li 2005-06-28 3:30 ` Linus Torvalds 2005-06-28 9:40 ` Junio C Hamano 2005-06-28 11:06 ` Christopher Li 2005-06-28 14:52 ` Petr Baudis 2005-06-28 16:35 ` Benjamin LaHaise 2005-06-28 20:30 ` Petr Baudis 2005-06-28 14:46 ` Jan Harkes 2005-06-28 10:38 ` Christopher Li 2005-06-28 16:45 ` Linus Torvalds 2005-06-29 0:49 ` [PATCH] Emit base objects of a delta chain when the delta is output Junio C Hamano 2005-06-28 2:01 ` CAREFUL! No more delta object support! Junio C Hamano 2005-06-28 2:03 ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano 2005-06-28 2:43 ` Linus Torvalds 2005-06-28 3:33 ` Junio C Hamano 2005-06-28 15:45 ` Linus Torvalds 2005-06-28 2:13 ` CAREFUL! No more delta object support! Linus Torvalds 2005-06-28 2:32 ` Junio C Hamano 2005-06-28 2:37 ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano 2005-06-28 2:48 ` CAREFUL! No more delta object support! Linus Torvalds 2005-06-28 5:09 ` Daniel Barkalow 2005-06-28 15:49 ` Linus Torvalds 2005-06-28 16:21 ` Linus Torvalds 2005-06-28 17:04 ` Daniel Barkalow 2005-06-28 17:36 ` Linus Torvalds 2005-06-28 18:17 ` Linus Torvalds 2005-06-28 19:49 ` Matthias Urlichs 2005-06-28 20:18 ` Matthias Urlichs 2005-06-28 20:01 ` Daniel Barkalow 2005-06-29 3:53 ` Linus Torvalds 2005-06-29 18:59 ` Linus Torvalds 2005-06-29 21:05 ` Daniel Barkalow 2005-06-29 21:38 ` Linus Torvalds 2005-06-29 22:24 ` Daniel Barkalow 2005-06-28 8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano 2005-06-28 21:56 ` [PATCH] Expose packed_git and alt_odb Junio C Hamano 2005-06-28 21:58 ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).