* Using bitmaps to accelerate fetch and clone @ 2012-09-27 0:47 Shawn Pearce 2012-09-27 12:17 ` Nguyen Thai Ngoc Duy 2012-09-28 12:00 ` Nguyen Thai Ngoc Duy 0 siblings, 2 replies; 20+ messages in thread From: Shawn Pearce @ 2012-09-27 0:47 UTC (permalink / raw) To: git; +Cc: Colby Ranger Google has published a series of patches (see links below) to JGit to improve fetch and clone performance by adding compressed bitmaps to the pack-*.idx structure. Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) In the table the repository tested was Android's platform/frameworks/base. The time shown is the time spent in the "Counting objects" phase of creating a pack for a client using the git:// protocol. The byte size shown is the size of the pack transferred to the client, and "commits back" describes how far behind the client was from the server when it started the fetch. In all test runs the client was git-core and the server was JGit on the same machine. The amount of disk space used by the compressed bitmaps is tunable, but averages 10-15% of the pack-*.idx file size. So about 8 MiB of additional space for this repository. A repository owner can reduce the worst case time used in the 100000 commit back case by using slightly more disk and positioning more bitmaps more frequently throughout history. The code doesn't do this by default because the expectation is that a client is probably not 100k commits behind. Instead it populates bitmaps at all branch and tag tips, and densely (every few hundred commits) near the tips, and spaces them out more the further back in history it goes. We assume the older history is accessed less often, and doesn't need to waste additional disk space or precious buffer cache. The basic gist of the implementation is a bitmap has a 1 bit set for each object that is reachable from the commit the bitmap is associated with. An index file may have a unique bitmap for hundreds of commits in the corresponding pack file. The set of objects to send is performed by doing a simple computation: OR (all want lines) AND NOT OR (all have lines) There are two key patches in the series that implement the file format change and logic involved: * https://git.eclipse.org/r/7939 Defines the new E003 index format and the bit set implementation logic. * https://git.eclipse.org/r/7940 Uses E003 indexes when available to make packs, and the logic required to make E003 format indexes during GC. :-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce @ 2012-09-27 12:17 ` Nguyen Thai Ngoc Duy 2012-09-27 14:33 ` Shawn Pearce 2012-09-27 17:20 ` Jeff King 2012-09-28 12:00 ` Nguyen Thai Ngoc Duy 1 sibling, 2 replies; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-09-27 12:17 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, Colby Ranger On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote: > Google has published a series of patches (see links below) to JGit to Should discussions about this series happen in here, jgit mailing or gerrit? I just want to make sure I'll discuss it at the right place. > improve fetch and clone performance by adding compressed bitmaps to > the pack-*.idx structure. > > Operation Index V2 Index VE003 > Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) > Fetch (1 commit back) 75ms 107ms > Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) > Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) > Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) > Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) > Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Beautiful. And curious, why do 100->1000 and 10000->10000 have such big leaps in time (V2)? > The basic gist of the implementation is a bitmap has a 1 bit set for > each object that is reachable from the commit the bitmap is associated > with. An index file may have a unique bitmap for hundreds of commits > in the corresponding pack file. The set of objects to send is > performed by doing a simple computation: > > OR (all want lines) AND NOT OR (all have lines) > > There are two key patches in the series that implement the file format > change and logic involved: > > * https://git.eclipse.org/r/7939 > > Defines the new E003 index format and the bit set > implementation logic. I suppose the index format is not set in stone yet? My java-foo is rusty and I'm not familiar with jgit, so I more likely read things wrong. It seems the bitmap data follows directly after regular index content. I'd like to see some sort of extension mechanism like in $GIT_DIR/index, so that we don't have to increase pack index version often. What I have in mind is optional commit cache to speed up rev-list and merge, which could be stored in pack index too. In PackIndexVE003 class + // Read the bitmaps for the Git types + SimpleDataInput dataInput = new SimpleDataInput(fd); + this.commits = readBitmap(dataInput); + this.trees = readBitmap(dataInput); + this.blobs = readBitmap(dataInput); + this.tags = readBitmap(dataInput); Am I correct in saying that you have four different on-disk bitmaps, one for each object type? If so, for compression efficient reasons? > :-) Definitely :-). I have shown my interest in this topic before. So I should probably say that I'm going to work on this on C Git, but sllloooowwwly. As this benefits the server side greatly, perhaps a GitHubber ;-) might want to work on this on C Git, for GitHub itself of course, and, as a side effect, make the rest of us happy? -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 12:17 ` Nguyen Thai Ngoc Duy @ 2012-09-27 14:33 ` Shawn Pearce 2012-09-28 1:37 ` Nguyen Thai Ngoc Duy 2012-09-27 17:20 ` Jeff King 1 sibling, 1 reply; 20+ messages in thread From: Shawn Pearce @ 2012-09-27 14:33 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger On Thu, Sep 27, 2012 at 5:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote: >> Google has published a series of patches (see links below) to JGit to > > Should discussions about this series happen in here, jgit mailing or > gerrit? I just want to make sure I'll discuss it at the right place. I think we should have a concrete discussion about the implementation in Java on the Gerrit changes (e.g. "you should fix this comment it doesn't sufficiently describe the method"), and a discussion about the file format and algorithm here where everyone can contribute. The format is named E003 because its still experimental. Its not set in stone. Future iterations might be named E004, etc. If we get something final we will look to rename it to just version 3. >> improve fetch and clone performance by adding compressed bitmaps to >> the pack-*.idx structure. >> >> Operation Index V2 Index VE003 >> Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) >> Fetch (1 commit back) 75ms 107ms >> Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) >> Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) >> Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) >> Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) >> Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) > > Beautiful. And curious, why do 100->1000 and 10000->10000 have such > big leaps in time (V2)? We didn't investigate this. Colby just reported on what the current JGit code does. I suspect there is something specific about the shape of the history graph for this repository combined with the way JGit wrote the pack file that caused these sorts of large increases. Perhaps they could be smaller. Not sure I care anymore when the E003 approach gives us such low times. :-) >> The basic gist of the implementation is a bitmap has a 1 bit set for >> each object that is reachable from the commit the bitmap is associated >> with. An index file may have a unique bitmap for hundreds of commits >> in the corresponding pack file. The set of objects to send is >> performed by doing a simple computation: >> >> OR (all want lines) AND NOT OR (all have lines) >> >> There are two key patches in the series that implement the file format >> change and logic involved: >> >> * https://git.eclipse.org/r/7939 >> >> Defines the new E003 index format and the bit set >> implementation logic. > > I suppose the index format is not set in stone yet? For E003, yes, we already have some data encoded with it. But as a file format change, no. We are willing to iterate on this if there is tangible benefit displayed by an alternative. Future versions would have to be E004 or some other new version number to disambiguate from E003. > My java-foo is > rusty and I'm not familiar with jgit, so I more likely read things > wrong. Or maybe not. :-) > It seems the bitmap data follows directly after regular index content. Correct. It is after the regular content, but before the 2 SHA-1 trailers. > I'd like to see some sort of extension mechanism like in > $GIT_DIR/index, so that we don't have to increase pack index version > often. This might be worthwhile. I dislike the way $GIT_DIR/index encodes extensions. Forcing an extension to fully materialize itself to determine its length so the length can be placed before the data is painful to work with when writing the file out to disk. I would prefer writing an index catalog at the trailer of the file. We already require random access to the index file, so its possible for a reader to read a fixed size trailer record that has the 2 SHA-1s we normally end an index with, and an extension catalog footer that has a length and CRC-32 of the catalog. The catalog would immediately appear before the footer, so a reader can find the start of the extension catalog by subtracting from the end of the file the catalog length and the file footer and catalog footer lengths. The catalog can then supply a starting offset for each extension section, and writers don't need to predict in advance how much data they need to store. Readers trying to use extensions aren't really hurt, Git already randomly seeks to read the tail of an index file to compare the pack SHA-1 before assuming the index is valid. > What I have in mind is optional commit cache to speed up > rev-list and merge, which could be stored in pack index too. We should also look into using the commit bitmap data to feed the rev-list traversal. The bitmaps can tell us which objects are commits, and their rough ordering given the packing rules. That may be sufficient to feed the walker without having a priority queue. > In PackIndexVE003 class > > + // Read the bitmaps for the Git types > + SimpleDataInput dataInput = new SimpleDataInput(fd); > + this.commits = readBitmap(dataInput); > + this.trees = readBitmap(dataInput); > + this.blobs = readBitmap(dataInput); > + this.tags = readBitmap(dataInput); > > Am I correct in saying that you have four different on-disk bitmaps, > one for each object type? If so, for compression efficient reasons? Yes. The packer needs to know the type of each object its going to pack. Instead of reading this from the individual object headers in the pack files, we store them in compressed type bitmaps. This is much faster to test than reading in the base data from the pack file. > Definitely :-). I have shown my interest in this topic before. So I > should probably say that I'm going to work on this on C Git, but > sllloooowwwly. As this benefits the server side greatly, perhaps a > GitHubber ;-) might want to work on this on C Git, for GitHub itself > of course, and, as a side effect, make the rest of us happy? Google may also contribute towards this work, we really want to see an improvement in git-core too. Its just a lot of work, and we are also a limited team. :-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 14:33 ` Shawn Pearce @ 2012-09-28 1:37 ` Nguyen Thai Ngoc Duy 0 siblings, 0 replies; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-09-28 1:37 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, Colby Ranger, Jeff King On Thu, Sep 27, 2012 at 9:33 PM, Shawn Pearce <spearce@spearce.org> wrote: >> I'd like to see some sort of extension mechanism like in >> $GIT_DIR/index, so that we don't have to increase pack index version >> often. > > This might be worthwhile. I dislike the way $GIT_DIR/index encodes > extensions. Forcing an extension to fully materialize itself to > determine its length so the length can be placed before the data is > painful to work with when writing the file out to disk. I would prefer > writing an index catalog at the trailer of the file. We already > require random access to the index file, so its possible for a reader > to read a fixed size trailer record that has the 2 SHA-1s we normally > end an index with, and an extension catalog footer that has a length > and CRC-32 of the catalog. The catalog would immediately appear before > the footer, so a reader can find the start of the extension catalog by > subtracting from the end of the file the catalog length and the file > footer and catalog footer lengths. The catalog can then supply a > starting offset for each extension section, and writers don't need to > predict in advance how much data they need to store. Readers trying to > use extensions aren't really hurt, Git already randomly seeks to read > the tail of an index file to compare the pack SHA-1 before assuming > the index is valid. Yeah, that's exactly what I had in mind. But perhaps a separate file (or files?) may be better. On that point, should all extensions be in one new extra file, or one extension per file? I prefer all extensions in one file, so we only need a single additional stat() for extension check instead of probing for the entire pack-XXX.* range. In that case, the catalog trailer idea still applies. -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 12:17 ` Nguyen Thai Ngoc Duy 2012-09-27 14:33 ` Shawn Pearce @ 2012-09-27 17:20 ` Jeff King 2012-09-27 17:35 ` Shawn Pearce ` (2 more replies) 1 sibling, 3 replies; 20+ messages in thread From: Jeff King @ 2012-09-27 17:20 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Shawn Pearce, git, Colby Ranger, David Barr On Thu, Sep 27, 2012 at 07:17:42PM +0700, Nguyen Thai Ngoc Duy wrote: > > Operation Index V2 Index VE003 > > Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) > > Fetch (1 commit back) 75ms 107ms > > Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) > > Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) > > Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) > > Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) > > Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) > > Beautiful. And curious, why do 100->1000 and 10000->10000 have such > big leaps in time (V2)? Agreed. I'm very excited about these numbers. > > Defines the new E003 index format and the bit set > > implementation logic. > [...] > It seems the bitmap data follows directly after regular index content. > I'd like to see some sort of extension mechanism like in > $GIT_DIR/index, so that we don't have to increase pack index version > often. What I have in mind is optional commit cache to speed up > rev-list and merge, which could be stored in pack index too. As I understand it, both the bitmaps and a commit cache are theoretically optional. That is, git can do the job without them, but they speed things up. If that is the case, do we need to bump the index version at all? Why not store a plain v2 index, and then store an additional file "pack-XXX.reachable" that contains the bitmaps and an independent version number. The sha1 in the filename makes sure that the reachability file is always in sync with the actual pack data and index. Old readers won't know about the new file, and will ignore it. For new readers, if the file is there they can use it; if it's missing (or its version is not understood), they can fall back to the regular index. I haven't looked at the details of the format change yet. If it is purely an extra chunk of data at the end, this Just Works. If there are changes to the earlier parts of the pack (e.g., I seem to recall the commit cache idea wanted separate indices for each object type), we may still need a v3. But it would be nice if we could make those changes generic (e.g., just the separate indices, which might support many different enhancements), and then let the actual feature work happen in the separate files. > Definitely :-). I have shown my interest in this topic before. So I > should probably say that I'm going to work on this on C Git, but > sllloooowwwly. As this benefits the server side greatly, perhaps a > GitHubber ;-) might want to work on this on C Git, for GitHub itself > of course, and, as a side effect, make the rest of us happy? Yeah, GitHub is definitely interested in this. I may take a shot at it, but I know David Barr (cc'd) is also interested in such things. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 17:20 ` Jeff King @ 2012-09-27 17:35 ` Shawn Pearce 2012-09-27 18:22 ` Jeff King 2012-09-27 19:47 ` David Michael Barr 2012-09-28 1:38 ` Nguyen Thai Ngoc Duy 2 siblings, 1 reply; 20+ messages in thread From: Shawn Pearce @ 2012-09-27 17:35 UTC (permalink / raw) To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr On Thu, Sep 27, 2012 at 10:20 AM, Jeff King <peff@peff.net> wrote: > On Thu, Sep 27, 2012 at 07:17:42PM +0700, Nguyen Thai Ngoc Duy wrote: > >> > Operation Index V2 Index VE003 >> > Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) >> > Fetch (1 commit back) 75ms 107ms >> > Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) >> > Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) >> > Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) >> > Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) >> > Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) >> >> Beautiful. And curious, why do 100->1000 and 10000->10000 have such >> big leaps in time (V2)? > > Agreed. I'm very excited about these numbers. > >> > Defines the new E003 index format and the bit set >> > implementation logic. >> [...] >> It seems the bitmap data follows directly after regular index content. >> I'd like to see some sort of extension mechanism like in >> $GIT_DIR/index, so that we don't have to increase pack index version >> often. What I have in mind is optional commit cache to speed up >> rev-list and merge, which could be stored in pack index too. > > As I understand it, both the bitmaps and a commit cache are > theoretically optional. That is, git can do the job without them, but > they speed things up. Yes, entirely true. > If that is the case, do we need to bump the index > version at all? Why not store a plain v2 index, and then store an > additional file "pack-XXX.reachable" that contains the bitmaps and an > independent version number. This is the alternate version we considered internally. It was a bit more work to define a 3rd file stream per pack in our backend storage system, so we opted for a revision of an existing stream. We could spend a bit more time and add a 3rd stream, keeping the index format unmodified. But we could have also done this with the CRC-32 table in index v2. We didn't. If the data should almost always be there in order to provide good service then we should really be embedding into the files. I'm on the fence. I could go either way on this. E003 was just the fastest way to prototype and start testing. We would probably be equally happy with the 3rd stream. > The sha1 in the filename makes sure that the reachability file is always > in sync with the actual pack data and index. Depending on the extension dependencies, you may need to also use the trailer SHA-1 from the pack file itself, like the index does. E.g. the bitmap data depends heavily on object order in the pack and is invalid if you repack with a different ordering algorithm, or a different delta set of results from delta compression. > Old readers won't know > about the new file, and will ignore it. For new readers, if the file is > there they can use it; if it's missing (or its version is not > understood), they can fall back to the regular index. > > I haven't looked at the details of the format change yet. If it is > purely an extra chunk of data at the end, this Just Works. Yes, its just extra chunk on the end. > If there are > changes to the earlier parts of the pack (e.g., I seem to recall the > commit cache idea wanted separate indices for each object type), we may > still need a v3. But it would be nice if we could make those changes > generic (e.g., just the separate indices, which might support many > different enhancements), and then let the actual feature work happen in > the separate files. Yes. One downside is these separate streams aren't removed when you run git repack. But this could be fixed by a modification to git repack to clean up additional extensions with the same pack base name. >> Definitely :-). I have shown my interest in this topic before. So I >> should probably say that I'm going to work on this on C Git, but >> sllloooowwwly. As this benefits the server side greatly, perhaps a >> GitHubber ;-) might want to work on this on C Git, for GitHub itself >> of course, and, as a side effect, make the rest of us happy? > > Yeah, GitHub is definitely interested in this. I may take a shot at it, > but I know David Barr (cc'd) is also interested in such things. Building this in C is also dependent on having a good implementation of EWAH compressed bitmaps available. So its a bit more work, but not insurmountable by any means. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 17:35 ` Shawn Pearce @ 2012-09-27 18:22 ` Jeff King 2012-09-27 18:36 ` Shawn Pearce 0 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2012-09-27 18:22 UTC (permalink / raw) To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr On Thu, Sep 27, 2012 at 10:35:47AM -0700, Shawn O. Pearce wrote: > > If that is the case, do we need to bump the index > > version at all? Why not store a plain v2 index, and then store an > > additional file "pack-XXX.reachable" that contains the bitmaps and an > > independent version number. > > This is the alternate version we considered internally. It was a bit > more work to define a 3rd file stream per pack in our backend storage > system, so we opted for a revision of an existing stream. We could > spend a bit more time and add a 3rd stream, keeping the index format > unmodified. I'd rather make the choice that provides the best user experience, even if it is a bit more code refactoring. > But we could have also done this with the CRC-32 table in index v2. We > didn't. If the data should almost always be there in order to provide > good service then we should really be embedding into the files. Yes, although there were other changes in v2, also (e.g., the fanout to handle larger packfiles). Bumping the version also made the transition take a lot longer. We introduced the reading and writing code, but then couldn't flip the default for quite a while. For big server providers this is not as big a deal (we know which versions of git we will use, and are OK with flipping a config bit). But it's one more tuning thing to deal with for small or single-person servers. I think clients will also want it. If we can make "git rev-list --objects --all" faster (which this should be able to do), we can speed up "git prune", which in turn is by far the slowest part of "git gc --auto", since in the typical case we are only incrementally packing. I also like that the general technique can be reused easily. We've talked about a generation-number cache in the past. That would fit this model as well. Removing a backwards-compatibility barrier makes it a lot easier to experiment with these sorts of things. > > The sha1 in the filename makes sure that the reachability file is always > > in sync with the actual pack data and index. > > Depending on the extension dependencies, you may need to also use the > trailer SHA-1 from the pack file itself, like the index does. E.g. the > bitmap data depends heavily on object order in the pack and is invalid > if you repack with a different ordering algorithm, or a different > delta set of results from delta compression. Interesting. I would have assumed it depended on order in the index. But like I said, I haven't looked. I think you are still OK, though, because the filename comes from the sha1 over the index file, which in turn includes the sha1 over the packfile. Thus any change in the packfile would give you a new pack and index name. > Yes. One downside is these separate streams aren't removed when you > run git repack. But this could be fixed by a modification to git > repack to clean up additional extensions with the same pack base name. I don't think that's a big deal. We already do it with ".keep" files. If you repack with an older version of git, you may have a stale supplementary file wasting space. But that's OK. The next time you gc with a newer version of git, we could detect and clean up such stale files (we already do so for tmp_pack_* files). -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 18:22 ` Jeff King @ 2012-09-27 18:36 ` Shawn Pearce 2012-09-27 18:52 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Shawn Pearce @ 2012-09-27 18:36 UTC (permalink / raw) To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr On Thu, Sep 27, 2012 at 11:22 AM, Jeff King <peff@peff.net> wrote: > > I think clients will also want it. If we can make "git rev-list > --objects --all" faster (which this should be able to do), we can speed > up "git prune", which in turn is by far the slowest part of "git gc > --auto", since in the typical case we are only incrementally packing. Yes, the bitmap can also accelerate prune. We didn't implement this but it is a trivial use of the existing bitmap. >> > The sha1 in the filename makes sure that the reachability file is always >> > in sync with the actual pack data and index. >> >> Depending on the extension dependencies, you may need to also use the >> trailer SHA-1 from the pack file itself, like the index does. E.g. the >> bitmap data depends heavily on object order in the pack and is invalid >> if you repack with a different ordering algorithm, or a different >> delta set of results from delta compression. > > Interesting. I would have assumed it depended on order in the index. No. We tried that. Assigning bits by order in index (aka order of SHA-1s sorted) results in horrible compression of the bitmap itself because of the uniform distribution of SHA-1. Encoding instead by pack order gets us really good bitmap compression, because object graph traversal order tends to take reachability into account. So we see long contiguous runs of 1s and get good compression. Sorting by SHA-1 just makes the space into swiss cheese. > I think you are still OK, though, because > the filename comes from the sha1 over the index file, which in turn > includes the sha1 over the packfile. Thus any change in the packfile > would give you a new pack and index name. No. The pack file name is composed from the SHA-1 of the sorted SHA-1s in the pack. Any change in compression settings or delta windows or even just random scheduling variations when repacking can cause offsets to slide, even if the set of objects being repacked has not differed. The resulting pack and index will have the same file names (as its the same set of objects), but the offset information and ordering is now different. Naming a pack after a SHA-1 is a fun feature. Naming it after the SHA-1 of the object list was a mistake. It should have been named after the SHA-1 in the trailer of the file, so that any single bit modified within the pack stream itself would have caused a different name to be used on the filesystem. But alas this is water under the bridge and not likely to change anytime soon. >> Yes. One downside is these separate streams aren't removed when you >> run git repack. But this could be fixed by a modification to git >> repack to clean up additional extensions with the same pack base name. > > I don't think that's a big deal. We already do it with ".keep" files. If > you repack with an older version of git, you may have a stale > supplementary file wasting space. But that's OK. The next time you gc > with a newer version of git, we could detect and clean up such stale > files (we already do so for tmp_pack_* files). Yes, obviously. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 18:36 ` Shawn Pearce @ 2012-09-27 18:52 ` Jeff King 2012-09-27 20:18 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2012-09-27 18:52 UTC (permalink / raw) To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr On Thu, Sep 27, 2012 at 11:36:19AM -0700, Shawn O. Pearce wrote: > > Interesting. I would have assumed it depended on order in the index. > > No. We tried that. Assigning bits by order in index (aka order of > SHA-1s sorted) results in horrible compression of the bitmap itself > because of the uniform distribution of SHA-1. Encoding instead by pack > order gets us really good bitmap compression, because object graph > traversal order tends to take reachability into account. So we see > long contiguous runs of 1s and get good compression. Sorting by SHA-1 > just makes the space into swiss cheese. Right, that makes a lot of sense. > > I think you are still OK, though, because > > the filename comes from the sha1 over the index file, which in turn > > includes the sha1 over the packfile. Thus any change in the packfile > > would give you a new pack and index name. > > No. The pack file name is composed from the SHA-1 of the sorted SHA-1s > in the pack. Any change in compression settings or delta windows or > even just random scheduling variations when repacking can cause > offsets to slide, even if the set of objects being repacked has not > differed. The resulting pack and index will have the same file names > (as its the same set of objects), but the offset information and > ordering is now different. Are you sure? The trailer is computed over the sha1 of the actual pack data (ordering, delta choices, and all), and is computed and written to the packfile via sha1close (see pack-objects.c, ll. 753-763). That trailer sha1 is fed into finish_tmp_packfile (l. 793). That function feeds it to write_idx_file, which starts a new sha1 computation that includes the sorted sha1 list and other index info. But before we sha1close that computation, we write the _original_ trailer sha1, adding it to the new sha1 calculation. See pack-write.c, ll. 178-180. And then that sha1 gets returned to finish_tmp_packfile, which uses it to name the resulting files. Am I reading the code wrong? -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 18:52 ` Jeff King @ 2012-09-27 20:18 ` Jeff King 2012-09-27 21:33 ` Junio C Hamano 0 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2012-09-27 20:18 UTC (permalink / raw) To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr On Thu, Sep 27, 2012 at 02:52:29PM -0400, Jeff King wrote: > > No. The pack file name is composed from the SHA-1 of the sorted SHA-1s > > in the pack. Any change in compression settings or delta windows or > > even just random scheduling variations when repacking can cause > > offsets to slide, even if the set of objects being repacked has not > > differed. The resulting pack and index will have the same file names > > (as its the same set of objects), but the offset information and > > ordering is now different. > > Are you sure? The trailer is computed over the sha1 of the actual pack > data (ordering, delta choices, and all), and is computed and written to > the packfile via sha1close (see pack-objects.c, ll. 753-763). That > trailer sha1 is fed into finish_tmp_packfile (l. 793). That function > feeds it to write_idx_file, which starts a new sha1 computation that > includes the sorted sha1 list and other index info. But before we > sha1close that computation, we write the _original_ trailer sha1, adding > it to the new sha1 calculation. See pack-write.c, ll. 178-180. > > And then that sha1 gets returned to finish_tmp_packfile, which uses it > to name the resulting files. > > Am I reading the code wrong? And the answer is...yes. I'm blind. The final bit of code in write_idx_file is: sha1write(f, sha1, 20); sha1close(f, NULL, ((opts->flags & WRITE_IDX_VERIFY) ? CSUM_CLOSE : CSUM_FSYNC)); git_SHA1_Final(sha1, &ctx); So we write the trailer, but the sha1 we pull out is _not_ the sha1 over the index format. It is from "ctx", not "f"; and hte former is from the object list. Just like you said. :) So yeah, we would want to put the pack trailer sha1 into the supplementary index file, and check that it matches when we open it. It's a slight annoyance, but it's O(1). Anything which rewrote the pack and index would also want to rewrite these supplementary files. So the worst case would be: 1. Pack with a new version which builds the supplementary file. 2. Repack with an old version which generates a pack with identical objects, but different ordering. It does not regenerate the supplementary file, because it does not know about it. 3. Try to read with newer git. Without the extra trailer check, we get a wrong answer. With the check, we notice that the supplementary file is bogus, and fallback to the slow path. Which I think is OK, considering that this is a reasonably unlikely scenario to come up often (and it is no slower than it would be if you generated a _new_ packfile in step 2). -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 20:18 ` Jeff King @ 2012-09-27 21:33 ` Junio C Hamano 2012-09-27 21:36 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Junio C Hamano @ 2012-09-27 21:33 UTC (permalink / raw) To: Jeff King Cc: Shawn Pearce, Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr Jeff King <peff@peff.net> writes: > So yeah, we would want to put the pack trailer sha1 into the > supplementary index file, and check that it matches when we open it. > It's a slight annoyance, but it's O(1). Yes. If I am not mistaken, that is exactly how an .idx file makes sure that it describes the matching .pack file (it has packfile checksum in its trailer). Otherwise you can repack the same set of objects into a new .pack file and make existing .idx very confused. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 21:33 ` Junio C Hamano @ 2012-09-27 21:36 ` Jeff King 0 siblings, 0 replies; 20+ messages in thread From: Jeff King @ 2012-09-27 21:36 UTC (permalink / raw) To: Junio C Hamano Cc: Shawn Pearce, Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr On Thu, Sep 27, 2012 at 02:33:01PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > So yeah, we would want to put the pack trailer sha1 into the > > supplementary index file, and check that it matches when we open it. > > It's a slight annoyance, but it's O(1). > > Yes. If I am not mistaken, that is exactly how an .idx file makes > sure that it describes the matching .pack file (it has packfile > checksum in its trailer). Otherwise you can repack the same set of > objects into a new .pack file and make existing .idx very confused. Yeah. In theory you wouldn't name the new packfile into place without also generating a new index for it. But even if you do it right, there's a race condition, and checking the trailer sha1 at least lets us know that they don't match (I assume we just reject the index, then; I guess this can result in an operation failing, but in practice it doesn't really happen, as we don't bother packing unless there are actually new objects). -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 17:20 ` Jeff King 2012-09-27 17:35 ` Shawn Pearce @ 2012-09-27 19:47 ` David Michael Barr 2012-09-28 1:38 ` Nguyen Thai Ngoc Duy 2 siblings, 0 replies; 20+ messages in thread From: David Michael Barr @ 2012-09-27 19:47 UTC (permalink / raw) To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, Shawn Pearce, git, Colby Ranger Hi all, On Fri, Sep 28, 2012 at 3:20 AM, Jeff King <peff@peff.net> wrote: > On Thu, Sep 27, 2012 at 07:17:42PM +0700, Nguyen Thai Ngoc Duy wrote: > >> > Operation Index V2 Index VE003 >> > Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) >> > Fetch (1 commit back) 75ms 107ms >> > Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) >> > Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) >> > Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) >> > Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) >> > Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) >> >> Beautiful. And curious, why do 100->1000 and 10000->10000 have such >> big leaps in time (V2)? > > Agreed. I'm very excited about these numbers. +1 >> Definitely :-). I have shown my interest in this topic before. So I >> should probably say that I'm going to work on this on C Git, but >> sllloooowwwly. As this benefits the server side greatly, perhaps a >> GitHubber ;-) might want to work on this on C Git, for GitHub itself >> of course, and, as a side effect, make the rest of us happy? > > Yeah, GitHub is definitely interested in this. I may take a shot at it, > but I know David Barr (cc'd) is also interested in such things. Yeah, I'm definitely interested, I love this stuff. -- David Michael Barr ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 17:20 ` Jeff King 2012-09-27 17:35 ` Shawn Pearce 2012-09-27 19:47 ` David Michael Barr @ 2012-09-28 1:38 ` Nguyen Thai Ngoc Duy 2 siblings, 0 replies; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-09-28 1:38 UTC (permalink / raw) To: Jeff King; +Cc: Shawn Pearce, git, Colby Ranger, David Barr On Fri, Sep 28, 2012 at 12:20 AM, Jeff King <peff@peff.net> wrote: >> Definitely :-). I have shown my interest in this topic before. So I >> should probably say that I'm going to work on this on C Git, but >> sllloooowwwly. As this benefits the server side greatly, perhaps a >> GitHubber ;-) might want to work on this on C Git, for GitHub itself >> of course, and, as a side effect, make the rest of us happy? > > Yeah, GitHub is definitely interested in this. I may take a shot at it, > but I know David Barr (cc'd) is also interested in such things. Great. Now I can just sit back and enjoy :) -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-27 0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce 2012-09-27 12:17 ` Nguyen Thai Ngoc Duy @ 2012-09-28 12:00 ` Nguyen Thai Ngoc Duy 2012-10-01 1:07 ` Shawn Pearce 1 sibling, 1 reply; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-09-28 12:00 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, Colby Ranger On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote: > * https://git.eclipse.org/r/7939 > > Defines the new E003 index format and the bit set > implementation logic. Quote from the patch's message: "Currently, the new index format can only be used with pack files that contain a complete closure of the object graph e.g. the result of a garbage collection." You mentioned this before in your idea mail a while back. I wonder if it's worth storing bitmaps for all packs, not just the self contained ones. We could have one leaf bitmap per pack to mark all leaves where we'll need to traverse outside the pack. Commit leaves are the best as we can potentially reuse commit bitmaps from other packs. Tree leaves will be followed in the normal/slow way. For connectivity check, fewer trees/commits to deflate/parse means less time. And connectivity check is done on every git-fetch (I suspect the other end of a push also has the same check). It's not unusual for me to fetch some repos once every few months so these incomplete packs could be quite big and it'll take some time for gc --auto to kick in (of course we could adjust gc --auto to start based on the number of non-bitmapped objects, in additional to number of packs). -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-09-28 12:00 ` Nguyen Thai Ngoc Duy @ 2012-10-01 1:07 ` Shawn Pearce 2012-10-01 1:59 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 20+ messages in thread From: Shawn Pearce @ 2012-10-01 1:07 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger On Fri, Sep 28, 2012 at 5:00 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote: >> * https://git.eclipse.org/r/7939 >> >> Defines the new E003 index format and the bit set >> implementation logic. > > Quote from the patch's message: > > "Currently, the new index format can only be used with pack files that > contain a complete closure of the object graph e.g. the result of a > garbage collection." > > You mentioned this before in your idea mail a while back. I wonder if > it's worth storing bitmaps for all packs, not just the self contained > ones. Colby and I started talking about this late last week too. It seems feasible, but does add a bit more complexity to the algorithm used when enumerating. > We could have one leaf bitmap per pack to mark all leaves where > we'll need to traverse outside the pack. Commit leaves are the best as > we can potentially reuse commit bitmaps from other packs. Tree leaves > will be followed in the normal/slow way. Yes, Colby proposed the same idea. We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in the pack and therefore cannot have a bit assigned to them. We could add a new section that listed the unique leaf SHA-1s in their own private table, and then assigned per bitmap a leaf bitmap that set to 1 for any leaf object that is outside of the pack. This would probably take up the least amount of disk space, vs. storing the list of leaf SHA-1s after each bitmap. If a pack has only 1 bitmap (e.g. it is a small chunk of recent history) there is really no difference in disk usage. If the pack has 2 or 3 commit bitmaps along a string of approximately 300 commits, you will have an identical leaf set for each of those bitmaps so using a single leaf SHA-1 table would support reusing the redundant leaf pointers. One of the problems we have seen with these non-closed packs is they waste an incredible amount of disk. As an example, do a `git fetch` from Linus tree when you are more than a few weeks behind. You will get back more than 100 objects, so the thin pack will be saved and completed with additional base objects. That thin pack will go from a few MiBs to more than 40 MiB of data on disk, thanks to the redundant base objects being appended to the end of the pack. For most uses these packs are best eliminated and replaced with a new complete closure pack. The redundant base objects disappear, and Git stops wasting a huge amount of disk. > For connectivity check, fewer trees/commits to deflate/parse means > less time. And connectivity check is done on every git-fetch (I > suspect the other end of a push also has the same check). It's not > unusual for me to fetch some repos once every few months so these > incomplete packs could be quite big and it'll take some time for gc > --auto to kick in (of course we could adjust gc --auto to start based > on the number of non-bitmapped objects, in additional to number of > packs). Yes, of course. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-10-01 1:07 ` Shawn Pearce @ 2012-10-01 1:59 ` Nguyen Thai Ngoc Duy 2012-10-01 2:26 ` Shawn Pearce 0 siblings, 1 reply; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-10-01 1:59 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, Colby Ranger On Mon, Oct 1, 2012 at 8:07 AM, Shawn Pearce <spearce@spearce.org> wrote: >> You mentioned this before in your idea mail a while back. I wonder if >> it's worth storing bitmaps for all packs, not just the self contained >> ones. > > Colby and I started talking about this late last week too. It seems > feasible, but does add a bit more complexity to the algorithm used > when enumerating. Yes. Though at server side, if it's too much trouble, the packer can just ignore open packs and use only closed ones. >> We could have one leaf bitmap per pack to mark all leaves where >> we'll need to traverse outside the pack. Commit leaves are the best as >> we can potentially reuse commit bitmaps from other packs. Tree leaves >> will be followed in the normal/slow way. > > Yes, Colby proposed the same idea. > > We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in > the pack and therefore cannot have a bit assigned to them. We could mark all objects _in_ the pack that lead to an external object. That's what I meant by leaves. We need to parse the leaves to find out actual SHA-1s that are outside the pack. Or we could go with your approach below too. > We could > add a new section that listed the unique leaf SHA-1s in their own > private table, and then assigned per bitmap a leaf bitmap that set to > 1 for any leaf object that is outside of the pack. > One of the problems we have seen with these non-closed packs is they > waste an incredible amount of disk. As an example, do a `git fetch` > from Linus tree when you are more than a few weeks behind. You will > get back more than 100 objects, so the thin pack will be saved and > completed with additional base objects. That thin pack will go from a > few MiBs to more than 40 MiB of data on disk, thanks to the redundant > base objects being appended to the end of the pack. For most uses > these packs are best eliminated and replaced with a new complete > closure pack. The redundant base objects disappear, and Git stops > wasting a huge amount of disk. That's probably a different problem. I appreciate disk savings but I would not want to wait a few more minutes for repack on every git-fetch. But if this bitmap thing makes repack much faster than currently, repacking after every git-fetch may become practical. -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-10-01 1:59 ` Nguyen Thai Ngoc Duy @ 2012-10-01 2:26 ` Shawn Pearce 2012-10-01 12:48 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 20+ messages in thread From: Shawn Pearce @ 2012-10-01 2:26 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger On Sun, Sep 30, 2012 at 6:59 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Mon, Oct 1, 2012 at 8:07 AM, Shawn Pearce <spearce@spearce.org> wrote: >>> You mentioned this before in your idea mail a while back. I wonder if >>> it's worth storing bitmaps for all packs, not just the self contained >>> ones. >> >> Colby and I started talking about this late last week too. It seems >> feasible, but does add a bit more complexity to the algorithm used >> when enumerating. > > Yes. Though at server side, if it's too much trouble, the packer can > just ignore open packs and use only closed ones. Its not trouble once the code is written. We were just trying to be expedient in producing a prototype that we could start to deploy on real-world workloads. Enumerating the non-closed-pack objects using the classical implementation is still slow and consumes CPU time at the server, using partial bitmaps should eliminate most of that CPU usage and reduce server loads. One of the more troublesome problems is building the bitmaps is difficult from a streaming processor like index-pack. You need the reachability graph for all objects, which is not currently produced when moving data over the wire. We do an fsck after-the-fact to verify we didn't get corrupt data, but this is optional and currently after the pack is stored. We need to refactor this code to run earlier to get the bitmap built. If we take Peff's idea and put the bitmap data into a new stream rather than the pack-*.idx file we can produce the bitmap at the same time as the fsck check, which is probably a simpler change. >>> We could have one leaf bitmap per pack to mark all leaves where >>> we'll need to traverse outside the pack. Commit leaves are the best as >>> we can potentially reuse commit bitmaps from other packs. Tree leaves >>> will be followed in the normal/slow way. >> >> Yes, Colby proposed the same idea. >> >> We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in >> the pack and therefore cannot have a bit assigned to them. > > We could mark all objects _in_ the pack that lead to an external > object. That's what I meant by leaves. We need to parse the leaves to > find out actual SHA-1s that are outside the pack. OK, yes, I was being pretty stupid. Of course we could mark the objects _in_ the pack as leaves and parse the leaves when we need to find the external pointers. > Or we could go with > your approach below too. I actually think my approach may be better. The root tree of a leaf commit would need to be scanned every time to identify trees that are reachable from this leaf commit, but are not reachable in the ancestry of the leaf's parents. This turns out to be rather expensive to compute every time a server or an fsck algorithm considers the partial pack. Its also somewhat uncommon for such an object to exist, it has to happen by an unrelated side branch introducing the same object and the creator of the pack seeing that object already exist and not including it in the pack. Defining the pack's "edge" as a list of SHA-1s not in this pack but known to be required allows us to compute that leaf root tree reachability once, and never consider parsing it again. Which saves servers that host frequently accessed Git repositories but aren't repacking all of the time. (FWIW we repack frequently, I hear GitHub does too, because a fully repacked repository serves clients better than a partially packed one.) >> We could >> add a new section that listed the unique leaf SHA-1s in their own >> private table, and then assigned per bitmap a leaf bitmap that set to >> 1 for any leaf object that is outside of the pack. > >> One of the problems we have seen with these non-closed packs is they >> waste an incredible amount of disk. As an example, do a `git fetch` >> from Linus tree when you are more than a few weeks behind. You will >> get back more than 100 objects, so the thin pack will be saved and >> completed with additional base objects. That thin pack will go from a >> few MiBs to more than 40 MiB of data on disk, thanks to the redundant >> base objects being appended to the end of the pack. For most uses >> these packs are best eliminated and replaced with a new complete >> closure pack. The redundant base objects disappear, and Git stops >> wasting a huge amount of disk. > > That's probably a different problem. I appreciate disk savings but I > would not want to wait a few more minutes for repack on every > git-fetch. But if this bitmap thing makes repack much faster than > currently, repacking after every git-fetch may become practical. I know the "Counting objects" phase of a repack is very expensive, but so too is the time required to do the IO in and out of every object in the repository. Spinning disks only transfer data so fast. Assuming the drive can move 50 MiB/s each way, a 500 MiB repository will take 10 seconds just to write back out the new pack. You aren't ever going to want to wait for repacking during git-fetch. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-10-01 2:26 ` Shawn Pearce @ 2012-10-01 12:48 ` Nguyen Thai Ngoc Duy 2012-10-02 15:00 ` Shawn Pearce 0 siblings, 1 reply; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-10-01 12:48 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, Colby Ranger On Mon, Oct 1, 2012 at 9:26 AM, Shawn Pearce <spearce@spearce.org> wrote: > One of the more troublesome problems is building the bitmaps is > difficult from a streaming processor like index-pack. You need the > reachability graph for all objects, which is not currently produced > when moving data over the wire. We do an fsck after-the-fact to verify > we didn't get corrupt data, but this is optional and currently after > the pack is stored. We need to refactor this code to run earlier to > get the bitmap built. If we take Peff's idea and put the bitmap data > into a new stream rather than the pack-*.idx file we can produce the > bitmap at the same time as the fsck check, which is probably a simpler > change. If we need to go through the whole pack, not random sha-1 access, then index-pack's traversal is more efficient. I have some patches that remove pack-check.c and make fsck use index-pack to walk through packs. It takes much less time. But rev walk for building bitmaps probably does not fit this style of traversal because rev walk does not align with delta walk. > Defining the pack's "edge" as a list of SHA-1s not in this pack but > known to be required allows us to compute that leaf root tree > reachability once, and never consider parsing it again. Which saves > servers that host frequently accessed Git repositories but aren't > repacking all of the time. (FWIW we repack frequently, I hear GitHub > does too, because a fully repacked repository serves clients better > than a partially packed one.) Probably off topic. Does saving a list of missing bases in the pack index help storing thin packs directly? I may be missing some points because I don't see why thin packs cannot be stored on disk in the first place. -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Using bitmaps to accelerate fetch and clone 2012-10-01 12:48 ` Nguyen Thai Ngoc Duy @ 2012-10-02 15:00 ` Shawn Pearce 0 siblings, 0 replies; 20+ messages in thread From: Shawn Pearce @ 2012-10-02 15:00 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger On Mon, Oct 1, 2012 at 5:48 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > Probably off topic. Does saving a list of missing bases in the pack > index help storing thin packs directly? I may be missing some points > because I don't see why thin packs cannot be stored on disk in the > first place. Packs are supposed to be completely self contained. We require all delta bases to be in the pack so that we can always recover the objects stored within it, even if there is no other pack available. This simplifies things like git gc such that they don't need to worry about retaining delta bases for other packs, or avoiding delta chain cycles between packs. I have managed to create a A->B->A delta chain once that made the repository corrupt as soon as the last non-delta copy of A as removed. Both A and B became unreadable as A needed B which needed A which needed B... :-( To keep things simple and easily verified as correct, we don't allow this. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2012-10-02 15:01 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-09-27 0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce 2012-09-27 12:17 ` Nguyen Thai Ngoc Duy 2012-09-27 14:33 ` Shawn Pearce 2012-09-28 1:37 ` Nguyen Thai Ngoc Duy 2012-09-27 17:20 ` Jeff King 2012-09-27 17:35 ` Shawn Pearce 2012-09-27 18:22 ` Jeff King 2012-09-27 18:36 ` Shawn Pearce 2012-09-27 18:52 ` Jeff King 2012-09-27 20:18 ` Jeff King 2012-09-27 21:33 ` Junio C Hamano 2012-09-27 21:36 ` Jeff King 2012-09-27 19:47 ` David Michael Barr 2012-09-28 1:38 ` Nguyen Thai Ngoc Duy 2012-09-28 12:00 ` Nguyen Thai Ngoc Duy 2012-10-01 1:07 ` Shawn Pearce 2012-10-01 1:59 ` Nguyen Thai Ngoc Duy 2012-10-01 2:26 ` Shawn Pearce 2012-10-01 12:48 ` Nguyen Thai Ngoc Duy 2012-10-02 15:00 ` Shawn Pearce
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).