From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id E1C212018E for ; Mon, 8 Aug 2016 14:50:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752072AbcHHOur (ORCPT ); Mon, 8 Aug 2016 10:50:47 -0400 Received: from cloud.peff.net ([104.130.231.41]:51165 "HELO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751479AbcHHOuq (ORCPT ); Mon, 8 Aug 2016 10:50:46 -0400 Received: (qmail 9246 invoked by uid 109); 8 Aug 2016 14:50:45 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.84) with SMTP; Mon, 08 Aug 2016 14:50:45 +0000 Received: (qmail 5413 invoked by uid 111); 8 Aug 2016 14:50:44 -0000 Received: from Unknown (HELO sigill.intra.peff.net) (10.0.0.7) by peff.net (qpsmtpd/0.84) with SMTP; Mon, 08 Aug 2016 10:50:44 -0400 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Mon, 08 Aug 2016 10:50:43 -0400 Date: Mon, 8 Aug 2016 10:50:43 -0400 From: Jeff King To: Junio C Hamano Cc: git@vger.kernel.org, Michael Haggerty Subject: Re: [PATCH v2 7/7] pack-objects: use mru list when iterating over packs Message-ID: <20160808145042.uwrk2m6jq3m4li37@sigill.intra.peff.net> References: <20160729040422.GA19678@sigill.intra.peff.net> <20160729041524.GG22408@sigill.intra.peff.net> <20160729054536.GA27343@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Fri, Jul 29, 2016 at 08:02:51AM -0700, Junio C Hamano wrote: > > So it's possible that the resulting pack > > is not as small as it could be (i.e., we break the chain with a base > > object, but it's possible if we looked that we could have broken the > > chain by making a delta against an existing base object). So I wonder if > > it's possible to detect this case earlier, during the "can we reuse this > > delta" bits of check_object(). > > I'd let the issue simmer in my mind a bit for now, as I do not > think of a neat trick offhand. Sorry, I let this go a bit longer than intended. Here's where I'm at with my thinking. To recap the issue for those just joining us, the series in question changes the order in which pack-objects will look at the existing packfiles (in order to make the counting step faster when you have many packs). This can introduce cycles in reused deltas because we might find "A" as a delta of "B" in one pack, but "B" as a delta of "A" in another pack. Whereas the current code, with a static pack ordering, will always find such an "A" and "B" in the same pack, so as long as that pack does not have a cycle, our set of reused deltas won't either. The current code does detect the cycle, but not until the write phase (at which point we issue a warning, throw out the delta, and output the full object). Here's a list of approaches I think we can use to fix this: 1. squelch the warning and ignore it. The downside here, besides not warning the user about true in-pack cycles, is that we have no opportunity to actually find a new delta (because we realize the problem only after the delta-compression phase). My test repository is a bad packing of all of the forks of torvalds/linux, with 3600 packs. I'm happy to share it if anybody wants to see it, but note that it is 11GB. The space overhead of the resulting pack in this case is ~3.2% (versus a pack generated by the original code, using the static pack order). Which is really not that bad, but I'm sure there are more pathological cases (i.e., there were on the order of hundreds or maybe thousands of cycles that needed broken, out of about 15 million total objects; but one could imagine the worst case as nr_of_objects/2). 2. break the cycles in check_object(), when we consider whether we can reuse the delta. The obvious advantage here (over breaking it in the writing phase) is that we can then feed the object into the normal delta-compression phase. The question is: how? 2a. One way to do so is to provide a total ordering over the packs. I.e., if "A" is a delta of "B", then allow the delta to be reused iff the pack containing "A" sorts before or equal to the pack containing "B". The actual ordering doesn't matter for purposes of the cycle-breaking, as long as its consistent. I implemented this using the mtime of the packs (breaking ties by their names), as that would give us results close to the Unsurprisingly, this adds a bit of CPU time (because we have more delta objects to search), though it's dwarfed by the wins you get from the sped-up counting phase (in my tests, 20 seconds of extra delta search for 39 minutes of "counting objects" savings). But it doesn't actually help the resulting pack size. It's still about 3.2% overhead (it's actually just slightly worse than case 1). I think what happens is that we throw out a lot of deltas, even ones that _aren't_ cycles, but just happened to be found in packs that are "backwards" in the total order. And then we have to do a delta search on them, but of course that can't always find a good match. I also tried two other variants here. The pack order for the cycle-breaking step shouldn't matter, so I wondered what would happen if I reversed it. It does indeed produce different results. The resulting pack is slightly better, at 2.6% overhead. But we spend even more time looking for deltas (IOW, we threw out more of them). I also tried bumping the pack window size, under the theory that maybe we just aren't finding great deltas for the ones we throw out. But it didn't seem to make much difference. So I like the simplicity of this idea (which, by the way, was not mine, but came from an off-list discussion with Michael). But I think it's a dead-end, because it throws away too many deltas that _could_ be reused. 2b. Another option is to do real cycle analysis, and break the cycles. This is tricky to do in check_object() because we are actually filling out the delta pointers as we go. So I think we would have to make two passes: one to come up with a tentative list of deltas to reuse, and then one to check them for cycles. If done right, the cycle check should be linear in the number of objects (it's basically a depth-first search). In fact, I think it would look a lot like what write_object() is doing. We'd just be moving the logic _before_ the delta-compression phase, instead of after. This is the one approach I've considered but have not yet implemented. So no numbers. 3. Pick a consistent pack order up front, and stop changing it mid-search. The question, of course, is which order. For my test repo, this is actually pretty easy. There's one gigantic pack with 15 million objects in it, and then 3600 small packs with a handful of objects from pushes. The giant pack is the oldest, so the normal "reverse chronological" pack order makes counting effectively O(m*n), because the majority of the objects are found in the very last pack. So an obvious thing to try is just reversing it. And indeed, the counting is quite fast then (similar to the MRU numbers). Though of course one can come up with a case where the objects are more evenly split across the packs, and it would not help (you can come up with pathological cases for MRU, too, but they're much less likely in practice, because it's really exploiting locality). But I was surprised to find that the resulting pack is even worse, at 4.5% overhead. I can't really explain that, and I'm starting to come to the conclusion that there's a fair bit of luck and black magic involved in delta reuse. I would not be at all surprised to find a test case where reversing the order actually _improved_ things. It may be that the numbers I saw in my (2a) tests were not because we broke deltas and couldn't find replacements, but simply because we get more and better deltas by looking in the smaller, newer packs first. So that's where I'm at. Mostly puzzled, and wondering if any of my experiments were even valid or showing anything interesting, and not just somewhat random artifacts of the deltas that happen to be in this particular test case. Worse, it may be that looking in the small packs _is_ objectively better, and we're losing that in any kind of pack-ordering changes (which is an inherent part of my speed-up strategy for the counting phase[1]). I've yet to implement 2b, true cycle-detection before delta-compression. But I have a feeling it will somehow show that my resulting pack is about 3% worse. :-/ -Peff [1] So obviously one option is to figure out a way to speed up the counting phase without changing the pack order. The only way I can think of is to build a master object index, where each entry has all of the packs that a given object can be found in, in their correct pack order. That would be fairly easy to implement as a hash table, but: - it would require a fair bit of memory; the pack .idx for this repo is 500MB, and we usually get to just mmap it. OTOH, the biggest part of that is the sha1s, so we could possibly just reference them by pointer into the mmap'd data. And it's not like you don't need much more than 500MB to hold the list of objects to pack. - it's inherently O(nr_of_objects_in_repo) in CPU and memory to build the index, whereas some operations are O(nr_of_objects_to_be_packed). In this case it's probably OK (we do pay for extra entries for each of the duplicates, but it's largely on the order of 15 million objects). But it's a big expense if you're just packing a few of the objects. So I dunno. I really like the MRU approach if we can salvage it.