From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,T_RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 8DF381F576 for ; Wed, 28 Feb 2018 11:25:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752328AbeB1LY7 (ORCPT ); Wed, 28 Feb 2018 06:24:59 -0500 Received: from mail-oi0-f50.google.com ([209.85.218.50]:36056 "EHLO mail-oi0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752124AbeB1LY6 (ORCPT ); Wed, 28 Feb 2018 06:24:58 -0500 Received: by mail-oi0-f50.google.com with SMTP id u73so1498839oie.3 for ; Wed, 28 Feb 2018 03:24:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=QULPIrFZ+ilTfzc9jjR62mxJgpS1rW7TOh1vZObjuIY=; b=UYHoEN9wZ71YIR09l+GEh6+ZUBptJDz4U1wjMIq4gA2ZczTnGaE2WqKR3YaAkKDGOr HFjwMQnrthtI1EpHds3a/5qj3G6hPNx+oFVvBQtll9Wnq0Cea1jcQVICvTz/ozXoz2HV 5cPkTdalpXw2Wfnu5O7Rrefj34zTO9gDTmFliayaGTaM2n2DLJ7edOjAJd+7/RKGo0eB Th3Jy98/r433hi26g0TFYpHkTmTE9g4RgJ8rTtiXdop0bxvJ0TUYZNof8iHm67VMb9Gr CLWRZjcS13q2uhPivY8XWbUCvMWKj11OC9bvCzpfJ/BbvmFdkXYbNUXl6aOu3ViNOY77 p2QQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=QULPIrFZ+ilTfzc9jjR62mxJgpS1rW7TOh1vZObjuIY=; b=JP3AHV1KrN58TEHSv2fspwobgRmNMvsRUMX8EFRDDLB1aduN9UKrNgaAdGiIjW3F9e rQhMfDhEAloOIHMsoPkLBW4FPkPEsSgC0Tgm2Kf8BZIPLafSCfNMJ7us7n1hRgM9esKu fH0VPODOXnzQUxr1cC3b6bHXiC5obSy9HL6Rmy8cd2kZvyhi/eGD0ikvfOahkJ+oM5DV JaB3bewGddBeOUgJzDp+tnP6MPmN09INsA9VZSeL3CmmMHxpccUsjtQLDoFHo8Opjuzx LttMrlY8ZoJfyjvvaOYMl7OYZ+rhdhC91Nz14zzu92VlJ+mdPtF0dprGwjbOJUdOI9bG Fnug== X-Gm-Message-State: APf1xPDiwcinLiS2vFjouLHnOYA7ybrBUBm2nxcwoJGPfAds/KrVms5X GrsvU7w50dqo3m5lQCuHXBVIyatJz7v/dGpRGUk= X-Google-Smtp-Source: AG47ELuJD2EpspxP/EaazAXufeJgqWFtY8kND4+0OmICtSjxwxWFOfj341UNPTBvdxwF1hQI4POWNTkKinHbuVKCIEk= X-Received: by 10.202.206.71 with SMTP id e68mr11634613oig.34.1519817097646; Wed, 28 Feb 2018 03:24:57 -0800 (PST) MIME-Version: 1.0 Received: by 10.74.25.140 with HTTP; Wed, 28 Feb 2018 03:24:27 -0800 (PST) In-Reply-To: <20180228111121.GA8925@sigill.intra.peff.net> References: <20180228092722.GA25627@ash> <20180228101757.GA11803@sigill.intra.peff.net> <20180228111121.GA8925@sigill.intra.peff.net> From: Duy Nguyen Date: Wed, 28 Feb 2018 18:24:27 +0700 Message-ID: Subject: Re: Reduce pack-objects memory footprint? To: Jeff King Cc: Git Mailing List Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Wed, Feb 28, 2018 at 6:11 PM, Jeff King wrote: >> > The torvalds/linux fork network has ~23 million objects, >> > so it's probably 7-8 GB of book-keeping. Which is gross, but 64GB in a >> > server isn't uncommon these days. >> >> I wonder if we could just do book keeping for some but not all objects >> because all objects simply do not scale. Say we have a big pack of >> many GBs, could we keep the 80% of its bottom untouched, register the >> top 20% (mostly non-blobs, and some more blobs as delta base) for >> repack? We copy the bottom part to the new pack byte-by-byte, then >> pack-objects rebuilds the top part with objects from other sources. > > Yes, though I think it would take a fair bit of surgery to do > internally. And some features (like bitmap generation) just wouldn't > work at all. > > I suspect you could simulate it, though, by just packing your subset > with pack-objects (feeding it directly without using "--revs") and then > catting the resulting packfiles together with a fixed-up header. > > At one point I played with a "fast pack" that would just cat packfiles > together. My goal was to make cases with 10,000 packs workable by > creating one lousy pack, and then repacking that lousy pack with a > "real" repack. In the end I abandoned it in favor of fixing the > performance problems from trying to make a real pack of 10,000 packs. :) > > But I might be able to dig it up if you want to experiment in that > direction. Naah it's ok. I'll go similar direction, but I'd repack those pack files too except the big one. Let's see how it turns out. >> They are 32 bytes per entry, so it should take less than object_entry. >> I briefly wondered if we should fall back to external rev-list too, >> just to free that memory. >> >> So about 200 MB for those objects (or maybe more for commits). Add 256 >> MB delta cache on top, it's still a bit far from 1.7G. There's >> something I'm still missing. > > Are you looking at RSS or heap? Keep in mind that you're mmap-ing what's > probably a 1GB packfile on disk. If you're under memory pressure that > won't all stay resident, but some of it will be counted in RSS. Interesting. It was RSS. >> Pity we can't do the same for 'struct object'. Most of the time we >> have a giant .idx file with most hashes. We could look up in both >> places: the hash table in object.c, and the idx file, to find an >> object. Then those objects that are associated with .idx file will not >> need "oid" field (needed to as key for the hash table). But I see no >> way to make that change. > > Yeah, that would be pretty invasive, I think. I also wonder if it would > perform worse due to cache effects. It should be better because of cache effects, I think. I mean, hash map is the least cache friendly lookup. Moving most objects out of the hash table shrinks it, which is even nicer to cache. But we also lose O(1) when we do binary search on .idx file (after failing to find the same object in the hash table) -- Duy