From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-11.6 required=3.0 tests=AWL,BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,USER_IN_DEF_DKIM_WL shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 233C220A1E for ; Tue, 11 Dec 2018 01:54:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729268AbeLKBy3 (ORCPT ); Mon, 10 Dec 2018 20:54:29 -0500 Received: from mail-ed1-f53.google.com ([209.85.208.53]:41214 "EHLO mail-ed1-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727540AbeLKBy2 (ORCPT ); Mon, 10 Dec 2018 20:54:28 -0500 Received: by mail-ed1-f53.google.com with SMTP id z28so11176590edi.8 for ; Mon, 10 Dec 2018 17:54:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=xtWy0b3a+EA+XimGMbRCTn0r50fF5u2HUaecJi0h3zU=; b=ufh0hHNGGhh6aDFchVS7RL3k5YouUAvXKKbz1zz193gh1Yh2n7XCzeWBGnP29xSXrQ mvFQ3FLQFNWfsjvV5WYsMGb9T0y+BRbR3oIZZwNryhBRfBQucVOOmRwVX7bN+Yw4HWE+ rqOyU1ylALLqm0nsg54k3+jaSzxM24zcIQ63lT+B1/x2PC+6Rn+f7v66biviHr3m2a/V DOnIE0ouGUNMGxgybeZmLDI0m92MswOfxD/Dxcn51aKRR/ZqEzb4m6K9W5ALcEMfsfdK B9H6ZwRK2y6+lP6TuK25s5TtcKYUdJK6bwtFGR483vhYlsL7wEo0Wg+96bz/I/yvsofz QvKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=xtWy0b3a+EA+XimGMbRCTn0r50fF5u2HUaecJi0h3zU=; b=o1J71foF+FBvvtHP7+mmWdl6jWbTG40AmJIR44ECAaR3dchAymdySw58b4IIhlJRg+ Br2W8rijHFbFDamKlp8gdDLMvwgE/e1ZQ4UkOkMFLSs326cIlfGt2nsksEnpN6Vzxl0/ epycZ5iQYLqiFEJIJF8cAjZSMSMMjDhIOkCol2a6viVjPhAEUbeH1kca1uXgzI1tIrsf 7ZMHH3tqpkQf9L/y8jGk5vCtDpG/sEjHGZ+bzZ8b82I8tKETkbGhwUzr+tx/03ndGdDG fGL/btvbF2C0l4fdYSm/ve4iqf5z7KzLFiEd974En8oWPAmRnxV0pPJiZYuCfb7R98h7 1quQ== X-Gm-Message-State: AA+aEWZK4HRM27F6RELknLN2FmBdfMGcc4SOWnWtJr6AKPNeKkIkS5YS 2JSOgwJNfRdrjGmijRdKs6SYumGBOUOHHS6XRlH7mg== X-Google-Smtp-Source: AFSGD/VC0BTh1ZqJdQ3MhuCLXEE7BxPkuLnsEnFnxzADDscrmH2YApdWV60zjmH3AMIH4ILbQR+a+GmE0o0XW9yRAkk= X-Received: by 2002:a50:8343:: with SMTP id 61mr13552894edh.154.1544493266651; Mon, 10 Dec 2018 17:54:26 -0800 (PST) MIME-Version: 1.0 References: <72b213959171af3bfe4d849b925920ddbfb3d4b7.1544465177.git.gitgitgadget@gmail.com> In-Reply-To: <72b213959171af3bfe4d849b925920ddbfb3d4b7.1544465177.git.gitgitgadget@gmail.com> From: Stefan Beller Date: Mon, 10 Dec 2018 17:54:15 -0800 Message-ID: Subject: Re: [PATCH 4/5] multi-pack-index: prepare 'repack' verb To: gitgitgadget@gmail.com Cc: git , Jeff King , Jonathan Nieder , =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= , Junio C Hamano , Derrick Stolee Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget wrote: > > From: Derrick Stolee > > In an environment where the multi-pack-index is useful, it is due > to many pack-files and an inability to repack the object store > into a single pack-file. However, it is likely that many of these > pack-files are rather small, and could be repacked into a slightly > larger pack-file without too much effort. It may also be important > to ensure the object store is highly available and the repack > operation does not interrupt concurrent git commands. > > Introduce a 'repack' verb to 'git multi-pack-index' that takes a > '--batch-size' option. The verb will inspect the multi-pack-index > for referenced pack-files whose size is smaller than the batch > size, until collecting a list of pack-files whose sizes sum to > larger than the batch size. Then, a new pack-file will be created > containing the objects from those pack-files that are referenced > by the multi-pack-index. The resulting pack is likely to actually > be smaller than the batch size due to compression and the fact > that there may be objects in the pack-files that have duplicate > copies in other pack-files. This highlights an optimization problem: How do we pick the batches optimally? Ideally we'd pick packs that have an overlap of many same objects to dedup them completely, next best would be to have objects that are very similar, such that they delta very well. (Assuming that the sum of the resulting pack sizes is a metric we'd want to optimize for eventually) For now it seems we just take a random cut of "small" packs. > > The current change introduces the command-line arguments, and we > add a test that ensures we parse these options properly. Since > we specify a small batch size, we will guarantee that future > implementations do not change the list of pack-files. This is another clever trick that makes the test correct despite no implementation yet. :-) > +repack:: > + When given as the verb, collect a batch of pack-files whose > + size are all at most the size given by --batch-size, okay. > but > + whose sizes sum to larger than --batch-size. or less if there are not enough packs. Now it would be interesting if we can specify an upper bound (e.g. my laptop has 8GB of ram, can I use this incremental repacking optimally by telling git to make batches of at most 7.5G), the answer seems to follow... > The batch is > + selected by greedily adding small pack-files starting with > + the oldest pack-files that fit the size. Create a new pack- > + file containing the objects the multi-pack-index indexes > + into thos pack-files, and rewrite the multi-pack-index to those > + contain that pack-file. A later run of 'git multi-pack-index > + expire' will delete the pack-files that were part of this > + batch. ... but the optimization seems to be rather about getting rid of the oldest packs first instead of getting as close to the batch size. (e.g. another way to look at this is to "find the permutation of all packs that (each are smaller than batch size), but in sum are the smallest threshold above the batch size). I guess that the strategy of picking the oldest is just easiest to implement and should be sufficient for now, but memory bounds might be interesting to keep in mind, just as the optimal packing from above.