From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-4.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id CC5901F453 for ; Sat, 27 Oct 2018 14:08:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728710AbeJ0WqF (ORCPT ); Sat, 27 Oct 2018 18:46:05 -0400 Received: from mail-it1-f196.google.com ([209.85.166.196]:38605 "EHLO mail-it1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728651AbeJ0WqF (ORCPT ); Sat, 27 Oct 2018 18:46:05 -0400 Received: by mail-it1-f196.google.com with SMTP id i76-v6so4882359ita.3 for ; Sat, 27 Oct 2018 07:04:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=A0EX4bMwOedJJUYDdWXsH+XTo++xO0CNrBeg6+RDCXk=; b=qfImcUcDEAbAWSclwZRtd3ApqW6LBIgcCoHUoNKU7Ia+O8hidxnIMevmY9vg3JM+ed baRY2ZGaQHUyi9S8KIpeepzaNqhrmy4f1AeOgsbYYcwZ7d2duHdWWKZz0pjD6DX3RR29 TiLSjxLXiaAz6D4q0Elh0WBoHNnREUo5+VsSt9MvA/qZ04QjUKsgc2/3rsvW3o55NTy5 T4wuIwt5J+4/ajkfYJrfhth8suPePpDPsgbVVGvymPGXPPx9OKUrsxOQ6EGRSyN2lN2m IQY8twFeeULWqbi5R3eQz7Jq6OgQj0D2HWOgTWCaaO3/ze6ByvQcdPG3VxTWZl11TKXa ETYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=A0EX4bMwOedJJUYDdWXsH+XTo++xO0CNrBeg6+RDCXk=; b=PitrAqEPj4loQAupfUnj0uJK+6AmW/FFYQyWnr7tVSTxoJvLTmYs6ocZhI8HuVe7FF Xlx0LcJQOyQ3ghUwgcCpbvNiA76JLEk8B/aBR6GZ1bqrAUX+ur22BCMgd7ddFvkWqDpR Bi2aDKt/NJzW4g4D4wyGzb0s5UvA1VSmkOCMJ/VTPqBudfe7tMc7opbMrGAKogUXwn38 uqX1zfcoT40CmQfpz2Gzbui4Y+176/MqDCm74td0A6ddibBZzyHEl16WD7nNBAvqxklI MxANp6D4h7/uJ0E2RRuyYVdGwkRqn8uPB1hNDRd1s6p8HTjNV7bCJx8ryFJot+0jZkkn cV4Q== X-Gm-Message-State: AGRZ1gLvk/ipfxh5Kk6e/NVG/Cr8ExRyzdghoSlRNCsG2rkC0HFo2UvE wxkzns0fY6R7SmcIapVTdoRgqiWXi7oBxB0EkRU= X-Google-Smtp-Source: AJdET5cefLqa4VgifMKOllqd43xBEIx2uJ3v75SQhStrowtYFiYDzhQXM0SHF4zW7DSuXSpEHcuEY9d7VlK89unbgjY= X-Received: by 2002:a24:7804:: with SMTP id p4-v6mr6182142itc.123.1540649098705; Sat, 27 Oct 2018 07:04:58 -0700 (PDT) MIME-Version: 1.0 References: <87o9bgl9yl.fsf@evledraar.gmail.com> <20181027093300.GA23974@sigill.intra.peff.net> In-Reply-To: <20181027093300.GA23974@sigill.intra.peff.net> From: Duy Nguyen Date: Sat, 27 Oct 2018 16:04:32 +0200 Message-ID: Subject: Re: [RFC PATCH] index-pack: improve performance on NFS To: Jeff King Cc: Junio C Hamano , =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= , gerardu@amazon.com, Git Mailing List , Christian Couder Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Sat, Oct 27, 2018 at 11:34 AM Jeff King wrote: > Taking one step back, the root problem in this thread is that stat() on > non-existing files is slow (which makes has_sha1_file slow). > > One solution there is to cache the results of looking in .git/objects > (or any alternate object store) for loose files. And indeed, this whole > scheme is just a specialized form of that: it's a flag to say "hey, we > do not have any objects yet, so do not bother looking". > > Could we implement that in a more direct and central way? And could we > implement it in a way that catches more cases? E.g., if I have _one_ > object, that defeats this specialized optimization, but it is probably > still beneficial to cache that knowledge (and the reasonable cutoff is > probably not 1, but some value of N loose objects). And we do hit this on normal git-fetch. The larger the received pack is (e.g. if you don't fetch often), the more stat() we do, so your approach is definitely better. > Of course any cache raises questions of cache invalidation, but I think > we've already dealt with that for this case. When we use > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of > accuracy/speed tradeoff (which does a similar caching thing with > packfiles). We don't care about a separate process adding more loose objects while index-pack is running, do we? I'm guessing we don't but just to double check... > So putting that all together, could we have something like: > > diff --git a/object-store.h b/object-store.h > index 63b7605a3e..28cde568a0 100644 > --- a/object-store.h > +++ b/object-store.h > @@ -135,6 +135,18 @@ struct raw_object_store { > */ > struct packed_git *all_packs; > > + /* > + * A set of all loose objects we have. This probably ought to be split > + * into a set of 256 caches so that we can fault in one directory at a > + * time. > + */ > + struct oid_array loose_cache; > + enum { > + LOOSE_CACHE_UNFILLED = 0, > + LOOSE_CACHE_INVALID, > + LOOSE_CACHE_VALID > + } loose_cache_status; > + > /* > * A fast, rough count of the number of objects in the repository. > * These two fields are not meant for direct access. Use > diff --git a/packfile.c b/packfile.c > index 86074a76e9..68ca4fff0e 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r) > r->objects->approximate_object_count_valid = 0; > r->objects->packed_git_initialized = 0; > prepare_packed_git(r); > + oid_array_clear(&r->objects->loose_cache); > + r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED; > } > > struct packed_git *get_packed_git(struct repository *r) > diff --git a/sha1-file.c b/sha1-file.c > index dd0b6aa873..edbe037eaa 100644 > --- a/sha1-file.c > +++ b/sha1-file.c > @@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep) > return parse_sha1_header_extended(hdr, &oi, 0); > } > > +/* probably should be configurable? */ Yes, perhaps with gc.auto config value (multiplied by 256) as the cut point. If it's too big maybe just go with a bloom filter. For this particular case we expect like 99% of calls to miss. > +#define LOOSE_OBJECT_CACHE_MAX 65536 > + > +static int fill_loose_cache(const struct object_id *oid, > + const char *path, > + void *data) > +{ > + struct oid_array *cache = data; > + > + if (cache->nr == LOOSE_OBJECT_CACHE_MAX) > + return -1; > + > + oid_array_append(data, oid); > + return 0; > +} > + > +static int quick_has_loose(struct raw_object_store *r, > + struct object_id *oid) > +{ > + struct oid_array *cache = &r->loose_cache; > + > + if (r->loose_cache_status == LOOSE_CACHE_UNFILLED) { > + if (for_each_loose_object(fill_loose_cache, cache, 0) < 0) > + r->loose_cache_status = LOOSE_CACHE_INVALID; > + else > + r->loose_cache_status = LOOSE_CACHE_VALID; > + } > + > + if (r->loose_cache_status == LOOSE_CACHE_INVALID) > + return -1; > + > + return oid_array_lookup(cache, oid) >= 0; > +} > + > static int sha1_loose_object_info(struct repository *r, > const unsigned char *sha1, > struct object_info *oi, int flags) > @@ -1198,6 +1232,19 @@ static int sha1_loose_object_info(struct repository *r, > if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) { > const char *path; > struct stat st; > + if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK)) { > + struct object_id oid; > + hashcpy(oid.hash, sha1); > + switch (quick_has_loose(r->objects, &oid)) { > + case 0: > + return -1; /* missing: error */ > + case 1: > + return 0; /* have: 0 == success */ > + default: > + /* unknown; fall back to stat */ > + break; > + } > + } > if (stat_sha1_file(r, sha1, &st, &path) < 0) > return -1; > if (oi->disk_sizep) > > That's mostly untested, but it might be enough to run some timing tests > with. I think if we want to pursue this, we'd want to address the bits I > mentioned in the comments, and look at unifying this with the loose > cache from cc817ca3ef (which if I had remembered we added, probably > would have saved some time writing the above ;) ). > > -Peff -- Duy