From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 9CDBB1F597 for ; Sun, 5 Aug 2018 08:57:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726121AbeHELBO (ORCPT ); Sun, 5 Aug 2018 07:01:14 -0400 Received: from mail-lj1-f193.google.com ([209.85.208.193]:40343 "EHLO mail-lj1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726081AbeHELBO (ORCPT ); Sun, 5 Aug 2018 07:01:14 -0400 Received: by mail-lj1-f193.google.com with SMTP id j19-v6so8181305ljc.7 for ; Sun, 05 Aug 2018 01:57:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=YtUpSf7xUtGqy3WHIC5xUkAAzbuWLeEGg191hRCZ/vA=; b=Z/m2EQNsOp7f3Fq54NnP28qXzkuExqSEroJTdHKzjR+wLTwau1fkn9YsZz1lOzDKRC nF12+XGwvAC4BsVWhWFOQmWqjrSKXI5U/WEHfcZi0Vh9EwojshjEcqAHXAEw6teBbDhb sRO/pcC/M+dW6P9zgCIFIF+z5a+B4x4gow9u4h1en7dIEfAtZYltr5n2rFulsJe2hVZq L71Ni/e0O0AtTj+43iUFZwjydmo5xdOr3jxyW5WSmx9N1h7+UuNkwwyZzW17x/mwzwIl KvLfCQL9iWscp2s18iAAtJJydV5dQInOKZAuUQL/wGQ3Bo9WRVRnReFOstWk00TywWml wnxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=YtUpSf7xUtGqy3WHIC5xUkAAzbuWLeEGg191hRCZ/vA=; b=oQ5dcJKrsE9e34F3Qaox+kLZoL+IEgBD3nrdvZxZbPLxTdpsJCZHz/F2yWbXW8jQw0 0gEp/nr+HfzxwR7Jkm2KRLIbq6/hYdsBR3D/Swwa7Asv29BYF4kBn8zMHddA0ZD0Mp1p jk5schIi/SSQbIun7B/6kKxHiqpoyOK9NYvaLaxu4VFll3kP1qV5lvqgNZFzLFD2n2K/ xfGQu5Yn5Aj/tPrwLYavMxIa3pXzmkArsJOqnaaViY3ga35XaOElbk00iWKYsfYKppyc Cg3JaJX2uj4vx6LoM+laIh+Dd9CaalKjPmDu0rqEhg0sJxyGIPlrYz1/gOyYzRPI+x6n K1Sw== X-Gm-Message-State: AOUpUlG3ly6+/bOdKIMC5Vps/+CAgSJGUcnU/bHHHfTuk+lytpqdVZa2 XNOTBIavRnSaGDAA7AlEWT8= X-Google-Smtp-Source: AAOMgpdlDoiT2qbNH1UmJhsRuVgzRvDOBjVm8QtHokuh9CXbtaRZXRRNJx7TzZlp//hmBII714kEzQ== X-Received: by 2002:a2e:2e02:: with SMTP id u2-v6mr10229478lju.77.1533459441644; Sun, 05 Aug 2018 01:57:21 -0700 (PDT) Received: from duynguyen.home (c80-216-12-205.bredband.comhem.se. [80.216.12.205]) by smtp.gmail.com with ESMTPSA id 2-v6sm1575319lfv.39.2018.08.05.01.57.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 05 Aug 2018 01:57:20 -0700 (PDT) Date: Sun, 5 Aug 2018 10:57:18 +0200 From: Duy Nguyen To: Ben Peart Cc: Ben Peart , Git Mailing List , Junio C Hamano , Eric Sunshine Subject: Re: [PATCH v2] checkout: optimize "git checkout -b " Message-ID: <20180805085718.GA27625@duynguyen.home> References: <20180724180122.29212-1-benpeart@microsoft.com> <20180731163909.19004-1-benpeart@microsoft.com> <3900401c-4d7b-233c-2098-9771a06ec0dd@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3900401c-4d7b-233c-2098-9771a06ec0dd@gmail.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Thu, Aug 02, 2018 at 02:02:00PM -0400, Ben Peart wrote: > > And if all go according to plan, there will be no changes made in the > > index (by either 2-way merge or sparse checkout stuff) we should be > > able to just skip writing down the index, if we haven't done that > > already. > > > > That would be great as writing the index is 8.4% of the time spent. And that is of course not good. Avoiding writing index at all is nice and all but we should not take that much time writing it (what about other commands?) I have another idea to reduce index size, which means cheaper trailer hash calculation, but this may also be a direction you could look into. I notice that we have a lot of duplicate data in the index. For example, most files would have the same ctime and even mtime (at least the seconds part). group id, device id and file mode should also be the same in normal case. I guess on Windows, inode is always the same as well. So why store (and hash) them repeatedly? This patch writes "index version 5" that only writes those fields out if they are different from the previous entry (we already do this for path name in v4). On webkit repo (275k files), writing time can be reduced from 0.314s (v2) to 0.206s (v5). File size of all versions: 9.8M Aug 5 10:27 .git/index.gz 38M Aug 5 09:54 .git/index-v2 22M Aug 5 10:26 .git/index-v4 15M Aug 5 10:34 .git/index-v5 As you can see there's still room to compress, but I don't think we want to go full deflate mode because compression time goes up. Another option I think you could consider is not storing full hash. As long as the shortened hash is still unique, we can expand to full hash at read time. Of course you can't go too short (I'm thinking half the current hash size is reasonable trade off) and need to take into account how much time will be lost for looking up and expanding these short hashes at read time. -- 8< -- diff --git a/cache.h b/cache.h index e6f7ee4b64..1d6bc7d122 100644 --- a/cache.h +++ b/cache.h @@ -129,7 +129,7 @@ struct cache_header { }; #define INDEX_FORMAT_LB 2 -#define INDEX_FORMAT_UB 4 +#define INDEX_FORMAT_UB 5 /* * The "cache_time" is just the low 32 bits of the diff --git a/read-cache.c b/read-cache.c index b0b5df5de7..9bb095d6b2 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2002,7 +2002,7 @@ static int ce_write_flush(git_hash_ctx *context, int fd) return 0; } -static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len) +static int ce_write(git_hash_ctx *context, int fd, const void *data, unsigned int len) { while (len) { unsigned int buffered = write_buffer_len; @@ -2024,6 +2024,18 @@ static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len) return 0; } +static int ce_write_u32(git_hash_ctx *context, int fd, uint32_t data) +{ + data = htonl(data); + return ce_write(context, fd, &data, sizeof(data)); +} + +static int ce_write_u16(git_hash_ctx *context, int fd, uint16_t data) +{ + data = htons(data); + return ce_write(context, fd, &data, sizeof(data)); +} + static int write_index_ext_header(git_hash_ctx *context, int fd, unsigned int ext, unsigned int sz) { @@ -2133,8 +2145,46 @@ static void copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk, } } +#define ONDISK_MASK (CE_STAGEMASK | CE_INTENT_TO_ADD | CE_SKIP_WORKTREE) + +static int ce_write_entry_v5(git_hash_ctx *c, int fd, + const struct cache_entry *ce, + const struct cache_entry *pce) +{ + uint8_t mask = 0; + const struct stat_data *st1 = &ce->ce_stat_data; + const struct stat_data *st2 = &pce->ce_stat_data; + //int ret = 0; + + mask |= (st1->sd_ctime.sec != st2->sd_ctime.sec) << 0; + mask |= (st1->sd_mtime.sec != st2->sd_mtime.sec) << 1; + mask |= (st1->sd_dev != st2->sd_dev) << 2; + mask |= (st1->sd_ino != st2->sd_ino) << 3; + mask |= (ce->ce_mode != pce->ce_mode) << 4; + mask |= (st1->sd_uid != st2->sd_uid) << 5; + mask |= (st1->sd_gid != st2->sd_gid) << 6; + mask |= ((ce->ce_flags & ONDISK_MASK) != (pce->ce_flags & ONDISK_MASK)) << 7; + + ce_write(c, fd, &mask, 1); + ce_write(c, fd, ce->oid.hash, GIT_SHA1_RAWSZ); + ce_write_u32(c, fd, st1->sd_ctime.nsec); + ce_write_u32(c, fd, st1->sd_mtime.nsec); + ce_write_u32(c, fd, st1->sd_size); + if (mask & (1 << 0)) ce_write_u32(c, fd, st1->sd_ctime.sec); + if (mask & (1 << 1)) ce_write_u32(c, fd, st1->sd_mtime.sec); + if (mask & (1 << 2)) ce_write_u32(c, fd, st1->sd_dev); + if (mask & (1 << 3)) ce_write_u32(c, fd, st1->sd_ino); + if (mask & (1 << 4)) ce_write_u16(c, fd, ce->ce_mode); + if (mask & (1 << 5)) ce_write_u32(c, fd, st1->sd_uid); + if (mask & (1 << 6)) ce_write_u32(c, fd, st1->sd_gid); + if (mask & (1 << 7)) ce_write_u32(c, fd, ce->ce_flags & ONDISK_MASK); + return 0; +} + static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce, - struct strbuf *previous_name, struct ondisk_cache_entry *ondisk) + const struct cache_entry *pce, + struct strbuf *previous_name, + struct ondisk_cache_entry *ondisk) { int size; int result; @@ -2173,8 +2223,12 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce, to_remove = previous_name->len - common; prefix_size = encode_varint(to_remove, to_remove_vi); - copy_cache_entry_to_ondisk(ondisk, ce); - result = ce_write(c, fd, ondisk, size); + if (!pce) { + copy_cache_entry_to_ondisk(ondisk, ce); + result = ce_write(c, fd, ondisk, size); + } else { + result = ce_write_entry_v5(c, fd, ce, pce); + } if (!result) result = ce_write(c, fd, to_remove_vi, prefix_size); if (!result) @@ -2313,7 +2367,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0) return -1; - previous_name = (hdr_version == 4) ? &previous_name_buf : NULL; + previous_name = (hdr_version >= 4) ? &previous_name_buf : NULL; for (i = 0; i < entries; i++) { struct cache_entry *ce = cache[i]; @@ -2334,7 +2388,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, drop_cache_tree = 1; } - if (ce_write_entry(&c, newfd, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0) + if (ce_write_entry(&c, newfd, ce, + hdr_version >= 5 && i > 0 ? cache[i - 1] : NULL, + previous_name, + (struct ondisk_cache_entry *)&ondisk) < 0) err = -1; if (err) -- 8< -- -- Duy