From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-4.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 6CA9E1F453 for ; Fri, 15 Feb 2019 20:22:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728523AbfBOUWQ (ORCPT ); Fri, 15 Feb 2019 15:22:16 -0500 Received: from mail-qk1-f194.google.com ([209.85.222.194]:33434 "EHLO mail-qk1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725782AbfBOUWQ (ORCPT ); Fri, 15 Feb 2019 15:22:16 -0500 Received: by mail-qk1-f194.google.com with SMTP id x9so6520901qkf.0 for ; Fri, 15 Feb 2019 12:22:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=NBX2DOKIXsBtY9Qc6j4F2aPcHMuyZCOV17bO1CEAaV4=; b=V05Lnhpxl3PtJ2FVpGV5fl6ey26yuZbS9y+383PxKIz0F6BGyVFG+GXWuCdznR8Er4 BGA6fyBjIcHhilgxIVY3BD/wA5pEPG45rNQ3qMbeM2RvpIJFVxHpBWPSp/IfpZEIQKqH K9/sIiTHv1lrJmToSOOM4LQLatWwbob71QZcXn0/Oslt7rtTQNMpA3F5VQ5mbDnLNqu7 LPRDXVvTDi45PyiqjrPMC+x66kHzuy98j/J0a4XGYX7dp7z2h2CpsBG2pbVVm2W8uJiD Kcx7sn+/dfhsVVwbYleOQ4a49qrWP/UiHebTCPbF7vW8544SEPG846eAWr5SpZCkJcfn CFpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=NBX2DOKIXsBtY9Qc6j4F2aPcHMuyZCOV17bO1CEAaV4=; b=t/u/mZrieHn0pt5UtYTXSHKB95O+AC8btHDF7AS/mAx6F9lD68NqKSGc+Kg7aC2ium zrY4i5tNIgTvQ5vMBv7qhfNgLOl+1BpEdwajnUakN/+mvyw3hcpSu6JLmyBaw7fdc+ii r+8kse8Nmsat7Q362dFFCSb3SQNVr8UCS+xOSonRCkOAhzC6Te3f1DDwO0XWv0LG74yP sruOrzE6OBDKkAQJaKaQ02w8j400+PzHEP6dGxIZwbjWeK9lKTLjt3x1+uBKYuaM9tfX C7gCe1rGVSaOEO7kZ8LcvAs7JcmIFr8kMxNUFwvkJcvWrntY0BSGzJ/fZidBoLB+VLj/ BVTQ== X-Gm-Message-State: AHQUAubAOuBU1FAM+sfaw9wrtW5fplcPx18EgQUnKCF1mAW1krBsFC5p H1UFn1c0x6F8W4JLP8KjCmA= X-Google-Smtp-Source: AHgI3IaWSAYblwsgi4XGbMtV1rJY5sW1QRiycyNZNC3hsqceP4ua1rDdXqNJEpCPKG5Mz+7qruIwhQ== X-Received: by 2002:a37:628a:: with SMTP id w132mr8464839qkb.60.1550262134569; Fri, 15 Feb 2019 12:22:14 -0800 (PST) Received: from [192.168.1.97] (70-33-148-227.unassigned.ntelos.net. [70.33.148.227]) by smtp.gmail.com with ESMTPSA id m12sm3620351qke.17.2019.02.15.12.22.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Feb 2019 12:22:13 -0800 (PST) Subject: Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4 To: Duy Nguyen , =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?UTF-8?Q?mason?= Cc: Git Mailing List , Junio C Hamano References: <20190213120807.25326-1-pclouds@gmail.com> <87bm3ek7qr.fsf@evledraar.gmail.com> From: Ben Peart Message-ID: <11875ebb-5a40-4c87-dce7-b337cc922100@gmail.com> Date: Fri, 15 Feb 2019 15:22:12 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 2/14/2019 5:14 AM, Duy Nguyen wrote: > On Thu, Feb 14, 2019 at 5:02 PM Ævar Arnfjörð Bjarmason > wrote: >>> Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the >>> same most of the time. ctime should often be the same (or differs just >>> slightly). And sometimes mtime is the same as well. st_ino is also >>> always zero on Windows. We're storing a lot of duplicate values. >>> >>> Index v5 handles this >> >> This looks really promising. > > I was going to reply to Junio. But it turns out I underestimated > "varint" encoding overhead and it increases read time too much. I > might get back and try some optimization when I'm bored, but until > then this is yet another failed experiment. > >>> As a result of this, v5 reduces file size from 30% (git.git) to >>> 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file >>> size is reduced by 63%! A 8.4MB index file is _almost_ acceptable. >>> Just for kicks, I tried this out on a couple of repos I have handy. files version index size %savings 200k 2 25,033,758 0.00% 3 25,033,758 0.00% 4 15,269,923 39.00% 5 9,759,844 61.01% 3m 2 446,123,848 0.00% 3 446,123,848 0.00% 4 249,631,640 44.04% 5 82,147,981 81.59% The 81% savings is very impressive. I didn't measure performance but not writing out an extra 167MB to disk has to help. I'm definitely also interested in your 'sparse index' format ideas as in our 3M repos, there are typically only a few thousand that don't have the skip-worktree bit set. I'm not sure if that is the same 'sparse' you had in mind but it would sure be nice! I've also contemplated multi-threading the index write code path. My thought was in the primary thread to allocate a buffer and when it is full have a background thread compute the SHA and write it to disk while the primary thread fills the next buffer. I'm not sure how much it will buy us as I don't know the relative cost of computing the SHA/writing to disk vs filling the buffer. I've suspected the filling the buffer thread would end up blocked on the background thread most of the time which is why I haven't tried it yet. >>> Of course we trade off storage with cpu. We now need to spend more >>> cycles writing or even reading (but still plenty fast compared to >>> zlib). For reading, I'm counting on multi thread to hide away all this >>> even if it becomes significant. >> >> This would be a bigger change, but have we/you ever done a POC >> experiment to see how much of this time is eaten up by zlib that >> wouldn't be eaten up with some of the newer "fast but good enough" >> compression algorithms, e.g. Snappy and Zstandard? > > I'm quite sure I tried zlib at some point, the only lasting impression > I have is "not good enough". Other algorithms might improve a bit, > perhaps on the uncompress/read side, but I find it unlikely we could > reasonably compress like a hundred megabytes in a few dozen > milliseconds (a quick google says Snappy compresses 250MB/s, so about > 400ms per 100MB, too long). Splitting the files and compressing in > parallel might help. But I will probably focus on "sparse index" > approach before going that direction. >