From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RCVD_IN_SORBS_SPAM, RP_MATCHES_RCVD,T_DKIM_INVALID shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 2F2462023D for ; Thu, 2 Mar 2017 01:45:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932083AbdCBBpm (ORCPT ); Wed, 1 Mar 2017 20:45:42 -0500 Received: from mail-it0-f49.google.com ([209.85.214.49]:37469 "EHLO mail-it0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753448AbdCBBpl (ORCPT ); Wed, 1 Mar 2017 20:45:41 -0500 Received: by mail-it0-f49.google.com with SMTP id 203so42850071ith.0 for ; Wed, 01 Mar 2017 17:45:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=v3INCOTasL+/XEN8tnmOHDIAczvrmKibHkO2Zh0q6gE=; b=FyGKk1KcETBp8VtoYkc6p9085iTm4OdPcJLI16NkwFZ4sHCA/FwcZPZtDXLPkvKRA5 Xtllc+1ewtqU53QxI97TCbvLFl2LQsRVWy+0hgXq91pCP9L1UKT7yBdGIMsLaAaHPQKj blWIHSRf+kZQzCN8YNd0jSXO72lf8MjYyn30ShwBMNltGGKbEuUk4i1sjaltA/KvqryT u/FvNv5HmcNsJlUTMimxTpCZQaSNVb9hFk2G6sTd5J+3LVYHOsRJXWVVJdBabnuWPlMX +iaUSMoiWsrp9MF4LPxBR0R4sRQlbuc72vrP8cM4HZEn0B7EV6Foww71UHfbj31NfvAY jzbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=v3INCOTasL+/XEN8tnmOHDIAczvrmKibHkO2Zh0q6gE=; b=cSoWXwDuaTvqX6mSC4Cy5QF/0i0Fom4seAIQOz9yZ9/5wtVfhhoT1I3c0orZQE+t7A AFatw2EPAlSXtkAE7Nth+j2/skPXVKOxvnpOzb4xfULm+jl+CQJsRsk92xVplnHIEbhG 0JBJCpQlPRqFJJNEQyWhmVopwnFagdA7b1YWGbXLCNpab4xikP7PApwaSPy5Tce66J/9 gYCAO/jCFZsFPJQLvfZ/DKfUoXj2CdNBH8LrJ5zLzEvVKFQvkTkBOHSRdBNrh/kefXde powenQB9XIx94WUGjuiHh6egCWhMsjvlfQVnLJreAimkThGXY/mWCFPxbwy4CStAvxDK uF8w== X-Gm-Message-State: AMke39nPuvP4RRS+OSPeWFXxGbswiEzrwx9cuvgSZpf9SDwFgTB3hTv6bra7V/QRh2KIYezKkDq3aScrx8gQZg== X-Received: by 10.36.225.13 with SMTP id n13mr7593732ith.114.1488415404945; Wed, 01 Mar 2017 16:43:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.146.131 with HTTP; Wed, 1 Mar 2017 16:43:24 -0800 (PST) In-Reply-To: <603afdf2-159c-6bed-0e85-2824391185d1@gmail.com> References: <4d2a1852-8c84-2869-78ad-3c863f6dcaf7@gmail.com> <603afdf2-159c-6bed-0e85-2824391185d1@gmail.com> From: Linus Torvalds Date: Wed, 1 Mar 2017 16:43:24 -0800 X-Google-Sender-Auth: IRb3XcJjGnlccE5oZtdYjhsYe3M Message-ID: Subject: Re: Delta compression not so effective To: Marius Storm-Olsen Cc: Git Mailing List Content-Type: text/plain; charset=UTF-8 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Wed, Mar 1, 2017 at 4:12 PM, Marius Storm-Olsen wrote: > > No, the list of git verify-objects in the previous post was from the bottom > of the sorted list, so those are the largest blobs, ~249MB.. .. so with a 6GB window, you should easily sill have 20+ objects. Not a huge window, but it should find some deltas. But a smaller window - _together_ with a suboptimal sorting choice - could then result in lack of successful delta matches. > So, this repo must be knocking several parts of Git's insides. I was curious > about why it was so slow on the writing objects part, since the whole repo > is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing > has ~400MB/s continuous throughput available. > > iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single > thread (since the "write objects" stage is single threaded, obviously). So the writing phase isn't multi-threaded because it's not expected to matter. But if you can't even generate deltas, you aren't just *writing* much more data, you're compressing all that data with zlib too. So even with a fast disk subsystem, you won't even be able to saturate the disk, simply because the compression will be slower (and single-threaded). > Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed > blobs are the same DLLs (multiple of them) I think the first thing you should test is to repack with fewer threads, and a bigger pack window. Do somethinig like -c pack.threads=4 --window-memory=30g instead. Just to see if that starts finding deltas. > Right, now on this machine, I really didn't notice much difference between > standard zlib level and doing -9. The 203GB version was actually with > zlib=9. Don't. zlib has *horrible* scaling with higher compressions. It doesn't actually improve the end result very much, and it makes things *much* slower. zlib was a reasonable choice when git started - well-known, stable, easy to use. But realistically it's a relatively horrible choice today, just because there are better alternatives now. >> Hos sensitive is your material? Could you make a smaller repo with >> some of the blobs that still show the symptoms? I don't think I want >> to download 206GB of data even if my internet access is good. > > Pretty sensitive, and not sure how I can reproduce this reasonable well. > However, I can easily recompile git with any recommended > instrumentation/printfs, if you have any suggestions of good places to > start? If anyone have good file/line numbers, I'll give that a go, and > report back? So the first thing you might want to do is to just print out the objects after sorting them, and before it starts trying to finsd deltas. See prepare_pack() in builtin/pack-objects.c, where it does something like this: if (nr_deltas && n > 1) { unsigned nr_done = 0; if (progress) progress_state = start_progress(_("Compressing objects"), nr_deltas); QSORT(delta_list, n, type_size_sort); ll_find_deltas(delta_list, n, window+1, depth, &nr_done); stop_progress(&progress_state); and notice that QSORT() line: that's what sorts the objects. You can do something like for (i = 0; i < n; i++) show_object_entry_details(delta_list[i]); right after that QSORT(), and make that print out the object hash, filename hash, and size (we don't have the filename that the object was associated with any more at that stage - they take too much space). Save off that array for off-line processing: when you have the object hash, you can see what the contents are, and match it up wuith the file in the git history using something like git log --oneline --raw -R --abbrev=40 which shows you the log, but also the "diff" in the form of "this filename changed from SHA1 to SHA1", so you can match up the object hashes with where they are in the tree (and where they are in history). So then you could try to figure out if that type_size_sort() heuristic is just particularly horrible for you. In fact, if your data is not *so* sensitive, and you're ok with making the one-line commit logs and the filenames public, you could make just those things available, and maybe I'll have time to look at it. I'm in the middle of the kernel merge window, but I'm in the last stretch, and because of the SHA1 thing I've been looking at git lately. No promises, though. Linus