From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from mail-qv1-xf44.google.com (mail-qv1-xf44.google.com [IPv6:2607:f8b0:4864:20::f44]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 5703B1F66E for ; Wed, 2 Sep 2020 21:38:09 +0000 (UTC) Received: by mail-qv1-xf44.google.com with SMTP id cr8so331812qvb.10 for ; Wed, 02 Sep 2020 14:38:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linuxfoundation.org; s=google; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-disposition:in-reply-to; bh=VpCGAn64eE1f7zRRwaqSSiRkX9+OrL8B8yJoiE0m4SM=; b=NoUEPZKaRVOr3rHYbCEfFIiVNZaIUFDtapCYYe8bEzIouU8C3AnhpSIx+7nndGOB5X FzMOze6w705lXtqVKLjIzI6QcZ4FiNMU/NHGpVSsPhOjw9+mjq1ciHHmveE3/kp+TBKz xQc2+nshrBLjjuoBTq6Gzki2OgSJUm48KWMoE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id :mail-followup-to:references:mime-version:content-disposition :in-reply-to; bh=VpCGAn64eE1f7zRRwaqSSiRkX9+OrL8B8yJoiE0m4SM=; b=Crvcjd6Ggj5UCMaMDj1r0oNfMSvy5Q+vO3mWheog03V0P97VRKmC4Es9ZroNab2IJg cr6OCZgei8PaG7fbtXicmD7rk2KN1Im+2z7SGetIfg1LC76Mp98U425w05PgVKFSHF0b 08/q69pjrBRSgtvKxGTqJ4eWY/+vo2kwIMpgbO1IephOKQXsgA1pnbYenrtG1LxKyJoR r0ca7hpih6jQdaonlX/qJTgO+OhrX8yrb027tdWaJQm1PwGdZkSXapWCxURq9Tw03dLD Gy9pIcKse2idRUClYVXT9hHwU+3bIALb7dscuG0Qn5JY3J3BICcXb8MxOyPWlWTTgNIS WgAA== X-Gm-Message-State: AOAM531qkCuZqxkDek3RCWqTMohIFIsz98hFMNjdAT9eKAQudxBhNVfF fTjUGcgPQ6P+sIt3QX8yHba5Mw== X-Google-Smtp-Source: ABdhPJxcTAjKdXgy/x7DsE2vfqOK6YLrS/GA+WZi3elZF6xF6l+A2pca6sEhm+BGMn/e6p4/U45GRQ== X-Received: by 2002:ad4:57a6:: with SMTP id g6mr2989994qvx.133.1599082687924; Wed, 02 Sep 2020 14:38:07 -0700 (PDT) Received: from chatter.i7.local ([89.36.78.230]) by smtp.gmail.com with ESMTPSA id m66sm688965qkf.86.2020.09.02.14.38.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Sep 2020 14:38:07 -0700 (PDT) Date: Wed, 2 Sep 2020 17:38:05 -0400 From: Konstantin Ryabitsev To: Eric Wong Cc: meta@public-inbox.org Subject: Re: message bloat over time... Message-ID: <20200902213805.5kth3pkyxl2owbmg@chatter.i7.local> Mail-Followup-To: Eric Wong , meta@public-inbox.org References: <20200902190525.GA11126@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200902190525.GA11126@dcvr> List-Id: On Wed, Sep 02, 2020 at 07:05:25PM +0000, Eric Wong wrote: > I've been indexing and reindexing a local mirror of > https://lore.kernel.org/lkml a bit, and it's kinda depressing to > see newer messages being more and more bloated even on a > plain-text-only mailing list :< > > The first column ("$X.git" is the epoch number, older epochs > are lower-numbered: "0.git" is oldest, "8.git" (not shown) is > the newest. 8.git is omitted since it's still in-progress, > each epoch is capped at roughly ~1.1G of packed git storage. > > The last column is the number of messages in that epoch, > so fewer messages fit in each epoch: > > 7.git counting 17d7e25e3e862d5d99182557bb723374230a8497 ... 312754 > 6.git counting bc9b3c196d0fc92a520e9ad4f92c4d3c1db1943f ... 346017 > 5.git counting 31ed379430c456f90bdd172b223020c0e6d7cb8d ... 379561 I'm not sure it's quite a fair comparison between 4 and 5, since the initial import was done from email sources that were heavily sanitized for headers -- both for privacy and for size. Everything we've been receiving since then carries untouched headers, which includes entire Received lines and all the DKIM/DMARC/SPF checking junk. > 4.git counting 88294f6d487193f5984791ee81213a25130d0559 ... 416015 > 3.git counting 93d9eace2721494d8457c7f5f6de803c0d648172 ... 453851 > 2.git counting d48078ceeec1f51313253a56ed3ba0eae7fde909 ... 455366 > 1.git counting 6b67b9f5e0cd82d3c734e6cdc44c1f722ab6fb6a ... 475671 > 0.git counting b67bf7f62c8125d67461cc6e7d1736ddc8844a18 ... 570488 > > So yeah, old epochs could fit more messages because messages > were smaller back then... I've considered doing some header stripping, but I've opted to preserve them for provenance/authenticity reasons. I may still change my mind at some point. :) -K