From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS53758 23.128.96.0/24 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id 4C5D21F953 for ; Wed, 27 Oct 2021 08:51:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241035AbhJ0Ix0 (ORCPT ); Wed, 27 Oct 2021 04:53:26 -0400 Received: from cloud.peff.net ([104.130.231.41]:47802 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239592AbhJ0IxZ (ORCPT ); Wed, 27 Oct 2021 04:53:25 -0400 Received: (qmail 12886 invoked by uid 109); 27 Oct 2021 08:51:00 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Wed, 27 Oct 2021 08:51:00 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 20201 invoked by uid 111); 27 Oct 2021 08:51:02 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Wed, 27 Oct 2021 04:51:02 -0400 Authentication-Results: peff.net; auth=none Date: Wed, 27 Oct 2021 04:50:59 -0400 From: Jeff King To: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason Cc: =?utf-8?Q?Jean-No=C3=ABl?= Avila , git Subject: Re: [Summit topic] Documentation (translations, FAQ updates, new user-focused, general improvements, etc.) Message-ID: References: <1c9adc5d-21ac-f6c6-8a87-959be5420636@free.fr> <211022.86r1cdjfe2.gmgdl@evledraar.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <211022.86r1cdjfe2.gmgdl@evledraar.gmail.com> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Fri, Oct 22, 2021 at 04:31:46PM +0200, Ævar Arnfjörð Bjarmason wrote: > I'd very much support this living in-tree just as the po/* directory > already does. I.e. periodically pulled down. Just a bit of a tangent here, since weblate was mentioned earlier. I'd caution a bit against pulling the history generated by weblate directly. It's pretty sub-optimal from a Git perspective: you have a bunch of big .po files and then a ton of little commits changing one or a handful of lines. So the "logical" size of the repository (the sum of the actual object sizes) ends up growing quite a bit. Deltas can help with the on-disk size, but: - lots of operations scale with the logical size. The client-side index-pack of a clone, for instance, but also everyday stuff like "git log -S". - empirically we don't do a great job of finding these. See below for some numbers. For instance, take https://github.com/phpmyadmin/phpmyadmin, a repository which uses weblate (I don't mean to pick on them; it's just a repo whose weblate-related packing I've looked into before). A fresh clone is 1.3GB. If you do an aggressive repack, you can get it down to about 550MB. But there's still tons of logical data. Running: git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' | perl -alne ' $logical += $F[0]; $disk += $F[1]; END { print "$logical / $disk = " . $logical / $disk } ' shows that there's over 70GB of logical data. It gets an impressive 156:1 compression ratio (for comparison, "normal" repos like linux.git and git.git are around 40-60x in my experience). If you split it up by directory, like this: git rev-list --objects --all --no-object-names -- po | git cat-file --batch-check='%(objectsize)' | perl -lne '$total += $_; END { print $total }' you'll see that po/ accounts for almost 60GB of that logical size. We face some of that in our current po/, too. They're big files, and that's the nature of the problem space. But our current ones tend to be edited by taking a pass over the whole file, rather than the one-liners that a web-based workflow encourages. To be clear, I'm not arguing against weblate in general. It's cool that it makes it easier for people to contribute to translations. But I think it has an outsized impact on size and performance compared to the rest of the repository. That's a big price to pay for carrying the history in-tree. Obviously one option there is to squash the po/ history before pulling it in. The weblate commit messages themselves aren't that useful. I'm not actually sure if jnavila's work so far has been using weblate. The commits in his git-html-l10n are much coarser than what I see in phpmyadmin, for example (so maybe he's doing similar squashing already). -Peff