git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "Jean-Noël Avila" <jn.avila@free.fr>, git <git@vger.kernel.org>
Subject: Re: [Summit topic] Documentation (translations, FAQ updates, new user-focused, general improvements, etc.)
Date: Wed, 27 Oct 2021 04:50:59 -0400	[thread overview]
Message-ID: <YXkS85G5ujqxVf0M@coredump.intra.peff.net> (raw)
In-Reply-To: <211022.86r1cdjfe2.gmgdl@evledraar.gmail.com>

On Fri, Oct 22, 2021 at 04:31:46PM +0200, Ævar Arnfjörð Bjarmason wrote:

> I'd very much support this living in-tree just as the po/* directory
> already does. I.e. periodically pulled down.

Just a bit of a tangent here, since weblate was mentioned earlier.

I'd caution a bit against pulling the history generated by weblate
directly. It's pretty sub-optimal from a Git perspective: you have a
bunch of big .po files and then a ton of little commits changing one or
a handful of lines.

So the "logical" size of the repository (the sum of the actual object
sizes) ends up growing quite a bit. Deltas can help with the on-disk
size, but:

  - lots of operations scale with the logical size. The client-side
    index-pack of a clone, for instance, but also everyday stuff like
    "git log -S".

  - empirically we don't do a great job of finding these. See below for
    some numbers.

For instance, take https://github.com/phpmyadmin/phpmyadmin, a
repository which uses weblate (I don't mean to pick on them; it's just a
repo whose weblate-related packing I've looked into before). A fresh
clone is 1.3GB. If you do an aggressive repack, you can get it down to
about 550MB. But there's still tons of logical data. Running:

  git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' |
  perl -alne '
    $logical += $F[0]; $disk += $F[1];
    END { print "$logical / $disk = " . $logical / $disk }
  '

shows that there's over 70GB of logical data. It gets an impressive
156:1 compression ratio (for comparison, "normal" repos like linux.git
and git.git are around 40-60x in my experience).

If you split it up by directory, like this:

  git rev-list --objects --all --no-object-names -- po |
  git cat-file --batch-check='%(objectsize)' |
  perl -lne '$total += $_; END { print $total }'

you'll see that po/ accounts for almost 60GB of that logical size.

We face some of that in our current po/, too. They're big files, and
that's the nature of the problem space. But our current ones tend to be
edited by taking a pass over the whole file, rather than the one-liners
that a web-based workflow encourages.

To be clear, I'm not arguing against weblate in general. It's cool that
it makes it easier for people to contribute to translations. But I think
it has an outsized impact on size and performance compared to the rest
of the repository. That's a big price to pay for carrying the history
in-tree.

Obviously one option there is to squash the po/ history before pulling
it in. The weblate commit messages themselves aren't that useful. I'm
not actually sure if jnavila's work so far has been using weblate. The
commits in his git-html-l10n are much coarser than what I see in
phpmyadmin, for example (so maybe he's doing similar squashing already).

-Peff

  parent reply	other threads:[~2021-10-27  8:51 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-21 11:55 Notes from the Git Contributors' Summit 2021, virtual, Oct 19/20 Johannes Schindelin
2021-10-21 11:55 ` [Summit topic] Crazy (and not so crazy) ideas Johannes Schindelin
2021-10-21 12:30   ` Son Luong Ngoc
2021-10-26 20:14   ` scripting speedups [was: [Summit topic] Crazy (and not so crazy) ideas] Eric Wong
2021-10-30 19:58     ` Ævar Arnfjörð Bjarmason
2021-11-03  9:24       ` test suite speedups via some not-so-crazy ideas (was: scripting speedups[...]) Ævar Arnfjörð Bjarmason
2021-11-03 22:12         ` test suite speedups via some not-so-crazy ideas Junio C Hamano
2021-11-02 13:52     ` scripting speedups [was: [Summit topic] Crazy (and not so crazy) ideas] Johannes Schindelin
2021-10-21 11:55 ` [Summit topic] SHA-256 Updates Johannes Schindelin
2021-10-21 11:56 ` [Summit topic] Server-side merge/rebase: needs and wants? Johannes Schindelin
2021-10-22  3:06   ` Bagas Sanjaya
2021-10-22 10:01     ` Johannes Schindelin
2021-10-23 20:52       ` Ævar Arnfjörð Bjarmason
2021-11-08 18:21   ` Taylor Blau
2021-11-09  2:15     ` Ævar Arnfjörð Bjarmason
2021-11-30 10:06       ` Christian Couder
2021-10-21 11:56 ` [Summit topic] Submodules and how to make them worth using Johannes Schindelin
2021-10-21 11:56 ` [Summit topic] Sparse checkout behavior and plans Johannes Schindelin
2021-10-21 11:56 ` [Summit topic] The state of getting a reftable backend working in git.git Johannes Schindelin
2021-10-25 19:00   ` Han-Wen Nienhuys
2021-10-25 22:09     ` Ævar Arnfjörð Bjarmason
2021-10-26  8:12       ` Han-Wen Nienhuys
2021-10-28 14:17         ` Philip Oakley
2021-10-26 15:51       ` Philip Oakley
2021-10-21 11:56 ` [Summit topic] Documentation (translations, FAQ updates, new user-focused, general improvements, etc.) Johannes Schindelin
2021-10-22 14:20   ` Jean-Noël Avila
2021-10-22 14:31     ` Ævar Arnfjörð Bjarmason
2021-10-27  7:02       ` Jean-Noël Avila
2021-10-27  8:50       ` Jeff King [this message]
2021-10-21 11:56 ` [Summit topic] Increasing diversity & inclusion (transition to `main`, etc) Johannes Schindelin
2021-10-21 12:55   ` Son Luong Ngoc
2021-10-22 10:02     ` vale check, was " Johannes Schindelin
2021-10-22 10:03       ` Johannes Schindelin
2021-10-21 11:57 ` [Summit topic] Improving Git UX Johannes Schindelin
2021-10-21 16:45   ` changing the experimental 'git switch' (was: [Summit topic] Improving Git UX) Ævar Arnfjörð Bjarmason
2021-10-21 23:03     ` changing the experimental 'git switch' Junio C Hamano
2021-10-22  3:33     ` changing the experimental 'git switch' (was: [Summit topic] Improving Git UX) Bagas Sanjaya
2021-10-22 14:04     ` martin
2021-10-22 14:24       ` Ævar Arnfjörð Bjarmason
2021-10-22 15:30         ` martin
2021-10-23  8:27           ` changing the experimental 'git switch' Sergey Organov
2021-10-22 21:54         ` Sergey Organov
2021-10-24  6:54       ` changing the experimental 'git switch' (was: [Summit topic] Improving Git UX) Martin
2021-10-24 20:27         ` changing the experimental 'git switch' Junio C Hamano
2021-10-25 12:48           ` Ævar Arnfjörð Bjarmason
2021-10-25 17:06             ` Junio C Hamano
2021-10-25 16:44     ` Sergey Organov
2021-10-25 22:23       ` Ævar Arnfjörð Bjarmason
2021-10-27 18:54         ` Sergey Organov
2021-10-21 11:57 ` [Summit topic] Improving reviewer quality of life (patchwork, subsystem lists?, etc) Johannes Schindelin
2021-10-21 13:41   ` Konstantin Ryabitsev
2021-10-22 22:06     ` Ævar Arnfjörð Bjarmason
2021-10-22  8:02 ` Missing notes, was Re: Notes from the Git Contributors' Summit 2021, virtual, Oct 19/20 Johannes Schindelin
2021-10-22  8:22   ` Johannes Schindelin
2021-10-22  8:30     ` Johannes Schindelin
2021-10-22  9:07       ` Johannes Schindelin
2021-10-22  9:44 ` Let's have public Git chalk talks, " Johannes Schindelin
2021-10-25 12:58   ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YXkS85G5ujqxVf0M@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jn.avila@free.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).