git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Caleb Gray <hey@calebgray.com>
To: Konstantin Tokarev <annulen@yandex.ru>
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>,
	Eric Wong <e@yhbt.net>, Junio C Hamano <gitster@pobox.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Add a "Flattened Cache" to `git --clone`?
Date: Mon, 25 May 2020 07:02:21 -0700	[thread overview]
Message-ID: <CAGjfG9YXQ5fkwY6DS8iOymHHXm-r+Rs4yah1afngwkGxNOE9gA@mail.gmail.com> (raw)
In-Reply-To: <1061511589863147@mail.yandex.ru>

For a repo like git itself, the assertions regarding the way git
currently builds its data (in fact, including the `checkout` portion)
does compete directly with the "cached result" methodology! Holy shit
guys, I'm impressed as hell.

tl;dr: The way I read the raw numbers, `git` ends up being as-fast-as
(or faster) than a "cache" of the .git folder. Without doing further
research, I'm inclined to agree with the previously mentioned bitmap
method already being effectively as efficient as (more efficient
than!?) a cache.


Methodology/Reasoning:
virtualized: verified zero network chatter on eth0 before and after each test.
tcpflow: to gather the bits for the entire transaction... from just
before the execution of `git clone` was started, and closing the
listener just after execution ended. (not worrying about
protocols/overhead)
tar: to compare the size of the repository on disk with the tcpflow
results. (not worrying about compensating for
headers/metadata/overhead)
gzip: to theoretically, I haven't checked anything, compensate for
seemingly arbitrary size differences when downloading over HTTPS.
time: (really) rough measure of execution time.


Commands used to generate files:
*.tcpflow: `sudo tcpflow -p -c -i eth0 > $filename.tcpflow`
*.tar: `tar cf $filename.tar .git`
*.gz: `gzip -9 $filename.tar`


Results:

75M kernelorg.tar
72M kernelorg.tar.gz
69M kernelorg_git.tcpflow
69M kernelorg_https.tcpflow

145M github.tar
143M github.tar.gz
143M github_git.tcpflow
142M github_https.tcpflow


Other Tests (sanity checks):

Cloned a gitea mirror of kernel.org's git:
69M gitea_git.tcpflow
69M gitea_https.tcpflow

Cloned a bitbucket mirror of kernel.org's git:
69M bitbucket_git.tcpflow
69M bitbucket_https.tcpflow

$ time git clone git://git.kernel.org/pub/scm/git/git.git
Cloning into 'git'...
remote: Enumerating objects: 15475, done.
remote: Counting objects: 100% (15475/15475), done.
remote: Compressing objects: 100% (861/861), done.
remote: Total 287977 (delta 14910), reused 14907 (delta 14610),
pack-reused 272502
Receiving objects: 100% (287977/287977), 66.09 MiB | 4.87 MiB/s, done.
Resolving deltas: 100% (217420/217420), done.

real    0m20.000s
user    0m15.414s
sys     0m1.606s

$ time wget https://calebgray.com/public/kernelorg.tar.gz
--2020-05-25 06:11:29--  https://calebgray.com/public/kernelorg.tar.gz
Resolving calebgray.com (calebgray.com)... 192.3.203.78
Connecting to calebgray.com (calebgray.com)|192.3.203.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74593708 (71M) [application/octet-stream]
Saving to: ‘kernelorg.tar.gz’

kernelorg.tar.gz
100%[========================================================================================>]
 71.14M  4.81MB/s    in 19s

2020-05-25 06:11:48 (3.79 MB/s) - ‘kernelorg.tar.gz’ saved [74593708/74593708]

real 0m19.420s
user 0m0.030s
sys 0m0.280s


Thanks everyone for your input and time! I love git, you guys do great work!

P.S. I ran a few other benchmarks outside of these, and the timing
always worked out to be more/less the same between the reported
transfer rate (as told by my router, as well) and the "real" time it
took to download (for both `git` and `wget`).

P.P.S. I haven't investigated the reason for the github repo being
nearly twice the size as the kernel.org hosted copy. That one stands
out as potentially part of the proxy discussion, or there's actually a
difference in the repo's data. Curiosity will likely get the best of
me eventually.




On Mon, May 18, 2020 at 9:40 PM Konstantin Tokarev <annulen@yandex.ru> wrote:
>
>
>
> 18.05.2020, 01:12, "Konstantin Ryabitsev" <konstantin@linuxfoundation.org>:
> > On Fri, May 15, 2020 at 09:42:57PM +0000, Eric Wong wrote:
> >>  That said, I'm not sure if any client-side caching proxies can
> >>  MITM HTTPS and save bandwidth with HTTPS everywhere, nowadays.
> >>  I seem to recall polipo being abandoned because of HTTPS.
> >>  Maybe there's a caching HTTPS MITM proxy out there...
> >
> > Right, this can't operate as a transparent proxy.
>
> AFAIK, Squid can do MITM, caching and operate transparently.
> In the past it was done via ssl_bump directive, but seems like syntax changed a bit
> in modern versions.

  parent reply	other threads:[~2020-05-25 14:02 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-14 14:34 Add a "Flattened Cache" to `git --clone`? Caleb Gray
2020-05-14 20:33 ` Konstantin Ryabitsev
2020-05-14 20:54   ` Bryan Turner
2020-05-14 21:05   ` Theodore Y. Ts'o
2020-05-14 21:09     ` Eric Sunshine
2020-05-14 21:10     ` Konstantin Ryabitsev
2020-05-14 21:23       ` Junio C Hamano
2020-05-14 21:44         ` Konstantin Ryabitsev
2020-05-15 21:42           ` Eric Wong
2020-05-17 22:12             ` Konstantin Ryabitsev
     [not found]               ` <1061511589863147@mail.yandex.ru>
2020-05-25 14:02                 ` Caleb Gray [this message]
2020-05-14 21:33     ` Caleb Gray
2020-05-14 21:56       ` Junio C Hamano
2020-05-14 22:04         ` Caleb Gray
2020-05-14 22:30           ` Junio C Hamano
2020-05-14 22:44           ` Bryan Turner
2020-05-14 21:19   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGjfG9YXQ5fkwY6DS8iOymHHXm-r+Rs4yah1afngwkGxNOE9gA@mail.gmail.com \
    --to=hey@calebgray.com \
    --cc=annulen@yandex.ru \
    --cc=e@yhbt.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=konstantin@linuxfoundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).