git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	"Stefan Beller" <sbeller@google.com>, "Jeff King" <peff@peff.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Brandon Williams" <bmwill@google.com>,
	git@vger.kernel.org, "Michael Haggerty" <mhagger@alum.mit.edu>
Subject: [PATCH 00/20] Read `packed-refs` using mmap()
Date: Wed, 13 Sep 2017 19:15:54 +0200	[thread overview]
Message-ID: <cover.1505319366.git.mhagger@alum.mit.edu> (raw)

Previously, any time we wanted to read even a single reference from
the `packed-refs` file, we parsed the whole file and stored it in an
elaborate structure in memory called a `ref_cache`. Subsequent
reference lookups or iterations over some or all of the references
could be done by reading from the `ref_cache`.

But for large `packed-refs` files, the time needed to parse the file,
and the memory needed to cache its contents, can be quite significant.
This is partly because building the cache costs lots of little memory
allocations. (And lest you think that most Git commands can be
executed by reading at most a couple of loose references, remember
that almost any command that reads objects has to look for replace
references (unless they are turned off in the config), and *that*
necessarily entails reading packed references.)

Following lots of work to extract the `packed_ref_store` into a
separate module and decouple it from the `files_ref_store`, it is now
possible to fundamentally change how the `packed-refs` file is read.

* `mmap()` the whole file rather than `read()`ing it.

* Instead of parsing the complete file at once into a `ref_cache`,
  parse the references out of the file contents on demand.

* Use a binary search to find, very quickly, the reference or group of
  references that needs to be read. Parse *only* those references out
  of the file contents, without creating in-memory data structures at
  all.

In rare cases this change might force parts of the `packed-refs` file
to be read multiple times, but that cost is far outweighed by the fact
that usually most of the `packed-refs` file doesn't have to be read
*at all*.

Note that the binary search optimization requires the `packed-refs`
file to be sorted by reference name. We have always written them
sorted, but just in case there are clients that don't, we assume the
file is unsorted unless its header lists a `sorted` trait. From now
on, we write the file with that trait. If the file is not sorted, it
is sorted on the fly in memory.

For a repository with only a couple thousand references and a warm
disk cache, this change doesn't make a very significant difference.
But for repositories with very large numbers of references, the
difference start to be significant:

A repository with 54k references (warm cache):

                                  git 2.13.1         this branch
git for-each-ref                      464 ms              452 ms
git for-each-ref (no output)           66 ms               47 ms
git for-each-ref (0.6% of refs)        47 ms                9 ms
git for-each-ref (0.6%, no output)     41 ms                2 ms
git rev-parse                          32 ms                2 ms

A repository (admittedly insane, but a real-life example) with 60M
references (warm cache):

                                  git 2.13.1         this branch
git for-each-ref (no output)        84000 ms            61000 ms
git rev-parse                       40000 ms                2 ms

This branch applies on top of mh/packed-ref-transactions. It can also
be obtained from my git fork [1] as branch `mmap-packed-refs`.

Michael

[1] https://github.com/mhagger/git

Jeff King (1):
  prefix_ref_iterator: break when we leave the prefix

Michael Haggerty (19):
  ref_iterator: keep track of whether the iterator output is ordered
  packed_ref_cache: add a backlink to the associated `packed_ref_store`
  die_unterminated_line(), die_invalid_line(): new functions
  read_packed_refs(): use mmap to read the `packed-refs` file
  read_packed_refs(): only check for a header at the top of the file
  read_packed_refs(): make parsing of the header line more robust
  read_packed_refs(): read references with minimal copying
  packed_ref_cache: remember the file-wide peeling state
  mmapped_ref_iterator: add iterator over a packed-refs file
  mmapped_ref_iterator_advance(): no peeled value for broken refs
  packed_ref_cache: keep the `packed-refs` file open and mmapped
  read_packed_refs(): ensure that references are ordered when read
  packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator`
  packed_read_raw_ref(): read the reference from the mmapped buffer
  ref_store: implement `refs_peel_ref()` generically
  packed_ref_store: get rid of the `ref_cache` entirely
  ref_cache: remove support for storing peeled values
  mmapped_ref_iterator: inline into `packed_ref_iterator`
  packed-backend.c: rename a bunch of things and update comments

 refs.c                |  22 +-
 refs/files-backend.c  |  54 +--
 refs/iterator.c       |  47 ++-
 refs/packed-backend.c | 896 +++++++++++++++++++++++++++++++++++++-------------
 refs/ref-cache.c      |  44 +--
 refs/ref-cache.h      |  35 +-
 refs/refs-internal.h  |  26 +-
 7 files changed, 761 insertions(+), 363 deletions(-)

-- 
2.14.1


             reply	other threads:[~2017-09-13 17:16 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-13 17:15 Michael Haggerty [this message]
2017-09-13 17:15 ` [PATCH 01/20] ref_iterator: keep track of whether the iterator output is ordered Michael Haggerty
2017-09-13 17:15 ` [PATCH 02/20] prefix_ref_iterator: break when we leave the prefix Michael Haggerty
2017-09-13 17:15 ` [PATCH 03/20] packed_ref_cache: add a backlink to the associated `packed_ref_store` Michael Haggerty
2017-09-13 17:15 ` [PATCH 04/20] die_unterminated_line(), die_invalid_line(): new functions Michael Haggerty
2017-09-13 17:15 ` [PATCH 05/20] read_packed_refs(): use mmap to read the `packed-refs` file Michael Haggerty
2017-09-13 17:16 ` [PATCH 06/20] read_packed_refs(): only check for a header at the top of the file Michael Haggerty
2017-09-13 17:16 ` [PATCH 07/20] read_packed_refs(): make parsing of the header line more robust Michael Haggerty
2017-09-13 17:16 ` [PATCH 08/20] read_packed_refs(): read references with minimal copying Michael Haggerty
2017-09-13 17:16 ` [PATCH 09/20] packed_ref_cache: remember the file-wide peeling state Michael Haggerty
2017-09-13 17:16 ` [PATCH 10/20] mmapped_ref_iterator: add iterator over a packed-refs file Michael Haggerty
2017-09-13 17:16 ` [PATCH 11/20] mmapped_ref_iterator_advance(): no peeled value for broken refs Michael Haggerty
2017-09-13 17:16 ` [PATCH 12/20] packed_ref_cache: keep the `packed-refs` file open and mmapped Michael Haggerty
2017-09-13 17:16 ` [PATCH 13/20] read_packed_refs(): ensure that references are ordered when read Michael Haggerty
2017-09-13 17:16 ` [PATCH 14/20] packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator` Michael Haggerty
2017-09-13 17:16 ` [PATCH 15/20] packed_read_raw_ref(): read the reference from the mmapped buffer Michael Haggerty
2017-09-13 17:16 ` [PATCH 16/20] ref_store: implement `refs_peel_ref()` generically Michael Haggerty
2017-09-13 17:16 ` [PATCH 17/20] packed_ref_store: get rid of the `ref_cache` entirely Michael Haggerty
2017-09-13 17:16 ` [PATCH 18/20] ref_cache: remove support for storing peeled values Michael Haggerty
2017-09-13 17:16 ` [PATCH 19/20] mmapped_ref_iterator: inline into `packed_ref_iterator` Michael Haggerty
2017-09-13 17:16 ` [PATCH 20/20] packed-backend.c: rename a bunch of things and update comments Michael Haggerty
2017-09-13 23:00   ` Stefan Beller
2017-09-13 21:35 ` [PATCH 00/20] Read `packed-refs` using mmap() Junio C Hamano
2017-09-14 16:12 ` Michael Haggerty
2017-09-14 22:16   ` Johannes Schindelin
2017-09-14 20:23 ` Johannes Schindelin
2017-09-15  4:21   ` Michael Haggerty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1505319366.git.mhagger@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=avarab@gmail.com \
    --cc=bmwill@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).