git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
To: git@vger.kernel.org
Cc: Derrick Stolee <dstolee@microsoft.com>,
	Nathaniel Filardo <nwf20@cl.cam.ac.uk>
Subject: [PATCH 0/4] Speed up repacking when lots of pack-kept objects
Date: Tue, 12 Mar 2019 13:18:54 +0000	[thread overview]
Message-ID: <20190312131858.26115-1-nwf20@cl.cam.ac.uk> (raw)

This patch series improves handling of very large repositories, as generated
by, for example, bup (https://github.com/bup/bup).  Prolonged operation
thereof creates quite a lot of small pack files; repacking improves
filesystem performance of the objects/pack directory, but is quite
expensive, in terms of time and memory.  We have adopted a strategy that
marks "large" (tens of GB) of pack files as "kept" and defers repacking
until there are enough un-kept packs or enough bytes of un-kept objects.
(The first patch in the series will make our accounting easier, replacing
some terrible shell scripting with grep.)

While this strategy has generally improved our lives relative to either
extreme (not repacking, or repacking after every bup save operation), it
still leaves a good bit to be desired.  Because our packs are marked as
kept, repacking will leave the objects therein alone, but it still must
instantiate in memory and walk the entire object graph.  However, because
our kept packs are transitively closed, such that an object in one
necessarily references only objects in other kept packs, we should like to
avoid reasoning about them more or less altogether.

This series attempts to do just that.  The middle patches are just some
groundwork for the last patch, which carries the punch line.  This last
patch adds an option to builtin/repack to enumerate commit and tree objects
within kept packs as UNINTERESTING to its spawned builtin/pack-objects
command.  Together with inducing the use of sparse reachability, this speeds
enumerating candidate objects for repacking and thereby substantially
reduces the runtime of our repack operations, while producing identical
results.

I am, however, rather a novice when it comes to git internals, so any and
all feedback is quite welcome.

Nathaniel Filardo (4):
  count-objects: report statistics about kept packs
  revision walk: optionally use sparse reachability
  repack: add --sparse and pass to pack-objects
  repack: optionally assume transitive kept packs

 Documentation/git-gc.txt         |  5 +++
 Documentation/git-repack.txt     | 25 +++++++++++++
 bisect.c                         |  2 +-
 blame.c                          |  2 +-
 builtin/checkout.c               |  2 +-
 builtin/commit.c                 |  2 +-
 builtin/count-objects.c          | 17 ++++++++-
 builtin/describe.c               |  2 +-
 builtin/fast-export.c            |  2 +-
 builtin/fmt-merge-msg.c          |  2 +-
 builtin/gc.c                     |  5 +++
 builtin/log.c                    | 10 ++---
 builtin/merge.c                  |  2 +-
 builtin/pack-objects.c           |  4 +-
 builtin/repack.c                 | 64 +++++++++++++++++++++++++++++++-
 builtin/rev-list.c               |  2 +-
 builtin/shortlog.c               |  2 +-
 bundle.c                         |  2 +-
 http-push.c                      |  2 +-
 merge-recursive.c                |  2 +-
 pack-bitmap-write.c              |  2 +-
 pack-bitmap.c                    |  4 +-
 reachable.c                      |  4 +-
 ref-filter.c                     |  2 +-
 remote.c                         |  2 +-
 revision.c                       | 10 +++--
 revision.h                       |  2 +-
 sequencer.c                      |  6 +--
 shallow.c                        |  2 +-
 submodule.c                      |  4 +-
 t/helper/test-revision-walking.c |  2 +-
 31 files changed, 154 insertions(+), 42 deletions(-)

-- 
2.17.1


             reply	other threads:[~2019-03-12 13:37 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-12 13:18 Nathaniel Filardo [this message]
2019-03-12 13:18 ` [PATCH 1/4] count-objects: report statistics about kept packs Nathaniel Filardo
2019-03-12 13:18 ` [PATCH 2/4] revision walk: optionally use sparse reachability Nathaniel Filardo
2019-03-12 13:59   ` Derrick Stolee
2019-03-12 13:18 ` [PATCH 3/4] repack: add --sparse and pass to pack-objects Nathaniel Filardo
2019-03-12 13:47   ` Derrick Stolee
2019-03-12 14:03     ` Dr N.W. Filardo
2019-03-12 13:18 ` [PATCH 4/4] repack: optionally assume transitive kept packs Nathaniel Filardo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190312131858.26115-1-nwf20@cl.cam.ac.uk \
    --to=nwf20@cl.cam.ac.uk \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).