git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Matheus Tavares <matheus.bernardino@usp.br>
To: git@vger.kernel.org
Cc: jeffhost@microsoft.com, chriscool@tuxfamily.org, peff@peff.net,
	t.gummerer@gmail.com, newren@gmail.com
Subject: [PATCH v2 00/19] Parallel Checkout (part I)
Date: Tue, 22 Sep 2020 19:49:14 -0300	[thread overview]
Message-ID: <cover.1600814153.git.matheus.bernardino@usp.br> (raw)
In-Reply-To: <cover.1597093021.git.matheus.bernardino@usp.br>

This series adds helper workers to checkout, parallelizing the reading,
filtering and writing of multiple blobs to the working tree.

Since v1, I got the chance to benchmark parallel checkout in more
machines. The results showed that the parallelization is most effective
with repositories located on SSDs or over distributed file systems. For
local file systems on spinning disks, it does not always bring good
performances. In fact, it even brings a slowdown sometimes. But given
the results on the two first cases, I think it's worth having the
parallel code as an optional (and non-default) setting.

The size of the repository being checked out and the compression level
on the packfiles also influence how much performance gain we can get
from parallel checkout. For example, downloading the Linux repo from
GitHub and from kernel.org I got packfiles with 2.9GB and 1.4GB,
respectively. The number of objects was the same, but GitHub's had a
smaller number of delta-chains with size >= 7 [A]. For this reason, the
sequential checkout after GitHub's clone was considerably faster than
the sequential checkout after kernel.org's clone. And the speedup from
parallel checkout was more modest (but it was faster in absolute values,
nevertheless).

[A]: https://docs.google.com/spreadsheets/d/1dDGLym77JAGCVYhKQHe44r3pqtrsvHrjS4NmD_Hqr6k/edit?usp=sharing

V2 got bigger with tests and some additional optimizations, so I decided
to divide the original series into two parts to facilitate reviewing.
This one is constituted of:

- The first 9 patches are preparatory steps in convert.c and entry.c.
- The middle 6 actually implement parallel checkout.
- The last 4 add tests.

Part II will contain some extra optimizations, like work stealing and
the creation of leading directories in parallel. With that, workers
won't need to stat() the path components again before opening the files
for writing. We will also skip some stat() calls during clone.


Major changes since v1:

General:
- Added tests
- Parallel checkout is no longer the default, since not all machines
  benefit from it.
- Rebased on top of master to use the adjusted mem_pool API of
  en/mem-pool.

Patch 10:
- Converted BUG() to error(), in handle_results(), when we finish
  parallel checkout with pending entries. This is not really a BUG; it
  can happen when a worker dies before sending all of its results. Also,
  by emitting an error message instead of die()'ing, we can continue
  processing the next results and, thus, avoid wasting successful work.
- Added missing initialization of ci->status on enqueue_entry().
- Fixed bug on which collision report during clone would not be correct
  when the file that is first written appears after it's colliding pair
  in the cache array.
- Reworded commit message and added comment in handle_results() to
  explain why we retry writing entries with path collisions.
- Renamed CI_RETRY to CI_COLLISION, to make it easier to change the
  behavior on collided entries in the future, if necessary.
- Some other minor changes like:
  * Removed unnecessary PC_HANDLING_RESULTS status.
  * Statically allocated the global parallel_checkout struct.
  * Renamed checkout_item to parallel_checkout_item.

Patch 11:
- Made parse_and_save_result() safer by checking that the received data
  has the expected size, instead of trusting ci->status and possibly
  accessing an invalid address on errors.
- Limited the workers to the number of enqueued entries.
- Added comment in packet_to_ci() mentioning why it's OK to encode
  NULL as a zero length string when sending the working_tree_encoding to
  workers.
- Split subprocess' spawning and finalizing loops, to mitigate the
  spawn/wait cost.
- Don't die() when a worker exits with an error code (only report the
  error), to avoid wasting good work by not updating the index with the 
  stat information from the written entries.
- Renamed checkout.workersThreshold to checkout.thresholdForParallelism.


Jeff Hostetler (4):
  convert: make convert_attrs() and convert structs public
  convert: add [async_]convert_to_working_tree_ca() variants
  convert: add get_stream_filter_ca() variant
  convert: add conv_attrs classification

Matheus Tavares (15):
  entry: extract a header file for entry.c functions
  entry: make fstat_output() and read_blob_entry() public
  entry: extract cache_entry update from write_entry()
  entry: move conv_attrs lookup up to checkout_entry()
  entry: add checkout_entry_ca() which takes preloaded conv_attrs
  unpack-trees: add basic support for parallel checkout
  parallel-checkout: make it truly parallel
  parallel-checkout: support progress displaying
  make_transient_cache_entry(): optionally alloc from mem_pool
  builtin/checkout.c: complete parallel checkout support
  checkout-index: add parallel checkout support
  parallel-checkout: add tests for basic operations
  parallel-checkout: add tests related to clone collisions
  parallel-checkout: add tests related to .gitattributes
  ci: run test round with parallel-checkout enabled

 .gitignore                              |   1 +
 Documentation/config/checkout.txt       |  21 +
 Makefile                                |   2 +
 apply.c                                 |   1 +
 builtin.h                               |   1 +
 builtin/checkout--helper.c              | 142 ++++++
 builtin/checkout-index.c                |  17 +
 builtin/checkout.c                      |  21 +-
 builtin/difftool.c                      |   3 +-
 cache.h                                 |  34 +-
 ci/run-build-and-tests.sh               |   1 +
 convert.c                               | 121 +++--
 convert.h                               |  68 +++
 entry.c                                 | 102 ++--
 entry.h                                 |  54 ++
 git.c                                   |   2 +
 parallel-checkout.c                     | 631 ++++++++++++++++++++++++
 parallel-checkout.h                     | 103 ++++
 read-cache.c                            |  12 +-
 t/README                                |   4 +
 t/lib-encoding.sh                       |  25 +
 t/lib-parallel-checkout.sh              |  45 ++
 t/t0028-working-tree-encoding.sh        |  25 +-
 t/t2080-parallel-checkout-basics.sh     | 197 ++++++++
 t/t2081-parallel-checkout-collisions.sh | 116 +++++
 t/t2082-parallel-checkout-attributes.sh | 174 +++++++
 unpack-trees.c                          |  22 +-
 27 files changed, 1793 insertions(+), 152 deletions(-)
 create mode 100644 builtin/checkout--helper.c
 create mode 100644 entry.h
 create mode 100644 parallel-checkout.c
 create mode 100644 parallel-checkout.h
 create mode 100644 t/lib-encoding.sh
 create mode 100644 t/lib-parallel-checkout.sh
 create mode 100755 t/t2080-parallel-checkout-basics.sh
 create mode 100755 t/t2081-parallel-checkout-collisions.sh
 create mode 100755 t/t2082-parallel-checkout-attributes.sh

-- 
2.28.0


  parent reply	other threads:[~2020-09-22 22:50 UTC|newest]

Thread overview: 154+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10 21:33 [RFC PATCH 00/21] [RFC] Parallel checkout Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 01/21] convert: make convert_attrs() and convert structs public Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 02/21] convert: add [async_]convert_to_working_tree_ca() variants Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 03/21] convert: add get_stream_filter_ca() variant Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 04/21] convert: add conv_attrs classification Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 05/21] entry: extract a header file for entry.c functions Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 06/21] entry: make fstat_output() and read_blob_entry() public Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 07/21] entry: extract cache_entry update from write_entry() Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 08/21] entry: move conv_attrs lookup up to checkout_entry() Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 09/21] entry: add checkout_entry_ca() which takes preloaded conv_attrs Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 10/21] unpack-trees: add basic support for parallel checkout Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 11/21] parallel-checkout: make it truly parallel Matheus Tavares
2020-08-19 21:34   ` Jeff Hostetler
2020-08-20  1:33     ` Matheus Tavares Bernardino
2020-08-20 14:39       ` Jeff Hostetler
2020-08-10 21:33 ` [RFC PATCH 12/21] parallel-checkout: add configuration options Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 13/21] parallel-checkout: support progress displaying Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 14/21] make_transient_cache_entry(): optionally alloc from mem_pool Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 15/21] builtin/checkout.c: complete parallel checkout support Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 16/21] checkout-index: add " Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 17/21] parallel-checkout: avoid stat() calls in workers Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 18/21] entry: use is_dir_sep() when checking leading dirs Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 19/21] symlinks: make has_dirs_only_path() track FL_NOENT Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 20/21] parallel-checkout: create leading dirs in workers Matheus Tavares
2020-08-10 21:33 ` [RFC PATCH 21/21] parallel-checkout: skip checking the working tree on clone Matheus Tavares
2020-08-12 16:57 ` [RFC PATCH 00/21] [RFC] Parallel checkout Jeff Hostetler
2020-09-22 22:49 ` Matheus Tavares [this message]
2020-09-22 22:49   ` [PATCH v2 01/19] convert: make convert_attrs() and convert structs public Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 02/19] convert: add [async_]convert_to_working_tree_ca() variants Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 03/19] convert: add get_stream_filter_ca() variant Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 04/19] convert: add conv_attrs classification Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 05/19] entry: extract a header file for entry.c functions Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 06/19] entry: make fstat_output() and read_blob_entry() public Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 07/19] entry: extract cache_entry update from write_entry() Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 08/19] entry: move conv_attrs lookup up to checkout_entry() Matheus Tavares
2020-10-01 15:53     ` Jeff Hostetler
2020-10-01 15:59       ` Jeff Hostetler
2020-09-22 22:49   ` [PATCH v2 09/19] entry: add checkout_entry_ca() which takes preloaded conv_attrs Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 10/19] unpack-trees: add basic support for parallel checkout Matheus Tavares
2020-10-05  6:17     ` [PATCH] parallel-checkout: drop unused checkout state parameter Jeff King
2020-10-05 13:13       ` Matheus Tavares Bernardino
2020-10-05 13:45         ` Jeff King
2020-09-22 22:49   ` [PATCH v2 11/19] parallel-checkout: make it truly parallel Matheus Tavares
2020-09-29 19:52     ` Martin Ågren
2020-09-30 14:02       ` Matheus Tavares Bernardino
2020-09-22 22:49   ` [PATCH v2 12/19] parallel-checkout: support progress displaying Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 13/19] make_transient_cache_entry(): optionally alloc from mem_pool Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 14/19] builtin/checkout.c: complete parallel checkout support Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 15/19] checkout-index: add " Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 16/19] parallel-checkout: add tests for basic operations Matheus Tavares
2020-10-20  1:35     ` Jonathan Nieder
2020-10-20  2:55       ` Taylor Blau
2020-10-20 13:18         ` Matheus Tavares Bernardino
2020-10-20 19:09           ` Junio C Hamano
2020-10-20  3:18       ` Matheus Tavares Bernardino
2020-10-20  4:16         ` Jonathan Nieder
2020-10-20 19:14         ` Junio C Hamano
2020-09-22 22:49   ` [PATCH v2 17/19] parallel-checkout: add tests related to clone collisions Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 18/19] parallel-checkout: add tests related to .gitattributes Matheus Tavares
2020-09-22 22:49   ` [PATCH v2 19/19] ci: run test round with parallel-checkout enabled Matheus Tavares
2020-10-29  2:14   ` [PATCH v3 00/19] Parallel Checkout (part I) Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 01/19] convert: make convert_attrs() and convert structs public Matheus Tavares
2020-10-29 23:40       ` Junio C Hamano
2020-10-30 17:01         ` Matheus Tavares Bernardino
2020-10-30 17:38           ` Junio C Hamano
2020-10-29  2:14     ` [PATCH v3 02/19] convert: add [async_]convert_to_working_tree_ca() variants Matheus Tavares
2020-10-29 23:48       ` Junio C Hamano
2020-10-29  2:14     ` [PATCH v3 03/19] convert: add get_stream_filter_ca() variant Matheus Tavares
2020-10-29 23:51       ` Junio C Hamano
2020-10-29  2:14     ` [PATCH v3 04/19] convert: add conv_attrs classification Matheus Tavares
2020-10-29 23:53       ` Junio C Hamano
2020-10-29  2:14     ` [PATCH v3 05/19] entry: extract a header file for entry.c functions Matheus Tavares
2020-10-30 21:36       ` Junio C Hamano
2020-10-29  2:14     ` [PATCH v3 06/19] entry: make fstat_output() and read_blob_entry() public Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 07/19] entry: extract cache_entry update from write_entry() Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 08/19] entry: move conv_attrs lookup up to checkout_entry() Matheus Tavares
2020-10-30 21:58       ` Junio C Hamano
2020-10-29  2:14     ` [PATCH v3 09/19] entry: add checkout_entry_ca() which takes preloaded conv_attrs Matheus Tavares
2020-10-30 22:02       ` Junio C Hamano
2020-10-29  2:14     ` [PATCH v3 10/19] unpack-trees: add basic support for parallel checkout Matheus Tavares
2020-11-02 19:35       ` Junio C Hamano
2020-11-03  3:48         ` Matheus Tavares Bernardino
2020-10-29  2:14     ` [PATCH v3 11/19] parallel-checkout: make it truly parallel Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 12/19] parallel-checkout: support progress displaying Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 13/19] make_transient_cache_entry(): optionally alloc from mem_pool Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 14/19] builtin/checkout.c: complete parallel checkout support Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 15/19] checkout-index: add " Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 16/19] parallel-checkout: add tests for basic operations Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 17/19] parallel-checkout: add tests related to clone collisions Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 18/19] parallel-checkout: add tests related to .gitattributes Matheus Tavares
2020-10-29  2:14     ` [PATCH v3 19/19] ci: run test round with parallel-checkout enabled Matheus Tavares
2020-10-29 19:48     ` [PATCH v3 00/19] Parallel Checkout (part I) Junio C Hamano
2020-10-30 15:58     ` Jeff Hostetler
2020-11-04 20:32     ` [PATCH v4 " Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 01/19] convert: make convert_attrs() and convert structs public Matheus Tavares
2020-12-05 10:40         ` Christian Couder
2020-12-05 21:53           ` Matheus Tavares Bernardino
2020-11-04 20:33       ` [PATCH v4 02/19] convert: add [async_]convert_to_working_tree_ca() variants Matheus Tavares
2020-12-05 11:10         ` Christian Couder
2020-12-05 22:20           ` Matheus Tavares Bernardino
2020-11-04 20:33       ` [PATCH v4 03/19] convert: add get_stream_filter_ca() variant Matheus Tavares
2020-12-05 11:45         ` Christian Couder
2020-11-04 20:33       ` [PATCH v4 04/19] convert: add conv_attrs classification Matheus Tavares
2020-12-05 12:07         ` Christian Couder
2020-12-05 22:08           ` Matheus Tavares Bernardino
2020-11-04 20:33       ` [PATCH v4 05/19] entry: extract a header file for entry.c functions Matheus Tavares
2020-12-06  8:31         ` Christian Couder
2020-11-04 20:33       ` [PATCH v4 06/19] entry: make fstat_output() and read_blob_entry() public Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 07/19] entry: extract cache_entry update from write_entry() Matheus Tavares
2020-12-06  8:53         ` Christian Couder
2020-11-04 20:33       ` [PATCH v4 08/19] entry: move conv_attrs lookup up to checkout_entry() Matheus Tavares
2020-12-06  9:35         ` Christian Couder
2020-12-07 13:52           ` Matheus Tavares Bernardino
2020-11-04 20:33       ` [PATCH v4 09/19] entry: add checkout_entry_ca() which takes preloaded conv_attrs Matheus Tavares
2020-12-06 10:02         ` Christian Couder
2020-12-07 16:47           ` Matheus Tavares Bernardino
2020-11-04 20:33       ` [PATCH v4 10/19] unpack-trees: add basic support for parallel checkout Matheus Tavares
2020-12-06 11:36         ` Christian Couder
2020-12-07 19:06           ` Matheus Tavares Bernardino
2020-11-04 20:33       ` [PATCH v4 11/19] parallel-checkout: make it truly parallel Matheus Tavares
2020-12-16 22:31         ` Emily Shaffer
2020-12-17 15:00           ` Matheus Tavares Bernardino
2020-11-04 20:33       ` [PATCH v4 12/19] parallel-checkout: support progress displaying Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 13/19] make_transient_cache_entry(): optionally alloc from mem_pool Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 14/19] builtin/checkout.c: complete parallel checkout support Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 15/19] checkout-index: add " Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 16/19] parallel-checkout: add tests for basic operations Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 17/19] parallel-checkout: add tests related to clone collisions Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 18/19] parallel-checkout: add tests related to .gitattributes Matheus Tavares
2020-11-04 20:33       ` [PATCH v4 19/19] ci: run test round with parallel-checkout enabled Matheus Tavares
2020-12-16 14:50       ` [PATCH v5 0/9] Parallel Checkout (part I) Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 1/9] convert: make convert_attrs() and convert structs public Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 2/9] convert: add [async_]convert_to_working_tree_ca() variants Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 3/9] convert: add get_stream_filter_ca() variant Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 4/9] convert: add classification for conv_attrs struct Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 5/9] entry: extract a header file for entry.c functions Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 6/9] entry: make fstat_output() and read_blob_entry() public Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 7/9] entry: extract update_ce_after_write() from write_entry() Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 8/9] entry: move conv_attrs lookup up to checkout_entry() Matheus Tavares
2020-12-16 14:50         ` [PATCH v5 9/9] entry: add checkout_entry_ca() taking preloaded conv_attrs Matheus Tavares
2020-12-16 15:27         ` [PATCH v5 0/9] Parallel Checkout (part I) Christian Couder
2020-12-17  1:11         ` Junio C Hamano
2021-03-23 14:19         ` [PATCH v6 0/9] Parallel Checkout (part 1) Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 1/9] convert: make convert_attrs() and convert structs public Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 2/9] convert: add [async_]convert_to_working_tree_ca() variants Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 3/9] convert: add get_stream_filter_ca() variant Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 4/9] convert: add classification for conv_attrs struct Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 5/9] entry: extract a header file for entry.c functions Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 6/9] entry: make fstat_output() and read_blob_entry() public Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 7/9] entry: extract update_ce_after_write() from write_entry() Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 8/9] entry: move conv_attrs lookup up to checkout_entry() Matheus Tavares
2021-03-23 14:19           ` [PATCH v6 9/9] entry: add checkout_entry_ca() taking preloaded conv_attrs Matheus Tavares
2021-03-23 17:34           ` [PATCH v6 0/9] Parallel Checkout (part 1) Junio C Hamano
2020-10-01 16:42 ` [RFC PATCH 00/21] [RFC] Parallel checkout Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1600814153.git.matheus.bernardino@usp.br \
    --to=matheus.bernardino@usp.br \
    --cc=chriscool@tuxfamily.org \
    --cc=git@vger.kernel.org \
    --cc=jeffhost@microsoft.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=t.gummerer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).