Re: performance on repack

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: "Nicolas Pitre" <nico@cam.org>,
	"Martin Koegler" <mkoegler@auto.tuwien.ac.at>,
	"Git Mailing List" <git@vger.kernel.org>
Subject: Re: performance on repack
Date: Wed, 15 Aug 2007 11:08:38 -0400	[thread overview]
Message-ID: <9e4733910708150808x39241071j1a4012f16cd26ef8@mail.gmail.com> (raw)
In-Reply-To: <20070815053231.GJ27913@spearce.org>

On 8/15/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > On Mon, 13 Aug 2007, Shawn O. Pearce wrote:
> Even Linux with its pretty amazing NPTL isn't up to the task.
> No threading package really is.

You need something like fibers instead of threads for task that small.

> Yea, that I agree with.  The other thing is the main thread may need
> to push a couple of windows worth of work into the threads, so that
> if this window's 1/n unit goes really fast on that thread it doesn't
> stall out waiting for the main thread to get it more data.
>
> > Also... read_sha1_file() is currently not thread safe, and threaded
> > delta searching would requires that its usage be serialized with a
> > global mutex (not done in this patch which is a bug), or ideally be made
> > thread aware.

You can avoid making all the low level calls thread safe by using the
main thread to get everything into RAM before starting to search for
the deltas. The worker threads would only deal with things completely
in memory. You may need to ref count these in-memory objects if they
are shared between worker threads. For simplicity the in-memory input
objects should be read only by the threads. The worker threads create
new structures to hand their results back to the main thread for
writing to disk.

A typical solution is to use a queue protected by locks. Main thread
faults in all the needed objects to cache. Main thread builds a queue
entry and increments reference count on all referenced objects. Main
thread uses locks to add entry to queue, while queue is locked it
removes any finished jobs. Main thread writes finished results to
disks, decrements ref counts. Cache logic can then take over about
when objects are actually deleted.

Worker threads wait on the queue. When something is placed in the
queue a waiting worker thread removes it, processes it, puts the
results in RAM, and places the object back on the finished queue. Then
waits for another object. It doesn't call into the main body of code.

Initially I would just ignore very large objects while getting the
basic code to work. After the basic code is working if a very large
object is encountered when the main thread is faulting objects in, the
main thread should just process this object on the spot using the
existing low memory code.

Hopefully the new adaptive read ahead code in the kernel will detect
the pattern of the main thread faulting objects in and make good
guesses about where to read ahead. This is key to not getting stuck
waiting on IO to complete. If not the main thread could use ADVISE to
give kernel read ahead clues about where it is going to read next.

> Yea, that's a problem.  From the Git library perspective that Luiz
> Capitulino has been working on for GSOC this summer being able to
> compile/link Git with -lpthreads and have it actually be thread
> safe would be useful.  Unfortunately its waaaay more than just
> making read_sha1_file() threadsafe.  :-/
>
> read_sha1_file() is actually a reasonably critical part of the
> major functions of Git.  Slowing that down by making it go through
> pthread_mutex_{lock,unlock} is going to hurt.  I tried something
> like the following, and its a bit slower, and really ain't a great
> multi-thread aware implementation.
>
> # unmodified v1.5.3-rc4-91-g9fa3465 from Junio
> /usr/bin/time ../rl-master/git-rev-list HEAD . >/dev/null
>         1.67 real         1.42 user         0.20 sys
>         1.65 real         1.36 user         0.17 sys
>         1.67 real         1.35 user         0.16 sys
>         1.70 real         1.35 user         0.16 sys
>         1.64 real         1.35 user         0.16 sys
>
> # v1.5.3-rc4-91-g9fa3465 + patch below
> /usr/bin/time ../rl-pthread/git-rev-list HEAD . >/dev/null
>         2.86 real         1.48 user         0.29 sys
>         1.68 real         1.37 user         0.17 sys
>         1.59 real         1.37 user         0.16 sys
>         1.66 real         1.37 user         0.17 sys
>         1.68 real         1.37 user         0.17 sys
>
> That's on Mac OS X.  I guess NPTL would probably be faster at this,
> but not so much faster as to make it not be an increase.
>
> -->8--
> diff --git a/Makefile b/Makefile
> index 4eb4637..f31811b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -710,7 +710,7 @@ SHELL_PATH_SQ = $(subst ','\'',$(SHELL_PATH))
>  PERL_PATH_SQ = $(subst ','\'',$(PERL_PATH))
>  TCLTK_PATH_SQ = $(subst ','\'',$(TCLTK_PATH))
>
> -LIBS = $(GITLIBS) $(EXTLIBS)
> +LIBS = $(GITLIBS) $(EXTLIBS) -lpthread
>
>  BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
>         -DETC_GITCONFIG='"$(ETC_GITCONFIG_SQ)"' $(COMPAT_CFLAGS)
> diff --git a/sha1_file.c b/sha1_file.c
> index aca741b..c239789 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -6,6 +6,7 @@
>   * This handles basic git sha1 object files - packing, unpacking,
>   * creation etc.
>   */
> +#include <pthread.h>
>  #include "cache.h"
>  #include "delta.h"
>  #include "pack.h"
> @@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
>  void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
>                      unsigned long *size)
>  {
> +       static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER;
>         unsigned long mapsize;
>         void *map, *buf;
>         struct cached_object *co;
>
> +       pthread_mutex_lock(&locky);
>         co = find_cached_object(sha1);
>         if (co) {
>                 buf = xmalloc(co->size + 1);
> @@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
>                 ((char*)buf)[co->size] = 0;
>                 *type = co->type;
>                 *size = co->size;
> +               pthread_mutex_unlock(&locky);
>                 return buf;
>         }
>
>         buf = read_packed_sha1(sha1, type, size);
> -       if (buf)
> +       if (buf) {
> +               pthread_mutex_unlock(&locky);
>                 return buf;
> +       }
>         map = map_sha1_file(sha1, &mapsize);
>         if (map) {
>                 buf = unpack_sha1_file(map, mapsize, type, size, sha1);
>                 munmap(map, mapsize);
> +               pthread_mutex_unlock(&locky);
>                 return buf;
>         }
>         reprepare_packed_git();
> -       return read_packed_sha1(sha1, type, size);
> +       buf = read_packed_sha1(sha1, type, size);
> +       pthread_mutex_unlock(&locky);
> +       return buf;
>  }
>
>  void *read_object_with_reference(const unsigned char *sha1,
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

next prev parent reply	other threads:[~2007-08-15 15:08 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-11 21:12 performance on repack Jon Smirl
2007-08-11 22:09 ` David Kastrup
2007-08-11 22:34   ` Linus Torvalds
2007-08-11 23:21     ` Jon Smirl
2007-08-12 10:33 ` Martin Koegler
2007-08-12 13:49   ` Jon Smirl
2007-08-14  3:12     ` Shawn O. Pearce
2007-08-14  4:10       ` Jon Smirl
2007-08-14  5:13         ` Shawn O. Pearce
2007-08-14  5:57           ` Jon Smirl
2007-08-14 14:52       ` Nicolas Pitre
2007-08-14 21:41       ` Nicolas Pitre
2007-08-15  1:20         ` Jon Smirl
2007-08-15  1:59           ` Nicolas Pitre
2007-08-15  5:32         ` Shawn O. Pearce
2007-08-15 15:08           ` Jon Smirl [this message]
2007-08-15 17:11             ` Martin Koegler
2007-08-15 18:38               ` Jon Smirl
2007-08-15 19:00                 ` Nicolas Pitre
2007-08-15 19:42                   ` Jon Smirl
2007-08-16  8:10                   ` David Kastrup
2007-08-16 15:34                     ` Nicolas Pitre
2007-08-16 16:13                       ` Jon Smirl
2007-08-16 16:21                         ` Nicolas Pitre
2007-08-15 21:05             ` Nicolas Pitre
2007-08-15 20:49           ` Nicolas Pitre
2007-08-30  4:27             ` Nicolas Pitre
2007-08-30  4:36               ` Nicolas Pitre
2007-08-30 16:17                 ` Jon Smirl
2007-09-01 21:54                 ` Jon Smirl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910708150808x39241071j1a4012f16cd26ef8@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=mkoegler@auto.tuwien.ac.at \
    --cc=nico@cam.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).