git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] refs: write packed_refs file using stdio
@ 2014-09-10 10:03 Jeff King
  2014-09-10 11:21 ` Michael Haggerty
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2014-09-10 10:03 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty

We write each line of a new packed-refs file individually
using a write() syscall (and sometimes 2, if the ref is
peeled). Since each line is only about 50-100 bytes long,
this creates a lot of system call overhead.

We can instead open a stdio handle around our descriptor and
use fprintf to write to it. The extra buffering is not a
problem for us, because nobody will read our new packed-refs
file until we call commit_lock_file (by which point we have
flushed everything).

On a pathological repository with 8.5 million refs, this
dropped the time to run `git pack-refs` from 20s to 6s.

Signed-off-by: Jeff King <peff@peff.net>
---
Obviously that repo is ridiculous (but a sad reality for me).

However, I think the benefits extend to smaller files, too. And it's
pretty easy to do (and I actually think the resulting write_packed_entry
is a lot easier to read, as well as lifting some arbitrary limits).

 cache.h        |  2 ++
 refs.c         | 39 ++++++++++++++++-----------------------
 write_or_die.c | 15 +++++++++++++++
 3 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/cache.h b/cache.h
index 4d5b76c..bc286ce 100644
--- a/cache.h
+++ b/cache.h
@@ -1395,6 +1395,8 @@ extern const char *git_mailmap_blob;
 
 /* IO helper functions */
 extern void maybe_flush_or_die(FILE *, const char *);
+__attribute__((format (printf, 2, 3)))
+extern void fprintf_or_die(FILE *, const char *fmt, ...);
 extern int copy_fd(int ifd, int ofd);
 extern int copy_file(const char *dst, const char *src, int mode);
 extern int copy_file_with_time(const char *dst, const char *src, int mode);
diff --git a/refs.c b/refs.c
index 27927f2..f08faed 100644
--- a/refs.c
+++ b/refs.c
@@ -2191,25 +2191,12 @@ struct ref_lock *lock_any_ref_for_update(const char *refname,
  * Write an entry to the packed-refs file for the specified refname.
  * If peeled is non-NULL, write it as the entry's peeled value.
  */
-static void write_packed_entry(int fd, char *refname, unsigned char *sha1,
+static void write_packed_entry(FILE *fh, char *refname, unsigned char *sha1,
 			       unsigned char *peeled)
 {
-	char line[PATH_MAX + 100];
-	int len;
-
-	len = snprintf(line, sizeof(line), "%s %s\n",
-		       sha1_to_hex(sha1), refname);
-	/* this should not happen but just being defensive */
-	if (len > sizeof(line))
-		die("too long a refname '%s'", refname);
-	write_or_die(fd, line, len);
-
-	if (peeled) {
-		if (snprintf(line, sizeof(line), "^%s\n",
-			     sha1_to_hex(peeled)) != PEELED_LINE_LENGTH)
-			die("internal error");
-		write_or_die(fd, line, PEELED_LINE_LENGTH);
-	}
+	fprintf_or_die(fh, "%s %s\n", sha1_to_hex(sha1), refname);
+	if (peeled)
+		fprintf_or_die(fh, "^%s\n", sha1_to_hex(peeled));
 }
 
 /*
@@ -2217,13 +2204,12 @@ static void write_packed_entry(int fd, char *refname, unsigned char *sha1,
  */
 static int write_packed_entry_fn(struct ref_entry *entry, void *cb_data)
 {
-	int *fd = cb_data;
 	enum peel_status peel_status = peel_entry(entry, 0);
 
 	if (peel_status != PEEL_PEELED && peel_status != PEEL_NON_TAG)
 		error("internal error: %s is not a valid packed reference!",
 		      entry->name);
-	write_packed_entry(*fd, entry->name, entry->u.value.sha1,
+	write_packed_entry(cb_data, entry->name, entry->u.value.sha1,
 			   peel_status == PEEL_PEELED ?
 			   entry->u.value.peeled : NULL);
 	return 0;
@@ -2259,15 +2245,22 @@ int commit_packed_refs(void)
 		get_packed_ref_cache(&ref_cache);
 	int error = 0;
 	int save_errno = 0;
+	FILE *out;
 
 	if (!packed_ref_cache->lock)
 		die("internal error: packed-refs not locked");
-	write_or_die(packed_ref_cache->lock->fd,
-		     PACKED_REFS_HEADER, strlen(PACKED_REFS_HEADER));
 
+	out = fdopen(packed_ref_cache->lock->fd, "w");
+	if (!out)
+		die_errno("unable to fdopen packed-refs descriptor");
+
+	fprintf_or_die(out, "%s", PACKED_REFS_HEADER);
 	do_for_each_entry_in_dir(get_packed_ref_dir(packed_ref_cache),
-				 0, write_packed_entry_fn,
-				 &packed_ref_cache->lock->fd);
+				 0, write_packed_entry_fn, out);
+	if (fclose(out))
+		die_errno("write error");
+	packed_ref_cache->lock->fd = -1;
+
 	if (commit_lock_file(packed_ref_cache->lock)) {
 		save_errno = errno;
 		error = -1;
diff --git a/write_or_die.c b/write_or_die.c
index b50f99a..e7afe7a 100644
--- a/write_or_die.c
+++ b/write_or_die.c
@@ -49,6 +49,21 @@ void maybe_flush_or_die(FILE *f, const char *desc)
 	}
 }
 
+void fprintf_or_die(FILE *f, const char *fmt, ...)
+{
+	va_list ap;
+	int ret;
+
+	va_start(ap, fmt);
+	ret = vfprintf(f, fmt, ap);
+	va_end(ap);
+
+	if (ret < 0) {
+		check_pipe(errno);
+		die_errno("write error");
+	}
+}
+
 void fsync_or_die(int fd, const char *msg)
 {
 	if (fsync(fd) < 0) {
-- 
2.1.0.373.g91ca799

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] refs: write packed_refs file using stdio
  2014-09-10 10:03 [PATCH] refs: write packed_refs file using stdio Jeff King
@ 2014-09-10 11:21 ` Michael Haggerty
  2014-09-10 11:39   ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Haggerty @ 2014-09-10 11:21 UTC (permalink / raw)
  To: Jeff King, git

On 09/10/2014 12:03 PM, Jeff King wrote:
> We write each line of a new packed-refs file individually
> using a write() syscall (and sometimes 2, if the ref is
> peeled). Since each line is only about 50-100 bytes long,
> this creates a lot of system call overhead.
> 
> We can instead open a stdio handle around our descriptor and
> use fprintf to write to it. The extra buffering is not a
> problem for us, because nobody will read our new packed-refs
> file until we call commit_lock_file (by which point we have
> flushed everything).
> 
> On a pathological repository with 8.5 million refs, this
> dropped the time to run `git pack-refs` from 20s to 6s.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Obviously that repo is ridiculous (but a sad reality for me).
> 
> However, I think the benefits extend to smaller files, too. And it's
> pretty easy to do (and I actually think the resulting write_packed_entry
> is a lot easier to read, as well as lifting some arbitrary limits).
> 
>  cache.h        |  2 ++
>  refs.c         | 39 ++++++++++++++++-----------------------
>  write_or_die.c | 15 +++++++++++++++
>  3 files changed, 33 insertions(+), 23 deletions(-)
> 
> diff --git a/cache.h b/cache.h
> index 4d5b76c..bc286ce 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1395,6 +1395,8 @@ extern const char *git_mailmap_blob;
>  
>  /* IO helper functions */
>  extern void maybe_flush_or_die(FILE *, const char *);
> +__attribute__((format (printf, 2, 3)))
> +extern void fprintf_or_die(FILE *, const char *fmt, ...);
>  extern int copy_fd(int ifd, int ofd);
>  extern int copy_file(const char *dst, const char *src, int mode);
>  extern int copy_file_with_time(const char *dst, const char *src, int mode);
> diff --git a/refs.c b/refs.c
> index 27927f2..f08faed 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -2191,25 +2191,12 @@ struct ref_lock *lock_any_ref_for_update(const char *refname,
>   * Write an entry to the packed-refs file for the specified refname.
>   * If peeled is non-NULL, write it as the entry's peeled value.
>   */
> -static void write_packed_entry(int fd, char *refname, unsigned char *sha1,
> +static void write_packed_entry(FILE *fh, char *refname, unsigned char *sha1,
>  			       unsigned char *peeled)
>  {
> -	char line[PATH_MAX + 100];
> -	int len;
> -
> -	len = snprintf(line, sizeof(line), "%s %s\n",
> -		       sha1_to_hex(sha1), refname);
> -	/* this should not happen but just being defensive */
> -	if (len > sizeof(line))
> -		die("too long a refname '%s'", refname);
> -	write_or_die(fd, line, len);
> -
> -	if (peeled) {
> -		if (snprintf(line, sizeof(line), "^%s\n",
> -			     sha1_to_hex(peeled)) != PEELED_LINE_LENGTH)
> -			die("internal error");
> -		write_or_die(fd, line, PEELED_LINE_LENGTH);
> -	}
> +	fprintf_or_die(fh, "%s %s\n", sha1_to_hex(sha1), refname);
> +	if (peeled)
> +		fprintf_or_die(fh, "^%s\n", sha1_to_hex(peeled));
>  }
>  
>  /*
> @@ -2217,13 +2204,12 @@ static void write_packed_entry(int fd, char *refname, unsigned char *sha1,
>   */
>  static int write_packed_entry_fn(struct ref_entry *entry, void *cb_data)
>  {
> -	int *fd = cb_data;
>  	enum peel_status peel_status = peel_entry(entry, 0);
>  
>  	if (peel_status != PEEL_PEELED && peel_status != PEEL_NON_TAG)
>  		error("internal error: %s is not a valid packed reference!",
>  		      entry->name);
> -	write_packed_entry(*fd, entry->name, entry->u.value.sha1,
> +	write_packed_entry(cb_data, entry->name, entry->u.value.sha1,
>  			   peel_status == PEEL_PEELED ?
>  			   entry->u.value.peeled : NULL);
>  	return 0;
> @@ -2259,15 +2245,22 @@ int commit_packed_refs(void)
>  		get_packed_ref_cache(&ref_cache);
>  	int error = 0;
>  	int save_errno = 0;
> +	FILE *out;
>  
>  	if (!packed_ref_cache->lock)
>  		die("internal error: packed-refs not locked");
> -	write_or_die(packed_ref_cache->lock->fd,
> -		     PACKED_REFS_HEADER, strlen(PACKED_REFS_HEADER));
>  
> +	out = fdopen(packed_ref_cache->lock->fd, "w");
> +	if (!out)
> +		die_errno("unable to fdopen packed-refs descriptor");
> +
> +	fprintf_or_die(out, "%s", PACKED_REFS_HEADER);
>  	do_for_each_entry_in_dir(get_packed_ref_dir(packed_ref_cache),
> -				 0, write_packed_entry_fn,
> -				 &packed_ref_cache->lock->fd);
> +				 0, write_packed_entry_fn, out);
> +	if (fclose(out))
> +		die_errno("write error");
> +	packed_ref_cache->lock->fd = -1;

It might be a minuscule bit safer to set `lock->fd = -1` *before*
calling `fclose()`.

TBH, it makes me uncomfortable having code outside of `lockfile.c`
having this level of intimacy with lockfile objects. I think it would be
better to have a

    FILE *fopen_lock_file(struct *lock_file, const char *mode);

that records the `FILE *` inside the `lockfile` instance, and to teach
`commit_lock_file()` and its friends to call `fclose()` if the `FILE *`
was created. I think that such a feature would encourage other lockfile
users to use the more convenient and readable stdio API.

But there is precedent for what you are doing so I will add
`fopen_lock_file()` to my mythical todo list and not bother you further
with the idea.

> +
>  	if (commit_lock_file(packed_ref_cache->lock)) {
>  		save_errno = errno;
>  		error = -1;
> diff --git a/write_or_die.c b/write_or_die.c
> index b50f99a..e7afe7a 100644
> --- a/write_or_die.c
> +++ b/write_or_die.c
> @@ -49,6 +49,21 @@ void maybe_flush_or_die(FILE *f, const char *desc)
>  	}
>  }
>  
> +void fprintf_or_die(FILE *f, const char *fmt, ...)
> +{
> +	va_list ap;
> +	int ret;
> +
> +	va_start(ap, fmt);
> +	ret = vfprintf(f, fmt, ap);
> +	va_end(ap);
> +
> +	if (ret < 0) {
> +		check_pipe(errno);
> +		die_errno("write error");
> +	}
> +}
> +
>  void fsync_or_die(int fd, const char *msg)
>  {
>  	if (fsync(fd) < 0) {
> 

Reviewed-by: Michael Haggerty <mhagger@alum.mit.edu>

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] refs: write packed_refs file using stdio
  2014-09-10 11:21 ` Michael Haggerty
@ 2014-09-10 11:39   ` Jeff King
       [not found]     ` <CAL=YDW=uWP2kWB31MEvJvVP7yUdwoh95PvfEYT6LT1x2UXpAvg@mail.gmail.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2014-09-10 11:39 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git

On Wed, Sep 10, 2014 at 01:21:27PM +0200, Michael Haggerty wrote:

> > +	if (fclose(out))
> > +		die_errno("write error");
> > +	packed_ref_cache->lock->fd = -1;
> 
> It might be a minuscule bit safer to set `lock->fd = -1` *before*
> calling `fclose()`.

Yeah, I considered that. The worst case is a signal coming in between
the two calls (or a somebody calling die() between the two :) ). In that
case the lockfile code will close() the fd again, which should be a noop
(since nobody will have opened it in the interim...right?).

That "since" assumption is the dangerous part. But on the other hand, if
we unset the fd first, then the lockfile code may fail to close it if it
is called before the fclose() (definitely by a signal in this case). I
think there are platforms where that would cause us to fail to remove
the file, which is annoying.

So I dunno. We cannot be atomic here. I could go either way.

> TBH, it makes me uncomfortable having code outside of `lockfile.c`
> having this level of intimacy with lockfile objects.

I kind of agree.

> I think it would be better to have a
> 
>     FILE *fopen_lock_file(struct *lock_file, const char *mode);
> 
> that records the `FILE *` inside the `lockfile` instance, and to teach
> `commit_lock_file()` and its friends to call `fclose()` if the `FILE *`
> was created. I think that such a feature would encourage other lockfile
> users to use the more convenient and readable stdio API.

I was tempted by that. We could also do:

  if (fflush(out))
	die_errno("write error");
  if (commit_lock_file(...))
     ...

which sidesteps the issue. We do then have to _later_ call fclose(out)
to free the handle memory, which is going to want to close the fd again.
Putting us back in the "let's hope nobody opened it in the meantime"
case from above.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] refs: write packed_refs file using stdio
       [not found]     ` <CAL=YDW=uWP2kWB31MEvJvVP7yUdwoh95PvfEYT6LT1x2UXpAvg@mail.gmail.com>
@ 2014-09-10 19:14       ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2014-09-10 19:14 UTC (permalink / raw)
  To: Ronnie Sahlberg; +Cc: Michael Haggerty, git@vger.kernel.org

On Wed, Sep 10, 2014 at 07:32:17AM -0700, Ronnie Sahlberg wrote:

> Even better could be to build an iovec for all the data and use writev() ?
> (now without gmail adding formatting)

I'm not sure that would make anything easier. We're literally
sprintf-ing (and calling sha1_to_hex) into a buffer so we can write()
it. That seems like a solid case for using fprintf instead.

We can't just writev() a bunch of buffers we have already. We'd
literally have to make an array of buffers to format into. It's much
simpler (and probably faster) to just format directly into stdio's
buffers.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-09-10 19:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-10 10:03 [PATCH] refs: write packed_refs file using stdio Jeff King
2014-09-10 11:21 ` Michael Haggerty
2014-09-10 11:39   ` Jeff King
     [not found]     ` <CAL=YDW=uWP2kWB31MEvJvVP7yUdwoh95PvfEYT6LT1x2UXpAvg@mail.gmail.com>
2014-09-10 19:14       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).