git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: larsxschneider@gmail.com
To: git@vger.kernel.org
Cc: Lars Schneider <larsxschneider@gmail.com>
Subject: [RFC] Native access to Git LFS cache
Date: Mon, 27 Jun 2016 07:38:33 +0200	[thread overview]
Message-ID: <1467005913-6503-1-git-send-email-larsxschneider@gmail.com> (raw)

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

I found a way to make Git LFS faster up to a factor of 100x in
repositories with a large number of Git LFS files. I am looking
for comments if my approach would be acceptable by the Git community.

## What is Git LFS?
Git LFS [1] is an extension to Git that handles large files for Git
repositories. The project gained quite some momentum as almost all major
Git hosting services support it (GitHub [1], Atlassian Bitbucket [2],
GitLab [4]).

## What is the problem with Git LFS?
Git LFS is an application that is executed via Git clean/smudge filter.
The process invocation of these filters requires noticeable time (especially
on Windows) even if the Git LFS executable only accesses its local cache.

Based on my previous findings [5] Steve Streeting (@sinbad) improved the
clone times of Git LFS repositories with a lot of files by a factor of 10
or more [6][7].

Unfortunately that fix helps only with cloning. Any local Git operation
that invokes the clean/smudge filter (e.g. switching branches) is still
slow. Even on the Git mailing list a user reported that issue [8].


## Proposed solution
Git LFS caches its objects under .git/lfs/objects. Most of the time Git
LFS objects are already available in the cache (e.g. if you switch branches
back and forth). I implemented these "cache hits" natively in Git.
Please note that this implementation is just a quick and dirty proof of
concept. If the Git community agrees that this kind of approach would be
acceptable then I will start to work on a proper patch series with cross
platform support and unit tests.


## Performance tests
I executed both test runs on a 2,5 GHz Intel Core i7 with SSD and OS X.
A test run is the consecutive execution of four Git commands:
 1. clone the repo
 2. checkout to the "removed-files" branch
 3. timed: checkout the "master" branch
 4. timed: checkout "removed-files" branch

Test command:
set -x; git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo; cd repo; git checkout removed-files; time git checkout master; time git checkout removed-files

I compiled Git with the following flags:
NO_GETTEXT=YesPlease NEEDS_SSL_WITH_CRYPTO=YesPlease make -j 8 CFLAGS="-I/usr/local/opt/openssl/include" LDFLAGS="-L/usr/local/opt/openssl/lib"


### TEST RUN A -- Default Git 2.9 (ab7797d) and Git LFS 1.2.1
+ git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo
Cloning into 'repo'...
warning: templates not found /Users/lars/share/git-core/templates
remote: Counting objects: 15012, done.
remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012
Receiving objects: 100% (15012/15012), 2.02 MiB | 1.77 MiB/s, done.
Checking connectivity... done.
Checking out files: 100% (15001/15001), done.
Git LFS: (15000 of 15000 files) 0 B / 77.04 KB
+ cd repo
+ git checkout removed-files
Branch removed-files set up to track remote branch removed-files from origin.
Switched to a new branch 'removed-files'
+ git checkout master
Checking out files: 100% (12000/12000), done.
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

real    6m2.979s
user    2m39.066s
sys 2m41.610s
+ git checkout removed-files
Switched to branch 'removed-files'
Your branch is up-to-date with 'origin/removed-files'.

real    0m1.310s
user    0m0.385s
sys 0m0.881s


### TEST RUN B -- Default Git 2.9 with native LFS cache and Git LFS 1.2.1
https://github.com/larsxschneider/git/tree/lfs-cache
+ git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo
Cloning into 'repo'...
warning: templates not found /Users/lars/share/git-core/templates
remote: Counting objects: 15012, done.
remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012
Receiving objects: 100% (15012/15012), 2.02 MiB | 1.44 MiB/s, done.
Checking connectivity... done.
Git LFS: (15001 of 15000 files) 0 B / 77.04 KB
+ cd repo
+ git checkout removed-files
Branch removed-files set up to track remote branch removed-files from origin.
Switched to a new branch 'removed-files'
+ git checkout master
Checking out files: 100% (12000/12000), done.
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

real    0m2.267s
user    0m0.295s
sys 0m1.948s
+ git checkout removed-files
Switched to branch 'removed-files'
Your branch is up-to-date with 'origin/removed-files'.

real    0m0.715s
user    0m0.072s
sys 0m0.672s


### Results
Default Git:                      6m2.979s + 0m1.310s = 364s
Git with native LFS cache access: 0m2.267s + 0m0.715s = 4s

The native cache solution is almost 100x faster when switching branches
on my local machine with a test repository containing 15,000 Git LFS files.
Based on my previous experience with Git LFS clone I expect even more
dramatic results on Windows.

Thanks,
Lars

[1] https://git-lfs.github.com/
[2] https://github.com/blog/1986-announcing-git-large-file-storage-lfs
[3] http://blogs.atlassian.com/2016/02/git-lfs-for-designers-game-developers-architects/
[4] https://about.gitlab.com/2015/11/23/announcing-git-lfs-support-in-gitlab/
[5] https://github.com/github/git-lfs/issues/931#issuecomment-172939381
[6] https://github.com/github/git-lfs/pull/988
[7] https://developer.atlassian.com/blog/2016/04/git-lfs-12-clone-faster/
[8] http://article.gmane.org/gmane.comp.version-control.git/297809


---
 cache.h     |   2 +
 convert.c   | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 csum-file.c |  31 +++++++++++++
 csum-file.h |   2 +
 hex.c       |  16 +++++++
 5 files changed, 172 insertions(+), 24 deletions(-)

diff --git a/cache.h b/cache.h
index 6049f86..57fdb18 100644
--- a/cache.h
+++ b/cache.h
@@ -1196,6 +1196,8 @@ extern char *sha1_to_hex_r(char *out, const unsigned char *sha1);
 extern char *sha1_to_hex(const unsigned char *sha1);	/* static buffer result! */
 extern char *oid_to_hex(const struct object_id *oid);	/* same static buffer as sha1_to_hex */

+extern char *sha256_to_hex_r(char *out, const unsigned char *sha1);
+
 extern int interpret_branch_name(const char *str, int len, struct strbuf *);
 extern int get_sha1_mb(const char *str, unsigned char *sha1);

diff --git a/convert.c b/convert.c
index b1614bf..006e9c4 100644
--- a/convert.c
+++ b/convert.c
@@ -18,6 +18,10 @@
 #define CONVERT_STAT_BITS_TXT_CRLF  0x2
 #define CONVERT_STAT_BITS_BIN       0x4

+const char *LFS_VERSION_MARKER = "version https://git-lfs.github.com/spec/v1\n";
+const char *LFS_OID_MARKER = "oid sha256:";
+
+
 enum crlf_action {
 	CRLF_UNDEFINED,
 	CRLF_BINARY,
@@ -427,6 +431,79 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	return (write_err || status);
 }

+static int cached_lfs_smudge(const char *src, const char *cmd,
+							 struct strbuf *lfsbuf)
+{
+	int ret = 0;
+	if (src &&
+		strlen(src) > strlen(LFS_VERSION_MARKER) &&
+		!strncmp(LFS_VERSION_MARKER, src, strlen(LFS_VERSION_MARKER)) &&
+		!strcmp("git-lfs smudge %f", cmd)
+	) {
+		const char *lfs_oid_found = strstr(src, LFS_OID_MARKER);
+		if (lfs_oid_found) {
+			const char *lfs_oid = lfs_oid_found + strlen(LFS_OID_MARKER);
+
+			// Construct path to LFS object
+			strbuf_reset(lfsbuf);
+			strbuf_addstr(lfsbuf, git_pathdup("lfs/objects/"));
+			strbuf_add(lfsbuf, lfs_oid, 2);
+			strbuf_addch(lfsbuf, '/');
+			strbuf_add(lfsbuf, lfs_oid+2, 2);
+			strbuf_addch(lfsbuf, '/');
+			strbuf_add(lfsbuf, lfs_oid, 64);
+
+			if (access(lfsbuf->buf, F_OK) != -1) {
+				// LFS object found in local LFS cache
+				ret = 1;
+			}
+		}
+	}
+	return ret;
+}
+
+static int cached_lfs_clean(const char *path, const char *cmd,
+							struct strbuf *lfsbuf)
+{
+	int ret = 0;
+	if (path && !strcmp("git-lfs clean %f", cmd)) {
+
+		// TODO: Is there an easy way to access the content of the last
+		// known committed state of this file in the Git repo? If yes,
+		// then we could read the last known Git LFS OID, construct a
+		// path in the Git LFS cache and compare this file against "path".
+		// If both files are equal then we can use the last known committed
+		// state as "clean" and we could get rid of the SHA256 dependency
+		// here.
+		ssize_t file_size;
+		unsigned char sha256[64];
+		sha256fd(path, &sha256, &file_size);
+
+		char lfs_oid[64];
+		sha256_to_hex_r(lfs_oid, sha256);
+
+		// Construct path to LFS object
+		strbuf_reset(lfsbuf);
+		strbuf_addstr(lfsbuf, git_pathdup("lfs/objects/"));
+		strbuf_add(lfsbuf, lfs_oid, 2);
+		strbuf_addch(lfsbuf, '/');
+		strbuf_add(lfsbuf, lfs_oid+2, 2);
+		strbuf_addch(lfsbuf, '/');
+		strbuf_add(lfsbuf, lfs_oid, 64);
+
+		if (access(lfsbuf->buf, F_OK) != -1) {
+			// LFS object found in local LFS cache
+			strbuf_reset(lfsbuf);
+			strbuf_addstr(lfsbuf, LFS_VERSION_MARKER);
+			strbuf_addstr(lfsbuf, LFS_OID_MARKER);
+			strbuf_add(lfsbuf, lfs_oid, 64);
+			strbuf_addf(lfsbuf, "\nsize %d\n", file_size);
+			ret = 1;
+		}
+	}
+	return ret;
+}
+
 static int apply_filter(const char *path, const char *src, size_t len, int fd,
                         struct strbuf *dst, const char *cmd)
 {
@@ -437,6 +514,7 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	 * (child --> cmd) --> us
 	 */
 	int ret = 1;
+	struct strbuf lfsbuf = STRBUF_INIT;
 	struct strbuf nbuf = STRBUF_INIT;
 	struct async async;
 	struct filter_params params;
@@ -447,37 +525,56 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	if (!dst)
 		return 1;

-	memset(&async, 0, sizeof(async));
-	async.proc = filter_buffer_or_fd;
-	async.data = &params;
-	async.out = -1;
-	params.src = src;
-	params.size = len;
-	params.fd = fd;
-	params.cmd = cmd;
-	params.path = path;
-
-	fflush(NULL);
-	if (start_async(&async))
-		return 0;	/* error was already reported */
-
-	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
-		ret = 0;
-	}
-	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
-		ret = 0;
+	// TODO: check if git config "lfs.native-cache" is true
+	if (cached_lfs_smudge(src, cmd, &lfsbuf)) {
+		fd = open(lfsbuf.buf, O_RDONLY);
+		if (strbuf_read(&nbuf, fd, len) < 0) {
+			error("reading from cached LFS object failed", lfsbuf.buf);
+			ret = 0;
+		}
+		if (close(fd)) {
+			error("closing cached LFS object failed", lfsbuf.buf);
+			ret = 0;
+		}
 	}
-	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
-		ret = 0;
+	// TODO: check if git config "lfs.native-cache" is true
+	else if (cached_lfs_clean(path, cmd, &lfsbuf)) {
+		strbuf_reset(&nbuf);
+		strbuf_addstr(&nbuf, lfsbuf.buf);
+	} else {
+		memset(&async, 0, sizeof(async));
+		async.proc = filter_buffer_or_fd;
+		async.data = &params;
+		async.out = -1;
+		params.src = src;
+		params.size = len;
+		params.fd = fd;
+		params.cmd = cmd;
+		params.path = path;
+
+		fflush(NULL);
+		if (start_async(&async))
+			return 0;	/* error was already reported */
+
+		if (strbuf_read(&nbuf, async.out, len) < 0) {
+			error("read from external filter %s failed", cmd);
+			ret = 0;
+		}
+		if (close(async.out)) {
+			error("read from external filter %s failed", cmd);
+			ret = 0;
+		}
+		if (finish_async(&async)) {
+			error("external filter %s failed", cmd);
+			ret = 0;
+		}
 	}

 	if (ret) {
 		strbuf_swap(dst, &nbuf);
 	}
 	strbuf_release(&nbuf);
+	strbuf_release(&lfsbuf);
 	return ret;
 }

diff --git a/csum-file.c b/csum-file.c
index a172199..2ca4d7f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -11,6 +11,37 @@
 #include "progress.h"
 #include "csum-file.h"

+void sha256fd(const char *name, unsigned char *sha256, ssize_t *file_size)
+{
+	int fd;
+	struct stat st;
+	fd = open(name, O_RDONLY);
+	if (fd < 0)
+		die_errno("unable to open '%s'", name);
+	fstat(fd, &st);
+	size_t size = xsize_t(st.st_size);
+	*file_size = size;
+
+	SHA256_CTX ctx;
+	SHA256_Init(&ctx);
+	unsigned char fd_buffer[8192];
+
+	while (size > 0) {
+		ssize_t rsize = size < sizeof(fd_buffer) ? size : sizeof(fd_buffer);
+		ssize_t ret = read_in_full(fd, fd_buffer, rsize);
+
+		if (ret < 0)
+			die_errno("%s: sha256 file read error", name);
+		if (ret != rsize)
+			die("failed to read %d bytes from '%s'", (int)rsize, name);
+		SHA256_Update(&ctx, fd_buffer, rsize);
+		size -= rsize;
+	}
+
+	SHA256_Final(sha256, &ctx);
+	close(fd);
+}
+
 static void flush(struct sha1file *f, const void *buf, unsigned int count)
 {
 	if (0 <= f->check_fd && count)  {
diff --git a/csum-file.h b/csum-file.h
index 7530927..bad9262 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -39,6 +39,8 @@ extern void sha1flush(struct sha1file *f);
 extern void crc32_begin(struct sha1file *);
 extern uint32_t crc32_end(struct sha1file *);

+extern void sha256fd(const char *name, unsigned char *sha256, ssize_t *file_size);
+
 static inline void sha1write_u8(struct sha1file *f, uint8_t data)
 {
 	sha1write(f, &data, sizeof(data));
diff --git a/hex.c b/hex.c
index 0519f85..73e2077 100644
--- a/hex.c
+++ b/hex.c
@@ -88,3 +88,19 @@ char *oid_to_hex(const struct object_id *oid)
 {
 	return sha1_to_hex(oid->hash);
 }
+
+char *sha256_to_hex_r(char *buffer, const unsigned char *sha1)
+{
+	static const char hex[] = "0123456789abcdef";
+	char *buf = buffer;
+	int i;
+
+	for (i = 0; i < 32; i++) {
+		unsigned int val = *sha1++;
+		*buf++ = hex[val >> 4];
+		*buf++ = hex[val & 0xf];
+	}
+	*buf = '\0';
+
+	return buffer;
+}
--
2.5.1


             reply	other threads:[~2016-06-27  5:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-27  5:38 larsxschneider [this message]
2016-06-27 15:53 ` [RFC] Native access to Git LFS cache Duy Nguyen
2016-06-28  9:40   ` Johannes Schindelin
2016-06-28 13:11     ` Duy Nguyen
2016-06-28 13:14       ` Johannes Schindelin
2016-06-28 13:43         ` Lars Schneider
2016-06-28 16:00           ` Duy Nguyen
2016-06-28 15:50         ` Duy Nguyen
2016-06-27 16:09 ` Junio C Hamano
2016-06-28 13:22   ` Lars Schneider
2016-06-28 13:53     ` Christian Couder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1467005913-6503-1-git-send-email-larsxschneider@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).