git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC] Native access to Git LFS cache
@ 2016-06-27  5:38 larsxschneider
  2016-06-27 15:53 ` Duy Nguyen
  2016-06-27 16:09 ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: larsxschneider @ 2016-06-27  5:38 UTC (permalink / raw)
  To: git; +Cc: Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

I found a way to make Git LFS faster up to a factor of 100x in
repositories with a large number of Git LFS files. I am looking
for comments if my approach would be acceptable by the Git community.

## What is Git LFS?
Git LFS [1] is an extension to Git that handles large files for Git
repositories. The project gained quite some momentum as almost all major
Git hosting services support it (GitHub [1], Atlassian Bitbucket [2],
GitLab [4]).

## What is the problem with Git LFS?
Git LFS is an application that is executed via Git clean/smudge filter.
The process invocation of these filters requires noticeable time (especially
on Windows) even if the Git LFS executable only accesses its local cache.

Based on my previous findings [5] Steve Streeting (@sinbad) improved the
clone times of Git LFS repositories with a lot of files by a factor of 10
or more [6][7].

Unfortunately that fix helps only with cloning. Any local Git operation
that invokes the clean/smudge filter (e.g. switching branches) is still
slow. Even on the Git mailing list a user reported that issue [8].


## Proposed solution
Git LFS caches its objects under .git/lfs/objects. Most of the time Git
LFS objects are already available in the cache (e.g. if you switch branches
back and forth). I implemented these "cache hits" natively in Git.
Please note that this implementation is just a quick and dirty proof of
concept. If the Git community agrees that this kind of approach would be
acceptable then I will start to work on a proper patch series with cross
platform support and unit tests.


## Performance tests
I executed both test runs on a 2,5 GHz Intel Core i7 with SSD and OS X.
A test run is the consecutive execution of four Git commands:
 1. clone the repo
 2. checkout to the "removed-files" branch
 3. timed: checkout the "master" branch
 4. timed: checkout "removed-files" branch

Test command:
set -x; git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo; cd repo; git checkout removed-files; time git checkout master; time git checkout removed-files

I compiled Git with the following flags:
NO_GETTEXT=YesPlease NEEDS_SSL_WITH_CRYPTO=YesPlease make -j 8 CFLAGS="-I/usr/local/opt/openssl/include" LDFLAGS="-L/usr/local/opt/openssl/lib"


### TEST RUN A -- Default Git 2.9 (ab7797d) and Git LFS 1.2.1
+ git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo
Cloning into 'repo'...
warning: templates not found /Users/lars/share/git-core/templates
remote: Counting objects: 15012, done.
remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012
Receiving objects: 100% (15012/15012), 2.02 MiB | 1.77 MiB/s, done.
Checking connectivity... done.
Checking out files: 100% (15001/15001), done.
Git LFS: (15000 of 15000 files) 0 B / 77.04 KB
+ cd repo
+ git checkout removed-files
Branch removed-files set up to track remote branch removed-files from origin.
Switched to a new branch 'removed-files'
+ git checkout master
Checking out files: 100% (12000/12000), done.
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

real    6m2.979s
user    2m39.066s
sys 2m41.610s
+ git checkout removed-files
Switched to branch 'removed-files'
Your branch is up-to-date with 'origin/removed-files'.

real    0m1.310s
user    0m0.385s
sys 0m0.881s


### TEST RUN B -- Default Git 2.9 with native LFS cache and Git LFS 1.2.1
https://github.com/larsxschneider/git/tree/lfs-cache
+ git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo
Cloning into 'repo'...
warning: templates not found /Users/lars/share/git-core/templates
remote: Counting objects: 15012, done.
remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012
Receiving objects: 100% (15012/15012), 2.02 MiB | 1.44 MiB/s, done.
Checking connectivity... done.
Git LFS: (15001 of 15000 files) 0 B / 77.04 KB
+ cd repo
+ git checkout removed-files
Branch removed-files set up to track remote branch removed-files from origin.
Switched to a new branch 'removed-files'
+ git checkout master
Checking out files: 100% (12000/12000), done.
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

real    0m2.267s
user    0m0.295s
sys 0m1.948s
+ git checkout removed-files
Switched to branch 'removed-files'
Your branch is up-to-date with 'origin/removed-files'.

real    0m0.715s
user    0m0.072s
sys 0m0.672s


### Results
Default Git:                      6m2.979s + 0m1.310s = 364s
Git with native LFS cache access: 0m2.267s + 0m0.715s = 4s

The native cache solution is almost 100x faster when switching branches
on my local machine with a test repository containing 15,000 Git LFS files.
Based on my previous experience with Git LFS clone I expect even more
dramatic results on Windows.

Thanks,
Lars

[1] https://git-lfs.github.com/
[2] https://github.com/blog/1986-announcing-git-large-file-storage-lfs
[3] http://blogs.atlassian.com/2016/02/git-lfs-for-designers-game-developers-architects/
[4] https://about.gitlab.com/2015/11/23/announcing-git-lfs-support-in-gitlab/
[5] https://github.com/github/git-lfs/issues/931#issuecomment-172939381
[6] https://github.com/github/git-lfs/pull/988
[7] https://developer.atlassian.com/blog/2016/04/git-lfs-12-clone-faster/
[8] http://article.gmane.org/gmane.comp.version-control.git/297809


---
 cache.h     |   2 +
 convert.c   | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 csum-file.c |  31 +++++++++++++
 csum-file.h |   2 +
 hex.c       |  16 +++++++
 5 files changed, 172 insertions(+), 24 deletions(-)

diff --git a/cache.h b/cache.h
index 6049f86..57fdb18 100644
--- a/cache.h
+++ b/cache.h
@@ -1196,6 +1196,8 @@ extern char *sha1_to_hex_r(char *out, const unsigned char *sha1);
 extern char *sha1_to_hex(const unsigned char *sha1);	/* static buffer result! */
 extern char *oid_to_hex(const struct object_id *oid);	/* same static buffer as sha1_to_hex */

+extern char *sha256_to_hex_r(char *out, const unsigned char *sha1);
+
 extern int interpret_branch_name(const char *str, int len, struct strbuf *);
 extern int get_sha1_mb(const char *str, unsigned char *sha1);

diff --git a/convert.c b/convert.c
index b1614bf..006e9c4 100644
--- a/convert.c
+++ b/convert.c
@@ -18,6 +18,10 @@
 #define CONVERT_STAT_BITS_TXT_CRLF  0x2
 #define CONVERT_STAT_BITS_BIN       0x4

+const char *LFS_VERSION_MARKER = "version https://git-lfs.github.com/spec/v1\n";
+const char *LFS_OID_MARKER = "oid sha256:";
+
+
 enum crlf_action {
 	CRLF_UNDEFINED,
 	CRLF_BINARY,
@@ -427,6 +431,79 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	return (write_err || status);
 }

+static int cached_lfs_smudge(const char *src, const char *cmd,
+							 struct strbuf *lfsbuf)
+{
+	int ret = 0;
+	if (src &&
+		strlen(src) > strlen(LFS_VERSION_MARKER) &&
+		!strncmp(LFS_VERSION_MARKER, src, strlen(LFS_VERSION_MARKER)) &&
+		!strcmp("git-lfs smudge %f", cmd)
+	) {
+		const char *lfs_oid_found = strstr(src, LFS_OID_MARKER);
+		if (lfs_oid_found) {
+			const char *lfs_oid = lfs_oid_found + strlen(LFS_OID_MARKER);
+
+			// Construct path to LFS object
+			strbuf_reset(lfsbuf);
+			strbuf_addstr(lfsbuf, git_pathdup("lfs/objects/"));
+			strbuf_add(lfsbuf, lfs_oid, 2);
+			strbuf_addch(lfsbuf, '/');
+			strbuf_add(lfsbuf, lfs_oid+2, 2);
+			strbuf_addch(lfsbuf, '/');
+			strbuf_add(lfsbuf, lfs_oid, 64);
+
+			if (access(lfsbuf->buf, F_OK) != -1) {
+				// LFS object found in local LFS cache
+				ret = 1;
+			}
+		}
+	}
+	return ret;
+}
+
+static int cached_lfs_clean(const char *path, const char *cmd,
+							struct strbuf *lfsbuf)
+{
+	int ret = 0;
+	if (path && !strcmp("git-lfs clean %f", cmd)) {
+
+		// TODO: Is there an easy way to access the content of the last
+		// known committed state of this file in the Git repo? If yes,
+		// then we could read the last known Git LFS OID, construct a
+		// path in the Git LFS cache and compare this file against "path".
+		// If both files are equal then we can use the last known committed
+		// state as "clean" and we could get rid of the SHA256 dependency
+		// here.
+		ssize_t file_size;
+		unsigned char sha256[64];
+		sha256fd(path, &sha256, &file_size);
+
+		char lfs_oid[64];
+		sha256_to_hex_r(lfs_oid, sha256);
+
+		// Construct path to LFS object
+		strbuf_reset(lfsbuf);
+		strbuf_addstr(lfsbuf, git_pathdup("lfs/objects/"));
+		strbuf_add(lfsbuf, lfs_oid, 2);
+		strbuf_addch(lfsbuf, '/');
+		strbuf_add(lfsbuf, lfs_oid+2, 2);
+		strbuf_addch(lfsbuf, '/');
+		strbuf_add(lfsbuf, lfs_oid, 64);
+
+		if (access(lfsbuf->buf, F_OK) != -1) {
+			// LFS object found in local LFS cache
+			strbuf_reset(lfsbuf);
+			strbuf_addstr(lfsbuf, LFS_VERSION_MARKER);
+			strbuf_addstr(lfsbuf, LFS_OID_MARKER);
+			strbuf_add(lfsbuf, lfs_oid, 64);
+			strbuf_addf(lfsbuf, "\nsize %d\n", file_size);
+			ret = 1;
+		}
+	}
+	return ret;
+}
+
 static int apply_filter(const char *path, const char *src, size_t len, int fd,
                         struct strbuf *dst, const char *cmd)
 {
@@ -437,6 +514,7 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	 * (child --> cmd) --> us
 	 */
 	int ret = 1;
+	struct strbuf lfsbuf = STRBUF_INIT;
 	struct strbuf nbuf = STRBUF_INIT;
 	struct async async;
 	struct filter_params params;
@@ -447,37 +525,56 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	if (!dst)
 		return 1;

-	memset(&async, 0, sizeof(async));
-	async.proc = filter_buffer_or_fd;
-	async.data = &params;
-	async.out = -1;
-	params.src = src;
-	params.size = len;
-	params.fd = fd;
-	params.cmd = cmd;
-	params.path = path;
-
-	fflush(NULL);
-	if (start_async(&async))
-		return 0;	/* error was already reported */
-
-	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
-		ret = 0;
-	}
-	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
-		ret = 0;
+	// TODO: check if git config "lfs.native-cache" is true
+	if (cached_lfs_smudge(src, cmd, &lfsbuf)) {
+		fd = open(lfsbuf.buf, O_RDONLY);
+		if (strbuf_read(&nbuf, fd, len) < 0) {
+			error("reading from cached LFS object failed", lfsbuf.buf);
+			ret = 0;
+		}
+		if (close(fd)) {
+			error("closing cached LFS object failed", lfsbuf.buf);
+			ret = 0;
+		}
 	}
-	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
-		ret = 0;
+	// TODO: check if git config "lfs.native-cache" is true
+	else if (cached_lfs_clean(path, cmd, &lfsbuf)) {
+		strbuf_reset(&nbuf);
+		strbuf_addstr(&nbuf, lfsbuf.buf);
+	} else {
+		memset(&async, 0, sizeof(async));
+		async.proc = filter_buffer_or_fd;
+		async.data = &params;
+		async.out = -1;
+		params.src = src;
+		params.size = len;
+		params.fd = fd;
+		params.cmd = cmd;
+		params.path = path;
+
+		fflush(NULL);
+		if (start_async(&async))
+			return 0;	/* error was already reported */
+
+		if (strbuf_read(&nbuf, async.out, len) < 0) {
+			error("read from external filter %s failed", cmd);
+			ret = 0;
+		}
+		if (close(async.out)) {
+			error("read from external filter %s failed", cmd);
+			ret = 0;
+		}
+		if (finish_async(&async)) {
+			error("external filter %s failed", cmd);
+			ret = 0;
+		}
 	}

 	if (ret) {
 		strbuf_swap(dst, &nbuf);
 	}
 	strbuf_release(&nbuf);
+	strbuf_release(&lfsbuf);
 	return ret;
 }

diff --git a/csum-file.c b/csum-file.c
index a172199..2ca4d7f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -11,6 +11,37 @@
 #include "progress.h"
 #include "csum-file.h"

+void sha256fd(const char *name, unsigned char *sha256, ssize_t *file_size)
+{
+	int fd;
+	struct stat st;
+	fd = open(name, O_RDONLY);
+	if (fd < 0)
+		die_errno("unable to open '%s'", name);
+	fstat(fd, &st);
+	size_t size = xsize_t(st.st_size);
+	*file_size = size;
+
+	SHA256_CTX ctx;
+	SHA256_Init(&ctx);
+	unsigned char fd_buffer[8192];
+
+	while (size > 0) {
+		ssize_t rsize = size < sizeof(fd_buffer) ? size : sizeof(fd_buffer);
+		ssize_t ret = read_in_full(fd, fd_buffer, rsize);
+
+		if (ret < 0)
+			die_errno("%s: sha256 file read error", name);
+		if (ret != rsize)
+			die("failed to read %d bytes from '%s'", (int)rsize, name);
+		SHA256_Update(&ctx, fd_buffer, rsize);
+		size -= rsize;
+	}
+
+	SHA256_Final(sha256, &ctx);
+	close(fd);
+}
+
 static void flush(struct sha1file *f, const void *buf, unsigned int count)
 {
 	if (0 <= f->check_fd && count)  {
diff --git a/csum-file.h b/csum-file.h
index 7530927..bad9262 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -39,6 +39,8 @@ extern void sha1flush(struct sha1file *f);
 extern void crc32_begin(struct sha1file *);
 extern uint32_t crc32_end(struct sha1file *);

+extern void sha256fd(const char *name, unsigned char *sha256, ssize_t *file_size);
+
 static inline void sha1write_u8(struct sha1file *f, uint8_t data)
 {
 	sha1write(f, &data, sizeof(data));
diff --git a/hex.c b/hex.c
index 0519f85..73e2077 100644
--- a/hex.c
+++ b/hex.c
@@ -88,3 +88,19 @@ char *oid_to_hex(const struct object_id *oid)
 {
 	return sha1_to_hex(oid->hash);
 }
+
+char *sha256_to_hex_r(char *buffer, const unsigned char *sha1)
+{
+	static const char hex[] = "0123456789abcdef";
+	char *buf = buffer;
+	int i;
+
+	for (i = 0; i < 32; i++) {
+		unsigned int val = *sha1++;
+		*buf++ = hex[val >> 4];
+		*buf++ = hex[val & 0xf];
+	}
+	*buf = '\0';
+
+	return buffer;
+}
--
2.5.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-27  5:38 [RFC] Native access to Git LFS cache larsxschneider
@ 2016-06-27 15:53 ` Duy Nguyen
  2016-06-28  9:40   ` Johannes Schindelin
  2016-06-27 16:09 ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Duy Nguyen @ 2016-06-27 15:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List

On Mon, Jun 27, 2016 at 7:38 AM,  <larsxschneider@gmail.com> wrote:
> ## Proposed solution
> Git LFS caches its objects under .git/lfs/objects. Most of the time Git
> LFS objects are already available in the cache (e.g. if you switch branches
> back and forth). I implemented these "cache hits" natively in Git.
> Please note that this implementation is just a quick and dirty proof of
> concept. If the Git community agrees that this kind of approach would be
> acceptable then I will start to work on a proper patch series with cross
> platform support and unit tests.

Would it be possible to move all this code to a separate daemon?
Instead of spawning a new process to do the filtering, you send a
command "convert this" over maybe unix socket and either receive the
whole result over the socket, or receive a path of the result.

I don't think hard coding "git-lfs" is a good way to go (if you keep
that in the final impl. of course). I guess the costly part is
spawning processes and going through the same process initialization
for every object. If we keep a daemon running, all that is gone. You
still have to pay for extra context switches and memory copy (unless
you send the path, but then it could be racy), but I think that's
negligible. And all smudge/clean filters can do caching and more if
they want to.
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-27  5:38 [RFC] Native access to Git LFS cache larsxschneider
  2016-06-27 15:53 ` Duy Nguyen
@ 2016-06-27 16:09 ` Junio C Hamano
  2016-06-28 13:22   ` Lars Schneider
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2016-06-27 16:09 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

larsxschneider@gmail.com writes:

> Unfortunately that fix helps only with cloning. Any local Git operation
> that invokes the clean/smudge filter (e.g. switching branches) is still
> slow.

Do you know where the slowness comes from?  Does Joey's new
clean/smudge interface help GitLFS?

You are not likely to get anything that knows that a blob object may
be named as anything other than SHA-1("blob <len>" + <contents>) to
Git core.  The remote-object-store idea that was floated by Peff and
Christian started running with at least maintains that object naming
property and has a better chance of interacting better with the core,
but LFS, Annex or anything that would not preserve the object naming
would not.

Personally, I view a surrogate blob left by LFS in the tree object
and filtered via clean/smudge a "smarter" kind of symbolic link that
points outside what Git controls.  The area outside what Git
controls is left to be managed by whatever the add-on does; Git
shouldn't even be aware of how they are structured and/or managed.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-27 15:53 ` Duy Nguyen
@ 2016-06-28  9:40   ` Johannes Schindelin
  2016-06-28 13:11     ` Duy Nguyen
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2016-06-28  9:40 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Lars Schneider, Git Mailing List

Hi Duy,

On Mon, 27 Jun 2016, Duy Nguyen wrote:

> On Mon, Jun 27, 2016 at 7:38 AM,  <larsxschneider@gmail.com> wrote:
> > ## Proposed solution
> > Git LFS caches its objects under .git/lfs/objects. Most of the time
> > Git LFS objects are already available in the cache (e.g. if you switch
> > branches back and forth). I implemented these "cache hits" natively in
> > Git.  Please note that this implementation is just a quick and dirty
> > proof of concept. If the Git community agrees that this kind of
> > approach would be acceptable then I will start to work on a proper
> > patch series with cross platform support and unit tests.
> 
> Would it be possible to move all this code to a separate daemon?
> Instead of spawning a new process to do the filtering, you send a
> command "convert this" over maybe unix socket and either receive the
> whole result over the socket, or receive a path of the result.

Unix sockets are not really portable...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-28  9:40   ` Johannes Schindelin
@ 2016-06-28 13:11     ` Duy Nguyen
  2016-06-28 13:14       ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Duy Nguyen @ 2016-06-28 13:11 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Lars Schneider, Git Mailing List

On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi Duy,
>
> On Mon, 27 Jun 2016, Duy Nguyen wrote:
>
>> On Mon, Jun 27, 2016 at 7:38 AM,  <larsxschneider@gmail.com> wrote:
>> > ## Proposed solution
>> > Git LFS caches its objects under .git/lfs/objects. Most of the time
>> > Git LFS objects are already available in the cache (e.g. if you switch
>> > branches back and forth). I implemented these "cache hits" natively in
>> > Git.  Please note that this implementation is just a quick and dirty
>> > proof of concept. If the Git community agrees that this kind of
>> > approach would be acceptable then I will start to work on a proper
>> > patch series with cross platform support and unit tests.
>>
>> Would it be possible to move all this code to a separate daemon?
>> Instead of spawning a new process to do the filtering, you send a
>> command "convert this" over maybe unix socket and either receive the
>> whole result over the socket, or receive a path of the result.
>
> Unix sockets are not really portable...

It's the same situation as index-helper. I expect you guys will
replace the transport with named pipe or similar.
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-28 13:11     ` Duy Nguyen
@ 2016-06-28 13:14       ` Johannes Schindelin
  2016-06-28 13:43         ` Lars Schneider
  2016-06-28 15:50         ` Duy Nguyen
  0 siblings, 2 replies; 11+ messages in thread
From: Johannes Schindelin @ 2016-06-28 13:14 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Lars Schneider, Git Mailing List

Hi Duy,

On Tue, 28 Jun 2016, Duy Nguyen wrote:

> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > On Mon, 27 Jun 2016, Duy Nguyen wrote:
> >
> >> On Mon, Jun 27, 2016 at 7:38 AM,  <larsxschneider@gmail.com> wrote:
> >> > ## Proposed solution
> >> > Git LFS caches its objects under .git/lfs/objects. Most of the time
> >> > Git LFS objects are already available in the cache (e.g. if you
> >> > switch branches back and forth). I implemented these "cache hits"
> >> > natively in Git.  Please note that this implementation is just a
> >> > quick and dirty proof of concept. If the Git community agrees that
> >> > this kind of approach would be acceptable then I will start to work
> >> > on a proper patch series with cross platform support and unit
> >> > tests.
> >>
> >> Would it be possible to move all this code to a separate daemon?
> >> Instead of spawning a new process to do the filtering, you send a
> >> command "convert this" over maybe unix socket and either receive the
> >> whole result over the socket, or receive a path of the result.
> >
> > Unix sockets are not really portable...
> 
> It's the same situation as index-helper. I expect you guys will
> replace the transport with named pipe or similar.

Yes, I will have to work on that. But I might need to ask for a change in
the design if I hit some obstacle there: named pipes are not the same at
all as Unix sockets.

Read: it will be painful, and not a general solution. So every new Unix
socket that you introduce will introduce new problems for me.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-27 16:09 ` Junio C Hamano
@ 2016-06-28 13:22   ` Lars Schneider
  2016-06-28 13:53     ` Christian Couder
  0 siblings, 1 reply; 11+ messages in thread
From: Lars Schneider @ 2016-06-28 13:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Christian Couder, Jeff King


> On 27 Jun 2016, at 18:09, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> Unfortunately that fix helps only with cloning. Any local Git operation
>> that invokes the clean/smudge filter (e.g. switching branches) is still
>> slow.
> 
> Do you know where the slowness comes from?  Does Joey's new
> clean/smudge interface help GitLFS?

I am pretty sure the startup time of the external clean/smudge process
causes the slowness and consequently I don't think Joey's patch would help. 
The following tests makes me believe that:

I ran the same test as in my original email using the repo with 15,000 
LFS files. Instead of the LFS binary I use the fast and simple shell 
built-in `true` command:

$ git -c filter.lfs.smudge=true -c filter.lfs.clean=true clone https://github.com/larsxschneider/lfstest-manyfiles.git
$ cd lfstest-manyfiles/
$ time git -c filter.lfs.smudge=true -c filter.lfs.clean=true checkout removed-files

real	0m47.030s
user	0m29.521s
sys	0m16.993s

It still takes 47 seconds to switch the branch. Does this test prove my
point or do you see a flaw in the test?


> You are not likely to get anything that knows that a blob object may
> be named as anything other than SHA-1("blob <len>" + <contents>) to
> Git core.  The remote-object-store idea that was floated by Peff and
> Christian started running with at least maintains that object naming
> property and has a better chance of interacting better with the core,
> but LFS, Annex or anything that would not preserve the object naming
> would not.
> 
> Personally, I view a surrogate blob left by LFS in the tree object
> and filtered via clean/smudge a "smarter" kind of symbolic link that
> points outside what Git controls.  The area outside what Git
> controls is left to be managed by whatever the add-on does; Git
> shouldn't even be aware of how they are structured and/or managed.

I understand and somewhat anticipated your point of view. I will try
to find a less intrusive solution.

@Christian/Peff: 
Is there a place to look for more info about your remote-object-store idea? 

Thanks,
Lars

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-28 13:14       ` Johannes Schindelin
@ 2016-06-28 13:43         ` Lars Schneider
  2016-06-28 16:00           ` Duy Nguyen
  2016-06-28 15:50         ` Duy Nguyen
  1 sibling, 1 reply; 11+ messages in thread
From: Lars Schneider @ 2016-06-28 13:43 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Duy Nguyen, Git Mailing List, technoweenie


> On 28 Jun 2016, at 15:14, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> 
> Hi Duy,
> 
> On Tue, 28 Jun 2016, Duy Nguyen wrote:
> 
>> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
>> <Johannes.Schindelin@gmx.de> wrote:
>>> 
>>> On Mon, 27 Jun 2016, Duy Nguyen wrote:
>>> 
>>>> On Mon, Jun 27, 2016 at 7:38 AM,  <larsxschneider@gmail.com> wrote:
>>>>> ## Proposed solution
>>>>> Git LFS caches its objects under .git/lfs/objects. Most of the time
>>>>> Git LFS objects are already available in the cache (e.g. if you
>>>>> switch branches back and forth). I implemented these "cache hits"
>>>>> natively in Git.  Please note that this implementation is just a
>>>>> quick and dirty proof of concept. If the Git community agrees that
>>>>> this kind of approach would be acceptable then I will start to work
>>>>> on a proper patch series with cross platform support and unit
>>>>> tests.
>>>> 
>>>> Would it be possible to move all this code to a separate daemon?
>>>> Instead of spawning a new process to do the filtering, you send a
>>>> command "convert this" over maybe unix socket and either receive the
>>>> whole result over the socket, or receive a path of the result.
>>> 
>>> Unix sockets are not really portable...
>> 
>> It's the same situation as index-helper. I expect you guys will
>> replace the transport with named pipe or similar.
> 
> Yes, I will have to work on that. But I might need to ask for a change in
> the design if I hit some obstacle there: named pipes are not the same at
> all as Unix sockets.
> 
> Read: it will be painful, and not a general solution. So every new Unix
> socket that you introduce will introduce new problems for me.

Thanks Duy for your suggestion. I considered a daemon, but a daemon makes
it always harder for the user as the user needs to ensure the daemon is 
running! Plus, Dscho's concerns regarding Windows.

I think the core problem is that we invoke the filter for every file:
https://github.com/git/git/blob/master/convert.c#L461-L475

Couldn't we start the filter executable at the beginning of the Git process
and communicate with it via stdin/stdout whenever we hit the Git filter 
code? Would that work?

Alternatively, do you see a way to add a "plugin" system to Git? Where Git
could be configured to dynamically load a "filter" library?

@Dscho:
Do you have a recommendation for interprocess communication that works 
without trouble on Windows? 

Thanks,
Lars

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-28 13:22   ` Lars Schneider
@ 2016-06-28 13:53     ` Christian Couder
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Couder @ 2016-06-28 13:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Junio C Hamano, Git Mailing List, Jeff King

On Tue, Jun 28, 2016 at 3:22 PM, Lars Schneider
<larsxschneider@gmail.com> wrote:
>
> @Christian/Peff:
> Is there a place to look for more info about your remote-object-store idea?

You may want to take a look at:

https://github.com/chriscool/git/commits/external-odb

I just updated it and I may send an updated RFC series from this
branch to the list soon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-28 13:14       ` Johannes Schindelin
  2016-06-28 13:43         ` Lars Schneider
@ 2016-06-28 15:50         ` Duy Nguyen
  1 sibling, 0 replies; 11+ messages in thread
From: Duy Nguyen @ 2016-06-28 15:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Lars Schneider, Git Mailing List

On Tue, Jun 28, 2016 at 3:14 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi Duy,
>
> On Tue, 28 Jun 2016, Duy Nguyen wrote:
>
>> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
>> <Johannes.Schindelin@gmx.de> wrote:
>> >
>> > On Mon, 27 Jun 2016, Duy Nguyen wrote:
>> >
>> >> On Mon, Jun 27, 2016 at 7:38 AM,  <larsxschneider@gmail.com> wrote:
>> >> > ## Proposed solution
>> >> > Git LFS caches its objects under .git/lfs/objects. Most of the time
>> >> > Git LFS objects are already available in the cache (e.g. if you
>> >> > switch branches back and forth). I implemented these "cache hits"
>> >> > natively in Git.  Please note that this implementation is just a
>> >> > quick and dirty proof of concept. If the Git community agrees that
>> >> > this kind of approach would be acceptable then I will start to work
>> >> > on a proper patch series with cross platform support and unit
>> >> > tests.
>> >>
>> >> Would it be possible to move all this code to a separate daemon?
>> >> Instead of spawning a new process to do the filtering, you send a
>> >> command "convert this" over maybe unix socket and either receive the
>> >> whole result over the socket, or receive a path of the result.
>> >
>> > Unix sockets are not really portable...
>>
>> It's the same situation as index-helper. I expect you guys will
>> replace the transport with named pipe or similar.
>
> Yes, I will have to work on that. But I might need to ask for a change in
> the design if I hit some obstacle there: named pipes are not the same at
> all as Unix sockets.
>
> Read: it will be painful, and not a general solution. So every new Unix
> socket that you introduce will introduce new problems for me.

I thought we could have a drop-in replacement (or maybe a higher
abstraction that would be sufficient for git). Thanks for pointing it
out.
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Native access to Git LFS cache
  2016-06-28 13:43         ` Lars Schneider
@ 2016-06-28 16:00           ` Duy Nguyen
  0 siblings, 0 replies; 11+ messages in thread
From: Duy Nguyen @ 2016-06-28 16:00 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Johannes Schindelin, Git Mailing List, technoweenie

On Tue, Jun 28, 2016 at 3:43 PM, Lars Schneider
<larsxschneider@gmail.com> wrote:
>
>> On 28 Jun 2016, at 15:14, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>>
>> Hi Duy,
>>
>> On Tue, 28 Jun 2016, Duy Nguyen wrote:
>>
>>> On Tue, Jun 28, 2016 at 11:40 AM, Johannes Schindelin
>>> <Johannes.Schindelin@gmx.de> wrote:
>>>>
>>>> On Mon, 27 Jun 2016, Duy Nguyen wrote:
>>>>
>>>>> On Mon, Jun 27, 2016 at 7:38 AM,  <larsxschneider@gmail.com> wrote:
>>>>>> ## Proposed solution
>>>>>> Git LFS caches its objects under .git/lfs/objects. Most of the time
>>>>>> Git LFS objects are already available in the cache (e.g. if you
>>>>>> switch branches back and forth). I implemented these "cache hits"
>>>>>> natively in Git.  Please note that this implementation is just a
>>>>>> quick and dirty proof of concept. If the Git community agrees that
>>>>>> this kind of approach would be acceptable then I will start to work
>>>>>> on a proper patch series with cross platform support and unit
>>>>>> tests.
>>>>>
>>>>> Would it be possible to move all this code to a separate daemon?
>>>>> Instead of spawning a new process to do the filtering, you send a
>>>>> command "convert this" over maybe unix socket and either receive the
>>>>> whole result over the socket, or receive a path of the result.
>>>>
>>>> Unix sockets are not really portable...
>>>
>>> It's the same situation as index-helper. I expect you guys will
>>> replace the transport with named pipe or similar.
>>
>> Yes, I will have to work on that. But I might need to ask for a change in
>> the design if I hit some obstacle there: named pipes are not the same at
>> all as Unix sockets.
>>
>> Read: it will be painful, and not a general solution. So every new Unix
>> socket that you introduce will introduce new problems for me.
>
> Thanks Duy for your suggestion. I considered a daemon, but a daemon makes
> it always harder for the user as the user needs to ensure the daemon is
> running! Plus, Dscho's concerns regarding Windows.
>
> I think the core problem is that we invoke the filter for every file:
> https://github.com/git/git/blob/master/convert.c#L461-L475
>
> Couldn't we start the filter executable at the beginning of the Git process
> and communicate with it via stdin/stdout whenever we hit the Git filter
> code? Would that work?

Yeah if one filter process per one git process still brings
significant perf. gain for you, why not, it's simpler than daemon.
Though you may want to look at Christian's external odb first in the
other mail. Note though that external odb may still spawn process a
lot (because the design is you cache objects locally once and you
don't have to spawn again). Whether that fits in lfs scheme, I have no
idea (I have never used git-lfs myself).

> Alternatively, do you see a way to add a "plugin" system to Git? Where Git
> could be configured to dynamically load a "filter" library?

I don't think plugins as .so files are welcome because it would force
us to freeze some ABI. So far all git extension has always been via an
external process.
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-06-28 16:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-27  5:38 [RFC] Native access to Git LFS cache larsxschneider
2016-06-27 15:53 ` Duy Nguyen
2016-06-28  9:40   ` Johannes Schindelin
2016-06-28 13:11     ` Duy Nguyen
2016-06-28 13:14       ` Johannes Schindelin
2016-06-28 13:43         ` Lars Schneider
2016-06-28 16:00           ` Duy Nguyen
2016-06-28 15:50         ` Duy Nguyen
2016-06-27 16:09 ` Junio C Hamano
2016-06-28 13:22   ` Lars Schneider
2016-06-28 13:53     ` Christian Couder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).