git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git status OOM on mmap of large file
@ 2019-01-22 22:07 Joey Hess
  2019-01-24  0:39 ` brian m. carlson
  2019-01-24 12:10 ` Jeff King
  0 siblings, 2 replies; 12+ messages in thread
From: Joey Hess @ 2019-01-22 22:07 UTC (permalink / raw)
  To: git

joey@darkstar:~/tmp/t> ls -l big-file
-rw-r--r-- 1 joey joey 11811160064 Jan 22 17:48 big-file
joey@darkstar:~/tmp/t> git status
fatal: Out of memory, realloc failed

This file is checked into git, but using a smudge/clean filter, so the actual
data checked into git is a hash. I did so using git-annex v7 mode, but I
suppose git lfs would cause the same problem.

[pid  6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)

Why status needs to mmap a large file that is not modified
and that is configured to pass through smudge/clean, I don't know.
It seems like it should be possible for status work in this situation.

-- 
see shy jo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-22 22:07 git status OOM on mmap of large file Joey Hess
@ 2019-01-24  0:39 ` brian m. carlson
  2019-01-24 12:14   ` Jeff King
  2019-01-24 12:10 ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: brian m. carlson @ 2019-01-24  0:39 UTC (permalink / raw)
  To: Joey Hess; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

On Tue, Jan 22, 2019 at 06:07:14PM -0400, Joey Hess wrote:
> joey@darkstar:~/tmp/t> ls -l big-file
> -rw-r--r-- 1 joey joey 11811160064 Jan 22 17:48 big-file
> joey@darkstar:~/tmp/t> git status
> fatal: Out of memory, realloc failed
> 
> This file is checked into git, but using a smudge/clean filter, so the actual
> data checked into git is a hash. I did so using git-annex v7 mode, but I
> suppose git lfs would cause the same problem.
> 
> [pid  6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> 
> Why status needs to mmap a large file that is not modified
> and that is configured to pass through smudge/clean, I don't know.

I believe that currently, Git stores the smudge/clean output in memory
until it writes it out. When using the persistent filter process, it's
possible for the process to choose to abort the operation, so we store
the data in memory until we get the status.

Theoretically, it should be possible for us to write this to a temporary
file, and if necessary, rename into place, although I'm not sure how
well that will work on Windows. File modes may also be tricky here.
Patches are of course welcome.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-22 22:07 git status OOM on mmap of large file Joey Hess
  2019-01-24  0:39 ` brian m. carlson
@ 2019-01-24 12:10 ` Jeff King
  2019-01-24 12:54   ` Duy Nguyen
  2019-01-24 18:38   ` Joey Hess
  1 sibling, 2 replies; 12+ messages in thread
From: Jeff King @ 2019-01-24 12:10 UTC (permalink / raw)
  To: Joey Hess; +Cc: git

On Tue, Jan 22, 2019 at 06:07:14PM -0400, Joey Hess wrote:

> joey@darkstar:~/tmp/t> ls -l big-file
> -rw-r--r-- 1 joey joey 11811160064 Jan 22 17:48 big-file
> joey@darkstar:~/tmp/t> git status
> fatal: Out of memory, realloc failed
> 
> This file is checked into git, but using a smudge/clean filter, so the actual
> data checked into git is a hash. I did so using git-annex v7 mode, but I
> suppose git lfs would cause the same problem.
> 
> [pid  6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> 
> Why status needs to mmap a large file that is not modified
> and that is configured to pass through smudge/clean, I don't know.
> It seems like it should be possible for status work in this situation.

One minor point: I don't think this is us mmap-ing the file. The
descriptor is -1, and Git never uses PROT_WRITE. This is likely your
libc using mmap to fulfill a malloc() request.

That said, it just turns the question into: why did Git try to malloc
that many bytes?  If I reproduce your example (using a 100MB file) and
set GIT_ALLOC_LIMIT in the environment, the backtrace to die() is:

  #1  0x0000555555786d65 in memory_limit_check (size=104857601, gentle=0) at wrapper.c:27
  #2  0x0000555555787084 in xrealloc (ptr=0x0, size=104857601) at wrapper.c:137
  #3  0x000055555575612e in strbuf_grow (sb=0x7fffffffdbf0, extra=104857600) at strbuf.c:98
  #4  0x000055555575731a in strbuf_read (sb=0x7fffffffdbf0, fd=4, hint=104857600) at strbuf.c:429
  #5  0x0000555555664a1f in apply_single_file_filter (path=0x5555558c787c "foo.rand", ...)
  #6  0x0000555555665321 in apply_filter (path=0x5555558c787c "foo.rand", ...)
  ...

Looking at apply_single_file_filter(), it's not the _original_ file that
it's trying to store, but rather the data coming back from the filter.
It's just that we use the original file size as a hint!

In this case (and I'd venture to say in most gigantic-file cases) it's
much larger than we need, to the point of causing a problem.

In other words, I think this patch fixes your problem:

diff --git a/convert.c b/convert.c
index 0d89ae7c23..85aebe2ed3 100644
--- a/convert.c
+++ b/convert.c
@@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
 	if (start_async(&async))
 		return 0;	/* error was already reported */
 
-	if (strbuf_read(&nbuf, async.out, len) < 0) {
+	if (strbuf_read(&nbuf, async.out, 0) < 0) {
 		err = error(_("read from external filter '%s' failed"), cmd);
 	}
 	if (close(async.out)) {

though possibly we should actually continue to use the file size as a
hint up to a certain point, which avoids reallocations for more "normal"
filters where the input and output sizes are in the same ballpark.

Just off the top of my head, something like:

  /* guess that the filtered output will be the same size as the original */
  hint = len;

  /* allocate 10% extra in case the clean size is slightly larger */
  hint *= 1.1;

  /*
   * in any case, never go higher than half of core.bigfileThreshold.
   * We'd like to avoid allocating more bytes than that, and that still
   * gives us room for our strbuf to preemptively double if our guess is
   * just a little on the low side.
   */
  if (hint > big_file_threshold / 2)
	hint = big_file_threshold / 2;

But to be honest, I have no idea if that would even produce measurable
benefits over simply growing the strbuf from scratch (i.e., hint==0).

-Peff

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-24  0:39 ` brian m. carlson
@ 2019-01-24 12:14   ` Jeff King
  2019-01-24 17:05     ` Joey Hess
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2019-01-24 12:14 UTC (permalink / raw)
  To: brian m. carlson, Joey Hess, git

On Thu, Jan 24, 2019 at 12:39:49AM +0000, brian m. carlson wrote:

> > [pid  6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> > 
> > Why status needs to mmap a large file that is not modified
> > and that is configured to pass through smudge/clean, I don't know.
> 
> I believe that currently, Git stores the smudge/clean output in memory
> until it writes it out. When using the persistent filter process, it's
> possible for the process to choose to abort the operation, so we store
> the data in memory until we get the status.

For the clean step, that should be OK, since that filter output is tiny
(but see my other message for a silly heuristic and an easy fix). And
that should be all that "git status" needs. But...

> Theoretically, it should be possible for us to write this to a temporary
> file, and if necessary, rename into place, although I'm not sure how
> well that will work on Windows. File modes may also be tricky here.
> Patches are of course welcome.

I didn't experiment with the smudge side, but I think it uses the same
apply_filter() code. Which means that yes, it would try to store the
11GB in memory before writing it out. And I agree writing it out to a
file and moving it directly into place is the sanest option there. If
that doesn't work, spooling to a tempfile and then streaming it into
place would also work.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-24 12:10 ` Jeff King
@ 2019-01-24 12:54   ` Duy Nguyen
  2019-01-24 18:38   ` Joey Hess
  1 sibling, 0 replies; 12+ messages in thread
From: Duy Nguyen @ 2019-01-24 12:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Joey Hess, git

On Thu, Jan 24, 2019 at 7:11 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Jan 22, 2019 at 06:07:14PM -0400, Joey Hess wrote:
>
> > joey@darkstar:~/tmp/t> ls -l big-file
> > -rw-r--r-- 1 joey joey 11811160064 Jan 22 17:48 big-file
> > joey@darkstar:~/tmp/t> git status
> > fatal: Out of memory, realloc failed
> >
> > This file is checked into git, but using a smudge/clean filter, so the actual
> > data checked into git is a hash. I did so using git-annex v7 mode, but I
> > suppose git lfs would cause the same problem.
> >
> > [pid  6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> >
> > Why status needs to mmap a large file that is not modified
> > and that is configured to pass through smudge/clean, I don't know.
> > It seems like it should be possible for status work in this situation.
>
> One minor point: I don't think this is us mmap-ing the file. The
> descriptor is -1, and Git never uses PROT_WRITE. This is likely your
> libc using mmap to fulfill a malloc() request.
>
> That said, it just turns the question into: why did Git try to malloc
> that many bytes?  If I reproduce your example (using a 100MB file) and
> set GIT_ALLOC_LIMIT in the environment, the backtrace to die() is:
>
>   #1  0x0000555555786d65 in memory_limit_check (size=104857601, gentle=0) at wrapper.c:27
>   #2  0x0000555555787084 in xrealloc (ptr=0x0, size=104857601) at wrapper.c:137
>   #3  0x000055555575612e in strbuf_grow (sb=0x7fffffffdbf0, extra=104857600) at strbuf.c:98
>   #4  0x000055555575731a in strbuf_read (sb=0x7fffffffdbf0, fd=4, hint=104857600) at strbuf.c:429
>   #5  0x0000555555664a1f in apply_single_file_filter (path=0x5555558c787c "foo.rand", ...)
>   #6  0x0000555555665321 in apply_filter (path=0x5555558c787c "foo.rand", ...)
>   ...
>
> Looking at apply_single_file_filter(), it's not the _original_ file that
> it's trying to store, but rather the data coming back from the filter.
> It's just that we use the original file size as a hint!

Really cool! I guessed as far as malloc() but did not actually test
it, let alone examine the problem closely like this.
-- 
Duy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-24 12:14   ` Jeff King
@ 2019-01-24 17:05     ` Joey Hess
  0 siblings, 0 replies; 12+ messages in thread
From: Joey Hess @ 2019-01-24 17:05 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

Jeff King wrote:
> I didn't experiment with the smudge side, but I think it uses the same
> apply_filter() code. Which means that yes, it would try to store the
> 11GB in memory before writing it out. And I agree writing it out to a
> file and moving it directly into place is the sanest option there. If
> that doesn't work, spooling to a tempfile and then streaming it into
> place would also work.

I have seen that buffering on the smudge side, yes. In git-annex 
I happen to use the smudge filter in an unusual way that avoids
that being a problem but I think it would affect git-lfs.

-- 
see shy jo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-24 12:10 ` Jeff King
  2019-01-24 12:54   ` Duy Nguyen
@ 2019-01-24 18:38   ` Joey Hess
  2019-01-24 19:18     ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Joey Hess @ 2019-01-24 18:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King wrote:
> Looking at apply_single_file_filter(), it's not the _original_ file that
> it's trying to store, but rather the data coming back from the filter.
> It's just that we use the original file size as a hint!

Thanks much for working that out!

> In other words, I think this patch fixes your problem:
> 
> diff --git a/convert.c b/convert.c
> index 0d89ae7c23..85aebe2ed3 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
>  	if (start_async(&async))
>  		return 0;	/* error was already reported */
>  
> -	if (strbuf_read(&nbuf, async.out, len) < 0) {
> +	if (strbuf_read(&nbuf, async.out, 0) < 0) {
>  		err = error(_("read from external filter '%s' failed"), cmd);
>  	}
>  	if (close(async.out)) {

Yes, confirmed that does fix it.

> though possibly we should actually continue to use the file size as a
> hint up to a certain point, which avoids reallocations for more "normal"
> filters where the input and output sizes are in the same ballpark.
> 
> Just off the top of my head, something like:
> 
>   /* guess that the filtered output will be the same size as the original */
>   hint = len;
> 
>   /* allocate 10% extra in case the clean size is slightly larger */
>   hint *= 1.1;
> 
>   /*
>    * in any case, never go higher than half of core.bigfileThreshold.
>    * We'd like to avoid allocating more bytes than that, and that still
>    * gives us room for our strbuf to preemptively double if our guess is
>    * just a little on the low side.
>    */
>   if (hint > big_file_threshold / 2)
> 	hint = big_file_threshold / 2;
> 
> But to be honest, I have no idea if that would even produce measurable
> benefits over simply growing the strbuf from scratch (i.e., hint==0).

Half of 512 MB is still quite a lot of memory to default to using in
this situation. Eg smaller VPS's still often only have a GB or two of ram.

When the clean filter is being used in a way that doesn't involve hashes
of large files, it will mostly be operating on typically sized source
code files. So capping the maximum hint size around the size of a typical
source code file would be plenty for both common cases for the clean filter.

I did some benchmarking, using cat as the clean filter:

git status 32 kb file, hint == len
   time                 3.865 ms   (3.829 ms .. 3.943 ms)
                        0.994 R²   (0.987 R² .. 0.999 R²)
   mean                 3.934 ms   (3.889 ms .. 4.021 ms)
   std dev              191.8 μs   (106.8 μs .. 291.8 μs)
git status 32 kb file, hint == 0
   time                 3.887 ms   (3.751 ms .. 4.064 ms)
                        0.992 R²   (0.986 R² .. 0.998 R²)
   mean                 4.002 ms   (3.931 ms .. 4.138 ms)
   std dev              292.2 μs   (189.0 μs .. 498.3 μs)
git status 1 mb file, hint == len
   time                 3.942 ms   (3.816 ms .. 4.078 ms)
                        0.995 R²   (0.991 R² .. 0.999 R²)
   mean                 3.969 ms   (3.916 ms .. 4.054 ms)
   std dev              220.1 μs   (155.1 μs .. 304.3 μs)
git status 1 mb file, hint == 0
   time                 3.869 ms   (3.836 ms .. 3.911 ms)
                        0.998 R²   (0.995 R² .. 1.000 R²)
   mean                 3.895 ms   (3.868 ms .. 3.947 ms)
   std dev              112.3 μs   (47.93 μs .. 182.7 μs)
git status 1 gb file, hint == len
   time                 7.173 s    (6.834 s .. 7.564 s)
                        0.999 R²   (0.999 R² .. 1.000 R²)
   mean                 7.560 s    (7.369 s .. 7.903 s)
   std dev              333.2 ms   (27.65 ms .. 412.2 ms)
git status 1 gb file, hint == 0
   time                 7.652 s    (6.307 s .. 8.263 s)
                        0.996 R²   (0.992 R² .. 1.000 R²)
   mean                 8.082 s    (7.843 s .. 8.202 s)
   std dev              232.3 ms   (2.362 ms .. 277.1 ms)

From this, it looks like the file has to be quite large before the
preallocation makes a sizable improvement to runtime, and the
smudge/clean filters have to be used for actual content filtering
(not for hash generation purposes as git-annex and git-lfs use it).
An unusual edge case I think. So hint == 0 seems fine.

-- 
see shy jo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-24 18:38   ` Joey Hess
@ 2019-01-24 19:18     ` Jeff King
  2019-01-24 19:28       ` Jeff King
  2019-01-24 20:36       ` [PATCH] avoid unncessary malloc of whole file size Joey Hess
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2019-01-24 19:18 UTC (permalink / raw)
  To: Joey Hess; +Cc: git

On Thu, Jan 24, 2019 at 02:38:10PM -0400, Joey Hess wrote:

> > Just off the top of my head, something like:
> > 
> >   /* guess that the filtered output will be the same size as the original */
> >   hint = len;
> > 
> >   /* allocate 10% extra in case the clean size is slightly larger */
> >   hint *= 1.1;
> > 
> >   /*
> >    * in any case, never go higher than half of core.bigfileThreshold.
> >    * We'd like to avoid allocating more bytes than that, and that still
> >    * gives us room for our strbuf to preemptively double if our guess is
> >    * just a little on the low side.
> >    */
> >   if (hint > big_file_threshold / 2)
> > 	hint = big_file_threshold / 2;
> > 
> > But to be honest, I have no idea if that would even produce measurable
> > benefits over simply growing the strbuf from scratch (i.e., hint==0).
> 
> Half of 512 MB is still quite a lot of memory to default to using in
> this situation. Eg smaller VPS's still often only have a GB or two of ram.

I think you'd want to drop core.bigFileThreshold on such a server, just
because Git will happily keep 2*(bigFileThreshold-1) in memory to do a
diff. But that nit aside...

> I did some benchmarking, using cat as the clean filter:
> [...]
> From this, it looks like the file has to be quite large before the
> preallocation makes a sizable improvement to runtime, and the
> smudge/clean filters have to be used for actual content filtering
> (not for hash generation purposes as git-annex and git-lfs use it).
> An unusual edge case I think. So hint == 0 seems fine.

Thanks for these timings! I agree that "hint == 0" is probably
reasonable, then.

I suppose there's no reason not to proceed with a patch around this.
For most cases it's really only half the solution (since smudging is
going to run into the same problem). But fixing that is quite a bit more
involved, and the change itself will be largely orthogonal.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git status OOM on mmap of large file
  2019-01-24 19:18     ` Jeff King
@ 2019-01-24 19:28       ` Jeff King
  2019-01-24 20:36       ` [PATCH] avoid unncessary malloc of whole file size Joey Hess
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2019-01-24 19:28 UTC (permalink / raw)
  To: Joey Hess; +Cc: git

On Thu, Jan 24, 2019 at 02:18:36PM -0500, Jeff King wrote:

> > I did some benchmarking, using cat as the clean filter:
> > [...]
> > From this, it looks like the file has to be quite large before the
> > preallocation makes a sizable improvement to runtime, and the
> > smudge/clean filters have to be used for actual content filtering
> > (not for hash generation purposes as git-annex and git-lfs use it).
> > An unusual edge case I think. So hint == 0 seems fine.
> 
> Thanks for these timings! I agree that "hint == 0" is probably
> reasonable, then.

One other minor point to consider: on some systems over-allocating
actually isn't that expensive, because pages are actually allocated
until we write to them, and malloc() is perfectly happy to overcommit
memory.  Your case would only run into problems on Linux when malloc()
actually refuses the allocation (so limiting ourselves to "too large but
still reasonable" is a valid strategy there).

But I doubt that's something we should be relying on in general. There
are many systems that don't overcommit.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] avoid unncessary malloc of whole file size
  2019-01-24 19:18     ` Jeff King
  2019-01-24 19:28       ` Jeff King
@ 2019-01-24 20:36       ` Joey Hess
  2019-01-24 21:12         ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Joey Hess @ 2019-01-24 20:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git

When a worktree file is larger than the available memory, and a clean
filter is in use, this avoids mallocing a buffer the whole size of the
file when reading from the clean filter, which caused commands like git
status and git commit to OOM.

Often in this situation the clean filter will produce a short identifier
for the file, so such a large buffer is not needed.

When the clean filter does output something around the same size as the
worktree file, the buffer will need to be reallocated until it fits,
starting at 8192 and doubling in size. Benchmarking indicates that
reallocation is not a significant overhead for outputs up to a
few MB in size.

Signed-off-by: Joey Hess <id@joeyh.name>
---
 convert.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/convert.c b/convert.c
index 0d89ae7c23..85aebe2ed3 100644
--- a/convert.c
+++ b/convert.c
@@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
 	if (start_async(&async))
 		return 0;	/* error was already reported */
 
-	if (strbuf_read(&nbuf, async.out, len) < 0) {
+	if (strbuf_read(&nbuf, async.out, 0) < 0) {
 		err = error(_("read from external filter '%s' failed"), cmd);
 	}
 	if (close(async.out)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] avoid unncessary malloc of whole file size
  2019-01-24 20:36       ` [PATCH] avoid unncessary malloc of whole file size Joey Hess
@ 2019-01-24 21:12         ` Junio C Hamano
  2019-01-24 21:18           ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2019-01-24 21:12 UTC (permalink / raw)
  To: Joey Hess; +Cc: Jeff King, git

Joey Hess <id@joeyh.name> writes:

> When a worktree file is larger than the available memory, and a clean
> filter is in use, this avoids mallocing a buffer the whole size of the
> file when reading from the clean filter, which caused commands like git
> status and git commit to OOM.
>
> Often in this situation the clean filter will produce a short identifier
> for the file, so such a large buffer is not needed.
>
> When the clean filter does output something around the same size as the
> worktree file, the buffer will need to be reallocated until it fits,
> starting at 8192 and doubling in size. Benchmarking indicates that
> reallocation is not a significant overhead for outputs up to a
> few MB in size.

Problem description first, then solultion.  "... this avoids ..." is
already talking about solution while forcing the readers to know
what the problem is.

    When a worktree file is ... filter is in use, we allocate a
    buffer for the whole size of the file when reading from the
    clean filter.  This can force us to overallocate if the clean
    filter is used to radically shrink a huge file and replace it
    with a small token (e.g. git-annex or git-lfs) and lead to OOM
    at the worst case.  Reading from the filter and growing the
    buffer as we go would avoid such an unnecessary OOM.

    When the clean filter does output ...
    ... few MB in size.

perhaps.

> Signed-off-by: Joey Hess <id@joeyh.name>
> ---
>  convert.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/convert.c b/convert.c
> index 0d89ae7c23..85aebe2ed3 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
>  	if (start_async(&async))
>  		return 0;	/* error was already reported */
>  
> -	if (strbuf_read(&nbuf, async.out, len) < 0) {
> +	if (strbuf_read(&nbuf, async.out, 0) < 0) {
>  		err = error(_("read from external filter '%s' failed"), cmd);
>  	}
>  	if (close(async.out)) {

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] avoid unncessary malloc of whole file size
  2019-01-24 21:12         ` Junio C Hamano
@ 2019-01-24 21:18           ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2019-01-24 21:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Joey Hess, git

On Thu, Jan 24, 2019 at 01:12:15PM -0800, Junio C Hamano wrote:

> Joey Hess <id@joeyh.name> writes:
> 
> > When a worktree file is larger than the available memory, and a clean
> > filter is in use, this avoids mallocing a buffer the whole size of the
> > file when reading from the clean filter, which caused commands like git
> > status and git commit to OOM.
> >
> > Often in this situation the clean filter will produce a short identifier
> > for the file, so such a large buffer is not needed.
> >
> > When the clean filter does output something around the same size as the
> > worktree file, the buffer will need to be reallocated until it fits,
> > starting at 8192 and doubling in size. Benchmarking indicates that
> > reallocation is not a significant overhead for outputs up to a
> > few MB in size.
> 
> Problem description first, then solultion.  "... this avoids ..." is
> already talking about solution while forcing the readers to know
> what the problem is.
> 
>     When a worktree file is ... filter is in use, we allocate a
>     buffer for the whole size of the file when reading from the
>     clean filter.  This can force us to overallocate if the clean
>     filter is used to radically shrink a huge file and replace it
>     with a small token (e.g. git-annex or git-lfs) and lead to OOM
>     at the worst case.  Reading from the filter and growing the
>     buffer as we go would avoid such an unnecessary OOM.
> 
>     When the clean filter does output ...
>     ... few MB in size.
> 
> perhaps.

Yeah, I agree that organization is nicer. Other than that, the patch
looks good to me.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-01-24 21:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-22 22:07 git status OOM on mmap of large file Joey Hess
2019-01-24  0:39 ` brian m. carlson
2019-01-24 12:14   ` Jeff King
2019-01-24 17:05     ` Joey Hess
2019-01-24 12:10 ` Jeff King
2019-01-24 12:54   ` Duy Nguyen
2019-01-24 18:38   ` Joey Hess
2019-01-24 19:18     ` Jeff King
2019-01-24 19:28       ` Jeff King
2019-01-24 20:36       ` [PATCH] avoid unncessary malloc of whole file size Joey Hess
2019-01-24 21:12         ` Junio C Hamano
2019-01-24 21:18           ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).