git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Cygwin can't handle huge packfiles?
@ 2006-04-03  9:46 Kees-Jan Dijkzeul
  2006-04-03 13:23 ` Johannes Schindelin
  2006-04-03 14:38 ` Alex Riesen
  0 siblings, 2 replies; 21+ messages in thread
From: Kees-Jan Dijkzeul @ 2006-04-03  9:46 UTC (permalink / raw)
  To: git

Hi,

I'm trying to get Git to manage a 5Gb source tree. Under linux, this
works like a charm. Under cygwin, however, I run in to difficulties.
For example:

$ git-clone sgp-wa/ sgp-wa.clone
fatal: packfile
./objects/pack/pack-56aa013a0234e198467ed37ae5db925764a6ee98.pack
cannot be mapped.
fatal: unexpected EOF
fetch-pack from '/cygdrive/e/Projects/sgp-wa/.git' failed.

To figure out what is happening, I printed the value of errno, which
turns out to be 12 (Cannot allocate memory). I'm not sure how mmap is
implemented in cygwin, but if they allocate memory and load the file
into it, then this error is not surprising, as the pack file in
question is 1.5Gb in size.

I'm not sure how to approach this problem. Any tips would be greatly
appreciated.

Thanks a lot!

Kees-Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03  9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul
@ 2006-04-03 13:23 ` Johannes Schindelin
  2006-04-03 14:26   ` Morten Welinder
  2006-04-03 14:33   ` Linus Torvalds
  2006-04-03 14:38 ` Alex Riesen
  1 sibling, 2 replies; 21+ messages in thread
From: Johannes Schindelin @ 2006-04-03 13:23 UTC (permalink / raw)
  To: Kees-Jan Dijkzeul; +Cc: git

Hi,

On Mon, 3 Apr 2006, Kees-Jan Dijkzeul wrote:

> I'm trying to get Git to manage a 5Gb source tree. Under linux, this
> works like a charm. Under cygwin, however, I run in to difficulties.
> For example:
> 
> $ git-clone sgp-wa/ sgp-wa.clone
> fatal: packfile
> ./objects/pack/pack-56aa013a0234e198467ed37ae5db925764a6ee98.pack
> cannot be mapped.
> fatal: unexpected EOF
> fetch-pack from '/cygdrive/e/Projects/sgp-wa/.git' failed.
> 
> To figure out what is happening, I printed the value of errno, which
> turns out to be 12 (Cannot allocate memory). I'm not sure how mmap is
> implemented in cygwin, but if they allocate memory and load the file
> into it, then this error is not surprising, as the pack file in
> question is 1.5Gb in size.

The problem is not mmap() on cygwin, but that a fork() has to jump through 
loops to reinstall the open file descriptors on cygwin. If the 
corresponding file was deleted, that fails. Therefore, we work around that 
on cygwin by actually reading the file into memory, *not* mmap()ing it.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03 13:23 ` Johannes Schindelin
@ 2006-04-03 14:26   ` Morten Welinder
  2006-04-03 14:33   ` Linus Torvalds
  1 sibling, 0 replies; 21+ messages in thread
From: Morten Welinder @ 2006-04-03 14:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Kees-Jan Dijkzeul, git

> The problem is not mmap() on cygwin, but that a fork() has to jump through
> loops to reinstall the open file descriptors on cygwin. If the
> corresponding file was deleted, that fails. Therefore, we work around that
> on cygwin by actually reading the file into memory, *not* mmap()ing it.

Maybe, but you aren't going to be able to handler much bigger packs
even on *nix.  Unless you go 64-bit, that is.

M.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03 13:23 ` Johannes Schindelin
  2006-04-03 14:26   ` Morten Welinder
@ 2006-04-03 14:33   ` Linus Torvalds
  2006-04-03 14:36     ` Linus Torvalds
  2006-04-03 15:12     ` Johannes Schindelin
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-04-03 14:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Kees-Jan Dijkzeul, git



On Mon, 3 Apr 2006, Johannes Schindelin wrote:
> 
> The problem is not mmap() on cygwin, but that a fork() has to jump through 
> loops to reinstall the open file descriptors on cygwin. If the 
> corresponding file was deleted, that fails. Therefore, we work around that 
> on cygwin by actually reading the file into memory, *not* mmap()ing it.

Well, we could actually do a _real_ mmap on pack-files. The pack-files are 
much better mmap'ed - there we don't _want_ them to be removed while we're 
using them. It was the index file etc that was problematic.

Maybe the cygwin fake mmap should be triggered only for the index (and 
possibly the individual objects - if only because there doing a 
malloc+read may actually be faster).

Using malloc+read on pack-files is pretty wasteful, since we usually only 
use a very small part of them (ie if we have a 1.5GB pack-file, it's sad 
to read all of it, when we'd usually actually access just a small small 
fraction of it).

That said, I think git _does_ have problems with large pack-files. We have 
some 32-bit issues etc, and just virtual address space things. So for now, 
it's probably best to limit pack-files to the few-hundred-meg size, and 
create serveral smaller ones rather than one huge one.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03 14:33   ` Linus Torvalds
@ 2006-04-03 14:36     ` Linus Torvalds
  2006-04-05 13:24       ` Kees-Jan Dijkzeul
  2006-04-07  8:15       ` Junio C Hamano
  2006-04-03 15:12     ` Johannes Schindelin
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2006-04-03 14:36 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Kees-Jan Dijkzeul, git



On Mon, 3 Apr 2006, Linus Torvalds wrote:
> 
> That said, I think git _does_ have problems with large pack-files. We have 
> some 32-bit issues etc

I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
the packfile data structure does. The index has 32-bit offsets into 
individual pack-files. 

That's not hugely fundamental, but I didn't expect people to hit it this 
quickly. What kind of project has a 1.5GB pack-file _already_? I hope it's 
fifteen years of history (so that we'll have another fifteen years before 
we'll have to worry about 4GB pack-files ;)

			Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03  9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul
  2006-04-03 13:23 ` Johannes Schindelin
@ 2006-04-03 14:38 ` Alex Riesen
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Riesen @ 2006-04-03 14:38 UTC (permalink / raw)
  To: Kees-Jan Dijkzeul; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1189 bytes --]

On 4/3/06, Kees-Jan Dijkzeul <k.j.dijkzeul@gmail.com> wrote:
> I'm trying to get Git to manage a 5Gb source tree. Under linux, this
> works like a charm. Under cygwin, however, I run in to difficulties.
> For example:
>
> $ git-clone sgp-wa/ sgp-wa.clone
> fatal: packfile
> ./objects/pack/pack-56aa013a0234e198467ed37ae5db925764a6ee98.pack
> cannot be mapped.
> fatal: unexpected EOF
> fetch-pack from '/cygdrive/e/Projects/sgp-wa/.git' failed.
>
> To figure out what is happening, I printed the value of errno, which
> turns out to be 12 (Cannot allocate memory). I'm not sure how mmap is

mmap in git on cygwin does not mmaps anything,
but just reads the whole file in memory.

> I'm not sure how to approach this problem. Any tips would be greatly
> appreciated.

I ended up hacking gitfakemmap like in the attached patches (sorry for mime).
It's very ugly and unsafe hack, and it's actually exactly the reason why it was
never submitted. Still, it helps me (it speedups revlist, for
instance), and maybe
it'll help you.
It is a really good example what stupid windows restrictions can do to
a program.

The patch is against git as of 3-Apr-2005, ~10 CET

[-- Attachment #2: cygmmap.patch --]
[-- Type: text/x-patch, Size: 5710 bytes --]

diff --git a/Makefile b/Makefile
index c79d646..8a46436
--- a/Makefile
+++ b/Makefile
@@ -389,7 +389,7 @@ ifdef NO_SETENV
 endif
 ifdef NO_MMAP
 	COMPAT_CFLAGS += -DNO_MMAP
-	COMPAT_OBJS += compat/mmap.o
+	COMPAT_OBJS += compat/mmap.o compat/realmmap.o
 endif
 ifdef NO_IPV6
 	ALL_CFLAGS += -DNO_IPV6
diff --git a/compat/realmmap.c b/compat/realmmap.c
new file mode 100644
index 0000000..8f26641
--- /dev/null
+++ b/compat/realmmap.c
@@ -0,0 +1,26 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/mman.h>
+#include "../git-compat-util.h"
+
+#undef mmap
+#undef munmap
+
+void *realmmap(void *start, size_t length, int prot , int flags, int fd, off_t offset)
+{
+	if (start != NULL || !(flags & MAP_PRIVATE)) {
+		errno = ENOTSUP;
+		return MAP_FAILED;
+	}
+	start = mmap(start, length, prot, flags, fd, offset);
+	return start;
+}
+
+int realmunmap(void *start, size_t length)
+{
+	return munmap(start, length);
+}
+
+
diff --git a/diff.c b/diff.c
index e496905..f1a2cf0 100644
--- a/diff.c
+++ b/diff.c
@@ -450,7 +450,7 @@ int diff_populate_filespec(struct diff_f
 		fd = open(s->path, O_RDONLY);
 		if (fd < 0)
 			goto err_empty;
-		s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
+		s->data = realmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
 		close(fd);
 		if (s->data == MAP_FAILED)
 			goto err_empty;
@@ -482,7 +482,7 @@ void diff_free_filespec_data(struct diff
 	if (s->should_free)
 		free(s->data);
 	else if (s->should_munmap)
-		munmap(s->data, s->size);
+		realmunmap(s->data, s->size);
 	s->should_free = s->should_munmap = 0;
 	s->data = NULL;
 	free(s->cnt_data);
diff --git a/git-compat-util.h b/git-compat-util.h
index 5d543d2..85150f8 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -42,22 +42,28 @@ extern int error(const char *err, ...) _
 
 #ifdef NO_MMAP
 
-#ifndef PROT_READ
+#include <sys/mman.h>
+/*#ifndef PROT_READ
 #define PROT_READ 1
 #define PROT_WRITE 2
 #define MAP_PRIVATE 1
 #define MAP_FAILED ((void*)-1)
-#endif
+#endif*/
 
 #define mmap gitfakemmap
 #define munmap gitfakemunmap
 extern void *gitfakemmap(void *start, size_t length, int prot , int flags, int fd, off_t offset);
 extern int gitfakemunmap(void *start, size_t length);
 
+extern void *realmmap(void *start, size_t length, int prot , int flags, int fd, off_t offset);
+extern int realmunmap(void *start, size_t length);
+
 #else /* NO_MMAP */
 
 #include <sys/mman.h>
 
+#define realmmap mmap
+#define realmunmap munmap
 #endif /* NO_MMAP */
 
 #ifdef NO_SETENV
diff --git a/sha1_file.c b/sha1_file.c
index 58edec0..712a068 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -330,14 +330,14 @@ void prepare_alt_odb(void)
 		close(fd);
 		return;
 	}
-	map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	map = realmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
 	if (map == MAP_FAILED)
 		return;
 
 	link_alt_odb_entries(map, map + st.st_size, '\n',
 			     get_object_directory());
-	munmap(map, st.st_size);
+	realmunmap(map, st.st_size);
 }
 
 static char *find_sha1_file(const unsigned char *sha1, struct stat *st)
@@ -378,7 +378,7 @@ static int check_packed_git_idx(const ch
 		return -1;
 	}
 	idx_size = st.st_size;
-	idx_map = mmap(NULL, idx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	idx_map = realmmap(NULL, idx_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
 	if (idx_map == MAP_FAILED)
 		return -1;
@@ -423,7 +423,7 @@ static int unuse_one_packed_git(void)
 	}
 	if (!lru)
 		return 0;
-	munmap(lru->pack_base, lru->pack_size);
+	realmunmap(lru->pack_base, lru->pack_size);
 	lru->pack_base = NULL;
 	return 1;
 }
@@ -460,7 +460,7 @@ int use_packed_git(struct packed_git *p)
 		}
 		if (st.st_size != p->pack_size)
 			die("packfile %s size mismatch.", p->pack_name);
-		map = mmap(NULL, p->pack_size, PROT_READ, MAP_PRIVATE, fd, 0);
+		map = realmmap(NULL, p->pack_size, PROT_READ, MAP_PRIVATE, fd, 0);
 		close(fd);
 		if (map == MAP_FAILED)
 			die("packfile %s cannot be mapped.", p->pack_name);
@@ -494,7 +494,7 @@ struct packed_git *add_packed_git(char *
 	/* do we have a corresponding .pack file? */
 	strcpy(path + path_len - 4, ".pack");
 	if (stat(path, &st) || !S_ISREG(st.st_mode)) {
-		munmap(idx_map, idx_size);
+		realmunmap(idx_map, idx_size);
 		return NULL;
 	}
 	/* ok, it looks sane as far as we can check without
@@ -647,7 +647,7 @@ static void *map_sha1_file_internal(cons
 		 */
 		sha1_file_open_flag = 0;
 	}
-	map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	map = realmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
 	if (map == MAP_FAILED)
 		return NULL;
@@ -1184,7 +1184,7 @@ int sha1_object_info(const unsigned char
 			*sizep = size;
 	}
 	inflateEnd(&stream);
-	munmap(map, mapsize);
+	realmunmap(map, mapsize);
 	return status;
 }
 
@@ -1210,7 +1210,7 @@ void * read_sha1_file(const unsigned cha
 	map = map_sha1_file_internal(sha1, &mapsize);
 	if (map) {
 		buf = unpack_sha1_file(map, mapsize, type, size);
-		munmap(map, mapsize);
+		realmunmap(map, mapsize);
 		return buf;
 	}
 	return NULL;
@@ -1493,7 +1493,7 @@ int write_sha1_to_fd(int fd, const unsig
 	} while (posn < objsize);
 
 	if (map)
-		munmap(map, objsize);
+		realmunmap(map, objsize);
 	if (temp_obj)
 		free(temp_obj);
 
@@ -1646,7 +1646,7 @@ int index_fd(unsigned char *sha1, int fd
 
 	buf = "";
 	if (size)
-		buf = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+		buf = realmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
 	if (buf == MAP_FAILED)
 		return -1;
@@ -1660,7 +1660,7 @@ int index_fd(unsigned char *sha1, int fd
 		ret = 0;
 	}
 	if (size)
-		munmap(buf, size);
+		realmunmap(buf, size);
 	return ret;
 }
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03 14:33   ` Linus Torvalds
  2006-04-03 14:36     ` Linus Torvalds
@ 2006-04-03 15:12     ` Johannes Schindelin
  1 sibling, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2006-04-03 15:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kees-Jan Dijkzeul, git

Hi,

On Mon, 3 Apr 2006, Linus Torvalds wrote:

> On Mon, 3 Apr 2006, Johannes Schindelin wrote:
> > 
> > The problem is not mmap() on cygwin, but that a fork() has to jump through 
> > loops to reinstall the open file descriptors on cygwin. If the 
> > corresponding file was deleted, that fails. Therefore, we work around that 
> > on cygwin by actually reading the file into memory, *not* mmap()ing it.
> 
> Well, we could actually do a _real_ mmap on pack-files. The pack-files are 
> much better mmap'ed - there we don't _want_ them to be removed while we're 
> using them. It was the index file etc that was problematic.
> 
> Maybe the cygwin fake mmap should be triggered only for the index (and 
> possibly the individual objects - if only because there doing a 
> malloc+read may actually be faster).

I hit the problem *only* with "git-whatchanged -p". Which means that the 
upcoming we-no-longer-write-temp-files-for-diff version should make that 
gitfakemmap() hack obsolete. (I have not checked whether there are other 
places where a file is mmap()ed and then used by a fork()ed process.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03 14:36     ` Linus Torvalds
@ 2006-04-05 13:24       ` Kees-Jan Dijkzeul
  2006-04-05 14:14         ` Johannes Schindelin
  2006-04-06  4:13         ` Junio C Hamano
  2006-04-07  8:15       ` Junio C Hamano
  1 sibling, 2 replies; 21+ messages in thread
From: Kees-Jan Dijkzeul @ 2006-04-05 13:24 UTC (permalink / raw)
  To: git

On 4/3/06, Linus Torvalds <torvalds@osdl.org> wrote:
[...]
> That's not hugely fundamental, but I didn't expect people to hit it this
> quickly. What kind of project has a 1.5GB pack-file _already_? I hope it's
> fifteen years of history (so that we'll have another fifteen years before
> we'll have to worry about 4GB pack-files ;)

I'm trying to get Git to manage my companies source tree. We're
writing software for digital TV sets. Anyway, the archive is about 5Gb
in size and contains binaries, zip files, excel sheets meeting minutes
and whatnot. So it doesn't compress very well. The 1.5Gb pack file
hardly contains any history at all (five commits or so). On the flip
side, for now I'll be the only one adding to the archive, so at least
it will not grow that fast ;-)

Anyway, to reconstitute the tree, I need very nearly the entire pack,
so limiting the pack size won't do much good, as git will still try to
allocate a total of 1.5Gb memory (which, unfortunately, isn't there
:-)

Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the
regular mmap for mapping pack files, only to discover that I compile
without defining "NO_MMAP", so I've been using the stock mmap all
along. So now I'm thinking that the cygwin mmap also does a
malloc-and-read, just like git does with NO_MMAP. So I'll continue to
investigate in that direction.

To be continued...

Groetjes,

Kees-Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-05 13:24       ` Kees-Jan Dijkzeul
@ 2006-04-05 14:14         ` Johannes Schindelin
  2006-04-05 21:08           ` Christopher Faylor
  2006-04-06  4:13         ` Junio C Hamano
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2006-04-05 14:14 UTC (permalink / raw)
  To: Kees-Jan Dijkzeul; +Cc: git

Hi,

On Wed, 5 Apr 2006, Kees-Jan Dijkzeul wrote:

> On 4/3/06, Linus Torvalds <torvalds@osdl.org> wrote:
> [...]
> > That's not hugely fundamental, but I didn't expect people to hit it this
> > quickly. What kind of project has a 1.5GB pack-file _already_? I hope it's
> > fifteen years of history (so that we'll have another fifteen years before
> > we'll have to worry about 4GB pack-files ;)
> 
> I'm trying to get Git to manage my companies source tree. We're
> writing software for digital TV sets. Anyway, the archive is about 5Gb
> in size and contains binaries, zip files, excel sheets meeting minutes
> and whatnot. So it doesn't compress very well. The 1.5Gb pack file
> hardly contains any history at all (five commits or so). On the flip
> side, for now I'll be the only one adding to the archive, so at least
> it will not grow that fast ;-)
> 
> Anyway, to reconstitute the tree, I need very nearly the entire pack,
> so limiting the pack size won't do much good, as git will still try to
> allocate a total of 1.5Gb memory (which, unfortunately, isn't there
> :-)
> 
> Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the
> regular mmap for mapping pack files, only to discover that I compile
> without defining "NO_MMAP", so I've been using the stock mmap all
> along. So now I'm thinking that the cygwin mmap also does a
> malloc-and-read, just like git does with NO_MMAP. So I'll continue to
> investigate in that direction.

I think cygwin's mmap() is based on the Win32 API equivalent, which could 
mean that it *is* memory mapped, but in a special area (which is smaller 
than 1.5 gigabyte). In this case, it would make sense to limit the pack 
size, thereby having several packs, and mmap() them as they are needed.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-05 14:14         ` Johannes Schindelin
@ 2006-04-05 21:08           ` Christopher Faylor
  2006-04-05 23:27             ` Rutger Nijlunsing
  0 siblings, 1 reply; 21+ messages in thread
From: Christopher Faylor @ 2006-04-05 21:08 UTC (permalink / raw)
  To: Johannes Schindelin, Kees-Jan Dijkzeul, git

On Wed, Apr 05, 2006 at 04:14:20PM +0200, Johannes Schindelin wrote:
>> Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the
>> regular mmap for mapping pack files, only to discover that I compile
>> without defining "NO_MMAP", so I've been using the stock mmap all
>> along. So now I'm thinking that the cygwin mmap also does a
>> malloc-and-read, just like git does with NO_MMAP. So I'll continue to
>> investigate in that direction.
>
>I think cygwin's mmap() is based on the Win32 API equivalent, which could 
>mean that it *is* memory mapped, but in a special area (which is smaller 
>than 1.5 gigabyte). In this case, it would make sense to limit the pack 
>size, thereby having several packs, and mmap() them as they are needed.

Yes, cygwin's mmap uses CreateFileMapping and MapViewOfFile.  IIRC,
Windows might have a 2G limitation lurking under the hood somewhere but
I think that might be tweakable with some registry setting.

cgf

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-05 21:08           ` Christopher Faylor
@ 2006-04-05 23:27             ` Rutger Nijlunsing
  2006-04-06  0:34               ` Christopher Faylor
  0 siblings, 1 reply; 21+ messages in thread
From: Rutger Nijlunsing @ 2006-04-05 23:27 UTC (permalink / raw)
  To: Christopher Faylor; +Cc: Johannes Schindelin, Kees-Jan Dijkzeul, git

On Wed, Apr 05, 2006 at 05:08:44PM -0400, Christopher Faylor wrote:
> On Wed, Apr 05, 2006 at 04:14:20PM +0200, Johannes Schindelin wrote:
> >> Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the
> >> regular mmap for mapping pack files, only to discover that I compile
> >> without defining "NO_MMAP", so I've been using the stock mmap all
> >> along. So now I'm thinking that the cygwin mmap also does a
> >> malloc-and-read, just like git does with NO_MMAP. So I'll continue to
> >> investigate in that direction.
> >
> >I think cygwin's mmap() is based on the Win32 API equivalent, which could 
> >mean that it *is* memory mapped, but in a special area (which is smaller 
> >than 1.5 gigabyte). In this case, it would make sense to limit the pack 
> >size, thereby having several packs, and mmap() them as they are needed.
> 
> Yes, cygwin's mmap uses CreateFileMapping and MapViewOfFile.  IIRC,
> Windows might have a 2G limitation lurking under the hood somewhere but
> I think that might be tweakable with some registry setting.

Windows places its DLLs criss-cross through the memory space because
every DLL on the system has its own preferred place to be loaded (the
base address). This severely limits the amount of largest contiguous
memory block available, which is needed for one mmap() I think.

Several solutions exist:
  - enlarge the address space with the /3GB boot flag in boot.ini
  - rebase all DLLs with REBASE.EXE (part of platform sdk) .
    Just make them the same and fix them to a low address.
    Problem is rebasing system dlls since those are locked by the system.
  - at start of program before other DLLs are loaded,
    reserve an as large part of the memory as possible with
    VirtualAlloc()

-- 
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-05 23:27             ` Rutger Nijlunsing
@ 2006-04-06  0:34               ` Christopher Faylor
  0 siblings, 0 replies; 21+ messages in thread
From: Christopher Faylor @ 2006-04-06  0:34 UTC (permalink / raw)
  To: git

On Thu, Apr 06, 2006 at 01:27:39AM +0200, Rutger Nijlunsing wrote:
>On Wed, Apr 05, 2006 at 05:08:44PM -0400, Christopher Faylor wrote:
>> On Wed, Apr 05, 2006 at 04:14:20PM +0200, Johannes Schindelin wrote:
>> >> Inspired by a patch of Alex Riesen (thanks, Alex), I tried to use the
>> >> regular mmap for mapping pack files, only to discover that I compile
>> >> without defining "NO_MMAP", so I've been using the stock mmap all
>> >> along. So now I'm thinking that the cygwin mmap also does a
>> >> malloc-and-read, just like git does with NO_MMAP. So I'll continue to
>> >> investigate in that direction.
>> >
>> >I think cygwin's mmap() is based on the Win32 API equivalent, which could 
>> >mean that it *is* memory mapped, but in a special area (which is smaller 
>> >than 1.5 gigabyte). In this case, it would make sense to limit the pack 
>> >size, thereby having several packs, and mmap() them as they are needed.
>> 
>> Yes, cygwin's mmap uses CreateFileMapping and MapViewOfFile.  IIRC,
>> Windows might have a 2G limitation lurking under the hood somewhere but
>> I think that might be tweakable with some registry setting.
>
>Windows places its DLLs criss-cross through the memory space because
>every DLL on the system has its own preferred place to be loaded (the
>base address). This severely limits the amount of largest contiguous
>memory block available, which is needed for one mmap() I think.
>
>Several solutions exist:
>  - enlarge the address space with the /3GB boot flag in boot.ini

Thanks.  The 3GB boot flag is what I was trying to remember.

>  - rebase all DLLs with REBASE.EXE (part of platform sdk) .
>    Just make them the same and fix them to a low address.
>    Problem is rebasing system dlls since those are locked by the system.

Cygwin has its own version of rebase and a method for rebasing all of the
dlls in the distribution.  Using that may help squeeze out a little bit
of memory.

>  - at start of program before other DLLs are loaded,
>    reserve an as large part of the memory as possible with
>    VirtualAlloc()

Cygwin actually uses this trick to try to push DLLs into their right
locations after a fork.  It sort of works but sometimes, in a child
proccess, Windows puts "stuff" in locations previously occupied by a
DLL.  I could swear that it does that just to be annoying...

There is a chicken/egg problem here in that Cygwin uses Doug Lea's malloc
and that version of malloc will use mmap when sbrk() fails -- as it is
apt to do when allocating gigabytes of memory.  So, using malloc is
not a way to avoid mmap.

cgf

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-05 13:24       ` Kees-Jan Dijkzeul
  2006-04-05 14:14         ` Johannes Schindelin
@ 2006-04-06  4:13         ` Junio C Hamano
  1 sibling, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2006-04-06  4:13 UTC (permalink / raw)
  To: Kees-Jan Dijkzeul; +Cc: git

"Kees-Jan Dijkzeul" <k.j.dijkzeul@gmail.com> writes:

> I'm trying to get Git to manage my companies source tree. We're
> writing software for digital TV sets. Anyway, the archive is about 5Gb
> in size and contains binaries, zip files, excel sheets meeting minutes
> and whatnot. So it doesn't compress very well. The 1.5Gb pack file
> hardly contains any history at all (five commits or so). On the flip
> side, for now I'll be the only one adding to the archive, so at least
> it will not grow that fast ;-)
>
> Anyway, to reconstitute the tree, I need very nearly the entire pack,
> so limiting the pack size won't do much good, as git will still try to
> allocate a total of 1.5Gb memory (which, unfortunately, isn't there
> :-)

Right now we LRU the pack files and evict older ones when we
mmap too many, but the unit of eviction is the whole file, so it
would not help the case like yours at all.  It might be possible
to mmap only part of a packfile, but it would involve fairly
major surgery to sha1_file.c.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
@ 2006-04-06 20:57 linux
  2006-04-06 23:53 ` Junio C Hamano
  0 siblings, 1 reply; 21+ messages in thread
From: linux @ 2006-04-06 20:57 UTC (permalink / raw)
  To: git, junkio; +Cc: linux

> Right now we LRU the pack files and evict older ones when we
> mmap too many, but the unit of eviction is the whole file, so it
> would not help the case like yours at all.  It might be possible
> to mmap only part of a packfile, but it would involve fairly
> major surgery to sha1_file.c.

The simplest solution seems to be to limit pack file size to a reasonable
fraction of a 32-bit address space.  Say, 0.5 G.

That should be a fairly straightforward hack to git-pack-objects.
It already emits two files; just make it emit more.

You can tweak the heurisitics to try to find a good break point: start
thinking about splitting the pack when you get to one size, but don't
force a break until you hit a harder limit as long as the deltas are
working well.

This can all be adjustable with a command line and/or config file option
to allow for the eventual demise of 32-bit systems.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-06 20:57 linux
@ 2006-04-06 23:53 ` Junio C Hamano
  2006-04-07  3:05   ` linux
  0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2006-04-06 23:53 UTC (permalink / raw)
  To: linux; +Cc: git

linux@horizon.com writes:

>> Right now we LRU the pack files and evict older ones when we
>> mmap too many, but the unit of eviction is the whole file, so it
>> would not help the case like yours at all.  It might be possible
>> to mmap only part of a packfile, but it would involve fairly
>> major surgery to sha1_file.c.
>
> The simplest solution seems to be to limit pack file size to a reasonable
> fraction of a 32-bit address space.  Say, 0.5 G.

I do not think that would help the original poster's situation
where only 5 revs result in a 1.5G pack.  I would _almost_ say
"do not pack such a repository", but there is the initial
cloning over git-aware transports which always results in a
repository with a single pack.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-06 23:53 ` Junio C Hamano
@ 2006-04-07  3:05   ` linux
  0 siblings, 0 replies; 21+ messages in thread
From: linux @ 2006-04-07  3:05 UTC (permalink / raw)
  To: junkio, linux; +Cc: git

> I do not think that would help the original poster's situation
> where only 5 revs result in a 1.5G pack.  I would _almost_ say
> "do not pack such a repository", but there is the initial
> cloning over git-aware transports which always results in a
> repository with a single pack.

Huh?  Why not?  That repository has a lot of files.  For compression,
you want all versions of a file in one pack, and with few versions that
makes it easier to split up, not harder.

As for network transport of packs, I haven't studied the details,
but if you allow "thin packs" that have deltas relative to
objects not in the pack, then breaking up the pack anywhere
should be legal.

Or, if necessary, you can stuff an arbitrarily large file through
git-unpack-objects, which reads a stream from stdin without
attempting to mmap it.


(Speaking of unpack-objects.c, what's that "static unsigned long eof"
variable in there?  It never seems to be set to a non-zero value.)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-03 14:36     ` Linus Torvalds
  2006-04-05 13:24       ` Kees-Jan Dijkzeul
@ 2006-04-07  8:15       ` Junio C Hamano
  2006-04-07  8:27         ` Jakub Narebski
  2006-04-07 14:11         ` Nicolas Pitre
  1 sibling, 2 replies; 21+ messages in thread
From: Junio C Hamano @ 2006-04-07  8:15 UTC (permalink / raw)
  To: git; +Cc: Kees-Jan Dijkzeul, Linus Torvalds

Linus Torvalds <torvalds@osdl.org> writes:

> On Mon, 3 Apr 2006, Linus Torvalds wrote:
>> 
>> That said, I think git _does_ have problems with large pack-files. We have 
>> some 32-bit issues etc
>
> I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
> the packfile data structure does. The index has 32-bit offsets into 
> individual pack-files. 
>
> That's not hugely fundamental,...

Linus _does_ understand what he means, but let me clarify and
outline a possible future direction.

 * pack-*.pack file has the following format:

   - The header appears at the beginning and consists of the following:

     4-byte signature
     4-byte version number (network byte order)
     4-byte number of objects contained in the pack (network byte order)

     Observation: we cannot have more than 4G versions ;-) and
     more than 4G objects in a pack.

   - The header is followed by number of object entries, each of
     which looks like this:

     (undeltified representation)
     n-byte type and length (4-bit type, (n-1)*7+4-bit length)
     compressed data

     (deltified representation)
     n-byte type and length (4-bit type, (n-1)*7+4-bit length)
     20-byte base object name
     compressed delta data

     Observation: length of each object is encoded in a variable
     length format and is not constrained to 32-bit or anything.

  - The trailer records 20-byte SHA1 checksum of all of the above.

 * pack-*.idx file has the following format:

  - The header consists of 256 4-byte network byte order
    integers.  N-th entry of this table records the number of
    objects in the corresponding pack, the first byte of whose
    object name are smaller than N.

    Observation: we would need to extend this to an array of
    8-byte integers to go beyond 4G objects per pack, but it is
    not strictly necessary.

  - The header is followed by sorted 28-byte entries, one entry
    per object in the pack.  Each entry is:

    4-byte network byte order integer, recording where the
    object is stored in the packfile as the offset from the
    beginning.

    20-byte object name.

    Observation: we would definitely need to extend this to
    8-byte integer plus 20-byte object name to handle a packfile
    that is larger than 4GB.

  - The file is concluded with a trailer:

    A copy of the 20-byte SHA1 checksum at the end of
    corresponding packfile.

    20-byte SHA1-checksum of all of the above.

This is not fundamental, in that pack idx file is something we
can regenerate from a packfile.  The push/fetch transfer over
git native protocols does not even transfer pack idx file;
instead, the recipient uses git-index-pack to generate pack idx.
git-index-pack would need to be updated to update the necessary
fields to 8-byte integers, without breaking existing packfiles.

The code to read idx file currently has a sanity check logic to
make sure that the size of the idx file is consistent with
24-byte entries (the last entry in the header matches the number
of objects recorded in the pack).  So we could reliably tell
between the current 24-byte version and 28-byte "beyond 4GB"
version, and support both formats at the same time.

Even after we start supporting the 28-byte "beyond 4GB" format,
we can and we should continue writing the current 24-byte
version of pack idx file when the packfile offset can be
expressed with 32-bit.

Having said that, I have to warn that this is not for weak of
heart.  The necessary changes would be somewhat involved.


----------------------------------------------------------------

Pack idx file

	idx
	    +--------------------------------+
	    | fanout[0] = 2                  |-.
	    +--------------------------------+ |
	    | fanout[1]                      | |
	    +--------------------------------+ |
	    | fanout[2]                      | |
	    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
	    | fanout[255]                    | |
	    +--------------------------------+ |
main	    | offset                         | |
index	    | object name 00XXXXXXXXXXXXXXXX | |
table	    +--------------------------------+ | 
	    | offset                         | |
	    | object name 00XXXXXXXXXXXXXXXX | |
	    +--------------------------------+ |
	  .-| offset                         |<+
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | +--------------------------------+
	  | | offset                         |
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	  | | offset                         |
	  | | object name FFXXXXXXXXXXXXXXXX |
	  | +--------------------------------+
trailer	  | | packfile checksum              |
	  | +--------------------------------+
	  | | idxfile checksum               |
	  | +--------------------------------+
          .-------.      
                  |
Pack file entry: <+

     packed object header:
	1-byte type (bit 4-6)
	       size0 (bit 0-3)
               end-of-length (bit 7)
        n-byte sizeN (as long as MSB is set, each 7-bit)
		size0..sizeN form 4+7+7+..+7 bit integer, size0
		is the most significant part.
     packed object data:
        If it is not DELTA, then deflated bytes (the size above
		is the size before compression).
	If it is DELTA, then
	  20-byte base object name SHA1 (the size above is the
	  	size of the delta data that follows).
          delta data, deflated.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-07  8:15       ` Junio C Hamano
@ 2006-04-07  8:27         ` Jakub Narebski
  2006-04-07 14:11         ` Nicolas Pitre
  1 sibling, 0 replies; 21+ messages in thread
From: Jakub Narebski @ 2006-04-07  8:27 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

>  * pack-*.pack file has the following format:
[...]
>  * pack-*.idx file has the following format:
[...]
Could you please put the information in parent post somewhere in
Documentation, for example Documentation/technical/pack-format.txt
(perhaps together with putting description of packing heuristic from
http://marc.theaimsgroup.com/?l=git&m=114134881923320 by Jon Loeliger in
Documentation/technical/pack-heuristics.txt even if it doesn't conform to
"serious documentation" standards)?

Thanks in advance
-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-07  8:15       ` Junio C Hamano
  2006-04-07  8:27         ` Jakub Narebski
@ 2006-04-07 14:11         ` Nicolas Pitre
  2006-04-07 18:31           ` Junio C Hamano
  1 sibling, 1 reply; 21+ messages in thread
From: Nicolas Pitre @ 2006-04-07 14:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kees-Jan Dijkzeul, Linus Torvalds

On Fri, 7 Apr 2006, Junio C Hamano wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > On Mon, 3 Apr 2006, Linus Torvalds wrote:
> >> 
> >> That said, I think git _does_ have problems with large pack-files. We have 
> >> some 32-bit issues etc
> >
> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
> > the packfile data structure does. The index has 32-bit offsets into 
> > individual pack-files. 
> >
> > That's not hugely fundamental,...
> 
> Linus _does_ understand what he means, but let me clarify and
> outline a possible future direction.
> 
[...]

For the record, the delta code also has 32-bit limitations of its own 
presently.  It cannot encode a delta against a buffer which is larger 
than 4GB.

I however made sure the byte 0 could be used as a prefix for future 
encoding extensions, like 64-bit file offsets for example.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-07 14:11         ` Nicolas Pitre
@ 2006-04-07 18:31           ` Junio C Hamano
  2006-04-07 18:46             ` Nicolas Pitre
  0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2006-04-07 18:31 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@cam.org> writes:

> On Fri, 7 Apr 2006, Junio C Hamano wrote:
>
>> Linus Torvalds <torvalds@osdl.org> writes:
>> 
>> > On Mon, 3 Apr 2006, Linus Torvalds wrote:
>> >> 
>> >> That said, I think git _does_ have problems with large pack-files. We have 
>> >> some 32-bit issues etc
>> >
>> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
>> > the packfile data structure does. The index has 32-bit offsets into 
>> > individual pack-files. 
>> >
>> > That's not hugely fundamental,...
>> 
>> Linus _does_ understand what he means, but let me clarify and
>> outline a possible future direction.
>
> For the record, the delta code also has 32-bit limitations of its own 
> presently.  It cannot encode a delta against a buffer which is larger 
> than 4GB.
>
> I however made sure the byte 0 could be used as a prefix for future 
> encoding extensions, like 64-bit file offsets for example.

True the delta data representation, not just the "delta code",
has that limitation, but I do not think you issue "insert 0-byte
literal data" command from the deltifier side right now, so we
should be OK.

Maybe we would want to check (cmd == 0) case to detect delta
extension that we do not handle right now?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Cygwin can't handle huge packfiles?
  2006-04-07 18:31           ` Junio C Hamano
@ 2006-04-07 18:46             ` Nicolas Pitre
  0 siblings, 0 replies; 21+ messages in thread
From: Nicolas Pitre @ 2006-04-07 18:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, 7 Apr 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Fri, 7 Apr 2006, Junio C Hamano wrote:
> >
> >> Linus Torvalds <torvalds@osdl.org> writes:
> >> 
> >> > On Mon, 3 Apr 2006, Linus Torvalds wrote:
> >> >> 
> >> >> That said, I think git _does_ have problems with large pack-files. We have 
> >> >> some 32-bit issues etc
> >> >
> >> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
> >> > the packfile data structure does. The index has 32-bit offsets into 
> >> > individual pack-files. 
> >> >
> >> > That's not hugely fundamental,...
> >> 
> >> Linus _does_ understand what he means, but let me clarify and
> >> outline a possible future direction.
> >
> > For the record, the delta code also has 32-bit limitations of its own 
> > presently.  It cannot encode a delta against a buffer which is larger 
> > than 4GB.
> >
> > I however made sure the byte 0 could be used as a prefix for future 
> > encoding extensions, like 64-bit file offsets for example.
> 
> True the delta data representation, not just the "delta code",
> has that limitation, but I do not think you issue "insert 0-byte
> literal data" command from the deltifier side right now, so we
> should be OK.
> 
> Maybe we would want to check (cmd == 0) case to detect delta
> extension that we do not handle right now?

Good idea.  Will send you a patch.


Nicolas

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2006-04-07 18:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-03  9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul
2006-04-03 13:23 ` Johannes Schindelin
2006-04-03 14:26   ` Morten Welinder
2006-04-03 14:33   ` Linus Torvalds
2006-04-03 14:36     ` Linus Torvalds
2006-04-05 13:24       ` Kees-Jan Dijkzeul
2006-04-05 14:14         ` Johannes Schindelin
2006-04-05 21:08           ` Christopher Faylor
2006-04-05 23:27             ` Rutger Nijlunsing
2006-04-06  0:34               ` Christopher Faylor
2006-04-06  4:13         ` Junio C Hamano
2006-04-07  8:15       ` Junio C Hamano
2006-04-07  8:27         ` Jakub Narebski
2006-04-07 14:11         ` Nicolas Pitre
2006-04-07 18:31           ` Junio C Hamano
2006-04-07 18:46             ` Nicolas Pitre
2006-04-03 15:12     ` Johannes Schindelin
2006-04-03 14:38 ` Alex Riesen
  -- strict thread matches above, loose matches on Subject: below --
2006-04-06 20:57 linux
2006-04-06 23:53 ` Junio C Hamano
2006-04-07  3:05   ` linux

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).