git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Re: [PATCH/RFC 1/1] gc --auto: exclude the largest giant pack in low-memory config
@ 2018-03-06  0:10 Duy Nguyen
  0 siblings, 0 replies; 5+ messages in thread
From: Duy Nguyen @ 2018-03-06  0:10 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Jeff King, Eric Wong

On Mon, Mar 5, 2018 at 9:00 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Thu, Mar 01 2018, Nguyễn Thái Ngọc Duy jotted:
>
>> pack-objects could be a big memory hog especially on large repos,
>> everybody knows that. The suggestion to stick a .keep file on the
>> largest pack to avoid this problem is also known for a long time.
>>
>> Let's do the suggestion automatically instead of waiting for people to
>> come to Git mailing list and get the advice. When a certain condition
>> is met, gc --auto create a .keep file temporary before repack is run,
>> then remove it afterward.
>>
>> gc --auto does this based on an estimation of pack-objects memory
>> usage and whether that fits in one third of system memory (the
>> assumption here is for desktop environment where there are many other
>> applications running).
>>
>> Since the estimation may be inaccurate and that 1/3 threshold is
>> arbitrary, give the user a finer control over this mechanism as well:
>> if the largest pack is larger than gc.bigPackThreshold, it's kept.
>
> This is very promising. Saves lots of memory on my ad-hoc testing of
> adding a *.keep file on an in-house repo.

The good news for you is when we run external rev-list on top of this,
memory consumption seems even better (and I think even peak memory
should be a bit lower too, but I'll need to verify that).

>> +     if (big_pack_threshold)
>> +             return pack->pack_size >= big_pack_threshold;
>> +
>> +     /* First we have to scan through at least one pack */
>> +     mem_want = pack->pack_size + pack->index_size;
>> +     /* then pack-objects needs lots more for book keeping */
>> +     mem_want += sizeof(struct object_entry) * nr_objects;
>> +     /*
>> +      * internal rev-list --all --objects takes up some memory too,
>> +      * let's say half of it is for blobs
>> +      */
>> +     mem_want += sizeof(struct blob) * nr_objects / 2;
>> +     /*
>> +      * and the other half is for trees (commits and tags are
>> +      * usually insignificant)
>> +      */
>> +     mem_want += sizeof(struct tree) * nr_objects / 2;
>> +     /* and then obj_hash[], underestimated in fact */
>> +     mem_want += sizeof(struct object *) * nr_objects;
>> +     /*
>> +      * read_sha1_file() (either at delta calculation phase, or
>> +      * writing phase) also fills up the delta base cache
>> +      */
>> +     mem_want += delta_base_cache_limit;
>> +     /* and of course pack-objects has its own delta cache */
>> +     mem_want += max_delta_cache_size;
>
> I'm not familiar enough with this part to say, but isn't this assuming a
> lot about the distribution of objects in a way that will cause is not to
> repack in some pathological cases?

Yeah this assumes a "normal" case. When the estimation is really off,
either we exclude the base pack or repack everything unnecessarily,
but we always repack. A wrong decision here can only affect
performance, not correctness.

There's one case I probably should address though. This "exclude the
base pack" will create two packs in the end, one big and one small.
But if the second pack is getting as big as the first one, it's time
we merge both into one.

> Probably worth documenting...
>
>> +     /* Only allow 1/3 of memory for pack-objects */
>> +     mem_have = total_ram() / 3;
>
> Would be great to have this be a configurable variable, so you could set
> it to e.g. 33% (like here), 50% etc.

Hmm.. isn't gc.bigPackThreshold enough? I mean in a controlled
environment, you probably already know how much ram is available, and
much of this estimation is based on pack size (well the number of
objects in the pack) anyway, you could avoid all this heuristics by
saying "when the base pack is larger than 1GB, always exclude it in
repack". This estimation should only needed when people do not
configure anything (and still expect reasonable defaults). Or when you
plan multiple 'gc' runs on the same machine?
-- 
Duy

^ permalink raw reply	[flat|nested] 5+ messages in thread
* Reduce pack-objects memory footprint?
@ 2018-02-28  9:27 Duy Nguyen
  2018-03-01  9:20 ` [PATCH/RFC 0/1] Avoid expensive 'repack -ad' in gc --auto Nguyễn Thái Ngọc Duy
  0 siblings, 1 reply; 5+ messages in thread
From: Duy Nguyen @ 2018-02-28  9:27 UTC (permalink / raw)
  To: git

linux-2.6.git current has 6483999 objects. "git gc" on my poor laptop
consumes 1.7G out of 4G RAM, pushing lots of data to swap and making
all apps nearly unusuable (granted the problem is partly Linux I/O
scheduler too). So I wonder if we can reduce pack-objects memory
footprint a bit.

This demonstration patch (probably breaks some tests) would reduce the
size of struct object_entry from from 136 down to 112 bytes on
x86-64. There are 6483999 of these objects, so the saving is 17% or
148 MB.

If we go further, notice that nr_objects is uint32_t, we could convert
the three pointers

	struct object_entry *delta;
	struct object_entry *delta_child;
	struct object_entry *delta_sibling;

to

	uint32_t delta;
	uint32_t delta_child;
	uint32_t delta_sibling;

which saves 12 bytes (or another 74 MB). 222 MB total is plenty of
space to keep some file cache from being evicted.

Is it worth doing this? The struct packing makes it harder to read
(and more fragile too). I added some more artifical limit like max
depth of 2^11. But I think 4096+ depth is getting unreasonable.

-- 8< --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5c674b2843..6a9804daec 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -877,8 +877,11 @@ static void write_pack_file(void)
 			strbuf_addf(&tmpname, "%s-", base_name);
 
 			if (write_bitmap_index) {
+				ALLOC_ARRAY(to_pack.in_pack_pos, to_pack.nr_objects);
 				bitmap_writer_set_checksum(oid.hash);
-				bitmap_writer_build_type_index(written_list, nr_written);
+				bitmap_writer_build_type_index(written_list,
+							       nr_written,
+							       &to_pack);
 			}
 
 			finish_tmp_packfile(&tmpname, pack_tmp_name,
@@ -1407,6 +1410,7 @@ static void check_object(struct object_entry *entry)
 		unsigned long avail;
 		off_t ofs;
 		unsigned char *buf, c;
+		enum object_type type;
 
 		buf = use_pack(p, &w_curs, entry->in_pack_offset, &avail);
 
@@ -1415,8 +1419,9 @@ static void check_object(struct object_entry *entry)
 		 * since non-delta representations could still be reused.
 		 */
 		used = unpack_object_header_buffer(buf, avail,
-						   &entry->in_pack_type,
+						   &type,
 						   &entry->size);
+		entry->in_pack_type = type;
 		if (used == 0)
 			goto give_up;
 
@@ -1559,6 +1564,7 @@ static void drop_reused_delta(struct object_entry *entry)
 {
 	struct object_entry **p = &entry->delta->delta_child;
 	struct object_info oi = OBJECT_INFO_INIT;
+	enum object_type type;
 
 	while (*p) {
 		if (*p == entry)
@@ -1570,7 +1576,7 @@ static void drop_reused_delta(struct object_entry *entry)
 	entry->depth = 0;
 
 	oi.sizep = &entry->size;
-	oi.typep = &entry->type;
+	oi.typep = &type;
 	if (packed_object_info(entry->in_pack, entry->in_pack_offset, &oi) < 0) {
 		/*
 		 * We failed to get the info from this pack for some reason;
@@ -1580,7 +1586,8 @@ static void drop_reused_delta(struct object_entry *entry)
 		 */
 		entry->type = sha1_object_info(entry->idx.oid.hash,
 					       &entry->size);
-	}
+	} else
+		entry->type = type;
 }
 
 /*
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index e01f992884..5c9957a095 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -49,7 +49,8 @@ void bitmap_writer_show_progress(int show)
  * Build the initial type index for the packfile
  */
 void bitmap_writer_build_type_index(struct pack_idx_entry **index,
-				    uint32_t index_nr)
+				    uint32_t index_nr,
+				    struct packing_data *to_pack)
 {
 	uint32_t i;
 
@@ -62,7 +63,7 @@ void bitmap_writer_build_type_index(struct pack_idx_entry **index,
 		struct object_entry *entry = (struct object_entry *)index[i];
 		enum object_type real_type;
 
-		entry->in_pack_pos = i;
+		to_pack->in_pack_pos[entry - to_pack->objects] = i;
 
 		switch (entry->type) {
 		case OBJ_COMMIT:
@@ -147,7 +148,7 @@ static uint32_t find_object_pos(const unsigned char *sha1)
 			"(object %s is missing)", sha1_to_hex(sha1));
 	}
 
-	return entry->in_pack_pos;
+	return writer.to_pack->in_pack_pos[entry - writer.to_pack->objects];
 }
 
 static void show_object(struct object *object, const char *name, void *data)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 9270983e5f..595914fa43 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1032,7 +1032,7 @@ int rebuild_existing_bitmaps(struct packing_data *mapping,
 		oe = packlist_find(mapping, sha1, NULL);
 
 		if (oe)
-			reposition[i] = oe->in_pack_pos + 1;
+			reposition[i] = mapping->in_pack_pos[oe - mapping->objects] + 1;
 	}
 
 	rebuild = bitmap_new();
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 3742a00e14..2558f7662a 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -44,7 +44,9 @@ int rebuild_existing_bitmaps(struct packing_data *mapping, khash_sha1 *reused_bi
 
 void bitmap_writer_show_progress(int show);
 void bitmap_writer_set_checksum(unsigned char *sha1);
-void bitmap_writer_build_type_index(struct pack_idx_entry **index, uint32_t index_nr);
+void bitmap_writer_build_type_index(struct pack_idx_entry **index,
+				    uint32_t index_nr,
+				    struct packing_data *);
 void bitmap_writer_reuse_bitmaps(struct packing_data *to_pack);
 void bitmap_writer_select_commits(struct commit **indexed_commits,
 		unsigned int indexed_commits_nr, int max_bitmaps);
diff --git a/pack-objects.h b/pack-objects.h
index 03f1191659..caecad23b6 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -1,6 +1,18 @@
 #ifndef PACK_OBJECTS_H
 #define PACK_OBJECTS_H
 
+/*
+ * State flags for depth-first search used for analyzing delta cycles.
+ *
+ * The depth is measured in delta-links to the base (so if A is a delta
+ * against B, then A has a depth of 1, and B a depth of 0).
+ */
+enum dfs_state {
+	DFS_NONE = 0,
+	DFS_ACTIVE,
+	DFS_DONE
+};
+
 struct object_entry {
 	struct pack_idx_entry idx;
 	unsigned long size;	/* uncompressed size */
@@ -14,11 +26,10 @@ struct object_entry {
 	void *delta_data;	/* cached delta (uncompressed) */
 	unsigned long delta_size;	/* delta data size (uncompressed) */
 	unsigned long z_delta_size;	/* delta data size (compressed) */
-	enum object_type type;
-	enum object_type in_pack_type;	/* could be delta */
 	uint32_t hash;			/* name hint hash */
-	unsigned int in_pack_pos;
 	unsigned char in_pack_header_size;
+	unsigned type:3;	 /* enum object_type */
+	unsigned in_pack_type:3; /* enum object_type - could be delta */
 	unsigned preferred_base:1; /*
 				    * we do not pack this, but is available
 				    * to be used as the base object to delta
@@ -27,19 +38,8 @@ struct object_entry {
 	unsigned no_try_delta:1;
 	unsigned tagged:1; /* near the very tip of refs */
 	unsigned filled:1; /* assigned write-order */
-
-	/*
-	 * State flags for depth-first search used for analyzing delta cycles.
-	 *
-	 * The depth is measured in delta-links to the base (so if A is a delta
-	 * against B, then A has a depth of 1, and B a depth of 0).
-	 */
-	enum {
-		DFS_NONE = 0,
-		DFS_ACTIVE,
-		DFS_DONE
-	} dfs_state;
-	int depth;
+	unsigned dfs_state:3;	/* enum dfs_state */
+	unsigned depth:11;
 };
 
 struct packing_data {
@@ -48,6 +48,8 @@ struct packing_data {
 
 	int32_t *index;
 	uint32_t index_size;
+
+	int *in_pack_pos;
 };
 
 struct object_entry *packlist_alloc(struct packing_data *pdata,
-- 8< --

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-03-06  0:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-06  0:10 [PATCH/RFC 1/1] gc --auto: exclude the largest giant pack in low-memory config Duy Nguyen
  -- strict thread matches above, loose matches on Subject: below --
2018-02-28  9:27 Reduce pack-objects memory footprint? Duy Nguyen
2018-03-01  9:20 ` [PATCH/RFC 0/1] Avoid expensive 'repack -ad' in gc --auto Nguyễn Thái Ngọc Duy
2018-03-01  9:20   ` [PATCH/RFC 1/1] gc --auto: exclude the largest giant pack in low-memory config Nguyễn Thái Ngọc Duy
2018-03-01 18:14     ` Junio C Hamano
2018-03-02  0:00       ` Duy Nguyen
2018-03-05 14:00     ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).