git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
To: git@vger.kernel.org
Cc: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: [PATCH 21/32] split-index: the writing part
Date: Mon, 28 Apr 2014 17:55:42 +0700	[thread overview]
Message-ID: <1398682553-11634-22-git-send-email-pclouds@gmail.com> (raw)
In-Reply-To: <1398682553-11634-1-git-send-email-pclouds@gmail.com>

prepare_to_write_split_index() does the major work, classifying
deleted, updated and added entries. write_link_extension() then just
writes it down.

An observation is, deleting an entry, then adding it back is recorded
as "entry X is deleted, entry X is added", not "entry X is replaced".
This is simpler, with small overhead: a replaced entry is stored
without its path, a new entry is store with its path.

A note about unpack_trees() and the deduplication code inside
prepare_to_write_split_index(). Usually tracking updated/removed
entries via read-cache API is enough. unpack_trees() manipulates the
index in a different way: it throws the entire source index out,
builds up a new one, copying/duplicating entries (using dup_entry)
from the source index over if necessary, then returns the new index.

A naive solution would be marking the entire source index "deleted"
and add their duplicates as new. That could bring $GIT_DIR/index back
to the original size. So we try harder and memcmp() between the
original and the duplicate to see if it needs updating.

We could avoid memcmp() too, by avoiding duplicating the original
entry in dup_entry(). The performance gain this way is within noise
level and it complicates unpack-trees.c. So memcmp() is the preferred
way to deal with deduplication.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 split-index.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 split-index.h |   4 +++
 2 files changed, 103 insertions(+), 2 deletions(-)

diff --git a/split-index.c b/split-index.c
index b36c73b..5708807 100644
--- a/split-index.c
+++ b/split-index.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "split-index.h"
+#include "ewah/ewok.h"
 
 struct split_index *init_split_index(struct index_state *istate)
 {
@@ -26,11 +27,22 @@ int read_link_extension(struct index_state *istate,
 	return 0;
 }
 
+static int write_strbuf(void *user_data, const void *data, size_t len)
+{
+	struct strbuf *sb = user_data;
+	strbuf_add(sb, data, len);
+	return len;
+}
+
 int write_link_extension(struct strbuf *sb,
 			 struct index_state *istate)
 {
 	struct split_index *si = istate->split_index;
 	strbuf_add(sb, si->base_sha1, 20);
+	if (!si->delete_bitmap && !si->replace_bitmap)
+		return 0;
+	ewah_serialize_to(si->delete_bitmap, write_strbuf, sb);
+	ewah_serialize_to(si->replace_bitmap, write_strbuf, sb);
 	return 0;
 }
 
@@ -62,14 +74,99 @@ void merge_base_index(struct index_state *istate)
 void prepare_to_write_split_index(struct index_state *istate)
 {
 	struct split_index *si = init_split_index(istate);
-	/* take cache[] out temporarily */
+	struct cache_entry **entries = NULL, *ce;
+	int i, nr_entries = 0, nr_alloc = 0;
+
+	si->delete_bitmap = ewah_new();
+	si->replace_bitmap = ewah_new();
+
+	if (si->base) {
+		/* Go through istate->cache[] and mark CE_MATCHED to
+		 * entry with positive index. We'll go through
+		 * base->cache[] later to delete all entries in base
+		 * that are not marked eith either CE_MATCHED or
+		 * CE_UPDATE_IN_BASE. If istate->cache[i] is a
+		 * duplicate, deduplicate it.
+		 */
+		for (i = 0; i < istate->cache_nr; i++) {
+			struct cache_entry *base;
+			/* namelen is checked separately */
+			const unsigned int ondisk_flags =
+				CE_STAGEMASK | CE_VALID | CE_EXTENDED_FLAGS;
+			unsigned int ce_flags, base_flags, ret;
+			ce = istate->cache[i];
+			if (!ce->index)
+				continue;
+			if (ce->index > si->base->cache_nr) {
+				ce->index = 0;
+				continue;
+			}
+			ce->ce_flags |= CE_MATCHED; /* or "shared" */
+			base = si->base->cache[ce->index - 1];
+			if (ce == base)
+				continue;
+			if (ce->ce_namelen != base->ce_namelen ||
+			    strcmp(ce->name, base->name)) {
+				ce->index = 0;
+				continue;
+			}
+			ce_flags = ce->ce_flags;
+			base_flags = base->ce_flags;
+			/* only on-disk flags matter */
+			ce->ce_flags   &= ondisk_flags;
+			base->ce_flags &= ondisk_flags;
+			ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
+				     offsetof(struct cache_entry, name) -
+				     offsetof(struct cache_entry, ce_stat_data));
+			ce->ce_flags = ce_flags;
+			base->ce_flags = base_flags;
+			if (ret)
+				ce->ce_flags |= CE_UPDATE_IN_BASE;
+			free(base);
+			si->base->cache[ce->index - 1] = ce;
+		}
+		for (i = 0; i < si->base->cache_nr; i++) {
+			ce = si->base->cache[i];
+			if ((ce->ce_flags & CE_REMOVE) ||
+			    !(ce->ce_flags & CE_MATCHED))
+				ewah_set(si->delete_bitmap, i);
+			else if (ce->ce_flags & CE_UPDATE_IN_BASE) {
+				ewah_set(si->replace_bitmap, i);
+				ALLOC_GROW(entries, nr_entries+1, nr_alloc);
+				entries[nr_entries++] = ce;
+			}
+		}
+	}
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		ce = istate->cache[i];
+		if ((!si->base || !ce->index) && !(ce->ce_flags & CE_REMOVE)) {
+			ALLOC_GROW(entries, nr_entries+1, nr_alloc);
+			entries[nr_entries++] = ce;
+		}
+		ce->ce_flags &= ~CE_MATCHED;
+	}
+
+	/*
+	 * take cache[] out temporarily, put entries[] in its place
+	 * for writing
+	 */
+	si->saved_cache = istate->cache;
 	si->saved_cache_nr = istate->cache_nr;
-	istate->cache_nr = 0;
+	istate->cache = entries;
+	istate->cache_nr = nr_entries;
 }
 
 void finish_writing_split_index(struct index_state *istate)
 {
 	struct split_index *si = init_split_index(istate);
+
+	ewah_free(si->delete_bitmap);
+	ewah_free(si->replace_bitmap);
+	si->delete_bitmap = NULL;
+	si->replace_bitmap = NULL;
+	free(istate->cache);
+	istate->cache = si->saved_cache;
 	istate->cache_nr = si->saved_cache_nr;
 }
 
diff --git a/split-index.h b/split-index.h
index 812e510..53b778f 100644
--- a/split-index.h
+++ b/split-index.h
@@ -3,10 +3,14 @@
 
 struct index_state;
 struct strbuf;
+struct ewah_bitmap;
 
 struct split_index {
 	unsigned char base_sha1[20];
 	struct index_state *base;
+	struct ewah_bitmap *delete_bitmap;
+	struct ewah_bitmap *replace_bitmap;
+	struct cache_entry **saved_cache;
 	unsigned int saved_cache_nr;
 	int refcount;
 };
-- 
1.9.1.346.ga2b5940

  parent reply	other threads:[~2014-04-28 10:56 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-28 10:55 [PATCH 00/32] Split index mode for very large indexes Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 01/32] ewah: fix constness of ewah_read_mmap Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 02/32] ewah: delete unused ewah_read_mmap_native declaration Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 03/32] sequencer: do not update/refresh index if the lock cannot be held Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 04/32] read-cache: new API write_locked_index instead of write_index/write_cache Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 05/32] read-cache: relocate and unexport commit_locked_index() Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 06/32] read-cache: store in-memory flags in the first 12 bits of ce_flags Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 07/32] read-cache: be strict about "changed" in remove_marked_cache_entries() Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 08/32] read-cache: be specific what part of the index has changed Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 09/32] update-index: " Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 10/32] resolve-undo: " Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 11/32] unpack-trees: " Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 12/32] cache-tree: mark istate->cache_changed on cache tree invalidation Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 13/32] cache-tree: mark istate->cache_changed on cache tree update Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 14/32] cache-tree: mark istate->cache_changed on prime_cache_tree() Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 15/32] entry.c: update cache_changed if refresh_cache is set in checkout_entry() Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 16/32] read-cache: save index SHA-1 after reading Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 17/32] read-cache: split-index mode Nguyễn Thái Ngọc Duy
2014-04-28 22:46   ` Junio C Hamano
2014-04-29  1:43     ` Duy Nguyen
2014-04-29 17:23       ` Junio C Hamano
2014-04-29 22:45         ` Duy Nguyen
2014-04-30 13:57           ` Junio C Hamano
2014-04-28 10:55 ` [PATCH 18/32] read-cache: mark new entries for split index Nguyễn Thái Ngọc Duy
2014-04-30 20:35   ` Eric Sunshine
2014-04-28 10:55 ` [PATCH 19/32] read-cache: save deleted entries in " Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 20/32] read-cache: mark updated entries for " Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` Nguyễn Thái Ngọc Duy [this message]
2014-04-28 10:55 ` [PATCH 22/32] split-index: the reading part Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 23/32] split-index: do not invalidate cache-tree at read time Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 24/32] split-index: strip pathname of on-disk replaced entries Nguyễn Thái Ngọc Duy
2014-04-29 20:25   ` Junio C Hamano
2014-04-28 10:55 ` [PATCH 25/32] update-index: new options to enable/disable split index mode Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 26/32] update-index --split-index: do not split if $GIT_DIR is read only Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 27/32] rev-parse: add --shared-index-path to get shared index path Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 28/32] read-tree: force split-index mode off on --index-output Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 29/32] read-tree: note about dropping split-index mode or index version Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 30/32] read-cache: force split index mode with GIT_TEST_SPLIT_INDEX Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 31/32] t2104: make sure split index mode is off for the version test Nguyễn Thái Ngọc Duy
2014-04-28 10:55 ` [PATCH 32/32] t1700: new tests for split-index mode Nguyễn Thái Ngọc Duy
2014-04-28 21:18 ` [PATCH 00/32] Split index mode for very large indexes Shawn Pearce
2014-04-29  1:52   ` Duy Nguyen
2014-05-09 10:27   ` Duy Nguyen
2014-05-09 17:55     ` Junio C Hamano
2014-05-13 11:15   ` [PATCH 0/8] Speed up cache loading time Nguyễn Thái Ngọc Duy
2014-05-13 11:15     ` [PATCH 1/8] read-cache: allow to keep mmap'd memory after reading Nguyễn Thái Ngọc Duy
2014-05-13 11:15     ` [PATCH 2/3] Add read-cache--daemon Nguyễn Thái Ngọc Duy
2014-05-13 11:52       ` Erik Faye-Lund
2014-05-13 12:01         ` Duy Nguyen
2014-05-13 13:01         ` Duy Nguyen
2014-05-13 13:37           ` Erik Faye-Lund
2014-05-13 13:49             ` Duy Nguyen
2014-05-13 14:06               ` Erik Faye-Lund
2014-05-13 14:10                 ` Duy Nguyen
2014-05-13 14:16                   ` Erik Faye-Lund
2014-05-13 11:15     ` [PATCH 2/8] unix-socket: stub impl. for platforms with no unix socket support Nguyễn Thái Ngọc Duy
2014-05-13 11:59       ` Erik Faye-Lund
2014-05-13 12:03         ` Erik Faye-Lund
2014-05-13 11:15     ` [PATCH 3/8] daemonize: set a flag before exiting the main process Nguyễn Thái Ngọc Duy
2014-05-13 11:15     ` [PATCH 3/3] read-cache: try index data from shared memory Nguyễn Thái Ngọc Duy
2014-05-13 11:15     ` [PATCH 4/8] Add read-cache--daemon for caching index and related stuff Nguyễn Thái Ngọc Duy
2014-05-13 11:56       ` Erik Faye-Lund
2014-05-13 11:15     ` [PATCH 5/8] read-cache: try index data from shared memory Nguyễn Thái Ngọc Duy
2014-05-13 12:13       ` Erik Faye-Lund
2014-05-13 11:15     ` [PATCH 6/8] read-cache--daemon: do not read index " Nguyễn Thái Ngọc Duy
2014-05-13 11:15     ` [PATCH 7/8] read-cache: skip verifying trailing SHA-1 on cached index Nguyễn Thái Ngọc Duy
2014-05-13 11:15     ` [PATCH 8/8] read-cache: inform the daemon that the index has been updated Nguyễn Thái Ngọc Duy
2014-05-13 12:17       ` Erik Faye-Lund
2014-05-22 16:38       ` David Turner
2014-05-13 14:24     ` [PATCH 0/8] Speed up cache loading time Stefan Beller
2014-05-13 14:35       ` Duy Nguyen
2014-05-13 11:20   ` [PATCH 9/8] even faster loading time with index version 254 Nguyễn Thái Ngọc Duy
2014-04-28 22:23 ` [PATCH 00/32] Split index mode for very large indexes Junio C Hamano
2014-04-30 20:48 ` Richard Hansen
2014-05-01  0:09   ` Duy Nguyen
  -- strict thread matches above, loose matches on Subject: below --
2014-06-13 12:19 [PATCH 00/32] Split index resend Nguyễn Thái Ngọc Duy
2014-06-13 12:19 ` [PATCH 21/32] split-index: the writing part Nguyễn Thái Ngọc Duy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1398682553-11634-22-git-send-email-pclouds@gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).