[PATCH v1 0/2] convert: stream and early out

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v1 0/2] convert: stream and early out
@ 2016-10-09  9:56 tboegi
  2016-10-10 20:19 ` Junio C Hamano
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: tboegi @ 2016-10-09  9:56 UTC (permalink / raw)
  To: git; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

An optimization when autocrlf is used and the binary/text detection is run.
Or git ls-files --eol is run to analyze the content of files or blobs.

Torsten Bögershausen (2):
  read-cache: factor out get_sha1_from_index() helper
  convert.c: stream and early out

 cache.h      |   3 +
 convert.c    | 195 +++++++++++++++++++++++++++++++++++++++--------------------
 read-cache.c |  29 +++++----
 3 files changed, 151 insertions(+), 76 deletions(-)

-- 
2.10.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 0/2] convert: stream and early out
  2016-10-09  9:56 [PATCH v1 0/2] convert: stream and early out tboegi
@ 2016-10-10 20:19 ` Junio C Hamano
  2016-10-12 13:47 ` [PATCH v2 0/2] Stream and fast search tboegi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2016-10-10 20:19 UTC (permalink / raw)
  To: tboegi; +Cc: git

tboegi@web.de writes:

> From: Torsten Bögershausen <tboegi@web.de>
>
> An optimization when autocrlf is used and the binary/text detection is run.
> Or git ls-files --eol is run to analyze the content of files or blobs.

This looks like a worthwhile thing to do.  Please sign-off the
patches when they are finalized.

Thanks.

>
> Torsten Bögershausen (2):
>   read-cache: factor out get_sha1_from_index() helper
>   convert.c: stream and early out
>
>  cache.h      |   3 +
>  convert.c    | 195 +++++++++++++++++++++++++++++++++++++++--------------------
>  read-cache.c |  29 +++++----
>  3 files changed, 151 insertions(+), 76 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 0/2] Stream and fast search
  2016-10-09  9:56 [PATCH v1 0/2] convert: stream and early out tboegi
  2016-10-10 20:19 ` Junio C Hamano
@ 2016-10-12 13:47 ` tboegi
  2016-10-27 17:02   ` Junio C Hamano
  2016-10-12 13:47 ` [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper tboegi
  2016-10-12 13:47 ` [PATCH v2 2/2] convert.c: stream and fast search for binary tboegi
  3 siblings, 1 reply; 11+ messages in thread
From: tboegi @ 2016-10-12 13:47 UTC (permalink / raw)
  To: git; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

Changes since v1:
- Rename earlyout into search_only
- Increase buffer from 2KiB to 16KiB
- s/mask/eol_bits/
- Reduce the "noise"
- Document "split gather_stats() into gather_all_stats()/gather_stats_partly()

Torsten Bögershausen (2):
  read-cache: factor out get_sha1_from_index() helper
  convert.c: stream and fast search for binary

 cache.h      |   3 +
 convert.c    | 191 ++++++++++++++++++++++++++++++++++++++++-------------------
 read-cache.c |  29 +++++----
 3 files changed, 150 insertions(+), 73 deletions(-)

-- 
2.10.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper
  2016-10-09  9:56 [PATCH v1 0/2] convert: stream and early out tboegi
  2016-10-10 20:19 ` Junio C Hamano
  2016-10-12 13:47 ` [PATCH v2 0/2] Stream and fast search tboegi
@ 2016-10-12 13:47 ` tboegi
  2016-10-27 19:57   ` Junio C Hamano
  2016-10-29 12:22   ` Duy Nguyen
  2016-10-12 13:47 ` [PATCH v2 2/2] convert.c: stream and fast search for binary tboegi
  3 siblings, 2 replies; 11+ messages in thread
From: tboegi @ 2016-10-12 13:47 UTC (permalink / raw)
  To: git; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

Factor out the retrieval of the sha1 for a given path in
read_blob_data_from_index() into the function get_sha1_from_index().

This will be used in the next commit, when convert.c can do the
analyze for "text=auto" without slurping the whole blob into memory
at once.

Add a wrapper definition get_sha1_from_cache().

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 cache.h      |  3 +++
 read-cache.c | 29 ++++++++++++++++++-----------
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/cache.h b/cache.h
index 1604e29..04de209 100644
--- a/cache.h
+++ b/cache.h
@@ -380,6 +380,7 @@ extern void free_name_hash(struct index_state *istate);
 #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at)
 #define unmerge_cache(pathspec) unmerge_index(&the_index, pathspec)
 #define read_blob_data_from_cache(path, sz) read_blob_data_from_index(&the_index, (path), (sz))
+#define get_sha1_from_cache(path)  get_sha1_from_index (&the_index, (path))
 #endif
 
 enum object_type {
@@ -1089,6 +1090,8 @@ static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *
 	return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
 }
 
+const unsigned char *get_sha1_from_index(struct index_state *istate, const char *path);
+
 /*
  * This internal function is only declared here for the benefit of
  * lookup_replace_object().  Please do not call it directly.
diff --git a/read-cache.c b/read-cache.c
index 38d67fa..5a1df14 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2290,13 +2290,27 @@ int index_name_is_other(const struct index_state *istate, const char *name,
 
 void *read_blob_data_from_index(struct index_state *istate, const char *path, unsigned long *size)
 {
-	int pos, len;
+	const unsigned char *sha1;
 	unsigned long sz;
 	enum object_type type;
 	void *data;
 
-	len = strlen(path);
-	pos = index_name_pos(istate, path, len);
+	sha1 = get_sha1_from_index(istate, path);
+	if (!sha1)
+		return NULL;
+	data = read_sha1_file(sha1, &type, &sz);
+	if (!data || type != OBJ_BLOB) {
+		free(data);
+		return NULL;
+	}
+	if (size)
+		*size = sz;
+	return data;
+}
+
+const unsigned char *get_sha1_from_index(struct index_state *istate, const char *path)
+{
+	int pos = index_name_pos(istate, path, strlen(path));
 	if (pos < 0) {
 		/*
 		 * We might be in the middle of a merge, in which
@@ -2312,14 +2326,7 @@ void *read_blob_data_from_index(struct index_state *istate, const char *path, un
 	}
 	if (pos < 0)
 		return NULL;
-	data = read_sha1_file(istate->cache[pos]->oid.hash, &type, &sz);
-	if (!data || type != OBJ_BLOB) {
-		free(data);
-		return NULL;
-	}
-	if (size)
-		*size = sz;
-	return data;
+	return istate->cache[pos]->oid.hash;
 }
 
 void stat_validity_clear(struct stat_validity *sv)
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] convert.c: stream and fast search for binary
  2016-10-09  9:56 [PATCH v1 0/2] convert: stream and early out tboegi
                   ` (2 preceding siblings ...)
  2016-10-12 13:47 ` [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper tboegi
@ 2016-10-12 13:47 ` tboegi
  2016-10-27 21:18   ` Junio C Hamano
  2016-10-29 12:13   ` Duy Nguyen
  3 siblings, 2 replies; 11+ messages in thread
From: tboegi @ 2016-10-12 13:47 UTC (permalink / raw)
  To: git; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When statistics are done for the autocrlf handling, the search in
the content can be stopped, if e.g
- a search for binary is done, and a NUL character is found
- a search for CRLF is done, and the first CRLF is found.

Similar when statistics for binary vs non-binary are gathered:
Whenever a lone CR or NUL is found, the search can be aborted.

When checking out files in "auto" mode, any file that has a "lone CR"
or a CRLF will not be converted, so the search can be aborted early.

Add the new bit, CONVERT_STAT_BITS_ANY_CR,
which is set for either lone CR or CRLF.

Many binary files have a NUL very early and it is often not necessary
to load the whole content of a file or blob into memory.

Split gather_stats() into gather_all_stats() and gather_stats_partly()
to do a streaming handling for blobs and files in the worktree.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 convert.c | 191 ++++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 129 insertions(+), 62 deletions(-)

diff --git a/convert.c b/convert.c
index 077f5e6..2396fe5 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "streaming.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -13,10 +14,12 @@
  * translation when the "text" attribute or "auto_crlf" option is set.
  */
 
-/* Stat bits: When BIN is set, the txt bits are unset */
 #define CONVERT_STAT_BITS_TXT_LF    0x1
 #define CONVERT_STAT_BITS_TXT_CRLF  0x2
 #define CONVERT_STAT_BITS_BIN       0x4
+#define CONVERT_STAT_BITS_ANY_CR    0x8
+
+#define STREAM_BUFFER_SIZE (1024*16)
 
 enum crlf_action {
 	CRLF_UNDEFINED,
@@ -31,30 +34,36 @@ enum crlf_action {
 
 struct text_stat {
 	/* NUL, CR, LF and CRLF counts */
-	unsigned nul, lonecr, lonelf, crlf;
+	unsigned stat_bits, lonecr, lonelf, crlf;
 
 	/* These are just approximations! */
 	unsigned printable, nonprintable;
 };
 
-static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
+static void gather_stats_partly(const char *buf, unsigned long size,
+				struct text_stat *stats, unsigned search_only)
 {
 	unsigned long i;
 
-	memset(stats, 0, sizeof(*stats));
-
+	if (!buf || !size)
+		return;
 	for (i = 0; i < size; i++) {
 		unsigned char c = buf[i];
 		if (c == '\r') {
+			stats->stat_bits |= CONVERT_STAT_BITS_ANY_CR;
 			if (i+1 < size && buf[i+1] == '\n') {
 				stats->crlf++;
 				i++;
-			} else
+				stats->stat_bits |= CONVERT_STAT_BITS_TXT_CRLF;
+			} else {
 				stats->lonecr++;
+				stats->stat_bits |= CONVERT_STAT_BITS_BIN;
+			}
 			continue;
 		}
 		if (c == '\n') {
 			stats->lonelf++;
+			stats->stat_bits |= CONVERT_STAT_BITS_TXT_LF;
 			continue;
 		}
 		if (c == 127)
@@ -67,7 +76,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 				stats->printable++;
 				break;
 			case 0:
-				stats->nul++;
+				stats->stat_bits |= CONVERT_STAT_BITS_BIN;
 				/* fall through */
 			default:
 				stats->nonprintable++;
@@ -75,6 +84,8 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 		}
 		else
 			stats->printable++;
+		if (stats->stat_bits & search_only)
+			break; /* We found what we have been searching for */
 	}
 
 	/* If file ends with EOF then don't count this EOF as non-printable. */
@@ -86,41 +97,62 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
  * The same heuristics as diff.c::mmfile_is_binary()
  * We treat files with bare CR as binary
  */
-static int convert_is_binary(unsigned long size, const struct text_stat *stats)
+static void convert_nonprintable(struct text_stat *stats)
 {
-	if (stats->lonecr)
-		return 1;
-	if (stats->nul)
-		return 1;
 	if ((stats->printable >> 7) < stats->nonprintable)
-		return 1;
-	return 0;
+		stats->stat_bits |= CONVERT_STAT_BITS_BIN;
 }
 
-static unsigned int gather_convert_stats(const char *data, unsigned long size)
+static void gather_all_stats(const char *buf, unsigned long size,
+			 struct text_stat *stats, unsigned search_only)
 {
+	memset(stats, 0, sizeof(*stats));
+	gather_stats_partly(buf, size, stats, search_only);
+	convert_nonprintable(stats);
+}
+
+
+static unsigned get_convert_stats_sha1(unsigned const char *sha1,
+				       unsigned search_only)
+{
+	struct git_istream *st;
 	struct text_stat stats;
-	int ret = 0;
-	if (!data || !size)
-		return 0;
-	gather_stats(data, size, &stats);
-	if (convert_is_binary(size, &stats))
-		ret |= CONVERT_STAT_BITS_BIN;
-	if (stats.crlf)
-		ret |= CONVERT_STAT_BITS_TXT_CRLF;
-	if (stats.lonelf)
-		ret |=  CONVERT_STAT_BITS_TXT_LF;
+	enum object_type type;
+	unsigned long sz;
 
-	return ret;
+	if (!sha1)
+		return 0;
+	memset(&stats, 0, sizeof(stats));
+	st = open_istream(sha1, &type, &sz, NULL);
+	if (!st) {
+		return 0;
+	}
+	if (type != OBJ_BLOB)
+		goto close_and_exit_i;
+	for (;;) {
+		char buf[STREAM_BUFFER_SIZE];
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+		if (readlen < 0)
+			break;
+		if (!readlen)
+			break;
+		gather_stats_partly(buf, (unsigned long)readlen, &stats, search_only);
+		if (stats.stat_bits & search_only)
+			break; /* We found what we have been searching for */
+	}
+close_and_exit_i:
+	close_istream(st);
+	convert_nonprintable(&stats);
+	return stats.stat_bits;
 }
 
-static const char *gather_convert_stats_ascii(const char *data, unsigned long size)
+static const char *convert_stats_ascii(unsigned convert_stats)
 {
-	unsigned int convert_stats = gather_convert_stats(data, size);
-
+	const unsigned eol_bits = CONVERT_STAT_BITS_TXT_LF |
+		CONVERT_STAT_BITS_TXT_CRLF;
 	if (convert_stats & CONVERT_STAT_BITS_BIN)
 		return "-text";
-	switch (convert_stats) {
+	switch (convert_stats & eol_bits) {
 	case CONVERT_STAT_BITS_TXT_LF:
 		return "lf";
 	case CONVERT_STAT_BITS_TXT_CRLF:
@@ -132,24 +164,45 @@ static const char *gather_convert_stats_ascii(const char *data, unsigned long si
 	}
 }
 
+static unsigned get_convert_stats_wt(const char *path)
+{
+	struct text_stat stats;
+	unsigned search_only = CONVERT_STAT_BITS_BIN;
+	int fd;
+	memset(&stats, 0, sizeof(stats));
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return 0;
+	for (;;) {
+		char buf[STREAM_BUFFER_SIZE];
+		ssize_t readlen = read(fd, buf, sizeof(buf));
+		if (readlen < 0)
+			break;
+		if (!readlen)
+			break;
+		gather_stats_partly(buf, (unsigned long)readlen, &stats, search_only);
+		if (stats.stat_bits & search_only)
+			break; /* We found what we have been searching for */
+	}
+	close(fd);
+	convert_nonprintable(&stats);
+	return stats.stat_bits;
+}
+
 const char *get_cached_convert_stats_ascii(const char *path)
 {
-	const char *ret;
-	unsigned long sz;
-	void *data = read_blob_data_from_cache(path, &sz);
-	ret = gather_convert_stats_ascii(data, sz);
-	free(data);
-	return ret;
+	unsigned convert_stats;
+	unsigned search_only = CONVERT_STAT_BITS_BIN;
+	convert_stats = get_convert_stats_sha1(get_sha1_from_cache(path),
+					       search_only);
+	return convert_stats_ascii(convert_stats);
 }
 
 const char *get_wt_convert_stats_ascii(const char *path)
 {
-	const char *ret = "";
-	struct strbuf sb = STRBUF_INIT;
-	if (strbuf_read_file(&sb, path, 0) >= 0)
-		ret = gather_convert_stats_ascii(sb.buf, sb.len);
-	strbuf_release(&sb);
-	return ret;
+	unsigned convert_stats;
+	convert_stats = get_convert_stats_wt(path);
+	return convert_stats_ascii(convert_stats);
 }
 
 static int text_eol_is_crlf(void)
@@ -213,16 +266,10 @@ static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
 
 static int has_cr_in_index(const char *path)
 {
-	unsigned long sz;
-	void *data;
-	int has_cr;
-
-	data = read_blob_data_from_cache(path, &sz);
-	if (!data)
-		return 0;
-	has_cr = memchr(data, '\r', sz) != NULL;
-	free(data);
-	return has_cr;
+	unsigned convert_stats;
+	convert_stats = get_convert_stats_sha1(get_sha1_from_cache(path),
+					       CONVERT_STAT_BITS_ANY_CR);
+	return convert_stats & CONVERT_STAT_BITS_ANY_CR;
 }
 
 static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
@@ -234,13 +281,13 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 	if (!stats->lonelf)
 		return 0;
 
-	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_INPUT || crlf_action == CRLF_AUTO_CRLF) {
+	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_CRLF) {
 		/* If we have any CR or CRLF line endings, we do not touch it */
 		/* This is the new safer autocrlf-handling */
 		if (stats->lonecr || stats->crlf)
 			return 0;
 
-		if (convert_is_binary(len, stats))
+		if (stats->stat_bits & CONVERT_STAT_BITS_BIN)
 			return 0;
 	}
 	return 1;
@@ -253,7 +300,8 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 {
 	struct text_stat stats;
 	char *dst;
-	int convert_crlf_into_lf;
+	int has_crlf_to_convert;
+	unsigned search_only = 0;
 
 	if (crlf_action == CRLF_BINARY ||
 	    (src && !len))
@@ -266,12 +314,16 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 	if (!buf && !src)
 		return 1;
 
-	gather_stats(src, len, &stats);
+	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_INPUT || crlf_action == CRLF_AUTO_CRLF)
+		search_only = CONVERT_STAT_BITS_BIN;
+
+	gather_all_stats(src, len, &stats, search_only);
+
 	/* Optimization: No CRLF? Nothing to convert, regardless. */
-	convert_crlf_into_lf = !!stats.crlf;
+	has_crlf_to_convert = !!stats.crlf;
 
 	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_INPUT || crlf_action == CRLF_AUTO_CRLF) {
-		if (convert_is_binary(len, &stats))
+		if (stats.stat_bits & CONVERT_STAT_BITS_BIN)
 			return 0;
 		/*
 		 * If the file in the index has any CR in it, do not convert.
@@ -280,24 +332,35 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 		if (checksafe == SAFE_CRLF_RENORMALIZE)
 			checksafe = SAFE_CRLF_FALSE;
 		else if (has_cr_in_index(path))
-			convert_crlf_into_lf = 0;
+			has_crlf_to_convert = 0;
 	}
 	if (checksafe && len) {
 		struct text_stat new_stats;
 		memcpy(&new_stats, &stats, sizeof(new_stats));
 		/* simulate "git add" */
-		if (convert_crlf_into_lf) {
+		if (has_crlf_to_convert) {
 			new_stats.lonelf += new_stats.crlf;
 			new_stats.crlf = 0;
+			/* all crlf, if any, are gone. Update the bits */
+			new_stats.stat_bits = stats.stat_bits & CONVERT_STAT_BITS_BIN;
+			if (new_stats.lonelf)
+				new_stats.stat_bits |= CONVERT_STAT_BITS_TXT_LF;
+			if (new_stats.lonecr)
+				new_stats.stat_bits |= CONVERT_STAT_BITS_ANY_CR;
 		}
 		/* simulate "git checkout" */
 		if (will_convert_lf_to_crlf(len, &new_stats, crlf_action)) {
 			new_stats.crlf += new_stats.lonelf;
 			new_stats.lonelf = 0;
+			new_stats.stat_bits = stats.stat_bits & CONVERT_STAT_BITS_BIN;
+			if (new_stats.crlf)
+				new_stats.stat_bits |= CONVERT_STAT_BITS_TXT_CRLF | CONVERT_STAT_BITS_ANY_CR;
+			if (new_stats.lonecr)
+				new_stats.stat_bits |= CONVERT_STAT_BITS_ANY_CR;
 		}
 		check_safe_crlf(path, crlf_action, &stats, &new_stats, checksafe);
 	}
-	if (!convert_crlf_into_lf)
+	if (!has_crlf_to_convert)
 		return 0;
 
 	/*
@@ -338,11 +401,15 @@ static int crlf_to_worktree(const char *path, const char *src, size_t len,
 {
 	char *to_free = NULL;
 	struct text_stat stats;
+	unsigned search_only = 0;
 
 	if (!len || output_eol(crlf_action) != EOL_CRLF)
 		return 0;
 
-	gather_stats(src, len, &stats);
+	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_CRLF)
+		search_only = CONVERT_STAT_BITS_ANY_CR | CONVERT_STAT_BITS_BIN;
+
+	gather_all_stats(src, len, &stats, search_only);
 	if (!will_convert_lf_to_crlf(len, &stats, crlf_action))
 		return 0;
 
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] Stream and fast search
  2016-10-12 13:47 ` [PATCH v2 0/2] Stream and fast search tboegi
@ 2016-10-27 17:02   ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2016-10-27 17:02 UTC (permalink / raw)
  To: git; +Cc: tboegi, Lars Schneider, Duy Nguyen

Cc'ed those who touched convert.c or read-cache.c in our relatively
recent past with a change that affects the eol conversion codepath.

Does any of you (and others on the list) have time and inclination
to review this series?

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper
  2016-10-12 13:47 ` [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper tboegi
@ 2016-10-27 19:57   ` Junio C Hamano
  2016-10-29 12:22   ` Duy Nguyen
  1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2016-10-27 19:57 UTC (permalink / raw)
  To: tboegi; +Cc: git

tboegi@web.de writes:

> From: Torsten Bögershausen <tboegi@web.de>
>
> Factor out the retrieval of the sha1 for a given path in
> read_blob_data_from_index() into the function get_sha1_from_index().
>
> This will be used in the next commit, when convert.c can do the
> analyze for "text=auto" without slurping the whole blob into memory
> at once.
>
> Add a wrapper definition get_sha1_from_cache().
>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  cache.h      |  3 +++
>  read-cache.c | 29 ++++++++++++++++++-----------
>  2 files changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index 1604e29..04de209 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -380,6 +380,7 @@ extern void free_name_hash(struct index_state *istate);
>  #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at)
>  #define unmerge_cache(pathspec) unmerge_index(&the_index, pathspec)
>  #define read_blob_data_from_cache(path, sz) read_blob_data_from_index(&the_index, (path), (sz))
> +#define get_sha1_from_cache(path)  get_sha1_from_index (&the_index, (path))

Should have caught this earlier, but there is an extra SP after "from_index"
which I'll remove (the topic is not in 'next' yet, lucky us).

I re-read this to ensure that it does not break read_blob_data_from_index()
the new function borrows the logic from.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] convert.c: stream and fast search for binary
  2016-10-12 13:47 ` [PATCH v2 2/2] convert.c: stream and fast search for binary tboegi
@ 2016-10-27 21:18   ` Junio C Hamano
  2016-11-01  9:36     ` Torsten Bögershausen
  2016-10-29 12:13   ` Duy Nguyen
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2016-10-27 21:18 UTC (permalink / raw)
  To: tboegi; +Cc: git

tboegi@web.de writes:

> From: Torsten Bögershausen <tboegi@web.de>
>
> When statistics are done for the autocrlf handling, the search in
> the content can be stopped, if e.g
> - a search for binary is done, and a NUL character is found
> - a search for CRLF is done, and the first CRLF is found.
>
> Similar when statistics for binary vs non-binary are gathered:
> Whenever a lone CR or NUL is found, the search can be aborted.
>
> When checking out files in "auto" mode, any file that has a "lone CR"
> or a CRLF will not be converted, so the search can be aborted early.
>
> Add the new bit, CONVERT_STAT_BITS_ANY_CR,
> which is set for either lone CR or CRLF.
>
> Many binary files have a NUL very early and it is often not necessary
> to load the whole content of a file or blob into memory.
>
> Split gather_stats() into gather_all_stats() and gather_stats_partly()
> to do a streaming handling for blobs and files in the worktree.
>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---

I'll try to review in reverse order, as this seems to be doing too
many things at once and cannot get my head around it without going
top down.

> @@ -338,11 +401,15 @@ static int crlf_to_worktree(const char *path, const char *src, size_t len,
>  {
>  	char *to_free = NULL;
>  	struct text_stat stats;
> +	unsigned search_only = 0;
>  
>  	if (!len || output_eol(crlf_action) != EOL_CRLF)
>  		return 0;
>  
> -	gather_stats(src, len, &stats);
> +	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_CRLF)
> +		search_only = CONVERT_STAT_BITS_ANY_CR | CONVERT_STAT_BITS_BIN;
> +
> +	gather_all_stats(src, len, &stats, search_only);
>  	if (!will_convert_lf_to_crlf(len, &stats, crlf_action))
>  		return 0;

This special case to decide whether we would limit the search_only
flag has too much intimate knowledge of what happens inside
will_convert_lf_to_crlf().  It knows that output_eol(crlf_action)
not being EOL_CRLF is the very first thing the function checks, too.

Makes one wonder if the check for output_eol(crlf_action) can be
removed from will_convert_lf_to_crlf(), no?  It is not apparent if
that is a good idea for the other caller in crlf_to_git().

gather_all_stats() will give up immediately when it sees either
ANY_CR or BIN.  If CR appears before we see any BIN, stat_bits would
not have BITS_BIN even if the buffer may have BIN byte later.  It is
OK because either lonecr or crlf would be non-zero, and
will_convert_lf_to_crlf() would return 0.  If BIN apepars before we
see any CR, neither lonecr nor crlf will become non-zero even if the
buffer may have CR byte later, but again it is OK because
will_convert_lf_to_crlf() will return 0 in that case.

This looks too brittle, even though it is correct.

> @@ -253,7 +300,8 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
>  {
>  	struct text_stat stats;
>  	char *dst;
> -	int convert_crlf_into_lf;
> +	int has_crlf_to_convert;
> +	unsigned search_only = 0;
>  
>  	if (crlf_action == CRLF_BINARY ||
>  	    (src && !len))
> @@ -266,12 +314,16 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
>  	if (!buf && !src)
>  		return 1;
>  
> -	gather_stats(src, len, &stats);
> +	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_INPUT || crlf_action == CRLF_AUTO_CRLF)
> +		search_only = CONVERT_STAT_BITS_BIN;
> +
> +	gather_all_stats(src, len, &stats, search_only);
> +
>  	/* Optimization: No CRLF? Nothing to convert, regardless. */
> -	convert_crlf_into_lf = !!stats.crlf;
> +	has_crlf_to_convert = !!stats.crlf;

The comment here may need to say a lot more, now we do not even
count .crlf in some cases because of "search_only" setting.

>  	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_AUTO_INPUT || crlf_action == CRLF_AUTO_CRLF) {

The new "search_only" criteria above was added to match this if
block; it is not as bad as the previous one in crlf_to_worktree()
that knows, and must be kept in sync with, what a separate function
will_convert_lf_to_crlf() does, but still it is horrible for both
maintainability and readability.  Can you devise some mechanism to
ensure that these two if statements will stay in sync?

> -		if (convert_is_binary(len, &stats))
> +		if (stats.stat_bits & CONVERT_STAT_BITS_BIN)
>  			return 0;

We no longer need the helper function convert_is_binary() and
instead need only STATS_BITS_BIN bit, so the control flow that
reaches this point is obviously correct (assuming that
gather_all_stats() that is limited with search_only option counts
things correctly, that is).

But what happens when we don't return here?  We didn't get the full
stats out of gather_all_stats() and continue.  Let's see what
happens in that case...

>  		/*
>  		 * If the file in the index has any CR in it, do not convert.
> @@ -280,24 +332,35 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
>  		if (checksafe == SAFE_CRLF_RENORMALIZE)
>  			checksafe = SAFE_CRLF_FALSE;
>  		else if (has_cr_in_index(path))
> -			convert_crlf_into_lf = 0;
> +			has_crlf_to_convert = 0;
>  	}

So at this point, we cannot trust what is in "stats" (we may have
come out of the above if() statement because it wasn't binary after
all).  It is unclear to me if we may or may not be able to trust
has_crlf_to_convert at this point.  If crlf_action was one of the
three magic values that caused search_only for BITS_BIN set, then
stats.crlf may or may not have seen CRLF---if a NUL came before any
CRLF, gather_all_stats() would have returned without seeing a CRLF
that exists, and otherwise it may have seen one and counted, so
has_crlf_to_convert that is set immediately after gather_all_stats()
returned cannot be trusted at all, when BITS_BIN was set in the
result.

What saves this codepath is that we would have returned at this
point if BITS_BIN was set in the result, so stats.crlf immediately
after gather_all_stats() returned can be trusted in that case, which
in turn means has_crlf_to_convert can also be trusted here.  Whew.

I hate to say this, and it certainly is not the fault of this patch,
but the result of applying this patch is undecipherable without a
great effort and is too brittle.  A reviewer or any future developer
who has to touch this codepath should not be forced to do this kind
of analysis.  Either the code should make everything I wrote above
clear by itself, or more in-code comment must talk about these
things.

Anyway, let's say we established that has_crlf_to_convert and
checksafe can be trustable at this point in the control flow, and
let's keep reading.

>  	if (checksafe && len) {
>  		struct text_stat new_stats;
>  		memcpy(&new_stats, &stats, sizeof(new_stats));
>  		/* simulate "git add" */
> -		if (convert_crlf_into_lf) {
> +		if (has_crlf_to_convert) {
>  			new_stats.lonelf += new_stats.crlf;
>  			new_stats.crlf = 0;
> +			/* all crlf, if any, are gone. Update the bits */
> +			new_stats.stat_bits = stats.stat_bits & CONVERT_STAT_BITS_BIN;
> +			if (new_stats.lonelf)
> +				new_stats.stat_bits |= CONVERT_STAT_BITS_TXT_LF;
> +			if (new_stats.lonecr)
> +				new_stats.stat_bits |= CONVERT_STAT_BITS_ANY_CR;

What's happening here?  stats.crlf and stats.lonelf are both
trustable, because even when BITS_BIN optimization were asked when
calling gather_all_stats(), we wouldn't come here if the
optimization actually kicked in.  If we convert crlf to lf, the
result would have more lonelf than the original by the number of
crlf to be converted, and the result would have no crlf, and
new_stats are adjusted to pretend as if we have ran the stat over
the buffer after conversion.  You further adjust .stat_bits, which
were not necessary in the old code.  The new comment says "Update
the bits" but what it should make it clear is why the new code cares
about the bits when the old code didn't.  ANY_CR is the new thing
and is understandable, but old code didn't flip TXT_LF bit.  Is it a
bugfix in the old code and it should have done so without your "find
partially and return early" optimization?  If so, that should have
been a separate patch to be understandable.

I am guessing that this is an attempt to future-proof the contents
of new_stats, so that will_convert_lf_to_crlf() can take a short-cut
by looking at these bits, even though currently it does not look at
any bit other than BITS_BIN.  If that is the case, that needs to be
told to the reader.  "Update the bits" is something any reader can
see.  You need to tell them why you are updating the bits.

>  		}
>  		/* simulate "git checkout" */
>  		if (will_convert_lf_to_crlf(len, &new_stats, crlf_action)) {
>  			new_stats.crlf += new_stats.lonelf;
>  			new_stats.lonelf = 0;
> +			new_stats.stat_bits = stats.stat_bits & CONVERT_STAT_BITS_BIN;
> +			if (new_stats.crlf)
> +				new_stats.stat_bits |= CONVERT_STAT_BITS_TXT_CRLF | CONVERT_STAT_BITS_ANY_CR;
> +			if (new_stats.lonecr)
> +				new_stats.stat_bits |= CONVERT_STAT_BITS_ANY_CR;
>  		}
>  		check_safe_crlf(path, crlf_action, &stats, &new_stats, checksafe);

Likewise.  Is this future-proofing new_stats to allow check_safe_crlf()
to use them in the future?

> @@ -86,41 +97,62 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
>   * The same heuristics as diff.c::mmfile_is_binary()
>   * We treat files with bare CR as binary
>   */
> -static int convert_is_binary(unsigned long size, const struct text_stat *stats)
> +static void convert_nonprintable(struct text_stat *stats)
>  {
> -	if (stats->lonecr)
> -		return 1;
> -	if (stats->nul)
> -		return 1;
>  	if ((stats->printable >> 7) < stats->nonprintable)
> -		return 1;
> -	return 0;
> +		stats->stat_bits |= CONVERT_STAT_BITS_BIN;
>  }

When search_only is set to BIN, stat_bits would have BIN if we saw
any non-printable control byte, so calling convert_nonprintable() at
the end of gather_all_stats() to flip the BIN bit on with
printable/nonprintable ratio has no effect in that case.  If we
didn't see a non-printable control byte, gather_stats_partly()
wouldn't have set BIN in the result but then we know the count of
printable and nonprintable are trustworthy, so comparison to flip
the bit makes sense.

The presense of lonecr does not matter, as that would have flipped
BIN, too.  So does the check for NUL, too, has become unnecessary.

I think the end result may be likely to be correct, but with too
many things going on at once, I cannot be confident in saying so.

This probably should be done as four more patches to become
reviewable.

 - One to use the CONVERT_STAT_BITS a lot more for the conversion
   decision than before, 

 - another to allow the caller to tell gather_stats() to give up
   early with the "search_only" bits, 

 - another to update the get_*_convert_stats() functions to use
   get_convert_stats_sha1(), and then finally 

 - use the streaming interface when reading from blob and file.

or something line that.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] convert.c: stream and fast search for binary
  2016-10-12 13:47 ` [PATCH v2 2/2] convert.c: stream and fast search for binary tboegi
  2016-10-27 21:18   ` Junio C Hamano
@ 2016-10-29 12:13   ` Duy Nguyen
  1 sibling, 0 replies; 11+ messages in thread
From: Duy Nguyen @ 2016-10-29 12:13 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Git Mailing List

On Wed, Oct 12, 2016 at 8:47 PM,  <tboegi@web.de> wrote:
> From: Torsten Bögershausen <tboegi@web.de>
>
> When statistics are done for the autocrlf handling, the search in
> the content can be stopped, if e.g
> - a search for binary is done, and a NUL character is found
> - a search for CRLF is done, and the first CRLF is found.
>
> Similar when statistics for binary vs non-binary are gathered:
> Whenever a lone CR or NUL is found, the search can be aborted.
>
> When checking out files in "auto" mode, any file that has a "lone CR"
> or a CRLF will not be converted, so the search can be aborted early.
>
> Add the new bit, CONVERT_STAT_BITS_ANY_CR,
> which is set for either lone CR or CRLF.
>
> Many binary files have a NUL very early and it is often not necessary
> to load the whole content of a file or blob into memory.
>
> Split gather_stats() into gather_all_stats() and gather_stats_partly()
> to do a streaming handling for blobs and files in the worktree.

Maybe break this commit down a bit? the gather_all_stats and
gather_stats_partly() seem independent and can standalone. So is the
blob streaming, and get_convert_stats_wt.
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper
  2016-10-12 13:47 ` [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper tboegi
  2016-10-27 19:57   ` Junio C Hamano
@ 2016-10-29 12:22   ` Duy Nguyen
  1 sibling, 0 replies; 11+ messages in thread
From: Duy Nguyen @ 2016-10-29 12:22 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Git Mailing List

On Wed, Oct 12, 2016 at 8:47 PM,  <tboegi@web.de> wrote:
> From: Torsten Bögershausen <tboegi@web.de>
>
> Factor out the retrieval of the sha1 for a given path in
> read_blob_data_from_index() into the function get_sha1_from_index().
>
> This will be used in the next commit, when convert.c can do the
> analyze for "text=auto" without slurping the whole blob into memory
> at once.
>
> Add a wrapper definition get_sha1_from_cache().
>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  cache.h      |  3 +++
>  read-cache.c | 29 ++++++++++++++++++-----------
>  2 files changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index 1604e29..04de209 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -380,6 +380,7 @@ extern void free_name_hash(struct index_state *istate);
>  #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at)
>  #define unmerge_cache(pathspec) unmerge_index(&the_index, pathspec)
>  #define read_blob_data_from_cache(path, sz) read_blob_data_from_index(&the_index, (path), (sz))
> +#define get_sha1_from_cache(path)  get_sha1_from_index (&the_index, (path))
>  #endif
>
>  enum object_type {
> @@ -1089,6 +1090,8 @@ static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *
>         return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
>  }
>
> +const unsigned char *get_sha1_from_index(struct index_state *istate, const char *path);
> +
>  /*
>   * This internal function is only declared here for the benefit of
>   * lookup_replace_object().  Please do not call it directly.
> diff --git a/read-cache.c b/read-cache.c
> index 38d67fa..5a1df14 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -2290,13 +2290,27 @@ int index_name_is_other(const struct index_state *istate, const char *name,
>
>  void *read_blob_data_from_index(struct index_state *istate, const char *path, unsigned long *size)
>  {
> -       int pos, len;
> +       const unsigned char *sha1;
>         unsigned long sz;
>         enum object_type type;
>         void *data;
>
> -       len = strlen(path);
> -       pos = index_name_pos(istate, path, len);
> +       sha1 = get_sha1_from_index(istate, path);
> +       if (!sha1)
> +               return NULL;
> +       data = read_sha1_file(sha1, &type, &sz);
> +       if (!data || type != OBJ_BLOB) {
> +               free(data);
> +               return NULL;
> +       }
> +       if (size)
> +               *size = sz;
> +       return data;
> +}
> +
> +const unsigned char *get_sha1_from_index(struct index_state *istate, const char *path)

Let's try to embrace struct object_id to make our lives easier when
the time comes to convert to a new hash algorithm by returning struct
object_id * here instead of the internal hash.

> +{
> +       int pos = index_name_pos(istate, path, strlen(path));
>         if (pos < 0) {
>                 /*
>                  * We might be in the middle of a merge, in which
> @@ -2312,14 +2326,7 @@ void *read_blob_data_from_index(struct index_state *istate, const char *path, un
>         }
>         if (pos < 0)
>                 return NULL;
> -       data = read_sha1_file(istate->cache[pos]->oid.hash, &type, &sz);
> -       if (!data || type != OBJ_BLOB) {
> -               free(data);
> -               return NULL;
> -       }
> -       if (size)
> -               *size = sz;
> -       return data;
> +       return istate->cache[pos]->oid.hash;
>  }
>
>  void stat_validity_clear(struct stat_validity *sv)
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] convert.c: stream and fast search for binary
  2016-10-27 21:18   ` Junio C Hamano
@ 2016-11-01  9:36     ` Torsten Bögershausen
  0 siblings, 0 replies; 11+ messages in thread
From: Torsten Bögershausen @ 2016-11-01  9:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[]
> This probably should be done as four more patches to become
> reviewable.
> 
>  - One to use the CONVERT_STAT_BITS a lot more for the conversion
>    decision than before, 
> 
>  - another to allow the caller to tell gather_stats() to give up
>    early with the "search_only" bits, 
> 
>  - another to update the get_*_convert_stats() functions to use
>    get_convert_stats_sha1(), and then finally 
> 
>  - use the streaming interface when reading from blob and file.
> 
> or something line that.

Many thanks for the detailed review. Let's see if I can come up
with a better series the next weeks or so.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-11-01  9:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-09  9:56 [PATCH v1 0/2] convert: stream and early out tboegi
2016-10-10 20:19 ` Junio C Hamano
2016-10-12 13:47 ` [PATCH v2 0/2] Stream and fast search tboegi
2016-10-27 17:02   ` Junio C Hamano
2016-10-12 13:47 ` [PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper tboegi
2016-10-27 19:57   ` Junio C Hamano
2016-10-29 12:22   ` Duy Nguyen
2016-10-12 13:47 ` [PATCH v2 2/2] convert.c: stream and fast search for binary tboegi
2016-10-27 21:18   ` Junio C Hamano
2016-11-01  9:36     ` Torsten Bögershausen
2016-10-29 12:13   ` Duy Nguyen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).