git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model
@ 2021-10-27  7:49 Johannes Schindelin via GitGitGadget
  2021-10-27  7:49 ` [PATCH 1/5] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
                   ` (6 more replies)
  0 siblings, 7 replies; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-27  7:49 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin

This patch series came in via the Git for Windows fork
[https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
before v2.34.0-rc0, therefore I appreciate every careful review you gentle
people can spare.

The x86_64 variant of Windows uses the LLP64 data model, where the long data
type is 32-bit. This is very different from the LP64 data model used e.g. by
x86_64 Linux, where unsigned long is 64-bit.

Most notably, this means that sizeof(unsigned long) != sizeof(size_t) in
general.

However, since Git was born in the Linux ecosystem, where that inequality
does not hold true, it is understandable that unsigned long is used in many
code locations where size_t should have been used. As a consequence, quite a
few things are broken e.g. on Windows, when it comes to 4GB file contents or
larger.

Using Git LFS [https://git-lfs.github.io/] trying to work around such issues
is one such a broken scenario. You cannot git checkout, say, 5GB files. Huge
files will be truncated to whatever the file size is modulo 4GB (in the case
of a 5GB file, it would be truncated to 1GB).

This patch series primarily fixes the Git LFS scenario, by allowing clean
filters to accept 5GB files, and by allowing smudge filters to produce 5GB
files.

The much larger project to teach Git to use size_t instead of unsigned long
in all the appropriate places is hardly scratched by this patch series.

Side note: The fix for the clean filter included in this series does not
actually affect Git LFS! The reason is that Git LFS marks its filter as
required, and therefore Git streams the file contents to Git LFS via a file
descriptor (which is unaffected by LLP64). A "clean" filter that is not
marked as required, however, lets Git take the code path that is fixed by
this patch series.

Johannes Schindelin (1):
  git-compat-util: introduce more size_t helpers

Matt Cooper (4):
  t1051: introduce a smudge filter test for extremely large files
  odb: teach read_blob_entry to use size_t
  odb: guard against data loss checking out a huge file
  clean/smudge: allow clean filters to process extremely large files

 convert.c                   |  2 +-
 delta.h                     |  6 +++---
 entry.c                     |  8 +++++---
 entry.h                     |  2 +-
 git-compat-util.h           | 25 +++++++++++++++++++++++++
 object-file.c               |  6 +++---
 packfile.c                  |  6 +++---
 parallel-checkout.c         |  2 +-
 t/t1051-large-conversion.sh | 22 ++++++++++++++++++++++
 9 files changed, 64 insertions(+), 15 deletions(-)


base-commit: ebf3c04b262aa27fbb97f8a0156c2347fecafafb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1068%2Fdscho%2Fhuge-file-smudge-clean-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1068/dscho/huge-file-smudge-clean-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1068
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 1/5] t1051: introduce a smudge filter test for extremely large files
  2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
@ 2021-10-27  7:49 ` Matt Cooper via GitGitGadget
  2021-10-28  7:15   ` Carlo Arenas
  2021-10-27  7:49 ` [PATCH 2/5] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-27  7:49 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
added to the database or workdir. ("Smudge" when moving to the workdir;
"clean" when moving to the database.) This is used natively to handle CRLF
to LF conversions. It's also employed by Git-LFS to replace large files
from the workdir with small tracking files in the repo and vice versa.

Git pulls the entire smudged file into memory. While this is inefficient,
there's a more insidious problem on some platforms due to inconsistency
between using unsigned long and size_t for the same type of data (size of
a file in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
size_t is typedef'd to unsigned long. On Windows, however, unsigned long is
only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
unsigned long long in order to be 64 bits).

Practically speaking, this means 64-bit Windows users of Git-LFS can't
handle files larger than 2^32 bytes. Other 64-bit platforms don't suffer
this limitation.

This commit introduces a test exposing the issue; future commits make it
pass. The test simulates the way Git-LFS works by having a tiny file
checked into the repository and expanding it to a huge file on checkout.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t1051-large-conversion.sh | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 8b7640b3ba8..684ba5bd0a5 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -83,4 +83,16 @@ test_expect_success 'ident converts on output' '
 	test_cmp small.clean large.clean
 '
 
+# This smudge filter prepends 5GB of zeros to the file it checks out. This
+# ensures that smudging doesn't mangle large files on 64-bit Windows.
+test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
+	test_commit test small "a small file" &&
+	test_config filter.makelarge.smudge "dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) && cat" &&
+	echo "small filter=makelarge" >.gitattributes &&
+	rm small &&
+	git checkout -- small &&
+	size=$(test_file_size small) &&
+	test "$size" -ge $((5 * 1024 * 1024 * 1024))
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 2/5] odb: teach read_blob_entry to use size_t
  2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
  2021-10-27  7:49 ` [PATCH 1/5] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
@ 2021-10-27  7:49 ` Matt Cooper via GitGitGadget
  2021-10-27  7:49 ` [PATCH 3/5] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-27  7:49 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

There is mixed use of size_t and unsigned long to deal with sizes in the
codebase. Recall that Windows defines unsigned long as 32 bits even on
64-bit platforms, meaning that converting size_t to unsigned long narrows
the range. This mostly doesn't cause a problem since Git rarely deals
with files larger than 2^32 bytes.

But adjunct systems such as Git LFS, which use smudge/clean filters to
keep huge files out of the repository, may have huge file contents passed
through some of the functions in entry.c and convert.c. On Windows, this
results in a truncated file being written to the workdir. I traced this to
one specific use of unsigned long in write_entry (and a similar instance
in write_pc_item_to_fd for parallel checkout). That appeared to be for
the call to read_blob_entry, which expects a pointer to unsigned long.

By altering the signature of read_blob_entry to expect a size_t,
write_entry can be switched to use size_t internally (which all of its
callers and most of its callees already used). To avoid touching dozens of
additional files, read_blob_entry uses a local unsigned long to call a
chain of functions which aren't prepared to accept size_t.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 entry.c                     | 8 +++++---
 entry.h                     | 2 +-
 parallel-checkout.c         | 2 +-
 t/t1051-large-conversion.sh | 2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/entry.c b/entry.c
index 711ee0693c7..4cb3942dbdc 100644
--- a/entry.c
+++ b/entry.c
@@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
 	return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
 }
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
+void *read_blob_entry(const struct cache_entry *ce, size_t *size)
 {
 	enum object_type type;
-	void *blob_data = read_object_file(&ce->oid, &type, size);
+	unsigned long ul;
+	void *blob_data = read_object_file(&ce->oid, &type, &ul);
 
+	*size = ul;
 	if (blob_data) {
 		if (type == OBJ_BLOB)
 			return blob_data;
@@ -270,7 +272,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
 	int fd, ret, fstat_done = 0;
 	char *new_blob;
 	struct strbuf buf = STRBUF_INIT;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 	size_t newsize = 0;
 	struct stat st;
diff --git a/entry.h b/entry.h
index b8c0e170dc7..61ee8c17604 100644
--- a/entry.h
+++ b/entry.h
@@ -51,7 +51,7 @@ int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
  */
 void unlink_entry(const struct cache_entry *ce);
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
+void *read_blob_entry(const struct cache_entry *ce, size_t *size);
 int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
 void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
 			   struct stat *st);
diff --git a/parallel-checkout.c b/parallel-checkout.c
index 6b1af32bb3d..b6f4a25642e 100644
--- a/parallel-checkout.c
+++ b/parallel-checkout.c
@@ -261,7 +261,7 @@ static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
 	struct stream_filter *filter;
 	struct strbuf buf = STRBUF_INIT;
 	char *blob;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 
 	/* Sanity check */
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 684ba5bd0a5..38aa0d8a075 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
 
 # This smudge filter prepends 5GB of zeros to the file it checks out. This
 # ensures that smudging doesn't mangle large files on 64-bit Windows.
-test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
+test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
 	test_commit test small "a small file" &&
 	test_config filter.makelarge.smudge "dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) && cat" &&
 	echo "small filter=makelarge" >.gitattributes &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 3/5] git-compat-util: introduce more size_t helpers
  2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
  2021-10-27  7:49 ` [PATCH 1/5] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
  2021-10-27  7:49 ` [PATCH 2/5] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
@ 2021-10-27  7:49 ` Johannes Schindelin via GitGitGadget
  2021-10-27  7:49 ` [PATCH 4/5] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-27  7:49 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

We will use them in the next commit.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 git-compat-util.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index a508dbe5a35..7977720655c 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -113,6 +113,14 @@
 #define unsigned_mult_overflows(a, b) \
     ((a) && (b) > maximum_unsigned_value_of_type(a) / (a))
 
+/*
+ * Returns true if the left shift of "a" by "shift" bits will
+ * overflow. The types of "a" and "b" must be unsigned.
+ * Note that this macro evaluates "a" twice!
+ */
+#define unsigned_left_shift_overflows(a, shift) \
+    ((a) > maximum_unsigned_value_of_type(a) >> shift)
+
 #ifdef __GNUC__
 #define TYPEOF(x) (__typeof__(x))
 #else
@@ -859,6 +867,23 @@ static inline size_t st_sub(size_t a, size_t b)
 	return a - b;
 }
 
+static inline size_t st_left_shift(size_t a, unsigned shift)
+{
+	if (unsigned_left_shift_overflows(a, shift))
+		die("size_t overflow: %"PRIuMAX" << %u",
+		    (uintmax_t)a, shift);
+	return a << shift;
+}
+
+static inline unsigned long cast_size_t_to_ulong(size_t a)
+{
+	if (a != (unsigned long)a)
+		die("object too large to read on this platform: %"
+		    PRIuMAX" is cut off to %lu",
+		    (uintmax_t)a, (unsigned long)a);
+	return (unsigned long)a;
+}
+
 #ifdef HAVE_ALLOCA_H
 # include <alloca.h>
 # define xalloca(size)      (alloca(size))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 4/5] odb: guard against data loss checking out a huge file
  2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-10-27  7:49 ` [PATCH 3/5] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
@ 2021-10-27  7:49 ` Matt Cooper via GitGitGadget
  2021-10-27  7:49 ` [PATCH 5/5] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-27  7:49 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

This introduces an additional guard for platforms where `unsigned long`
and `size_t` are not of the same size. If the size of an object in the
database would overflow `unsigned long`, instead we now exit with an
error.

A complete fix will have to update _many_ other functions throughout the
codebase to use `size_t` instead of `unsigned long`. It will have to be
implemented at some stage.

This commit puts in a stop-gap for the time being.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 delta.h       | 6 +++---
 object-file.c | 6 +++---
 packfile.c    | 6 +++---
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/delta.h b/delta.h
index 2df5fe13d95..8a56ec07992 100644
--- a/delta.h
+++ b/delta.h
@@ -90,15 +90,15 @@ static inline unsigned long get_delta_hdr_size(const unsigned char **datap,
 					       const unsigned char *top)
 {
 	const unsigned char *data = *datap;
-	unsigned long cmd, size = 0;
+	size_t cmd, size = 0;
 	int i = 0;
 	do {
 		cmd = *data++;
-		size |= (cmd & 0x7f) << i;
+		size |= st_left_shift(cmd & 0x7f, i);
 		i += 7;
 	} while (cmd & 0x80 && data < top);
 	*datap = data;
-	return size;
+	return cast_size_t_to_ulong(size);
 }
 
 #endif
diff --git a/object-file.c b/object-file.c
index f233b440b22..70e456fc2a3 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1344,7 +1344,7 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 				       unsigned int flags)
 {
 	const char *type_buf = hdr;
-	unsigned long size;
+	size_t size;
 	int type, type_len = 0;
 
 	/*
@@ -1388,12 +1388,12 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 			if (c > 9)
 				break;
 			hdr++;
-			size = size * 10 + c;
+			size = st_add(st_mult(size, 10), c);
 		}
 	}
 
 	if (oi->sizep)
-		*oi->sizep = size;
+		*oi->sizep = cast_size_t_to_ulong(size);
 
 	/*
 	 * The length must be followed by a zero byte
diff --git a/packfile.c b/packfile.c
index 755aa7aec5e..3ccea004396 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1059,7 +1059,7 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 		unsigned long len, enum object_type *type, unsigned long *sizep)
 {
 	unsigned shift;
-	unsigned long size, c;
+	size_t size, c;
 	unsigned long used = 0;
 
 	c = buf[used++];
@@ -1073,10 +1073,10 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 			break;
 		}
 		c = buf[used++];
-		size += (c & 0x7f) << shift;
+		size = st_add(size, st_left_shift(c & 0x7f, shift));
 		shift += 7;
 	}
-	*sizep = size;
+	*sizep = cast_size_t_to_ulong(size);
 	return used;
 }
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 5/5] clean/smudge: allow clean filters to process extremely large files
  2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-10-27  7:49 ` [PATCH 4/5] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
@ 2021-10-27  7:49 ` Matt Cooper via GitGitGadget
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
  2021-10-28 20:56 ` [PATCH 0/3] " Carlo Marcelo Arenas Belón
  6 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-27  7:49 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
moved between the database and the worktree. We already made sure that
it is possible for smudge filters to produce contents that are larger
than `unsigned long` can represent (which matters on systems where
`unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
Now we make sure that clean filters can _consume_ contents that are
larger than that.

Note that this commit only allows clean filters' _input_ to be larger
than can be represented by `unsigned long`.

This change makes only a very minute dent into the much larger project
to teach Git to use `size_t` instead of `unsigned long` wherever
appropriate.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 convert.c                   |  2 +-
 t/t1051-large-conversion.sh | 10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/convert.c b/convert.c
index fd9c84b0257..5ad6dfc08a0 100644
--- a/convert.c
+++ b/convert.c
@@ -613,7 +613,7 @@ static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
 
 struct filter_params {
 	const char *src;
-	unsigned long size;
+	size_t size;
 	int fd;
 	const char *cmd;
 	const char *path;
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 38aa0d8a075..c488850bee4 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -95,4 +95,14 @@ test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output'
 	test "$size" -ge $((5 * 1024 * 1024 * 1024))
 '
 
+# This clean filter writes down the size of input it receives. By checking against
+# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
+test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on input' '
+	dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) >big &&
+	test_config filter.checklarge.clean "wc -c >big.size" &&
+	echo "big filter=checklarge" >.gitattributes &&
+	git add big &&
+	test $(test_file_size big) -eq $(cat big.size)
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/5] t1051: introduce a smudge filter test for extremely large files
  2021-10-27  7:49 ` [PATCH 1/5] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
@ 2021-10-28  7:15   ` Carlo Arenas
  2021-10-28  8:54     ` [PATCH] helper/test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón
  0 siblings, 1 reply; 78+ messages in thread
From: Carlo Arenas @ 2021-10-28  7:15 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget; +Cc: git, Johannes Schindelin, Matt Cooper

On Wed, Oct 27, 2021 at 12:03 PM Matt Cooper via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index 8b7640b3ba8..684ba5bd0a5 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -83,4 +83,16 @@ test_expect_success 'ident converts on output' '
>         test_cmp small.clean large.clean
>  '
>
> +# This smudge filter prepends 5GB of zeros to the file it checks out. This
> +# ensures that smudging doesn't mangle large files on 64-bit Windows.
> +test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
> +       test_commit test small "a small file" &&
> +       test_config filter.makelarge.smudge "dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) && cat" &&

/dev/zero doesn't exist in HP NonStop, a portable solution would be to
use `test-tool genzeros` that is available since d5cfd142ec (tests:
teach the test-tool to generate NUL bytes and use it, 2019-02-14)

Carlo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH] helper/test-genzeros: allow more than 2G zeros in Windows
  2021-10-28  7:15   ` Carlo Arenas
@ 2021-10-28  8:54     ` Carlo Marcelo Arenas Belón
  2021-10-28 20:32       ` Johannes Schindelin
  0 siblings, 1 reply; 78+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2021-10-28  8:54 UTC (permalink / raw)
  To: git; +Cc: vtbassmatt, johannes.schindelin, Carlo Marcelo Arenas Belón

d5cfd142ec (tests: teach the test-tool to generate NUL bytes and
use it, 2019-02-14), add a way to generate zeroes in a portable
way without using /dev/zero (needed by HP NonStop), but uses a
long variable that is limited to 2^31 in Windows.

Use instead a (POSIX/C99) intmax_t that is at least 64bit wide
in 64-bit Windows to use in a future test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 t/helper/test-genzeros.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
index 9532f5bac9..b1197e91a8 100644
--- a/t/helper/test-genzeros.c
+++ b/t/helper/test-genzeros.c
@@ -3,14 +3,14 @@
 
 int cmd__genzeros(int argc, const char **argv)
 {
-	long count;
+	intmax_t count;
 
 	if (argc > 2) {
 		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
 		return 1;
 	}
 
-	count = argc > 1 ? strtol(argv[1], NULL, 0) : -1L;
+	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
 
 	while (count < 0 || count--) {
 		if (putchar(0) == EOF)
-- 
2.33.0.1155.gbdb71ac078


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] helper/test-genzeros: allow more than 2G zeros in Windows
  2021-10-28  8:54     ` [PATCH] helper/test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón
@ 2021-10-28 20:32       ` Johannes Schindelin
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-10-28 20:32 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón; +Cc: git, vtbassmatt

[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]

Hi Carlo,

On Thu, 28 Oct 2021, Carlo Marcelo Arenas Belón wrote:

> d5cfd142ec (tests: teach the test-tool to generate NUL bytes and
> use it, 2019-02-14), add a way to generate zeroes in a portable
> way without using /dev/zero (needed by HP NonStop), but uses a
> long variable that is limited to 2^31 in Windows.
>
> Use instead a (POSIX/C99) intmax_t that is at least 64bit wide
> in 64-bit Windows to use in a future test.
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

Thank you for this patch. I integrated it into the patch series.

Unfortunately, it is incomplete, not because it does not work, but because
it comes at a hefty performance cost. In my tests, generating a gigabyte
of NULs took around 27 seconds with `genzeros`. Compare that to ~0.75
seconds with `dd`, and it is not funny, stop laughing.

Happily, I was able to rewrite the core part of `genzeros` to write chunks
of a 256kB array instead, which pushed it back down to ~0.6 seconds.

Will send out a new iteration as soon as the CI build passes.

Ciao,
Dscho

> ---
>  t/helper/test-genzeros.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
> index 9532f5bac9..b1197e91a8 100644
> --- a/t/helper/test-genzeros.c
> +++ b/t/helper/test-genzeros.c
> @@ -3,14 +3,14 @@
>
>  int cmd__genzeros(int argc, const char **argv)
>  {
> -	long count;
> +	intmax_t count;
>
>  	if (argc > 2) {
>  		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
>  		return 1;
>  	}
>
> -	count = argc > 1 ? strtol(argv[1], NULL, 0) : -1L;
> +	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
>
>  	while (count < 0 || count--) {
>  		if (putchar(0) == EOF)
> --
> 2.33.0.1155.gbdb71ac078
>
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-10-27  7:49 ` [PATCH 5/5] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
@ 2021-10-28 20:50 ` Johannes Schindelin via GitGitGadget
  2021-10-28 20:50   ` [PATCH v2 1/7] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
                     ` (8 more replies)
  2021-10-28 20:56 ` [PATCH 0/3] " Carlo Marcelo Arenas Belón
  6 siblings, 9 replies; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin

This patch series came in via the Git for Windows fork
[https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
before v2.34.0-rc0, therefore I appreciate every careful review you gentle
people can spare.

The x86_64 variant of Windows uses the LLP64 data model, where the long data
type is 32-bit. This is very different from the LP64 data model used e.g. by
x86_64 Linux, where unsigned long is 64-bit.

Most notably, this means that sizeof(unsigned long) != sizeof(size_t) in
general.

However, since Git was born in the Linux ecosystem, where that inequality
does not hold true, it is understandable that unsigned long is used in many
code locations where size_t should have been used. As a consequence, quite a
few things are broken e.g. on Windows, when it comes to 4GB file contents or
larger.

Using Git LFS [https://git-lfs.github.io/] trying to work around such issues
is one such a broken scenario. You cannot git checkout, say, 5GB files. Huge
files will be truncated to whatever the file size is modulo 4GB (in the case
of a 5GB file, it would be truncated to 1GB).

This patch series primarily fixes the Git LFS scenario, by allowing clean
filters to accept 5GB files, and by allowing smudge filters to produce 5GB
files.

The much larger project to teach Git to use size_t instead of unsigned long
in all the appropriate places is hardly scratched by this patch series.

Side note: The fix for the clean filter included in this series does not
actually affect Git LFS! The reason is that Git LFS marks its filter as
required, and therefore Git streams the file contents to Git LFS via a file
descriptor (which is unaffected by LLP64). A "clean" filter that is not
marked as required, however, lets Git take the code path that is fixed by
this patch series.

Changes since v1:

 * Removed extraneous "Signed-off-by:" lines from "git-compat-util:
   introduce more size_t helpers".
 * Integrated Carlo's patch to allow genzeros to generate large amounts of
   NULs, even in LLP64 data models.
 * Using test-tool genzeros instead of dd if=/dev/zero, to help HP NonStop
   (which appears to use the LP64 data model and therefore should pass the
   new test cases even without the fixes provided in this patch series).
 * Accelerating genzeros to have performance characteristics similar to dd
   if=/dev/zero instead of being ~50x slower.

Carlo Marcelo Arenas Belón (1):
  test-genzeros: allow more than 2G zeros in Windows

Johannes Schindelin (2):
  test-tool genzeros: generate large amounts of data more efficiently
  git-compat-util: introduce more size_t helpers

Matt Cooper (4):
  t1051: introduce a smudge filter test for extremely large files
  odb: teach read_blob_entry to use size_t
  odb: guard against data loss checking out a huge file
  clean/smudge: allow clean filters to process extremely large files

 convert.c                   |  2 +-
 delta.h                     |  6 +++---
 entry.c                     |  8 +++++---
 entry.h                     |  2 +-
 git-compat-util.h           | 25 +++++++++++++++++++++++++
 object-file.c               |  6 +++---
 packfile.c                  |  6 +++---
 parallel-checkout.c         |  2 +-
 t/helper/test-genzeros.c    | 21 +++++++++++++++++----
 t/t1051-large-conversion.sh | 23 +++++++++++++++++++++++
 10 files changed, 82 insertions(+), 19 deletions(-)


base-commit: ebf3c04b262aa27fbb97f8a0156c2347fecafafb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1068%2Fdscho%2Fhuge-file-smudge-clean-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1068/dscho/huge-file-smudge-clean-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1068

Range-diff vs v1:

 -:  ----------- > 1:  068f897b973 test-genzeros: allow more than 2G zeros in Windows
 -:  ----------- > 2:  6edcbae372e test-tool genzeros: generate large amounts of data more efficiently
 1:  449eb5c205e ! 3:  1bdded86f5d t1051: introduce a smudge filter test for extremely large files
     @@ t/t1051-large-conversion.sh: test_expect_success 'ident converts on output' '
      +# ensures that smudging doesn't mangle large files on 64-bit Windows.
      +test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
      +	test_commit test small "a small file" &&
     -+	test_config filter.makelarge.smudge "dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) && cat" &&
     ++	test_config filter.makelarge.smudge \
     ++		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
      +	echo "small filter=makelarge" >.gitattributes &&
      +	rm small &&
      +	git checkout -- small &&
 2:  5b9d149ba23 ! 4:  3ffd3a001f7 odb: teach read_blob_entry to use size_t
     @@ t/t1051-large-conversion.sh: test_expect_success 'ident converts on output' '
      -test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
      +test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
       	test_commit test small "a small file" &&
     - 	test_config filter.makelarge.smudge "dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) && cat" &&
     - 	echo "small filter=makelarge" >.gitattributes &&
     + 	test_config filter.makelarge.smudge \
     + 		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
 3:  c81ee778b26 ! 5:  32472ae3f98 git-compat-util: introduce more size_t helpers
     @@ Commit message
          We will use them in the next commit.
      
          Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
     -    Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
     -    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
      
       ## git-compat-util.h ##
      @@
 4:  bca013a0511 = 6:  c6910584108 odb: guard against data loss checking out a huge file
 5:  20387ce3557 ! 7:  d87d4229bb4 clean/smudge: allow clean filters to process extremely large files
     @@ t/t1051-large-conversion.sh: test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files
      +# This clean filter writes down the size of input it receives. By checking against
      +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
      +test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on input' '
     -+	dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) >big &&
     ++	test-tool genzeros $((5*1024*1024*1024)) >big &&
      +	test_config filter.checklarge.clean "wc -c >big.size" &&
      +	echo "big filter=checklarge" >.gitattributes &&
      +	git add big &&

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v2 1/7] test-genzeros: allow more than 2G zeros in Windows
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
@ 2021-10-28 20:50   ` Carlo Marcelo Arenas Belón via GitGitGadget
  2021-10-28 20:50   ` [PATCH v2 2/7] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Carlo Marcelo Arenas Belón via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin,
	Carlo Marcelo Arenas Belón

From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas@gmail.com>

d5cfd142ec (tests: teach the test-tool to generate NUL bytes and
use it, 2019-02-14), add a way to generate zeroes in a portable
way without using /dev/zero (needed by HP NonStop), but uses a
long variable that is limited to 2^31 in Windows.

Use instead a (POSIX/C99) intmax_t that is at least 64bit wide
in 64-bit Windows to use in a future test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/helper/test-genzeros.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
index 9532f5bac97..b1197e91a89 100644
--- a/t/helper/test-genzeros.c
+++ b/t/helper/test-genzeros.c
@@ -3,14 +3,14 @@
 
 int cmd__genzeros(int argc, const char **argv)
 {
-	long count;
+	intmax_t count;
 
 	if (argc > 2) {
 		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
 		return 1;
 	}
 
-	count = argc > 1 ? strtol(argv[1], NULL, 0) : -1L;
+	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
 
 	while (count < 0 || count--) {
 		if (putchar(0) == EOF)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 2/7] test-tool genzeros: generate large amounts of data more efficiently
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
  2021-10-28 20:50   ` [PATCH v2 1/7] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
@ 2021-10-28 20:50   ` Johannes Schindelin via GitGitGadget
  2021-10-28 22:55     ` Junio C Hamano
  2021-10-28 20:50   ` [PATCH v2 3/7] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

In this developer's tests, producing one gigabyte worth of NULs in a
busy loop that writes out individual bytes, unbuffered, took ~27sec.
Writing chunked 256kB buffers instead only took ~0.6sec

This matters because we are about to introduce a pair of test cases that
want to be able to produce 5GB of NULs, and we cannot use `/dev/zero`
because of the HP NonStop platform's lack of support for that device.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/helper/test-genzeros.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
index b1197e91a89..c061c429da9 100644
--- a/t/helper/test-genzeros.c
+++ b/t/helper/test-genzeros.c
@@ -3,7 +3,10 @@
 
 int cmd__genzeros(int argc, const char **argv)
 {
+	/* static, so that it is NUL-initialized */
+	static char zeros[256 * 1024];
 	intmax_t count;
+	ssize_t n;
 
 	if (argc > 2) {
 		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
@@ -12,9 +15,19 @@ int cmd__genzeros(int argc, const char **argv)
 
 	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
 
-	while (count < 0 || count--) {
-		if (putchar(0) == EOF)
+	/* Writing out individual NUL bytes is slow... */
+	while (count < 0)
+		if (write(1, zeros, ARRAY_SIZE(zeros) < 0))
 			return -1;
+
+	while (count > 0) {
+		n = write(1, zeros, count < ARRAY_SIZE(zeros) ?
+			  count : ARRAY_SIZE(zeros));
+
+		if (n < 0)
+			return -1;
+
+		count -= n;
 	}
 
 	return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 3/7] t1051: introduce a smudge filter test for extremely large files
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
  2021-10-28 20:50   ` [PATCH v2 1/7] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
  2021-10-28 20:50   ` [PATCH v2 2/7] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
@ 2021-10-28 20:50   ` Matt Cooper via GitGitGadget
  2021-10-28 20:50   ` [PATCH v2 4/7] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
added to the database or workdir. ("Smudge" when moving to the workdir;
"clean" when moving to the database.) This is used natively to handle CRLF
to LF conversions. It's also employed by Git-LFS to replace large files
from the workdir with small tracking files in the repo and vice versa.

Git pulls the entire smudged file into memory. While this is inefficient,
there's a more insidious problem on some platforms due to inconsistency
between using unsigned long and size_t for the same type of data (size of
a file in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
size_t is typedef'd to unsigned long. On Windows, however, unsigned long is
only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
unsigned long long in order to be 64 bits).

Practically speaking, this means 64-bit Windows users of Git-LFS can't
handle files larger than 2^32 bytes. Other 64-bit platforms don't suffer
this limitation.

This commit introduces a test exposing the issue; future commits make it
pass. The test simulates the way Git-LFS works by having a tiny file
checked into the repository and expanding it to a huge file on checkout.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t1051-large-conversion.sh | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 8b7640b3ba8..7c1a2845005 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -83,4 +83,17 @@ test_expect_success 'ident converts on output' '
 	test_cmp small.clean large.clean
 '
 
+# This smudge filter prepends 5GB of zeros to the file it checks out. This
+# ensures that smudging doesn't mangle large files on 64-bit Windows.
+test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
+	test_commit test small "a small file" &&
+	test_config filter.makelarge.smudge \
+		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
+	echo "small filter=makelarge" >.gitattributes &&
+	rm small &&
+	git checkout -- small &&
+	size=$(test_file_size small) &&
+	test "$size" -ge $((5 * 1024 * 1024 * 1024))
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 4/7] odb: teach read_blob_entry to use size_t
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-10-28 20:50   ` [PATCH v2 3/7] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
@ 2021-10-28 20:50   ` Matt Cooper via GitGitGadget
  2021-10-28 22:14     ` Carlo Arenas
  2021-10-28 20:50   ` [PATCH v2 5/7] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

There is mixed use of size_t and unsigned long to deal with sizes in the
codebase. Recall that Windows defines unsigned long as 32 bits even on
64-bit platforms, meaning that converting size_t to unsigned long narrows
the range. This mostly doesn't cause a problem since Git rarely deals
with files larger than 2^32 bytes.

But adjunct systems such as Git LFS, which use smudge/clean filters to
keep huge files out of the repository, may have huge file contents passed
through some of the functions in entry.c and convert.c. On Windows, this
results in a truncated file being written to the workdir. I traced this to
one specific use of unsigned long in write_entry (and a similar instance
in write_pc_item_to_fd for parallel checkout). That appeared to be for
the call to read_blob_entry, which expects a pointer to unsigned long.

By altering the signature of read_blob_entry to expect a size_t,
write_entry can be switched to use size_t internally (which all of its
callers and most of its callees already used). To avoid touching dozens of
additional files, read_blob_entry uses a local unsigned long to call a
chain of functions which aren't prepared to accept size_t.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 entry.c                     | 8 +++++---
 entry.h                     | 2 +-
 parallel-checkout.c         | 2 +-
 t/t1051-large-conversion.sh | 2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/entry.c b/entry.c
index 711ee0693c7..4cb3942dbdc 100644
--- a/entry.c
+++ b/entry.c
@@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
 	return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
 }
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
+void *read_blob_entry(const struct cache_entry *ce, size_t *size)
 {
 	enum object_type type;
-	void *blob_data = read_object_file(&ce->oid, &type, size);
+	unsigned long ul;
+	void *blob_data = read_object_file(&ce->oid, &type, &ul);
 
+	*size = ul;
 	if (blob_data) {
 		if (type == OBJ_BLOB)
 			return blob_data;
@@ -270,7 +272,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
 	int fd, ret, fstat_done = 0;
 	char *new_blob;
 	struct strbuf buf = STRBUF_INIT;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 	size_t newsize = 0;
 	struct stat st;
diff --git a/entry.h b/entry.h
index b8c0e170dc7..61ee8c17604 100644
--- a/entry.h
+++ b/entry.h
@@ -51,7 +51,7 @@ int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
  */
 void unlink_entry(const struct cache_entry *ce);
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
+void *read_blob_entry(const struct cache_entry *ce, size_t *size);
 int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
 void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
 			   struct stat *st);
diff --git a/parallel-checkout.c b/parallel-checkout.c
index 6b1af32bb3d..b6f4a25642e 100644
--- a/parallel-checkout.c
+++ b/parallel-checkout.c
@@ -261,7 +261,7 @@ static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
 	struct stream_filter *filter;
 	struct strbuf buf = STRBUF_INIT;
 	char *blob;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 
 	/* Sanity check */
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 7c1a2845005..5ba03d02682 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
 
 # This smudge filter prepends 5GB of zeros to the file it checks out. This
 # ensures that smudging doesn't mangle large files on 64-bit Windows.
-test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
+test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
 	test_commit test small "a small file" &&
 	test_config filter.makelarge.smudge \
 		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 5/7] git-compat-util: introduce more size_t helpers
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-10-28 20:50   ` [PATCH v2 4/7] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
@ 2021-10-28 20:50   ` Johannes Schindelin via GitGitGadget
  2021-10-28 23:05     ` Junio C Hamano
  2021-10-28 20:50   ` [PATCH v2 6/7] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

We will use them in the next commit.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 git-compat-util.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index a508dbe5a35..7977720655c 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -113,6 +113,14 @@
 #define unsigned_mult_overflows(a, b) \
     ((a) && (b) > maximum_unsigned_value_of_type(a) / (a))
 
+/*
+ * Returns true if the left shift of "a" by "shift" bits will
+ * overflow. The types of "a" and "b" must be unsigned.
+ * Note that this macro evaluates "a" twice!
+ */
+#define unsigned_left_shift_overflows(a, shift) \
+    ((a) > maximum_unsigned_value_of_type(a) >> shift)
+
 #ifdef __GNUC__
 #define TYPEOF(x) (__typeof__(x))
 #else
@@ -859,6 +867,23 @@ static inline size_t st_sub(size_t a, size_t b)
 	return a - b;
 }
 
+static inline size_t st_left_shift(size_t a, unsigned shift)
+{
+	if (unsigned_left_shift_overflows(a, shift))
+		die("size_t overflow: %"PRIuMAX" << %u",
+		    (uintmax_t)a, shift);
+	return a << shift;
+}
+
+static inline unsigned long cast_size_t_to_ulong(size_t a)
+{
+	if (a != (unsigned long)a)
+		die("object too large to read on this platform: %"
+		    PRIuMAX" is cut off to %lu",
+		    (uintmax_t)a, (unsigned long)a);
+	return (unsigned long)a;
+}
+
 #ifdef HAVE_ALLOCA_H
 # include <alloca.h>
 # define xalloca(size)      (alloca(size))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 6/7] odb: guard against data loss checking out a huge file
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-10-28 20:50   ` [PATCH v2 5/7] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
@ 2021-10-28 20:50   ` Matt Cooper via GitGitGadget
  2021-10-28 20:50   ` [PATCH v2 7/7] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

This introduces an additional guard for platforms where `unsigned long`
and `size_t` are not of the same size. If the size of an object in the
database would overflow `unsigned long`, instead we now exit with an
error.

A complete fix will have to update _many_ other functions throughout the
codebase to use `size_t` instead of `unsigned long`. It will have to be
implemented at some stage.

This commit puts in a stop-gap for the time being.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 delta.h       | 6 +++---
 object-file.c | 6 +++---
 packfile.c    | 6 +++---
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/delta.h b/delta.h
index 2df5fe13d95..8a56ec07992 100644
--- a/delta.h
+++ b/delta.h
@@ -90,15 +90,15 @@ static inline unsigned long get_delta_hdr_size(const unsigned char **datap,
 					       const unsigned char *top)
 {
 	const unsigned char *data = *datap;
-	unsigned long cmd, size = 0;
+	size_t cmd, size = 0;
 	int i = 0;
 	do {
 		cmd = *data++;
-		size |= (cmd & 0x7f) << i;
+		size |= st_left_shift(cmd & 0x7f, i);
 		i += 7;
 	} while (cmd & 0x80 && data < top);
 	*datap = data;
-	return size;
+	return cast_size_t_to_ulong(size);
 }
 
 #endif
diff --git a/object-file.c b/object-file.c
index f233b440b22..70e456fc2a3 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1344,7 +1344,7 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 				       unsigned int flags)
 {
 	const char *type_buf = hdr;
-	unsigned long size;
+	size_t size;
 	int type, type_len = 0;
 
 	/*
@@ -1388,12 +1388,12 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 			if (c > 9)
 				break;
 			hdr++;
-			size = size * 10 + c;
+			size = st_add(st_mult(size, 10), c);
 		}
 	}
 
 	if (oi->sizep)
-		*oi->sizep = size;
+		*oi->sizep = cast_size_t_to_ulong(size);
 
 	/*
 	 * The length must be followed by a zero byte
diff --git a/packfile.c b/packfile.c
index 755aa7aec5e..3ccea004396 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1059,7 +1059,7 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 		unsigned long len, enum object_type *type, unsigned long *sizep)
 {
 	unsigned shift;
-	unsigned long size, c;
+	size_t size, c;
 	unsigned long used = 0;
 
 	c = buf[used++];
@@ -1073,10 +1073,10 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 			break;
 		}
 		c = buf[used++];
-		size += (c & 0x7f) << shift;
+		size = st_add(size, st_left_shift(c & 0x7f, shift));
 		shift += 7;
 	}
-	*sizep = size;
+	*sizep = cast_size_t_to_ulong(size);
 	return used;
 }
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 7/7] clean/smudge: allow clean filters to process extremely large files
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-10-28 20:50   ` [PATCH v2 6/7] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
@ 2021-10-28 20:50   ` Matt Cooper via GitGitGadget
  2021-10-28 22:32   ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model brian m. carlson
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
  8 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-28 20:50 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
moved between the database and the worktree. We already made sure that
it is possible for smudge filters to produce contents that are larger
than `unsigned long` can represent (which matters on systems where
`unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
Now we make sure that clean filters can _consume_ contents that are
larger than that.

Note that this commit only allows clean filters' _input_ to be larger
than can be represented by `unsigned long`.

This change makes only a very minute dent into the much larger project
to teach Git to use `size_t` instead of `unsigned long` wherever
appropriate.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 convert.c                   |  2 +-
 t/t1051-large-conversion.sh | 10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/convert.c b/convert.c
index fd9c84b0257..5ad6dfc08a0 100644
--- a/convert.c
+++ b/convert.c
@@ -613,7 +613,7 @@ static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
 
 struct filter_params {
 	const char *src;
-	unsigned long size;
+	size_t size;
 	int fd;
 	const char *cmd;
 	const char *path;
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 5ba03d02682..83d9cf485a3 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -96,4 +96,14 @@ test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output'
 	test "$size" -ge $((5 * 1024 * 1024 * 1024))
 '
 
+# This clean filter writes down the size of input it receives. By checking against
+# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
+test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on input' '
+	test-tool genzeros $((5*1024*1024*1024)) >big &&
+	test_config filter.checklarge.clean "wc -c >big.size" &&
+	echo "big filter=checklarge" >.gitattributes &&
+	git add big &&
+	test $(test_file_size big) -eq $(cat big.size)
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 0/3] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
@ 2021-10-28 20:56 ` Carlo Marcelo Arenas Belón
  2021-10-28 20:56   ` [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón
                     ` (2 more replies)
  6 siblings, 3 replies; 78+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2021-10-28 20:56 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, vtbassmatt, Carlo Marcelo Arenas Belón

Following series is a fixup to avoid breaking in 32-bit systems; first
patch should be added as a prerequisite with the 2 following able to
apply to the corresponding patch in the original series.

Carlo Marcelo Arenas Belón (3):
  test-lib: add prerequisite for 64-bit platforms
  fixup! t1051: introduce a smudge filter test for extremely large files
  fixup! clean/smudge: allow clean filters to process extremely large
    files

 t/t1051-large-conversion.sh | 4 ++--
 t/test-lib.sh               | 4 ++++
 2 files changed, 6 insertions(+), 2 deletions(-)

-- 
2.33.0.1155.gbdb71ac078


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms
  2021-10-28 20:56 ` [PATCH 0/3] " Carlo Marcelo Arenas Belón
@ 2021-10-28 20:56   ` Carlo Marcelo Arenas Belón
  2021-10-28 21:45     ` Johannes Schindelin
  2021-10-28 20:56   ` [PATCH 2/3] fixup! t1051: introduce a smudge filter test for extremely large files Carlo Marcelo Arenas Belón
  2021-10-28 20:56   ` [PATCH 3/3] fixup! clean/smudge: allow clean filters to process " Carlo Marcelo Arenas Belón
  2 siblings, 1 reply; 78+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2021-10-28 20:56 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, vtbassmatt, Carlo Marcelo Arenas Belón

Allow tests that assume a 64-bit size_t to be skipped in 32-bit platforms
and regardless of the size of long.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 t/test-lib.sh | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/t/test-lib.sh b/t/test-lib.sh
index fc1e521519..5fa7fb5719 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1687,6 +1687,10 @@ build_option () {
 	sed -ne "s/^$1: //p"
 }
 
+test_lazy_prereq IS_64BIT '
+	test 8 -eq "$(build_option sizeof-size_t)"
+'
+
 test_lazy_prereq LONG_IS_64BIT '
 	test 8 -le "$(build_option sizeof-long)"
 '
-- 
2.33.0.1155.gbdb71ac078


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 2/3] fixup! t1051: introduce a smudge filter test for extremely large files
  2021-10-28 20:56 ` [PATCH 0/3] " Carlo Marcelo Arenas Belón
  2021-10-28 20:56   ` [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón
@ 2021-10-28 20:56   ` Carlo Marcelo Arenas Belón
  2021-10-28 20:56   ` [PATCH 3/3] fixup! clean/smudge: allow clean filters to process " Carlo Marcelo Arenas Belón
  2 siblings, 0 replies; 78+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2021-10-28 20:56 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, vtbassmatt, Carlo Marcelo Arenas Belón

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 t/t1051-large-conversion.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index c488850bee..dd5f270b20 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
 
 # This smudge filter prepends 5GB of zeros to the file it checks out. This
 # ensures that smudging doesn't mangle large files on 64-bit Windows.
-test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
+test_expect_success EXPENSIVE,IS_64BIT,!LONG_IS_64BIT 'files over 4GB convert on output' '
 	test_commit test small "a small file" &&
 	test_config filter.makelarge.smudge "dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) && cat" &&
 	echo "small filter=makelarge" >.gitattributes &&
-- 
2.33.0.1155.gbdb71ac078


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 3/3] fixup! clean/smudge: allow clean filters to process extremely large files
  2021-10-28 20:56 ` [PATCH 0/3] " Carlo Marcelo Arenas Belón
  2021-10-28 20:56   ` [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón
  2021-10-28 20:56   ` [PATCH 2/3] fixup! t1051: introduce a smudge filter test for extremely large files Carlo Marcelo Arenas Belón
@ 2021-10-28 20:56   ` Carlo Marcelo Arenas Belón
  2 siblings, 0 replies; 78+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2021-10-28 20:56 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, vtbassmatt, Carlo Marcelo Arenas Belón

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 t/t1051-large-conversion.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index dd5f270b20..e903570f30 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -97,7 +97,7 @@ test_expect_success EXPENSIVE,IS_64BIT,!LONG_IS_64BIT 'files over 4GB convert on
 
 # This clean filter writes down the size of input it receives. By checking against
 # the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
-test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on input' '
+test_expect_success EXPENSIVE,IS_64BIT,!LONG_IS_64BIT 'files over 4GB convert on input' '
 	dd if=/dev/zero bs=$((1024*1024)) count=$((5*1024)) >big &&
 	test_config filter.checklarge.clean "wc -c >big.size" &&
 	echo "big filter=checklarge" >.gitattributes &&
-- 
2.33.0.1155.gbdb71ac078


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms
  2021-10-28 20:56   ` [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón
@ 2021-10-28 21:45     ` Johannes Schindelin
  2021-10-28 22:09       ` Carlo Arenas
  0 siblings, 1 reply; 78+ messages in thread
From: Johannes Schindelin @ 2021-10-28 21:45 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón; +Cc: git, vtbassmatt

[-- Attachment #1: Type: text/plain, Size: 996 bytes --]

Hi Carlo,

On Thu, 28 Oct 2021, Carlo Marcelo Arenas Belón wrote:

> Allow tests that assume a 64-bit size_t to be skipped in 32-bit platforms
> and regardless of the size of long.

Makes sense, but...

>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
>  t/test-lib.sh | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index fc1e521519..5fa7fb5719 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -1687,6 +1687,10 @@ build_option () {
>  	sed -ne "s/^$1: //p"
>  }
>
> +test_lazy_prereq IS_64BIT '

This should be `SIZE_T_IS_64BIT`.

> +	test 8 -eq "$(build_option sizeof-size_t)"

Since this is clearly copied from `LONG_IS_64BIT`, why the change from
`-le` to `-eq`? It is at least inconsistent to use anything different
here.

Ciao,
Dscho

> +'
> +
>  test_lazy_prereq LONG_IS_64BIT '
>  	test 8 -le "$(build_option sizeof-long)"
>  '
> --
> 2.33.0.1155.gbdb71ac078
>
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms
  2021-10-28 21:45     ` Johannes Schindelin
@ 2021-10-28 22:09       ` Carlo Arenas
  2021-10-28 22:38         ` Junio C Hamano
  0 siblings, 1 reply; 78+ messages in thread
From: Carlo Arenas @ 2021-10-28 22:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, vtbassmatt

On Thu, Oct 28, 2021 at 2:46 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> On Thu, 28 Oct 2021, Carlo Marcelo Arenas Belón wrote:
>
> > Allow tests that assume a 64-bit size_t to be skipped in 32-bit platforms
> > and regardless of the size of long.
>
> Makes sense, but...
>
> >
> > Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> > ---
> >  t/test-lib.sh | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> > index fc1e521519..5fa7fb5719 100644
> > --- a/t/test-lib.sh
> > +++ b/t/test-lib.sh
> > @@ -1687,6 +1687,10 @@ build_option () {
> >       sed -ne "s/^$1: //p"
> >  }
> >
> > +test_lazy_prereq IS_64BIT '
>
> This should be `SIZE_T_IS_64BIT`.

Fair point, but...

> > +     test 8 -eq "$(build_option sizeof-size_t)"
>
> Since this is clearly copied from `LONG_IS_64BIT`, why the change from
> `-le` to `-eq`? It is at least inconsistent to use anything different
> here.

My assumption is that the check for sizeof(size_t) we have is really
about finding the bit width of the platform, and we currently support
2 of them (32-bit and 64-bit), which is why the name I chose was
"IS_64BIT" and also why I was strict on it being exactly 8 bytes
(considering all platforms git supports have bytes with 8 bits).

It can go eitherway IMHO, and your point about being inconsistent
(with my lack of explanation in the commit) suggests we should instead
use your proposal, do you want me to resend or could adjust them in
your tree?

Carlo

PS. I think we should also add a "TODO" comment in the code, but other
than that could also take my "Reviewed-by" for the series

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 4/7] odb: teach read_blob_entry to use size_t
  2021-10-28 20:50   ` [PATCH v2 4/7] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
@ 2021-10-28 22:14     ` Carlo Arenas
  2021-10-28 22:21       ` Johannes Schindelin
  0 siblings, 1 reply; 78+ messages in thread
From: Carlo Arenas @ 2021-10-28 22:14 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget; +Cc: git, Johannes Schindelin, Matt Cooper

On Thu, Oct 28, 2021 at 1:50 PM Matt Cooper via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> diff --git a/entry.c b/entry.c
> index 711ee0693c7..4cb3942dbdc 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
>         return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
>  }
>
> -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
> +void *read_blob_entry(const struct cache_entry *ce, size_t *size)
>  {
>         enum object_type type;
> -       void *blob_data = read_object_file(&ce->oid, &type, size);
> +       unsigned long ul;
> +       void *blob_data = read_object_file(&ce->oid, &type, &ul);
>
> +       *size = ul;

Considering this unsigned long variable is obviously holding an
incorrect value, might be worth adding a "TODO" comment here,
mentioning it should be changed to a real size_t (probably after
release, since it is obvious that change is too intrusive to need this
as a kludge for now).

Carlo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 4/7] odb: teach read_blob_entry to use size_t
  2021-10-28 22:14     ` Carlo Arenas
@ 2021-10-28 22:21       ` Johannes Schindelin
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-10-28 22:21 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: Matt Cooper via GitGitGadget, git, Matt Cooper

Hi Carlo,

On Thu, 28 Oct 2021, Carlo Arenas wrote:

> On Thu, Oct 28, 2021 at 1:50 PM Matt Cooper via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> > diff --git a/entry.c b/entry.c
> > index 711ee0693c7..4cb3942dbdc 100644
> > --- a/entry.c
> > +++ b/entry.c
> > @@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
> >         return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
> >  }
> >
> > -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
> > +void *read_blob_entry(const struct cache_entry *ce, size_t *size)
> >  {
> >         enum object_type type;
> > -       void *blob_data = read_object_file(&ce->oid, &type, size);
> > +       unsigned long ul;
> > +       void *blob_data = read_object_file(&ce->oid, &type, &ul);
> >
> > +       *size = ul;
>
> Considering this unsigned long variable is obviously holding an
> incorrect value, might be worth adding a "TODO" comment here,
> mentioning it should be changed to a real size_t (probably after
> release, since it is obvious that change is too intrusive to need this
> as a kludge for now).

I do not think it would be a good idea to introduce a `TODO` comment here.
Why _here_, of all places? Just because we use `size_t` correctly for a
bit of the function?

No, that `TODO` would easily be forgotten when somebody tackles the big
`unsigned long` -> `size_t` project. I therefore do not even want to add
it. We know about this problem. No need for a code comment that is prone
to become stale.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-10-28 20:50   ` [PATCH v2 7/7] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
@ 2021-10-28 22:32   ` brian m. carlson
  2021-10-28 23:07     ` Junio C Hamano
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
  8 siblings, 1 reply; 78+ messages in thread
From: brian m. carlson @ 2021-10-28 22:32 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Carlo Arenas, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 2531 bytes --]

On 2021-10-28 at 20:50:30, Johannes Schindelin via GitGitGadget wrote:
> This patch series came in via the Git for Windows fork
> [https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
> before v2.34.0-rc0, therefore I appreciate every careful review you gentle
> people can spare.
> 
> The x86_64 variant of Windows uses the LLP64 data model, where the long data
> type is 32-bit. This is very different from the LP64 data model used e.g. by
> x86_64 Linux, where unsigned long is 64-bit.
> 
> Most notably, this means that sizeof(unsigned long) != sizeof(size_t) in
> general.
> 
> However, since Git was born in the Linux ecosystem, where that inequality
> does not hold true, it is understandable that unsigned long is used in many
> code locations where size_t should have been used. As a consequence, quite a
> few things are broken e.g. on Windows, when it comes to 4GB file contents or
> larger.
> 
> Using Git LFS [https://git-lfs.github.io/] trying to work around such issues
> is one such a broken scenario. You cannot git checkout, say, 5GB files. Huge
> files will be truncated to whatever the file size is modulo 4GB (in the case
> of a 5GB file, it would be truncated to 1GB).
> 
> This patch series primarily fixes the Git LFS scenario, by allowing clean
> filters to accept 5GB files, and by allowing smudge filters to produce 5GB
> files.
> 
> The much larger project to teach Git to use size_t instead of unsigned long
> in all the appropriate places is hardly scratched by this patch series.
> 
> Side note: The fix for the clean filter included in this series does not
> actually affect Git LFS! The reason is that Git LFS marks its filter as
> required, and therefore Git streams the file contents to Git LFS via a file
> descriptor (which is unaffected by LLP64). A "clean" filter that is not
> marked as required, however, lets Git take the code path that is fixed by
> this patch series.

This series seems sane to me.  I'm delighted we can fix this issue with
so little code, since it's a been a very inconvenient problem for a lot
of Windows users.

I might suggest when we make the giant transition project in the future
that we use size_t for things that are going to be in memory and off_t
or, if necessary, intmax_t for general object sizes so it's clear which
one we want.  However, that has no effect on this series since this
intentionally has a limited scope.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms
  2021-10-28 22:09       ` Carlo Arenas
@ 2021-10-28 22:38         ` Junio C Hamano
  2021-11-02 15:20           ` Johannes Schindelin
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2021-10-28 22:38 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: Johannes Schindelin, git, vtbassmatt

Carlo Arenas <carenas@gmail.com> writes:

>> Since this is clearly copied from `LONG_IS_64BIT`, why the change from
>> `-le` to `-eq`? It is at least inconsistent to use anything different
>> here.
>
> My assumption is that the check for sizeof(size_t) we have is really
> about finding the bit width of the platform, and we currently support
> 2 of them (32-bit and 64-bit), which is why the name I chose was
> "IS_64BIT" and also why I was strict on it being exactly 8 bytes
> (considering all platforms git supports have bytes with 8 bits).
>
> It can go eitherway IMHO, and your point about being inconsistent
> (with my lack of explanation in the commit) suggests we should instead
> use your proposal, do you want me to resend or could adjust them in
> your tree?

Is LONG_IS_64BIT used to ensure that long is _at least_ 64 bit?  If
so, perhaps its name may need to be rethought.  On the other hand,
if it is meant to ensure that long is exactly 64 bit, then it should
use -eq to compare.

And from that point of view, SIZE_T_IS_64BIT and use of -eq look
consistent with each other, I would think.

Thanks.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 2/7] test-tool genzeros: generate large amounts of data more efficiently
  2021-10-28 20:50   ` [PATCH v2 2/7] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
@ 2021-10-28 22:55     ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-28 22:55 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Carlo Arenas, Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

>  int cmd__genzeros(int argc, const char **argv)
>  {
> +	/* static, so that it is NUL-initialized */
> +	static char zeros[256 * 1024];

Of course, another benefit is that you do not waste 256kB of
stackspace for this array.  We could probably even mark it as
"const" to emphasize that it is initialized with and will stay
to be bunch of NULs.

>  	intmax_t count;
> +	ssize_t n;
>  
>  	if (argc > 2) {
>  		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
> @@ -12,9 +15,19 @@ int cmd__genzeros(int argc, const char **argv)
>  
>  	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
>  
> -	while (count < 0 || count--) {
> -		if (putchar(0) == EOF)

> +	/* Writing out individual NUL bytes is slow... */
> +	while (count < 0)
> +		if (write(1, zeros, ARRAY_SIZE(zeros) < 0))
>  			return -1;

Be careful where you close your parentheses.  

I wonder if your compiler didn't warn that ARRAY_SIZE(x) will never
be negative?

> +	while (count > 0) {
> +		n = write(1, zeros, count < ARRAY_SIZE(zeros) ?
> +			  count : ARRAY_SIZE(zeros));

This side looks OK.

> +		if (n < 0)
> +			return -1;
> +
> +		count -= n;
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 5/7] git-compat-util: introduce more size_t helpers
  2021-10-28 20:50   ` [PATCH v2 5/7] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
@ 2021-10-28 23:05     ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-28 23:05 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Carlo Arenas, Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> We will use them in the next commit.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  git-compat-util.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
>
> diff --git a/git-compat-util.h b/git-compat-util.h
> index a508dbe5a35..7977720655c 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -113,6 +113,14 @@
>  #define unsigned_mult_overflows(a, b) \
>      ((a) && (b) > maximum_unsigned_value_of_type(a) / (a))
>  
> +/*
> + * Returns true if the left shift of "a" by "shift" bits will
> + * overflow. The types of "a" and "b" must be unsigned.

The type of "a" must be unsigned, and there is no "b".

"shift" can be of an integral type, and it probably is a good idea
to feed a positive value that is smaller than bitsizeof(type(a)),
but we probably do not have to say anything about it.

> + * Note that this macro evaluates "a" twice!

maximum_unsigned_value_of_type() does take bitsizeof() of the thing,
but it only needs the type of it, not the value, so I doubt that it
would evaluate 'a' even once.  This macro does need the value of 'a'
so it would evaluate it once.

> + */
> +#define unsigned_left_shift_overflows(a, shift) \
> +    ((a) > maximum_unsigned_value_of_type(a) >> shift)

try:

	unsigned a = 0;
	int ov = unsigned_left_shift_overflows(++a, 4);
	
	printf("a = %d, ov = %d\n", a, ov);
	return 0;

I think you'd get "a = 1".


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-28 22:32   ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model brian m. carlson
@ 2021-10-28 23:07     ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-28 23:07 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Johannes Schindelin via GitGitGadget, git, Carlo Arenas,
	Johannes Schindelin

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> This series seems sane to me.  I'm delighted we can fix this issue with
> so little code, since it's a been a very inconvenient problem for a lot
> of Windows users.

Yup.  I do find the goal this series aims is quite worthy, and the
size of the series is "reviewable in a single sitting" bite-size,
which was pleasant.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
                     ` (7 preceding siblings ...)
  2021-10-28 22:32   ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model brian m. carlson
@ 2021-10-29 13:59   ` Johannes Schindelin via GitGitGadget
  2021-10-29 13:59     ` [PATCH v3 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
                       ` (9 more replies)
  8 siblings, 10 replies; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin

This patch series came in via the Git for Windows fork
[https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
before v2.34.0-rc0, therefore I appreciate every careful review you gentle
people can spare.

The x86_64 variant of Windows uses the LLP64 data model, where the long data
type is 32-bit. This is very different from the LP64 data model used e.g. by
x86_64 Linux, where unsigned long is 64-bit.

Most notably, this means that sizeof(unsigned long) != sizeof(size_t) in
general.

However, since Git was born in the Linux ecosystem, where that inequality
does not hold true, it is understandable that unsigned long is used in many
code locations where size_t should have been used. As a consequence, quite a
few things are broken e.g. on Windows, when it comes to 4GB file contents or
larger.

Using Git LFS [https://git-lfs.github.io/] trying to work around such issues
is one such a broken scenario. You cannot git checkout, say, 5GB files. Huge
files will be truncated to whatever the file size is modulo 4GB (in the case
of a 5GB file, it would be truncated to 1GB).

This patch series primarily fixes the Git LFS scenario, by allowing clean
filters to accept 5GB files, and by allowing smudge filters to produce 5GB
files.

The much larger project to teach Git to use size_t instead of unsigned long
in all the appropriate places is hardly scratched by this patch series.

Side note: The fix for the clean filter included in this series does not
actually affect Git LFS! The reason is that Git LFS marks its filter as
required, and therefore Git streams the file contents to Git LFS via a file
descriptor (which is unaffected by LLP64). A "clean" filter that is not
marked as required, however, lets Git take the code path that is fixed by
this patch series.

Changes since v2:

 * The test cases now use a prereq to avoid running in 32-bit setups (where
   they would be guaranteed to fail).
 * The SIZE_T_IS64BIT prereq specifically does not follow LONG_IS_64BIT, by
   testing for exactly 64 bits.
 * The code comment above unsigned_left_shift_overflows() was fixed.
 * unsigned_left_shift_overflows() now verifies that shift is smaller than
   the bit size of the operand.
 * genzeros now marks its buffer of NULs as const.
 * The error check in the infinite loop genzeros was fixed.

Changes since v1:

 * Removed extraneous "Signed-off-by:" lines from "git-compat-util:
   introduce more size_t helpers".
 * Integrated Carlo's patch to allow genzeros to generate large amounts of
   NULs, even in LLP64 data models.
 * Using test-tool genzeros instead of dd if=/dev/zero, to help HP NonStop
   (which appears to use the LP64 data model and therefore should pass the
   new test cases even without the fixes provided in this patch series).
 * Accelerating genzeros to have performance characteristics similar to dd
   if=/dev/zero instead of being ~50x slower.

Carlo Marcelo Arenas Belón (2):
  test-genzeros: allow more than 2G zeros in Windows
  test-lib: add prerequisite for 64-bit platforms

Johannes Schindelin (2):
  test-tool genzeros: generate large amounts of data more efficiently
  git-compat-util: introduce more size_t helpers

Matt Cooper (4):
  t1051: introduce a smudge filter test for extremely large files
  odb: teach read_blob_entry to use size_t
  odb: guard against data loss checking out a huge file
  clean/smudge: allow clean filters to process extremely large files

 convert.c                   |  2 +-
 delta.h                     |  6 +++---
 entry.c                     |  8 +++++---
 entry.h                     |  2 +-
 git-compat-util.h           | 25 +++++++++++++++++++++++++
 object-file.c               |  6 +++---
 packfile.c                  |  6 +++---
 parallel-checkout.c         |  2 +-
 t/helper/test-genzeros.c    | 21 +++++++++++++++++----
 t/t1051-large-conversion.sh | 25 +++++++++++++++++++++++++
 t/test-lib.sh               |  4 ++++
 11 files changed, 88 insertions(+), 19 deletions(-)


base-commit: ebf3c04b262aa27fbb97f8a0156c2347fecafafb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1068%2Fdscho%2Fhuge-file-smudge-clean-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1068/dscho/huge-file-smudge-clean-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1068

Range-diff vs v2:

 1:  068f897b973 = 1:  068f897b973 test-genzeros: allow more than 2G zeros in Windows
 2:  6edcbae372e ! 2:  05219720014 test-tool genzeros: generate large amounts of data more efficiently
     @@ t/helper/test-genzeros.c
       int cmd__genzeros(int argc, const char **argv)
       {
      +	/* static, so that it is NUL-initialized */
     -+	static char zeros[256 * 1024];
     ++	static const char zeros[256 * 1024];
       	intmax_t count;
      +	ssize_t n;
       
     @@ t/helper/test-genzeros.c: int cmd__genzeros(int argc, const char **argv)
      -		if (putchar(0) == EOF)
      +	/* Writing out individual NUL bytes is slow... */
      +	while (count < 0)
     -+		if (write(1, zeros, ARRAY_SIZE(zeros) < 0))
     ++		if (write(1, zeros, ARRAY_SIZE(zeros)) < 0)
       			return -1;
      +
      +	while (count > 0) {
 -:  ----------- > 3:  489500bb1dc test-lib: add prerequisite for 64-bit platforms
 3:  1bdded86f5d ! 4:  ce9dfaac9f8 t1051: introduce a smudge filter test for extremely large files
     @@ t/t1051-large-conversion.sh: test_expect_success 'ident converts on output' '
       
      +# This smudge filter prepends 5GB of zeros to the file it checks out. This
      +# ensures that smudging doesn't mangle large files on 64-bit Windows.
     -+test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
     ++test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
     ++		'files over 4GB convert on output' '
      +	test_commit test small "a small file" &&
      +	test_config filter.makelarge.smudge \
      +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
 4:  3ffd3a001f7 ! 5:  dbef8168bc7 odb: teach read_blob_entry to use size_t
     @@ t/t1051-large-conversion.sh: test_expect_success 'ident converts on output' '
       
       # This smudge filter prepends 5GB of zeros to the file it checks out. This
       # ensures that smudging doesn't mangle large files on 64-bit Windows.
     --test_expect_failure EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
     -+test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output' '
     +-test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
     ++test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
     + 		'files over 4GB convert on output' '
       	test_commit test small "a small file" &&
       	test_config filter.makelarge.smudge \
     - 		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
 5:  32472ae3f98 ! 6:  18419070c29 git-compat-util: introduce more size_t helpers
     @@ git-compat-util.h
       
      +/*
      + * Returns true if the left shift of "a" by "shift" bits will
     -+ * overflow. The types of "a" and "b" must be unsigned.
     -+ * Note that this macro evaluates "a" twice!
     ++ * overflow. The type of "a" must be unsigned.
      + */
      +#define unsigned_left_shift_overflows(a, shift) \
     -+    ((a) > maximum_unsigned_value_of_type(a) >> shift)
     ++    ((shift) < bitsizeof(a) && \
     ++     (a) > maximum_unsigned_value_of_type(a) >> (shift))
      +
       #ifdef __GNUC__
       #define TYPEOF(x) (__typeof__(x))
 6:  c6910584108 = 7:  f59c523bcc4 odb: guard against data loss checking out a huge file
 7:  d87d4229bb4 ! 8:  acc5591517f clean/smudge: allow clean filters to process extremely large files
     @@ convert.c: static int crlf_to_worktree(const char *src, size_t len, struct strbu
       	const char *path;
      
       ## t/t1051-large-conversion.sh ##
     -@@ t/t1051-large-conversion.sh: test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on output'
     +@@ t/t1051-large-conversion.sh: test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
       	test "$size" -ge $((5 * 1024 * 1024 * 1024))
       '
       
      +# This clean filter writes down the size of input it receives. By checking against
      +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
     -+test_expect_success EXPENSIVE,!LONG_IS_64BIT 'files over 4GB convert on input' '
     ++test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
     ++		'files over 4GB convert on input' '
      +	test-tool genzeros $((5*1024*1024*1024)) >big &&
      +	test_config filter.checklarge.clean "wc -c >big.size" &&
      +	echo "big filter=checklarge" >.gitattributes &&

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 1/8] test-genzeros: allow more than 2G zeros in Windows
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
@ 2021-10-29 13:59     ` Carlo Marcelo Arenas Belón via GitGitGadget
  2021-10-29 13:59     ` [PATCH v3 2/8] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 78+ messages in thread
From: Carlo Marcelo Arenas Belón via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Carlo Marcelo Arenas Belón

From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas@gmail.com>

d5cfd142ec (tests: teach the test-tool to generate NUL bytes and
use it, 2019-02-14), add a way to generate zeroes in a portable
way without using /dev/zero (needed by HP NonStop), but uses a
long variable that is limited to 2^31 in Windows.

Use instead a (POSIX/C99) intmax_t that is at least 64bit wide
in 64-bit Windows to use in a future test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/helper/test-genzeros.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
index 9532f5bac97..b1197e91a89 100644
--- a/t/helper/test-genzeros.c
+++ b/t/helper/test-genzeros.c
@@ -3,14 +3,14 @@
 
 int cmd__genzeros(int argc, const char **argv)
 {
-	long count;
+	intmax_t count;
 
 	if (argc > 2) {
 		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
 		return 1;
 	}
 
-	count = argc > 1 ? strtol(argv[1], NULL, 0) : -1L;
+	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
 
 	while (count < 0 || count--) {
 		if (putchar(0) == EOF)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 2/8] test-tool genzeros: generate large amounts of data more efficiently
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
  2021-10-29 13:59     ` [PATCH v3 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
@ 2021-10-29 13:59     ` Johannes Schindelin via GitGitGadget
  2021-10-29 22:50       ` Junio C Hamano
  2021-10-29 13:59     ` [PATCH v3 3/8] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón via GitGitGadget
                       ` (7 subsequent siblings)
  9 siblings, 1 reply; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

In this developer's tests, producing one gigabyte worth of NULs in a
busy loop that writes out individual bytes, unbuffered, took ~27sec.
Writing chunked 256kB buffers instead only took ~0.6sec

This matters because we are about to introduce a pair of test cases that
want to be able to produce 5GB of NULs, and we cannot use `/dev/zero`
because of the HP NonStop platform's lack of support for that device.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/helper/test-genzeros.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
index b1197e91a89..8ca988d6216 100644
--- a/t/helper/test-genzeros.c
+++ b/t/helper/test-genzeros.c
@@ -3,7 +3,10 @@
 
 int cmd__genzeros(int argc, const char **argv)
 {
+	/* static, so that it is NUL-initialized */
+	static const char zeros[256 * 1024];
 	intmax_t count;
+	ssize_t n;
 
 	if (argc > 2) {
 		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
@@ -12,9 +15,19 @@ int cmd__genzeros(int argc, const char **argv)
 
 	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
 
-	while (count < 0 || count--) {
-		if (putchar(0) == EOF)
+	/* Writing out individual NUL bytes is slow... */
+	while (count < 0)
+		if (write(1, zeros, ARRAY_SIZE(zeros)) < 0)
 			return -1;
+
+	while (count > 0) {
+		n = write(1, zeros, count < ARRAY_SIZE(zeros) ?
+			  count : ARRAY_SIZE(zeros));
+
+		if (n < 0)
+			return -1;
+
+		count -= n;
 	}
 
 	return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 3/8] test-lib: add prerequisite for 64-bit platforms
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
  2021-10-29 13:59     ` [PATCH v3 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
  2021-10-29 13:59     ` [PATCH v3 2/8] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
@ 2021-10-29 13:59     ` Carlo Marcelo Arenas Belón via GitGitGadget
  2021-10-29 22:52       ` Junio C Hamano
  2021-10-29 13:59     ` [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
                       ` (6 subsequent siblings)
  9 siblings, 1 reply; 78+ messages in thread
From: Carlo Marcelo Arenas Belón via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Carlo Marcelo Arenas Belón

From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas@gmail.com>

Allow tests that assume a 64-bit `size_t` to be skipped in 32-bit
platforms and regardless of the size of `long`.

This imitates the `LONG_IS_64BIT` prerequisite.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/test-lib.sh | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/t/test-lib.sh b/t/test-lib.sh
index adaf03543e8..af1a94c2c20 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1642,6 +1642,10 @@ build_option () {
 	sed -ne "s/^$1: //p"
 }
 
+test_lazy_prereq SIZE_T_IS_64BIT '
+	test 8 -eq "$(build_option sizeof-size_t)"
+'
+
 test_lazy_prereq LONG_IS_64BIT '
 	test 8 -le "$(build_option sizeof-long)"
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-10-29 13:59     ` [PATCH v3 3/8] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón via GitGitGadget
@ 2021-10-29 13:59     ` Matt Cooper via GitGitGadget
  2021-10-29 23:00       ` Junio C Hamano
  2021-10-29 13:59     ` [PATCH v3 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
                       ` (5 subsequent siblings)
  9 siblings, 1 reply; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
added to the database or workdir. ("Smudge" when moving to the workdir;
"clean" when moving to the database.) This is used natively to handle CRLF
to LF conversions. It's also employed by Git-LFS to replace large files
from the workdir with small tracking files in the repo and vice versa.

Git pulls the entire smudged file into memory. While this is inefficient,
there's a more insidious problem on some platforms due to inconsistency
between using unsigned long and size_t for the same type of data (size of
a file in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
size_t is typedef'd to unsigned long. On Windows, however, unsigned long is
only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
unsigned long long in order to be 64 bits).

Practically speaking, this means 64-bit Windows users of Git-LFS can't
handle files larger than 2^32 bytes. Other 64-bit platforms don't suffer
this limitation.

This commit introduces a test exposing the issue; future commits make it
pass. The test simulates the way Git-LFS works by having a tiny file
checked into the repository and expanding it to a huge file on checkout.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t1051-large-conversion.sh | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 8b7640b3ba8..bff86c13208 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -83,4 +83,18 @@ test_expect_success 'ident converts on output' '
 	test_cmp small.clean large.clean
 '
 
+# This smudge filter prepends 5GB of zeros to the file it checks out. This
+# ensures that smudging doesn't mangle large files on 64-bit Windows.
+test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+		'files over 4GB convert on output' '
+	test_commit test small "a small file" &&
+	test_config filter.makelarge.smudge \
+		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
+	echo "small filter=makelarge" >.gitattributes &&
+	rm small &&
+	git checkout -- small &&
+	size=$(test_file_size small) &&
+	test "$size" -ge $((5 * 1024 * 1024 * 1024))
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 5/8] odb: teach read_blob_entry to use size_t
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-10-29 13:59     ` [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
@ 2021-10-29 13:59     ` Matt Cooper via GitGitGadget
  2021-10-29 23:17       ` Junio C Hamano
  2021-10-29 13:59     ` [PATCH v3 6/8] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
                       ` (4 subsequent siblings)
  9 siblings, 1 reply; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

There is mixed use of size_t and unsigned long to deal with sizes in the
codebase. Recall that Windows defines unsigned long as 32 bits even on
64-bit platforms, meaning that converting size_t to unsigned long narrows
the range. This mostly doesn't cause a problem since Git rarely deals
with files larger than 2^32 bytes.

But adjunct systems such as Git LFS, which use smudge/clean filters to
keep huge files out of the repository, may have huge file contents passed
through some of the functions in entry.c and convert.c. On Windows, this
results in a truncated file being written to the workdir. I traced this to
one specific use of unsigned long in write_entry (and a similar instance
in write_pc_item_to_fd for parallel checkout). That appeared to be for
the call to read_blob_entry, which expects a pointer to unsigned long.

By altering the signature of read_blob_entry to expect a size_t,
write_entry can be switched to use size_t internally (which all of its
callers and most of its callees already used). To avoid touching dozens of
additional files, read_blob_entry uses a local unsigned long to call a
chain of functions which aren't prepared to accept size_t.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 entry.c                     | 8 +++++---
 entry.h                     | 2 +-
 parallel-checkout.c         | 2 +-
 t/t1051-large-conversion.sh | 2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/entry.c b/entry.c
index 711ee0693c7..4cb3942dbdc 100644
--- a/entry.c
+++ b/entry.c
@@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
 	return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
 }
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
+void *read_blob_entry(const struct cache_entry *ce, size_t *size)
 {
 	enum object_type type;
-	void *blob_data = read_object_file(&ce->oid, &type, size);
+	unsigned long ul;
+	void *blob_data = read_object_file(&ce->oid, &type, &ul);
 
+	*size = ul;
 	if (blob_data) {
 		if (type == OBJ_BLOB)
 			return blob_data;
@@ -270,7 +272,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
 	int fd, ret, fstat_done = 0;
 	char *new_blob;
 	struct strbuf buf = STRBUF_INIT;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 	size_t newsize = 0;
 	struct stat st;
diff --git a/entry.h b/entry.h
index b8c0e170dc7..61ee8c17604 100644
--- a/entry.h
+++ b/entry.h
@@ -51,7 +51,7 @@ int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
  */
 void unlink_entry(const struct cache_entry *ce);
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
+void *read_blob_entry(const struct cache_entry *ce, size_t *size);
 int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
 void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
 			   struct stat *st);
diff --git a/parallel-checkout.c b/parallel-checkout.c
index 6b1af32bb3d..b6f4a25642e 100644
--- a/parallel-checkout.c
+++ b/parallel-checkout.c
@@ -261,7 +261,7 @@ static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
 	struct stream_filter *filter;
 	struct strbuf buf = STRBUF_INIT;
 	char *blob;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 
 	/* Sanity check */
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index bff86c13208..8b23d862600 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
 
 # This smudge filter prepends 5GB of zeros to the file it checks out. This
 # ensures that smudging doesn't mangle large files on 64-bit Windows.
-test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
 		'files over 4GB convert on output' '
 	test_commit test small "a small file" &&
 	test_config filter.makelarge.smudge \
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 6/8] git-compat-util: introduce more size_t helpers
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-10-29 13:59     ` [PATCH v3 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
@ 2021-10-29 13:59     ` Johannes Schindelin via GitGitGadget
  2021-10-29 23:10       ` Junio C Hamano
  2021-10-29 13:59     ` [PATCH v3 7/8] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
                       ` (3 subsequent siblings)
  9 siblings, 1 reply; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

We will use them in the next commit.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 git-compat-util.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index a508dbe5a35..1f41e5611a1 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -113,6 +113,14 @@
 #define unsigned_mult_overflows(a, b) \
     ((a) && (b) > maximum_unsigned_value_of_type(a) / (a))
 
+/*
+ * Returns true if the left shift of "a" by "shift" bits will
+ * overflow. The type of "a" must be unsigned.
+ */
+#define unsigned_left_shift_overflows(a, shift) \
+    ((shift) < bitsizeof(a) && \
+     (a) > maximum_unsigned_value_of_type(a) >> (shift))
+
 #ifdef __GNUC__
 #define TYPEOF(x) (__typeof__(x))
 #else
@@ -859,6 +867,23 @@ static inline size_t st_sub(size_t a, size_t b)
 	return a - b;
 }
 
+static inline size_t st_left_shift(size_t a, unsigned shift)
+{
+	if (unsigned_left_shift_overflows(a, shift))
+		die("size_t overflow: %"PRIuMAX" << %u",
+		    (uintmax_t)a, shift);
+	return a << shift;
+}
+
+static inline unsigned long cast_size_t_to_ulong(size_t a)
+{
+	if (a != (unsigned long)a)
+		die("object too large to read on this platform: %"
+		    PRIuMAX" is cut off to %lu",
+		    (uintmax_t)a, (unsigned long)a);
+	return (unsigned long)a;
+}
+
 #ifdef HAVE_ALLOCA_H
 # include <alloca.h>
 # define xalloca(size)      (alloca(size))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 7/8] odb: guard against data loss checking out a huge file
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-10-29 13:59     ` [PATCH v3 6/8] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
@ 2021-10-29 13:59     ` Matt Cooper via GitGitGadget
  2021-10-29 23:13       ` Junio C Hamano
  2021-10-29 13:59     ` [PATCH v3 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
                       ` (2 subsequent siblings)
  9 siblings, 1 reply; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

This introduces an additional guard for platforms where `unsigned long`
and `size_t` are not of the same size. If the size of an object in the
database would overflow `unsigned long`, instead we now exit with an
error.

A complete fix will have to update _many_ other functions throughout the
codebase to use `size_t` instead of `unsigned long`. It will have to be
implemented at some stage.

This commit puts in a stop-gap for the time being.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 delta.h       | 6 +++---
 object-file.c | 6 +++---
 packfile.c    | 6 +++---
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/delta.h b/delta.h
index 2df5fe13d95..8a56ec07992 100644
--- a/delta.h
+++ b/delta.h
@@ -90,15 +90,15 @@ static inline unsigned long get_delta_hdr_size(const unsigned char **datap,
 					       const unsigned char *top)
 {
 	const unsigned char *data = *datap;
-	unsigned long cmd, size = 0;
+	size_t cmd, size = 0;
 	int i = 0;
 	do {
 		cmd = *data++;
-		size |= (cmd & 0x7f) << i;
+		size |= st_left_shift(cmd & 0x7f, i);
 		i += 7;
 	} while (cmd & 0x80 && data < top);
 	*datap = data;
-	return size;
+	return cast_size_t_to_ulong(size);
 }
 
 #endif
diff --git a/object-file.c b/object-file.c
index f233b440b22..70e456fc2a3 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1344,7 +1344,7 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 				       unsigned int flags)
 {
 	const char *type_buf = hdr;
-	unsigned long size;
+	size_t size;
 	int type, type_len = 0;
 
 	/*
@@ -1388,12 +1388,12 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 			if (c > 9)
 				break;
 			hdr++;
-			size = size * 10 + c;
+			size = st_add(st_mult(size, 10), c);
 		}
 	}
 
 	if (oi->sizep)
-		*oi->sizep = size;
+		*oi->sizep = cast_size_t_to_ulong(size);
 
 	/*
 	 * The length must be followed by a zero byte
diff --git a/packfile.c b/packfile.c
index 755aa7aec5e..3ccea004396 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1059,7 +1059,7 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 		unsigned long len, enum object_type *type, unsigned long *sizep)
 {
 	unsigned shift;
-	unsigned long size, c;
+	size_t size, c;
 	unsigned long used = 0;
 
 	c = buf[used++];
@@ -1073,10 +1073,10 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 			break;
 		}
 		c = buf[used++];
-		size += (c & 0x7f) << shift;
+		size = st_add(size, st_left_shift(c & 0x7f, shift));
 		shift += 7;
 	}
-	*sizep = size;
+	*sizep = cast_size_t_to_ulong(size);
 	return used;
 }
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
                       ` (6 preceding siblings ...)
  2021-10-29 13:59     ` [PATCH v3 7/8] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
@ 2021-10-29 13:59     ` Matt Cooper via GitGitGadget
  2021-10-29 23:17       ` Junio C Hamano
  2021-10-29 18:34     ` [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Junio C Hamano
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
  9 siblings, 1 reply; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-10-29 13:59 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin, Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
moved between the database and the worktree. We already made sure that
it is possible for smudge filters to produce contents that are larger
than `unsigned long` can represent (which matters on systems where
`unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
Now we make sure that clean filters can _consume_ contents that are
larger than that.

Note that this commit only allows clean filters' _input_ to be larger
than can be represented by `unsigned long`.

This change makes only a very minute dent into the much larger project
to teach Git to use `size_t` instead of `unsigned long` wherever
appropriate.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 convert.c                   |  2 +-
 t/t1051-large-conversion.sh | 11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/convert.c b/convert.c
index fd9c84b0257..5ad6dfc08a0 100644
--- a/convert.c
+++ b/convert.c
@@ -613,7 +613,7 @@ static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
 
 struct filter_params {
 	const char *src;
-	unsigned long size;
+	size_t size;
 	int fd;
 	const char *cmd;
 	const char *path;
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 8b23d862600..d4cfe8bf5de 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -97,4 +97,15 @@ test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
 	test "$size" -ge $((5 * 1024 * 1024 * 1024))
 '
 
+# This clean filter writes down the size of input it receives. By checking against
+# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
+test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+		'files over 4GB convert on input' '
+	test-tool genzeros $((5*1024*1024*1024)) >big &&
+	test_config filter.checklarge.clean "wc -c >big.size" &&
+	echo "big filter=checklarge" >.gitattributes &&
+	git add big &&
+	test $(test_file_size big) -eq $(cat big.size)
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
                       ` (7 preceding siblings ...)
  2021-10-29 13:59     ` [PATCH v3 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
@ 2021-10-29 18:34     ` Junio C Hamano
       [not found]       ` <nycvar.QRO.7.76.6.2110292239170.56@tvgsbejvaqbjf.bet>
  2021-11-02 14:41       ` Johannes Schindelin
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
  9 siblings, 2 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 18:34 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> This patch series came in via the Git for Windows fork
> [https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
> before v2.34.0-rc0, therefore I appreciate every careful review you gentle
> people can spare.

It is way too late for my tree to go in before -rc0, but the patches
in the last round, with the "Changes since v2" description below,
all sound sensible, including the decision to stop here, instead of
doing "everything should be either size_t or intmax_t" conversion.

7/8 did not apply for me to the tip of 'master', but "am -3" wiggled
it in.  You may want to double check the results.

As the primary author of the series, given the cover title matches
the title of one step in the series, seems to be Matt, let me queue
them under mc/clean-smudge-with-llp64 topic.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
       [not found]       ` <nycvar.QRO.7.76.6.2110292239170.56@tvgsbejvaqbjf.bet>
@ 2021-10-29 21:12         ` Johannes Schindelin
  2021-10-29 23:25           ` Junio C Hamano
  2021-10-30 15:16           ` Philip Oakley
  0 siblings, 2 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-10-29 21:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Carlo Arenas,
	brian m. carlson

[Re-sending, as it seems that the Git mailing list is causing trouble
again, at least I do not see this on lore.kernel.org/git]


On Fri, 29 Oct 2021, Johannes Schindelin wrote:

> Hi Junio,
> 
> On Fri, 29 Oct 2021, Junio C Hamano wrote:
> 
> > "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> > writes:
> > 
> > > This patch series came in via the Git for Windows fork
> > > [https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
> > > before v2.34.0-rc0, therefore I appreciate every careful review you gentle
> > > people can spare.
> > 
> > It is way too late for my tree to go in before -rc0, but the patches
> > in the last round, with the "Changes since v2" description below,
> > all sound sensible, including the decision to stop here, instead of
> > doing "everything should be either size_t or intmax_t" conversion.
> 
> Thank you for your careful review, it definitely helped with polishing the
> patches.
> 
> > 7/8 did not apply for me to the tip of 'master', but "am -3" wiggled
> > it in.  You may want to double check the results.
> 
> Right, I purposefully based the patches on v2.32.0 so they would merge
> cleanly into Git for Windows' `main` branch.
> 
> I should have clarified that I was talking about merging that PR into Git
> for Windows before the -rc0 ;-)
> 
> Speaking of which, -rc0 is still coming, right? ;-)
> https://tinyurl.com/gitcal still claims that it was scheduled for
> yesterday.
> 
> > As the primary author of the series, given the cover title matches the
> > title of one step in the series, seems to be Matt, let me queue them
> > under mc/clean-smudge-with-llp64 topic.
> 
> Absolutely. Matt was in the driving seat, I was just reviewing and helping
> here and there, and then shepherding the patch series upstream. It was my
> decision to start upstreaming before merging it into Git for Windows, but
> the plan was all along to get this into Git for Windows v2.34.0 because
> there are Git LFS users using Windows who are eagerly awaiting this fix.
> 
> I am not aware of any other popular platform using the LLP64 data model,
> therefore I do not even think that these patches have to be fast-tracked
> into Git v2.34.0, next cycle would be good enough. Unless you are aware of
> other such platforms that do not rely on the Git for Windows fork, but on
> Git built from your repository?
> 
> Ciao,
> Dscho
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 2/8] test-tool genzeros: generate large amounts of data more efficiently
  2021-10-29 13:59     ` [PATCH v3 2/8] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
@ 2021-10-29 22:50       ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 22:50 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> @@ -12,9 +15,19 @@ int cmd__genzeros(int argc, const char **argv)
>  
>  	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
>  
> -	while (count < 0 || count--) {
> -		if (putchar(0) == EOF)
> +	/* Writing out individual NUL bytes is slow... */
> +	while (count < 0)
> +		if (write(1, zeros, ARRAY_SIZE(zeros)) < 0)
>  			return -1;
> +
> +	while (count > 0) {
> +		n = write(1, zeros, count < ARRAY_SIZE(zeros) ?
> +			  count : ARRAY_SIZE(zeros));
> +
> +		if (n < 0)
> +			return -1;
> +
> +		count -= n;
>  	}
>  
>  	return 0;

This round looks OK to me.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 3/8] test-lib: add prerequisite for 64-bit platforms
  2021-10-29 13:59     ` [PATCH v3 3/8] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón via GitGitGadget
@ 2021-10-29 22:52       ` Junio C Hamano
  2021-11-02 14:35         ` Johannes Schindelin
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 22:52 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin

"Carlo Marcelo Arenas Belón via GitGitGadget"
<gitgitgadget@gmail.com> writes:

> +test_lazy_prereq SIZE_T_IS_64BIT '
> +	test 8 -eq "$(build_option sizeof-size_t)"
> +'
> +
>  test_lazy_prereq LONG_IS_64BIT '
>  	test 8 -le "$(build_option sizeof-long)"
>  '

In the longer run, LONG_IS_64BIT wants to be renamed to indicate
that it is at least 64-bit long.  LONG_HAS_64BIT, perhaps?

Obviously it can be left outside the scope of this series.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files
  2021-10-29 13:59     ` [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
@ 2021-10-29 23:00       ` Junio C Hamano
  2021-10-29 23:21         ` Junio C Hamano
  2021-11-02 14:57         ` Johannes Schindelin
  0 siblings, 2 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 23:00 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Matt Cooper

"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Matt Cooper <vtbassmatt@gmail.com>
>
> The filter system allows for alterations to file contents when they're
> added to the database or workdir. ("Smudge" when moving to the workdir;
> "clean" when moving to the database.) This is used natively to handle CRLF
> to LF conversions. It's also employed by Git-LFS to replace large files
> from the workdir with small tracking files in the repo and vice versa.

Not a huge deal, but make it a habit to spell "working tree" not "workdir",
as someday you'd write end-user facing documentation in our tree ;-).
 
> Git pulls the entire smudged file into memory.

Giving "for what" would be helpful to readers.

    Git reads the entire smudged file into memory to convert it into
    a "clean" form to be used in-core.

> While this is inefficient,
> there's a more insidious problem on some platforms due to inconsistency
> between using unsigned long and size_t for the same type of data (size of
> a file in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
> size_t is typedef'd to unsigned long. On Windows, however, unsigned long is
> only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
> unsigned long long in order to be 64 bits).
>
> Practically speaking, this means 64-bit Windows users of Git-LFS can't
> handle files larger than 2^32 bytes. Other 64-bit platforms don't suffer
> this limitation.
>
> This commit introduces a test exposing the issue; future commits make it
> pass. The test simulates the way Git-LFS works by having a tiny file
> checked into the repository and expanding it to a huge file on checkout.
>
> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  t/t1051-large-conversion.sh | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index 8b7640b3ba8..bff86c13208 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -83,4 +83,18 @@ test_expect_success 'ident converts on output' '
>  	test_cmp small.clean large.clean
>  '
>  
> +# This smudge filter prepends 5GB of zeros to the file it checks out. This
> +# ensures that smudging doesn't mangle large files on 64-bit Windows.
> +test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +		'files over 4GB convert on output' '
> +	test_commit test small "a small file" &&
> +	test_config filter.makelarge.smudge \
> +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
> +	echo "small filter=makelarge" >.gitattributes &&
> +	rm small &&
> +	git checkout -- small &&
> +	size=$(test_file_size small) &&
> +	test "$size" -ge $((5 * 1024 * 1024 * 1024))
> +'

Why not exactly 5G, but anything that is at least 5G is OK?

Thanks.

>  test_done

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 6/8] git-compat-util: introduce more size_t helpers
  2021-10-29 13:59     ` [PATCH v3 6/8] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
@ 2021-10-29 23:10       ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 23:10 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +/*
> + * Returns true if the left shift of "a" by "shift" bits will
> + * overflow. The type of "a" must be unsigned.
> + */
> +#define unsigned_left_shift_overflows(a, shift) \
> +    ((shift) < bitsizeof(a) && \
> +     (a) > maximum_unsigned_value_of_type(a) >> (shift))

Cute.  So somebody asks

    if (unsigned_left_shift_overflows(a, 100)

and they get "you are unsafe, regardless of the value of a, you get
an overflow".  Makes perfect sensen.

>  #ifdef __GNUC__
>  #define TYPEOF(x) (__typeof__(x))
>  #else
> @@ -859,6 +867,23 @@ static inline size_t st_sub(size_t a, size_t b)
>  	return a - b;
>  }
>  
> +static inline size_t st_left_shift(size_t a, unsigned shift)
> +{
> +	if (unsigned_left_shift_overflows(a, shift))
> +		die("size_t overflow: %"PRIuMAX" << %u",
> +		    (uintmax_t)a, shift);
> +	return a << shift;
> +}

Makes sense.

> +static inline unsigned long cast_size_t_to_ulong(size_t a)
> +{
> +	if (a != (unsigned long)a)
> +		die("object too large to read on this platform: %"
> +		    PRIuMAX" is cut off to %lu",
> +		    (uintmax_t)a, (unsigned long)a);
> +	return (unsigned long)a;
> +}
> +
>  #ifdef HAVE_ALLOCA_H
>  # include <alloca.h>
>  # define xalloca(size)      (alloca(size))

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 7/8] odb: guard against data loss checking out a huge file
  2021-10-29 13:59     ` [PATCH v3 7/8] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
@ 2021-10-29 23:13       ` Junio C Hamano
  0 siblings, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 23:13 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Matt Cooper

"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/delta.h b/delta.h
> index 2df5fe13d95..8a56ec07992 100644
> --- a/delta.h
> +++ b/delta.h
> @@ -90,15 +90,15 @@ static inline unsigned long get_delta_hdr_size(const unsigned char **datap,
>  					       const unsigned char *top)
>  {
>  	const unsigned char *data = *datap;
> -	unsigned long cmd, size = 0;
> +	size_t cmd, size = 0;
>  	int i = 0;
>  	do {
>  		cmd = *data++;
> -		size |= (cmd & 0x7f) << i;
> +		size |= st_left_shift(cmd & 0x7f, i);
>  		i += 7;
>  	} while (cmd & 0x80 && data < top);
>  	*datap = data;
> -	return size;
> +	return cast_size_t_to_ulong(size);
>  }

OK.  The patch-delta code is reasonably self contained, so the next
step to measure the in-core object size in size_t shouldn't be too
bad, but before everything got size_t aware, we have to have the
"safe casting" somewhere, and I agree this "immediately before
return" is a good place to draw the line.

Nicely done.

>  			if (c > 9)
>  				break;
>  			hdr++;
> -			size = size * 10 + c;
> +			size = st_add(st_mult(size, 10), c);

Nice.

>  	if (oi->sizep)
> -		*oi->sizep = size;
> +		*oi->sizep = cast_size_t_to_ulong(size);

OK.

> @@ -1073,10 +1073,10 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
>  			break;
>  		}
>  		c = buf[used++];
> -		size += (c & 0x7f) << shift;
> +		size = st_add(size, st_left_shift(c & 0x7f, shift));
>  		shift += 7;
>  	}
> -	*sizep = size;
> +	*sizep = cast_size_t_to_ulong(size);

OK.

Looking good.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-10-29 13:59     ` [PATCH v3 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
@ 2021-10-29 23:17       ` Junio C Hamano
  2021-11-02 14:59         ` Johannes Schindelin
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 23:17 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Matt Cooper

"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +# This clean filter writes down the size of input it receives. By checking against
> +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
> +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +		'files over 4GB convert on input' '
> +	test-tool genzeros $((5*1024*1024*1024)) >big &&
> +	test_config filter.checklarge.clean "wc -c >big.size" &&
> +	echo "big filter=checklarge" >.gitattributes &&
> +	git add big &&
> +	test $(test_file_size big) -eq $(cat big.size)
> +'

I would have expected that the clean filter to be sending the count
to its standard output (to be hashed and made into a blob object),
and the test wuld be doing "git cat-file blob :big" to read the
contents of the raw blob, bypassing the filter system.  But we are
testing with only a single path anyway, use of this single extra
file is OK.

Looking good.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 5/8] odb: teach read_blob_entry to use size_t
  2021-10-29 13:59     ` [PATCH v3 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
@ 2021-10-29 23:17       ` Junio C Hamano
  2021-11-02 15:10         ` Johannes Schindelin
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 23:17 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Matt Cooper

"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
> +void *read_blob_entry(const struct cache_entry *ce, size_t *size)
>  {
>  	enum object_type type;
> -	void *blob_data = read_object_file(&ce->oid, &type, size);
> +	unsigned long ul;
> +	void *blob_data = read_object_file(&ce->oid, &type, &ul);
>  
> +	*size = ul;

It is a bit curious place to draw the line; we want to make sure
that blob_entry can hold huge data, but in this step we do not mind
read_object_file() is not capable of going full 64-bit?

I guess I'll see soon enough why by reading later steps.  I can see
that for the purpose of making write_entry() aware of the size_t,
this is necessary at the minimum.

Looking good.

>  	if (blob_data) {
>  		if (type == OBJ_BLOB)
>  			return blob_data;
> @@ -270,7 +272,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
>  	int fd, ret, fstat_done = 0;
>  	char *new_blob;
>  	struct strbuf buf = STRBUF_INIT;
> -	unsigned long size;
> +	size_t size;
>  	ssize_t wrote;
>  	size_t newsize = 0;
>  	struct stat st;
> diff --git a/entry.h b/entry.h
> index b8c0e170dc7..61ee8c17604 100644
> --- a/entry.h
> +++ b/entry.h
> @@ -51,7 +51,7 @@ int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
>   */
>  void unlink_entry(const struct cache_entry *ce);
>  
> -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
> +void *read_blob_entry(const struct cache_entry *ce, size_t *size);
>  int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
>  void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
>  			   struct stat *st);
> diff --git a/parallel-checkout.c b/parallel-checkout.c
> index 6b1af32bb3d..b6f4a25642e 100644
> --- a/parallel-checkout.c
> +++ b/parallel-checkout.c
> @@ -261,7 +261,7 @@ static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
>  	struct stream_filter *filter;
>  	struct strbuf buf = STRBUF_INIT;
>  	char *blob;
> -	unsigned long size;
> +	size_t size;
>  	ssize_t wrote;
>  
>  	/* Sanity check */
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index bff86c13208..8b23d862600 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
>  
>  # This smudge filter prepends 5GB of zeros to the file it checks out. This
>  # ensures that smudging doesn't mangle large files on 64-bit Windows.
> -test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
>  		'files over 4GB convert on output' '
>  	test_commit test small "a small file" &&
>  	test_config filter.makelarge.smudge \

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files
  2021-10-29 23:00       ` Junio C Hamano
@ 2021-10-29 23:21         ` Junio C Hamano
  2021-11-02 14:56           ` Johannes Schindelin
  2021-11-02 14:57         ` Johannes Schindelin
  1 sibling, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 23:21 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Matt Cooper

Junio C Hamano <gitster@pobox.com> writes:

>> +# This smudge filter prepends 5GB of zeros to the file it checks out. This
>> +# ensures that smudging doesn't mangle large files on 64-bit Windows.
>> +test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
>> +		'files over 4GB convert on output' '
>> +	test_commit test small "a small file" &&
>> +	test_config filter.makelarge.smudge \
>> +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
>> +	echo "small filter=makelarge" >.gitattributes &&
>> +	rm small &&
>> +	git checkout -- small &&
>> +	size=$(test_file_size small) &&
>> +	test "$size" -ge $((5 * 1024 * 1024 * 1024))
>> +'
>
> Why not exactly 5G, but anything that is at least 5G is OK?

I know it is more than 5G, thanks to the "&& cat".  THe question was
why aren't we measuring the size of "a small file" so that we can
check against an exact size to be expected.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-29 21:12         ` Johannes Schindelin
@ 2021-10-29 23:25           ` Junio C Hamano
  2021-10-30 15:16           ` Philip Oakley
  1 sibling, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-10-29 23:25 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Johannes Schindelin via GitGitGadget, git, Carlo Arenas,
	brian m. carlson

Johannes Schindelin <johannes.schindelin@gmail.com> writes:

>> Speaking of which, -rc0 is still coming, right? ;-)
>> https://tinyurl.com/gitcal still claims that it was scheduled for
>> yesterday.

Was swamped too deeply to even make noises about the delay.  Sorry
about that.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-29 21:12         ` Johannes Schindelin
  2021-10-29 23:25           ` Junio C Hamano
@ 2021-10-30 15:16           ` Philip Oakley
  2021-10-30 17:35             ` Torsten Bögershausen
  1 sibling, 1 reply; 78+ messages in thread
From: Philip Oakley @ 2021-10-30 15:16 UTC (permalink / raw)
  To: Johannes Schindelin, Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Carlo Arenas,
	brian m. carlson


On 29/10/2021 22:12, Johannes Schindelin wrote:
> I am not aware of any other popular platform using the LLP64 data model,
> therefore I do not even think that these patches have to be fast-tracked
> into Git v2.34.0, next cycle would be good enough. Unless you are aware of
> other such platforms that do not rely on the Git for Windows fork, but on
> Git built from your repository?

I was under the impression that the original Raspberry Pi also used the
LLP64 model, or similar, and that had started of Torsten (tboegi) on the
extensive early work on this. I was just looking at the zlib parts
following the Git Merge.

Torsten was compiling for Rasbian (gcc (Raspbian 6.3.0-18+rpi1+deb9u1)
6.3.0 20170516)

Philip

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-30 15:16           ` Philip Oakley
@ 2021-10-30 17:35             ` Torsten Bögershausen
  2021-10-30 19:29               ` Philip Oakley
  0 siblings, 1 reply; 78+ messages in thread
From: Torsten Bögershausen @ 2021-10-30 17:35 UTC (permalink / raw)
  To: Philip Oakley
  Cc: Johannes Schindelin, Junio C Hamano,
	Johannes Schindelin via GitGitGadget, git, Carlo Arenas,
	brian m. carlson

On Sat, Oct 30, 2021 at 04:16:31PM +0100, Philip Oakley wrote:
>
> On 29/10/2021 22:12, Johannes Schindelin wrote:
> > I am not aware of any other popular platform using the LLP64 data model,
> > therefore I do not even think that these patches have to be fast-tracked
> > into Git v2.34.0, next cycle would be good enough. Unless you are aware of
> > other such platforms that do not rely on the Git for Windows fork, but on
> > Git built from your repository?
>
> I was under the impression that the original Raspberry Pi also used the
> LLP64 model, or similar, and that had started of Torsten (tboegi) on the
> extensive early work on this. I was just looking at the zlib parts
> following the Git Merge.
>
> Torsten was compiling for Rasbian (gcc (Raspbian 6.3.0-18+rpi1+deb9u1)
> 6.3.0 20170516)
>
> Philip

The raspi does not use LLP64.
However, the gcc from above did warn about a long - size_t mixup.
Even if both are 32 bit.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-30 17:35             ` Torsten Bögershausen
@ 2021-10-30 19:29               ` Philip Oakley
  0 siblings, 0 replies; 78+ messages in thread
From: Philip Oakley @ 2021-10-30 19:29 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Johannes Schindelin, Junio C Hamano,
	Johannes Schindelin via GitGitGadget, git, Carlo Arenas,
	brian m. carlson

Hi Torsten,
On 30/10/2021 18:35, Torsten Bögershausen wrote:
> On Sat, Oct 30, 2021 at 04:16:31PM +0100, Philip Oakley wrote:
>> On 29/10/2021 22:12, Johannes Schindelin wrote:
>>> I am not aware of any other popular platform using the LLP64 data model,
>>> therefore I do not even think that these patches have to be fast-tracked
>>> into Git v2.34.0, next cycle would be good enough. Unless you are aware of
>>> other such platforms that do not rely on the Git for Windows fork, but on
>>> Git built from your repository?
>> I was under the impression that the original Raspberry Pi also used the
>> LLP64 model, or similar, and that had started of Torsten (tboegi) on the
>> extensive early work on this. I was just looking at the zlib parts
>> following the Git Merge.
>>
>> Torsten was compiling for Rasbian (gcc (Raspbian 6.3.0-18+rpi1+deb9u1)
>> 6.3.0 20170516)
>>
>> Philip
> The raspi does not use LLP64.
> However, the gcc from above did warn about a long - size_t mixup.
> Even if both are 32 bit.
Thanks for the clarification. Looks like I'd misinterpreted what you'd
said at the time.
--
Philip

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 3/8] test-lib: add prerequisite for 64-bit platforms
  2021-10-29 22:52       ` Junio C Hamano
@ 2021-11-02 14:35         ` Johannes Schindelin
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-02 14:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Carlo Marcelo Arenas Belón via GitGitGadget, git,
	Carlo Arenas, brian m. carlson

[-- Attachment #1: Type: text/plain, Size: 794 bytes --]

Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> "Carlo Marcelo Arenas Belón via GitGitGadget"
> <gitgitgadget@gmail.com> writes:
>
> > +test_lazy_prereq SIZE_T_IS_64BIT '
> > +	test 8 -eq "$(build_option sizeof-size_t)"
> > +'
> > +
> >  test_lazy_prereq LONG_IS_64BIT '
> >  	test 8 -le "$(build_option sizeof-long)"
> >  '
>
> In the longer run, LONG_IS_64BIT wants to be renamed to indicate
> that it is at least 64-bit long.  LONG_HAS_64BIT, perhaps?

Or `LONG_AT_LEAST_64BIT`. It does look as if the current users are asking
for that, not for precisely 64-bit.

> Obviously it can be left outside the scope of this series.

Definitely. The patch series already grew from 5 to 8 patches. We really
need to be more conscious of scope here.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-29 18:34     ` [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Junio C Hamano
       [not found]       ` <nycvar.QRO.7.76.6.2110292239170.56@tvgsbejvaqbjf.bet>
@ 2021-11-02 14:41       ` Johannes Schindelin
  1 sibling, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-02 14:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Carlo Arenas,
	brian m. carlson

Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> > This patch series came in via the Git for Windows fork
> > [https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
> > before v2.34.0-rc0, therefore I appreciate every careful review you gentle
> > people can spare.
>
> It is way too late for my tree to go in before -rc0,

As I said, the intention was to merge this into Git _for Windows_ before
-rc0. And I did. The non-Windows platforms are much less likely to benefit
from this patch series.

> but the patches in the last round, with the "Changes since v2"
> description below, all sound sensible, including the decision to stop
> here, instead of doing "everything should be either size_t or intmax_t"
> conversion.

Yep, if you want to weep, you can have a look at
https://github.com/git-for-windows/git/pull/2179/files, which tries to do
that `unsigned long` -> `size_t` thing (although admittedly in an as yet
unreviewable shape).

> 7/8 did not apply for me to the tip of 'master', but "am -3" wiggled
> it in.  You may want to double check the results.

The range-diff looks good.

> As the primary author of the series, given the cover title matches
> the title of one step in the series, seems to be Matt, let me queue
> them under mc/clean-smudge-with-llp64 topic.

Yep, this came in via Git for Windows, traditionally I take it upon me
to shepherd those contributions into upstream Git.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files
  2021-10-29 23:21         ` Junio C Hamano
@ 2021-11-02 14:56           ` Johannes Schindelin
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-02 14:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Matt Cooper

Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
> >> +# This smudge filter prepends 5GB of zeros to the file it checks out. This
> >> +# ensures that smudging doesn't mangle large files on 64-bit Windows.
> >> +test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> >> +		'files over 4GB convert on output' '
> >> +	test_commit test small "a small file" &&
> >> +	test_config filter.makelarge.smudge \
> >> +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
> >> +	echo "small filter=makelarge" >.gitattributes &&
> >> +	rm small &&
> >> +	git checkout -- small &&
> >> +	size=$(test_file_size small) &&
> >> +	test "$size" -ge $((5 * 1024 * 1024 * 1024))
> >> +'
> >
> > Why not exactly 5G, but anything that is at least 5G is OK?
>
> I know it is more than 5G, thanks to the "&& cat".  THe question was
> why aren't we measuring the size of "a small file" so that we can
> check against an exact size to be expected.

The main problem solved by this patch series is the fact that by virtue of
using `unsigned long`, Git truncated the contents to less than 4GB. So,
really, what we care about here is that that does not happen anymore.

I will change it to take the extra time to determine `small`'s file size
and add that to 5GB, to avoid confusing readers, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files
  2021-10-29 23:00       ` Junio C Hamano
  2021-10-29 23:21         ` Junio C Hamano
@ 2021-11-02 14:57         ` Johannes Schindelin
  1 sibling, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-02 14:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Matt Cooper

Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> "Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Matt Cooper <vtbassmatt@gmail.com>
> >
> > The filter system allows for alterations to file contents when they're
> > added to the database or workdir. ("Smudge" when moving to the workdir;
> > "clean" when moving to the database.) This is used natively to handle CRLF
> > to LF conversions. It's also employed by Git-LFS to replace large files
> > from the workdir with small tracking files in the repo and vice versa.
>
> Not a huge deal, but make it a habit to spell "working tree" not "workdir",
> as someday you'd write end-user facing documentation in our tree ;-).
>
> > Git pulls the entire smudged file into memory.
>
> Giving "for what" would be helpful to readers.
>
>     Git reads the entire smudged file into memory to convert it into
>     a "clean" form to be used in-core.

I changed both.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-10-29 23:17       ` Junio C Hamano
@ 2021-11-02 14:59         ` Johannes Schindelin
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-02 14:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Matt Cooper

Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> "Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +# This clean filter writes down the size of input it receives. By checking against
> > +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
> > +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> > +		'files over 4GB convert on input' '
> > +	test-tool genzeros $((5*1024*1024*1024)) >big &&
> > +	test_config filter.checklarge.clean "wc -c >big.size" &&
> > +	echo "big filter=checklarge" >.gitattributes &&
> > +	git add big &&
> > +	test $(test_file_size big) -eq $(cat big.size)
> > +'
>
> I would have expected that the clean filter to be sending the count
> to its standard output (to be hashed and made into a blob object),
> and the test wuld be doing "git cat-file blob :big" to read the
> contents of the raw blob, bypassing the filter system.

That was exactly what Matt had in his first iteration. But I dislike
unnecessarily spawned processes, they are not "free" on Windows, so I
shortened the design to take this shortcut.

> But we are testing with only a single path anyway, use of this single
> extra file is OK.

Precisely.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 5/8] odb: teach read_blob_entry to use size_t
  2021-10-29 23:17       ` Junio C Hamano
@ 2021-11-02 15:10         ` Johannes Schindelin
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-02 15:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Matt Cooper

Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> "Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
> > +void *read_blob_entry(const struct cache_entry *ce, size_t *size)
> >  {
> >  	enum object_type type;
> > -	void *blob_data = read_object_file(&ce->oid, &type, size);
> > +	unsigned long ul;
> > +	void *blob_data = read_object_file(&ce->oid, &type, &ul);
> >
> > +	*size = ul;
>
> It is a bit curious place to draw the line; we want to make sure
> that blob_entry can hold huge data, but in this step we do not mind
> read_object_file() is not capable of going full 64-bit?

Indeed that is the case. The consideration here is: how large of a patch
series do I want to take this late in the cycle?

Here is the call tree (for full details, look no further than
https://github.com/git-for-windows/git/pull/3487#issuecomment-950727616):

    read_object_file()
        repo_read_object_file()
            read_object_file_extended()
                read_object()
                    oid_object_info_extended()
                        do_oid_object_info_extended()
                            loose_object_info()
                                parse_loose_header_extended()
                            packed_object_info()
                                cache_or_unpack_entry()
                                    unpack_entry()
                                        unpack_object_header()
                                           unpack_object_header_buffer()
                                get_size_from_delta()
                                    get_delta_hdr_size()

All three leaves have code that needs to be adjusted to use `size_t`
instead of `unsigned long`, and all of the other functions in that call
tree need to be adjusted for that. Some of the callers do not even pass an
`unsigned long` pointer around, but instead a pointer to `struct
object_info` (which, you guessed it, also has an `unsigned long` that
should have been a `size_t` from the beginning).

This is too big a change I am willing to work on, let alone accept, this
late in the cycle.

Sure, it would fix the scenario where clean/smudge filters are not even
involved (read: `git add`ing a 5GB file _directly_). But the potential for
bugs to hide!

> I guess I'll see soon enough why by reading later steps.  I can see
> that for the purpose of making write_entry() aware of the size_t,
> this is necessary at the minimum.

I fear that you won't see more about this in the later steps ;-)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms
  2021-10-28 22:38         ` Junio C Hamano
@ 2021-11-02 15:20           ` Johannes Schindelin
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-02 15:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Carlo Arenas, git, vtbassmatt

Hi Junio,

On Thu, 28 Oct 2021, Junio C Hamano wrote:

> Carlo Arenas <carenas@gmail.com> writes:
>
> >> Since this is clearly copied from `LONG_IS_64BIT`, why the change from
> >> `-le` to `-eq`? It is at least inconsistent to use anything different
> >> here.
> >
> > My assumption is that the check for sizeof(size_t) we have is really
> > about finding the bit width of the platform, and we currently support
> > 2 of them (32-bit and 64-bit), which is why the name I chose was
> > "IS_64BIT" and also why I was strict on it being exactly 8 bytes
> > (considering all platforms git supports have bytes with 8 bits).
> >
> > It can go eitherway IMHO, and your point about being inconsistent
> > (with my lack of explanation in the commit) suggests we should instead
> > use your proposal, do you want me to resend or could adjust them in
> > your tree?
>
> Is LONG_IS_64BIT used to ensure that long is _at least_ 64 bit?  If
> so, perhaps its name may need to be rethought.  On the other hand,
> if it is meant to ensure that long is exactly 64 bit, then it should
> use -eq to compare.

`LONG_IS_64BIT` is used by the `tar` tests to ensure that `long` can
represent file sizes larger than 4GB. So yeah, it is an `_AT_LEAST_`
instead of an `_IS_`.

Not -rc material, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v4 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
                       ` (8 preceding siblings ...)
  2021-10-29 18:34     ` [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Junio C Hamano
@ 2021-11-02 15:46     ` Johannes Schindelin via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
                         ` (8 more replies)
  9 siblings, 9 replies; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin

This patch series came in via the Git for Windows fork
[https://github.com/git-for-windows/git/pull/3487], and I intend to merge it
before v2.34.0-rc0, therefore I appreciate every careful review you gentle
people can spare.

The x86_64 variant of Windows uses the LLP64 data model, where the long data
type is 32-bit. This is very different from the LP64 data model used e.g. by
x86_64 Linux, where unsigned long is 64-bit.

Most notably, this means that sizeof(unsigned long) != sizeof(size_t) in
general.

However, since Git was born in the Linux ecosystem, where that inequality
does not hold true, it is understandable that unsigned long is used in many
code locations where size_t should have been used. As a consequence, quite a
few things are broken e.g. on Windows, when it comes to 4GB file contents or
larger.

Using Git LFS [https://git-lfs.github.io/] trying to work around such issues
is one such a broken scenario. You cannot git checkout, say, 5GB files. Huge
files will be truncated to whatever the file size is modulo 4GB (in the case
of a 5GB file, it would be truncated to 1GB).

This patch series primarily fixes the Git LFS scenario, by allowing clean
filters to accept 5GB files, and by allowing smudge filters to produce 5GB
files.

The much larger project to teach Git to use size_t instead of unsigned long
in all the appropriate places is hardly scratched by this patch series.

Side note: The fix for the clean filter included in this series does not
actually affect Git LFS! The reason is that Git LFS marks its filter as
required, and therefore Git streams the file contents to Git LFS via a file
descriptor (which is unaffected by LLP64). A "clean" filter that is not
marked as required, however, lets Git take the code path that is fixed by
this patch series.

Changes since v3:

 * The commit message of the fourth patch no longer talks about a "workdir"
   but about the "working tree".
 * The smudge filter test case is now more precise when verifying the
   result's file size.

Changes since v2:

 * The test cases now use a prereq to avoid running in 32-bit setups (where
   they would be guaranteed to fail).
 * The SIZE_T_IS64BIT prereq specifically does not follow LONG_IS_64BIT, by
   testing for exactly 64 bits.
 * The code comment above unsigned_left_shift_overflows() was fixed.
 * unsigned_left_shift_overflows() now verifies that shift is smaller than
   the bit size of the operand.
 * genzeros now marks its buffer of NULs as const.
 * The error check in the infinite loop genzeros was fixed.

Changes since v1:

 * Removed extraneous "Signed-off-by:" lines from "git-compat-util:
   introduce more size_t helpers".
 * Integrated Carlo's patch to allow genzeros to generate large amounts of
   NULs, even in LLP64 data models.
 * Using test-tool genzeros instead of dd if=/dev/zero, to help HP NonStop
   (which appears to use the LP64 data model and therefore should pass the
   new test cases even without the fixes provided in this patch series).
 * Accelerating genzeros to have performance characteristics similar to dd
   if=/dev/zero instead of being ~50x slower.

Carlo Marcelo Arenas Belón (2):
  test-genzeros: allow more than 2G zeros in Windows
  test-lib: add prerequisite for 64-bit platforms

Johannes Schindelin (2):
  test-tool genzeros: generate large amounts of data more efficiently
  git-compat-util: introduce more size_t helpers

Matt Cooper (4):
  t1051: introduce a smudge filter test for extremely large files
  odb: teach read_blob_entry to use size_t
  odb: guard against data loss checking out a huge file
  clean/smudge: allow clean filters to process extremely large files

 convert.c                   |  2 +-
 delta.h                     |  6 +++---
 entry.c                     |  8 +++++---
 entry.h                     |  2 +-
 git-compat-util.h           | 25 +++++++++++++++++++++++++
 object-file.c               |  6 +++---
 packfile.c                  |  6 +++---
 parallel-checkout.c         |  2 +-
 t/helper/test-genzeros.c    | 21 +++++++++++++++++----
 t/t1051-large-conversion.sh | 26 ++++++++++++++++++++++++++
 t/test-lib.sh               |  4 ++++
 11 files changed, 89 insertions(+), 19 deletions(-)


base-commit: ebf3c04b262aa27fbb97f8a0156c2347fecafafb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1068%2Fdscho%2Fhuge-file-smudge-clean-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1068/dscho/huge-file-smudge-clean-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1068

Range-diff vs v3:

 1:  068f897b973 = 1:  068f897b973 test-genzeros: allow more than 2G zeros in Windows
 2:  05219720014 = 2:  05219720014 test-tool genzeros: generate large amounts of data more efficiently
 3:  489500bb1dc = 3:  489500bb1dc test-lib: add prerequisite for 64-bit platforms
 4:  ce9dfaac9f8 ! 4:  f099b4eaead t1051: introduce a smudge filter test for extremely large files
     @@ Commit message
          t1051: introduce a smudge filter test for extremely large files
      
          The filter system allows for alterations to file contents when they're
     -    added to the database or workdir. ("Smudge" when moving to the workdir;
     -    "clean" when moving to the database.) This is used natively to handle CRLF
     -    to LF conversions. It's also employed by Git-LFS to replace large files
     -    from the workdir with small tracking files in the repo and vice versa.
     +    added to the database or working tree. ("Smudge" when moving to the
     +    working tree; "clean" when moving to the database.) This is used
     +    natively to handle CRLF to LF conversions. It's also employed by Git-LFS
     +    to replace large files from the working tree with small tracking files
     +    in the repo and vice versa.
      
     -    Git pulls the entire smudged file into memory. While this is inefficient,
     -    there's a more insidious problem on some platforms due to inconsistency
     -    between using unsigned long and size_t for the same type of data (size of
     -    a file in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
     -    size_t is typedef'd to unsigned long. On Windows, however, unsigned long is
     -    only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
     +    Git reads the entire smudged file into memory to convert it into a
     +    "clean" form to be used in-core. While this is inefficient, there's a
     +    more insidious problem on some platforms due to inconsistency between
     +    using unsigned long and size_t for the same type of data (size of a file
     +    in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
     +    size_t is typedef'd to unsigned long. On Windows, however, unsigned long
     +    is only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
          unsigned long long in order to be 64 bits).
      
          Practically speaking, this means 64-bit Windows users of Git-LFS can't
     @@ t/t1051-large-conversion.sh: test_expect_success 'ident converts on output' '
      +test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
      +		'files over 4GB convert on output' '
      +	test_commit test small "a small file" &&
     ++	small_size=$(test_file_size small) &&
      +	test_config filter.makelarge.smudge \
      +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
      +	echo "small filter=makelarge" >.gitattributes &&
      +	rm small &&
      +	git checkout -- small &&
      +	size=$(test_file_size small) &&
     -+	test "$size" -ge $((5 * 1024 * 1024 * 1024))
     ++	test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
      +'
      +
       test_done
 5:  dbef8168bc7 ! 5:  308a8f2a3ad odb: teach read_blob_entry to use size_t
     @@ t/t1051-large-conversion.sh: test_expect_success 'ident converts on output' '
      +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
       		'files over 4GB convert on output' '
       	test_commit test small "a small file" &&
     - 	test_config filter.makelarge.smudge \
     + 	small_size=$(test_file_size small) &&
 6:  18419070c29 = 6:  65bc291b680 git-compat-util: introduce more size_t helpers
 7:  f59c523bcc4 = 7:  7b6655f03f5 odb: guard against data loss checking out a huge file
 8:  acc5591517f ! 8:  41fda423982 clean/smudge: allow clean filters to process extremely large files
     @@ convert.c: static int crlf_to_worktree(const char *src, size_t len, struct strbu
      
       ## t/t1051-large-conversion.sh ##
      @@ t/t1051-large-conversion.sh: test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
     - 	test "$size" -ge $((5 * 1024 * 1024 * 1024))
     + 	test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
       '
       
      +# This clean filter writes down the size of input it receives. By checking against

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v4 1/8] test-genzeros: allow more than 2G zeros in Windows
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
@ 2021-11-02 15:46       ` Carlo Marcelo Arenas Belón via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 2/8] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
                         ` (7 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Carlo Marcelo Arenas Belón via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Carlo Marcelo Arenas Belón

From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas@gmail.com>

d5cfd142ec (tests: teach the test-tool to generate NUL bytes and
use it, 2019-02-14), add a way to generate zeroes in a portable
way without using /dev/zero (needed by HP NonStop), but uses a
long variable that is limited to 2^31 in Windows.

Use instead a (POSIX/C99) intmax_t that is at least 64bit wide
in 64-bit Windows to use in a future test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/helper/test-genzeros.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
index 9532f5bac97..b1197e91a89 100644
--- a/t/helper/test-genzeros.c
+++ b/t/helper/test-genzeros.c
@@ -3,14 +3,14 @@
 
 int cmd__genzeros(int argc, const char **argv)
 {
-	long count;
+	intmax_t count;
 
 	if (argc > 2) {
 		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
 		return 1;
 	}
 
-	count = argc > 1 ? strtol(argv[1], NULL, 0) : -1L;
+	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
 
 	while (count < 0 || count--) {
 		if (putchar(0) == EOF)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 2/8] test-tool genzeros: generate large amounts of data more efficiently
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
@ 2021-11-02 15:46       ` Johannes Schindelin via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 3/8] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón via GitGitGadget
                         ` (6 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

In this developer's tests, producing one gigabyte worth of NULs in a
busy loop that writes out individual bytes, unbuffered, took ~27sec.
Writing chunked 256kB buffers instead only took ~0.6sec

This matters because we are about to introduce a pair of test cases that
want to be able to produce 5GB of NULs, and we cannot use `/dev/zero`
because of the HP NonStop platform's lack of support for that device.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/helper/test-genzeros.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-genzeros.c b/t/helper/test-genzeros.c
index b1197e91a89..8ca988d6216 100644
--- a/t/helper/test-genzeros.c
+++ b/t/helper/test-genzeros.c
@@ -3,7 +3,10 @@
 
 int cmd__genzeros(int argc, const char **argv)
 {
+	/* static, so that it is NUL-initialized */
+	static const char zeros[256 * 1024];
 	intmax_t count;
+	ssize_t n;
 
 	if (argc > 2) {
 		fprintf(stderr, "usage: %s [<count>]\n", argv[0]);
@@ -12,9 +15,19 @@ int cmd__genzeros(int argc, const char **argv)
 
 	count = argc > 1 ? strtoimax(argv[1], NULL, 0) : -1;
 
-	while (count < 0 || count--) {
-		if (putchar(0) == EOF)
+	/* Writing out individual NUL bytes is slow... */
+	while (count < 0)
+		if (write(1, zeros, ARRAY_SIZE(zeros)) < 0)
 			return -1;
+
+	while (count > 0) {
+		n = write(1, zeros, count < ARRAY_SIZE(zeros) ?
+			  count : ARRAY_SIZE(zeros));
+
+		if (n < 0)
+			return -1;
+
+		count -= n;
 	}
 
 	return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 3/8] test-lib: add prerequisite for 64-bit platforms
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 2/8] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
@ 2021-11-02 15:46       ` Carlo Marcelo Arenas Belón via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 4/8] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
                         ` (5 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Carlo Marcelo Arenas Belón via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Carlo Marcelo Arenas Belón

From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas@gmail.com>

Allow tests that assume a 64-bit `size_t` to be skipped in 32-bit
platforms and regardless of the size of `long`.

This imitates the `LONG_IS_64BIT` prerequisite.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/test-lib.sh | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/t/test-lib.sh b/t/test-lib.sh
index adaf03543e8..af1a94c2c20 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1642,6 +1642,10 @@ build_option () {
 	sed -ne "s/^$1: //p"
 }
 
+test_lazy_prereq SIZE_T_IS_64BIT '
+	test 8 -eq "$(build_option sizeof-size_t)"
+'
+
 test_lazy_prereq LONG_IS_64BIT '
 	test 8 -le "$(build_option sizeof-long)"
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 4/8] t1051: introduce a smudge filter test for extremely large files
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-11-02 15:46       ` [PATCH v4 3/8] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón via GitGitGadget
@ 2021-11-02 15:46       ` Matt Cooper via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
                         ` (4 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
added to the database or working tree. ("Smudge" when moving to the
working tree; "clean" when moving to the database.) This is used
natively to handle CRLF to LF conversions. It's also employed by Git-LFS
to replace large files from the working tree with small tracking files
in the repo and vice versa.

Git reads the entire smudged file into memory to convert it into a
"clean" form to be used in-core. While this is inefficient, there's a
more insidious problem on some platforms due to inconsistency between
using unsigned long and size_t for the same type of data (size of a file
in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
size_t is typedef'd to unsigned long. On Windows, however, unsigned long
is only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
unsigned long long in order to be 64 bits).

Practically speaking, this means 64-bit Windows users of Git-LFS can't
handle files larger than 2^32 bytes. Other 64-bit platforms don't suffer
this limitation.

This commit introduces a test exposing the issue; future commits make it
pass. The test simulates the way Git-LFS works by having a tiny file
checked into the repository and expanding it to a huge file on checkout.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t1051-large-conversion.sh | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 8b7640b3ba8..e7f9f0bdc56 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -83,4 +83,19 @@ test_expect_success 'ident converts on output' '
 	test_cmp small.clean large.clean
 '
 
+# This smudge filter prepends 5GB of zeros to the file it checks out. This
+# ensures that smudging doesn't mangle large files on 64-bit Windows.
+test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+		'files over 4GB convert on output' '
+	test_commit test small "a small file" &&
+	small_size=$(test_file_size small) &&
+	test_config filter.makelarge.smudge \
+		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
+	echo "small filter=makelarge" >.gitattributes &&
+	rm small &&
+	git checkout -- small &&
+	size=$(test_file_size small) &&
+	test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 5/8] odb: teach read_blob_entry to use size_t
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-11-02 15:46       ` [PATCH v4 4/8] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
@ 2021-11-02 15:46       ` Matt Cooper via GitGitGadget
  2021-11-02 20:40         ` Torsten Bögershausen
  2021-11-02 15:46       ` [PATCH v4 6/8] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
                         ` (3 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

There is mixed use of size_t and unsigned long to deal with sizes in the
codebase. Recall that Windows defines unsigned long as 32 bits even on
64-bit platforms, meaning that converting size_t to unsigned long narrows
the range. This mostly doesn't cause a problem since Git rarely deals
with files larger than 2^32 bytes.

But adjunct systems such as Git LFS, which use smudge/clean filters to
keep huge files out of the repository, may have huge file contents passed
through some of the functions in entry.c and convert.c. On Windows, this
results in a truncated file being written to the workdir. I traced this to
one specific use of unsigned long in write_entry (and a similar instance
in write_pc_item_to_fd for parallel checkout). That appeared to be for
the call to read_blob_entry, which expects a pointer to unsigned long.

By altering the signature of read_blob_entry to expect a size_t,
write_entry can be switched to use size_t internally (which all of its
callers and most of its callees already used). To avoid touching dozens of
additional files, read_blob_entry uses a local unsigned long to call a
chain of functions which aren't prepared to accept size_t.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 entry.c                     | 8 +++++---
 entry.h                     | 2 +-
 parallel-checkout.c         | 2 +-
 t/t1051-large-conversion.sh | 2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/entry.c b/entry.c
index 711ee0693c7..4cb3942dbdc 100644
--- a/entry.c
+++ b/entry.c
@@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
 	return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
 }
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
+void *read_blob_entry(const struct cache_entry *ce, size_t *size)
 {
 	enum object_type type;
-	void *blob_data = read_object_file(&ce->oid, &type, size);
+	unsigned long ul;
+	void *blob_data = read_object_file(&ce->oid, &type, &ul);
 
+	*size = ul;
 	if (blob_data) {
 		if (type == OBJ_BLOB)
 			return blob_data;
@@ -270,7 +272,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
 	int fd, ret, fstat_done = 0;
 	char *new_blob;
 	struct strbuf buf = STRBUF_INIT;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 	size_t newsize = 0;
 	struct stat st;
diff --git a/entry.h b/entry.h
index b8c0e170dc7..61ee8c17604 100644
--- a/entry.h
+++ b/entry.h
@@ -51,7 +51,7 @@ int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
  */
 void unlink_entry(const struct cache_entry *ce);
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
+void *read_blob_entry(const struct cache_entry *ce, size_t *size);
 int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
 void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
 			   struct stat *st);
diff --git a/parallel-checkout.c b/parallel-checkout.c
index 6b1af32bb3d..b6f4a25642e 100644
--- a/parallel-checkout.c
+++ b/parallel-checkout.c
@@ -261,7 +261,7 @@ static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
 	struct stream_filter *filter;
 	struct strbuf buf = STRBUF_INIT;
 	char *blob;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 
 	/* Sanity check */
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index e7f9f0bdc56..e6d52f98b15 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
 
 # This smudge filter prepends 5GB of zeros to the file it checks out. This
 # ensures that smudging doesn't mangle large files on 64-bit Windows.
-test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
 		'files over 4GB convert on output' '
 	test_commit test small "a small file" &&
 	small_size=$(test_file_size small) &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 6/8] git-compat-util: introduce more size_t helpers
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-11-02 15:46       ` [PATCH v4 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
@ 2021-11-02 15:46       ` Johannes Schindelin via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 7/8] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
                         ` (2 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

We will use them in the next commit.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 git-compat-util.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index a508dbe5a35..1f41e5611a1 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -113,6 +113,14 @@
 #define unsigned_mult_overflows(a, b) \
     ((a) && (b) > maximum_unsigned_value_of_type(a) / (a))
 
+/*
+ * Returns true if the left shift of "a" by "shift" bits will
+ * overflow. The type of "a" must be unsigned.
+ */
+#define unsigned_left_shift_overflows(a, shift) \
+    ((shift) < bitsizeof(a) && \
+     (a) > maximum_unsigned_value_of_type(a) >> (shift))
+
 #ifdef __GNUC__
 #define TYPEOF(x) (__typeof__(x))
 #else
@@ -859,6 +867,23 @@ static inline size_t st_sub(size_t a, size_t b)
 	return a - b;
 }
 
+static inline size_t st_left_shift(size_t a, unsigned shift)
+{
+	if (unsigned_left_shift_overflows(a, shift))
+		die("size_t overflow: %"PRIuMAX" << %u",
+		    (uintmax_t)a, shift);
+	return a << shift;
+}
+
+static inline unsigned long cast_size_t_to_ulong(size_t a)
+{
+	if (a != (unsigned long)a)
+		die("object too large to read on this platform: %"
+		    PRIuMAX" is cut off to %lu",
+		    (uintmax_t)a, (unsigned long)a);
+	return (unsigned long)a;
+}
+
 #ifdef HAVE_ALLOCA_H
 # include <alloca.h>
 # define xalloca(size)      (alloca(size))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 7/8] odb: guard against data loss checking out a huge file
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
                         ` (5 preceding siblings ...)
  2021-11-02 15:46       ` [PATCH v4 6/8] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
@ 2021-11-02 15:46       ` Matt Cooper via GitGitGadget
  2021-11-02 15:46       ` [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
  2021-11-02 21:46       ` [PATCH v4 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Torsten Bögershausen
  8 siblings, 0 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

This introduces an additional guard for platforms where `unsigned long`
and `size_t` are not of the same size. If the size of an object in the
database would overflow `unsigned long`, instead we now exit with an
error.

A complete fix will have to update _many_ other functions throughout the
codebase to use `size_t` instead of `unsigned long`. It will have to be
implemented at some stage.

This commit puts in a stop-gap for the time being.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 delta.h       | 6 +++---
 object-file.c | 6 +++---
 packfile.c    | 6 +++---
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/delta.h b/delta.h
index 2df5fe13d95..8a56ec07992 100644
--- a/delta.h
+++ b/delta.h
@@ -90,15 +90,15 @@ static inline unsigned long get_delta_hdr_size(const unsigned char **datap,
 					       const unsigned char *top)
 {
 	const unsigned char *data = *datap;
-	unsigned long cmd, size = 0;
+	size_t cmd, size = 0;
 	int i = 0;
 	do {
 		cmd = *data++;
-		size |= (cmd & 0x7f) << i;
+		size |= st_left_shift(cmd & 0x7f, i);
 		i += 7;
 	} while (cmd & 0x80 && data < top);
 	*datap = data;
-	return size;
+	return cast_size_t_to_ulong(size);
 }
 
 #endif
diff --git a/object-file.c b/object-file.c
index f233b440b22..70e456fc2a3 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1344,7 +1344,7 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 				       unsigned int flags)
 {
 	const char *type_buf = hdr;
-	unsigned long size;
+	size_t size;
 	int type, type_len = 0;
 
 	/*
@@ -1388,12 +1388,12 @@ static int parse_loose_header_extended(const char *hdr, struct object_info *oi,
 			if (c > 9)
 				break;
 			hdr++;
-			size = size * 10 + c;
+			size = st_add(st_mult(size, 10), c);
 		}
 	}
 
 	if (oi->sizep)
-		*oi->sizep = size;
+		*oi->sizep = cast_size_t_to_ulong(size);
 
 	/*
 	 * The length must be followed by a zero byte
diff --git a/packfile.c b/packfile.c
index 755aa7aec5e..3ccea004396 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1059,7 +1059,7 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 		unsigned long len, enum object_type *type, unsigned long *sizep)
 {
 	unsigned shift;
-	unsigned long size, c;
+	size_t size, c;
 	unsigned long used = 0;
 
 	c = buf[used++];
@@ -1073,10 +1073,10 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 			break;
 		}
 		c = buf[used++];
-		size += (c & 0x7f) << shift;
+		size = st_add(size, st_left_shift(c & 0x7f, shift));
 		shift += 7;
 	}
-	*sizep = size;
+	*sizep = cast_size_t_to_ulong(size);
 	return used;
 }
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
                         ` (6 preceding siblings ...)
  2021-11-02 15:46       ` [PATCH v4 7/8] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
@ 2021-11-02 15:46       ` Matt Cooper via GitGitGadget
  2021-11-02 20:47         ` Torsten Bögershausen
  2021-11-04 17:26         ` Junio C Hamano
  2021-11-02 21:46       ` [PATCH v4 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Torsten Bögershausen
  8 siblings, 2 replies; 78+ messages in thread
From: Matt Cooper via GitGitGadget @ 2021-11-02 15:46 UTC (permalink / raw)
  To: git
  Cc: Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Matt Cooper

From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
moved between the database and the worktree. We already made sure that
it is possible for smudge filters to produce contents that are larger
than `unsigned long` can represent (which matters on systems where
`unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
Now we make sure that clean filters can _consume_ contents that are
larger than that.

Note that this commit only allows clean filters' _input_ to be larger
than can be represented by `unsigned long`.

This change makes only a very minute dent into the much larger project
to teach Git to use `size_t` instead of `unsigned long` wherever
appropriate.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 convert.c                   |  2 +-
 t/t1051-large-conversion.sh | 11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/convert.c b/convert.c
index fd9c84b0257..5ad6dfc08a0 100644
--- a/convert.c
+++ b/convert.c
@@ -613,7 +613,7 @@ static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
 
 struct filter_params {
 	const char *src;
-	unsigned long size;
+	size_t size;
 	int fd;
 	const char *cmd;
 	const char *path;
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index e6d52f98b15..042b0e44292 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -98,4 +98,15 @@ test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
 	test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
 '
 
+# This clean filter writes down the size of input it receives. By checking against
+# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
+test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+		'files over 4GB convert on input' '
+	test-tool genzeros $((5*1024*1024*1024)) >big &&
+	test_config filter.checklarge.clean "wc -c >big.size" &&
+	echo "big filter=checklarge" >.gitattributes &&
+	git add big &&
+	test $(test_file_size big) -eq $(cat big.size)
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 5/8] odb: teach read_blob_entry to use size_t
  2021-11-02 15:46       ` [PATCH v4 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
@ 2021-11-02 20:40         ` Torsten Bögershausen
  2021-11-04  0:09           ` Johannes Schindelin
  0 siblings, 1 reply; 78+ messages in thread
From: Torsten Bögershausen @ 2021-11-02 20:40 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Johannes Schindelin, Matt Cooper

Nicely explained, some comments inline

On Tue, Nov 02, 2021 at 03:46:08PM +0000, Matt Cooper via GitGitGadget wrote:
> From: Matt Cooper <vtbassmatt@gmail.com>
>
> There is mixed use of size_t and unsigned long to deal with sizes in the
> codebase. Recall that Windows defines unsigned long as 32 bits even on
> 64-bit platforms, meaning that converting size_t to unsigned long narrows
> the range. This mostly doesn't cause a problem since Git rarely deals
> with files larger than 2^32 bytes.

What does this mean ?
> ... This mostly doesn't cause a problem since Git rarely deals
> with files larger than 2^32 bytes.

Is "mostly" is a good wording here ?
May be
This doesn't cause a problem when files smaller than 2^32 bytes are handled by Git.

>
> But adjunct systems such as Git LFS, which use smudge/clean filters to
> keep huge files out of the repository, may have huge file contents passed
> through some of the functions in entry.c and convert.c. On Windows, this
> results in a truncated file being written to the workdir. I traced this to
> one specific use of unsigned long in write_entry (and a similar instance
> in write_pc_item_to_fd for parallel checkout). That appeared to be for
> the call to read_blob_entry, which expects a pointer to unsigned long.
>
> By altering the signature of read_blob_entry to expect a size_t,

"expect" -> "use"

(I am not a native English speaker, would "changing" be better than "altering" ?)
 By changing the signature of read_blob_entry to use size_t,



> write_entry can be switched to use size_t internally (which all of its
> callers and most of its callees already used). To avoid touching dozens of
> additional files, read_blob_entry uses a local unsigned long to call a
> chain of functions which aren't prepared to accept size_t.
>
> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  entry.c                     | 8 +++++---
>  entry.h                     | 2 +-
>  parallel-checkout.c         | 2 +-
>  t/t1051-large-conversion.sh | 2 +-
>  4 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/entry.c b/entry.c
> index 711ee0693c7..4cb3942dbdc 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
>  	return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
>  }
>
> -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
> +void *read_blob_entry(const struct cache_entry *ce, size_t *size)
>  {
>  	enum object_type type;
> -	void *blob_data = read_object_file(&ce->oid, &type, size);
> +	unsigned long ul;
> +	void *blob_data = read_object_file(&ce->oid, &type, &ul);
>
> +	*size = ul;
>  	if (blob_data) {
>  		if (type == OBJ_BLOB)
>  			return blob_data;
> @@ -270,7 +272,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
>  	int fd, ret, fstat_done = 0;
>  	char *new_blob;
>  	struct strbuf buf = STRBUF_INIT;
> -	unsigned long size;
> +	size_t size;
>  	ssize_t wrote;
>  	size_t newsize = 0;
>  	struct stat st;
> diff --git a/entry.h b/entry.h
> index b8c0e170dc7..61ee8c17604 100644
> --- a/entry.h
> +++ b/entry.h
> @@ -51,7 +51,7 @@ int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
>   */
>  void unlink_entry(const struct cache_entry *ce);
>
> -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
> +void *read_blob_entry(const struct cache_entry *ce, size_t *size);
>  int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
>  void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
>  			   struct stat *st);
> diff --git a/parallel-checkout.c b/parallel-checkout.c
> index 6b1af32bb3d..b6f4a25642e 100644
> --- a/parallel-checkout.c
> +++ b/parallel-checkout.c
> @@ -261,7 +261,7 @@ static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
>  	struct stream_filter *filter;
>  	struct strbuf buf = STRBUF_INIT;
>  	char *blob;
> -	unsigned long size;
> +	size_t size;
>  	ssize_t wrote;
>
>  	/* Sanity check */
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index e7f9f0bdc56..e6d52f98b15 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
>
>  # This smudge filter prepends 5GB of zeros to the file it checks out. This
>  # ensures that smudging doesn't mangle large files on 64-bit Windows.
> -test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
>  		'files over 4GB convert on output' '
>  	test_commit test small "a small file" &&
>  	small_size=$(test_file_size small) &&
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-11-02 15:46       ` [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
@ 2021-11-02 20:47         ` Torsten Bögershausen
  2021-11-04  0:11           ` Johannes Schindelin
  2021-11-04 17:26         ` Junio C Hamano
  1 sibling, 1 reply; 78+ messages in thread
From: Torsten Bögershausen @ 2021-11-02 20:47 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Johannes Schindelin, Matt Cooper

On Tue, Nov 02, 2021 at 03:46:11PM +0000, Matt Cooper via GitGitGadget wrote:
> From: Matt Cooper <vtbassmatt@gmail.com>
>
> The filter system allows for alterations to file contents when they're

Some nit-picking:
looking at
https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
we can read
"...substitutions in files on commit/checkout."

Should we use this wording here as well ?


> moved between the database and the worktree. We already made sure that
> it is possible for smudge filters to produce contents that are larger
> than `unsigned long` can represent (which matters on systems where
> `unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
> Now we make sure that clean filters can _consume_ contents that are
> larger than that.
>
> Note that this commit only allows clean filters' _input_ to be larger
> than can be represented by `unsigned long`.
>
> This change makes only a very minute dent into the much larger project
> to teach Git to use `size_t` instead of `unsigned long` wherever
> appropriate.
>
> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  convert.c                   |  2 +-
>  t/t1051-large-conversion.sh | 11 +++++++++++
>  2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/convert.c b/convert.c
> index fd9c84b0257..5ad6dfc08a0 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -613,7 +613,7 @@ static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
>
>  struct filter_params {
>  	const char *src;
> -	unsigned long size;
> +	size_t size;
>  	int fd;
>  	const char *cmd;
>  	const char *path;
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index e6d52f98b15..042b0e44292 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -98,4 +98,15 @@ test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
>  	test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
>  '
>
> +# This clean filter writes down the size of input it receives. By checking against
> +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
> +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +		'files over 4GB convert on input' '
> +	test-tool genzeros $((5*1024*1024*1024)) >big &&
> +	test_config filter.checklarge.clean "wc -c >big.size" &&
> +	echo "big filter=checklarge" >.gitattributes &&
> +	git add big &&
> +	test $(test_file_size big) -eq $(cat big.size)
> +'
> +
>  test_done
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
                         ` (7 preceding siblings ...)
  2021-11-02 15:46       ` [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
@ 2021-11-02 21:46       ` Torsten Bögershausen
  2021-11-03  6:31         ` Johannes Sixt
  8 siblings, 1 reply; 78+ messages in thread
From: Torsten Bögershausen @ 2021-11-02 21:46 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Johannes Schindelin

On Tue, Nov 02, 2021 at 03:46:03PM +0000, Johannes Schindelin via GitGitGadget wrote:

I could not convince my raspi to apply patch 7/8:

  git am </tmp/7
  Applying: odb: guard against data loss checking out a huge file
  error: patch failed: object-file.c:1344
  error: object-file.c: patch does not apply
  Patch failed at 0001 odb: guard against data loss checking out a huge file
  hint: Use 'git am --show-current-patch=diff' to see the failed patch
  When you have resolved this problem, run "git am --continue".
  If you prefer to skip this patch, run "git am --skip" instead.
  To restore the original branch and stop patching, run "git am --abort".

I am not sure, what went wrong. I can retry the next days - or is this
branch/series somewhere available ?

But beside this, up to patch 6/8 it compiled without warnings.
And the series looks good so far.
Some minor nits had been found and reported for 2 of the patches/commit messages.

[snip]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model
  2021-11-02 21:46       ` [PATCH v4 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Torsten Bögershausen
@ 2021-11-03  6:31         ` Johannes Sixt
  0 siblings, 0 replies; 78+ messages in thread
From: Johannes Sixt @ 2021-11-03  6:31 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Johannes Schindelin,
	Johannes Schindelin via GitGitGadget

Am 02.11.21 um 22:46 schrieb Torsten Bögershausen:
> On Tue, Nov 02, 2021 at 03:46:03PM +0000, Johannes Schindelin via GitGitGadget wrote:
> 
> I could not convince my raspi to apply patch 7/8:
> 
>   git am </tmp/7
>   Applying: odb: guard against data loss checking out a huge file
>   error: patch failed: object-file.c:1344
>   error: object-file.c: patch does not apply
>   Patch failed at 0001 odb: guard against data loss checking out a huge file
>   hint: Use 'git am --show-current-patch=diff' to see the failed patch
>   When you have resolved this problem, run "git am --continue".
>   If you prefer to skip this patch, run "git am --skip" instead.
>   To restore the original branch and stop patching, run "git am --abort".
> 
> I am not sure, what went wrong. I can retry the next days - or is this
> branch/series somewhere available ?

This was submitted via Gitgitgadget. Therefore, the cover letter
that you are responding to has these footers:

> base-commit: ebf3c04b262aa27fbb97f8a0156c2347fecafafb
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1068%2Fdscho%2Fhuge-file-smudge-clean-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1068/dscho/huge-file-smudge-clean-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/1068

-- Hannes

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 5/8] odb: teach read_blob_entry to use size_t
  2021-11-02 20:40         ` Torsten Bögershausen
@ 2021-11-04  0:09           ` Johannes Schindelin
  2021-11-04 12:24             ` Philip Oakley
  0 siblings, 1 reply; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-04  0:09 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Johannes Schindelin, Philip Oakley, Matt Cooper

[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]

Hi Torsten,

On Tue, 2 Nov 2021, Torsten Bögershausen wrote:

> On Tue, Nov 02, 2021 at 03:46:08PM +0000, Matt Cooper via GitGitGadget wrote:
> > From: Matt Cooper <vtbassmatt@gmail.com>
> >
> > There is mixed use of size_t and unsigned long to deal with sizes in the
> > codebase. Recall that Windows defines unsigned long as 32 bits even on
> > 64-bit platforms, meaning that converting size_t to unsigned long narrows
> > the range. This mostly doesn't cause a problem since Git rarely deals
> > with files larger than 2^32 bytes.
>
> What does this mean ?

I found the explanation to be quite clear (otherwise I would have changed
it before submitting, obviously).

Git is rarely used with large files, as it was meant to track source code
files, and it is pretty rare that a source code file is larger than 4GB.
Therefore, the described issues are edge cases rather than common ones.

> > ... This mostly doesn't cause a problem since Git rarely deals
> > with files larger than 2^32 bytes.
>
> Is "mostly" is a good wording here ?

I do think so.

> May be

s/May be/Maybe/ since you're nitpicking wording ;-)

> This doesn't cause a problem when files smaller than 2^32 bytes are handled by Git.

That would lose the rather important fact that it is common to encounter
only tracked files that are much smaller than 4GB.

> > But adjunct systems such as Git LFS, which use smudge/clean filters to
> > keep huge files out of the repository, may have huge file contents passed
> > through some of the functions in entry.c and convert.c. On Windows, this
> > results in a truncated file being written to the workdir. I traced this to
> > one specific use of unsigned long in write_entry (and a similar instance
> > in write_pc_item_to_fd for parallel checkout). That appeared to be for
> > the call to read_blob_entry, which expects a pointer to unsigned long.
> >
> > By altering the signature of read_blob_entry to expect a size_t,
>
> "expect" -> "use"

I would say that in the context of talking about a signature, "expect" is
a better verb than "use".

But then, just like you I am not a native speaker, so I think we should
maybe stop telling a native speaker like Matt how to use his native
tongue...

> (I am not a native English speaker, would "changing" be better than "altering" ?)
>  By changing the signature of read_blob_entry to use size_t,

As I said, I am not a native English speaker, either. So I believe that
Matt knows better than the two of us together how to phrase things in
English.

Since you had nothing to say about the patch itself, may I assume that
you're fine with it?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-11-02 20:47         ` Torsten Bögershausen
@ 2021-11-04  0:11           ` Johannes Schindelin
  2021-11-04  8:33             ` Torsten Bögershausen
  0 siblings, 1 reply; 78+ messages in thread
From: Johannes Schindelin @ 2021-11-04  0:11 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Johannes Schindelin, Philip Oakley, Matt Cooper

[-- Attachment #1: Type: text/plain, Size: 883 bytes --]

Hi Torsten,

On Tue, 2 Nov 2021, Torsten Bögershausen wrote:

> On Tue, Nov 02, 2021 at 03:46:11PM +0000, Matt Cooper via GitGitGadget wrote:
> > From: Matt Cooper <vtbassmatt@gmail.com>
> >
> > The filter system allows for alterations to file contents when they're
>
> Some nit-picking:
> looking at
> https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
> we can read
> "...substitutions in files on commit/checkout."
>
> Should we use this wording here as well ?

Again, I believe that Matt's command of the English language is pretty
good (but then, I have the advantage of knowing him and I very much enjoy
learning new English words while chatting with him). I would therefore
chalk it up to artistic license when he uses the word "alterations".

Since you did not comment on the patch, may I assume that you find it
flawless?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-11-04  0:11           ` Johannes Schindelin
@ 2021-11-04  8:33             ` Torsten Bögershausen
  0 siblings, 0 replies; 78+ messages in thread
From: Torsten Bögershausen @ 2021-11-04  8:33 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Johannes Schindelin, Philip Oakley, Matt Cooper

On Thu, Nov 04, 2021 at 01:11:44AM +0100, Johannes Schindelin wrote:
> Hi Torsten,
>
> On Tue, 2 Nov 2021, Torsten Bögershausen wrote:
>
> > On Tue, Nov 02, 2021 at 03:46:11PM +0000, Matt Cooper via GitGitGadget wrote:
> > > From: Matt Cooper <vtbassmatt@gmail.com>
> > >
> > > The filter system allows for alterations to file contents when they're
> >
> > Some nit-picking:
> > looking at
> > https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
> > we can read
> > "...substitutions in files on commit/checkout."
> >
> > Should we use this wording here as well ?
>
> Again, I believe that Matt's command of the English language is pretty
> good (but then, I have the advantage of knowing him and I very much enjoy
> learning new English words while chatting with him). I would therefore
> chalk it up to artistic license when he uses the word "alterations".

That was not really what my comment was about.
We have exising documentations about Git at other places, and my question
was if we can/should/could use the same terminolgy here in the commit message
as well.
This could make it easier for readers, if the same words are used for the same
thing.

>
> Since you did not comment on the patch, may I assume that you find it
> flawless?

I could not find any flaws.

>
> Ciao,
> Dscho


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 5/8] odb: teach read_blob_entry to use size_t
  2021-11-04  0:09           ` Johannes Schindelin
@ 2021-11-04 12:24             ` Philip Oakley
  0 siblings, 0 replies; 78+ messages in thread
From: Philip Oakley @ 2021-11-04 12:24 UTC (permalink / raw)
  To: Johannes Schindelin, Torsten Bögershausen
  Cc: Matt Cooper via GitGitGadget, git, Carlo Arenas, brian m. carlson,
	Johannes Schindelin, Matt Cooper

On 04/11/2021 00:09, Johannes Schindelin wrote:
> I would say that in the context of talking about a signature, "expect" is
> a better verb than "use".
>
> But then, just like you I am not a native speaker, so I think we should
> maybe stop telling a native speaker like Matt how to use his native
> tongue...
>
>> (I am not a native English speaker, would "changing" be better than "altering" ?)
>>  By changing the signature of read_blob_entry to use size_t,
> As I said, I am not a native English speaker, either. So I believe that
> Matt knows better than the two of us together how to phrase things in
> English.
Being a native speaker can be a bit of a Catch 22, especially in
English, as we can use unusual idioms and find them normal, and often
weren't taught 'grammar'.

It is important that the text is understandable to those for whom
English isn't their first language. At least we don't have to use
simplified English [e.g. 1,2]

Philip

[1] https://en.wikipedia.org/wiki/Simplified_Technical_English
[2]
https://www.boeing.com/company/key-orgs/licensing/simplified-english-checker.page
(not actually used;-)



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files
  2021-11-02 15:46       ` [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
  2021-11-02 20:47         ` Torsten Bögershausen
@ 2021-11-04 17:26         ` Junio C Hamano
  1 sibling, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2021-11-04 17:26 UTC (permalink / raw)
  To: Matt Cooper via GitGitGadget
  Cc: git, Carlo Arenas, brian m. carlson, Johannes Schindelin,
	Philip Oakley, Torsten Bögershausen, Johannes Schindelin,
	Matt Cooper

"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  struct filter_params {
>  	const char *src;
> -	unsigned long size;
> +	size_t size;
>  	int fd;
>  	const char *cmd;
>  	const char *path;

OK, this member is used in only two places in the file.  One is used
as a parameter to write_in_full() that is prepared to take size_t.
The other is to assign to this member from a size_t parameter the
apply_single_file_filter() function got from the caller, and the
callchain leading down to the function are size_t aware.

So, this may only make "a small dent" as described in the proposed
log message, but it does move it in the right direction.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2021-11-04 17:27 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-27  7:49 [PATCH 0/5] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
2021-10-27  7:49 ` [PATCH 1/5] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
2021-10-28  7:15   ` Carlo Arenas
2021-10-28  8:54     ` [PATCH] helper/test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón
2021-10-28 20:32       ` Johannes Schindelin
2021-10-27  7:49 ` [PATCH 2/5] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
2021-10-27  7:49 ` [PATCH 3/5] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
2021-10-27  7:49 ` [PATCH 4/5] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
2021-10-27  7:49 ` [PATCH 5/5] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
2021-10-28 20:50 ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model Johannes Schindelin via GitGitGadget
2021-10-28 20:50   ` [PATCH v2 1/7] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
2021-10-28 20:50   ` [PATCH v2 2/7] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
2021-10-28 22:55     ` Junio C Hamano
2021-10-28 20:50   ` [PATCH v2 3/7] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
2021-10-28 20:50   ` [PATCH v2 4/7] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
2021-10-28 22:14     ` Carlo Arenas
2021-10-28 22:21       ` Johannes Schindelin
2021-10-28 20:50   ` [PATCH v2 5/7] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
2021-10-28 23:05     ` Junio C Hamano
2021-10-28 20:50   ` [PATCH v2 6/7] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
2021-10-28 20:50   ` [PATCH v2 7/7] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
2021-10-28 22:32   ` [PATCH v2 0/7] Allow clean/smudge filters to handle huge files in the LLP64 data model brian m. carlson
2021-10-28 23:07     ` Junio C Hamano
2021-10-29 13:59   ` [PATCH v3 0/8] " Johannes Schindelin via GitGitGadget
2021-10-29 13:59     ` [PATCH v3 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
2021-10-29 13:59     ` [PATCH v3 2/8] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
2021-10-29 22:50       ` Junio C Hamano
2021-10-29 13:59     ` [PATCH v3 3/8] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón via GitGitGadget
2021-10-29 22:52       ` Junio C Hamano
2021-11-02 14:35         ` Johannes Schindelin
2021-10-29 13:59     ` [PATCH v3 4/8] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
2021-10-29 23:00       ` Junio C Hamano
2021-10-29 23:21         ` Junio C Hamano
2021-11-02 14:56           ` Johannes Schindelin
2021-11-02 14:57         ` Johannes Schindelin
2021-10-29 13:59     ` [PATCH v3 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
2021-10-29 23:17       ` Junio C Hamano
2021-11-02 15:10         ` Johannes Schindelin
2021-10-29 13:59     ` [PATCH v3 6/8] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
2021-10-29 23:10       ` Junio C Hamano
2021-10-29 13:59     ` [PATCH v3 7/8] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
2021-10-29 23:13       ` Junio C Hamano
2021-10-29 13:59     ` [PATCH v3 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
2021-10-29 23:17       ` Junio C Hamano
2021-11-02 14:59         ` Johannes Schindelin
2021-10-29 18:34     ` [PATCH v3 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Junio C Hamano
     [not found]       ` <nycvar.QRO.7.76.6.2110292239170.56@tvgsbejvaqbjf.bet>
2021-10-29 21:12         ` Johannes Schindelin
2021-10-29 23:25           ` Junio C Hamano
2021-10-30 15:16           ` Philip Oakley
2021-10-30 17:35             ` Torsten Bögershausen
2021-10-30 19:29               ` Philip Oakley
2021-11-02 14:41       ` Johannes Schindelin
2021-11-02 15:46     ` [PATCH v4 " Johannes Schindelin via GitGitGadget
2021-11-02 15:46       ` [PATCH v4 1/8] test-genzeros: allow more than 2G zeros in Windows Carlo Marcelo Arenas Belón via GitGitGadget
2021-11-02 15:46       ` [PATCH v4 2/8] test-tool genzeros: generate large amounts of data more efficiently Johannes Schindelin via GitGitGadget
2021-11-02 15:46       ` [PATCH v4 3/8] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón via GitGitGadget
2021-11-02 15:46       ` [PATCH v4 4/8] t1051: introduce a smudge filter test for extremely large files Matt Cooper via GitGitGadget
2021-11-02 15:46       ` [PATCH v4 5/8] odb: teach read_blob_entry to use size_t Matt Cooper via GitGitGadget
2021-11-02 20:40         ` Torsten Bögershausen
2021-11-04  0:09           ` Johannes Schindelin
2021-11-04 12:24             ` Philip Oakley
2021-11-02 15:46       ` [PATCH v4 6/8] git-compat-util: introduce more size_t helpers Johannes Schindelin via GitGitGadget
2021-11-02 15:46       ` [PATCH v4 7/8] odb: guard against data loss checking out a huge file Matt Cooper via GitGitGadget
2021-11-02 15:46       ` [PATCH v4 8/8] clean/smudge: allow clean filters to process extremely large files Matt Cooper via GitGitGadget
2021-11-02 20:47         ` Torsten Bögershausen
2021-11-04  0:11           ` Johannes Schindelin
2021-11-04  8:33             ` Torsten Bögershausen
2021-11-04 17:26         ` Junio C Hamano
2021-11-02 21:46       ` [PATCH v4 0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model Torsten Bögershausen
2021-11-03  6:31         ` Johannes Sixt
2021-10-28 20:56 ` [PATCH 0/3] " Carlo Marcelo Arenas Belón
2021-10-28 20:56   ` [PATCH 1/3] test-lib: add prerequisite for 64-bit platforms Carlo Marcelo Arenas Belón
2021-10-28 21:45     ` Johannes Schindelin
2021-10-28 22:09       ` Carlo Arenas
2021-10-28 22:38         ` Junio C Hamano
2021-11-02 15:20           ` Johannes Schindelin
2021-10-28 20:56   ` [PATCH 2/3] fixup! t1051: introduce a smudge filter test for extremely large files Carlo Marcelo Arenas Belón
2021-10-28 20:56   ` [PATCH 3/3] fixup! clean/smudge: allow clean filters to process " Carlo Marcelo Arenas Belón

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).