git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v1 1/1] index_bulk_checkin(): Take off_t, not size_t
@ 2018-10-18 19:11 tboegi
  2018-10-19  1:58 ` Junio C Hamano
  2018-10-23 16:11 ` [PATCH/RFC v2 1/1] Use off_t instead of size_t for functions dealing with streamed checkin tboegi
  0 siblings, 2 replies; 3+ messages in thread
From: tboegi @ 2018-10-18 19:11 UTC (permalink / raw)
  To: git; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When streaming data from disk into a blob, use off_t instead of
size_t, which is a better choice for file length.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---

This is based on an old patch from 2017, which never made it to the list.
I think it make sense to have off_t/size_t more consistent,
reviews/comments are welcome.

bulk-checkin.c | 4 ++--
 bulk-checkin.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 409ecb566b..2631e82d6c 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -189,7 +189,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state,
 
 static int deflate_to_pack(struct bulk_checkin_state *state,
 			   struct object_id *result_oid,
-			   int fd, size_t size,
+			   int fd, off_t size,
 			   enum object_type type, const char *path,
 			   unsigned flags)
 {
@@ -258,7 +258,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 }
 
 int index_bulk_checkin(struct object_id *oid,
-		       int fd, size_t size, enum object_type type,
+		       int fd, off_t size, enum object_type type,
 		       const char *path, unsigned flags)
 {
 	int status = deflate_to_pack(&state, oid, fd, size, type,
diff --git a/bulk-checkin.h b/bulk-checkin.h
index f438f93811..09b2affdf3 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -7,7 +7,7 @@
 #include "cache.h"
 
 extern int index_bulk_checkin(struct object_id *oid,
-			      int fd, size_t size, enum object_type type,
+			      int fd, off_t size, enum object_type type,
 			      const char *path, unsigned flags);
 
 extern void plug_bulk_checkin(void);
-- 
2.19.0.271.gfe8321ec05


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v1 1/1] index_bulk_checkin(): Take off_t, not size_t
  2018-10-18 19:11 [PATCH v1 1/1] index_bulk_checkin(): Take off_t, not size_t tboegi
@ 2018-10-19  1:58 ` Junio C Hamano
  2018-10-23 16:11 ` [PATCH/RFC v2 1/1] Use off_t instead of size_t for functions dealing with streamed checkin tboegi
  1 sibling, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2018-10-19  1:58 UTC (permalink / raw)
  To: tboegi; +Cc: git

tboegi@web.de writes:

> bulk-checkin.c | 4 ++--
>  bulk-checkin.h | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)

If you lost SP in your editor, then it is OK but if format-patch
lost it for some reason, plasee tell me as we need to find the bug.

>
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 409ecb566b..2631e82d6c 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -189,7 +189,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state,
>  
>  static int deflate_to_pack(struct bulk_checkin_state *state,
>  			   struct object_id *result_oid,
> -			   int fd, size_t size,
> +			   int fd, off_t size,

The size is once casted to uintmax_t for recording in the object
header (which is fine), and then passed to stream_to_pack(), which
still takes, and more importantly, does comparisons and chunking in,
size_t after this patch.  Without xsize_t() around size passed in
the call to stream_to_pack(), you may silently be truncating off_t
down to size_t in this function.

> @@ -258,7 +258,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
>  }
>  
>  int index_bulk_checkin(struct object_id *oid,
> -		       int fd, size_t size, enum object_type type,
> +		       int fd, off_t size, enum object_type type,
>  		       const char *path, unsigned flags)
>  {
>  	int status = deflate_to_pack(&state, oid, fd, size, type,

This one is a thin wrapper around deflate_to_pack() above.

Its sole caller is sha1-file.c::index_stream() and takes size_t from
its callers, and passes size_t to index_bulk_checkin().

The sole caller of index_stream(), sha1-file.c::index_fd(), wants to
pass st->st_size, and it uses xsize_t() because index_stream() and
callchain underneath currently take size_t.  You want that callchain
to take off_t with this patch.

The whole purpose of stream_to_pack() is to take potentially large
input from the file on the filesystem, chop that into manageable
chunks and feed the underlying hashing and deflating machinery that
takes possibly smaller integer types to represent the sizes of data
they take in a single call, so once that function is taught to take
ofs_t I think you can say you converted the entire callchain from
index_fd() down.

So, this looks like a good starting step, but I suspect it needs one
more level of digging for it to become truly useful.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH/RFC v2 1/1] Use off_t instead of size_t for functions dealing with streamed checkin
  2018-10-18 19:11 [PATCH v1 1/1] index_bulk_checkin(): Take off_t, not size_t tboegi
  2018-10-19  1:58 ` Junio C Hamano
@ 2018-10-23 16:11 ` tboegi
  1 sibling, 0 replies; 3+ messages in thread
From: tboegi @ 2018-10-23 16:11 UTC (permalink / raw)
  To: git; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When streaming data from disk into a blob, it should be possible to commit
a file with a file size > 4 GiB using the streaming functionality in Git.
Because of the streaming there is no need to load the whole data into
memory at once.
Today this is not possible on e.g. a 32 bit Linux system.
There is no good reason to limit the length of the file by using a size_t
in the code, which is a 32 bit value.
Loosen this restriction and use off_t instead of size_t in the call chain.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---

This is a suggestion for V2, changing even sha1-file.c,
so that the whole patch makes more sense.
The initial commit of a >4Gib file was tested on a 32 bit system

I didn't remove the wrapper functions, as I don't know
what their purpose is.

And: The commit message may need some tweaking, though

bulk-checkin.c | 6 +++---
 bulk-checkin.h | 2 +-
 sha1-file.c    | 5 ++---
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 409ecb566b..34dbf5c4ea 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -96,7 +96,7 @@ static int already_written(struct bulk_checkin_state *state, struct object_id *o
  */
 static int stream_to_pack(struct bulk_checkin_state *state,
 			  git_hash_ctx *ctx, off_t *already_hashed_to,
-			  int fd, size_t size, enum object_type type,
+			  int fd, off_t size, enum object_type type,
 			  const char *path, unsigned flags)
 {
 	git_zstream s;
@@ -189,7 +189,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state,
 
 static int deflate_to_pack(struct bulk_checkin_state *state,
 			   struct object_id *result_oid,
-			   int fd, size_t size,
+			   int fd, off_t size,
 			   enum object_type type, const char *path,
 			   unsigned flags)
 {
@@ -258,7 +258,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 }
 
 int index_bulk_checkin(struct object_id *oid,
-		       int fd, size_t size, enum object_type type,
+		       int fd, off_t size, enum object_type type,
 		       const char *path, unsigned flags)
 {
 	int status = deflate_to_pack(&state, oid, fd, size, type,
diff --git a/bulk-checkin.h b/bulk-checkin.h
index f438f93811..09b2affdf3 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -7,7 +7,7 @@
 #include "cache.h"
 
 extern int index_bulk_checkin(struct object_id *oid,
-			      int fd, size_t size, enum object_type type,
+			      int fd, off_t size, enum object_type type,
 			      const char *path, unsigned flags);
 
 extern void plug_bulk_checkin(void);
diff --git a/sha1-file.c b/sha1-file.c
index a4367b8f04..98d0f50ffa 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1934,7 +1934,7 @@ static int index_core(struct object_id *oid, int fd, size_t size,
  * binary blobs, they generally do not want to get any conversion, and
  * callers should avoid this code path when filters are requested.
  */
-static int index_stream(struct object_id *oid, int fd, size_t size,
+static int index_stream(struct object_id *oid, int fd, off_t size,
 			enum object_type type, const char *path,
 			unsigned flags)
 {
@@ -1959,8 +1959,7 @@ int index_fd(struct object_id *oid, int fd, struct stat *st,
 		ret = index_core(oid, fd, xsize_t(st->st_size), type, path,
 				 flags);
 	else
-		ret = index_stream(oid, fd, xsize_t(st->st_size), type, path,
-				   flags);
+		ret = index_stream(oid, fd, st->st_size, type, path, flags);
 	close(fd);
 	return ret;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-10-23 16:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-18 19:11 [PATCH v1 1/1] index_bulk_checkin(): Take off_t, not size_t tboegi
2018-10-19  1:58 ` Junio C Hamano
2018-10-23 16:11 ` [PATCH/RFC v2 1/1] Use off_t instead of size_t for functions dealing with streamed checkin tboegi

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).