git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git Mailing List <git@vger.kernel.org>,
	gitster@pobox.com, jnareb@gmail.com, tboegi@web.de,
	mlbright@gmail.com, remi.galan-alfonso@ensimag.grenoble-inp.fr,
	pclouds@gmail.com, e@80x24.org, ramsay@ramsayjones.plus.com
Subject: Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
Date: Wed, 27 Jul 2016 19:31:26 +0200	[thread overview]
Message-ID: <5FE50D2C-5D97-4523-9BE2-88745B3F83EA@gmail.com> (raw)
In-Reply-To: <20160727013251.GA12159@sigill.intra.peff.net>


> On 27 Jul 2016, at 03:32, Jeff King <peff@peff.net> wrote:
> 
> On Wed, Jul 27, 2016 at 02:06:05AM +0200, larsxschneider@gmail.com wrote:
> 
>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
>> +{
>> +	off_t bytes_read;
>> +	off_t total_bytes_read = 0;
> 
> I haven't looked carefully at the whole patch yet, but there seems to be
> some type issues here. off_t is a good type for storing the whole size
> of a file (which may be larger than the amount of memory we can
> allocate). But size_t is the right size for an in-memory object.
> 
> This function takes a size_t size, which makes sense if it is meant to
> read everything into a strbuf.
> 
> So I think our total_bytes_read would probably want to be a size_t here,
> too, because it cannot possibly grow larger than that (and that is
> enforced by the loop below). Otherwise you get weirdness like "sb->buf +
> total_bytes_ref" possibly overflowing memory.

OK


>> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush
> 
> What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
> 32-bit system)?

Would that be an acceptable solution?

if (size + 1 > SIZE_MAX)
	return die("unrepresentable length for filter buffer");

Can you point me to an example in the Git source how this kind of thing should
be handled?


>> +	do {
>> +		bytes_read = packet_read(
>> +			fd, NULL, NULL,
>> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
>> +			PACKET_READ_GENTLE_ON_EOF
>> +		);
> 
> packet_read() actually returns an int, and may return "-1" on EOF (and
> int is fine because we know that we are constrained to 16-bit values
> by the pkt-line definition). You read it into an "off_t". I _think_ that
> is OK, because I believe POSIX says off_t must be signed. But probably
> "int" is the more correct type here.

OK


>> +		total_bytes_read += bytes_read;
> 
> If you do get "-1", I think you need to detect it here before adjusting
> total_bytes_read.

Correct!


>> +	while (
>> +		bytes_read > 0 && 					// the last packet was no flush
>> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
>> +	);
> 
> And I'm not sure if you need to distinguish between "0" and "-1" when
> checking byte_read here.

We want to finish reading in both cases, no?


> 
>> +	strbuf_setlen(sb, total_bytes_read);
> 
> Passing an off_t to something expecting a size_t, which can involve
> truncation (though I think in practice you really are limited to
> size_t).

OK


>> +static int multi_packet_write(const char *src, size_t len, const int in, const int out)
>> +{
>> +	int ret = 1;
>> +	char header[4];
>> +	char buffer[8192];
>> +	off_t bytes_to_write;
>> +	while (ret) {
>> +		if (in >= 0) {
>> +			bytes_to_write = xread(in, buffer, sizeof(buffer));
> 
> Likewise here, xread() is returning ssize_t. Again, OK if we can assume
> off_t is signed, but it probably makes sense to use the correct type (we
> also know it cannot be larger than 8K, of course).

OK


> Why 8K? The pkt-line format naturally restricts us to just under 64K, so
> why not take advantage of that and minimize the framing overhead for
> large data?

I took inspiration from here for 8K MAX_IO_SIZE:
https://github.com/git/git/blob/master/copy.c#L6

Is this read limit correct? Should I read 8 times to fill a pkt-line?


>> +			if (bytes_to_write < 0)
>> +				ret &= 0;
> 
> I think "&= 0" is unusual for our codebase? Would just writing "= 0" be
> more clear?

Yes!


> We do sometimes do "ret |= something()" but that is in cases where
> "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
> your function's error-reporting is inverted from our usual style?

I thought it makes the code easier to read and the filter doesn't care
at what point the error happens anyways. The filter either succeeds
or fails. What style would you suggest?


>> +		set_packet_header(header, bytes_to_write + 4);
>> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
>> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
>> +	}
> 
> If you look at format_packet(), it pulls a slight trick: we have a
> buffer 4 bytes larger than we need, format into "buf + 4", and then
> write the final size at the beginning. That lets us write() it all in
> one go.
> 
> At first I thought this function was simply reinventing packet_write(),
> but I guess you are trying to avoid the extra copy of the data (once
> into the buffer from xread, and then again via format_packet just to add
> the extra bytes at the beginning).

Yes, that was my intention.


> I agree with what Junio said elsewhere, that there may be a way to make
> the pkt-line code handle this zero-copy situation better. Perhaps
> something like:
> 
>  struct pktline {
> 	/* first 4 bytes are reserved for length header */
> 	char buf[LARGE_PACKET_MAX];
>  };
>  #define PKTLINE_DATA_START(pkt) ((pkt)->buf + 4)
>  #define PKTLINE_DATA_LEN (LARGE_PACKET_MAX - 4)
> 
>  ...
>  struct pktline pkt;
>  ssize_t len = xread(fd, PKTLINE_DATA_START(&pkt), PKTLINE_DATA_LEN);
>  packet_send(&pkt, len);
> 
> Then packet_send() knows that the first 4 bytes are reserved for it. I
> suspect that the strbuf used by format_packet() could get away with
> using such a "struct pktline" too, though in practice I doubt there's
> any real efficiency to be gained (we generally reuse the same strbuf
> over and over, so it will grow once to 64K and get reused).

OK, I will try that.


>> +	ret &= write_in_full(out, "0000", 4) == 4;
> 
> packet_flush() ?
> 
> I know the packet functions are keen on write_or_die() versus
> write_in_full().  That is perhaps something that should be fixed.

Yes, the write_or_die calls were the reason for the manual packet flush. 
I will propose a change for these functions to accommodate non "required"
filters as it is OK when they fail. 


> This was just supposed to be a short note about off_t before eating
> dinner (oops), so I didn't read past here.

Thank you :-)

- Lars

  reply	other threads:[~2016-07-27 17:31 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-22 15:48 [PATCH v1 0/3] Git filter protocol larsxschneider
2016-07-22 15:48 ` [PATCH v1 1/3] convert: quote filter names in error messages larsxschneider
2016-07-22 15:48 ` [PATCH v1 2/3] convert: modernize tests larsxschneider
2016-07-26 15:18   ` Remi Galan Alfonso
2016-07-26 20:40     ` Junio C Hamano
2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
2016-07-22 22:32   ` Torsten Bögershausen
2016-07-24 12:09     ` Lars Schneider
2016-07-22 23:19   ` Ramsay Jones
2016-07-22 23:28     ` Ramsay Jones
2016-07-24 17:16     ` Lars Schneider
2016-07-24 22:36       ` Ramsay Jones
2016-07-24 23:22         ` Jakub Narębski
2016-07-25 20:32           ` Lars Schneider
2016-07-26 10:58             ` Jakub Narębski
2016-07-25 20:24         ` Lars Schneider
2016-07-23  0:11   ` Jakub Narębski
2016-07-23  7:27     ` Eric Wong
2016-07-26 20:00       ` Jeff King
2016-07-24 18:36     ` Lars Schneider
2016-07-24 20:14       ` Jakub Narębski
2016-07-24 21:30         ` Jakub Narębski
2016-07-25 20:16           ` Lars Schneider
2016-07-26 12:24             ` Jakub Narębski
2016-07-25 20:09         ` Lars Schneider
2016-07-26 14:18           ` Jakub Narębski
2016-07-23  8:14   ` Eric Wong
2016-07-24 19:11     ` Lars Schneider
2016-07-25  7:27       ` Eric Wong
2016-07-25 15:48       ` Duy Nguyen
2016-07-22 21:39 ` [PATCH v1 0/3] Git filter protocol Junio C Hamano
2016-07-24 11:24   ` Lars Schneider
2016-07-26 20:11     ` Jeff King
2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
2016-07-27  0:06   ` [PATCH v2 1/5] convert: quote filter names in error messages larsxschneider
2016-07-27 20:01     ` Jakub Narębski
2016-07-28  8:23       ` Lars Schneider
2016-07-27  0:06   ` [PATCH v2 2/5] convert: modernize tests larsxschneider
2016-07-27  0:06   ` [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function larsxschneider
2016-07-27  0:20     ` Junio C Hamano
2016-07-27  9:13       ` Lars Schneider
2016-07-27 16:31         ` Junio C Hamano
2016-07-27  0:06   ` [PATCH v2 4/5] convert: generate large test files only once larsxschneider
2016-07-27  2:35     ` Torsten Bögershausen
2016-07-27 13:32       ` Jeff King
2016-07-27 16:50         ` Lars Schneider
2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
2016-07-27  1:32     ` Jeff King
2016-07-27 17:31       ` Lars Schneider [this message]
2016-07-27 18:11         ` Jeff King
2016-07-28 12:10           ` Lars Schneider
2016-07-28 13:35             ` Jeff King
2016-07-27  9:41     ` Eric Wong
2016-07-29 10:38       ` Lars Schneider
2016-07-29 11:24         ` Jakub Narębski
2016-07-29 11:31           ` Lars Schneider
2016-08-05 18:55         ` Eric Wong
2016-08-05 23:26           ` Lars Schneider
2016-08-05 23:38             ` Eric Wong
2016-07-27 23:31     ` Jakub Narębski
2016-07-29  8:04       ` Lars Schneider
2016-07-29 17:35         ` Junio C Hamano
2016-07-29 23:11           ` Jakub Narębski
2016-07-29 23:44             ` Lars Schneider
2016-07-30  9:32               ` Jakub Narębski
2016-07-28 10:32     ` Torsten Bögershausen
2016-07-27 19:08   ` [PATCH v2 0/5] Git filter protocol Jakub Narębski
2016-07-28  7:16     ` Lars Schneider
2016-07-28 10:42       ` Jakub Narębski
2016-07-28 13:29       ` Jeff King
2016-07-29  7:40         ` Jakub Narębski
2016-07-29  8:14           ` Lars Schneider
2016-07-29 15:57             ` Jeff King
2016-07-29 16:20               ` Lars Schneider
2016-07-29 16:50                 ` Jeff King
2016-07-29 17:43                   ` Lars Schneider
2016-07-29 18:27                     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5FE50D2C-5D97-4523-9BE2-88745B3F83EA@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=mlbright@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=remi.galan-alfonso@ensimag.grenoble-inp.fr \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).