From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on starla
X-Spam-Level: 
X-Spam-Status: No, score=-1.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=ham autolearn_force=no version=3.4.6
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 682E11F44D
	for <e@80x24.org>; Mon, 11 Mar 2024 11:55:01 +0000 (UTC)
Authentication-Results: dcvr.yhbt.net;
	dkim=pass (2048-bit key; secure) header.d=cs.wisc.edu header.i=@cs.wisc.edu header.a=rsa-sha256 header.s=csl-2018021300 header.b=Og3QItFx;
	dkim-atps=neutral
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 35DAB385843B
	for <e@80x24.org>; Mon, 11 Mar 2024 11:55:00 +0000 (GMT)
Received: from smtpout2.cs.wisc.edu (smtpout2.cs.wisc.edu [128.105.6.54])
 by sourceware.org (Postfix) with ESMTPS id 78F9E3858D20
 for <libc-alpha@sourceware.org>; Mon, 11 Mar 2024 11:54:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 78F9E3858D20
Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none)
 header.from=cs.wisc.edu
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=cs.wisc.edu
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 78F9E3858D20
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=128.105.6.54
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1710158077; cv=none;
 b=te1FBDKF8mG+y9elfPeO1vPmFLPsCVmsLPibT0gwPFWE8rCc5ibDYp26k4df5dm4gNgu+K3txwE1kawXgHtEziqehyJ2VIyrt/q3obBTu45FClZvJpDQFbHJyq+wyK5QeQgThfyVrfw+kwRfalYrpu0y/aG/rvL8ZVufgC1Vmyk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1710158077; c=relaxed/simple;
 bh=25t4ZeciE4lU5HzCkVvXSZBKU0SCJbWD0/ceeQ49a2Q=;
 h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version;
 b=PZ31Q4l75QEa9nrJkF6gP8DGE6i2QpQTLboiFXSuHV8DOpQq7IJwZWDyuCDERhBRM5u0A/JGtwSxcYz3FSDanvtNJUCKKiZ/sKzwQdPY9Cp2gmuwQZKXin/JFFWSZBBahV8efy5NiROpU4bczBoi1GM6e7UfIzcd/DIvy+JaqgU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from alumni.cs.wisc.edu (alumni.cs.wisc.edu [128.105.2.11])
 by flint.cs.wisc.edu (8.14.7/8.14.4) with ESMTP id 42BBsBmx014091;
 Mon, 11 Mar 2024 06:54:13 -0500
DKIM-Filter: OpenDKIM Filter v2.11.0 flint.cs.wisc.edu 42BBsBmx014091
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.wisc.edu;
 s=csl-2018021300; t=1710158053;
 bh=fywFyN3JXsNkyHvTrcN8zKi1A9mXDv0ocU3s1P90qXg=;
 h=Date:From:To:cc:Subject:In-Reply-To:References:From;
 b=Og3QItFxiZHOoOITat6uXBtW8K991AXj4TjUIVv6dhTJW+zOTYJfxmRRVqsp2wDBD
 DYQVguCIWLF6uK4q1+DzYt8gNmaYmVkqS24VAw9tSuYDlys39RU5sS26A9/u2RXRUn
 4q+X6UCoFlFVzce6tkisQogNHT9Hd36baXSbAhEJSru3od9GEL02LsH6Gk8QY+Apl6
 5l7Nobpozk/aupHleTWftwN6rL+yEK05nBeo0TjqRb1t+l/IKMAC1KqGZ1J7hAKXnA
 8lYbRJ/CiEcLCjGAX75Ulo1PCmjM6ijtA9KBY6PmWbKPm0VbOXgQUd840QydGM+POC
 nSoa+M9WDwhBA==
Received: by alumni.cs.wisc.edu (Postfix, from userid 23719)
 id 73D401E0848; Mon, 11 Mar 2024 06:54:11 -0500 (CDT)
Received: from localhost (localhost [127.0.0.1])
 by alumni.cs.wisc.edu (Postfix) with ESMTP id 6CDF31E03A5;
 Mon, 11 Mar 2024 06:54:11 -0500 (CDT)
Date: Mon, 11 Mar 2024 06:54:11 -0500 (CDT)
From: Carl Edquist <edquist@cs.wisc.edu>
To: Zachary Santer <zsanter@gmail.com>
cc: libc-alpha@sourceware.org, coreutils@gnu.org, p@draigbrady.com
Subject: Re: RFE: enable buffering on null-terminated data
In-Reply-To: <CABkLJULka=Ox-WVNfqzeLYs1dX0h7ovnfjeRdqGSFcqVMJ47KQ@mail.gmail.com>
Message-ID: <8c490a55-598a-adf6-67c2-eb2a6099620a@cs.wisc.edu>
References: <CABkLJULa8c0zr1BkzWLTpAxHBcpb15Xms0-Q2OOVCHiAHuL0uA@mail.gmail.com>
 <9831afe6-958a-fbd3-9434-05dd0c9b602a@draigBrady.com>
 <CABkLJUKdbwP-7Bw5PTXGDh5o9qpX14=7TCxSgd5v+1mDfdoEpQ@mail.gmail.com>
 <317fe0e2-8cf9-d4ac-ed56-e6ebcc2baa55@cs.wisc.edu>
 <CABkLJULka=Ox-WVNfqzeLYs1dX0h7ovnfjeRdqGSFcqVMJ47KQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
 BOUNDARY="1769999106-1898998172-1710154819=:437286"
Content-ID: <330cc25a-7233-a752-c105-d4cc0e5c5ee8@cs.wisc.edu>
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--1769999106-1898998172-1710154819=:437286
Content-Type: text/plain; CHARSET=UTF-8; format=flowed
Content-Transfer-Encoding: 8BIT
Content-ID: <481ce199-35ca-1bd1-fa33-bc3c6d74e7a0@cs.wisc.edu>

On Sun, 10 Mar 2024, Zachary Santer wrote:

> On Sun, Mar 10, 2024 at 4:36 PM Carl Edquist <edquist@cs.wisc.edu> wrote:
>>
>> Out of curiosity, do you have an example command line for your use case?
>
> My use for 'stdbuf --output=L' is to be able to run a command within a
> bash coprocess.

Oh, cool, now you're talking!  ;)


> (Really, a background process communicating with the parent process 
> through FIFOs, since Bash prints a warning message if you try to run 
> more than one coprocess at a time. Shouldn't make a difference here.)

(Kind of a side-note ... bash's limited coprocess handling was a long 
standing annoyance for me in the past, to the point that I wrote a bash 
coprocess management library to handle multiple active coprocess and give 
convenient methods for interaction.  Perhaps the trickiest bit about 
multiple coprocesses open at once (which I suspect is the reason support 
was never added to bash) is that you don't want the second and subsequent 
coprocesses to inherit the pipe fds of prior open coprocesses.  This can 
result in deadlock if, for instance, you close your write end to coproc1, 
but coproc1 continues to wait for input because coproc2 also has a copy of 
a write end of the pipe to coproc1's input.  So you need to be smart about 
subsequent coprocesses first closing all fds associated with other 
coprocesses.

Word to the wise: you might encounter this issue (coproc2 prevents coproc1 
from seeing its end-of-input) even though you are rigging this up yourself 
with FIFOs rather than bash's coproc builtin.)


> See coproc-buffering, attached.

Thanks!

> Without making the command's output either line-buffered or unbuffered, 
> what I'm doing there would deadlock. I feed one line in and then expect 
> to be able to read a transformed line immediately. If that transformed 
> line is stuck in a buffer that's still waiting to be filled, then 
> nothing happens.
>
> I swear doing this actually makes sense in my application.

Yeah makes sense!  I am familiar with the problem you're describing.

(In my coprocess management library, I effectively run every coproc with 
--output=L by default, by eval'ing the output of 'env -i stdbuf -oL env', 
because most of the time for a coprocess, that's whats wanted/necessary.)


... Although, for your example coprocess use, where the shell both 
produces the input for the coproc and consumes its output, you might be 
able to simplify things by making the producer and consumer separate 
processes.  Then you could do a simpler 'producer | filter | consumer' 
without having to worry about buffering at all.  But if the producer and 
consumer need to be in the same process (eg they share state and are 
logically interdependent), then yeah that's where you need a coprocess for 
the filter.

... On the other hand, if the issue is that someone is producing one line 
at a time _interactively_ (that is, inputting text or commands from a 
terminal), then you might argue that the performance hit for unbuffered 
output will be insignificant compared to time spent waiting for terminal 
input.


> $ ./coproc-buffering 100000
> Line-buffered:
> real    0m17.795s
> user    0m6.234s
> sys     0m11.469s
> Unbuffered:
> real    0m21.656s
> user    0m6.609s
> sys     0m14.906s

Yeah, this makes sense in your particular example.

It looks like expand(1) uses putchar(3), so in unbuffered mode this 
translates to one write(2) call for every byte.  sed(1) is not quite as 
bad - in unbuffered it appears to output the line and the newline 
terminator separately, so two write(2) calls for every line.

So in both cases (but especially for expand), line buffering reduces the 
number of write(2) calls.

(Although given your time output, you might say the performance hit for 
unbuffered is not that huge.)


> When I initially implemented this thing, I felt lucky that the data I 
> was passing in were lines ending in newlines, and not null-terminated, 
> since my script gets to benefit from 'stdbuf --output=L'.

:thumbsup:


> Truth be told, I don't currently have a need for --output=N.

Mmm-hmm  :)


> Of course, sed and all sorts of other Linux command-line tools can 
> produce or handle null-terminated data.

Definitely.  So in the general case, theoretically it seems as useful to 
buffer output on nul bytes.

Note that for gnu sed in particular, there is a -u/--unbuffered option, 
which will effectively give you line buffered output, including buffering 
on nul bytes with -z/--null-data .

... I'll be honest though, I am having trouble imagining a realistic 
pipeline that filters filenames with embedded newlines using expand(1) 
;)

...

But, I want to be a good sport here and contrive an actual use case.

So for fun, say I want to use cut(1) (which performs poorly when 
unbuffered) in a coprocess that takes null-terminated file paths on input 
and outputs the first directory component (which possibly contains 
embedded newlines).

The basic command in the coprocess would be:

 	cut -d/ -f1 -z

but with the default block buffering for pipe output, that will hang (the 
problem you describe) if you expect to read a record back from it after 
each record sent.


The unbuffered approach works, but (as discussed) is pretty inefficient:

 	stdbuf --output=0  cut -d/ -f1 -z


But, if we swap nul bytes and newlines before and after cut, then we can 
run cut with regular newline line buffering, and get the desired effect:

 	stdbuf --output=0 tr '\0\n' '\n\0' |
 	stdbuf --output=L cut -d/ -f1      |
 	stdbuf --output=0 tr '\0\n' '\n\0'


The embedded newlines in filenames will be passed by tr(1) to cut(1) as 
embedded nul bytes, cut will line-buffer its output, and the second tr 
will restore the original embedded newlines & null-terminated records.

Note that unbuffered tr(1) will still output its translated input in 
blocks (with fwrite(3)) rather than a byte at a time, so tr will 
effectively give buffered output with the same size as the input records.

(That is, newline or null-terminated input records will effectively 
produce newline or null-terminated output buffering, respectively.)


I'd venture to guess that most of the standard filters could be made to 
pass along null-terminated records as line-buffered records the same way. 
Might even package it into a convenience function to set them up:


 	swap_znl () { stdbuf -o0 tr '\0\n' '\n\0'; }

 	nulterm_via_linebuf () { swap_znl | stdbuf -oL "$@" | swap_znl; }


Then, for example, stand it up with bash's coproc:

 	$ coproc DC1 { nulterm_via_linebuf cut -d/ -f1; }

 	$ printf 'a\nb/c\nd/efg\0' >&${DC1[1]}
 	$ IFS='' read -rd '' -u ${DC1[0]} DIR
 	$ echo "[$DIR]"
 	[a
 	b]

(or however else you manage your coprocesses.)

It's a workaround, and it keeps the kind of buffering you'd get with a 
'stdbuf --output=N', but to be fair the extra data shoveling is not 
exactly free.

...

So ... again in theory I also feel like a null-terminated buffering mode 
for stdbuf(1) (and setbuf(3)) is kind of a missing feature.  It may just 
be that nobody has actually had a real need for it.  (Yet?)


> I'm running bash in MSYS2 on a Windows machine, so hopefully that 
> doesn't invalidate any assumptions.

Ooh.  No idea.  Your strace and sed might have different options than 
mine.  Also, I am not sure if there are different pipe and fd duplication 
semantics, compared to linux.  But, based on the examples & output you're 
giving, I think we're on the same page for the discussion.


> Now setting up strace around the things within the coprocess, and only 
> passing in one line, I now have coproc-buffering-strace, attached. 
> Giving the argument 'L', both sed and expand call write() once. Giving 
> the argument 0, sed calls write() twice and expand calls it a bunch of 
> times, seemingly once for each character it outputs. So I guess that's 
> it.

:thumbsup:  Yeah that matches what I was seeing also.


Thanks for humoring the peanut gallery here :D

Carl
--1769999106-1898998172-1710154819=:437286--