git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Randall S. Becker" <rsbecker@nexbridge.com>
To: "'Emily Shaffer'" <emilyshaffer@google.com>,
	"'Jeff Hostetler'" <git@jeffhostetler.com>
Cc: git@vger.kernel.org,
	"'Ævar Arnfjörð Bjarmason'" <avarab@gmail.com>,
	"'Junio C Hamano'" <gitster@pobox.com>,
	"'Bagas Sanjaya'" <bagasdotme@gmail.com>
Subject: RE: [PATCH v2] tr2: log parent process name
Date: Fri, 21 May 2021 16:23:28 -0400	[thread overview]
Message-ID: <02be01d74e7f$2c282300$84786900$@nexbridge.com> (raw)
In-Reply-To: <YKgSc5OgVOt6HQqW@google.com>

<emilyshaffer@google.com>
On May 21, 2021 4:05 PM, Emily Shaffer wrote:
>On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>> > - I took a look at Jeff H's advice on using a "data_json" event to log
>> >    this and decided it would be a little more flexible to add a new event
>> >    instead. If we want, it'd be feasible to then shoehorn the GfW parent
>> >    tree stuff into this new event too. Doing it this way is definitely
>> >    easier to parse for Google's trace analysis system (which for now
>> >    completely skips "data_json" as it's polymorphic), and also - I think
>> >    - means that we can add more fields later on if we need to (thread
>> >    info, different fields than just /proc/n/comm like exec path, argv,
>> >    whatever).
>>
>> I could argue both sides of this, so I guess it is fine either way.
>>
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>>
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings.  You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>>
>> So both are about equal.
>>
>> (I'll avoid the temptation to make a snarky comment about fixing your
>> post processing. :-) :-) :-) )
>
>;P
>
>(I don't have much to add - this is an accurate summary of what I thought about, too. Thanks for writing it out.)
>
>>
>> It really doesn't matter one way or the other.
>>
>> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>> >    the "command name" - excluding argv, excluding path, etc. It
>> > seems
>>
>> So you're trying to log argv[0] of the process and not the full
>> command line.  That's what I'm doing.
>
>It's close to argv[0], yeah. POSIX docs indicate it might be truncated in a way that argv[0] hasn't been, but it also doesn't
include the
>leading path (as far as I've seen). For example, a long-running helper script I use with mutt, right now (gaffing on line length in
email to
>help with argv clarity, sorry):
>
>  $ ps aux | grep mutt
>  emilysh+ 4119883  0.0  0.0   6892  3600 pts/6    S+   12:44   0:00 /bin/bash
/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-
>new-split.sh /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>  # comm is truncated to 15ch, except apparently in the cases of some
>  # kernel worker processes I saw with much longer names?
>  $ cat /proc/4119883/comm
>  open-vim-in-new
>  # exe is a link to the executable, which means bash as this is a
>  # script
>  $ ls -lha /proc/4119883/exe
>  lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
>  /proc/4119883/exe -> /usr/bin/bash
>  # cmdline has the whole argv, separated on NUL so it runs together in
>  # editor
>  $ cat /proc/4119883/cmdline
>  /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-
>7433772284891386689
>
>Jonathan N pointed out that the process name (the thing in 'comm') can also be manually manipulated by the process itself, and 'man
>procfs'
>also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in 'prctl()', so that tracks. (It doesn't look like we can use prctl()
to find
>out the names of processes besides the current process, though, so the procfs stuff is still needed. Dang.)
>
>>
>> >    like this is a little more safe about excluding personal information
>> >    from the traces which take the form of "myscript.sh
>> >    --password=hunter2", but would still be worrisome for something like
>> >    "mysupersecretproject.sh". I'm not sure whether that means we still
>> >    want to guard it with a config flag, though.
>>
>> You might check whether you get the name of the script or just get a
>> lot of entries with just "/usr/bin/bash".
>
>See above :)
>
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
>
>Yes, that's what we've noticed too, so a process name isn't worrying us that much more.
>
>>
>> > - I also added a lot to the commit message; hopefully it's not too
>> >    rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>> >    wasn't going to cut it.
>> > - As for testing, I followed the lead of GfW's parentage info - "this
>> >    isn't portable so writing tests for it will suck, just scrub it from
>> >    the tests". Maybe it makes sense to do some more
>> >    platform-specific-ness in the test suite instead? I wasn't sure.
>>
>> yeah, that's probably best.  Unless you can tokenize it properly so
>> that you can predict the results in a HEREDOC in the test source.
>>
>> For example, you might try to test tracing a command (where a
>> top-level "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and
>> check the output for the child.
>
>Yeah, I had trouble with even deciding when to attempt such a check or not.
>> > +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
>> > +	{
>> > +		/*
>> > +		 * NEEDSWORK: we could do the entire ptree in an array instead,
>> > +		 * see compat/win32/trace2_win32_process_info.c.
>> > +		 */
>> > +		char *names[2];
>> > +		names[0] = get_process_name(getppid());
>> > +		names[1] = NULL;
>>
>> You're only logging 1 parent.  That's fine to get started.
>>
>> I'm logging IIRC 10 parents on GFW.  That might seem overkill, but
>> there are lots of intermediate parents that hide what is happening.
>> For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>>
>> And that doesn't include who called push.
>>
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
>
>Agree. But it's expensive - I didn't find a handy library call to find "parent ID of given process ID", so I think we'd have to
manipulate
>procfs; and so far I only see parent ID in summary infos like /proc/n/status or /proc/n/stat, which contain lots of other info too
and would
>need parsing.
>
>We could reduce the cost a little bit by grabbing the process name from the status or stat as well, and therefore still only
opening one file
>per process, but I'd want to check whether the formats are expected to be stable for those things.
>
>> > +static void fn_command_ancestry_fl(const char *file, int line,
>> > +const char **parent_names) {
>> > +	const char *event_name = "cmd_ancestry";
>> > +	const char *parent_name = NULL;
>> > +	struct json_writer jw = JSON_WRITER_INIT;
>> > +
>> > +	jw_object_begin(&jw, 0);
>> > +	event_fmt_prepare(event_name, file, line, NULL, &jw);
>> > +	jw_object_inline_begin_array(&jw, "ancestry");
>> > +
>> > +	while ((parent_name = *parent_names++))
>> > +		jw_array_string(&jw, parent_name);
>>
>> You're building the array with the immediate parent in a[0] and the
>> grandparent in a[1], and etc.  This is the same as I did in GFW.
>>
>> Perhaps state this in the docs somewhere.
>
>Sure, makes sense. I think I neglected any doc work whatsoever in this patch anyways, whoops :)
>
>> > +	/* cmd_ancestry parent <- grandparent <- great-grandparent */
>> > +	strbuf_addstr(&buf_payload, "cmd_ancestry ");
>> > +	while ((parent_name = *parent_names++)) {
>> > +		strbuf_addstr(&buf_payload, parent_name);
>>
>> Did you want to quote each parent's name?
>
>I'd rather not - since they're going into an array anyway, I'd expect the array delimiters to be enough. Am I being naive? 'normal'
looks to
>me like it's supposed to be mostly human readable anyways, rather than parseable?
>
>> > +	strbuf_addstr(&buf_payload, "ancestry:[");
>> > +	/* It's not an argv but the rules are basically the same. */
>> > +	sq_append_quote_argv_pretty(&buf_payload, parent_names);
>>
>> This will have whitespace delimiters between the quoted strings rather
>> than commas.  Just checking if that's what you wanted.
>>
>> I'm not sure it matters, since this stream is intended for human
>> parsing.
>
>Yeah, it seems fine to me as is.
>
>>
>> We should update Documentation/technical/api-trace2.txt too.
>
>Yep, thanks.
>
>I appreciate the review, Jeff.

I checked the performance of my NonStop parent lookup implementation. It's fast enough that no one would notice (8 microseconds on
the oldest slowest machine I could find).

Just an FYI.
-Randall


  reply	other threads:[~2021-05-21 20:23 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
2021-05-07  3:25 ` Bagas Sanjaya
2021-05-07 17:09 ` Emily Shaffer
2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
2021-05-11 21:31   ` Junio C Hamano
2021-05-14 22:06   ` Emily Shaffer
2021-05-16  3:48     ` Junio C Hamano
2021-05-17 20:17       ` Emily Shaffer
2021-05-11 17:28 ` Jeff Hostetler
2021-05-14 22:07   ` Emily Shaffer
2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
2021-05-20 21:36   ` Randall S. Becker
2021-05-20 23:23     ` Emily Shaffer
2021-05-21 13:20       ` Randall S. Becker
2021-05-21 16:24         ` Randall S. Becker
2021-05-21  2:09   ` Junio C Hamano
2021-05-21 19:02     ` Emily Shaffer
2021-05-21 23:22       ` Junio C Hamano
2021-05-24 18:37         ` Emily Shaffer
2021-05-21 19:15   ` Jeff Hostetler
2021-05-21 20:05     ` Emily Shaffer
2021-05-21 20:23       ` Randall S. Becker [this message]
2021-05-22 11:18       ` Jeff Hostetler
2021-05-24 23:33       ` Ævar Arnfjörð Bjarmason
2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
2021-05-24 20:49     ` Emily Shaffer
2021-05-25  3:54     ` Junio C Hamano
2021-05-25 13:33       ` Randall S. Becker
2021-06-08 18:58     ` [PATCH v4] " Emily Shaffer
2021-06-08 20:56       ` Emily Shaffer
2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
2021-06-08 22:16         ` Randall S. Becker
2021-06-08 22:24           ` Emily Shaffer
2021-06-08 22:39             ` Randall S. Becker
2021-06-09 20:17               ` Emily Shaffer
2021-06-16  8:42         ` Junio C Hamano
2021-06-28 16:45         ` Jeff Hostetler
2021-06-29 23:51           ` Emily Shaffer
2021-06-30  6:10             ` Ævar Arnfjörð Bjarmason
2021-07-22  0:21               ` Emily Shaffer
2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
2021-07-22  1:27           ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
2021-08-02  9:34             ` Ævar Arnfjörð Bjarmason
2021-07-22  1:27           ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
2021-07-22 21:02             ` Junio C Hamano
2021-08-02  9:38             ` Ævar Arnfjörð Bjarmason
2021-08-02 12:45               ` Ævar Arnfjörð Bjarmason
2021-08-02 10:22             ` Ævar Arnfjörð Bjarmason
2021-08-02 12:47               ` Ævar Arnfjörð Bjarmason
2021-08-02 15:23               ` Jeff Hostetler
2021-08-02 16:10               ` Randall S. Becker
2021-08-02 18:41                 ` Ævar Arnfjörð Bjarmason
2021-08-25 23:19               ` [PATCH 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-25 23:19                 ` [PATCH 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-25 23:19                 ` [PATCH 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:19                 ` [PATCH 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26  3:09                   ` Taylor Blau
2021-08-25 23:19                 ` [PATCH 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26  3:21                   ` Taylor Blau
2021-08-25 23:19                 ` [PATCH 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26  3:23                   ` Taylor Blau
2021-08-25 23:19                 ` [PATCH 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:49                   ` Eric Sunshine
2021-08-26  4:07                   ` Taylor Blau
2021-08-26 12:24                     ` "I don't know what the author meant by that..." (was "Re: [PATCH 6/6] tr2: log N parent process names on Linux") Ævar Arnfjörð Bjarmason
2021-08-26 12:22                 ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26 15:58                     ` Eric Sunshine
2021-08-26 16:42                     ` Junio C Hamano
2021-08-26 12:22                   ` [PATCH v2 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-26 22:38                   ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-27  8:02                   ` [PATCH v3 " Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 4/6] tr2: leave the parent list empty upon failure & don't leak memory Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-31  0:17                     ` [PATCH v3 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-02 10:30             ` [PATCH v6 2/2] tr2: log parent process name Ævar Arnfjörð Bjarmason
2021-08-02 16:24               ` Junio C Hamano
2021-08-02 18:42                 ` Ævar Arnfjörð Bjarmason
2021-07-22 16:59           ` [PATCH v6 0/2] " Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='02be01d74e7f$2c282300$84786900$@nexbridge.com' \
    --to=rsbecker@nexbridge.com \
    --cc=avarab@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=emilyshaffer@google.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).