From: "Randall S. Becker" <rsbecker@nexbridge.com>
To: "'Emily Shaffer'" <emilyshaffer@google.com>,
"'Jeff Hostetler'" <git@jeffhostetler.com>
Cc: git@vger.kernel.org,
"'Ævar Arnfjörð Bjarmason'" <avarab@gmail.com>,
"'Junio C Hamano'" <gitster@pobox.com>,
"'Bagas Sanjaya'" <bagasdotme@gmail.com>
Subject: RE: [PATCH v2] tr2: log parent process name
Date: Fri, 21 May 2021 16:23:28 -0400 [thread overview]
Message-ID: <02be01d74e7f$2c282300$84786900$@nexbridge.com> (raw)
In-Reply-To: <YKgSc5OgVOt6HQqW@google.com>
<emilyshaffer@google.com>
On May 21, 2021 4:05 PM, Emily Shaffer wrote:
>On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>> > - I took a look at Jeff H's advice on using a "data_json" event to log
>> > this and decided it would be a little more flexible to add a new event
>> > instead. If we want, it'd be feasible to then shoehorn the GfW parent
>> > tree stuff into this new event too. Doing it this way is definitely
>> > easier to parse for Google's trace analysis system (which for now
>> > completely skips "data_json" as it's polymorphic), and also - I think
>> > - means that we can add more fields later on if we need to (thread
>> > info, different fields than just /proc/n/comm like exec path, argv,
>> > whatever).
>>
>> I could argue both sides of this, so I guess it is fine either way.
>>
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>>
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings. You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>>
>> So both are about equal.
>>
>> (I'll avoid the temptation to make a snarky comment about fixing your
>> post processing. :-) :-) :-) )
>
>;P
>
>(I don't have much to add - this is an accurate summary of what I thought about, too. Thanks for writing it out.)
>
>>
>> It really doesn't matter one way or the other.
>>
>> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>> > the "command name" - excluding argv, excluding path, etc. It
>> > seems
>>
>> So you're trying to log argv[0] of the process and not the full
>> command line. That's what I'm doing.
>
>It's close to argv[0], yeah. POSIX docs indicate it might be truncated in a way that argv[0] hasn't been, but it also doesn't
include the
>leading path (as far as I've seen). For example, a long-running helper script I use with mutt, right now (gaffing on line length in
email to
>help with argv clarity, sorry):
>
> $ ps aux | grep mutt
> emilysh+ 4119883 0.0 0.0 6892 3600 pts/6 S+ 12:44 0:00 /bin/bash
/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-
>new-split.sh /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
> # comm is truncated to 15ch, except apparently in the cases of some
> # kernel worker processes I saw with much longer names?
> $ cat /proc/4119883/comm
> open-vim-in-new
> # exe is a link to the executable, which means bash as this is a
> # script
> $ ls -lha /proc/4119883/exe
> lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
> /proc/4119883/exe -> /usr/bin/bash
> # cmdline has the whole argv, separated on NUL so it runs together in
> # editor
> $ cat /proc/4119883/cmdline
> /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-
>7433772284891386689
>
>Jonathan N pointed out that the process name (the thing in 'comm') can also be manually manipulated by the process itself, and 'man
>procfs'
>also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in 'prctl()', so that tracks. (It doesn't look like we can use prctl()
to find
>out the names of processes besides the current process, though, so the procfs stuff is still needed. Dang.)
>
>>
>> > like this is a little more safe about excluding personal information
>> > from the traces which take the form of "myscript.sh
>> > --password=hunter2", but would still be worrisome for something like
>> > "mysupersecretproject.sh". I'm not sure whether that means we still
>> > want to guard it with a config flag, though.
>>
>> You might check whether you get the name of the script or just get a
>> lot of entries with just "/usr/bin/bash".
>
>See above :)
>
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
>
>Yes, that's what we've noticed too, so a process name isn't worrying us that much more.
>
>>
>> > - I also added a lot to the commit message; hopefully it's not too
>> > rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>> > wasn't going to cut it.
>> > - As for testing, I followed the lead of GfW's parentage info - "this
>> > isn't portable so writing tests for it will suck, just scrub it from
>> > the tests". Maybe it makes sense to do some more
>> > platform-specific-ness in the test suite instead? I wasn't sure.
>>
>> yeah, that's probably best. Unless you can tokenize it properly so
>> that you can predict the results in a HEREDOC in the test source.
>>
>> For example, you might try to test tracing a command (where a
>> top-level "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and
>> check the output for the child.
>
>Yeah, I had trouble with even deciding when to attempt such a check or not.
>> > + if (reason == TRACE2_PROCESS_INFO_STARTUP)
>> > + {
>> > + /*
>> > + * NEEDSWORK: we could do the entire ptree in an array instead,
>> > + * see compat/win32/trace2_win32_process_info.c.
>> > + */
>> > + char *names[2];
>> > + names[0] = get_process_name(getppid());
>> > + names[1] = NULL;
>>
>> You're only logging 1 parent. That's fine to get started.
>>
>> I'm logging IIRC 10 parents on GFW. That might seem overkill, but
>> there are lots of intermediate parents that hide what is happening.
>> For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>>
>> And that doesn't include who called push.
>>
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
>
>Agree. But it's expensive - I didn't find a handy library call to find "parent ID of given process ID", so I think we'd have to
manipulate
>procfs; and so far I only see parent ID in summary infos like /proc/n/status or /proc/n/stat, which contain lots of other info too
and would
>need parsing.
>
>We could reduce the cost a little bit by grabbing the process name from the status or stat as well, and therefore still only
opening one file
>per process, but I'd want to check whether the formats are expected to be stable for those things.
>
>> > +static void fn_command_ancestry_fl(const char *file, int line,
>> > +const char **parent_names) {
>> > + const char *event_name = "cmd_ancestry";
>> > + const char *parent_name = NULL;
>> > + struct json_writer jw = JSON_WRITER_INIT;
>> > +
>> > + jw_object_begin(&jw, 0);
>> > + event_fmt_prepare(event_name, file, line, NULL, &jw);
>> > + jw_object_inline_begin_array(&jw, "ancestry");
>> > +
>> > + while ((parent_name = *parent_names++))
>> > + jw_array_string(&jw, parent_name);
>>
>> You're building the array with the immediate parent in a[0] and the
>> grandparent in a[1], and etc. This is the same as I did in GFW.
>>
>> Perhaps state this in the docs somewhere.
>
>Sure, makes sense. I think I neglected any doc work whatsoever in this patch anyways, whoops :)
>
>> > + /* cmd_ancestry parent <- grandparent <- great-grandparent */
>> > + strbuf_addstr(&buf_payload, "cmd_ancestry ");
>> > + while ((parent_name = *parent_names++)) {
>> > + strbuf_addstr(&buf_payload, parent_name);
>>
>> Did you want to quote each parent's name?
>
>I'd rather not - since they're going into an array anyway, I'd expect the array delimiters to be enough. Am I being naive? 'normal'
looks to
>me like it's supposed to be mostly human readable anyways, rather than parseable?
>
>> > + strbuf_addstr(&buf_payload, "ancestry:[");
>> > + /* It's not an argv but the rules are basically the same. */
>> > + sq_append_quote_argv_pretty(&buf_payload, parent_names);
>>
>> This will have whitespace delimiters between the quoted strings rather
>> than commas. Just checking if that's what you wanted.
>>
>> I'm not sure it matters, since this stream is intended for human
>> parsing.
>
>Yeah, it seems fine to me as is.
>
>>
>> We should update Documentation/technical/api-trace2.txt too.
>
>Yep, thanks.
>
>I appreciate the review, Jeff.
I checked the performance of my NonStop parent lookup implementation. It's fast enough that no one would notice (8 microseconds on
the oldest slowest machine I could find).
Just an FYI.
-Randall
next prev parent reply other threads:[~2021-05-21 20:23 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-07 0:29 [PATCH] tr2: log parent process name Emily Shaffer
2021-05-07 3:25 ` Bagas Sanjaya
2021-05-07 17:09 ` Emily Shaffer
2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
2021-05-11 21:31 ` Junio C Hamano
2021-05-14 22:06 ` Emily Shaffer
2021-05-16 3:48 ` Junio C Hamano
2021-05-17 20:17 ` Emily Shaffer
2021-05-11 17:28 ` Jeff Hostetler
2021-05-14 22:07 ` Emily Shaffer
2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
2021-05-20 21:36 ` Randall S. Becker
2021-05-20 23:23 ` Emily Shaffer
2021-05-21 13:20 ` Randall S. Becker
2021-05-21 16:24 ` Randall S. Becker
2021-05-21 2:09 ` Junio C Hamano
2021-05-21 19:02 ` Emily Shaffer
2021-05-21 23:22 ` Junio C Hamano
2021-05-24 18:37 ` Emily Shaffer
2021-05-21 19:15 ` Jeff Hostetler
2021-05-21 20:05 ` Emily Shaffer
2021-05-21 20:23 ` Randall S. Becker [this message]
2021-05-22 11:18 ` Jeff Hostetler
2021-05-24 23:33 ` Ævar Arnfjörð Bjarmason
2021-05-24 20:10 ` [PATCH v3] " Emily Shaffer
2021-05-24 20:49 ` Emily Shaffer
2021-05-25 3:54 ` Junio C Hamano
2021-05-25 13:33 ` Randall S. Becker
2021-06-08 18:58 ` [PATCH v4] " Emily Shaffer
2021-06-08 20:56 ` Emily Shaffer
2021-06-08 22:10 ` [PATCH v5] " Emily Shaffer
2021-06-08 22:16 ` Randall S. Becker
2021-06-08 22:24 ` Emily Shaffer
2021-06-08 22:39 ` Randall S. Becker
2021-06-09 20:17 ` Emily Shaffer
2021-06-16 8:42 ` Junio C Hamano
2021-06-28 16:45 ` Jeff Hostetler
2021-06-29 23:51 ` Emily Shaffer
2021-06-30 6:10 ` Ævar Arnfjörð Bjarmason
2021-07-22 0:21 ` Emily Shaffer
2021-07-22 1:27 ` [PATCH v6 0/2] " Emily Shaffer
2021-07-22 1:27 ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
2021-08-02 9:34 ` Ævar Arnfjörð Bjarmason
2021-07-22 1:27 ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
2021-07-22 21:02 ` Junio C Hamano
2021-08-02 9:38 ` Ævar Arnfjörð Bjarmason
2021-08-02 12:45 ` Ævar Arnfjörð Bjarmason
2021-08-02 10:22 ` Ævar Arnfjörð Bjarmason
2021-08-02 12:47 ` Ævar Arnfjörð Bjarmason
2021-08-02 15:23 ` Jeff Hostetler
2021-08-02 16:10 ` Randall S. Becker
2021-08-02 18:41 ` Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26 3:09 ` Taylor Blau
2021-08-25 23:19 ` [PATCH 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26 3:21 ` Taylor Blau
2021-08-25 23:19 ` [PATCH 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26 3:23 ` Taylor Blau
2021-08-25 23:19 ` [PATCH 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:49 ` Eric Sunshine
2021-08-26 4:07 ` Taylor Blau
2021-08-26 12:24 ` "I don't know what the author meant by that..." (was "Re: [PATCH 6/6] tr2: log N parent process names on Linux") Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26 15:58 ` Eric Sunshine
2021-08-26 16:42 ` Junio C Hamano
2021-08-26 12:22 ` [PATCH v2 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-26 22:38 ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-27 8:02 ` [PATCH v3 " Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 4/6] tr2: leave the parent list empty upon failure & don't leak memory Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-31 0:17 ` [PATCH v3 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-02 10:30 ` [PATCH v6 2/2] tr2: log parent process name Ævar Arnfjörð Bjarmason
2021-08-02 16:24 ` Junio C Hamano
2021-08-02 18:42 ` Ævar Arnfjörð Bjarmason
2021-07-22 16:59 ` [PATCH v6 0/2] " Jeff Hostetler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='02be01d74e7f$2c282300$84786900$@nexbridge.com' \
--to=rsbecker@nexbridge.com \
--cc=avarab@gmail.com \
--cc=bagasdotme@gmail.com \
--cc=emilyshaffer@google.com \
--cc=git@jeffhostetler.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).