From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Emily Shaffer <emilyshaffer@google.com>
Cc: Jeff Hostetler <git@jeffhostetler.com>,
git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
Bagas Sanjaya <bagasdotme@gmail.com>
Subject: Re: [PATCH v2] tr2: log parent process name
Date: Tue, 25 May 2021 01:33:27 +0200 [thread overview]
Message-ID: <877djnogbk.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <YKgSc5OgVOt6HQqW@google.com>
On Fri, May 21 2021, Emily Shaffer wrote:
> On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>> > - I took a look at Jeff H's advice on using a "data_json" event to log
>> > this and decided it would be a little more flexible to add a new event
>> > instead. If we want, it'd be feasible to then shoehorn the GfW parent
>> > tree stuff into this new event too. Doing it this way is definitely
>> > easier to parse for Google's trace analysis system (which for now
>> > completely skips "data_json" as it's polymorphic), and also - I think
>> > - means that we can add more fields later on if we need to (thread
>> > info, different fields than just /proc/n/comm like exec path, argv,
>> > whatever).
>>
>> I could argue both sides of this, so I guess it is fine either way.
>>
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>>
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings. You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>>
>> So both are about equal.
>>
>> (I'll avoid the temptation to make a snarky comment about fixing
>> your post processing. :-) :-) :-) )
>
> ;P
>
> (I don't have much to add - this is an accurate summary of what I
> thought about, too. Thanks for writing it out.)
>
>>
>> It really doesn't matter one way or the other.
>>
>> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>> > the "command name" - excluding argv, excluding path, etc. It seems
>>
>> So you're trying to log argv[0] of the process and not the full
>> command line. That's what I'm doing.
>
> It's close to argv[0], yeah. POSIX docs indicate it might be truncated
> in a way that argv[0] hasn't been, but it also doesn't include the
> leading path (as far as I've seen). For example, a long-running helper
> script I use with mutt, right now (gaffing on line length in email to
> help with argv clarity, sorry):
>
> $ ps aux | grep mutt
> emilysh+ 4119883 0.0 0.0 6892 3600 pts/6 S+ 12:44 0:00 /bin/bash
> /usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh
> /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
> # comm is truncated to 15ch, except apparently in the cases of some
> # kernel worker processes I saw with much longer names?
> $ cat /proc/4119883/comm
> open-vim-in-new
> # exe is a link to the executable, which means bash as this is a
> # script
> $ ls -lha /proc/4119883/exe
> lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
> /proc/4119883/exe -> /usr/bin/bash
> # cmdline has the whole argv, separated on NUL so it runs together in
> # editor
> $ cat /proc/4119883/cmdline
> /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>
> Jonathan N pointed out that the process name (the thing in 'comm') can
> also be manually manipulated by the process itself, and 'man procfs'
> also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in
> 'prctl()', so that tracks. (It doesn't look like we can use prctl() to
> find out the names of processes besides the current process, though, so
> the procfs stuff is still needed. Dang.)
>
>>
>> > like this is a little more safe about excluding personal information
>> > from the traces which take the form of "myscript.sh
>> > --password=hunter2", but would still be worrisome for something like
>> > "mysupersecretproject.sh". I'm not sure whether that means we still
>> > want to guard it with a config flag, though.
>>
>> You might check whether you get the name of the script or just get
>> a lot of entries with just "/usr/bin/bash".
>
> See above :)
>
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
>
> Yes, that's what we've noticed too, so a process name isn't worrying us
> that much more.
>
>>
>> > - I also added a lot to the commit message; hopefully it's not too
>> > rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>> > wasn't going to cut it.
>> > - As for testing, I followed the lead of GfW's parentage info - "this
>> > isn't portable so writing tests for it will suck, just scrub it from
>> > the tests". Maybe it makes sense to do some more
>> > platform-specific-ness in the test suite instead? I wasn't sure.
>>
>> yeah, that's probably best. Unless you can tokenize it properly
>> so that you can predict the results in a HEREDOC in the test source.
>>
>> For example, you might try to test tracing a command (where a top-level
>> "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the
>> output for the child.
>
> Yeah, I had trouble with even deciding when to attempt such a check or
> not.
>> > + if (reason == TRACE2_PROCESS_INFO_STARTUP)
>> > + {
>> > + /*
>> > + * NEEDSWORK: we could do the entire ptree in an array instead,
>> > + * see compat/win32/trace2_win32_process_info.c.
>> > + */
>> > + char *names[2];
>> > + names[0] = get_process_name(getppid());
>> > + names[1] = NULL;
>>
>> You're only logging 1 parent. That's fine to get started.
>>
>> I'm logging IIRC 10 parents on GFW. That might seem overkill,
>> but there are lots of intermediate parents that hide what is
>> happening. For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>>
>> And that doesn't include who called push.
>>
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
>
> Agree. But it's expensive - I didn't find a handy library call to find
> "parent ID of given process ID", so I think we'd have to manipulate
> procfs; and so far I only see parent ID in summary infos like
> /proc/n/status or /proc/n/stat, which contain lots of other info too and
> would need parsing.
It sounds a bit like you're fumbling your way towards (re?)discovering:
pstree -s <pid>
You can look at its implementation (or strace it) to see what it does,
and yes, on Linux there's no handy C library for this, iterating over
procfs is the library.
Aside from the privacy, PII, usefulness of this data etc. discussions in
this & related threads I don't think that per-se should be an issue on a
modern Linux system. After all we'd just need to do it once on
startup. For any sub-process we spawn we'd carry it forward after the
initial /usr/bin/git invocation.
next prev parent reply other threads:[~2021-05-24 23:36 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-07 0:29 [PATCH] tr2: log parent process name Emily Shaffer
2021-05-07 3:25 ` Bagas Sanjaya
2021-05-07 17:09 ` Emily Shaffer
2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
2021-05-11 21:31 ` Junio C Hamano
2021-05-14 22:06 ` Emily Shaffer
2021-05-16 3:48 ` Junio C Hamano
2021-05-17 20:17 ` Emily Shaffer
2021-05-11 17:28 ` Jeff Hostetler
2021-05-14 22:07 ` Emily Shaffer
2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
2021-05-20 21:36 ` Randall S. Becker
2021-05-20 23:23 ` Emily Shaffer
2021-05-21 13:20 ` Randall S. Becker
2021-05-21 16:24 ` Randall S. Becker
2021-05-21 2:09 ` Junio C Hamano
2021-05-21 19:02 ` Emily Shaffer
2021-05-21 23:22 ` Junio C Hamano
2021-05-24 18:37 ` Emily Shaffer
2021-05-21 19:15 ` Jeff Hostetler
2021-05-21 20:05 ` Emily Shaffer
2021-05-21 20:23 ` Randall S. Becker
2021-05-22 11:18 ` Jeff Hostetler
2021-05-24 23:33 ` Ævar Arnfjörð Bjarmason [this message]
2021-05-24 20:10 ` [PATCH v3] " Emily Shaffer
2021-05-24 20:49 ` Emily Shaffer
2021-05-25 3:54 ` Junio C Hamano
2021-05-25 13:33 ` Randall S. Becker
2021-06-08 18:58 ` [PATCH v4] " Emily Shaffer
2021-06-08 20:56 ` Emily Shaffer
2021-06-08 22:10 ` [PATCH v5] " Emily Shaffer
2021-06-08 22:16 ` Randall S. Becker
2021-06-08 22:24 ` Emily Shaffer
2021-06-08 22:39 ` Randall S. Becker
2021-06-09 20:17 ` Emily Shaffer
2021-06-16 8:42 ` Junio C Hamano
2021-06-28 16:45 ` Jeff Hostetler
2021-06-29 23:51 ` Emily Shaffer
2021-06-30 6:10 ` Ævar Arnfjörð Bjarmason
2021-07-22 0:21 ` Emily Shaffer
2021-07-22 1:27 ` [PATCH v6 0/2] " Emily Shaffer
2021-07-22 1:27 ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
2021-08-02 9:34 ` Ævar Arnfjörð Bjarmason
2021-07-22 1:27 ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
2021-07-22 21:02 ` Junio C Hamano
2021-08-02 9:38 ` Ævar Arnfjörð Bjarmason
2021-08-02 12:45 ` Ævar Arnfjörð Bjarmason
2021-08-02 10:22 ` Ævar Arnfjörð Bjarmason
2021-08-02 12:47 ` Ævar Arnfjörð Bjarmason
2021-08-02 15:23 ` Jeff Hostetler
2021-08-02 16:10 ` Randall S. Becker
2021-08-02 18:41 ` Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:19 ` [PATCH 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26 3:09 ` Taylor Blau
2021-08-25 23:19 ` [PATCH 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26 3:21 ` Taylor Blau
2021-08-25 23:19 ` [PATCH 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26 3:23 ` Taylor Blau
2021-08-25 23:19 ` [PATCH 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:49 ` Eric Sunshine
2021-08-26 4:07 ` Taylor Blau
2021-08-26 12:24 ` "I don't know what the author meant by that..." (was "Re: [PATCH 6/6] tr2: log N parent process names on Linux") Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26 15:58 ` Eric Sunshine
2021-08-26 16:42 ` Junio C Hamano
2021-08-26 12:22 ` [PATCH v2 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26 12:22 ` [PATCH v2 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-26 22:38 ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-27 8:02 ` [PATCH v3 " Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 4/6] tr2: leave the parent list empty upon failure & don't leak memory Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-27 8:02 ` [PATCH v3 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-31 0:17 ` [PATCH v3 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-02 10:30 ` [PATCH v6 2/2] tr2: log parent process name Ævar Arnfjörð Bjarmason
2021-08-02 16:24 ` Junio C Hamano
2021-08-02 18:42 ` Ævar Arnfjörð Bjarmason
2021-07-22 16:59 ` [PATCH v6 0/2] " Jeff Hostetler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877djnogbk.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=bagasdotme@gmail.com \
--cc=emilyshaffer@google.com \
--cc=git@jeffhostetler.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).