git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH] tr2: log parent process name
@ 2021-05-07  0:29 Emily Shaffer
  2021-05-07  3:25 ` Bagas Sanjaya
                   ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-07  0:29 UTC (permalink / raw)
  To: git; +Cc: Emily Shaffer

It can be useful to tell who invoked Git - was it invoked manually by a
user via CLI or script? By an IDE? Knowing where the Git invocation came
from can help with debugging to isolate where the problem came from.

Unfortunately, there's no cross-platform reliable way to gather the name
of the parent process. If procfs is present, we can use that; otherwise
we will need to discover the name another way. However, the process ID
should be sufficient regardless of platform.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---
We briefly discussed hiding this behind a config, internally. However, I
wanted to include the parent name alongside the cmd_start event, which
happens very early (maybe before config gathering?).

Maybe it's better to log the parent_name as its own event, since it
shouldn't change over the lifetime of the process?

procfs is very non-portable, though - I think this won't even work on
MacOS. So I'm curious if anybody has better suggestions for how to do
this.

 - Emily

 Makefile                  |  1 +
 compat/procinfo.c         | 19 +++++++++++++++++++
 git-compat-util.h         |  6 ++++++
 t/t0212-trace2-event.sh   |  8 ++++++++
 t/t0212/parse_events.perl |  1 +
 trace2/tr2_tgt_event.c    |  3 +++
 6 files changed, 38 insertions(+)
 create mode 100644 compat/procinfo.c

diff --git a/Makefile b/Makefile
index 93664d6714..19f5189c6f 100644
--- a/Makefile
+++ b/Makefile
@@ -855,6 +855,7 @@ LIB_OBJS += commit-graph.o
 LIB_OBJS += commit-reach.o
 LIB_OBJS += commit.o
 LIB_OBJS += compat/obstack.o
+LIB_OBJS += compat/procinfo.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
 LIB_OBJS += connect.o
diff --git a/compat/procinfo.c b/compat/procinfo.c
new file mode 100644
index 0000000000..7f4c8dd284
--- /dev/null
+++ b/compat/procinfo.c
@@ -0,0 +1,19 @@
+#include "git-compat-util.h"
+
+#include "strbuf.h"
+
+char *get_process_name(int pid)
+{
+	struct strbuf procfs_path = STRBUF_INIT;
+	struct strbuf out = STRBUF_INIT;
+	/* try to use procfs if it's present. */
+	strbuf_addf(&procfs_path, "/proc/%d/cmdline", pid);
+	if (strbuf_read_file(&out, procfs_path.buf, 0) > 0)
+	{
+		strbuf_release(&procfs_path);
+		return strbuf_detach(&out, NULL);
+	}
+
+	/* NEEDSWORK: add non-procfs implementations here. */
+	return NULL;
+}
diff --git a/git-compat-util.h b/git-compat-util.h
index a508dbe5a3..cc7d5d8a2a 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1382,4 +1382,10 @@ static inline void *container_of_or_null_offset(void *ptr, size_t offset)
 
 void sleep_millisec(int millisec);
 
+/*
+ * Convert PID to process name (as would show in top/task manager). Returns
+ * NULL if unimplemented - be sure to check for NULL at callsite.
+ */
+char *get_process_name(int pid);
+
 #endif
diff --git a/t/t0212-trace2-event.sh b/t/t0212-trace2-event.sh
index 1529155cf0..3a2a8a5b5f 100755
--- a/t/t0212-trace2-event.sh
+++ b/t/t0212-trace2-event.sh
@@ -61,6 +61,7 @@ test_expect_success JSON_PP 'event stream, error event' '
 	|    "exit_code":0,
 	|    "hierarchy":"trace2",
 	|    "name":"trace2",
+	|    "parent_name":"/bin/sh",
 	|    "version":"$V"
 	|  }
 	|};
@@ -115,6 +116,7 @@ test_expect_success JSON_PP 'event stream, return code 0' '
 	|    "exit_code":0,
 	|    "hierarchy":"trace2",
 	|    "name":"trace2",
+	|    "parent_name":"/bin/sh",
 	|    "version":"$V"
 	|  },
 	|  "_SID0_/_SID1_":{
@@ -143,6 +145,7 @@ test_expect_success JSON_PP 'event stream, return code 0' '
 	|    "exit_code":0,
 	|    "hierarchy":"trace2/trace2",
 	|    "name":"trace2",
+	|    "parent_name":"test-tool",
 	|    "version":"$V"
 	|  },
 	|  "_SID0_/_SID1_/_SID2_":{
@@ -155,6 +158,7 @@ test_expect_success JSON_PP 'event stream, return code 0' '
 	|    "exit_code":0,
 	|    "hierarchy":"trace2/trace2/trace2",
 	|    "name":"trace2",
+	|    "parent_name":"$TEST_DIRECTORY/../t/helper//test-tool",
 	|    "version":"$V"
 	|  }
 	|};
@@ -192,6 +196,7 @@ test_expect_success JSON_PP 'event stream, list config' '
 	|        "value":"hello world"
 	|      }
 	|    ],
+	|    "parent_name":"/bin/sh",
 	|    "version":"$V"
 	|  }
 	|};
@@ -229,6 +234,7 @@ test_expect_success JSON_PP 'event stream, list env vars' '
 	|        "value":"hello world"
 	|      }
 	|    ],
+	|    "parent_name":"/bin/sh",
 	|    "version":"$V"
 	|  }
 	|};
@@ -263,6 +269,7 @@ test_expect_success JSON_PP 'basic trace2_data' '
 	|    "exit_code":0,
 	|    "hierarchy":"trace2",
 	|    "name":"trace2",
+	|    "parent_name":"/bin/sh",
 	|    "version":"$V"
 	|  }
 	|};
@@ -295,6 +302,7 @@ test_expect_success JSON_PP 'using global config, event stream, error event' '
 	|    "exit_code":0,
 	|    "hierarchy":"trace2",
 	|    "name":"trace2",
+	|    "parent_name":"/bin/sh",
 	|    "version":"$V"
 	|  }
 	|};
diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
index 6584bb5634..dd8b1be844 100644
--- a/t/t0212/parse_events.perl
+++ b/t/t0212/parse_events.perl
@@ -99,6 +99,7 @@
     }
 
     elsif ($event eq 'start') {
+	$processes->{$sid}->{'parent_name'} = $line->{'parent_name'};
 	$processes->{$sid}->{'argv'} = $line->{'argv'};
 	$processes->{$sid}->{'argv'}[0] = "_EXE_";
     }
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 6353e8ad91..d258d5807c 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -145,10 +145,12 @@ static void fn_start_fl(const char *file, int line,
 	const char *event_name = "start";
 	struct json_writer jw = JSON_WRITER_INIT;
 	double t_abs = (double)us_elapsed_absolute / 1000000.0;
+	char *parent_name = get_process_name(getppid());
 
 	jw_object_begin(&jw, 0);
 	event_fmt_prepare(event_name, file, line, NULL, &jw);
 	jw_object_double(&jw, "t_abs", 6, t_abs);
+	jw_object_string(&jw, "parent_name", parent_name ? parent_name : NULL );
 	jw_object_inline_begin_array(&jw, "argv");
 	jw_array_argv(&jw, argv);
 	jw_end(&jw);
@@ -156,6 +158,7 @@ static void fn_start_fl(const char *file, int line,
 
 	tr2_dst_write_line(&tr2dst_event, &jw.json);
 	jw_release(&jw);
+	free(parent_name);
 }
 
 static void fn_exit_fl(const char *file, int line, uint64_t us_elapsed_absolute,
-- 
2.31.1.607.g51e8a6a459-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
@ 2021-05-07  3:25 ` Bagas Sanjaya
  2021-05-07 17:09 ` Emily Shaffer
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 45+ messages in thread
From: Bagas Sanjaya @ 2021-05-07  3:25 UTC (permalink / raw)
  To: Emily Shaffer, git

On 07/05/21 07.29, Emily Shaffer wrote:
> It can be useful to tell who invoked Git - was it invoked manually by a
> user via CLI or script? By an IDE? Knowing where the Git invocation came
> from can help with debugging to isolate where the problem came from.
> 
> Unfortunately, there's no cross-platform reliable way to gather the name
> of the parent process. If procfs is present, we can use that; otherwise
> we will need to discover the name another way. However, the process ID
> should be sufficient regardless of platform.

What about on Windows?

> Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> ---
> We briefly discussed hiding this behind a config, internally. However, I
> wanted to include the parent name alongside the cmd_start event, which
> happens very early (maybe before config gathering?).
> 
> Maybe it's better to log the parent_name as its own event, since it
> shouldn't change over the lifetime of the process?
> 
> procfs is very non-portable, though - I think this won't even work on
> MacOS. So I'm curious if anybody has better suggestions for how to do
> this.

Maybe we can say that "currently the method of gathering parent process
name that Git uses only work on Linux".

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
  2021-05-07  3:25 ` Bagas Sanjaya
@ 2021-05-07 17:09 ` Emily Shaffer
  2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-07 17:09 UTC (permalink / raw)
  To: git

On Thu, May 06, 2021 at 05:29:08PM -0700, Emily Shaffer wrote:
> 
> It can be useful to tell who invoked Git - was it invoked manually by a
> user via CLI or script? By an IDE? Knowing where the Git invocation came
> from can help with debugging to isolate where the problem came from.
> 
> Unfortunately, there's no cross-platform reliable way to gather the name
> of the parent process. If procfs is present, we can use that; otherwise
> we will need to discover the name another way. However, the process ID
> should be sufficient regardless of platform.
> 
> Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> ---
> We briefly discussed hiding this behind a config, internally. However, I
> wanted to include the parent name alongside the cmd_start event, which
> happens very early (maybe before config gathering?).
> 
> Maybe it's better to log the parent_name as its own event, since it
> shouldn't change over the lifetime of the process?
> 
> procfs is very non-portable, though - I think this won't even work on
> MacOS. So I'm curious if anybody has better suggestions for how to do
> this.

I wrote this and then I wrote nonportable tests anyways, bah. Working on
a fix now - the tests break for MacOS and Windows.

The parent_name needs to be set conditionally below.

> diff --git a/t/t0212-trace2-event.sh b/t/t0212-trace2-event.sh
> index 1529155cf0..3a2a8a5b5f 100755
> --- a/t/t0212-trace2-event.sh
> +++ b/t/t0212-trace2-event.sh
> @@ -61,6 +61,7 @@ test_expect_success JSON_PP 'event stream, error event' '
>  	|    "exit_code":0,
>  	|    "hierarchy":"trace2",
>  	|    "name":"trace2",
> +	|    "parent_name":"/bin/sh",
>  	|    "version":"$V"
>  	|  }
>  	|};
> @@ -115,6 +116,7 @@ test_expect_success JSON_PP 'event stream, return code 0' '
>  	|    "exit_code":0,
>  	|    "hierarchy":"trace2",
>  	|    "name":"trace2",
> +	|    "parent_name":"/bin/sh",
>  	|    "version":"$V"
>  	|  },
>  	|  "_SID0_/_SID1_":{
> @@ -143,6 +145,7 @@ test_expect_success JSON_PP 'event stream, return code 0' '
>  	|    "exit_code":0,
>  	|    "hierarchy":"trace2/trace2",
>  	|    "name":"trace2",
> +	|    "parent_name":"test-tool",
>  	|    "version":"$V"
>  	|  },
>  	|  "_SID0_/_SID1_/_SID2_":{
> @@ -155,6 +158,7 @@ test_expect_success JSON_PP 'event stream, return code 0' '
>  	|    "exit_code":0,
>  	|    "hierarchy":"trace2/trace2/trace2",
>  	|    "name":"trace2",
> +	|    "parent_name":"$TEST_DIRECTORY/../t/helper//test-tool",
>  	|    "version":"$V"
>  	|  }
>  	|};
> @@ -192,6 +196,7 @@ test_expect_success JSON_PP 'event stream, list config' '
>  	|        "value":"hello world"
>  	|      }
>  	|    ],
> +	|    "parent_name":"/bin/sh",
>  	|    "version":"$V"
>  	|  }
>  	|};
> @@ -229,6 +234,7 @@ test_expect_success JSON_PP 'event stream, list env vars' '
>  	|        "value":"hello world"
>  	|      }
>  	|    ],
> +	|    "parent_name":"/bin/sh",
>  	|    "version":"$V"
>  	|  }
>  	|};
> @@ -263,6 +269,7 @@ test_expect_success JSON_PP 'basic trace2_data' '
>  	|    "exit_code":0,
>  	|    "hierarchy":"trace2",
>  	|    "name":"trace2",
> +	|    "parent_name":"/bin/sh",
>  	|    "version":"$V"
>  	|  }
>  	|};
> @@ -295,6 +302,7 @@ test_expect_success JSON_PP 'using global config, event stream, error event' '
>  	|    "exit_code":0,
>  	|    "hierarchy":"trace2",
>  	|    "name":"trace2",
> +	|    "parent_name":"/bin/sh",
>  	|    "version":"$V"
>  	|  }
>  	|};


 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
  2021-05-07  3:25 ` Bagas Sanjaya
  2021-05-07 17:09 ` Emily Shaffer
@ 2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
  2021-05-11 21:31   ` Junio C Hamano
  2021-05-14 22:06   ` Emily Shaffer
  2021-05-11 17:28 ` Jeff Hostetler
  2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
  4 siblings, 2 replies; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-05-10 12:29 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: git


On Thu, May 06 2021, Emily Shaffer wrote:

> It can be useful to tell who invoked Git - was it invoked manually by a
> user via CLI or script? By an IDE? Knowing where the Git invocation came
> from can help with debugging to isolate where the problem came from.

Aside from the portability concerns others have raised, I don't really
see why you'd need this.

We already have the nest-level as part of the SID, so isn't it
sufficient (and portable) at the top-level to log what isatty says + set
the initial SID "root" in the IDE (which presumably knows about git).

Wouldn't this log passwords in cases of e.g.:

    some-script --git-password secret # invokes "git"

In older versions of linux reading e.g. smaps from /proc/self would
stall the kernel while the read was happening, I haven't checked whether
cmdline is such a thing (probably not), but it's a subtle thing to have
in mind for this / follow-ups if it's made portable (is that an issue on
other OS's?).

All that being said I've got nothing fundamentally against this.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
                   ` (2 preceding siblings ...)
  2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
@ 2021-05-11 17:28 ` Jeff Hostetler
  2021-05-14 22:07   ` Emily Shaffer
  2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
  4 siblings, 1 reply; 45+ messages in thread
From: Jeff Hostetler @ 2021-05-11 17:28 UTC (permalink / raw)
  To: Emily Shaffer, git



On 5/6/21 8:29 PM, Emily Shaffer wrote:
> It can be useful to tell who invoked Git - was it invoked manually by a
> user via CLI or script? By an IDE? Knowing where the Git invocation came
> from can help with debugging to isolate where the problem came from.
> 
> Unfortunately, there's no cross-platform reliable way to gather the name
> of the parent process. If procfs is present, we can use that; otherwise
> we will need to discover the name another way. However, the process ID
> should be sufficient regardless of platform.
> 
> Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> ---
> We briefly discussed hiding this behind a config, internally. However, I
> wanted to include the parent name alongside the cmd_start event, which
> happens very early (maybe before config gathering?).
> 
> Maybe it's better to log the parent_name as its own event, since it
> shouldn't change over the lifetime of the process?
> 
> procfs is very non-portable, though - I think this won't even work on
> MacOS. So I'm curious if anybody has better suggestions for how to do
> this.
> 
>   - Emily


Look at `trace2_collect_process_info()` in `trace2.h` and
`compat/win32/trace2_win32_process_info.c`.

That function is designed to let the process call out to
platform-specific code to do things like getting the name
of the invoking process.

It is called with an enum to indicate when/why it is being
called.

On Windows, I have platform code to get the process ancestry,
the peak VM usage, and whether the process is being debugged.
Two of those were easy....  And all are very platform-specific.

They all generate events of the form:

	trace2_data_*("process", "windows/<something>", ...)

Some of my `t021[012]/` helper scripts exclude "process" category
messages from the output to make the t*.sh tests portable.


To implement a /proc lookup as you have suggested, I would
fixup the Makefile or config.mak.uname to include something
like a "HAVE_..." feature and then use that at the
bottom of trace2.h to declare a trace2_collect_process_info()
function and then have your code in compat/procinfo.c hide
behind my interface.

Your code should emits events of the form:

	trace2_data_*("process", "linux/<something>", ...)

I notice there are several `PROCFS_*` symbols currently defined
in `config.mak.uname` so maybe this should be:

	trace2_data_*("process", "procfs/<something>", ...)

(I'm not sure how similar code for Linux and the various BSD
versions would be.)


Jeff


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
@ 2021-05-11 21:31   ` Junio C Hamano
  2021-05-14 22:06   ` Emily Shaffer
  1 sibling, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2021-05-11 21:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Emily Shaffer, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, May 06 2021, Emily Shaffer wrote:
>
>> It can be useful to tell who invoked Git - was it invoked manually by a
>> user via CLI or script? By an IDE? Knowing where the Git invocation came
>> from can help with debugging to isolate where the problem came from.
>
> Aside from the portability concerns others have raised, I don't really
> see why you'd need this.
>
> We already have the nest-level as part of the SID, so isn't it
> sufficient (and portable) at the top-level to log what isatty says + set
> the initial SID "root" in the IDE (which presumably knows about git).
>
> Wouldn't this log passwords in cases of e.g.:
>
>     some-script --git-password secret # invokes "git"

Both valid and excellent points, I would think.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
  2021-05-11 21:31   ` Junio C Hamano
@ 2021-05-14 22:06   ` Emily Shaffer
  2021-05-16  3:48     ` Junio C Hamano
  1 sibling, 1 reply; 45+ messages in thread
From: Emily Shaffer @ 2021-05-14 22:06 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On Mon, May 10, 2021 at 02:29:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> On Thu, May 06 2021, Emily Shaffer wrote:
> 
> > It can be useful to tell who invoked Git - was it invoked manually by a
> > user via CLI or script? By an IDE? Knowing where the Git invocation came
> > from can help with debugging to isolate where the problem came from.
> 
> Aside from the portability concerns others have raised, I don't really
> see why you'd need this.
> 
> We already have the nest-level as part of the SID, so isn't it
> sufficient (and portable) at the top-level to log what isatty says + set
> the initial SID "root" in the IDE (which presumably knows about git).

If you already know all the IDEs and scripts which invoke Git, sure, you
could set the SID root - we do this with 'repo' tool today. But actually
we want this because we aren't sure exactly who at Google is invoking
Git in their tooling, how, why, etc. - this logline was supposed to help
with that. Chicken, egg, etc.

Or else I'm misunderstanding your suggestion; to me it sounded like "go
fix your VSCode plugin so that it sets an additional envvar" and I
responded accordingly. If you mean something else I didn't see,
implemented in Git land, then I'm curious to hear more.

> 
> Wouldn't this log passwords in cases of e.g.:
> 
>     some-script --git-password secret # invokes "git"

I have nothing but nasty things to say about scripts which would do
that, but your point is valid enough to make me think this logline
should be gated behind a config and posted much later (when config is
available - I think it's not yet, with the patch as-is).

> 
> In older versions of linux reading e.g. smaps from /proc/self would
> stall the kernel while the read was happening, I haven't checked whether
> cmdline is such a thing (probably not), but it's a subtle thing to have
> in mind for this / follow-ups if it's made portable (is that an issue on
> other OS's?).

That's an interesting point and I wonder how I can validate if it
does/doesn't stall in this way - I guess I can go manpage diving?

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-11 17:28 ` Jeff Hostetler
@ 2021-05-14 22:07   ` Emily Shaffer
  0 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-14 22:07 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git

On Tue, May 11, 2021 at 01:28:39PM -0400, Jeff Hostetler wrote:
> 
> 
> 
> On 5/6/21 8:29 PM, Emily Shaffer wrote:
> > It can be useful to tell who invoked Git - was it invoked manually by a
> > user via CLI or script? By an IDE? Knowing where the Git invocation came
> > from can help with debugging to isolate where the problem came from.
> > 
> > Unfortunately, there's no cross-platform reliable way to gather the name
> > of the parent process. If procfs is present, we can use that; otherwise
> > we will need to discover the name another way. However, the process ID
> > should be sufficient regardless of platform.
> > 
> > Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> > ---
> > We briefly discussed hiding this behind a config, internally. However, I
> > wanted to include the parent name alongside the cmd_start event, which
> > happens very early (maybe before config gathering?).
> > 
> > Maybe it's better to log the parent_name as its own event, since it
> > shouldn't change over the lifetime of the process?
> > 
> > procfs is very non-portable, though - I think this won't even work on
> > MacOS. So I'm curious if anybody has better suggestions for how to do
> > this.
> > 
> >   - Emily
> 
> 
> Look at `trace2_collect_process_info()` in `trace2.h` and
> `compat/win32/trace2_win32_process_info.c`.
> 
> That function is designed to let the process call out to
> platform-specific code to do things like getting the name
> of the invoking process.
> 
> It is called with an enum to indicate when/why it is being
> called.
> 
> On Windows, I have platform code to get the process ancestry,
> the peak VM usage, and whether the process is being debugged.
> Two of those were easy....  And all are very platform-specific.
> 
> They all generate events of the form:
> 
> 	trace2_data_*("process", "windows/<something>", ...)
> 
> Some of my `t021[012]/` helper scripts exclude "process" category
> messages from the output to make the t*.sh tests portable.
> 
> 
> To implement a /proc lookup as you have suggested, I would
> fixup the Makefile or config.mak.uname to include something
> like a "HAVE_..." feature and then use that at the
> bottom of trace2.h to declare a trace2_collect_process_info()
> function and then have your code in compat/procinfo.c hide
> behind my interface.
> 
> Your code should emits events of the form:
> 
> 	trace2_data_*("process", "linux/<something>", ...)
> 
> I notice there are several `PROCFS_*` symbols currently defined
> in `config.mak.uname` so maybe this should be:
> 
> 	trace2_data_*("process", "procfs/<something>", ...)
> 
> (I'm not sure how similar code for Linux and the various BSD
> versions would be.)

Thanks a bunch for the pointers - this is great.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-14 22:06   ` Emily Shaffer
@ 2021-05-16  3:48     ` Junio C Hamano
  2021-05-17 20:17       ` Emily Shaffer
  0 siblings, 1 reply; 45+ messages in thread
From: Junio C Hamano @ 2021-05-16  3:48 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Ævar Arnfjörð Bjarmason, git

Emily Shaffer <emilyshaffer@google.com> writes:

> On Mon, May 10, 2021 at 02:29:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> We already have the nest-level as part of the SID, so isn't it
>> sufficient (and portable) at the top-level to log what isatty says + set
>> the initial SID "root" in the IDE (which presumably knows about git).
>
> If you already know all the IDEs and scripts which invoke Git, sure, you
> could set the SID root - we do this with 'repo' tool today. But actually
> we want this because we aren't sure exactly who at Google is invoking
> Git in their tooling, how, why, etc. - this logline was supposed to help
> with that. Chicken, egg, etc.

I agreed with Æver's suggestion exectly because I failed to read the
above motivation from the patch.  If you are trying to find out who
called, then you'd need to do the "find the parent process" dance
when your parent did not give you their SID to append your ident to.

Perhaps it was obvious to the author of the patch, but it was
unclear from the point of view of readers.  Perhaps the first
paragraph of the proposed message wants a bit of rephrasing.  "we
aren't sure exactly ... how, why, etc." part of the above really
helped.

Thanks.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tr2: log parent process name
  2021-05-16  3:48     ` Junio C Hamano
@ 2021-05-17 20:17       ` Emily Shaffer
  0 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-17 20:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ævar Arnfjörð Bjarmason, git

On Sun, May 16, 2021 at 12:48:37PM +0900, Junio C Hamano wrote:
> 
> Emily Shaffer <emilyshaffer@google.com> writes:
> 
> > On Mon, May 10, 2021 at 02:29:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >
> >> We already have the nest-level as part of the SID, so isn't it
> >> sufficient (and portable) at the top-level to log what isatty says + set
> >> the initial SID "root" in the IDE (which presumably knows about git).
> >
> > If you already know all the IDEs and scripts which invoke Git, sure, you
> > could set the SID root - we do this with 'repo' tool today. But actually
> > we want this because we aren't sure exactly who at Google is invoking
> > Git in their tooling, how, why, etc. - this logline was supposed to help
> > with that. Chicken, egg, etc.
> 
> I agreed with Æver's suggestion exectly because I failed to read the
> above motivation from the patch.  If you are trying to find out who
> called, then you'd need to do the "find the parent process" dance
> when your parent did not give you their SID to append your ident to.
> 
> Perhaps it was obvious to the author of the patch, but it was
> unclear from the point of view of readers.  Perhaps the first
> paragraph of the proposed message wants a bit of rephrasing.  "we
> aren't sure exactly ... how, why, etc." part of the above really
> helped.
> 
> Thanks.

Thanks for the feedback. I was vacationing last week but should be
sending a rework today or tomorrow, will see if I can shore up the
commit-msg as well as the portability stuff Jeff H raised.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2] tr2: log parent process name
  2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
                   ` (3 preceding siblings ...)
  2021-05-11 17:28 ` Jeff Hostetler
@ 2021-05-20 21:05 ` Emily Shaffer
  2021-05-20 21:36   ` Randall S. Becker
                     ` (3 more replies)
  4 siblings, 4 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-20 21:05 UTC (permalink / raw)
  To: git
  Cc: Emily Shaffer, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff Hostetler, Bagas Sanjaya

It can be useful to tell who invoked Git - was it invoked manually by a
user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
we can influence the source code and set the GIT_TRACE2_PARENT_SID
environment variable from the caller process. In 'repo''s case, that
parent SID is manipulated to include the string "repo", which means we
can positively identify when Git was invoked by 'repo' tool. However,
identifying parents that way requires both that we know which tools
invoke Git and that we have the ability to modify the source code of
those tools. It cannot scale to keep up with the various IDEs and
wrappers which use Git, most of which we don't know about. Learning
which tools and wrappers invoke Git, and how, would give us insight to
decide where to improve Git's usability and performance.

Unfortunately, there's no cross-platform reliable way to gather the name
of the parent process. If procfs is present, we can use that; otherwise
we will need to discover the name another way. However, the process ID
should be sufficient regardless of platform.

Git for Windows gathers similar information and logs it as a "data_json"
event. However, since "data_json" has a variable format, it is difficult
to parse effectively in some languages; instead, let's pursue a
dedicated "cmd_ancestry" event to record information about the ancestry
of the current process and a consistent, parseable way.

Git for Windows also gathers information about more than one parent. In
Linux further ancestry info can be gathered with procfs, but it's
unwieldy to do so. In the interest of later moving Git for Windows
ancestry logging to the 'cmd_ancestry' event, and in the interest of
later adding more ancestry to the Linux implementation - or of adding
this functionality to other platforms which have an easier time walking
the process tree - let's make 'cmd_ancestry' accept an array of
parentage.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---

Hi folks, the comments I received in v1 were of two varieties:
1) "There are better ways to make this platform-safe", and
2) "Your commit message doesn't convince me".
Since I sent v1, though, I also learned a little more about procfs, and
about the trace2 structure overall, so there are some pretty significant
differences from v1:

- I took a look at Jeff H's advice on using a "data_json" event to log
  this and decided it would be a little more flexible to add a new event
  instead. If we want, it'd be feasible to then shoehorn the GfW parent
  tree stuff into this new event too. Doing it this way is definitely
  easier to parse for Google's trace analysis system (which for now
  completely skips "data_json" as it's polymorphic), and also - I think
  - means that we can add more fields later on if we need to (thread
  info, different fields than just /proc/n/comm like exec path, argv,
  whatever).
- Jonathan N also pointed out to me that /proc/n/comm exists, and logs
  the "command name" - excluding argv, excluding path, etc. It seems
  like this is a little more safe about excluding personal information
  from the traces which take the form of "myscript.sh
  --password=hunter2", but would still be worrisome for something like
  "mysupersecretproject.sh". I'm not sure whether that means we still
  want to guard it with a config flag, though.
- I also added a lot to the commit message; hopefully it's not too
  rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
  wasn't going to cut it.
- As for testing, I followed the lead of GfW's parentage info - "this
  isn't portable so writing tests for it will suck, just scrub it from
  the tests". Maybe it makes sense to do some more
  platform-specific-ness in the test suite instead? I wasn't sure.

Thanks, all.
 - Emily

 Makefile                  |  5 ++++
 compat/procinfo.c         | 53 +++++++++++++++++++++++++++++++++++++++
 config.mak.uname          |  1 +
 git-compat-util.h         |  6 +++++
 t/t0210/scrub_normal.perl |  6 +++++
 t/t0211/scrub_perf.perl   |  5 ++++
 t/t0212/parse_events.perl |  5 +++-
 trace2.c                  | 13 ++++++++++
 trace2.h                  | 12 ++++++++-
 trace2/tr2_tgt.h          |  3 +++
 trace2/tr2_tgt_event.c    | 21 ++++++++++++++++
 trace2/tr2_tgt_normal.c   | 19 ++++++++++++++
 trace2/tr2_tgt_perf.c     | 16 ++++++++++++
 13 files changed, 163 insertions(+), 2 deletions(-)
 create mode 100644 compat/procinfo.c

diff --git a/Makefile b/Makefile
index 93664d6714..330e4fa011 100644
--- a/Makefile
+++ b/Makefile
@@ -1889,6 +1889,11 @@ ifneq ($(PROCFS_EXECUTABLE_PATH),)
 	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
 endif
 
+ifdef HAVE_PROCFS_LINUX
+	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
+	COMPAT_OBJS += compat/procinfo.o
+endif
+
 ifdef HAVE_NS_GET_EXECUTABLE_PATH
 	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
 endif
diff --git a/compat/procinfo.c b/compat/procinfo.c
new file mode 100644
index 0000000000..523600673f
--- /dev/null
+++ b/compat/procinfo.c
@@ -0,0 +1,53 @@
+#include "cache.h"
+
+#include "strbuf.h"
+#include "trace2.h"
+
+char *get_process_name(int pid)
+{
+#ifdef HAVE_PROCFS_LINUX
+	struct strbuf procfs_path = STRBUF_INIT;
+	struct strbuf out = STRBUF_INIT;
+	/* try to use procfs if it's present. */
+	strbuf_addf(&procfs_path, "/proc/%d/comm", pid);
+	if (!strbuf_read_file(&out, procfs_path.buf, 0))
+	{
+		/* All done with file reads, clean up early */
+		strbuf_release(&procfs_path);
+		return strbuf_detach(&out, NULL);
+	}
+#endif
+
+	/* NEEDSWORK: add non-procfs implementations here. */
+	return NULL;
+}
+
+void trace2_collect_process_info(enum trace2_process_info_reason reason)
+{
+	if (!trace2_is_enabled())
+		return;
+
+	/* someday we may want to write something extra here, but not today */
+	if (reason == TRACE2_PROCESS_INFO_EXIT)
+		return;
+
+	if (reason == TRACE2_PROCESS_INFO_STARTUP)
+	{
+		/*
+		 * NEEDSWORK: we could do the entire ptree in an array instead,
+		 * see compat/win32/trace2_win32_process_info.c.
+		 */
+		char *names[2];
+		names[0] = get_process_name(getppid());
+		names[1] = NULL;
+
+		if (!names[0])
+			return;
+
+		trace2_cmd_ancestry((const char**)names);
+
+		free(names[0]);
+	}
+
+	return;
+}
diff --git a/config.mak.uname b/config.mak.uname
index cb443b4e02..7ad110a1d2 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -58,6 +58,7 @@ ifeq ($(uname_S),Linux)
 	FREAD_READS_DIRECTORIES = UnfortunatelyYes
 	BASIC_CFLAGS += -DHAVE_SYSINFO
 	PROCFS_EXECUTABLE_PATH = /proc/self/exe
+	HAVE_PROCFS_LINUX = YesPlease
 endif
 ifeq ($(uname_S),GNU/kFreeBSD)
 	HAVE_ALLOCA_H = YesPlease
diff --git a/git-compat-util.h b/git-compat-util.h
index a508dbe5a3..cc7d5d8a2a 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1382,4 +1382,10 @@ static inline void *container_of_or_null_offset(void *ptr, size_t offset)
 
 void sleep_millisec(int millisec);
 
+/*
+ * Convert PID to process name (as would show in top/task manager). Returns
+ * NULL if unimplemented - be sure to check for NULL at callsite.
+ */
+char *get_process_name(int pid);
+
 #endif
diff --git a/t/t0210/scrub_normal.perl b/t/t0210/scrub_normal.perl
index c65d1a815e..7cc4de392a 100644
--- a/t/t0210/scrub_normal.perl
+++ b/t/t0210/scrub_normal.perl
@@ -42,6 +42,12 @@
 	# so just omit it for testing purposes.
 	# print "cmd_path _EXE_\n";
     }
+    elsif ($line =~ m/^cmd_ancestry/) {
+	# 'cmd_ancestry' is not implemented everywhere, so for portability's
+	# sake, skip it when parsing normal.
+	#
+	# print "$line";
+    }
     else {
 	print "$line";
     }
diff --git a/t/t0211/scrub_perf.perl b/t/t0211/scrub_perf.perl
index 351af7844e..d164b750ff 100644
--- a/t/t0211/scrub_perf.perl
+++ b/t/t0211/scrub_perf.perl
@@ -44,6 +44,11 @@
 	# $tokens[$col_rest] = "_EXE_";
 	goto SKIP_LINE;
     }
+    elsif ($tokens[$col_event] =~ m/cmd_ancestry/) {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere,
+	# so skip it.
+	goto SKIP_LINE;
+    }
     elsif ($tokens[$col_event] =~ m/child_exit/) {
 	$tokens[$col_rest] =~ s/ pid:\d* / pid:_PID_ /;
     }
diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
index 6584bb5634..b6408560c0 100644
--- a/t/t0212/parse_events.perl
+++ b/t/t0212/parse_events.perl
@@ -132,7 +132,10 @@
 	# just omit it for testing purposes.
 	# $processes->{$sid}->{'path'} = "_EXE_";
     }
-    
+    elsif ($event eq 'cmd_ancestry') {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere, so
+	# just skip it for testing purposes.
+    }
     elsif ($event eq 'cmd_name') {
 	$processes->{$sid}->{'name'} = $line->{'name'};
 	$processes->{$sid}->{'hierarchy'} = $line->{'hierarchy'};
diff --git a/trace2.c b/trace2.c
index 256120c7fd..b9b154ac44 100644
--- a/trace2.c
+++ b/trace2.c
@@ -260,6 +260,19 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname)
 			tgt_j->pfn_command_path_fl(file, line, pathname);
 }
 
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	struct tr2_tgt *tgt_j;
+	int j;
+
+	if (!trace2_enabled)
+		return;
+
+	for_each_wanted_builtin (j, tgt_j)
+		if (tgt_j->pfn_command_ancestry_fl)
+			tgt_j->pfn_command_ancestry_fl(file, line, parent_names);
+}
+
 void trace2_cmd_name_fl(const char *file, int line, const char *name)
 {
 	struct tr2_tgt *tgt_j;
diff --git a/trace2.h b/trace2.h
index ede18c2e06..23743ac62b 100644
--- a/trace2.h
+++ b/trace2.h
@@ -133,6 +133,16 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname);
 
 #define trace2_cmd_path(p) trace2_cmd_path_fl(__FILE__, __LINE__, (p))
 
+/*
+ * Emit an 'ancestry' event with the process name of the current process's
+ * parent process.
+ * This gives post-processors a way to determine what invoked the command and
+ * learn more about usage patterns.
+ */
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names);
+
+#define trace2_cmd_ancestry(v) trace2_cmd_ancestry_fl(__FILE__, __LINE__, (v))
+
 /*
  * Emit a 'cmd_name' event with the canonical name of the command.
  * This gives post-processors a simple field to identify the command
@@ -492,7 +502,7 @@ enum trace2_process_info_reason {
 	TRACE2_PROCESS_INFO_EXIT,
 };
 
-#if defined(GIT_WINDOWS_NATIVE)
+#if ( defined(GIT_WINDOWS_NATIVE) || defined(HAVE_PROCFS_LINUX) )
 void trace2_collect_process_info(enum trace2_process_info_reason reason);
 #else
 #define trace2_collect_process_info(reason) \
diff --git a/trace2/tr2_tgt.h b/trace2/tr2_tgt.h
index 7b90469212..1f66fd6573 100644
--- a/trace2/tr2_tgt.h
+++ b/trace2/tr2_tgt.h
@@ -27,6 +27,8 @@ typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
 
 typedef void(tr2_tgt_evt_command_path_fl_t)(const char *file, int line,
 					    const char *command_path);
+typedef void(tr2_tgt_evt_command_ancestry_fl_t)(const char *file, int line,
+						const char **parent_names);
 typedef void(tr2_tgt_evt_command_name_fl_t)(const char *file, int line,
 					    const char *name,
 					    const char *hierarchy);
@@ -108,6 +110,7 @@ struct tr2_tgt {
 	tr2_tgt_evt_atexit_t                    *pfn_atexit;
 	tr2_tgt_evt_error_va_fl_t               *pfn_error_va_fl;
 	tr2_tgt_evt_command_path_fl_t           *pfn_command_path_fl;
+	tr2_tgt_evt_command_ancestry_fl_t	*pfn_command_ancestry_fl;
 	tr2_tgt_evt_command_name_fl_t           *pfn_command_name_fl;
 	tr2_tgt_evt_command_mode_fl_t           *pfn_command_mode_fl;
 	tr2_tgt_evt_alias_fl_t                  *pfn_alias_fl;
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 6353e8ad91..578a9a5287 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -261,6 +261,26 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	jw_release(&jw);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	const char *parent_name = NULL;
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	event_fmt_prepare(event_name, file, line, NULL, &jw);
+	jw_object_inline_begin_array(&jw, "ancestry");
+
+	while ((parent_name = *parent_names++))
+		jw_array_string(&jw, parent_name);
+
+	jw_end(&jw); /* 'ancestry' array */
+	jw_end(&jw); /* event object */
+
+	tr2_dst_write_line(&tr2dst_event, &jw.json);
+	jw_release(&jw);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -584,6 +604,7 @@ struct tr2_tgt tr2_tgt_event = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
index 31b602c171..a5751c8864 100644
--- a/trace2/tr2_tgt_normal.c
+++ b/trace2/tr2_tgt_normal.c
@@ -160,6 +160,24 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *parent_name = NULL;
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	/* cmd_ancestry parent <- grandparent <- great-grandparent */
+	strbuf_addstr(&buf_payload, "cmd_ancestry ");
+	while ((parent_name = *parent_names++)) {
+		strbuf_addstr(&buf_payload, parent_name);
+		/* if we'll write another one after this, add a delimiter */
+		if (parent_names && *parent_names)
+			strbuf_addstr(&buf_payload, " <- ");
+	}
+
+	normal_io_write_fl(file, line, &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -306,6 +324,7 @@ struct tr2_tgt tr2_tgt_normal = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
index a8018f18cc..af4d65a0a5 100644
--- a/trace2/tr2_tgt_perf.c
+++ b/trace2/tr2_tgt_perf.c
@@ -253,6 +253,21 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	strbuf_addstr(&buf_payload, "ancestry:[");
+	/* It's not an argv but the rules are basically the same. */
+	sq_append_quote_argv_pretty(&buf_payload, parent_names);
+	strbuf_addch(&buf_payload, ']');
+
+	perf_io_write_fl(file, line, event_name, NULL, NULL, NULL, NULL,
+			 &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -532,6 +547,7 @@ struct tr2_tgt tr2_tgt_perf = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v2] tr2: log parent process name
  2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
@ 2021-05-20 21:36   ` Randall S. Becker
  2021-05-20 23:23     ` Emily Shaffer
  2021-05-21  2:09   ` Junio C Hamano
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Randall S. Becker @ 2021-05-20 21:36 UTC (permalink / raw)
  To: 'Emily Shaffer', git
  Cc: 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On May 20, 2021 5:06 PM, Emily Shaffer wrote:
>To: git@vger.kernel.org
>Cc: Emily Shaffer <emilyshaffer@google.com>; Ævar Arnfjörð Bjarmason <avarab@gmail.com>; Junio C Hamano <gitster@pobox.com>;
>Jeff Hostetler <git@jeffhostetler.com>; Bagas Sanjaya <bagasdotme@gmail.com>
>Subject: [PATCH v2] tr2: log parent process name
>
>It can be useful to tell who invoked Git - was it invoked manually by a user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
>we can influence the source code and set the GIT_TRACE2_PARENT_SID environment variable from the caller process. In 'repo''s case,
>that parent SID is manipulated to include the string "repo", which means we can positively identify when Git was invoked by 'repo' tool.
>However, identifying parents that way requires both that we know which tools invoke Git and that we have the ability to modify the source
>code of those tools. It cannot scale to keep up with the various IDEs and wrappers which use Git, most of which we don't know about.
>Learning which tools and wrappers invoke Git, and how, would give us insight to decide where to improve Git's usability and performance.
>
>Unfortunately, there's no cross-platform reliable way to gather the name of the parent process. If procfs is present, we can use that;
>otherwise we will need to discover the name another way. However, the process ID should be sufficient regardless of platform.

I like this idea, but there are some platforms where this is unlikely to work. NonStop, in particular, can initiate git - and I frequently do - from a non-POSIX environment where process name is entirely different. In fact, it is something like $ABC (always beginning with a $, which makes life very difficult for shell scripts and screws up GIT_SSH_COMMAND, but I digress). I'm going to need to plug in something very platform-specific to make this work. getppid() always returns 1 in this situation, which is extraordinarily meaningless on the platform and does not represent the actual parent.

I will try to put the appropriate compat hooks in once this moves into master but I can't promise it will be particularly efficient at this stage.

Regards,
Randall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-20 21:36   ` Randall S. Becker
@ 2021-05-20 23:23     ` Emily Shaffer
  2021-05-21 13:20       ` Randall S. Becker
  0 siblings, 1 reply; 45+ messages in thread
From: Emily Shaffer @ 2021-05-20 23:23 UTC (permalink / raw)
  To: Randall S. Becker
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On Thu, May 20, 2021 at 05:36:25PM -0400, Randall S. Becker wrote:
> 
> On May 20, 2021 5:06 PM, Emily Shaffer wrote:
> >To: git@vger.kernel.org
> >Cc: Emily Shaffer <emilyshaffer@google.com>; Ævar Arnfjörð Bjarmason <avarab@gmail.com>; Junio C Hamano <gitster@pobox.com>;
> >Jeff Hostetler <git@jeffhostetler.com>; Bagas Sanjaya <bagasdotme@gmail.com>
> >Subject: [PATCH v2] tr2: log parent process name
> >
> >It can be useful to tell who invoked Git - was it invoked manually by a user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
> >we can influence the source code and set the GIT_TRACE2_PARENT_SID environment variable from the caller process. In 'repo''s case,
> >that parent SID is manipulated to include the string "repo", which means we can positively identify when Git was invoked by 'repo' tool.
> >However, identifying parents that way requires both that we know which tools invoke Git and that we have the ability to modify the source
> >code of those tools. It cannot scale to keep up with the various IDEs and wrappers which use Git, most of which we don't know about.
> >Learning which tools and wrappers invoke Git, and how, would give us insight to decide where to improve Git's usability and performance.
> >
> >Unfortunately, there's no cross-platform reliable way to gather the name of the parent process. If procfs is present, we can use that;
> >otherwise we will need to discover the name another way. However, the process ID should be sufficient regardless of platform.
> 
> I like this idea, but there are some platforms where this is unlikely to work. NonStop, in particular, can initiate git - and I frequently do - from a non-POSIX environment where process name is entirely different. In fact, it is something like $ABC (always beginning with a $, which makes life very difficult for shell scripts and screws up GIT_SSH_COMMAND, but I digress). I'm going to need to plug in something very platform-specific to make this work. getppid() always returns 1 in this situation, which is extraordinarily meaningless on the platform and does not represent the actual parent.

Ok. It sounds like you're saying I should be more conservative in the
commit message as well as in the #ifdef scope? Do you think this needs a
reroll to made the #ifdef more aggressive, or would you rather get to it
when you get to it?

It looks like the change in config.mak.uname won't affect NonStop; I
think also the compat/procinfo.c is probably indicative enough of "this
stuff is for procfs" that it won't look like it *should* work for
NonStop, which means that you should still get the stub for
'trace2_collect_process_info()'. But if you think the guards aren't
readable enough I can try to move them around a little more.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
  2021-05-20 21:36   ` Randall S. Becker
@ 2021-05-21  2:09   ` Junio C Hamano
  2021-05-21 19:02     ` Emily Shaffer
  2021-05-21 19:15   ` Jeff Hostetler
  2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
  3 siblings, 1 reply; 45+ messages in thread
From: Junio C Hamano @ 2021-05-21  2:09 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: git, Ævar Arnfjörð Bjarmason, Jeff Hostetler,
	Bagas Sanjaya

Emily Shaffer <emilyshaffer@google.com> writes:

> Unfortunately, there's no cross-platform reliable way to gather the name
> of the parent process. If procfs is present, we can use that; otherwise
> we will need to discover the name another way. However, the process ID
> should be sufficient regardless of platform.

Not a strong objection, but I wonder if seeing random integer(s) is
better than not having cmd_ancestry info at all.  The latter better
signals that the platform does not yet have the "parent process
name" feature, I would think.

> Git for Windows also gathers information about more than one parent. In
> Linux further ancestry info can be gathered with procfs, but it's
> unwieldy to do so. In the interest of later moving Git for Windows
> ancestry logging to the 'cmd_ancestry' event, and in the interest of
> later adding more ancestry to the Linux implementation - or of adding
> this functionality to other platforms which have an easier time walking
> the process tree - let's make 'cmd_ancestry' accept an array of
> parentage.

Could we rephrase "more than one parent" at the beginning to
clarify?  I initially had to wonder what "an array of parentage"
contains (father and mother, or a sole parent and its sole parent,
which is a sole grandparent).  Since there is no "multiple processes
meet and spawn a single process", I take it is the latter.  Perhaps
"more than one generation of" or something?

> +ifdef HAVE_PROCFS_LINUX
> +	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
> +	COMPAT_OBJS += compat/procinfo.o
> +endif
> +
>  ifdef HAVE_NS_GET_EXECUTABLE_PATH
>  	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
>  endif
> diff --git a/compat/procinfo.c b/compat/procinfo.c
> new file mode 100644
> index 0000000000..523600673f
> --- /dev/null
> +++ b/compat/procinfo.c
> @@ -0,0 +1,53 @@
> +#include "cache.h"
> +
> +#include "strbuf.h"
> +#include "trace2.h"
> +
> +char *get_process_name(int pid)
> +{
> +#ifdef HAVE_PROCFS_LINUX
> +	struct strbuf procfs_path = STRBUF_INIT;
> +	struct strbuf out = STRBUF_INIT;
> +	/* try to use procfs if it's present. */
> +	strbuf_addf(&procfs_path, "/proc/%d/comm", pid);
> +	if (!strbuf_read_file(&out, procfs_path.buf, 0))
> +	{

Place this opening brace at the end of the previous line.

> +		/* All done with file reads, clean up early */
> +		strbuf_release(&procfs_path);
> +		return strbuf_detach(&out, NULL);
> +	}
> +#endif
> +
> +	/* NEEDSWORK: add non-procfs implementations here. */
> +	return NULL;
> +}

> +void trace2_collect_process_info(enum trace2_process_info_reason reason)
> +{
> +	if (!trace2_is_enabled())
> +		return;
> +
> +	/* someday we may want to write something extra here, but not today */
> +	if (reason == TRACE2_PROCESS_INFO_EXIT)
> +		return;
> +
> +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
> +	{

Ditto.

> +		/*
> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
> +		 * see compat/win32/trace2_win32_process_info.c.
> +		 */
> +		char *names[2];
> +		names[0] = get_process_name(getppid());
> +		names[1] = NULL;
> +
> +		if (!names[0])
> +			return;

OK, so if there is no name given, we do not show pid as a
placeholder.

> +		trace2_cmd_ancestry((const char**)names);
> +
> +		free(names[0]);
> +	}
> +
> +	return;
> +}
> diff --git a/config.mak.uname b/config.mak.uname
> index cb443b4e02..7ad110a1d2 100644
> --- a/config.mak.uname
> +++ b/config.mak.uname
> @@ -58,6 +58,7 @@ ifeq ($(uname_S),Linux)
>  	FREAD_READS_DIRECTORIES = UnfortunatelyYes
>  	BASIC_CFLAGS += -DHAVE_SYSINFO
>  	PROCFS_EXECUTABLE_PATH = /proc/self/exe
> +	HAVE_PROCFS_LINUX = YesPlease

Have all Linux instances procfs enabled and mounted?  It might be
that we need to detect this at runtime anyway?

    ... goes and thinks ...

Ah, OK, that "try reading from proc/%d/comm" is the runtime
detection, so it is only this Makefile variable is slightly
misnamed (it is not "HAVE" but "is worth checking for it").

Makes sense.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v2] tr2: log parent process name
  2021-05-20 23:23     ` Emily Shaffer
@ 2021-05-21 13:20       ` Randall S. Becker
  2021-05-21 16:24         ` Randall S. Becker
  0 siblings, 1 reply; 45+ messages in thread
From: Randall S. Becker @ 2021-05-21 13:20 UTC (permalink / raw)
  To: 'Emily Shaffer'
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On May 20, 2021 7:24 PM, Emily Shaffer wrote:
>
>On Thu, May 20, 2021 at 05:36:25PM -0400, Randall S. Becker wrote:
>>
>> On May 20, 2021 5:06 PM, Emily Shaffer wrote:
>> >To: git@vger.kernel.org
>> >Cc: Emily Shaffer <emilyshaffer@google.com>; Ævar Arnfjörð Bjarmason
>> ><avarab@gmail.com>; Junio C Hamano <gitster@pobox.com>; Jeff
>> >Hostetler <git@jeffhostetler.com>; Bagas Sanjaya
>> ><bagasdotme@gmail.com>
>> >Subject: [PATCH v2] tr2: log parent process name
>> >
>> >It can be useful to tell who invoked Git - was it invoked manually by
>> >a user via CLI or script? By an IDE?  In some cases - like 'repo'
>> >tool - we can influence the source code and set the GIT_TRACE2_PARENT_SID environment variable from the caller process. In
'repo''s
>case, that parent SID is manipulated to include the string "repo", which means we can positively identify when Git was invoked by
'repo'
>tool.
>> >However, identifying parents that way requires both that we know
>> >which tools invoke Git and that we have the ability to modify the source code of those tools. It cannot scale to keep up with
the various
>IDEs and wrappers which use Git, most of which we don't know about.
>> >Learning which tools and wrappers invoke Git, and how, would give us insight to decide where to improve Git's usability and
>performance.
>> >
>> >Unfortunately, there's no cross-platform reliable way to gather the
>> >name of the parent process. If procfs is present, we can use that; otherwise we will need to discover the name another way.
However,
>the process ID should be sufficient regardless of platform.
>>
>> I like this idea, but there are some platforms where this is unlikely to work. NonStop, in particular, can initiate git - and I
frequently do -
>from a non-POSIX environment where process name is entirely different. In fact, it is something like $ABC (always beginning with a
$,
>which makes life very difficult for shell scripts and screws up GIT_SSH_COMMAND, but I digress). I'm going to need to plug in
something
>very platform-specific to make this work. getppid() always returns 1 in this situation, which is extraordinarily meaningless on the
platform
>and does not represent the actual parent.
>
>Ok. It sounds like you're saying I should be more conservative in the commit message as well as in the #ifdef scope? Do you think
this
>needs a reroll to made the #ifdef more aggressive, or would you rather get to it when you get to it?

I'll get to it pretty quickly once it's rolled in.

>It looks like the change in config.mak.uname won't affect NonStop; I think also the compat/procinfo.c is probably indicative enough
of "this
>stuff is for procfs" that it won't look like it *should* work for NonStop, which means that you should still get the stub for
>'trace2_collect_process_info()'. But if you think the guards aren't readable enough I can try to move them around a little more.

Guards are fine. There's just a lot more work to do for me. We need to make sure that the rendering of ancestor processes are
generic enough not to be just pid_t through any interfaces where this is queried. In NonStop's case, char[25], should be sufficient
for the short term, but I would prefer something longer, say char[128] to be safe for the future in which to stick the ancestor. To
be completely unique, the ancestor is going to look like \node.$proc:sequence (where node is a 7 character name, proc is a 5
character name, and sequence is currently a long.

-Randall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v2] tr2: log parent process name
  2021-05-21 13:20       ` Randall S. Becker
@ 2021-05-21 16:24         ` Randall S. Becker
  0 siblings, 0 replies; 45+ messages in thread
From: Randall S. Becker @ 2021-05-21 16:24 UTC (permalink / raw)
  To: 'Emily Shaffer'
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On May 21, 2021 9:21 AM, I wrote:
>To: 'Emily Shaffer' <emilyshaffer@google.com>
>Cc: git@vger.kernel.org; 'Ævar Arnfjörð Bjarmason' <avarab@gmail.com>; 'Junio C Hamano' <gitster@pobox.com>; 'Jeff Hostetler'
><git@jeffhostetler.com>; 'Bagas Sanjaya' <bagasdotme@gmail.com>
>Subject: RE: [PATCH v2] tr2: log parent process name
>
>On May 20, 2021 7:24 PM, Emily Shaffer wrote:
>>
>>On Thu, May 20, 2021 at 05:36:25PM -0400, Randall S. Becker wrote:
>>>
>>> On May 20, 2021 5:06 PM, Emily Shaffer wrote:
>>> >To: git@vger.kernel.org
>>> >Cc: Emily Shaffer <emilyshaffer@google.com>; Ævar Arnfjörð Bjarmason
>>> ><avarab@gmail.com>; Junio C Hamano <gitster@pobox.com>; Jeff
>>> >Hostetler <git@jeffhostetler.com>; Bagas Sanjaya
>>> ><bagasdotme@gmail.com>
>>> >Subject: [PATCH v2] tr2: log parent process name
>>> >
>>> >It can be useful to tell who invoked Git - was it invoked manually
>>> >by a user via CLI or script? By an IDE?  In some cases - like 'repo'
>>> >tool - we can influence the source code and set the
>>> >GIT_TRACE2_PARENT_SID environment variable from the caller process.
>>> >In
>'repo''s
>>case, that parent SID is manipulated to include the string "repo",
>>which means we can positively identify when Git was invoked by
>'repo'
>>tool.
>>> >However, identifying parents that way requires both that we know
>>> >which tools invoke Git and that we have the ability to modify the
>>> >source code of those tools. It cannot scale to keep up with
>the various
>>IDEs and wrappers which use Git, most of which we don't know about.
>>> >Learning which tools and wrappers invoke Git, and how, would give us
>>> >insight to decide where to improve Git's usability and
>>performance.
>>> >
>>> >Unfortunately, there's no cross-platform reliable way to gather the
>>> >name of the parent process. If procfs is present, we can use that; otherwise we will need to discover the name another way.
>However,
>>the process ID should be sufficient regardless of platform.
>>>
>>> I like this idea, but there are some platforms where this is unlikely
>>> to work. NonStop, in particular, can initiate git - and I
>frequently do -
>>from a non-POSIX environment where process name is entirely different.
>>In fact, it is something like $ABC (always beginning with a
>$,
>>which makes life very difficult for shell scripts and screws up
>>GIT_SSH_COMMAND, but I digress). I'm going to need to plug in
>something
>>very platform-specific to make this work. getppid() always returns 1 in
>>this situation, which is extraordinarily meaningless on the
>platform
>>and does not represent the actual parent.
>>
>>Ok. It sounds like you're saying I should be more conservative in the
>>commit message as well as in the #ifdef scope? Do you think
>this
>>needs a reroll to made the #ifdef more aggressive, or would you rather get to it when you get to it?
>
>I'll get to it pretty quickly once it's rolled in.
>
>>It looks like the change in config.mak.uname won't affect NonStop; I
>>think also the compat/procinfo.c is probably indicative enough
>of "this
>>stuff is for procfs" that it won't look like it *should* work for
>>NonStop, which means that you should still get the stub for 'trace2_collect_process_info()'. But if you think the guards aren't
readable
>enough I can try to move them around a little more.
>
>Guards are fine. There's just a lot more work to do for me. We need to make sure that the rendering of ancestor processes are
generic
>enough not to be just pid_t through any interfaces where this is queried. In NonStop's case, char[25], should be sufficient for the
short
>term, but I would prefer something longer, say char[128] to be safe for the future in which to stick the ancestor. To be completely
unique,
>the ancestor is going to look like \node.$proc:sequence (where node is a 7 character name, proc is a 5 character name, and sequence
is
>currently a long.

Just so we know what's coming, the code snippet to get a parent on NonStop is as follows (roughly):
        pid_t ossParent = getppid();
        if (ossParent == 1) {
                short pHandle[10];
                short pAncestor[10];
                short ancestorLength;
                short error;
                short attributes[] = { 40 }; /* MOM Process */
                short processNameLength;
                char processName[64];

                PROCESSHANDLE_NULLIT_(pHandle);
                PROCESS_GETINFO_(pHandle);
                error = PROCESS_GETINFOLIST_(,,,, pHandle,
                        attributes, (short) sizeof(attributes)/sizeof(attributes[0]),
                        pAncestor, (short) sizeof(pAncestor), &ancestorLength);
                if (error) {
                        printf("Cannot process parent. Error %d\n", error);
                        return;
                }
                PROCESSHANDLE_TO_FILENAME_(pAncestor, processName, (short) sizeof(processName),
                        &processNameLength);
                processName[processNameLength] = '\0';
                printf("GUARDIAN Parent %s\n", processName);
        } else {
                printf("OSS Parent %d\n", ossParent);
        }

Which in the test program generates
GUARDIAN Parent \HPITUG.$:3:1100:583555076

Regards,
Randall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-21  2:09   ` Junio C Hamano
@ 2021-05-21 19:02     ` Emily Shaffer
  2021-05-21 23:22       ` Junio C Hamano
  0 siblings, 1 reply; 45+ messages in thread
From: Emily Shaffer @ 2021-05-21 19:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Ævar Arnfjörð Bjarmason, Jeff Hostetler,
	Bagas Sanjaya

On Fri, May 21, 2021 at 11:09:49AM +0900, Junio C Hamano wrote:
> 
> Emily Shaffer <emilyshaffer@google.com> writes:
> 
> > Unfortunately, there's no cross-platform reliable way to gather the name
> > of the parent process. If procfs is present, we can use that; otherwise
> > we will need to discover the name another way. However, the process ID
> > should be sufficient regardless of platform.
> 
> Not a strong objection, but I wonder if seeing random integer(s) is
> better than not having cmd_ancestry info at all.  The latter better
> signals that the platform does not yet have the "parent process
> name" feature, I would think.

Hm, we could; I guess then analyzers could still correlate "all these
things were called by some kind of wrapper, even if we don't know if
that was an IDE or a script or just the user or what, we can guess based
on other heuristics". Ok.

> 
> > Git for Windows also gathers information about more than one parent. In
> > Linux further ancestry info can be gathered with procfs, but it's
> > unwieldy to do so. In the interest of later moving Git for Windows
> > ancestry logging to the 'cmd_ancestry' event, and in the interest of
> > later adding more ancestry to the Linux implementation - or of adding
> > this functionality to other platforms which have an easier time walking
> > the process tree - let's make 'cmd_ancestry' accept an array of
> > parentage.
> 
> Could we rephrase "more than one parent" at the beginning to
> clarify?  I initially had to wonder what "an array of parentage"
> contains (father and mother, or a sole parent and its sole parent,
> which is a sole grandparent).  Since there is no "multiple processes
> meet and spawn a single process", I take it is the latter.  Perhaps
> "more than one generation of" or something?

Good point, will refer to "more than one generation". Thanks.

> > +	if (!strbuf_read_file(&out, procfs_path.buf, 0))
> > +	{
> 
> Place this opening brace at the end of the previous line.

Will polish up this and others for v3, hopefully today.
> > +		if (!names[0])
> > +			return;
> 
> OK, so if there is no name given, we do not show pid as a
> placeholder.

Based on your suggestion above I think it will make sense to show pid as
placeholder after all, though. So I will change that for v3.

> >  	PROCFS_EXECUTABLE_PATH = /proc/self/exe
> > +	HAVE_PROCFS_LINUX = YesPlease
> 
> Have all Linux instances procfs enabled and mounted?  It might be
> that we need to detect this at runtime anyway?
> 
>     ... goes and thinks ...
> 
> Ah, OK, that "try reading from proc/%d/comm" is the runtime
> detection, so it is only this Makefile variable is slightly
> misnamed (it is not "HAVE" but "is worth checking for it").

I wonder what is better. "MAYBE_PROCFS_LINUX"? I don't see any other
vars in config.mak.uname that indicate "pretty sure but not totally
sure" in a quick scan. However, right above this line we seem to feel
certain in our guess about "PROCFS_EXECUTABLE_PATH"...

...but when it is used in exec-cmd.c:git_get_exec_path_procfs(), invoked
by exec-cmd.c:git_get_exec_path(), we're very tolerant to faults if it's
not there:

  static int git_get_exec_path(struct strbuf *buf, const char *argv0)
  {
  	/*
  	 * [snip]
  	 * Each of these functions returns 0 on success, so evaluation will stop
  	 * after the first successful method.
  	 */
  	if (
  #ifdef HAVE_BSD_KERN_PROC_SYSCTL
  		git_get_exec_path_bsd_sysctl(buf) &&
  #endif /* HAVE_BSD_KERN_PROC_SYSCTL */
  
  #ifdef HAVE_NS_GET_EXECUTABLE_PATH
  		git_get_exec_path_darwin(buf) &&
  #endif /* HAVE_NS_GET_EXECUTABLE_PATH */
  
  #ifdef PROCFS_EXECUTABLE_PATH
  		git_get_exec_path_procfs(buf) &&  /*** <- OK if fails ***/
  #endif /* PROCFS_EXECUTABLE_PATH */
  
  #ifdef HAVE_WPGMPTR
  		git_get_exec_path_wpgmptr(buf) &&
  #endif /* HAVE_WPGMPTR */
  
  		git_get_exec_path_from_argv0(buf, argv0)) {
  		return -1;
  	}
  
  [snip]
  
  #ifdef PROCFS_EXECUTABLE_PATH
  /*
   * Resolves the executable path by examining a procfs symlink.
   *
   * Returns 0 on success, -1 on failure.
   */
  static int git_get_exec_path_procfs(struct strbuf *buf)
  {
  	if (strbuf_realpath(buf, PROCFS_EXECUTABLE_PATH, 0)) {
  		trace_printf(
  			"trace: resolved executable path from procfs: %s\n",
  			buf->buf);
  		return 0;
  	}
  	return -1;
  }
  #endif /* PROCFS_EXECUTABLE_PATH */


So it seems this other procfs bit takes the "probably but not
definitely" step and is tolerant at runtime as well. Which doesn't help
me much to decide how to rename HAVE_PROCFS_LINUX.

I'll switch it to MAYBE_PROCFS_LINUX for v3 unless someone yells, I
guess.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
  2021-05-20 21:36   ` Randall S. Becker
  2021-05-21  2:09   ` Junio C Hamano
@ 2021-05-21 19:15   ` Jeff Hostetler
  2021-05-21 20:05     ` Emily Shaffer
  2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
  3 siblings, 1 reply; 45+ messages in thread
From: Jeff Hostetler @ 2021-05-21 19:15 UTC (permalink / raw)
  To: Emily Shaffer, git
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, Bagas Sanjaya



On 5/20/21 5:05 PM, Emily Shaffer wrote:
> It can be useful to tell who invoked Git - was it invoked manually by a
> user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
> we can influence the source code and set the GIT_TRACE2_PARENT_SID
> environment variable from the caller process. In 'repo''s case, that
> parent SID is manipulated to include the string "repo", which means we
> can positively identify when Git was invoked by 'repo' tool. However,
> identifying parents that way requires both that we know which tools
> invoke Git and that we have the ability to modify the source code of
> those tools. It cannot scale to keep up with the various IDEs and
> wrappers which use Git, most of which we don't know about. Learning
> which tools and wrappers invoke Git, and how, would give us insight to
> decide where to improve Git's usability and performance.
> 
> Unfortunately, there's no cross-platform reliable way to gather the name
> of the parent process. If procfs is present, we can use that; otherwise
> we will need to discover the name another way. However, the process ID
> should be sufficient regardless of platform.
> 
> Git for Windows gathers similar information and logs it as a "data_json"
> event. However, since "data_json" has a variable format, it is difficult
> to parse effectively in some languages; instead, let's pursue a
> dedicated "cmd_ancestry" event to record information about the ancestry
> of the current process and a consistent, parseable way.
> 
> Git for Windows also gathers information about more than one parent. In
> Linux further ancestry info can be gathered with procfs, but it's
> unwieldy to do so. In the interest of later moving Git for Windows
> ancestry logging to the 'cmd_ancestry' event, and in the interest of
> later adding more ancestry to the Linux implementation - or of adding
> this functionality to other platforms which have an easier time walking
> the process tree - let's make 'cmd_ancestry' accept an array of
> parentage.
> 
> Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> ---
> 
> Hi folks, the comments I received in v1 were of two varieties:
> 1) "There are better ways to make this platform-safe", and
> 2) "Your commit message doesn't convince me".
> Since I sent v1, though, I also learned a little more about procfs, and
> about the trace2 structure overall, so there are some pretty significant
> differences from v1:
> 
> - I took a look at Jeff H's advice on using a "data_json" event to log
>    this and decided it would be a little more flexible to add a new event
>    instead. If we want, it'd be feasible to then shoehorn the GfW parent
>    tree stuff into this new event too. Doing it this way is definitely
>    easier to parse for Google's trace analysis system (which for now
>    completely skips "data_json" as it's polymorphic), and also - I think
>    - means that we can add more fields later on if we need to (thread
>    info, different fields than just /proc/n/comm like exec path, argv,
>    whatever).

I could argue both sides of this, so I guess it is fine either way.

In GFW I log a array of argv[0] strings in a generic "data_json" event.
I could also log additional "data_json" events with more structured
data if needed.

On the other hand, you're proposing a "cmd_ancestry" event with a
single array of strings.  You would have to expand the call signature
of the trace2_cmd_ancestry() API to add additional data and inside
tr2_tgt_event.c add additional fields to the JSON being composed.

So both are about equal.

(I'll avoid the temptation to make a snarky comment about fixing
your post processing. :-) :-) :-) )

It really doesn't matter one way or the other.

> - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>    the "command name" - excluding argv, excluding path, etc. It seems

So you're trying to log argv[0] of the process and not the full
command line.  That's what I'm doing.

>    like this is a little more safe about excluding personal information
>    from the traces which take the form of "myscript.sh
>    --password=hunter2", but would still be worrisome for something like
>    "mysupersecretproject.sh". I'm not sure whether that means we still
>    want to guard it with a config flag, though.

You might check whether you get the name of the script or just get
a lot of entries with just "/usr/bin/bash".

There's lots of PII in the data stream to worry about.
The name of the command is just one aspect, but I digress.

> - I also added a lot to the commit message; hopefully it's not too
>    rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>    wasn't going to cut it.
> - As for testing, I followed the lead of GfW's parentage info - "this
>    isn't portable so writing tests for it will suck, just scrub it from
>    the tests". Maybe it makes sense to do some more
>    platform-specific-ness in the test suite instead? I wasn't sure.

yeah, that's probably best.  Unless you can tokenize it properly
so that you can predict the results in a HEREDOC in the test source.

For example, you might try to test tracing a command (where a top-level
"git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the
output for the child.

> 
> Thanks, all.
>   - Emily
> 
>   Makefile                  |  5 ++++
>   compat/procinfo.c         | 53 +++++++++++++++++++++++++++++++++++++++
>   config.mak.uname          |  1 +
>   git-compat-util.h         |  6 +++++
>   t/t0210/scrub_normal.perl |  6 +++++
>   t/t0211/scrub_perf.perl   |  5 ++++
>   t/t0212/parse_events.perl |  5 +++-
>   trace2.c                  | 13 ++++++++++
>   trace2.h                  | 12 ++++++++-
>   trace2/tr2_tgt.h          |  3 +++
>   trace2/tr2_tgt_event.c    | 21 ++++++++++++++++
>   trace2/tr2_tgt_normal.c   | 19 ++++++++++++++
>   trace2/tr2_tgt_perf.c     | 16 ++++++++++++
>   13 files changed, 163 insertions(+), 2 deletions(-)
>   create mode 100644 compat/procinfo.c
> 
> diff --git a/Makefile b/Makefile
> index 93664d6714..330e4fa011 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1889,6 +1889,11 @@ ifneq ($(PROCFS_EXECUTABLE_PATH),)
>   	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
>   endif
>   
> +ifdef HAVE_PROCFS_LINUX
> +	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
> +	COMPAT_OBJS += compat/procinfo.o
> +endif
> +
>   ifdef HAVE_NS_GET_EXECUTABLE_PATH
>   	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
>   endif
> diff --git a/compat/procinfo.c b/compat/procinfo.c
> new file mode 100644
> index 0000000000..523600673f
> --- /dev/null
> +++ b/compat/procinfo.c
> @@ -0,0 +1,53 @@
> +#include "cache.h"
> +
> +#include "strbuf.h"
> +#include "trace2.h"
> +
> +char *get_process_name(int pid)
> +{
> +#ifdef HAVE_PROCFS_LINUX
> +	struct strbuf procfs_path = STRBUF_INIT;
> +	struct strbuf out = STRBUF_INIT;
> +	/* try to use procfs if it's present. */
> +	strbuf_addf(&procfs_path, "/proc/%d/comm", pid);
> +	if (!strbuf_read_file(&out, procfs_path.buf, 0))
> +	{
> +		/* All done with file reads, clean up early */
> +		strbuf_release(&procfs_path);
> +		return strbuf_detach(&out, NULL);
> +	}
> +#endif
> +
> +	/* NEEDSWORK: add non-procfs implementations here. */
> +	return NULL;
> +}
> +
> +void trace2_collect_process_info(enum trace2_process_info_reason reason)
> +{
> +	if (!trace2_is_enabled())
> +		return;
> +
> +	/* someday we may want to write something extra here, but not today */
> +	if (reason == TRACE2_PROCESS_INFO_EXIT)
> +		return;
> +
> +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
> +	{
> +		/*
> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
> +		 * see compat/win32/trace2_win32_process_info.c.
> +		 */
> +		char *names[2];
> +		names[0] = get_process_name(getppid());
> +		names[1] = NULL;

You're only logging 1 parent.  That's fine to get started.

I'm logging IIRC 10 parents on GFW.  That might seem overkill,
but there are lots of intermediate parents that hide what is
happening.  For example, a "git push" might spawn "git remote-https"
which spawns "git-remote-https" which spawn "git send-pack" which
spawns "git pack-objects".

And that doesn't include who called push.

And it's not uncommon to see 2 or 3 "bash" entries in the array
because of the bash scripts being run.

> +
> +		if (!names[0])
> +			return;
> +
> +		trace2_cmd_ancestry((const char**)names);
> +
> +		free(names[0]);
> +	}
> +
> +	return;
> +}
> diff --git a/config.mak.uname b/config.mak.uname
> index cb443b4e02..7ad110a1d2 100644
> --- a/config.mak.uname
> +++ b/config.mak.uname
> @@ -58,6 +58,7 @@ ifeq ($(uname_S),Linux)
>   	FREAD_READS_DIRECTORIES = UnfortunatelyYes
>   	BASIC_CFLAGS += -DHAVE_SYSINFO
>   	PROCFS_EXECUTABLE_PATH = /proc/self/exe
> +	HAVE_PROCFS_LINUX = YesPlease
>   endif
>   ifeq ($(uname_S),GNU/kFreeBSD)
>   	HAVE_ALLOCA_H = YesPlease
> diff --git a/git-compat-util.h b/git-compat-util.h
> index a508dbe5a3..cc7d5d8a2a 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -1382,4 +1382,10 @@ static inline void *container_of_or_null_offset(void *ptr, size_t offset)
>   
>   void sleep_millisec(int millisec);
>   
> +/*
> + * Convert PID to process name (as would show in top/task manager). Returns
> + * NULL if unimplemented - be sure to check for NULL at callsite.
> + */
> +char *get_process_name(int pid);
> +
>   #endif
> diff --git a/t/t0210/scrub_normal.perl b/t/t0210/scrub_normal.perl
> index c65d1a815e..7cc4de392a 100644
> --- a/t/t0210/scrub_normal.perl
> +++ b/t/t0210/scrub_normal.perl
> @@ -42,6 +42,12 @@
>   	# so just omit it for testing purposes.
>   	# print "cmd_path _EXE_\n";
>       }
> +    elsif ($line =~ m/^cmd_ancestry/) {
> +	# 'cmd_ancestry' is not implemented everywhere, so for portability's
> +	# sake, skip it when parsing normal.
> +	#
> +	# print "$line";
> +    }
>       else {
>   	print "$line";
>       }
> diff --git a/t/t0211/scrub_perf.perl b/t/t0211/scrub_perf.perl
> index 351af7844e..d164b750ff 100644
> --- a/t/t0211/scrub_perf.perl
> +++ b/t/t0211/scrub_perf.perl
> @@ -44,6 +44,11 @@
>   	# $tokens[$col_rest] = "_EXE_";
>   	goto SKIP_LINE;
>       }
> +    elsif ($tokens[$col_event] =~ m/cmd_ancestry/) {
> +	# 'cmd_ancestry' is platform-specific and not implemented everywhere,
> +	# so skip it.
> +	goto SKIP_LINE;
> +    }
>       elsif ($tokens[$col_event] =~ m/child_exit/) {
>   	$tokens[$col_rest] =~ s/ pid:\d* / pid:_PID_ /;
>       }
> diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
> index 6584bb5634..b6408560c0 100644
> --- a/t/t0212/parse_events.perl
> +++ b/t/t0212/parse_events.perl
> @@ -132,7 +132,10 @@
>   	# just omit it for testing purposes.
>   	# $processes->{$sid}->{'path'} = "_EXE_";
>       }
> -
> +    elsif ($event eq 'cmd_ancestry') {
> +	# 'cmd_ancestry' is platform-specific and not implemented everywhere, so
> +	# just skip it for testing purposes.
> +    }
>       elsif ($event eq 'cmd_name') {
>   	$processes->{$sid}->{'name'} = $line->{'name'};
>   	$processes->{$sid}->{'hierarchy'} = $line->{'hierarchy'};
> diff --git a/trace2.c b/trace2.c
> index 256120c7fd..b9b154ac44 100644
> --- a/trace2.c
> +++ b/trace2.c
> @@ -260,6 +260,19 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname)
>   			tgt_j->pfn_command_path_fl(file, line, pathname);
>   }
>   
> +void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	struct tr2_tgt *tgt_j;
> +	int j;
> +
> +	if (!trace2_enabled)
> +		return;
> +
> +	for_each_wanted_builtin (j, tgt_j)
> +		if (tgt_j->pfn_command_ancestry_fl)
> +			tgt_j->pfn_command_ancestry_fl(file, line, parent_names);
> +}
> +
>   void trace2_cmd_name_fl(const char *file, int line, const char *name)
>   {
>   	struct tr2_tgt *tgt_j;
> diff --git a/trace2.h b/trace2.h
> index ede18c2e06..23743ac62b 100644
> --- a/trace2.h
> +++ b/trace2.h
> @@ -133,6 +133,16 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname);
>   
>   #define trace2_cmd_path(p) trace2_cmd_path_fl(__FILE__, __LINE__, (p))
>   
> +/*
> + * Emit an 'ancestry' event with the process name of the current process's
> + * parent process.
> + * This gives post-processors a way to determine what invoked the command and
> + * learn more about usage patterns.
> + */
> +void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names);
> +
> +#define trace2_cmd_ancestry(v) trace2_cmd_ancestry_fl(__FILE__, __LINE__, (v))
> +
>   /*
>    * Emit a 'cmd_name' event with the canonical name of the command.
>    * This gives post-processors a simple field to identify the command
> @@ -492,7 +502,7 @@ enum trace2_process_info_reason {
>   	TRACE2_PROCESS_INFO_EXIT,
>   };
>   
> -#if defined(GIT_WINDOWS_NATIVE)
> +#if ( defined(GIT_WINDOWS_NATIVE) || defined(HAVE_PROCFS_LINUX) )
>   void trace2_collect_process_info(enum trace2_process_info_reason reason);
>   #else
>   #define trace2_collect_process_info(reason) \
> diff --git a/trace2/tr2_tgt.h b/trace2/tr2_tgt.h
> index 7b90469212..1f66fd6573 100644
> --- a/trace2/tr2_tgt.h
> +++ b/trace2/tr2_tgt.h
> @@ -27,6 +27,8 @@ typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
>   
>   typedef void(tr2_tgt_evt_command_path_fl_t)(const char *file, int line,
>   					    const char *command_path);
> +typedef void(tr2_tgt_evt_command_ancestry_fl_t)(const char *file, int line,
> +						const char **parent_names);
>   typedef void(tr2_tgt_evt_command_name_fl_t)(const char *file, int line,
>   					    const char *name,
>   					    const char *hierarchy);
> @@ -108,6 +110,7 @@ struct tr2_tgt {
>   	tr2_tgt_evt_atexit_t                    *pfn_atexit;
>   	tr2_tgt_evt_error_va_fl_t               *pfn_error_va_fl;
>   	tr2_tgt_evt_command_path_fl_t           *pfn_command_path_fl;
> +	tr2_tgt_evt_command_ancestry_fl_t	*pfn_command_ancestry_fl;
>   	tr2_tgt_evt_command_name_fl_t           *pfn_command_name_fl;
>   	tr2_tgt_evt_command_mode_fl_t           *pfn_command_mode_fl;
>   	tr2_tgt_evt_alias_fl_t                  *pfn_alias_fl;
> diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
> index 6353e8ad91..578a9a5287 100644
> --- a/trace2/tr2_tgt_event.c
> +++ b/trace2/tr2_tgt_event.c
> @@ -261,6 +261,26 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
>   	jw_release(&jw);
>   }
>   
> +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	const char *event_name = "cmd_ancestry";
> +	const char *parent_name = NULL;
> +	struct json_writer jw = JSON_WRITER_INIT;
> +
> +	jw_object_begin(&jw, 0);
> +	event_fmt_prepare(event_name, file, line, NULL, &jw);
> +	jw_object_inline_begin_array(&jw, "ancestry");
> +
> +	while ((parent_name = *parent_names++))
> +		jw_array_string(&jw, parent_name);

You're building the array with the immediate parent in a[0]
and the grandparent in a[1], and etc.  This is the same as
I did in GFW.

Perhaps state this in the docs somewhere.

> +
> +	jw_end(&jw); /* 'ancestry' array */
> +	jw_end(&jw); /* event object */
> +
> +	tr2_dst_write_line(&tr2dst_event, &jw.json);
> +	jw_release(&jw);
> +}
> +
>   static void fn_command_name_fl(const char *file, int line, const char *name,
>   			       const char *hierarchy)
>   {
> @@ -584,6 +604,7 @@ struct tr2_tgt tr2_tgt_event = {
>   	fn_atexit,
>   	fn_error_va_fl,
>   	fn_command_path_fl,
> +	fn_command_ancestry_fl,
>   	fn_command_name_fl,
>   	fn_command_mode_fl,
>   	fn_alias_fl,
> diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
> index 31b602c171..a5751c8864 100644
> --- a/trace2/tr2_tgt_normal.c
> +++ b/trace2/tr2_tgt_normal.c
> @@ -160,6 +160,24 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
>   	strbuf_release(&buf_payload);
>   }
>   
> +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	const char *parent_name = NULL;
> +	struct strbuf buf_payload = STRBUF_INIT;
> +
> +	/* cmd_ancestry parent <- grandparent <- great-grandparent */
> +	strbuf_addstr(&buf_payload, "cmd_ancestry ");
> +	while ((parent_name = *parent_names++)) {
> +		strbuf_addstr(&buf_payload, parent_name);

Did you want to quote each parent's name?

> +		/* if we'll write another one after this, add a delimiter */
> +		if (parent_names && *parent_names)
> +			strbuf_addstr(&buf_payload, " <- ");
> +	}
> +
> +	normal_io_write_fl(file, line, &buf_payload);
> +	strbuf_release(&buf_payload);
> +}
> +
>   static void fn_command_name_fl(const char *file, int line, const char *name,
>   			       const char *hierarchy)
>   {
> @@ -306,6 +324,7 @@ struct tr2_tgt tr2_tgt_normal = {
>   	fn_atexit,
>   	fn_error_va_fl,
>   	fn_command_path_fl,
> +	fn_command_ancestry_fl,
>   	fn_command_name_fl,
>   	fn_command_mode_fl,
>   	fn_alias_fl,
> diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
> index a8018f18cc..af4d65a0a5 100644
> --- a/trace2/tr2_tgt_perf.c
> +++ b/trace2/tr2_tgt_perf.c
> @@ -253,6 +253,21 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
>   	strbuf_release(&buf_payload);
>   }
>   
> +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	const char *event_name = "cmd_ancestry";
> +	struct strbuf buf_payload = STRBUF_INIT;
> +
> +	strbuf_addstr(&buf_payload, "ancestry:[");
> +	/* It's not an argv but the rules are basically the same. */
> +	sq_append_quote_argv_pretty(&buf_payload, parent_names);

This will have whitespace delimiters between the quoted strings
rather than commas.  Just checking if that's what you wanted.

I'm not sure it matters, since this stream is intended for human
parsing.

> +	strbuf_addch(&buf_payload, ']');
> +
> +	perf_io_write_fl(file, line, event_name, NULL, NULL, NULL, NULL,
> +			 &buf_payload);
> +	strbuf_release(&buf_payload);
> +}
> +
>   static void fn_command_name_fl(const char *file, int line, const char *name,
>   			       const char *hierarchy)
>   {
> @@ -532,6 +547,7 @@ struct tr2_tgt tr2_tgt_perf = {
>   	fn_atexit,
>   	fn_error_va_fl,
>   	fn_command_path_fl,
> +	fn_command_ancestry_fl,
>   	fn_command_name_fl,
>   	fn_command_mode_fl,
>   	fn_alias_fl,
> 

We should update Documentation/technical/api-trace2.txt too.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-21 19:15   ` Jeff Hostetler
@ 2021-05-21 20:05     ` Emily Shaffer
  2021-05-21 20:23       ` Randall S. Becker
                         ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-21 20:05 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: git, Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Bagas Sanjaya

On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
> On 5/20/21 5:05 PM, Emily Shaffer wrote:
> > - I took a look at Jeff H's advice on using a "data_json" event to log
> >    this and decided it would be a little more flexible to add a new event
> >    instead. If we want, it'd be feasible to then shoehorn the GfW parent
> >    tree stuff into this new event too. Doing it this way is definitely
> >    easier to parse for Google's trace analysis system (which for now
> >    completely skips "data_json" as it's polymorphic), and also - I think
> >    - means that we can add more fields later on if we need to (thread
> >    info, different fields than just /proc/n/comm like exec path, argv,
> >    whatever).
> 
> I could argue both sides of this, so I guess it is fine either way.
> 
> In GFW I log a array of argv[0] strings in a generic "data_json" event.
> I could also log additional "data_json" events with more structured
> data if needed.
> 
> On the other hand, you're proposing a "cmd_ancestry" event with a
> single array of strings.  You would have to expand the call signature
> of the trace2_cmd_ancestry() API to add additional data and inside
> tr2_tgt_event.c add additional fields to the JSON being composed.
> 
> So both are about equal.
> 
> (I'll avoid the temptation to make a snarky comment about fixing
> your post processing. :-) :-) :-) )

;P

(I don't have much to add - this is an accurate summary of what I
thought about, too. Thanks for writing it out.)

> 
> It really doesn't matter one way or the other.
> 
> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
> >    the "command name" - excluding argv, excluding path, etc. It seems
> 
> So you're trying to log argv[0] of the process and not the full
> command line.  That's what I'm doing.

It's close to argv[0], yeah. POSIX docs indicate it might be truncated
in a way that argv[0] hasn't been, but it also doesn't include the
leading path (as far as I've seen). For example, a long-running helper
script I use with mutt, right now (gaffing on line length in email to
help with argv clarity, sorry):

  $ ps aux | grep mutt
  emilysh+ 4119883  0.0  0.0   6892  3600 pts/6    S+   12:44   0:00 /bin/bash /usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
  # comm is truncated to 15ch, except apparently in the cases of some
  # kernel worker processes I saw with much longer names?
  $ cat /proc/4119883/comm
  open-vim-in-new
  # exe is a link to the executable, which means bash as this is a
  # script
  $ ls -lha /proc/4119883/exe
  lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
  /proc/4119883/exe -> /usr/bin/bash
  # cmdline has the whole argv, separated on NUL so it runs together in
  # editor
  $ cat /proc/4119883/cmdline
  /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-7433772284891386689

Jonathan N pointed out that the process name (the thing in 'comm') can
also be manually manipulated by the process itself, and 'man procfs'
also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in
'prctl()', so that tracks. (It doesn't look like we can use prctl() to
find out the names of processes besides the current process, though, so
the procfs stuff is still needed. Dang.)

> 
> >    like this is a little more safe about excluding personal information
> >    from the traces which take the form of "myscript.sh
> >    --password=hunter2", but would still be worrisome for something like
> >    "mysupersecretproject.sh". I'm not sure whether that means we still
> >    want to guard it with a config flag, though.
> 
> You might check whether you get the name of the script or just get
> a lot of entries with just "/usr/bin/bash".

See above :)

> There's lots of PII in the data stream to worry about.
> The name of the command is just one aspect, but I digress.

Yes, that's what we've noticed too, so a process name isn't worrying us
that much more.

> 
> > - I also added a lot to the commit message; hopefully it's not too
> >    rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
> >    wasn't going to cut it.
> > - As for testing, I followed the lead of GfW's parentage info - "this
> >    isn't portable so writing tests for it will suck, just scrub it from
> >    the tests". Maybe it makes sense to do some more
> >    platform-specific-ness in the test suite instead? I wasn't sure.
> 
> yeah, that's probably best.  Unless you can tokenize it properly
> so that you can predict the results in a HEREDOC in the test source.
> 
> For example, you might try to test tracing a command (where a top-level
> "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the
> output for the child.

Yeah, I had trouble with even deciding when to attempt such a check or
not.
> > +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
> > +	{
> > +		/*
> > +		 * NEEDSWORK: we could do the entire ptree in an array instead,
> > +		 * see compat/win32/trace2_win32_process_info.c.
> > +		 */
> > +		char *names[2];
> > +		names[0] = get_process_name(getppid());
> > +		names[1] = NULL;
> 
> You're only logging 1 parent.  That's fine to get started.
> 
> I'm logging IIRC 10 parents on GFW.  That might seem overkill,
> but there are lots of intermediate parents that hide what is
> happening.  For example, a "git push" might spawn "git remote-https"
> which spawns "git-remote-https" which spawn "git send-pack" which
> spawns "git pack-objects".
> 
> And that doesn't include who called push.
> 
> And it's not uncommon to see 2 or 3 "bash" entries in the array
> because of the bash scripts being run.

Agree. But it's expensive - I didn't find a handy library call to find
"parent ID of given process ID", so I think we'd have to manipulate
procfs; and so far I only see parent ID in summary infos like
/proc/n/status or /proc/n/stat, which contain lots of other info too and
would need parsing.

We could reduce the cost a little bit by grabbing the process name from
the status or stat as well, and therefore still only opening one file per
process, but I'd want to check whether the formats are expected to be
stable for those things.

> > +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
> > +{
> > +	const char *event_name = "cmd_ancestry";
> > +	const char *parent_name = NULL;
> > +	struct json_writer jw = JSON_WRITER_INIT;
> > +
> > +	jw_object_begin(&jw, 0);
> > +	event_fmt_prepare(event_name, file, line, NULL, &jw);
> > +	jw_object_inline_begin_array(&jw, "ancestry");
> > +
> > +	while ((parent_name = *parent_names++))
> > +		jw_array_string(&jw, parent_name);
> 
> You're building the array with the immediate parent in a[0]
> and the grandparent in a[1], and etc.  This is the same as
> I did in GFW.
> 
> Perhaps state this in the docs somewhere.

Sure, makes sense. I think I neglected any doc work whatsoever in this
patch anyways, whoops :)

> > +	/* cmd_ancestry parent <- grandparent <- great-grandparent */
> > +	strbuf_addstr(&buf_payload, "cmd_ancestry ");
> > +	while ((parent_name = *parent_names++)) {
> > +		strbuf_addstr(&buf_payload, parent_name);
> 
> Did you want to quote each parent's name?

I'd rather not - since they're going into an array anyway, I'd expect
the array delimiters to be enough. Am I being naive? 'normal' looks to
me like it's supposed to be mostly human readable anyways, rather than
parseable?

> > +	strbuf_addstr(&buf_payload, "ancestry:[");
> > +	/* It's not an argv but the rules are basically the same. */
> > +	sq_append_quote_argv_pretty(&buf_payload, parent_names);
> 
> This will have whitespace delimiters between the quoted strings
> rather than commas.  Just checking if that's what you wanted.
> 
> I'm not sure it matters, since this stream is intended for human
> parsing.

Yeah, it seems fine to me as is.

> 
> We should update Documentation/technical/api-trace2.txt too.

Yep, thanks.

I appreciate the review, Jeff.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v2] tr2: log parent process name
  2021-05-21 20:05     ` Emily Shaffer
@ 2021-05-21 20:23       ` Randall S. Becker
  2021-05-22 11:18       ` Jeff Hostetler
  2021-05-24 23:33       ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 45+ messages in thread
From: Randall S. Becker @ 2021-05-21 20:23 UTC (permalink / raw)
  To: 'Emily Shaffer', 'Jeff Hostetler'
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Bagas Sanjaya'

<emilyshaffer@google.com>
On May 21, 2021 4:05 PM, Emily Shaffer wrote:
>On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>> > - I took a look at Jeff H's advice on using a "data_json" event to log
>> >    this and decided it would be a little more flexible to add a new event
>> >    instead. If we want, it'd be feasible to then shoehorn the GfW parent
>> >    tree stuff into this new event too. Doing it this way is definitely
>> >    easier to parse for Google's trace analysis system (which for now
>> >    completely skips "data_json" as it's polymorphic), and also - I think
>> >    - means that we can add more fields later on if we need to (thread
>> >    info, different fields than just /proc/n/comm like exec path, argv,
>> >    whatever).
>>
>> I could argue both sides of this, so I guess it is fine either way.
>>
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>>
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings.  You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>>
>> So both are about equal.
>>
>> (I'll avoid the temptation to make a snarky comment about fixing your
>> post processing. :-) :-) :-) )
>
>;P
>
>(I don't have much to add - this is an accurate summary of what I thought about, too. Thanks for writing it out.)
>
>>
>> It really doesn't matter one way or the other.
>>
>> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>> >    the "command name" - excluding argv, excluding path, etc. It
>> > seems
>>
>> So you're trying to log argv[0] of the process and not the full
>> command line.  That's what I'm doing.
>
>It's close to argv[0], yeah. POSIX docs indicate it might be truncated in a way that argv[0] hasn't been, but it also doesn't
include the
>leading path (as far as I've seen). For example, a long-running helper script I use with mutt, right now (gaffing on line length in
email to
>help with argv clarity, sorry):
>
>  $ ps aux | grep mutt
>  emilysh+ 4119883  0.0  0.0   6892  3600 pts/6    S+   12:44   0:00 /bin/bash
/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-
>new-split.sh /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>  # comm is truncated to 15ch, except apparently in the cases of some
>  # kernel worker processes I saw with much longer names?
>  $ cat /proc/4119883/comm
>  open-vim-in-new
>  # exe is a link to the executable, which means bash as this is a
>  # script
>  $ ls -lha /proc/4119883/exe
>  lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
>  /proc/4119883/exe -> /usr/bin/bash
>  # cmdline has the whole argv, separated on NUL so it runs together in
>  # editor
>  $ cat /proc/4119883/cmdline
>  /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-
>7433772284891386689
>
>Jonathan N pointed out that the process name (the thing in 'comm') can also be manually manipulated by the process itself, and 'man
>procfs'
>also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in 'prctl()', so that tracks. (It doesn't look like we can use prctl()
to find
>out the names of processes besides the current process, though, so the procfs stuff is still needed. Dang.)
>
>>
>> >    like this is a little more safe about excluding personal information
>> >    from the traces which take the form of "myscript.sh
>> >    --password=hunter2", but would still be worrisome for something like
>> >    "mysupersecretproject.sh". I'm not sure whether that means we still
>> >    want to guard it with a config flag, though.
>>
>> You might check whether you get the name of the script or just get a
>> lot of entries with just "/usr/bin/bash".
>
>See above :)
>
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
>
>Yes, that's what we've noticed too, so a process name isn't worrying us that much more.
>
>>
>> > - I also added a lot to the commit message; hopefully it's not too
>> >    rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>> >    wasn't going to cut it.
>> > - As for testing, I followed the lead of GfW's parentage info - "this
>> >    isn't portable so writing tests for it will suck, just scrub it from
>> >    the tests". Maybe it makes sense to do some more
>> >    platform-specific-ness in the test suite instead? I wasn't sure.
>>
>> yeah, that's probably best.  Unless you can tokenize it properly so
>> that you can predict the results in a HEREDOC in the test source.
>>
>> For example, you might try to test tracing a command (where a
>> top-level "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and
>> check the output for the child.
>
>Yeah, I had trouble with even deciding when to attempt such a check or not.
>> > +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
>> > +	{
>> > +		/*
>> > +		 * NEEDSWORK: we could do the entire ptree in an array instead,
>> > +		 * see compat/win32/trace2_win32_process_info.c.
>> > +		 */
>> > +		char *names[2];
>> > +		names[0] = get_process_name(getppid());
>> > +		names[1] = NULL;
>>
>> You're only logging 1 parent.  That's fine to get started.
>>
>> I'm logging IIRC 10 parents on GFW.  That might seem overkill, but
>> there are lots of intermediate parents that hide what is happening.
>> For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>>
>> And that doesn't include who called push.
>>
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
>
>Agree. But it's expensive - I didn't find a handy library call to find "parent ID of given process ID", so I think we'd have to
manipulate
>procfs; and so far I only see parent ID in summary infos like /proc/n/status or /proc/n/stat, which contain lots of other info too
and would
>need parsing.
>
>We could reduce the cost a little bit by grabbing the process name from the status or stat as well, and therefore still only
opening one file
>per process, but I'd want to check whether the formats are expected to be stable for those things.
>
>> > +static void fn_command_ancestry_fl(const char *file, int line,
>> > +const char **parent_names) {
>> > +	const char *event_name = "cmd_ancestry";
>> > +	const char *parent_name = NULL;
>> > +	struct json_writer jw = JSON_WRITER_INIT;
>> > +
>> > +	jw_object_begin(&jw, 0);
>> > +	event_fmt_prepare(event_name, file, line, NULL, &jw);
>> > +	jw_object_inline_begin_array(&jw, "ancestry");
>> > +
>> > +	while ((parent_name = *parent_names++))
>> > +		jw_array_string(&jw, parent_name);
>>
>> You're building the array with the immediate parent in a[0] and the
>> grandparent in a[1], and etc.  This is the same as I did in GFW.
>>
>> Perhaps state this in the docs somewhere.
>
>Sure, makes sense. I think I neglected any doc work whatsoever in this patch anyways, whoops :)
>
>> > +	/* cmd_ancestry parent <- grandparent <- great-grandparent */
>> > +	strbuf_addstr(&buf_payload, "cmd_ancestry ");
>> > +	while ((parent_name = *parent_names++)) {
>> > +		strbuf_addstr(&buf_payload, parent_name);
>>
>> Did you want to quote each parent's name?
>
>I'd rather not - since they're going into an array anyway, I'd expect the array delimiters to be enough. Am I being naive? 'normal'
looks to
>me like it's supposed to be mostly human readable anyways, rather than parseable?
>
>> > +	strbuf_addstr(&buf_payload, "ancestry:[");
>> > +	/* It's not an argv but the rules are basically the same. */
>> > +	sq_append_quote_argv_pretty(&buf_payload, parent_names);
>>
>> This will have whitespace delimiters between the quoted strings rather
>> than commas.  Just checking if that's what you wanted.
>>
>> I'm not sure it matters, since this stream is intended for human
>> parsing.
>
>Yeah, it seems fine to me as is.
>
>>
>> We should update Documentation/technical/api-trace2.txt too.
>
>Yep, thanks.
>
>I appreciate the review, Jeff.

I checked the performance of my NonStop parent lookup implementation. It's fast enough that no one would notice (8 microseconds on
the oldest slowest machine I could find).

Just an FYI.
-Randall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-21 19:02     ` Emily Shaffer
@ 2021-05-21 23:22       ` Junio C Hamano
  2021-05-24 18:37         ` Emily Shaffer
  0 siblings, 1 reply; 45+ messages in thread
From: Junio C Hamano @ 2021-05-21 23:22 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: git, Ævar Arnfjörð Bjarmason, Jeff Hostetler,
	Bagas Sanjaya

Emily Shaffer <emilyshaffer@google.com> writes:

>> > we will need to discover the name another way. However, the process ID
>> > should be sufficient regardless of platform.
>> 
>> Not a strong objection, but I wonder if seeing random integer(s) is
>> better than not having cmd_ancestry info at all.  The latter better
>> signals that the platform does not yet have the "parent process
>> name" feature, I would think.
>
> Hm, we could...

Please don't.  There is a misreading here.

You mentioned "However, the process ID should be sufficient" and I
read it as "In the worst case we can emit the process ID if we do
not know how to turn it into name", and to that I said "showing
process IDs is not all that useful as they are random integers
without extra info on processes that were running back when the log
entry was taken".  Similarly, my later "OK, we do not show pid as a
placeholder." is "Contrary to what I thought you said earlier, you
do not give raw process IDs and instead honestly say we do not have
that information by omitting the record.  I am happy to see what the
actual patch does".

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-21 20:05     ` Emily Shaffer
  2021-05-21 20:23       ` Randall S. Becker
@ 2021-05-22 11:18       ` Jeff Hostetler
  2021-05-24 23:33       ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 45+ messages in thread
From: Jeff Hostetler @ 2021-05-22 11:18 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: git, Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Bagas Sanjaya



On 5/21/21 4:05 PM, Emily Shaffer wrote:
> On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>>> - I took a look at Jeff H's advice on using a "data_json" event to log
>>>     this and decided it would be a little more flexible to add a new event
>>>     instead. If we want, it'd be feasible to then shoehorn the GfW parent
>>>     tree stuff into this new event too. Doing it this way is definitely
>>>     easier to parse for Google's trace analysis system (which for now
>>>     completely skips "data_json" as it's polymorphic), and also - I think
>>>     - means that we can add more fields later on if we need to (thread
>>>     info, different fields than just /proc/n/comm like exec path, argv,
>>>     whatever).
>>
>> I could argue both sides of this, so I guess it is fine either way.
>>
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>>
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings.  You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>>
>> So both are about equal.
>>
>> (I'll avoid the temptation to make a snarky comment about fixing
>> your post processing. :-) :-) :-) )
> 
> ;P
> 
> (I don't have much to add - this is an accurate summary of what I
> thought about, too. Thanks for writing it out.)
> 
>>
>> It really doesn't matter one way or the other.
>>
>>> - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>>>     the "command name" - excluding argv, excluding path, etc. It seems
>>
>> So you're trying to log argv[0] of the process and not the full
>> command line.  That's what I'm doing.
> 
> It's close to argv[0], yeah. POSIX docs indicate it might be truncated
> in a way that argv[0] hasn't been, but it also doesn't include the
> leading path (as far as I've seen). For example, a long-running helper
> script I use with mutt, right now (gaffing on line length in email to
> help with argv clarity, sorry):
> 
>    $ ps aux | grep mutt
>    emilysh+ 4119883  0.0  0.0   6892  3600 pts/6    S+   12:44   0:00 /bin/bash /usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>    # comm is truncated to 15ch, except apparently in the cases of some
>    # kernel worker processes I saw with much longer names?
>    $ cat /proc/4119883/comm
>    open-vim-in-new
>    # exe is a link to the executable, which means bash as this is a
>    # script
>    $ ls -lha /proc/4119883/exe
>    lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
>    /proc/4119883/exe -> /usr/bin/bash
>    # cmdline has the whole argv, separated on NUL so it runs together in
>    # editor
>    $ cat /proc/4119883/cmdline
>    /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
> 
> Jonathan N pointed out that the process name (the thing in 'comm') can
> also be manually manipulated by the process itself, and 'man procfs'
> also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in
> 'prctl()', so that tracks. (It doesn't look like we can use prctl() to
> find out the names of processes besides the current process, though, so
> the procfs stuff is still needed. Dang.)
> 
>>
>>>     like this is a little more safe about excluding personal information
>>>     from the traces which take the form of "myscript.sh
>>>     --password=hunter2", but would still be worrisome for something like
>>>     "mysupersecretproject.sh". I'm not sure whether that means we still
>>>     want to guard it with a config flag, though.
>>
>> You might check whether you get the name of the script or just get
>> a lot of entries with just "/usr/bin/bash".
> 
> See above :)
> 
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
> 
> Yes, that's what we've noticed too, so a process name isn't worrying us
> that much more.
> 
>>
>>> - I also added a lot to the commit message; hopefully it's not too
>>>     rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>>>     wasn't going to cut it.
>>> - As for testing, I followed the lead of GfW's parentage info - "this
>>>     isn't portable so writing tests for it will suck, just scrub it from
>>>     the tests". Maybe it makes sense to do some more
>>>     platform-specific-ness in the test suite instead? I wasn't sure.
>>
>> yeah, that's probably best.  Unless you can tokenize it properly
>> so that you can predict the results in a HEREDOC in the test source.
>>
>> For example, you might try to test tracing a command (where a top-level
>> "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the
>> output for the child.
> 
> Yeah, I had trouble with even deciding when to attempt such a check or
> not.
>>> +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
>>> +	{
>>> +		/*
>>> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
>>> +		 * see compat/win32/trace2_win32_process_info.c.
>>> +		 */
>>> +		char *names[2];
>>> +		names[0] = get_process_name(getppid());
>>> +		names[1] = NULL;
>>
>> You're only logging 1 parent.  That's fine to get started.
>>
>> I'm logging IIRC 10 parents on GFW.  That might seem overkill,
>> but there are lots of intermediate parents that hide what is
>> happening.  For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>>
>> And that doesn't include who called push.
>>
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
> 
> Agree. But it's expensive - I didn't find a handy library call to find
> "parent ID of given process ID", so I think we'd have to manipulate
> procfs; and so far I only see parent ID in summary infos like
> /proc/n/status or /proc/n/stat, which contain lots of other info too and
> would need parsing.
> 
> We could reduce the cost a little bit by grabbing the process name from
> the status or stat as well, and therefore still only opening one file per
> process, but I'd want to check whether the formats are expected to be
> stable for those things.

Yeah, don't force it collect a bunch of data for n parents that you
don't need.  I guess my point was that there were lots of time when
there were too many "unhelpful" parents in the ancestry chain.  So
maybe confirm that you can get what you need from just 1 parent.
I needed more than 1 to tell that a particular command was interactive
vs launched by VS or VSCode or etc.


> 
>>> +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
>>> +{
>>> +	const char *event_name = "cmd_ancestry";
>>> +	const char *parent_name = NULL;
>>> +	struct json_writer jw = JSON_WRITER_INIT;
>>> +
>>> +	jw_object_begin(&jw, 0);
>>> +	event_fmt_prepare(event_name, file, line, NULL, &jw);
>>> +	jw_object_inline_begin_array(&jw, "ancestry");
>>> +
>>> +	while ((parent_name = *parent_names++))
>>> +		jw_array_string(&jw, parent_name);
>>
>> You're building the array with the immediate parent in a[0]
>> and the grandparent in a[1], and etc.  This is the same as
>> I did in GFW.
>>
>> Perhaps state this in the docs somewhere.
> 
> Sure, makes sense. I think I neglected any doc work whatsoever in this
> patch anyways, whoops :)
> 
>>> +	/* cmd_ancestry parent <- grandparent <- great-grandparent */
>>> +	strbuf_addstr(&buf_payload, "cmd_ancestry ");
>>> +	while ((parent_name = *parent_names++)) {
>>> +		strbuf_addstr(&buf_payload, parent_name);
>>
>> Did you want to quote each parent's name?
> 
> I'd rather not - since they're going into an array anyway, I'd expect
> the array delimiters to be enough. Am I being naive? 'normal' looks to
> me like it's supposed to be mostly human readable anyways, rather than
> parseable?
> 
>>> +	strbuf_addstr(&buf_payload, "ancestry:[");
>>> +	/* It's not an argv but the rules are basically the same. */
>>> +	sq_append_quote_argv_pretty(&buf_payload, parent_names);
>>
>> This will have whitespace delimiters between the quoted strings
>> rather than commas.  Just checking if that's what you wanted.
>>
>> I'm not sure it matters, since this stream is intended for human
>> parsing.
> 
> Yeah, it seems fine to me as is.
> 
>>
>> We should update Documentation/technical/api-trace2.txt too.
> 
> Yep, thanks.
> 
> I appreciate the review, Jeff.
> 
>   - Emily
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-21 23:22       ` Junio C Hamano
@ 2021-05-24 18:37         ` Emily Shaffer
  0 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-24 18:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Ævar Arnfjörð Bjarmason, Jeff Hostetler,
	Bagas Sanjaya

On Sat, May 22, 2021 at 08:22:46AM +0900, Junio C Hamano wrote:
> 
> Emily Shaffer <emilyshaffer@google.com> writes:
> 
> >> > we will need to discover the name another way. However, the process ID
> >> > should be sufficient regardless of platform.
> >> 
> >> Not a strong objection, but I wonder if seeing random integer(s) is
> >> better than not having cmd_ancestry info at all.  The latter better
> >> signals that the platform does not yet have the "parent process
> >> name" feature, I would think.
> >
> > Hm, we could...
> 
> Please don't.  There is a misreading here.
> 
> You mentioned "However, the process ID should be sufficient" and I
> read it as "In the worst case we can emit the process ID if we do
> not know how to turn it into name", and to that I said "showing
> process IDs is not all that useful as they are random integers
> without extra info on processes that were running back when the log
> entry was taken".  Similarly, my later "OK, we do not show pid as a
> placeholder." is "Contrary to what I thought you said earlier, you
> do not give raw process IDs and instead honestly say we do not have
> that information by omitting the record.  I am happy to see what the
> actual patch does".

Ah, thanks for clarifying. I'll see if I can make the "PID should be
enough" statement less confusing, instead - what I meant was "on all
systems, the result of getppid() should be sufficient to look up the
process name, so this code is probably shareable", and Randall has
pointed out elsewhere to me that that's false.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3] tr2: log parent process name
  2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
                     ` (2 preceding siblings ...)
  2021-05-21 19:15   ` Jeff Hostetler
@ 2021-05-24 20:10   ` Emily Shaffer
  2021-05-24 20:49     ` Emily Shaffer
                       ` (2 more replies)
  3 siblings, 3 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-24 20:10 UTC (permalink / raw)
  To: git
  Cc: Emily Shaffer, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff Hostetler, Bagas Sanjaya, Randall S. Becker

It can be useful to tell who invoked Git - was it invoked manually by a
user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
we can influence the source code and set the GIT_TRACE2_PARENT_SID
environment variable from the caller process. In 'repo''s case, that
parent SID is manipulated to include the string "repo", which means we
can positively identify when Git was invoked by 'repo' tool. However,
identifying parents that way requires both that we know which tools
invoke Git and that we have the ability to modify the source code of
those tools. It cannot scale to keep up with the various IDEs and
wrappers which use Git, most of which we don't know about. Learning
which tools and wrappers invoke Git, and how, would give us insight to
decide where to improve Git's usability and performance.

Unfortunately, there's no cross-platform reliable way to gather the name
of the parent process. If procfs is present, we can use that; otherwise
we will need to discover the name another way. However, the process ID
should be sufficient to look up the process name on most platforms, so
that code may be shareable.

Git for Windows gathers similar information and logs it as a "data_json"
event. However, since "data_json" has a variable format, it is difficult
to parse effectively in some languages; instead, let's pursue a
dedicated "cmd_ancestry" event to record information about the ancestry
of the current process and a consistent, parseable way.

Git for Windows also gathers information about more than one parent. In
Linux further ancestry info can be gathered with procfs, but it's
unwieldy to do so. In the interest of later moving Git for Windows
ancestry logging to the 'cmd_ancestry' event, and in the interest of
later adding more ancestry to the Linux implementation - or of adding
this functionality to other platforms which have an easier time walking
the process tree - let's make 'cmd_ancestry' accept an array of
parentage.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---

Since v2, only some nits fixed - a few brace placements and some update
to the commit message language. Thanks Jeff, Junio, and Randall for
looking through the patch more thoroughly.

This iteration also has passing CI:
https://github.com/nasamuffin/git/actions/runs/872341549

I personally think it's ready to go; Jeff H mentioned adding more
parents but later said there's value in confirming we can get the info
we want from just one parent, first, which matches my intent too. I see
room for additional work, but it can happen later - adding more parents
to the Linux impl if we want, and logging the Windows impl to the same
cmd_ancestry event if we want.

 - Emily



 Makefile                  |  5 ++++
 compat/procinfo.c         | 51 +++++++++++++++++++++++++++++++++++++++
 config.mak.uname          |  1 +
 git-compat-util.h         |  6 +++++
 t/t0210/scrub_normal.perl |  6 +++++
 t/t0211/scrub_perf.perl   |  5 ++++
 t/t0212/parse_events.perl |  5 +++-
 trace2.c                  | 13 ++++++++++
 trace2.h                  | 12 ++++++++-
 trace2/tr2_tgt.h          |  3 +++
 trace2/tr2_tgt_event.c    | 21 ++++++++++++++++
 trace2/tr2_tgt_normal.c   | 19 +++++++++++++++
 trace2/tr2_tgt_perf.c     | 16 ++++++++++++
 13 files changed, 161 insertions(+), 2 deletions(-)
 create mode 100644 compat/procinfo.c

diff --git a/Makefile b/Makefile
index 93664d6714..330e4fa011 100644
--- a/Makefile
+++ b/Makefile
@@ -1889,6 +1889,11 @@ ifneq ($(PROCFS_EXECUTABLE_PATH),)
 	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
 endif
 
+ifdef HAVE_PROCFS_LINUX
+	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
+	COMPAT_OBJS += compat/procinfo.o
+endif
+
 ifdef HAVE_NS_GET_EXECUTABLE_PATH
 	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
 endif
diff --git a/compat/procinfo.c b/compat/procinfo.c
new file mode 100644
index 0000000000..0e92fb8b7c
--- /dev/null
+++ b/compat/procinfo.c
@@ -0,0 +1,51 @@
+#include "cache.h"
+
+#include "strbuf.h"
+#include "trace2.h"
+
+char *get_process_name(int pid)
+{
+#ifdef HAVE_PROCFS_LINUX
+	struct strbuf procfs_path = STRBUF_INIT;
+	struct strbuf out = STRBUF_INIT;
+	/* try to use procfs if it's present. */
+	strbuf_addf(&procfs_path, "/proc/%d/comm", pid);
+	if (!strbuf_read_file(&out, procfs_path.buf, 0)) {
+		/* All done with file reads, clean up early */
+		strbuf_release(&procfs_path);
+		return strbuf_detach(&out, NULL);
+	}
+#endif
+
+	/* NEEDSWORK: add non-procfs implementations here. */
+	return NULL;
+}
+
+void trace2_collect_process_info(enum trace2_process_info_reason reason)
+{
+	if (!trace2_is_enabled())
+		return;
+
+	/* someday we may want to write something extra here, but not today */
+	if (reason == TRACE2_PROCESS_INFO_EXIT)
+		return;
+
+	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
+		/*
+		 * NEEDSWORK: we could do the entire ptree in an array instead,
+		 * see compat/win32/trace2_win32_process_info.c.
+		 */
+		char *names[2];
+		names[0] = get_process_name(getppid());
+		names[1] = NULL;
+
+		if (!names[0])
+			return;
+
+		trace2_cmd_ancestry((const char**)names);
+
+		free(names[0]);
+	}
+
+	return;
+}
diff --git a/config.mak.uname b/config.mak.uname
index cb443b4e02..7ad110a1d2 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -58,6 +58,7 @@ ifeq ($(uname_S),Linux)
 	FREAD_READS_DIRECTORIES = UnfortunatelyYes
 	BASIC_CFLAGS += -DHAVE_SYSINFO
 	PROCFS_EXECUTABLE_PATH = /proc/self/exe
+	HAVE_PROCFS_LINUX = YesPlease
 endif
 ifeq ($(uname_S),GNU/kFreeBSD)
 	HAVE_ALLOCA_H = YesPlease
diff --git a/git-compat-util.h b/git-compat-util.h
index a508dbe5a3..cc7d5d8a2a 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1382,4 +1382,10 @@ static inline void *container_of_or_null_offset(void *ptr, size_t offset)
 
 void sleep_millisec(int millisec);
 
+/*
+ * Convert PID to process name (as would show in top/task manager). Returns
+ * NULL if unimplemented - be sure to check for NULL at callsite.
+ */
+char *get_process_name(int pid);
+
 #endif
diff --git a/t/t0210/scrub_normal.perl b/t/t0210/scrub_normal.perl
index c65d1a815e..7cc4de392a 100644
--- a/t/t0210/scrub_normal.perl
+++ b/t/t0210/scrub_normal.perl
@@ -42,6 +42,12 @@
 	# so just omit it for testing purposes.
 	# print "cmd_path _EXE_\n";
     }
+    elsif ($line =~ m/^cmd_ancestry/) {
+	# 'cmd_ancestry' is not implemented everywhere, so for portability's
+	# sake, skip it when parsing normal.
+	#
+	# print "$line";
+    }
     else {
 	print "$line";
     }
diff --git a/t/t0211/scrub_perf.perl b/t/t0211/scrub_perf.perl
index 351af7844e..d164b750ff 100644
--- a/t/t0211/scrub_perf.perl
+++ b/t/t0211/scrub_perf.perl
@@ -44,6 +44,11 @@
 	# $tokens[$col_rest] = "_EXE_";
 	goto SKIP_LINE;
     }
+    elsif ($tokens[$col_event] =~ m/cmd_ancestry/) {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere,
+	# so skip it.
+	goto SKIP_LINE;
+    }
     elsif ($tokens[$col_event] =~ m/child_exit/) {
 	$tokens[$col_rest] =~ s/ pid:\d* / pid:_PID_ /;
     }
diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
index 6584bb5634..b6408560c0 100644
--- a/t/t0212/parse_events.perl
+++ b/t/t0212/parse_events.perl
@@ -132,7 +132,10 @@
 	# just omit it for testing purposes.
 	# $processes->{$sid}->{'path'} = "_EXE_";
     }
-    
+    elsif ($event eq 'cmd_ancestry') {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere, so
+	# just skip it for testing purposes.
+    }
     elsif ($event eq 'cmd_name') {
 	$processes->{$sid}->{'name'} = $line->{'name'};
 	$processes->{$sid}->{'hierarchy'} = $line->{'hierarchy'};
diff --git a/trace2.c b/trace2.c
index 256120c7fd..b9b154ac44 100644
--- a/trace2.c
+++ b/trace2.c
@@ -260,6 +260,19 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname)
 			tgt_j->pfn_command_path_fl(file, line, pathname);
 }
 
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	struct tr2_tgt *tgt_j;
+	int j;
+
+	if (!trace2_enabled)
+		return;
+
+	for_each_wanted_builtin (j, tgt_j)
+		if (tgt_j->pfn_command_ancestry_fl)
+			tgt_j->pfn_command_ancestry_fl(file, line, parent_names);
+}
+
 void trace2_cmd_name_fl(const char *file, int line, const char *name)
 {
 	struct tr2_tgt *tgt_j;
diff --git a/trace2.h b/trace2.h
index ede18c2e06..23743ac62b 100644
--- a/trace2.h
+++ b/trace2.h
@@ -133,6 +133,16 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname);
 
 #define trace2_cmd_path(p) trace2_cmd_path_fl(__FILE__, __LINE__, (p))
 
+/*
+ * Emit an 'ancestry' event with the process name of the current process's
+ * parent process.
+ * This gives post-processors a way to determine what invoked the command and
+ * learn more about usage patterns.
+ */
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names);
+
+#define trace2_cmd_ancestry(v) trace2_cmd_ancestry_fl(__FILE__, __LINE__, (v))
+
 /*
  * Emit a 'cmd_name' event with the canonical name of the command.
  * This gives post-processors a simple field to identify the command
@@ -492,7 +502,7 @@ enum trace2_process_info_reason {
 	TRACE2_PROCESS_INFO_EXIT,
 };
 
-#if defined(GIT_WINDOWS_NATIVE)
+#if ( defined(GIT_WINDOWS_NATIVE) || defined(HAVE_PROCFS_LINUX) )
 void trace2_collect_process_info(enum trace2_process_info_reason reason);
 #else
 #define trace2_collect_process_info(reason) \
diff --git a/trace2/tr2_tgt.h b/trace2/tr2_tgt.h
index 7b90469212..1f66fd6573 100644
--- a/trace2/tr2_tgt.h
+++ b/trace2/tr2_tgt.h
@@ -27,6 +27,8 @@ typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
 
 typedef void(tr2_tgt_evt_command_path_fl_t)(const char *file, int line,
 					    const char *command_path);
+typedef void(tr2_tgt_evt_command_ancestry_fl_t)(const char *file, int line,
+						const char **parent_names);
 typedef void(tr2_tgt_evt_command_name_fl_t)(const char *file, int line,
 					    const char *name,
 					    const char *hierarchy);
@@ -108,6 +110,7 @@ struct tr2_tgt {
 	tr2_tgt_evt_atexit_t                    *pfn_atexit;
 	tr2_tgt_evt_error_va_fl_t               *pfn_error_va_fl;
 	tr2_tgt_evt_command_path_fl_t           *pfn_command_path_fl;
+	tr2_tgt_evt_command_ancestry_fl_t	*pfn_command_ancestry_fl;
 	tr2_tgt_evt_command_name_fl_t           *pfn_command_name_fl;
 	tr2_tgt_evt_command_mode_fl_t           *pfn_command_mode_fl;
 	tr2_tgt_evt_alias_fl_t                  *pfn_alias_fl;
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 6353e8ad91..578a9a5287 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -261,6 +261,26 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	jw_release(&jw);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	const char *parent_name = NULL;
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	event_fmt_prepare(event_name, file, line, NULL, &jw);
+	jw_object_inline_begin_array(&jw, "ancestry");
+
+	while ((parent_name = *parent_names++))
+		jw_array_string(&jw, parent_name);
+
+	jw_end(&jw); /* 'ancestry' array */
+	jw_end(&jw); /* event object */
+
+	tr2_dst_write_line(&tr2dst_event, &jw.json);
+	jw_release(&jw);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -584,6 +604,7 @@ struct tr2_tgt tr2_tgt_event = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
index 31b602c171..a5751c8864 100644
--- a/trace2/tr2_tgt_normal.c
+++ b/trace2/tr2_tgt_normal.c
@@ -160,6 +160,24 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *parent_name = NULL;
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	/* cmd_ancestry parent <- grandparent <- great-grandparent */
+	strbuf_addstr(&buf_payload, "cmd_ancestry ");
+	while ((parent_name = *parent_names++)) {
+		strbuf_addstr(&buf_payload, parent_name);
+		/* if we'll write another one after this, add a delimiter */
+		if (parent_names && *parent_names)
+			strbuf_addstr(&buf_payload, " <- ");
+	}
+
+	normal_io_write_fl(file, line, &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -306,6 +324,7 @@ struct tr2_tgt tr2_tgt_normal = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
index a8018f18cc..af4d65a0a5 100644
--- a/trace2/tr2_tgt_perf.c
+++ b/trace2/tr2_tgt_perf.c
@@ -253,6 +253,21 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	strbuf_addstr(&buf_payload, "ancestry:[");
+	/* It's not an argv but the rules are basically the same. */
+	sq_append_quote_argv_pretty(&buf_payload, parent_names);
+	strbuf_addch(&buf_payload, ']');
+
+	perf_io_write_fl(file, line, event_name, NULL, NULL, NULL, NULL,
+			 &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -532,6 +547,7 @@ struct tr2_tgt tr2_tgt_perf = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] tr2: log parent process name
  2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
@ 2021-05-24 20:49     ` Emily Shaffer
  2021-05-25  3:54     ` Junio C Hamano
  2021-06-08 18:58     ` [PATCH v4] " Emily Shaffer
  2 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-05-24 20:49 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff Hostetler, Bagas Sanjaya, Randall S. Becker

On Mon, May 24, 2021 at 01:10:07PM -0700, Emily Shaffer wrote:
> 
> Git for Windows also gathers information about more than one parent. In
Oh, bah, I guess I missed changing this to "more than one generation of
parent". It doesn't seem worth a reroll, but I'll make the change
locally now so I don't miss it with any v3->v4 nits that may come in.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2] tr2: log parent process name
  2021-05-21 20:05     ` Emily Shaffer
  2021-05-21 20:23       ` Randall S. Becker
  2021-05-22 11:18       ` Jeff Hostetler
@ 2021-05-24 23:33       ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-05-24 23:33 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Jeff Hostetler, git, Junio C Hamano, Bagas Sanjaya


On Fri, May 21 2021, Emily Shaffer wrote:

> On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>> > - I took a look at Jeff H's advice on using a "data_json" event to log
>> >    this and decided it would be a little more flexible to add a new event
>> >    instead. If we want, it'd be feasible to then shoehorn the GfW parent
>> >    tree stuff into this new event too. Doing it this way is definitely
>> >    easier to parse for Google's trace analysis system (which for now
>> >    completely skips "data_json" as it's polymorphic), and also - I think
>> >    - means that we can add more fields later on if we need to (thread
>> >    info, different fields than just /proc/n/comm like exec path, argv,
>> >    whatever).
>> 
>> I could argue both sides of this, so I guess it is fine either way.
>> 
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>> 
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings.  You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>> 
>> So both are about equal.
>> 
>> (I'll avoid the temptation to make a snarky comment about fixing
>> your post processing. :-) :-) :-) )
>
> ;P
>
> (I don't have much to add - this is an accurate summary of what I
> thought about, too. Thanks for writing it out.)
>
>> 
>> It really doesn't matter one way or the other.
>> 
>> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>> >    the "command name" - excluding argv, excluding path, etc. It seems
>> 
>> So you're trying to log argv[0] of the process and not the full
>> command line.  That's what I'm doing.
>
> It's close to argv[0], yeah. POSIX docs indicate it might be truncated
> in a way that argv[0] hasn't been, but it also doesn't include the
> leading path (as far as I've seen). For example, a long-running helper
> script I use with mutt, right now (gaffing on line length in email to
> help with argv clarity, sorry):
>
>   $ ps aux | grep mutt
>   emilysh+ 4119883 0.0 0.0 6892 3600 pts/6 S+ 12:44 0:00 /bin/bash
> /usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh
> /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>   # comm is truncated to 15ch, except apparently in the cases of some
>   # kernel worker processes I saw with much longer names?
>   $ cat /proc/4119883/comm
>   open-vim-in-new
>   # exe is a link to the executable, which means bash as this is a
>   # script
>   $ ls -lha /proc/4119883/exe
>   lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
>   /proc/4119883/exe -> /usr/bin/bash
>   # cmdline has the whole argv, separated on NUL so it runs together in
>   # editor
>   $ cat /proc/4119883/cmdline
>   /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>
> Jonathan N pointed out that the process name (the thing in 'comm') can
> also be manually manipulated by the process itself, and 'man procfs'
> also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in
> 'prctl()', so that tracks. (It doesn't look like we can use prctl() to
> find out the names of processes besides the current process, though, so
> the procfs stuff is still needed. Dang.)
>
>> 
>> >    like this is a little more safe about excluding personal information
>> >    from the traces which take the form of "myscript.sh
>> >    --password=hunter2", but would still be worrisome for something like
>> >    "mysupersecretproject.sh". I'm not sure whether that means we still
>> >    want to guard it with a config flag, though.
>> 
>> You might check whether you get the name of the script or just get
>> a lot of entries with just "/usr/bin/bash".
>
> See above :)
>
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
>
> Yes, that's what we've noticed too, so a process name isn't worrying us
> that much more.
>
>> 
>> > - I also added a lot to the commit message; hopefully it's not too
>> >    rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>> >    wasn't going to cut it.
>> > - As for testing, I followed the lead of GfW's parentage info - "this
>> >    isn't portable so writing tests for it will suck, just scrub it from
>> >    the tests". Maybe it makes sense to do some more
>> >    platform-specific-ness in the test suite instead? I wasn't sure.
>> 
>> yeah, that's probably best.  Unless you can tokenize it properly
>> so that you can predict the results in a HEREDOC in the test source.
>> 
>> For example, you might try to test tracing a command (where a top-level
>> "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the
>> output for the child.
>
> Yeah, I had trouble with even deciding when to attempt such a check or
> not.
>> > +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
>> > +	{
>> > +		/*
>> > +		 * NEEDSWORK: we could do the entire ptree in an array instead,
>> > +		 * see compat/win32/trace2_win32_process_info.c.
>> > +		 */
>> > +		char *names[2];
>> > +		names[0] = get_process_name(getppid());
>> > +		names[1] = NULL;
>> 
>> You're only logging 1 parent.  That's fine to get started.
>> 
>> I'm logging IIRC 10 parents on GFW.  That might seem overkill,
>> but there are lots of intermediate parents that hide what is
>> happening.  For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>> 
>> And that doesn't include who called push.
>> 
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
>
> Agree. But it's expensive - I didn't find a handy library call to find
> "parent ID of given process ID", so I think we'd have to manipulate
> procfs; and so far I only see parent ID in summary infos like
> /proc/n/status or /proc/n/stat, which contain lots of other info too and
> would need parsing.

It sounds a bit like you're fumbling your way towards (re?)discovering:

    pstree -s <pid>

You can look at its implementation (or strace it) to see what it does,
and yes, on Linux there's no handy C library for this, iterating over
procfs is the library.

Aside from the privacy, PII, usefulness of this data etc. discussions in
this & related threads I don't think that per-se should be an issue on a
modern Linux system. After all we'd just need to do it once on
startup. For any sub-process we spawn we'd carry it forward after the
initial /usr/bin/git invocation.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] tr2: log parent process name
  2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
  2021-05-24 20:49     ` Emily Shaffer
@ 2021-05-25  3:54     ` Junio C Hamano
  2021-05-25 13:33       ` Randall S. Becker
  2021-06-08 18:58     ` [PATCH v4] " Emily Shaffer
  2 siblings, 1 reply; 45+ messages in thread
From: Junio C Hamano @ 2021-05-25  3:54 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: git, Ævar Arnfjörð Bjarmason, Jeff Hostetler,
	Bagas Sanjaya, Randall S. Becker

Emily Shaffer <emilyshaffer@google.com> writes:

> diff --git a/compat/procinfo.c b/compat/procinfo.c
> new file mode 100644
> index 0000000000..0e92fb8b7c
> --- /dev/null
> +++ b/compat/procinfo.c
> @@ -0,0 +1,51 @@
> +#include "cache.h"
> +
> +#include "strbuf.h"
> +#include "trace2.h"
> +
> +char *get_process_name(int pid)
> +{
> +#ifdef HAVE_PROCFS_LINUX
> +	struct strbuf procfs_path = STRBUF_INIT;
> +	struct strbuf out = STRBUF_INIT;
> +	/* try to use procfs if it's present. */
> +	strbuf_addf(&procfs_path, "/proc/%d/comm", pid);
> +	if (!strbuf_read_file(&out, procfs_path.buf, 0)) {
> +		/* All done with file reads, clean up early */
> +		strbuf_release(&procfs_path);
> +		return strbuf_detach(&out, NULL);
> +	}
> +#endif
> +
> +	/* NEEDSWORK: add non-procfs implementations here. */
> +	return NULL;
> +}

Is the reason why this takes "int" and not "pid_t" because we may
port to non-POSIX platforms that do not have pid_t defined?

    ... goes and greps ...

Nah, we use pid_t everywhere (including compat/mingw.c); unless
there is a reason not to, let's use that type.

> +void trace2_collect_process_info(enum trace2_process_info_reason reason)
> +{
> +	if (!trace2_is_enabled())
> +		return;
> +
> +	/* someday we may want to write something extra here, but not today */
> +	if (reason == TRACE2_PROCESS_INFO_EXIT)
> +		return;
> +
> +	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
> +		/*
> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
> +		 * see compat/win32/trace2_win32_process_info.c.
> +		 */
> +		char *names[2];
> +		names[0] = get_process_name(getppid());
> +		names[1] = NULL;

Makes me wonder if get_process_name() is an appropriate
abstraction; specifically, something like

		const char **names = get_ancestry_names();
                int cnt;
		if (names)
			trace2_cmd_ancestry(names);
		for (cnt = 0; names[cnt]; cnt++)
                	free((char *)names[cnt]);
		free(names);

would allow platforms to decide how many levels is easy for them to
grab for reporting, for example (and they do not even have to have
to assume that getting process IDs to feed get_process_name() one by
one is the easiest way to show ancestry).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v3] tr2: log parent process name
  2021-05-25  3:54     ` Junio C Hamano
@ 2021-05-25 13:33       ` Randall S. Becker
  0 siblings, 0 replies; 45+ messages in thread
From: Randall S. Becker @ 2021-05-25 13:33 UTC (permalink / raw)
  To: 'Junio C Hamano', 'Emily Shaffer'
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Jeff Hostetler', 'Bagas Sanjaya'

On May 24, 2021 11:54 PM, Junio C Hamano wrote:
>Emily Shaffer <emilyshaffer@google.com> writes:
>
>> diff --git a/compat/procinfo.c b/compat/procinfo.c new file mode
>> 100644 index 0000000000..0e92fb8b7c
>> --- /dev/null
>> +++ b/compat/procinfo.c
>> @@ -0,0 +1,51 @@
>> +#include "cache.h"
>> +
>> +#include "strbuf.h"
>> +#include "trace2.h"
>> +
>> +char *get_process_name(int pid)
>> +{
>> +#ifdef HAVE_PROCFS_LINUX
>> +	struct strbuf procfs_path = STRBUF_INIT;
>> +	struct strbuf out = STRBUF_INIT;
>> +	/* try to use procfs if it's present. */
>> +	strbuf_addf(&procfs_path, "/proc/%d/comm", pid);
>> +	if (!strbuf_read_file(&out, procfs_path.buf, 0)) {
>> +		/* All done with file reads, clean up early */
>> +		strbuf_release(&procfs_path);
>> +		return strbuf_detach(&out, NULL);
>> +	}
>> +#endif
>> +
>> +	/* NEEDSWORK: add non-procfs implementations here. */
>> +	return NULL;
>> +}
>
>Is the reason why this takes "int" and not "pid_t" because we may port to non-POSIX platforms that do not have pid_t defined?
>
>    ... goes and greps ...
>
>Nah, we use pid_t everywhere (including compat/mingw.c); unless there is a reason not to, let's use that type.
>
>> +void trace2_collect_process_info(enum trace2_process_info_reason
>> +reason) {
>> +	if (!trace2_is_enabled())
>> +		return;
>> +
>> +	/* someday we may want to write something extra here, but not today */
>> +	if (reason == TRACE2_PROCESS_INFO_EXIT)
>> +		return;
>> +
>> +	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
>> +		/*
>> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
>> +		 * see compat/win32/trace2_win32_process_info.c.
>> +		 */
>> +		char *names[2];
>> +		names[0] = get_process_name(getppid());
>> +		names[1] = NULL;
>
>Makes me wonder if get_process_name() is an appropriate abstraction; specifically, something like
>
>		const char **names = get_ancestry_names();
>                int cnt;
>		if (names)
>			trace2_cmd_ancestry(names);
>		for (cnt = 0; names[cnt]; cnt++)
>                	free((char *)names[cnt]);
>		free(names);
>
>would allow platforms to decide how many levels is easy for them to grab for reporting, for example (and they do not even have to have
>to assume that getting process IDs to feed get_process_name() one by one is the easiest way to show ancestry).

Passing pid_t == 1 to get_process_names is non-informative. We can't tell what is really intended inside get_process_names(). Knowing that we are getting the ancestor, however, gives us an the important glue that the parent is wanted so we can interpret 1 as a non-ancestor. NonStop has pid_t defined regardless of whether it is used or relevant, so a higher-level of abstraction makes sense. The git process always has a valid pid_t, but the ancestor may not, but we do not know this at compile time. The platform code previously shared appears to be the correct technique for us. Going with get_ancestry_names() is interesting, but I think a count (how many ancestors do you want) is important (1 - just immediate, 2 - parent, grandparent, -1 - everyone). The ancestry tree can be very large (although each lookup is only about 8us). During a trace, I really would not want to care about more than 2 ancestors, typically, not the likely 8 or 10 that I can see happening in my situation (too much noise). The number of ancestors to trace probably needs to be in .git.config.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4] tr2: log parent process name
  2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
  2021-05-24 20:49     ` Emily Shaffer
  2021-05-25  3:54     ` Junio C Hamano
@ 2021-06-08 18:58     ` Emily Shaffer
  2021-06-08 20:56       ` Emily Shaffer
  2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
  2 siblings, 2 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-06-08 18:58 UTC (permalink / raw)
  To: git
  Cc: Emily Shaffer, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff Hostetler, Bagas Sanjaya, Randall S. Becker

It can be useful to tell who invoked Git - was it invoked manually by a
user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
we can influence the source code and set the GIT_TRACE2_PARENT_SID
environment variable from the caller process. In 'repo''s case, that
parent SID is manipulated to include the string "repo", which means we
can positively identify when Git was invoked by 'repo' tool. However,
identifying parents that way requires both that we know which tools
invoke Git and that we have the ability to modify the source code of
those tools. It cannot scale to keep up with the various IDEs and
wrappers which use Git, most of which we don't know about. Learning
which tools and wrappers invoke Git, and how, would give us insight to
decide where to improve Git's usability and performance.

Unfortunately, there's no cross-platform reliable way to gather the name
of the parent process. If procfs is present, we can use that; otherwise
we will need to discover the name another way. However, the process ID
should be sufficient to look up the process name on most platforms, so
that code may be shareable.

Git for Windows gathers similar information and logs it as a "data_json"
event. However, since "data_json" has a variable format, it is difficult
to parse effectively in some languages; instead, let's pursue a
dedicated "cmd_ancestry" event to record information about the ancestry
of the current process and a consistent, parseable way.

Git for Windows also gathers information about more than one generation
of parent. In Linux further ancestry info can be gathered with procfs,
but it's unwieldy to do so. In the interest of later moving Git for
Windows ancestry logging to the 'cmd_ancestry' event, and in the
interest of later adding more ancestry to the Linux implementation - or
of adding this functionality to other platforms which have an easier
time walking the process tree - let's make 'cmd_ancestry' accept an
array of parentage.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---

Notes:
    Since v3:
    
    Junio and Randall suggested ditching "get_process_name(ppid)" as the API to
    implement on various platforms, and I liked the suggestion a lot. So instead,
    I added "get_ancestry_names(strvec *names)".
    
     - Using a strvec instead of a char** makes cleanup easier, since strvec takes
       care of counting the number of strings in the array for us. Otherwise we'd
       need to include size as a return or out-param in get_ancestry_names(), and
       that's what the util class is for, right? :)
     - I made get_ancestry_names() static instead of putting it into
       git-compat-util.h. I think I had put get_process_name() into that header to
       facilitate non-procfs implementations, but compat/procinfo.c doesn't seem to
       me to indicate "procfs only", and if we do need to implement
       get_process_name() somewhere else, it'll be pretty easy to move it.
     - I added a description of "cmd_ancestry" to
       Documentation/technical/api-trace2.txt. I didn't see any user-facing docs to
       update (for example, "git grep cmd_path" produces only that one doc file).
    
    Thanks, all.
    
     - Emily

 Documentation/technical/api-trace2.txt | 14 ++++++
 Makefile                               |  5 +++
 compat/procinfo.c                      | 61 ++++++++++++++++++++++++++
 config.mak.uname                       |  1 +
 t/t0210/scrub_normal.perl              |  6 +++
 t/t0211/scrub_perf.perl                |  5 +++
 t/t0212/parse_events.perl              |  5 ++-
 trace2.c                               | 13 ++++++
 trace2.h                               | 12 ++++-
 trace2/tr2_tgt.h                       |  3 ++
 trace2/tr2_tgt_event.c                 | 21 +++++++++
 trace2/tr2_tgt_normal.c                | 19 ++++++++
 trace2/tr2_tgt_perf.c                  | 16 +++++++
 13 files changed, 179 insertions(+), 2 deletions(-)
 create mode 100644 compat/procinfo.c

diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt
index 3f52f981a2..8a0b360a0e 100644
--- a/Documentation/technical/api-trace2.txt
+++ b/Documentation/technical/api-trace2.txt
@@ -493,6 +493,20 @@ about specific error arguments.
 }
 ------------
 
+`"cmd_ancestry"`::
+	This event contains the text command name for the parent (and earlier
+	generations of parents) of the current process, in an array ordered from
+	nearest parent to furthest great-grandparent. It may not be implemented
+	on all platforms.
++
+------------
+{
+	"event":"cmd_ancestry",
+	...
+	"ancestry":["bash","tmux: server","systemd"]
+}
+------------
+
 `"cmd_name"`::
 	This event contains the command name for this git process
 	and the hierarchy of commands from parent git processes.
diff --git a/Makefile b/Makefile
index 93664d6714..330e4fa011 100644
--- a/Makefile
+++ b/Makefile
@@ -1889,6 +1889,11 @@ ifneq ($(PROCFS_EXECUTABLE_PATH),)
 	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
 endif
 
+ifdef HAVE_PROCFS_LINUX
+	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
+	COMPAT_OBJS += compat/procinfo.o
+endif
+
 ifdef HAVE_NS_GET_EXECUTABLE_PATH
 	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
 endif
diff --git a/compat/procinfo.c b/compat/procinfo.c
new file mode 100644
index 0000000000..9417c3aa54
--- /dev/null
+++ b/compat/procinfo.c
@@ -0,0 +1,61 @@
+#include "cache.h"
+
+#include "strbuf.h"
+#include "strvec.h"
+#include "trace2.h"
+
+static void get_ancestry_names(struct strvec *names)
+{
+#ifdef HAVE_PROCFS_LINUX
+	/*
+	 * NEEDSWORK: We could gather the entire pstree into an array to match
+	 * functionality with compat/win32/trace2_win32_process_info.c.
+	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
+	 * gather the immediate parent name which is readily accessible from
+	 * /proc/$(getppid())/comm.
+	 */
+	struct strbuf procfs_path = STRBUF_INIT;
+	struct strbuf name = STRBUF_INIT;
+
+	/* try to use procfs if it's present. */
+	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
+	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
+		strbuf_release(&procfs_path);
+		strvec_push(names, strbuf_detach(&name, NULL));
+	}
+
+	return;
+#endif
+	/* NEEDSWORK: add non-procfs-linux implementations here */
+}
+
+void trace2_collect_process_info(enum trace2_process_info_reason reason)
+{
+	if (!trace2_is_enabled())
+		return;
+
+	/* someday we may want to write something extra here, but not today */
+	if (reason == TRACE2_PROCESS_INFO_EXIT)
+		return;
+
+	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
+		/*
+		 * NEEDSWORK: we could do the entire ptree in an array instead,
+		 * see compat/win32/trace2_win32_process_info.c.
+		 */
+		struct strvec names = STRVEC_INIT;
+
+		get_ancestry_names(&names);
+
+		if (names.nr == 0) {
+			strvec_clear(&names);
+			return;
+		}
+
+		trace2_cmd_ancestry(names.v);
+
+		strvec_clear(&names);
+	}
+
+	return;
+}
diff --git a/config.mak.uname b/config.mak.uname
index cb443b4e02..7ad110a1d2 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -58,6 +58,7 @@ ifeq ($(uname_S),Linux)
 	FREAD_READS_DIRECTORIES = UnfortunatelyYes
 	BASIC_CFLAGS += -DHAVE_SYSINFO
 	PROCFS_EXECUTABLE_PATH = /proc/self/exe
+	HAVE_PROCFS_LINUX = YesPlease
 endif
 ifeq ($(uname_S),GNU/kFreeBSD)
 	HAVE_ALLOCA_H = YesPlease
diff --git a/t/t0210/scrub_normal.perl b/t/t0210/scrub_normal.perl
index c65d1a815e..7cc4de392a 100644
--- a/t/t0210/scrub_normal.perl
+++ b/t/t0210/scrub_normal.perl
@@ -42,6 +42,12 @@
 	# so just omit it for testing purposes.
 	# print "cmd_path _EXE_\n";
     }
+    elsif ($line =~ m/^cmd_ancestry/) {
+	# 'cmd_ancestry' is not implemented everywhere, so for portability's
+	# sake, skip it when parsing normal.
+	#
+	# print "$line";
+    }
     else {
 	print "$line";
     }
diff --git a/t/t0211/scrub_perf.perl b/t/t0211/scrub_perf.perl
index 351af7844e..d164b750ff 100644
--- a/t/t0211/scrub_perf.perl
+++ b/t/t0211/scrub_perf.perl
@@ -44,6 +44,11 @@
 	# $tokens[$col_rest] = "_EXE_";
 	goto SKIP_LINE;
     }
+    elsif ($tokens[$col_event] =~ m/cmd_ancestry/) {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere,
+	# so skip it.
+	goto SKIP_LINE;
+    }
     elsif ($tokens[$col_event] =~ m/child_exit/) {
 	$tokens[$col_rest] =~ s/ pid:\d* / pid:_PID_ /;
     }
diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
index 6584bb5634..b6408560c0 100644
--- a/t/t0212/parse_events.perl
+++ b/t/t0212/parse_events.perl
@@ -132,7 +132,10 @@
 	# just omit it for testing purposes.
 	# $processes->{$sid}->{'path'} = "_EXE_";
     }
-    
+    elsif ($event eq 'cmd_ancestry') {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere, so
+	# just skip it for testing purposes.
+    }
     elsif ($event eq 'cmd_name') {
 	$processes->{$sid}->{'name'} = $line->{'name'};
 	$processes->{$sid}->{'hierarchy'} = $line->{'hierarchy'};
diff --git a/trace2.c b/trace2.c
index 256120c7fd..b9b154ac44 100644
--- a/trace2.c
+++ b/trace2.c
@@ -260,6 +260,19 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname)
 			tgt_j->pfn_command_path_fl(file, line, pathname);
 }
 
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	struct tr2_tgt *tgt_j;
+	int j;
+
+	if (!trace2_enabled)
+		return;
+
+	for_each_wanted_builtin (j, tgt_j)
+		if (tgt_j->pfn_command_ancestry_fl)
+			tgt_j->pfn_command_ancestry_fl(file, line, parent_names);
+}
+
 void trace2_cmd_name_fl(const char *file, int line, const char *name)
 {
 	struct tr2_tgt *tgt_j;
diff --git a/trace2.h b/trace2.h
index ede18c2e06..23743ac62b 100644
--- a/trace2.h
+++ b/trace2.h
@@ -133,6 +133,16 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname);
 
 #define trace2_cmd_path(p) trace2_cmd_path_fl(__FILE__, __LINE__, (p))
 
+/*
+ * Emit an 'ancestry' event with the process name of the current process's
+ * parent process.
+ * This gives post-processors a way to determine what invoked the command and
+ * learn more about usage patterns.
+ */
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names);
+
+#define trace2_cmd_ancestry(v) trace2_cmd_ancestry_fl(__FILE__, __LINE__, (v))
+
 /*
  * Emit a 'cmd_name' event with the canonical name of the command.
  * This gives post-processors a simple field to identify the command
@@ -492,7 +502,7 @@ enum trace2_process_info_reason {
 	TRACE2_PROCESS_INFO_EXIT,
 };
 
-#if defined(GIT_WINDOWS_NATIVE)
+#if ( defined(GIT_WINDOWS_NATIVE) || defined(HAVE_PROCFS_LINUX) )
 void trace2_collect_process_info(enum trace2_process_info_reason reason);
 #else
 #define trace2_collect_process_info(reason) \
diff --git a/trace2/tr2_tgt.h b/trace2/tr2_tgt.h
index 7b90469212..1f66fd6573 100644
--- a/trace2/tr2_tgt.h
+++ b/trace2/tr2_tgt.h
@@ -27,6 +27,8 @@ typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
 
 typedef void(tr2_tgt_evt_command_path_fl_t)(const char *file, int line,
 					    const char *command_path);
+typedef void(tr2_tgt_evt_command_ancestry_fl_t)(const char *file, int line,
+						const char **parent_names);
 typedef void(tr2_tgt_evt_command_name_fl_t)(const char *file, int line,
 					    const char *name,
 					    const char *hierarchy);
@@ -108,6 +110,7 @@ struct tr2_tgt {
 	tr2_tgt_evt_atexit_t                    *pfn_atexit;
 	tr2_tgt_evt_error_va_fl_t               *pfn_error_va_fl;
 	tr2_tgt_evt_command_path_fl_t           *pfn_command_path_fl;
+	tr2_tgt_evt_command_ancestry_fl_t	*pfn_command_ancestry_fl;
 	tr2_tgt_evt_command_name_fl_t           *pfn_command_name_fl;
 	tr2_tgt_evt_command_mode_fl_t           *pfn_command_mode_fl;
 	tr2_tgt_evt_alias_fl_t                  *pfn_alias_fl;
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 6353e8ad91..578a9a5287 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -261,6 +261,26 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	jw_release(&jw);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	const char *parent_name = NULL;
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	event_fmt_prepare(event_name, file, line, NULL, &jw);
+	jw_object_inline_begin_array(&jw, "ancestry");
+
+	while ((parent_name = *parent_names++))
+		jw_array_string(&jw, parent_name);
+
+	jw_end(&jw); /* 'ancestry' array */
+	jw_end(&jw); /* event object */
+
+	tr2_dst_write_line(&tr2dst_event, &jw.json);
+	jw_release(&jw);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -584,6 +604,7 @@ struct tr2_tgt tr2_tgt_event = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
index 31b602c171..a5751c8864 100644
--- a/trace2/tr2_tgt_normal.c
+++ b/trace2/tr2_tgt_normal.c
@@ -160,6 +160,24 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *parent_name = NULL;
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	/* cmd_ancestry parent <- grandparent <- great-grandparent */
+	strbuf_addstr(&buf_payload, "cmd_ancestry ");
+	while ((parent_name = *parent_names++)) {
+		strbuf_addstr(&buf_payload, parent_name);
+		/* if we'll write another one after this, add a delimiter */
+		if (parent_names && *parent_names)
+			strbuf_addstr(&buf_payload, " <- ");
+	}
+
+	normal_io_write_fl(file, line, &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -306,6 +324,7 @@ struct tr2_tgt tr2_tgt_normal = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
index a8018f18cc..af4d65a0a5 100644
--- a/trace2/tr2_tgt_perf.c
+++ b/trace2/tr2_tgt_perf.c
@@ -253,6 +253,21 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	strbuf_addstr(&buf_payload, "ancestry:[");
+	/* It's not an argv but the rules are basically the same. */
+	sq_append_quote_argv_pretty(&buf_payload, parent_names);
+	strbuf_addch(&buf_payload, ']');
+
+	perf_io_write_fl(file, line, event_name, NULL, NULL, NULL, NULL,
+			 &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -532,6 +547,7 @@ struct tr2_tgt tr2_tgt_perf = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
-- 
2.32.0.rc1.229.g3e70b5a671-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] tr2: log parent process name
  2021-06-08 18:58     ` [PATCH v4] " Emily Shaffer
@ 2021-06-08 20:56       ` Emily Shaffer
  2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
  1 sibling, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-06-08 20:56 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff Hostetler, Bagas Sanjaya, Randall S. Becker

On Tue, Jun 08, 2021 at 11:58:55AM -0700, Emily Shaffer wrote:

Hmph, this is failing CI. I'll investigate it today and send a v5.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5] tr2: log parent process name
  2021-06-08 18:58     ` [PATCH v4] " Emily Shaffer
  2021-06-08 20:56       ` Emily Shaffer
@ 2021-06-08 22:10       ` Emily Shaffer
  2021-06-08 22:16         ` Randall S. Becker
                           ` (3 more replies)
  1 sibling, 4 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-06-08 22:10 UTC (permalink / raw)
  To: git
  Cc: Emily Shaffer, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff Hostetler, Bagas Sanjaya, Randall S. Becker

It can be useful to tell who invoked Git - was it invoked manually by a
user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
we can influence the source code and set the GIT_TRACE2_PARENT_SID
environment variable from the caller process. In 'repo''s case, that
parent SID is manipulated to include the string "repo", which means we
can positively identify when Git was invoked by 'repo' tool. However,
identifying parents that way requires both that we know which tools
invoke Git and that we have the ability to modify the source code of
those tools. It cannot scale to keep up with the various IDEs and
wrappers which use Git, most of which we don't know about. Learning
which tools and wrappers invoke Git, and how, would give us insight to
decide where to improve Git's usability and performance.

Unfortunately, there's no cross-platform reliable way to gather the name
of the parent process. If procfs is present, we can use that; otherwise
we will need to discover the name another way. However, the process ID
should be sufficient to look up the process name on most platforms, so
that code may be shareable.

Git for Windows gathers similar information and logs it as a "data_json"
event. However, since "data_json" has a variable format, it is difficult
to parse effectively in some languages; instead, let's pursue a
dedicated "cmd_ancestry" event to record information about the ancestry
of the current process and a consistent, parseable way.

Git for Windows also gathers information about more than one generation
of parent. In Linux further ancestry info can be gathered with procfs,
but it's unwieldy to do so. In the interest of later moving Git for
Windows ancestry logging to the 'cmd_ancestry' event, and in the
interest of later adding more ancestry to the Linux implementation - or
of adding this functionality to other platforms which have an easier
time walking the process tree - let's make 'cmd_ancestry' accept an
array of parentage.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---

Since v4:

Since I'm reading from a file to discover the name of the process, the
file is newline-terminated. This newline caused some havoc in tests, so
it's better to strip it if it's there. It's not part of the process name
itself.

Passing tests at
https://github.com/nasamuffin/git/actions/runs/919722509

Range-diff against v4:
1:  efb0a3ccb4 ! 1:  7a7e1ebbfa tr2: log parent process name
    @@ compat/procinfo.c (new)
     +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
     +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
     +		strbuf_release(&procfs_path);
    ++		strbuf_trim_trailing_newline(&name);
     +		strvec_push(names, strbuf_detach(&name, NULL));
     +	}
     +

 Documentation/technical/api-trace2.txt | 14 ++++++
 Makefile                               |  5 +++
 compat/procinfo.c                      | 62 ++++++++++++++++++++++++++
 config.mak.uname                       |  1 +
 t/t0210/scrub_normal.perl              |  6 +++
 t/t0211/scrub_perf.perl                |  5 +++
 t/t0212/parse_events.perl              |  5 ++-
 trace2.c                               | 13 ++++++
 trace2.h                               | 12 ++++-
 trace2/tr2_tgt.h                       |  3 ++
 trace2/tr2_tgt_event.c                 | 21 +++++++++
 trace2/tr2_tgt_normal.c                | 19 ++++++++
 trace2/tr2_tgt_perf.c                  | 16 +++++++
 13 files changed, 180 insertions(+), 2 deletions(-)
 create mode 100644 compat/procinfo.c

diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt
index 3f52f981a2..8a0b360a0e 100644
--- a/Documentation/technical/api-trace2.txt
+++ b/Documentation/technical/api-trace2.txt
@@ -493,6 +493,20 @@ about specific error arguments.
 }
 ------------
 
+`"cmd_ancestry"`::
+	This event contains the text command name for the parent (and earlier
+	generations of parents) of the current process, in an array ordered from
+	nearest parent to furthest great-grandparent. It may not be implemented
+	on all platforms.
++
+------------
+{
+	"event":"cmd_ancestry",
+	...
+	"ancestry":["bash","tmux: server","systemd"]
+}
+------------
+
 `"cmd_name"`::
 	This event contains the command name for this git process
 	and the hierarchy of commands from parent git processes.
diff --git a/Makefile b/Makefile
index 93664d6714..330e4fa011 100644
--- a/Makefile
+++ b/Makefile
@@ -1889,6 +1889,11 @@ ifneq ($(PROCFS_EXECUTABLE_PATH),)
 	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
 endif
 
+ifdef HAVE_PROCFS_LINUX
+	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
+	COMPAT_OBJS += compat/procinfo.o
+endif
+
 ifdef HAVE_NS_GET_EXECUTABLE_PATH
 	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
 endif
diff --git a/compat/procinfo.c b/compat/procinfo.c
new file mode 100644
index 0000000000..f8763cacf8
--- /dev/null
+++ b/compat/procinfo.c
@@ -0,0 +1,62 @@
+#include "cache.h"
+
+#include "strbuf.h"
+#include "strvec.h"
+#include "trace2.h"
+
+static void get_ancestry_names(struct strvec *names)
+{
+#ifdef HAVE_PROCFS_LINUX
+	/*
+	 * NEEDSWORK: We could gather the entire pstree into an array to match
+	 * functionality with compat/win32/trace2_win32_process_info.c.
+	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
+	 * gather the immediate parent name which is readily accessible from
+	 * /proc/$(getppid())/comm.
+	 */
+	struct strbuf procfs_path = STRBUF_INIT;
+	struct strbuf name = STRBUF_INIT;
+
+	/* try to use procfs if it's present. */
+	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
+	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
+		strbuf_release(&procfs_path);
+		strbuf_trim_trailing_newline(&name);
+		strvec_push(names, strbuf_detach(&name, NULL));
+	}
+
+	return;
+#endif
+	/* NEEDSWORK: add non-procfs-linux implementations here */
+}
+
+void trace2_collect_process_info(enum trace2_process_info_reason reason)
+{
+	if (!trace2_is_enabled())
+		return;
+
+	/* someday we may want to write something extra here, but not today */
+	if (reason == TRACE2_PROCESS_INFO_EXIT)
+		return;
+
+	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
+		/*
+		 * NEEDSWORK: we could do the entire ptree in an array instead,
+		 * see compat/win32/trace2_win32_process_info.c.
+		 */
+		struct strvec names = STRVEC_INIT;
+
+		get_ancestry_names(&names);
+
+		if (names.nr == 0) {
+			strvec_clear(&names);
+			return;
+		}
+
+		trace2_cmd_ancestry(names.v);
+
+		strvec_clear(&names);
+	}
+
+	return;
+}
diff --git a/config.mak.uname b/config.mak.uname
index cb443b4e02..7ad110a1d2 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -58,6 +58,7 @@ ifeq ($(uname_S),Linux)
 	FREAD_READS_DIRECTORIES = UnfortunatelyYes
 	BASIC_CFLAGS += -DHAVE_SYSINFO
 	PROCFS_EXECUTABLE_PATH = /proc/self/exe
+	HAVE_PROCFS_LINUX = YesPlease
 endif
 ifeq ($(uname_S),GNU/kFreeBSD)
 	HAVE_ALLOCA_H = YesPlease
diff --git a/t/t0210/scrub_normal.perl b/t/t0210/scrub_normal.perl
index c65d1a815e..7cc4de392a 100644
--- a/t/t0210/scrub_normal.perl
+++ b/t/t0210/scrub_normal.perl
@@ -42,6 +42,12 @@
 	# so just omit it for testing purposes.
 	# print "cmd_path _EXE_\n";
     }
+    elsif ($line =~ m/^cmd_ancestry/) {
+	# 'cmd_ancestry' is not implemented everywhere, so for portability's
+	# sake, skip it when parsing normal.
+	#
+	# print "$line";
+    }
     else {
 	print "$line";
     }
diff --git a/t/t0211/scrub_perf.perl b/t/t0211/scrub_perf.perl
index 351af7844e..d164b750ff 100644
--- a/t/t0211/scrub_perf.perl
+++ b/t/t0211/scrub_perf.perl
@@ -44,6 +44,11 @@
 	# $tokens[$col_rest] = "_EXE_";
 	goto SKIP_LINE;
     }
+    elsif ($tokens[$col_event] =~ m/cmd_ancestry/) {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere,
+	# so skip it.
+	goto SKIP_LINE;
+    }
     elsif ($tokens[$col_event] =~ m/child_exit/) {
 	$tokens[$col_rest] =~ s/ pid:\d* / pid:_PID_ /;
     }
diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
index 6584bb5634..b6408560c0 100644
--- a/t/t0212/parse_events.perl
+++ b/t/t0212/parse_events.perl
@@ -132,7 +132,10 @@
 	# just omit it for testing purposes.
 	# $processes->{$sid}->{'path'} = "_EXE_";
     }
-    
+    elsif ($event eq 'cmd_ancestry') {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere, so
+	# just skip it for testing purposes.
+    }
     elsif ($event eq 'cmd_name') {
 	$processes->{$sid}->{'name'} = $line->{'name'};
 	$processes->{$sid}->{'hierarchy'} = $line->{'hierarchy'};
diff --git a/trace2.c b/trace2.c
index 256120c7fd..b9b154ac44 100644
--- a/trace2.c
+++ b/trace2.c
@@ -260,6 +260,19 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname)
 			tgt_j->pfn_command_path_fl(file, line, pathname);
 }
 
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	struct tr2_tgt *tgt_j;
+	int j;
+
+	if (!trace2_enabled)
+		return;
+
+	for_each_wanted_builtin (j, tgt_j)
+		if (tgt_j->pfn_command_ancestry_fl)
+			tgt_j->pfn_command_ancestry_fl(file, line, parent_names);
+}
+
 void trace2_cmd_name_fl(const char *file, int line, const char *name)
 {
 	struct tr2_tgt *tgt_j;
diff --git a/trace2.h b/trace2.h
index ede18c2e06..23743ac62b 100644
--- a/trace2.h
+++ b/trace2.h
@@ -133,6 +133,16 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname);
 
 #define trace2_cmd_path(p) trace2_cmd_path_fl(__FILE__, __LINE__, (p))
 
+/*
+ * Emit an 'ancestry' event with the process name of the current process's
+ * parent process.
+ * This gives post-processors a way to determine what invoked the command and
+ * learn more about usage patterns.
+ */
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names);
+
+#define trace2_cmd_ancestry(v) trace2_cmd_ancestry_fl(__FILE__, __LINE__, (v))
+
 /*
  * Emit a 'cmd_name' event with the canonical name of the command.
  * This gives post-processors a simple field to identify the command
@@ -492,7 +502,7 @@ enum trace2_process_info_reason {
 	TRACE2_PROCESS_INFO_EXIT,
 };
 
-#if defined(GIT_WINDOWS_NATIVE)
+#if ( defined(GIT_WINDOWS_NATIVE) || defined(HAVE_PROCFS_LINUX) )
 void trace2_collect_process_info(enum trace2_process_info_reason reason);
 #else
 #define trace2_collect_process_info(reason) \
diff --git a/trace2/tr2_tgt.h b/trace2/tr2_tgt.h
index 7b90469212..1f66fd6573 100644
--- a/trace2/tr2_tgt.h
+++ b/trace2/tr2_tgt.h
@@ -27,6 +27,8 @@ typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
 
 typedef void(tr2_tgt_evt_command_path_fl_t)(const char *file, int line,
 					    const char *command_path);
+typedef void(tr2_tgt_evt_command_ancestry_fl_t)(const char *file, int line,
+						const char **parent_names);
 typedef void(tr2_tgt_evt_command_name_fl_t)(const char *file, int line,
 					    const char *name,
 					    const char *hierarchy);
@@ -108,6 +110,7 @@ struct tr2_tgt {
 	tr2_tgt_evt_atexit_t                    *pfn_atexit;
 	tr2_tgt_evt_error_va_fl_t               *pfn_error_va_fl;
 	tr2_tgt_evt_command_path_fl_t           *pfn_command_path_fl;
+	tr2_tgt_evt_command_ancestry_fl_t	*pfn_command_ancestry_fl;
 	tr2_tgt_evt_command_name_fl_t           *pfn_command_name_fl;
 	tr2_tgt_evt_command_mode_fl_t           *pfn_command_mode_fl;
 	tr2_tgt_evt_alias_fl_t                  *pfn_alias_fl;
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 6353e8ad91..578a9a5287 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -261,6 +261,26 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	jw_release(&jw);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	const char *parent_name = NULL;
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	event_fmt_prepare(event_name, file, line, NULL, &jw);
+	jw_object_inline_begin_array(&jw, "ancestry");
+
+	while ((parent_name = *parent_names++))
+		jw_array_string(&jw, parent_name);
+
+	jw_end(&jw); /* 'ancestry' array */
+	jw_end(&jw); /* event object */
+
+	tr2_dst_write_line(&tr2dst_event, &jw.json);
+	jw_release(&jw);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -584,6 +604,7 @@ struct tr2_tgt tr2_tgt_event = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
index 31b602c171..a5751c8864 100644
--- a/trace2/tr2_tgt_normal.c
+++ b/trace2/tr2_tgt_normal.c
@@ -160,6 +160,24 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *parent_name = NULL;
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	/* cmd_ancestry parent <- grandparent <- great-grandparent */
+	strbuf_addstr(&buf_payload, "cmd_ancestry ");
+	while ((parent_name = *parent_names++)) {
+		strbuf_addstr(&buf_payload, parent_name);
+		/* if we'll write another one after this, add a delimiter */
+		if (parent_names && *parent_names)
+			strbuf_addstr(&buf_payload, " <- ");
+	}
+
+	normal_io_write_fl(file, line, &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -306,6 +324,7 @@ struct tr2_tgt tr2_tgt_normal = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
index a8018f18cc..af4d65a0a5 100644
--- a/trace2/tr2_tgt_perf.c
+++ b/trace2/tr2_tgt_perf.c
@@ -253,6 +253,21 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	strbuf_addstr(&buf_payload, "ancestry:[");
+	/* It's not an argv but the rules are basically the same. */
+	sq_append_quote_argv_pretty(&buf_payload, parent_names);
+	strbuf_addch(&buf_payload, ']');
+
+	perf_io_write_fl(file, line, event_name, NULL, NULL, NULL, NULL,
+			 &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -532,6 +547,7 @@ struct tr2_tgt tr2_tgt_perf = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
-- 
2.32.0.rc1.229.g3e70b5a671-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5] tr2: log parent process name
  2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
@ 2021-06-08 22:16         ` Randall S. Becker
  2021-06-08 22:24           ` Emily Shaffer
  2021-06-16  8:42         ` Junio C Hamano
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Randall S. Becker @ 2021-06-08 22:16 UTC (permalink / raw)
  To: 'Emily Shaffer', git
  Cc: 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On June 8, 2021 6:11 PM, Emily Shaffer wrote:
>It can be useful to tell who invoked Git - was it invoked manually by a user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
>we can influence the source code and set the GIT_TRACE2_PARENT_SID environment variable from the caller process. In 'repo''s case,
>that parent SID is manipulated to include the string "repo", which means we can positively identify when Git was invoked by 'repo' tool.
>However, identifying parents that way requires both that we know which tools invoke Git and that we have the ability to modify the source
>code of those tools. It cannot scale to keep up with the various IDEs and wrappers which use Git, most of which we don't know about.
>Learning which tools and wrappers invoke Git, and how, would give us insight to decide where to improve Git's usability and performance.
>
>Unfortunately, there's no cross-platform reliable way to gather the name of the parent process. If procfs is present, we can use that;
>otherwise we will need to discover the name another way. However, the process ID should be sufficient to look up the process name on
>most platforms, so that code may be shareable.
>
>Git for Windows gathers similar information and logs it as a "data_json"
>event. However, since "data_json" has a variable format, it is difficult to parse effectively in some languages; instead, let's pursue a
>dedicated "cmd_ancestry" event to record information about the ancestry of the current process and a consistent, parseable way.
>
>Git for Windows also gathers information about more than one generation of parent. In Linux further ancestry info can be gathered with
>procfs, but it's unwieldy to do so. In the interest of later moving Git for Windows ancestry logging to the 'cmd_ancestry' event, and in the
>interest of later adding more ancestry to the Linux implementation - or of adding this functionality to other platforms which have an
>easier time walking the process tree - let's make 'cmd_ancestry' accept an array of parentage.

We are probably going to have to discuss this one at more length. On NonStop, in some cases, I have access to the program arguments of the parent (rather like ps -ef) in POSIX-land, but not from the other personality. I do have access to the program object name, in both sides, although if someone replaces the object - which is not actually possible for a running program, but a rename is - the object may end up being somewhat meaningless or mangled. My suspicion is that I'm going to have to supply different things for the two personalities, but I'm not sure what as of yet.

Regards,
Randall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5] tr2: log parent process name
  2021-06-08 22:16         ` Randall S. Becker
@ 2021-06-08 22:24           ` Emily Shaffer
  2021-06-08 22:39             ` Randall S. Becker
  0 siblings, 1 reply; 45+ messages in thread
From: Emily Shaffer @ 2021-06-08 22:24 UTC (permalink / raw)
  To: Randall S. Becker
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On Tue, Jun 08, 2021 at 06:16:57PM -0400, Randall S. Becker wrote:
> 
> On June 8, 2021 6:11 PM, Emily Shaffer wrote:
> >It can be useful to tell who invoked Git - was it invoked manually by a user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
> >we can influence the source code and set the GIT_TRACE2_PARENT_SID environment variable from the caller process. In 'repo''s case,
> >that parent SID is manipulated to include the string "repo", which means we can positively identify when Git was invoked by 'repo' tool.
> >However, identifying parents that way requires both that we know which tools invoke Git and that we have the ability to modify the source
> >code of those tools. It cannot scale to keep up with the various IDEs and wrappers which use Git, most of which we don't know about.
> >Learning which tools and wrappers invoke Git, and how, would give us insight to decide where to improve Git's usability and performance.
> >
> >Unfortunately, there's no cross-platform reliable way to gather the name of the parent process. If procfs is present, we can use that;
> >otherwise we will need to discover the name another way. However, the process ID should be sufficient to look up the process name on
> >most platforms, so that code may be shareable.
> >
> >Git for Windows gathers similar information and logs it as a "data_json"
> >event. However, since "data_json" has a variable format, it is difficult to parse effectively in some languages; instead, let's pursue a
> >dedicated "cmd_ancestry" event to record information about the ancestry of the current process and a consistent, parseable way.
> >
> >Git for Windows also gathers information about more than one generation of parent. In Linux further ancestry info can be gathered with
> >procfs, but it's unwieldy to do so. In the interest of later moving Git for Windows ancestry logging to the 'cmd_ancestry' event, and in the
> >interest of later adding more ancestry to the Linux implementation - or of adding this functionality to other platforms which have an
> >easier time walking the process tree - let's make 'cmd_ancestry' accept an array of parentage.
> 
> We are probably going to have to discuss this one at more length. On
> NonStop, in some cases, I have access to the program arguments of the
> parent (rather like ps -ef) in POSIX-land, but not from the other
> personality. I do have access to the program object name, in both
> sides, although if someone replaces the object - which is not actually
> possible for a running program, but a rename is - the object may end
> up being somewhat meaningless or mangled. My suspicion is that I'm
> going to have to supply different things for the two personalities,
> but I'm not sure what as of yet.

I guess I'm having trouble understanding - it sounds like you're
describing one process with a graph of ancestry instead of a line. Does
that mean we shouldn't be tracing the ancestry the way we are?

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5] tr2: log parent process name
  2021-06-08 22:24           ` Emily Shaffer
@ 2021-06-08 22:39             ` Randall S. Becker
  2021-06-09 20:17               ` Emily Shaffer
  0 siblings, 1 reply; 45+ messages in thread
From: Randall S. Becker @ 2021-06-08 22:39 UTC (permalink / raw)
  To: 'Emily Shaffer'
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On June 8, 2021 6:25 PM, Emily Shaffer wrote"
>To: Randall S. Becker <rsbecker@nexbridge.com>
>Cc: git@vger.kernel.org; 'Ævar Arnfjörð Bjarmason' <avarab@gmail.com>; 'Junio C Hamano' <gitster@pobox.com>; 'Jeff Hostetler'
><git@jeffhostetler.com>; 'Bagas Sanjaya' <bagasdotme@gmail.com>
>Subject: Re: [PATCH v5] tr2: log parent process name
>
>On Tue, Jun 08, 2021 at 06:16:57PM -0400, Randall S. Becker wrote:
>>
>> On June 8, 2021 6:11 PM, Emily Shaffer wrote:
>> >It can be useful to tell who invoked Git - was it invoked manually by
>> >a user via CLI or script? By an IDE?  In some cases - like 'repo'
>> >tool - we can influence the source code and set the GIT_TRACE2_PARENT_SID environment variable from the caller process. In
'repo''s
>case, that parent SID is manipulated to include the string "repo", which means we can positively identify when Git was invoked by
'repo'
>tool.
>> >However, identifying parents that way requires both that we know
>> >which tools invoke Git and that we have the ability to modify the source code of those tools. It cannot scale to keep up with
the various
>IDEs and wrappers which use Git, most of which we don't know about.
>> >Learning which tools and wrappers invoke Git, and how, would give us insight to decide where to improve Git's usability and
>performance.
>> >
>> >Unfortunately, there's no cross-platform reliable way to gather the
>> >name of the parent process. If procfs is present, we can use that;
>> >otherwise we will need to discover the name another way. However, the process ID should be sufficient to look up the process
name on
>most platforms, so that code may be shareable.
>> >
>> >Git for Windows gathers similar information and logs it as a "data_json"
>> >event. However, since "data_json" has a variable format, it is
>> >difficult to parse effectively in some languages; instead, let's pursue a dedicated "cmd_ancestry" event to record information
about the
>ancestry of the current process and a consistent, parseable way.
>> >
>> >Git for Windows also gathers information about more than one
>> >generation of parent. In Linux further ancestry info can be gathered
>> >with procfs, but it's unwieldy to do so. In the interest of later
>> >moving Git for Windows ancestry logging to the 'cmd_ancestry' event, and in the interest of later adding more ancestry to the
Linux
>implementation - or of adding this functionality to other platforms which have an easier time walking the process tree - let's make
>'cmd_ancestry' accept an array of parentage.
>>
>> We are probably going to have to discuss this one at more length. On
>> NonStop, in some cases, I have access to the program arguments of the
>> parent (rather like ps -ef) in POSIX-land, but not from the other
>> personality. I do have access to the program object name, in both
>> sides, although if someone replaces the object - which is not actually
>> possible for a running program, but a rename is - the object may end
>> up being somewhat meaningless or mangled. My suspicion is that I'm
>> going to have to supply different things for the two personalities,
>> but I'm not sure what as of yet.
>
>I guess I'm having trouble understanding - it sounds like you're describing one process with a graph of ancestry instead of a line.
Does that
>mean we shouldn't be tracing the ancestry the way we are?

It's more like this (g = Guardian, p=Posix, for illustration), for a typical interactive situation coming from the non-POSIX side):

gMonitor -> gAncestor1 -> gAncestor2 -> pAncestor3 (/bin/sh) -> git

And when started from an SSH window:

gMonitor -> gAncestor1 (SSH) -> pAncestor2 (/bin/sh) -> git

I can get the program object name from any of the above, and the pid from a POSIX process, or the name (or cpu and process number)
of a Guardian process. In the case of POSIX, obtaining program arguments may be possible. An ancestor, as with Linux, can have
multiple children in a tree but a child can only have one parent - well, technically one at a time anyway because there are some
funky exceptions where a child can adopt a different parent in Guardian-land. In both cases, I can get the program object file of
the process (like /usr/local/bin/git), but if someone renames git because an install happened during a long-running operation, like
git gc --aggressive, the object may be named something else, like /usr/local/bin/ZZLDAG01, for argument sake).

I'm not sure any of this is really relevant, but describes some of what is possible. It also might be useful to pull out the tty
that the process was running on. That is easy to get if the terminal is still connected. I think that particular bit of information
might be very useful, as well as user information.

-Randall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5] tr2: log parent process name
  2021-06-08 22:39             ` Randall S. Becker
@ 2021-06-09 20:17               ` Emily Shaffer
  0 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-06-09 20:17 UTC (permalink / raw)
  To: Randall S. Becker
  Cc: git, 'Ævar Arnfjörð Bjarmason',
	'Junio C Hamano', 'Jeff Hostetler',
	'Bagas Sanjaya'

On Tue, Jun 08, 2021 at 06:39:15PM -0400, Randall S. Becker wrote:
> On June 8, 2021 6:25 PM, Emily Shaffer wrote"
> >> We are probably going to have to discuss this one at more length. On
> >> NonStop, in some cases, I have access to the program arguments of the
> >> parent (rather like ps -ef) in POSIX-land, but not from the other
> >> personality. I do have access to the program object name, in both
> >> sides, although if someone replaces the object - which is not actually
> >> possible for a running program, but a rename is - the object may end
> >> up being somewhat meaningless or mangled. My suspicion is that I'm
> >> going to have to supply different things for the two personalities,
> >> but I'm not sure what as of yet.
> >
> >I guess I'm having trouble understanding - it sounds like you're
> >describing one process with a graph of ancestry instead of a line.
> >Does that mean we shouldn't be tracing the ancestry the way we are?
> 
> It's more like this (g = Guardian, p=Posix, for illustration), for a
> typical interactive situation coming from the non-POSIX side):
> 
> gMonitor -> gAncestor1 -> gAncestor2 -> pAncestor3 (/bin/sh) -> git
> 
> And when started from an SSH window:
> 
> gMonitor -> gAncestor1 (SSH) -> pAncestor2 (/bin/sh) -> git
> 
> I can get the program object name from any of the above, and the pid
> from a POSIX process, or the name (or cpu and process number) of a
> Guardian process. In the case of POSIX, obtaining program arguments
> may be possible. An ancestor, as with Linux, can have multiple
> children in a tree but a child can only have one parent - well,
> technically one at a time anyway because there are some funky
> exceptions where a child can adopt a different parent in
> Guardian-land. In both cases, I can get the program object file of the
> process (like /usr/local/bin/git), but if someone renames git because
> an install happened during a long-running operation, like git gc
> --aggressive, the object may be named something else, like
> /usr/local/bin/ZZLDAG01, for argument sake).

Interesting. One thing that might be helpful (if you don't care about
the later name, since it looks like it might be junk) is that
trace2_collect_process_info() provides some hint about when it was
invoked, via the 'enum trace2_process_info_reason' arg. So if you're
worried about a long-running process being renamed, the
TRACE2_PROCESS_INFO_STARTUP reason happens very early in common-main.c.
(And you could capture info like "the name changed later" at
TRACE2_PROCESS_INFO_EXIT if you wanted.)

> I'm not sure any of this is really relevant, but describes some of
> what is possible. It also might be useful to pull out the tty that the
> process was running on. That is easy to get if the terminal is still
> connected. I think that particular bit of information might be very
> useful, as well as user information.
> 
> -Randall
> 

(I don't have much insight into what would or wouldn't be useful to log
here in your case, but I will say that it all sounds very cool and I
appreciate your thorough explanation.)

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5] tr2: log parent process name
  2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
  2021-06-08 22:16         ` Randall S. Becker
@ 2021-06-16  8:42         ` Junio C Hamano
  2021-06-28 16:45         ` Jeff Hostetler
  2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
  3 siblings, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2021-06-16  8:42 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: git, Ævar Arnfjörð Bjarmason, Jeff Hostetler,
	Bagas Sanjaya, Randall S. Becker

Emily Shaffer <emilyshaffer@google.com> writes:

Other than your back-and-forth with Randall on NonStop specifics
that didn't result in any code change, there was no comment on this
patch.  Are other people totally happy with this version, or are
they totally uninterested?

> +	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
> +		/*
> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
> +		 * see compat/win32/trace2_win32_process_info.c.
> +		 */
> +		struct strvec names = STRVEC_INIT;
> +
> +		get_ancestry_names(&names);
> +
> +		if (names.nr == 0) {
>
> +			strvec_clear(&names);
> +			return;
> +		}
>
> +		trace2_cmd_ancestry(names.v);
> +
> +		strvec_clear(&names);


Micronit.  CodingGuidelines tells us not to explicitly compare with
constant 0, '\0', or NULL.  It may be more concise and easier to
follow if written like this:

		get_ancestry_names(&names);
		if (names.nr)
			trace2_cmd_ancestry(names.v);
		strvec_clear(&names);


Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5] tr2: log parent process name
  2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
  2021-06-08 22:16         ` Randall S. Becker
  2021-06-16  8:42         ` Junio C Hamano
@ 2021-06-28 16:45         ` Jeff Hostetler
  2021-06-29 23:51           ` Emily Shaffer
  2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
  3 siblings, 1 reply; 45+ messages in thread
From: Jeff Hostetler @ 2021-06-28 16:45 UTC (permalink / raw)
  To: Emily Shaffer, git
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Bagas Sanjaya, Randall S. Becker



On 6/8/21 6:10 PM, Emily Shaffer wrote:
> It can be useful to tell who invoked Git - was it invoked manually by a
> user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
> we can influence the source code and set the GIT_TRACE2_PARENT_SID
> environment variable from the caller process. In 'repo''s case, that
> parent SID is manipulated to include the string "repo", which means we
> can positively identify when Git was invoked by 'repo' tool. However,
> identifying parents that way requires both that we know which tools
> invoke Git and that we have the ability to modify the source code of
> those tools. It cannot scale to keep up with the various IDEs and
> wrappers which use Git, most of which we don't know about. Learning
> which tools and wrappers invoke Git, and how, would give us insight to
> decide where to improve Git's usability and performance.
> 
> Unfortunately, there's no cross-platform reliable way to gather the name
> of the parent process. If procfs is present, we can use that; otherwise
> we will need to discover the name another way. However, the process ID
> should be sufficient to look up the process name on most platforms, so
> that code may be shareable.
> 
> Git for Windows gathers similar information and logs it as a "data_json"
> event. However, since "data_json" has a variable format, it is difficult
> to parse effectively in some languages; instead, let's pursue a
> dedicated "cmd_ancestry" event to record information about the ancestry
> of the current process and a consistent, parseable way.
> 
> Git for Windows also gathers information about more than one generation
> of parent. In Linux further ancestry info can be gathered with procfs,
> but it's unwieldy to do so. In the interest of later moving Git for
> Windows ancestry logging to the 'cmd_ancestry' event, and in the
> interest of later adding more ancestry to the Linux implementation - or
> of adding this functionality to other platforms which have an easier
> time walking the process tree - let's make 'cmd_ancestry' accept an
> array of parentage.
> 
> Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> ---
> 
> Since v4:
> 
> Since I'm reading from a file to discover the name of the process, the
> file is newline-terminated. This newline caused some havoc in tests, so
> it's better to strip it if it's there. It's not part of the process name
> itself.
> 
> Passing tests at
> https://github.com/nasamuffin/git/actions/runs/919722509
> 
> Range-diff against v4:
> 1:  efb0a3ccb4 ! 1:  7a7e1ebbfa tr2: log parent process name
>      @@ compat/procinfo.c (new)
>       +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
>       +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
>       +		strbuf_release(&procfs_path);
>      ++		strbuf_trim_trailing_newline(&name);
>       +		strvec_push(names, strbuf_detach(&name, NULL));
>       +	}
>       +

You're only getting the name of the command (argv[0]) and not the
full command line, right?  That is a good thing.

> 
>   Documentation/technical/api-trace2.txt | 14 ++++++
>   Makefile                               |  5 +++
>   compat/procinfo.c                      | 62 ++++++++++++++++++++++++++
>   config.mak.uname                       |  1 +
>   t/t0210/scrub_normal.perl              |  6 +++
>   t/t0211/scrub_perf.perl                |  5 +++
>   t/t0212/parse_events.perl              |  5 ++-
>   trace2.c                               | 13 ++++++
>   trace2.h                               | 12 ++++-
>   trace2/tr2_tgt.h                       |  3 ++
>   trace2/tr2_tgt_event.c                 | 21 +++++++++
>   trace2/tr2_tgt_normal.c                | 19 ++++++++
>   trace2/tr2_tgt_perf.c                  | 16 +++++++
>   13 files changed, 180 insertions(+), 2 deletions(-)
>   create mode 100644 compat/procinfo.c
> 
> diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt
> index 3f52f981a2..8a0b360a0e 100644
> --- a/Documentation/technical/api-trace2.txt
> +++ b/Documentation/technical/api-trace2.txt
> @@ -493,6 +493,20 @@ about specific error arguments.
>   }
>   ------------
>   
> +`"cmd_ancestry"`::
> +	This event contains the text command name for the parent (and earlier
> +	generations of parents) of the current process, in an array ordered from
> +	nearest parent to furthest great-grandparent. It may not be implemented
> +	on all platforms.
> ++
> +------------
> +{
> +	"event":"cmd_ancestry",
> +	...
> +	"ancestry":["bash","tmux: server","systemd"]

Is the second element really "tmux: server".  Seems odd that that's what
the command name (argv[0]) is.  Perhaps I misread something??

> +}

This array is bounded and that implies that you captured all of
the grand parents back to "init" (or whatever it is called these
days).

Is there value in having a final "..." or "(truncated)" element
to indicate that the list incomplete?  I did the latter in the
Windows version.


> +------------
> +
>   `"cmd_name"`::
>   	This event contains the command name for this git process
>   	and the hierarchy of commands from parent git processes.
> diff --git a/Makefile b/Makefile
> index 93664d6714..330e4fa011 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1889,6 +1889,11 @@ ifneq ($(PROCFS_EXECUTABLE_PATH),)
>   	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
>   endif
>   
> +ifdef HAVE_PROCFS_LINUX
> +	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
> +	COMPAT_OBJS += compat/procinfo.o
> +endif
> +
>   ifdef HAVE_NS_GET_EXECUTABLE_PATH
>   	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
>   endif
> diff --git a/compat/procinfo.c b/compat/procinfo.c
> new file mode 100644
> index 0000000000..f8763cacf8
> --- /dev/null
> +++ b/compat/procinfo.c
> @@ -0,0 +1,62 @@
> +#include "cache.h"
> +
> +#include "strbuf.h"
> +#include "strvec.h"
> +#include "trace2.h"
> +
> +static void get_ancestry_names(struct strvec *names)
> +{
> +#ifdef HAVE_PROCFS_LINUX
> +	/*
> +	 * NEEDSWORK: We could gather the entire pstree into an array to match
> +	 * functionality with compat/win32/trace2_win32_process_info.c.
> +	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
> +	 * gather the immediate parent name which is readily accessible from
> +	 * /proc/$(getppid())/comm.
> +	 */
> +	struct strbuf procfs_path = STRBUF_INIT;
> +	struct strbuf name = STRBUF_INIT;
> +
> +	/* try to use procfs if it's present. */
> +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
> +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
> +		strbuf_release(&procfs_path);
> +		strbuf_trim_trailing_newline(&name);
> +		strvec_push(names, strbuf_detach(&name, NULL));
> +	}
> +
> +	return;
> +#endif
> +	/* NEEDSWORK: add non-procfs-linux implementations here */
> +}

Perhaps this has already been discussed, but would it be better
to have a "compat/linux/trace2_linux_process_info.c"
or "compat/procfs/trace2_procfs_process_info.c" source file and
only compile it in Linux-compatible builds -- rather than #ifdef'ing
the source.  This is a highly platform-specific feature.

For example, if I convert the Win32 version to use your new event,
I wouldn't want to move the code.

I just noticed that you have both "BASIC_CFLAGS+=" and a "COMPAT_OBSJ+="
lines.  If you made this source file procfs-specific, you wouldn't need
the ifdef and you could avoid the new CFLAG.

> +
> +void trace2_collect_process_info(enum trace2_process_info_reason reason)
> +{
> +	if (!trace2_is_enabled())
> +		return;
> +
> +	/* someday we may want to write something extra here, but not today */
> +	if (reason == TRACE2_PROCESS_INFO_EXIT)
> +		return;
> +
> +	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
> +		/*
> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
> +		 * see compat/win32/trace2_win32_process_info.c.
> +		 */
> +		struct strvec names = STRVEC_INIT;
> +
> +		get_ancestry_names(&names);
> +
> +		if (names.nr == 0) {
> +			strvec_clear(&names);
> +			return;
> +		}
> +
> +		trace2_cmd_ancestry(names.v);
> +
> +		strvec_clear(&names);

I agree with Junio here, it would be simpler to say it like this:

		get_ancestry_names(&names);
		if (names.nr)
			trace2_cmd_ancestry(names.v);
		strvec_clear(&names);

> +	}
> +
> +	return;
> +}
> diff --git a/config.mak.uname b/config.mak.uname
> index cb443b4e02..7ad110a1d2 100644
> --- a/config.mak.uname
> +++ b/config.mak.uname
> @@ -58,6 +58,7 @@ ifeq ($(uname_S),Linux)
>   	FREAD_READS_DIRECTORIES = UnfortunatelyYes
>   	BASIC_CFLAGS += -DHAVE_SYSINFO
>   	PROCFS_EXECUTABLE_PATH = /proc/self/exe
> +	HAVE_PROCFS_LINUX = YesPlease
>   endif
>   ifeq ($(uname_S),GNU/kFreeBSD)
>   	HAVE_ALLOCA_H = YesPlease
> diff --git a/t/t0210/scrub_normal.perl b/t/t0210/scrub_normal.perl
> index c65d1a815e..7cc4de392a 100644
> --- a/t/t0210/scrub_normal.perl
> +++ b/t/t0210/scrub_normal.perl
> @@ -42,6 +42,12 @@
>   	# so just omit it for testing purposes.
>   	# print "cmd_path _EXE_\n";
>       }
> +    elsif ($line =~ m/^cmd_ancestry/) {
> +	# 'cmd_ancestry' is not implemented everywhere, so for portability's
> +	# sake, skip it when parsing normal.
> +	#
> +	# print "$line";
> +    }
>       else {
>   	print "$line";
>       }
> diff --git a/t/t0211/scrub_perf.perl b/t/t0211/scrub_perf.perl
> index 351af7844e..d164b750ff 100644
> --- a/t/t0211/scrub_perf.perl
> +++ b/t/t0211/scrub_perf.perl
> @@ -44,6 +44,11 @@
>   	# $tokens[$col_rest] = "_EXE_";
>   	goto SKIP_LINE;
>       }
> +    elsif ($tokens[$col_event] =~ m/cmd_ancestry/) {
> +	# 'cmd_ancestry' is platform-specific and not implemented everywhere,
> +	# so skip it.
> +	goto SKIP_LINE;
> +    }
>       elsif ($tokens[$col_event] =~ m/child_exit/) {
>   	$tokens[$col_rest] =~ s/ pid:\d* / pid:_PID_ /;
>       }
> diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
> index 6584bb5634..b6408560c0 100644
> --- a/t/t0212/parse_events.perl
> +++ b/t/t0212/parse_events.perl
> @@ -132,7 +132,10 @@
>   	# just omit it for testing purposes.
>   	# $processes->{$sid}->{'path'} = "_EXE_";
>       }
> -
> +    elsif ($event eq 'cmd_ancestry') {
> +	# 'cmd_ancestry' is platform-specific and not implemented everywhere, so
> +	# just skip it for testing purposes.
> +    }
>       elsif ($event eq 'cmd_name') {
>   	$processes->{$sid}->{'name'} = $line->{'name'};
>   	$processes->{$sid}->{'hierarchy'} = $line->{'hierarchy'};
> diff --git a/trace2.c b/trace2.c
> index 256120c7fd..b9b154ac44 100644
> --- a/trace2.c
> +++ b/trace2.c
> @@ -260,6 +260,19 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname)
>   			tgt_j->pfn_command_path_fl(file, line, pathname);
>   }
>   
> +void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	struct tr2_tgt *tgt_j;
> +	int j;
> +
> +	if (!trace2_enabled)
> +		return;
> +
> +	for_each_wanted_builtin (j, tgt_j)
> +		if (tgt_j->pfn_command_ancestry_fl)
> +			tgt_j->pfn_command_ancestry_fl(file, line, parent_names);
> +}
> +
>   void trace2_cmd_name_fl(const char *file, int line, const char *name)
>   {
>   	struct tr2_tgt *tgt_j;
> diff --git a/trace2.h b/trace2.h
> index ede18c2e06..23743ac62b 100644
> --- a/trace2.h
> +++ b/trace2.h
> @@ -133,6 +133,16 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname);
>   
>   #define trace2_cmd_path(p) trace2_cmd_path_fl(__FILE__, __LINE__, (p))
>   
> +/*
> + * Emit an 'ancestry' event with the process name of the current process's
> + * parent process.
> + * This gives post-processors a way to determine what invoked the command and
> + * learn more about usage patterns.
> + */
> +void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names);
> +
> +#define trace2_cmd_ancestry(v) trace2_cmd_ancestry_fl(__FILE__, __LINE__, (v))
> +
>   /*
>    * Emit a 'cmd_name' event with the canonical name of the command.
>    * This gives post-processors a simple field to identify the command
> @@ -492,7 +502,7 @@ enum trace2_process_info_reason {
>   	TRACE2_PROCESS_INFO_EXIT,
>   };
>   
> -#if defined(GIT_WINDOWS_NATIVE)
> +#if ( defined(GIT_WINDOWS_NATIVE) || defined(HAVE_PROCFS_LINUX) )
>   void trace2_collect_process_info(enum trace2_process_info_reason reason);
>   #else
>   #define trace2_collect_process_info(reason) \
> diff --git a/trace2/tr2_tgt.h b/trace2/tr2_tgt.h
> index 7b90469212..1f66fd6573 100644
> --- a/trace2/tr2_tgt.h
> +++ b/trace2/tr2_tgt.h
> @@ -27,6 +27,8 @@ typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
>   
>   typedef void(tr2_tgt_evt_command_path_fl_t)(const char *file, int line,
>   					    const char *command_path);
> +typedef void(tr2_tgt_evt_command_ancestry_fl_t)(const char *file, int line,
> +						const char **parent_names);
>   typedef void(tr2_tgt_evt_command_name_fl_t)(const char *file, int line,
>   					    const char *name,
>   					    const char *hierarchy);
> @@ -108,6 +110,7 @@ struct tr2_tgt {
>   	tr2_tgt_evt_atexit_t                    *pfn_atexit;
>   	tr2_tgt_evt_error_va_fl_t               *pfn_error_va_fl;
>   	tr2_tgt_evt_command_path_fl_t           *pfn_command_path_fl;
> +	tr2_tgt_evt_command_ancestry_fl_t	*pfn_command_ancestry_fl;
>   	tr2_tgt_evt_command_name_fl_t           *pfn_command_name_fl;
>   	tr2_tgt_evt_command_mode_fl_t           *pfn_command_mode_fl;
>   	tr2_tgt_evt_alias_fl_t                  *pfn_alias_fl;
> diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
> index 6353e8ad91..578a9a5287 100644
> --- a/trace2/tr2_tgt_event.c
> +++ b/trace2/tr2_tgt_event.c
> @@ -261,6 +261,26 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
>   	jw_release(&jw);
>   }
>   
> +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	const char *event_name = "cmd_ancestry";
> +	const char *parent_name = NULL;
> +	struct json_writer jw = JSON_WRITER_INIT;
> +
> +	jw_object_begin(&jw, 0);
> +	event_fmt_prepare(event_name, file, line, NULL, &jw);
> +	jw_object_inline_begin_array(&jw, "ancestry");
> +
> +	while ((parent_name = *parent_names++))
> +		jw_array_string(&jw, parent_name);
> +
> +	jw_end(&jw); /* 'ancestry' array */
> +	jw_end(&jw); /* event object */
> +
> +	tr2_dst_write_line(&tr2dst_event, &jw.json);
> +	jw_release(&jw);
> +}
> +
>   static void fn_command_name_fl(const char *file, int line, const char *name,
>   			       const char *hierarchy)
>   {
> @@ -584,6 +604,7 @@ struct tr2_tgt tr2_tgt_event = {
>   	fn_atexit,
>   	fn_error_va_fl,
>   	fn_command_path_fl,
> +	fn_command_ancestry_fl,
>   	fn_command_name_fl,
>   	fn_command_mode_fl,
>   	fn_alias_fl,
> diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
> index 31b602c171..a5751c8864 100644
> --- a/trace2/tr2_tgt_normal.c
> +++ b/trace2/tr2_tgt_normal.c
> @@ -160,6 +160,24 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
>   	strbuf_release(&buf_payload);
>   }
>   
> +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	const char *parent_name = NULL;
> +	struct strbuf buf_payload = STRBUF_INIT;
> +
> +	/* cmd_ancestry parent <- grandparent <- great-grandparent */
> +	strbuf_addstr(&buf_payload, "cmd_ancestry ");
> +	while ((parent_name = *parent_names++)) {
> +		strbuf_addstr(&buf_payload, parent_name);
> +		/* if we'll write another one after this, add a delimiter */
> +		if (parent_names && *parent_names)
> +			strbuf_addstr(&buf_payload, " <- ");
> +	}
> +
> +	normal_io_write_fl(file, line, &buf_payload);
> +	strbuf_release(&buf_payload);
> +}
> +
>   static void fn_command_name_fl(const char *file, int line, const char *name,
>   			       const char *hierarchy)
>   {
> @@ -306,6 +324,7 @@ struct tr2_tgt tr2_tgt_normal = {
>   	fn_atexit,
>   	fn_error_va_fl,
>   	fn_command_path_fl,
> +	fn_command_ancestry_fl,
>   	fn_command_name_fl,
>   	fn_command_mode_fl,
>   	fn_alias_fl,
> diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
> index a8018f18cc..af4d65a0a5 100644
> --- a/trace2/tr2_tgt_perf.c
> +++ b/trace2/tr2_tgt_perf.c
> @@ -253,6 +253,21 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
>   	strbuf_release(&buf_payload);
>   }
>   
> +static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
> +{
> +	const char *event_name = "cmd_ancestry";
> +	struct strbuf buf_payload = STRBUF_INIT;
> +
> +	strbuf_addstr(&buf_payload, "ancestry:[");
> +	/* It's not an argv but the rules are basically the same. */
> +	sq_append_quote_argv_pretty(&buf_payload, parent_names);
> +	strbuf_addch(&buf_payload, ']');
> +
> +	perf_io_write_fl(file, line, event_name, NULL, NULL, NULL, NULL,
> +			 &buf_payload);
> +	strbuf_release(&buf_payload);
> +}
> +
>   static void fn_command_name_fl(const char *file, int line, const char *name,
>   			       const char *hierarchy)
>   {
> @@ -532,6 +547,7 @@ struct tr2_tgt tr2_tgt_perf = {
>   	fn_atexit,
>   	fn_error_va_fl,
>   	fn_command_path_fl,
> +	fn_command_ancestry_fl,
>   	fn_command_name_fl,
>   	fn_command_mode_fl,
>   	fn_alias_fl,
> 

Otherwise, this looks good to me.
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5] tr2: log parent process name
  2021-06-28 16:45         ` Jeff Hostetler
@ 2021-06-29 23:51           ` Emily Shaffer
  2021-06-30  6:10             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 45+ messages in thread
From: Emily Shaffer @ 2021-06-29 23:51 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: git, Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Bagas Sanjaya, Randall S. Becker

On Mon, Jun 28, 2021 at 12:45:24PM -0400, Jeff Hostetler wrote:
> On 6/8/21 6:10 PM, Emily Shaffer wrote:
> > Range-diff against v4:
> > 1:  efb0a3ccb4 ! 1:  7a7e1ebbfa tr2: log parent process name
> >      @@ compat/procinfo.c (new)
> >       +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
> >       +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
> >       +		strbuf_release(&procfs_path);
> >      ++		strbuf_trim_trailing_newline(&name);
> >       +		strvec_push(names, strbuf_detach(&name, NULL));
> >       +	}
> >       +
> 
> You're only getting the name of the command (argv[0]) and not the
> full command line, right?  That is a good thing.

Roughly. The name can be reset by the process itself (that's what
happened, I guess, in the tmux case I pasted below) but by default it's
argv[0]. It's also truncated to 15ch or something.
> >   ------------
> > +`"cmd_ancestry"`::
> > +	This event contains the text command name for the parent (and earlier
> > +	generations of parents) of the current process, in an array ordered from
> > +	nearest parent to furthest great-grandparent. It may not be implemented
> > +	on all platforms.
> > ++
> > +------------
> > +{
> > +	"event":"cmd_ancestry",
> > +	...
> > +	"ancestry":["bash","tmux: server","systemd"]
> 
> Is the second element really "tmux: server".  Seems odd that that's what
> the command name (argv[0]) is.  Perhaps I misread something??

See above. This is what shows up in pstree, though, and by poking around in
/proc I confirmed that this is indeed the content of /proc/<tmux-pid>/comm:

        ├─tmux: server─┬─bash───mutt───open-vim-in-new───vim
        │              ├─bash───pstree
        │              └─mutt

This is a somewhat contrived example, though, because in Linux as of this patch,
only one ancestor is gathered. So maybe I had better make the doc
reflect what's actually possible. I'm planning on sending a follow-on
sometime soon exposing more generations of ancestry, so I guess I could
update the docs back to this state around then.

> 
> > +}
> 
> This array is bounded and that implies that you captured all of
> the grand parents back to "init" (or whatever it is called these
> days).

In this case it does - pid 1 is systemd, which hasn't got a parent
process.

> Is there value in having a final "..." or "(truncated)" element
> to indicate that the list incomplete?  I did the latter in the
> Windows version.

Hrm. I'm not the one who wants to parse these - it's someone else who's
working with our team internally - so I'll ask around and see what they
think is best.

> > +#ifdef HAVE_PROCFS_LINUX
> > +	/*
> > +	 * NEEDSWORK: We could gather the entire pstree into an array to match
> > +	 * functionality with compat/win32/trace2_win32_process_info.c.
> > +	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
> > +	 * gather the immediate parent name which is readily accessible from
> > +	 * /proc/$(getppid())/comm.
> > +	 */
> > +	struct strbuf procfs_path = STRBUF_INIT;
> > +	struct strbuf name = STRBUF_INIT;
> > +
> > +	/* try to use procfs if it's present. */
> > +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
> > +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
> > +		strbuf_release(&procfs_path);
> > +		strbuf_trim_trailing_newline(&name);
> > +		strvec_push(names, strbuf_detach(&name, NULL));
> > +	}
> > +
> > +	return;
> > +#endif
> > +	/* NEEDSWORK: add non-procfs-linux implementations here */
> > +}
> 
> Perhaps this has already been discussed, but would it be better
> to have a "compat/linux/trace2_linux_process_info.c"
> or "compat/procfs/trace2_procfs_process_info.c" source file and
> only compile it in Linux-compatible builds -- rather than #ifdef'ing
> the source.  This is a highly platform-specific feature.
> 
> For example, if I convert the Win32 version to use your new event,
> I wouldn't want to move the code.
> 
> I just noticed that you have both "BASIC_CFLAGS+=" and a "COMPAT_OBSJ+="
> lines.  If you made this source file procfs-specific, you wouldn't need
> the ifdef and you could avoid the new CFLAG.

Sure, I'll investigate it, thanks.

> > +
> > +		if (names.nr == 0) {
> > +			strvec_clear(&names);
> > +			return;
> > +		}
> > +
> > +		trace2_cmd_ancestry(names.v);
> > +
> > +		strvec_clear(&names);
> 
> I agree with Junio here, it would be simpler to say it like this:
> 
> 		get_ancestry_names(&names);
> 		if (names.nr)
> 			trace2_cmd_ancestry(names.v);
> 		strvec_clear(&names);
> 

Thanks both, done locally.

> Otherwise, this looks good to me.

Thanks. Look for a v6 from me this week, hopefully with the build stuff
sorted out.

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5] tr2: log parent process name
  2021-06-29 23:51           ` Emily Shaffer
@ 2021-06-30  6:10             ` Ævar Arnfjörð Bjarmason
  2021-07-22  0:21               ` Emily Shaffer
  0 siblings, 1 reply; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-30  6:10 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Jeff Hostetler, git, Junio C Hamano, Bagas Sanjaya, Randall S. Becker


On Tue, Jun 29 2021, Emily Shaffer wrote:

> On Mon, Jun 28, 2021 at 12:45:24PM -0400, Jeff Hostetler wrote:
>> On 6/8/21 6:10 PM, Emily Shaffer wrote:
>> > Range-diff against v4:
>> > 1:  efb0a3ccb4 ! 1:  7a7e1ebbfa tr2: log parent process name
>> >      @@ compat/procinfo.c (new)
>> >       +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
>> >       +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
>> >       +		strbuf_release(&procfs_path);
>> >      ++		strbuf_trim_trailing_newline(&name);
>> >       +		strvec_push(names, strbuf_detach(&name, NULL));
>> >       +	}
>> >       +
>> 
>> You're only getting the name of the command (argv[0]) and not the
>> full command line, right?  That is a good thing.
>
> Roughly. The name can be reset by the process itself (that's what
> happened, I guess, in the tmux case I pasted below) but by default it's
> argv[0]. It's also truncated to 15ch or something.

16 including the \0. See prctl(2). Linux has two different ways to
set/get the name, one is the argv method, the other is
prctl(PR_SET_NAME). They don't need to match at all. The ps(1) utility
and some top-like utilities allow you to switch between viewing the two
versions.

As noted in the linked manual pages you'll also potentially need to deal
with multithreaded programs having different names for each thread.

I don't think we use this now, but FWIW one thing I've wanted to do for
a while was to have the progress.c code update this, so you see if git's
at N% counting objects or whatever in top.

>> > +#ifdef HAVE_PROCFS_LINUX
>> > +	/*
>> > +	 * NEEDSWORK: We could gather the entire pstree into an array to match
>> > +	 * functionality with compat/win32/trace2_win32_process_info.c.
>> > +	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
>> > +	 * gather the immediate parent name which is readily accessible from
>> > +	 * /proc/$(getppid())/comm.
>> > +	 */
>> > +	struct strbuf procfs_path = STRBUF_INIT;
>> > +	struct strbuf name = STRBUF_INIT;
>> > +
>> > +	/* try to use procfs if it's present. */
>> > +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
>> > +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
>> > +		strbuf_release(&procfs_path);
>> > +		strbuf_trim_trailing_newline(&name);
>> > +		strvec_push(names, strbuf_detach(&name, NULL));
>> > +	}
>> > +
>> > +	return;
>> > +#endif
>> > +	/* NEEDSWORK: add non-procfs-linux implementations here */
>> > +}
>> 
>> Perhaps this has already been discussed, but would it be better
>> to have a "compat/linux/trace2_linux_process_info.c"
>> or "compat/procfs/trace2_procfs_process_info.c" source file and
>> only compile it in Linux-compatible builds -- rather than #ifdef'ing
>> the source.  This is a highly platform-specific feature.
>> 
>> For example, if I convert the Win32 version to use your new event,
>> I wouldn't want to move the code.
>> 
>> I just noticed that you have both "BASIC_CFLAGS+=" and a "COMPAT_OBSJ+="
>> lines.  If you made this source file procfs-specific, you wouldn't need
>> the ifdef and you could avoid the new CFLAG.


In general we've preferred not using ifdefs at all except for the small
bits that absolutely need it.

So e.g. in this case the whole code should compile on non-Linux, we just
need a small boolean guard somewhere to check what the platform is.

It means we don't have significant pieces of code that don't compile
except on platform X. It's easy to get into your code not compiling if
you overuse ifdefs.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5] tr2: log parent process name
  2021-06-30  6:10             ` Ævar Arnfjörð Bjarmason
@ 2021-07-22  0:21               ` Emily Shaffer
  0 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-07-22  0:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff Hostetler, git, Junio C Hamano, Bagas Sanjaya, Randall S. Becker

On Wed, Jun 30, 2021 at 08:10:59AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> On Tue, Jun 29 2021, Emily Shaffer wrote:
> 
> > On Mon, Jun 28, 2021 at 12:45:24PM -0400, Jeff Hostetler wrote:
> >> On 6/8/21 6:10 PM, Emily Shaffer wrote:
> >> > Range-diff against v4:
> >> > 1:  efb0a3ccb4 ! 1:  7a7e1ebbfa tr2: log parent process name
> >> >      @@ compat/procinfo.c (new)
> >> >       +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
> >> >       +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
> >> >       +		strbuf_release(&procfs_path);
> >> >      ++		strbuf_trim_trailing_newline(&name);
> >> >       +		strvec_push(names, strbuf_detach(&name, NULL));
> >> >       +	}
> >> >       +
> >> 
> >> You're only getting the name of the command (argv[0]) and not the
> >> full command line, right?  That is a good thing.
> >
> > Roughly. The name can be reset by the process itself (that's what
> > happened, I guess, in the tmux case I pasted below) but by default it's
> > argv[0]. It's also truncated to 15ch or something.
> 
> 16 including the \0. See prctl(2). Linux has two different ways to
> set/get the name, one is the argv method, the other is
> prctl(PR_SET_NAME). They don't need to match at all. The ps(1) utility
> and some top-like utilities allow you to switch between viewing the two
> versions.
> 
> As noted in the linked manual pages you'll also potentially need to deal
> with multithreaded programs having different names for each thread.
> 
> I don't think we use this now, but FWIW one thing I've wanted to do for
> a while was to have the progress.c code update this, so you see if git's
> at N% counting objects or whatever in top.
> 
> >> > +#ifdef HAVE_PROCFS_LINUX
> >> > +	/*
> >> > +	 * NEEDSWORK: We could gather the entire pstree into an array to match
> >> > +	 * functionality with compat/win32/trace2_win32_process_info.c.
> >> > +	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
> >> > +	 * gather the immediate parent name which is readily accessible from
> >> > +	 * /proc/$(getppid())/comm.
> >> > +	 */
> >> > +	struct strbuf procfs_path = STRBUF_INIT;
> >> > +	struct strbuf name = STRBUF_INIT;
> >> > +
> >> > +	/* try to use procfs if it's present. */
> >> > +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
> >> > +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
> >> > +		strbuf_release(&procfs_path);
> >> > +		strbuf_trim_trailing_newline(&name);
> >> > +		strvec_push(names, strbuf_detach(&name, NULL));
> >> > +	}
> >> > +
> >> > +	return;
> >> > +#endif
> >> > +	/* NEEDSWORK: add non-procfs-linux implementations here */
> >> > +}
> >> 
> >> Perhaps this has already been discussed, but would it be better
> >> to have a "compat/linux/trace2_linux_process_info.c"
> >> or "compat/procfs/trace2_procfs_process_info.c" source file and
> >> only compile it in Linux-compatible builds -- rather than #ifdef'ing
> >> the source.  This is a highly platform-specific feature.
> >> 
> >> For example, if I convert the Win32 version to use your new event,
> >> I wouldn't want to move the code.
> >> 
> >> I just noticed that you have both "BASIC_CFLAGS+=" and a "COMPAT_OBSJ+="
> >> lines.  If you made this source file procfs-specific, you wouldn't need
> >> the ifdef and you could avoid the new CFLAG.
> 
> 
> In general we've preferred not using ifdefs at all except for the small
> bits that absolutely need it.
> 
> So e.g. in this case the whole code should compile on non-Linux, we just
> need a small boolean guard somewhere to check what the platform is.
> 
> It means we don't have significant pieces of code that don't compile
> except on platform X. It's easy to get into your code not compiling if
> you overuse ifdefs.

Hmm. I see what you mean.

However, since win32 is already using some conditionally compiling code,
that's the approach I went for. It's still running CI to make sure I
didn't break Windows in the process, but I'll post it (and the Actions
runs) later today, assuming it passed.

Yes, the implementation I wrote would compile on any platform, but as I
understand it, other platforms may need drastically different
implementations which may not compile as easily as this. So, I'm not
sure if it's really appropriate to do run-time platform checks here
(which is what I think you were describing).

Anyway, I'll look forward to seeing what you folks think of the next
iteration (again, hopefully later today).

 - Emily

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v6 0/2] tr2: log parent process name
  2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
                           ` (2 preceding siblings ...)
  2021-06-28 16:45         ` Jeff Hostetler
@ 2021-07-22  1:27         ` Emily Shaffer
  2021-07-22  1:27           ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
                             ` (2 more replies)
  3 siblings, 3 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-07-22  1:27 UTC (permalink / raw)
  To: git
  Cc: Emily Shaffer, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff Hostetler, Bagas Sanjaya, Randall S. Becker,
	Jonathan Nieder

Since v5, I reshuffled some of the platform-specific compilation recipe
around per Jeff H's comments on v5. I think these are at odds with
Ævar's comments, though, so I guess let's take a look and see how
strongly we feel? :)

Otherwise I also made the logic inversion Junio suggested to make the
logic easier to follow in procinfo.c:trace2_collect_process_info.

Thanks,
 - Emily

Emily Shaffer (2):
  tr2: make process info collection platform-generic
  tr2: log parent process name

 Documentation/technical/api-trace2.txt | 14 +++++++
 Makefile                               |  4 ++
 compat/linux/procinfo.c                | 55 ++++++++++++++++++++++++++
 compat/stub/procinfo.c                 | 11 ++++++
 config.mak.uname                       |  3 ++
 t/t0210/scrub_normal.perl              |  6 +++
 t/t0211/scrub_perf.perl                |  5 +++
 t/t0212/parse_events.perl              |  5 ++-
 trace2.c                               | 13 ++++++
 trace2.h                               | 16 +++++---
 trace2/tr2_tgt.h                       |  3 ++
 trace2/tr2_tgt_event.c                 | 21 ++++++++++
 trace2/tr2_tgt_normal.c                | 19 +++++++++
 trace2/tr2_tgt_perf.c                  | 16 ++++++++
 14 files changed, 184 insertions(+), 7 deletions(-)
 create mode 100644 compat/linux/procinfo.c
 create mode 100644 compat/stub/procinfo.c

Range-diff against v5:
-:  ---------- > 1:  80084448e4 tr2: make process info collection platform-generic
1:  7a7e1ebbfa ! 2:  485f9a24f0 tr2: log parent process name
    @@ Documentation/technical/api-trace2.txt: about specific error arguments.
      	This event contains the command name for this git process
      	and the hierarchy of commands from parent git processes.
     
    - ## Makefile ##
    -@@ Makefile: ifneq ($(PROCFS_EXECUTABLE_PATH),)
    - 	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
    - endif
    - 
    -+ifdef HAVE_PROCFS_LINUX
    -+	BASIC_CFLAGS += -DHAVE_PROCFS_LINUX
    -+	COMPAT_OBJS += compat/procinfo.o
    -+endif
    -+
    - ifdef HAVE_NS_GET_EXECUTABLE_PATH
    - 	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
    - endif
    -
    - ## compat/procinfo.c (new) ##
    + ## compat/linux/procinfo.c (new) ##
     @@
     +#include "cache.h"
     +
    @@ compat/procinfo.c (new)
     +
     +static void get_ancestry_names(struct strvec *names)
     +{
    -+#ifdef HAVE_PROCFS_LINUX
     +	/*
     +	 * NEEDSWORK: We could gather the entire pstree into an array to match
     +	 * functionality with compat/win32/trace2_win32_process_info.c.
    @@ compat/procinfo.c (new)
     +	}
     +
     +	return;
    -+#endif
     +	/* NEEDSWORK: add non-procfs-linux implementations here */
     +}
     +
    @@ compat/procinfo.c (new)
     +
     +		get_ancestry_names(&names);
     +
    -+		if (names.nr == 0) {
    -+			strvec_clear(&names);
    -+			return;
    -+		}
    -+
    -+		trace2_cmd_ancestry(names.v);
    -+
    ++		if (names.nr)
    ++			trace2_cmd_ancestry(names.v);
     +		strvec_clear(&names);
     +	}
     +
    @@ config.mak.uname: ifeq ($(uname_S),Linux)
      	FREAD_READS_DIRECTORIES = UnfortunatelyYes
      	BASIC_CFLAGS += -DHAVE_SYSINFO
      	PROCFS_EXECUTABLE_PATH = /proc/self/exe
    -+	HAVE_PROCFS_LINUX = YesPlease
    ++	HAVE_PLATFORM_PROCINFO = YesPlease
    ++	COMPAT_OBJS += compat/linux/procinfo.o
      endif
      ifeq ($(uname_S),GNU/kFreeBSD)
      	HAVE_ALLOCA_H = YesPlease
    @@ trace2.h: void trace2_cmd_path_fl(const char *file, int line, const char *pathna
      /*
       * Emit a 'cmd_name' event with the canonical name of the command.
       * This gives post-processors a simple field to identify the command
    -@@ trace2.h: enum trace2_process_info_reason {
    - 	TRACE2_PROCESS_INFO_EXIT,
    - };
    - 
    --#if defined(GIT_WINDOWS_NATIVE)
    -+#if ( defined(GIT_WINDOWS_NATIVE) || defined(HAVE_PROCFS_LINUX) )
    - void trace2_collect_process_info(enum trace2_process_info_reason reason);
    - #else
    - #define trace2_collect_process_info(reason) \
     
      ## trace2/tr2_tgt.h ##
     @@ trace2/tr2_tgt.h: typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
-- 
2.32.0.402.g57bb445576-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v6 1/2] tr2: make process info collection platform-generic
  2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
@ 2021-07-22  1:27           ` Emily Shaffer
  2021-07-22  1:27           ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
  2021-07-22 16:59           ` [PATCH v6 0/2] " Jeff Hostetler
  2 siblings, 0 replies; 45+ messages in thread
From: Emily Shaffer @ 2021-07-22  1:27 UTC (permalink / raw)
  To: git; +Cc: Emily Shaffer, Jonathan Nieder

To pave the way for non-Windows platforms to define
trace2_collect_process_info(), reorganize the stub-or-definition schema
to something which doesn't directly reference Windows.

Platforms which want to collect parent process information in the
future should:

 1. Add an implementation to compat/ (e.g. compat/somearch/procinfo.c)
 2. Add that object to COMPAT_OBJS to config.mak.uname
    (e.g. COMPAT_OBJS += compat/somearch/procinfo.o)
 3. Define HAVE_PLATFORM_PROCINFO in config.mak.uname

In the Windows case, this definition lives in
compat/win32/trace2_win32_process_info.c, which is already conditionally
added to COMPAT_OBJS; so let's add HAVE_PLATFORM_PROCINFO to hint to the
build that compat/stub/procinfo.c should not be used.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
Helped-by: Jonathan Nieder <jrnieder@gmail.com>
---
 Makefile               |  4 ++++
 compat/stub/procinfo.c | 11 +++++++++++
 config.mak.uname       |  1 +
 trace2.h               |  6 ------
 4 files changed, 16 insertions(+), 6 deletions(-)
 create mode 100644 compat/stub/procinfo.c

diff --git a/Makefile b/Makefile
index 93664d6714..bd76689c79 100644
--- a/Makefile
+++ b/Makefile
@@ -1889,6 +1889,10 @@ ifneq ($(PROCFS_EXECUTABLE_PATH),)
 	BASIC_CFLAGS += '-DPROCFS_EXECUTABLE_PATH="$(procfs_executable_path_SQ)"'
 endif
 
+ifndef HAVE_PLATFORM_PROCINFO
+	COMPAT_OBJS += compat/stub/procinfo.o
+endif
+
 ifdef HAVE_NS_GET_EXECUTABLE_PATH
 	BASIC_CFLAGS += -DHAVE_NS_GET_EXECUTABLE_PATH
 endif
diff --git a/compat/stub/procinfo.c b/compat/stub/procinfo.c
new file mode 100644
index 0000000000..12c0a23c9e
--- /dev/null
+++ b/compat/stub/procinfo.c
@@ -0,0 +1,11 @@
+#include "git-compat-util.h"
+
+#include "trace2.h"
+
+/*
+ * Stub. See sample implementations in compat/linux/procinfo.c and
+ * compat/win32/trace2_win32_process_info.c.
+ */
+void trace2_collect_process_info(enum trace2_process_info_reason reason)
+{
+}
diff --git a/config.mak.uname b/config.mak.uname
index cb443b4e02..185ff79b14 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -612,6 +612,7 @@ ifneq (,$(findstring MINGW,$(uname_S)))
 	ETAGS_TARGET = ETAGS
 	NO_POSIX_GOODIES = UnfortunatelyYes
 	DEFAULT_HELP_FORMAT = html
+	HAVE_PLATFORM_PROCINFO = YesPlease
 	BASIC_LDFLAGS += -municode
 	COMPAT_CFLAGS += -DNOGDI -Icompat -Icompat/win32
 	COMPAT_CFLAGS += -DSTRIP_EXTENSION=\".exe\"
diff --git a/trace2.h b/trace2.h
index ede18c2e06..0d990db817 100644
--- a/trace2.h
+++ b/trace2.h
@@ -492,13 +492,7 @@ enum trace2_process_info_reason {
 	TRACE2_PROCESS_INFO_EXIT,
 };
 
-#if defined(GIT_WINDOWS_NATIVE)
 void trace2_collect_process_info(enum trace2_process_info_reason reason);
-#else
-#define trace2_collect_process_info(reason) \
-	do {                                \
-	} while (0)
-#endif
 
 const char *trace2_session_id(void);
 
-- 
2.32.0.402.g57bb445576-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v6 2/2] tr2: log parent process name
  2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
  2021-07-22  1:27           ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
@ 2021-07-22  1:27           ` Emily Shaffer
  2021-07-22 21:02             ` Junio C Hamano
  2021-07-22 16:59           ` [PATCH v6 0/2] " Jeff Hostetler
  2 siblings, 1 reply; 45+ messages in thread
From: Emily Shaffer @ 2021-07-22  1:27 UTC (permalink / raw)
  To: git; +Cc: Emily Shaffer

It can be useful to tell who invoked Git - was it invoked manually by a
user via CLI or script? By an IDE?  In some cases - like 'repo' tool -
we can influence the source code and set the GIT_TRACE2_PARENT_SID
environment variable from the caller process. In 'repo''s case, that
parent SID is manipulated to include the string "repo", which means we
can positively identify when Git was invoked by 'repo' tool. However,
identifying parents that way requires both that we know which tools
invoke Git and that we have the ability to modify the source code of
those tools. It cannot scale to keep up with the various IDEs and
wrappers which use Git, most of which we don't know about. Learning
which tools and wrappers invoke Git, and how, would give us insight to
decide where to improve Git's usability and performance.

Unfortunately, there's no cross-platform reliable way to gather the name
of the parent process. If procfs is present, we can use that; otherwise
we will need to discover the name another way. However, the process ID
should be sufficient to look up the process name on most platforms, so
that code may be shareable.

Git for Windows gathers similar information and logs it as a "data_json"
event. However, since "data_json" has a variable format, it is difficult
to parse effectively in some languages; instead, let's pursue a
dedicated "cmd_ancestry" event to record information about the ancestry
of the current process and a consistent, parseable way.

Git for Windows also gathers information about more than one generation
of parent. In Linux further ancestry info can be gathered with procfs,
but it's unwieldy to do so. In the interest of later moving Git for
Windows ancestry logging to the 'cmd_ancestry' event, and in the
interest of later adding more ancestry to the Linux implementation - or
of adding this functionality to other platforms which have an easier
time walking the process tree - let's make 'cmd_ancestry' accept an
array of parentage.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---

Notes:
    Since v3:
    
    Junio and Randall suggested ditching "get_process_name(ppid)" as the API to
    implement on various platforms, and I liked the suggestion a lot. So instead,
    I added "get_ancestry_names(strvec *names)".
    
     - Using a strvec instead of a char** makes cleanup easier, since strvec takes
       care of counting the number of strings in the array for us. Otherwise we'd
       need to include size as a return or out-param in get_ancestry_names(), and
       that's what the util class is for, right? :)
     - I made get_ancestry_names() static instead of putting it into
       git-compat-util.h. I think I had put get_process_name() into that header to
       facilitate non-procfs implementations, but compat/procinfo.c doesn't seem to
       me to indicate "procfs only", and if we do need to implement
       get_process_name() somewhere else, it'll be pretty easy to move it.
     - I added a description of "cmd_ancestry" to
       Documentation/technical/api-trace2.txt. I didn't see any user-facing docs to
       update (for example, "git grep cmd_path" produces only that one doc file).
    
    Thanks, all.
    
     - Emily

 Documentation/technical/api-trace2.txt | 14 +++++++
 compat/linux/procinfo.c                | 55 ++++++++++++++++++++++++++
 config.mak.uname                       |  2 +
 t/t0210/scrub_normal.perl              |  6 +++
 t/t0211/scrub_perf.perl                |  5 +++
 t/t0212/parse_events.perl              |  5 ++-
 trace2.c                               | 13 ++++++
 trace2.h                               | 10 +++++
 trace2/tr2_tgt.h                       |  3 ++
 trace2/tr2_tgt_event.c                 | 21 ++++++++++
 trace2/tr2_tgt_normal.c                | 19 +++++++++
 trace2/tr2_tgt_perf.c                  | 16 ++++++++
 12 files changed, 168 insertions(+), 1 deletion(-)
 create mode 100644 compat/linux/procinfo.c

diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt
index 3f52f981a2..8a0b360a0e 100644
--- a/Documentation/technical/api-trace2.txt
+++ b/Documentation/technical/api-trace2.txt
@@ -493,6 +493,20 @@ about specific error arguments.
 }
 ------------
 
+`"cmd_ancestry"`::
+	This event contains the text command name for the parent (and earlier
+	generations of parents) of the current process, in an array ordered from
+	nearest parent to furthest great-grandparent. It may not be implemented
+	on all platforms.
++
+------------
+{
+	"event":"cmd_ancestry",
+	...
+	"ancestry":["bash","tmux: server","systemd"]
+}
+------------
+
 `"cmd_name"`::
 	This event contains the command name for this git process
 	and the hierarchy of commands from parent git processes.
diff --git a/compat/linux/procinfo.c b/compat/linux/procinfo.c
new file mode 100644
index 0000000000..578fed4cd3
--- /dev/null
+++ b/compat/linux/procinfo.c
@@ -0,0 +1,55 @@
+#include "cache.h"
+
+#include "strbuf.h"
+#include "strvec.h"
+#include "trace2.h"
+
+static void get_ancestry_names(struct strvec *names)
+{
+	/*
+	 * NEEDSWORK: We could gather the entire pstree into an array to match
+	 * functionality with compat/win32/trace2_win32_process_info.c.
+	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
+	 * gather the immediate parent name which is readily accessible from
+	 * /proc/$(getppid())/comm.
+	 */
+	struct strbuf procfs_path = STRBUF_INIT;
+	struct strbuf name = STRBUF_INIT;
+
+	/* try to use procfs if it's present. */
+	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
+	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
+		strbuf_release(&procfs_path);
+		strbuf_trim_trailing_newline(&name);
+		strvec_push(names, strbuf_detach(&name, NULL));
+	}
+
+	return;
+	/* NEEDSWORK: add non-procfs-linux implementations here */
+}
+
+void trace2_collect_process_info(enum trace2_process_info_reason reason)
+{
+	if (!trace2_is_enabled())
+		return;
+
+	/* someday we may want to write something extra here, but not today */
+	if (reason == TRACE2_PROCESS_INFO_EXIT)
+		return;
+
+	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
+		/*
+		 * NEEDSWORK: we could do the entire ptree in an array instead,
+		 * see compat/win32/trace2_win32_process_info.c.
+		 */
+		struct strvec names = STRVEC_INIT;
+
+		get_ancestry_names(&names);
+
+		if (names.nr)
+			trace2_cmd_ancestry(names.v);
+		strvec_clear(&names);
+	}
+
+	return;
+}
diff --git a/config.mak.uname b/config.mak.uname
index 185ff79b14..d3bd4c6843 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -58,6 +58,8 @@ ifeq ($(uname_S),Linux)
 	FREAD_READS_DIRECTORIES = UnfortunatelyYes
 	BASIC_CFLAGS += -DHAVE_SYSINFO
 	PROCFS_EXECUTABLE_PATH = /proc/self/exe
+	HAVE_PLATFORM_PROCINFO = YesPlease
+	COMPAT_OBJS += compat/linux/procinfo.o
 endif
 ifeq ($(uname_S),GNU/kFreeBSD)
 	HAVE_ALLOCA_H = YesPlease
diff --git a/t/t0210/scrub_normal.perl b/t/t0210/scrub_normal.perl
index c65d1a815e..7cc4de392a 100644
--- a/t/t0210/scrub_normal.perl
+++ b/t/t0210/scrub_normal.perl
@@ -42,6 +42,12 @@
 	# so just omit it for testing purposes.
 	# print "cmd_path _EXE_\n";
     }
+    elsif ($line =~ m/^cmd_ancestry/) {
+	# 'cmd_ancestry' is not implemented everywhere, so for portability's
+	# sake, skip it when parsing normal.
+	#
+	# print "$line";
+    }
     else {
 	print "$line";
     }
diff --git a/t/t0211/scrub_perf.perl b/t/t0211/scrub_perf.perl
index 351af7844e..d164b750ff 100644
--- a/t/t0211/scrub_perf.perl
+++ b/t/t0211/scrub_perf.perl
@@ -44,6 +44,11 @@
 	# $tokens[$col_rest] = "_EXE_";
 	goto SKIP_LINE;
     }
+    elsif ($tokens[$col_event] =~ m/cmd_ancestry/) {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere,
+	# so skip it.
+	goto SKIP_LINE;
+    }
     elsif ($tokens[$col_event] =~ m/child_exit/) {
 	$tokens[$col_rest] =~ s/ pid:\d* / pid:_PID_ /;
     }
diff --git a/t/t0212/parse_events.perl b/t/t0212/parse_events.perl
index 6584bb5634..b6408560c0 100644
--- a/t/t0212/parse_events.perl
+++ b/t/t0212/parse_events.perl
@@ -132,7 +132,10 @@
 	# just omit it for testing purposes.
 	# $processes->{$sid}->{'path'} = "_EXE_";
     }
-    
+    elsif ($event eq 'cmd_ancestry') {
+	# 'cmd_ancestry' is platform-specific and not implemented everywhere, so
+	# just skip it for testing purposes.
+    }
     elsif ($event eq 'cmd_name') {
 	$processes->{$sid}->{'name'} = $line->{'name'};
 	$processes->{$sid}->{'hierarchy'} = $line->{'hierarchy'};
diff --git a/trace2.c b/trace2.c
index 256120c7fd..b9b154ac44 100644
--- a/trace2.c
+++ b/trace2.c
@@ -260,6 +260,19 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname)
 			tgt_j->pfn_command_path_fl(file, line, pathname);
 }
 
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	struct tr2_tgt *tgt_j;
+	int j;
+
+	if (!trace2_enabled)
+		return;
+
+	for_each_wanted_builtin (j, tgt_j)
+		if (tgt_j->pfn_command_ancestry_fl)
+			tgt_j->pfn_command_ancestry_fl(file, line, parent_names);
+}
+
 void trace2_cmd_name_fl(const char *file, int line, const char *name)
 {
 	struct tr2_tgt *tgt_j;
diff --git a/trace2.h b/trace2.h
index 0d990db817..9b7286c572 100644
--- a/trace2.h
+++ b/trace2.h
@@ -133,6 +133,16 @@ void trace2_cmd_path_fl(const char *file, int line, const char *pathname);
 
 #define trace2_cmd_path(p) trace2_cmd_path_fl(__FILE__, __LINE__, (p))
 
+/*
+ * Emit an 'ancestry' event with the process name of the current process's
+ * parent process.
+ * This gives post-processors a way to determine what invoked the command and
+ * learn more about usage patterns.
+ */
+void trace2_cmd_ancestry_fl(const char *file, int line, const char **parent_names);
+
+#define trace2_cmd_ancestry(v) trace2_cmd_ancestry_fl(__FILE__, __LINE__, (v))
+
 /*
  * Emit a 'cmd_name' event with the canonical name of the command.
  * This gives post-processors a simple field to identify the command
diff --git a/trace2/tr2_tgt.h b/trace2/tr2_tgt.h
index 7b90469212..1f66fd6573 100644
--- a/trace2/tr2_tgt.h
+++ b/trace2/tr2_tgt.h
@@ -27,6 +27,8 @@ typedef void(tr2_tgt_evt_error_va_fl_t)(const char *file, int line,
 
 typedef void(tr2_tgt_evt_command_path_fl_t)(const char *file, int line,
 					    const char *command_path);
+typedef void(tr2_tgt_evt_command_ancestry_fl_t)(const char *file, int line,
+						const char **parent_names);
 typedef void(tr2_tgt_evt_command_name_fl_t)(const char *file, int line,
 					    const char *name,
 					    const char *hierarchy);
@@ -108,6 +110,7 @@ struct tr2_tgt {
 	tr2_tgt_evt_atexit_t                    *pfn_atexit;
 	tr2_tgt_evt_error_va_fl_t               *pfn_error_va_fl;
 	tr2_tgt_evt_command_path_fl_t           *pfn_command_path_fl;
+	tr2_tgt_evt_command_ancestry_fl_t	*pfn_command_ancestry_fl;
 	tr2_tgt_evt_command_name_fl_t           *pfn_command_name_fl;
 	tr2_tgt_evt_command_mode_fl_t           *pfn_command_mode_fl;
 	tr2_tgt_evt_alias_fl_t                  *pfn_alias_fl;
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 6353e8ad91..578a9a5287 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -261,6 +261,26 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	jw_release(&jw);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	const char *parent_name = NULL;
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	event_fmt_prepare(event_name, file, line, NULL, &jw);
+	jw_object_inline_begin_array(&jw, "ancestry");
+
+	while ((parent_name = *parent_names++))
+		jw_array_string(&jw, parent_name);
+
+	jw_end(&jw); /* 'ancestry' array */
+	jw_end(&jw); /* event object */
+
+	tr2_dst_write_line(&tr2dst_event, &jw.json);
+	jw_release(&jw);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -584,6 +604,7 @@ struct tr2_tgt tr2_tgt_event = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
index 31b602c171..a5751c8864 100644
--- a/trace2/tr2_tgt_normal.c
+++ b/trace2/tr2_tgt_normal.c
@@ -160,6 +160,24 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *parent_name = NULL;
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	/* cmd_ancestry parent <- grandparent <- great-grandparent */
+	strbuf_addstr(&buf_payload, "cmd_ancestry ");
+	while ((parent_name = *parent_names++)) {
+		strbuf_addstr(&buf_payload, parent_name);
+		/* if we'll write another one after this, add a delimiter */
+		if (parent_names && *parent_names)
+			strbuf_addstr(&buf_payload, " <- ");
+	}
+
+	normal_io_write_fl(file, line, &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -306,6 +324,7 @@ struct tr2_tgt tr2_tgt_normal = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
index a8018f18cc..af4d65a0a5 100644
--- a/trace2/tr2_tgt_perf.c
+++ b/trace2/tr2_tgt_perf.c
@@ -253,6 +253,21 @@ static void fn_command_path_fl(const char *file, int line, const char *pathname)
 	strbuf_release(&buf_payload);
 }
 
+static void fn_command_ancestry_fl(const char *file, int line, const char **parent_names)
+{
+	const char *event_name = "cmd_ancestry";
+	struct strbuf buf_payload = STRBUF_INIT;
+
+	strbuf_addstr(&buf_payload, "ancestry:[");
+	/* It's not an argv but the rules are basically the same. */
+	sq_append_quote_argv_pretty(&buf_payload, parent_names);
+	strbuf_addch(&buf_payload, ']');
+
+	perf_io_write_fl(file, line, event_name, NULL, NULL, NULL, NULL,
+			 &buf_payload);
+	strbuf_release(&buf_payload);
+}
+
 static void fn_command_name_fl(const char *file, int line, const char *name,
 			       const char *hierarchy)
 {
@@ -532,6 +547,7 @@ struct tr2_tgt tr2_tgt_perf = {
 	fn_atexit,
 	fn_error_va_fl,
 	fn_command_path_fl,
+	fn_command_ancestry_fl,
 	fn_command_name_fl,
 	fn_command_mode_fl,
 	fn_alias_fl,
-- 
2.32.0.402.g57bb445576-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v6 0/2] tr2: log parent process name
  2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
  2021-07-22  1:27           ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
  2021-07-22  1:27           ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
@ 2021-07-22 16:59           ` Jeff Hostetler
  2 siblings, 0 replies; 45+ messages in thread
From: Jeff Hostetler @ 2021-07-22 16:59 UTC (permalink / raw)
  To: Emily Shaffer, git
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Bagas Sanjaya, Randall S. Becker, Jonathan Nieder



On 7/21/21 9:27 PM, Emily Shaffer wrote:
> Since v5, I reshuffled some of the platform-specific compilation recipe
> around per Jeff H's comments on v5. I think these are at odds with
> Ævar's comments, though, so I guess let's take a look and see how
> strongly we feel? :)
> 
> Otherwise I also made the logic inversion Junio suggested to make the
> logic easier to follow in procinfo.c:trace2_collect_process_info.
> 
> Thanks,
>   - Emily
...

LGTM

Thanks,
Jeff


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v6 2/2] tr2: log parent process name
  2021-07-22  1:27           ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
@ 2021-07-22 21:02             ` Junio C Hamano
  0 siblings, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2021-07-22 21:02 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: git

Emily Shaffer <emilyshaffer@google.com> writes:

> Unfortunately, there's no cross-platform reliable way to gather the name
> of the parent process. If procfs is present, we can use that; otherwise
> we will need to discover the name another way. However, the process ID
> should be sufficient to look up the process name on most platforms, so
> that code may be shareable.

Is the sentence that begin with "However" still relevant?  The
latest get_ancestry_names() does not add anything when the result
read from the procfs is unusable, but the sentence gives a false
impression that it may somehow fall back to show the PID.

> Git for Windows also gathers information about more than one generation
> of parent. In Linux further ancestry info can be gathered with procfs,
> but it's unwieldy to do so. In the interest of later moving Git for
> Windows ancestry logging to the 'cmd_ancestry' event, and in the
> interest of later adding more ancestry to the Linux implementation - or
> of adding this functionality to other platforms which have an easier
> time walking the process tree - let's make 'cmd_ancestry' accept an
> array of parentage.

Clearly written.  Nice.

> +`"cmd_ancestry"`::
> +	This event contains the text command name for the parent (and earlier
> +	generations of parents) of the current process, in an array ordered from
> +	nearest parent to furthest great-grandparent. It may not be implemented
> +	on all platforms.
> ++
> +------------
> +{
> +	"event":"cmd_ancestry",
> +	...
> +	"ancestry":["bash","tmux: server","systemd"]
> +}
> +------------

OK.

> diff --git a/compat/linux/procinfo.c b/compat/linux/procinfo.c
> new file mode 100644
> index 0000000000..578fed4cd3
> --- /dev/null
> +++ b/compat/linux/procinfo.c
> @@ -0,0 +1,55 @@
> +#include "cache.h"
> +
> +#include "strbuf.h"
> +#include "strvec.h"
> +#include "trace2.h"
> +
> +static void get_ancestry_names(struct strvec *names)
> +{
> +	/*
> +	 * NEEDSWORK: We could gather the entire pstree into an array to match
> +	 * functionality with compat/win32/trace2_win32_process_info.c.
> +	 * To do so, we may want to examine /proc/<pid>/stat. For now, just
> +	 * gather the immediate parent name which is readily accessible from
> +	 * /proc/$(getppid())/comm.
> +	 */
> +	struct strbuf procfs_path = STRBUF_INIT;
> +	struct strbuf name = STRBUF_INIT;
> +
> +	/* try to use procfs if it's present. */
> +	strbuf_addf(&procfs_path, "/proc/%d/comm", getppid());
> +	if (strbuf_read_file(&name, procfs_path.buf, 0)) {
> +		strbuf_release(&procfs_path);
> +		strbuf_trim_trailing_newline(&name);
> +		strvec_push(names, strbuf_detach(&name, NULL));

At this point, we successfully read from /proc/$(ppid)/comm
and pushed the result into names strvec.  I think you would want to
have an explicit "return;" here.

Alternatively, you could put the rest of the function in the
corresponding "else" clause, but I think an early return after each
method successfully collects the result would make a lot more sense.

> +	}
> +
> +	return;
> +	/* NEEDSWORK: add non-procfs-linux implementations here */

This looks the other way around.  A future non-procfs-linux
implementation written here will be unreachable code ;-)

Just lose "return;" from here, have one in the if(){} body, and keep
the "here is where you add more/other implementations" comment and
we'd be OK, I think.

> +void trace2_collect_process_info(enum trace2_process_info_reason reason)
> +{
> +	if (!trace2_is_enabled())
> +		return;
> +
> +	/* someday we may want to write something extra here, but not today */
> +	if (reason == TRACE2_PROCESS_INFO_EXIT)
> +		return;
> +
> +	if (reason == TRACE2_PROCESS_INFO_STARTUP) {
> +		/*
> +		 * NEEDSWORK: we could do the entire ptree in an array instead,
> +		 * see compat/win32/trace2_win32_process_info.c.
> +		 */
> +		struct strvec names = STRVEC_INIT;
> +
> +		get_ancestry_names(&names);
> +
> +		if (names.nr)
> +			trace2_cmd_ancestry(names.v);
> +		strvec_clear(&names);
> +	}
> +
> +	return;
> +}

Good.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-07-22 21:02 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
2021-05-07  3:25 ` Bagas Sanjaya
2021-05-07 17:09 ` Emily Shaffer
2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
2021-05-11 21:31   ` Junio C Hamano
2021-05-14 22:06   ` Emily Shaffer
2021-05-16  3:48     ` Junio C Hamano
2021-05-17 20:17       ` Emily Shaffer
2021-05-11 17:28 ` Jeff Hostetler
2021-05-14 22:07   ` Emily Shaffer
2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
2021-05-20 21:36   ` Randall S. Becker
2021-05-20 23:23     ` Emily Shaffer
2021-05-21 13:20       ` Randall S. Becker
2021-05-21 16:24         ` Randall S. Becker
2021-05-21  2:09   ` Junio C Hamano
2021-05-21 19:02     ` Emily Shaffer
2021-05-21 23:22       ` Junio C Hamano
2021-05-24 18:37         ` Emily Shaffer
2021-05-21 19:15   ` Jeff Hostetler
2021-05-21 20:05     ` Emily Shaffer
2021-05-21 20:23       ` Randall S. Becker
2021-05-22 11:18       ` Jeff Hostetler
2021-05-24 23:33       ` Ævar Arnfjörð Bjarmason
2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
2021-05-24 20:49     ` Emily Shaffer
2021-05-25  3:54     ` Junio C Hamano
2021-05-25 13:33       ` Randall S. Becker
2021-06-08 18:58     ` [PATCH v4] " Emily Shaffer
2021-06-08 20:56       ` Emily Shaffer
2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
2021-06-08 22:16         ` Randall S. Becker
2021-06-08 22:24           ` Emily Shaffer
2021-06-08 22:39             ` Randall S. Becker
2021-06-09 20:17               ` Emily Shaffer
2021-06-16  8:42         ` Junio C Hamano
2021-06-28 16:45         ` Jeff Hostetler
2021-06-29 23:51           ` Emily Shaffer
2021-06-30  6:10             ` Ævar Arnfjörð Bjarmason
2021-07-22  0:21               ` Emily Shaffer
2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
2021-07-22  1:27           ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
2021-07-22  1:27           ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
2021-07-22 21:02             ` Junio C Hamano
2021-07-22 16:59           ` [PATCH v6 0/2] " Jeff Hostetler

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git