[RFC PATCH v1] telemetry design overview (part 1)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC PATCH v1] telemetry design overview (part 1)
@ 2018-06-07 14:53 git
  2018-06-07 14:53 ` [RFC PATCH v1] telemetry: design documenation git
  2018-06-07 21:10 ` [RFC PATCH v1] telemetry design overview (part 1) Johannes Sixt
  0 siblings, 2 replies; 27+ messages in thread
From: git @ 2018-06-07 14:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

I've been working to add code to Git to optionally collect telemetry data.
The goal is to be able to collect performance data from Git commands and
allow it to be aggregated over a user community to find "slow commands".

I'm going to break this up into several parts rather than sending one large
patch series.  I think it is easier to review in pieces and in stages.

Part 1 contains the overall design documentation.
Part 2 will contain the basic telemetry event mechanism and the cmd_start
and cmd_exit events.

I'll post part 2 shortly -- as soon as I can convert my tests from python
to perl.  :-)

Jeff Hostetler (1):
  telemetry: design documenation

 Documentation/technical/telemetry.txt | 475 ++++++++++++++++++++++++++++++++++
 1 file changed, 475 insertions(+)
 create mode 100644 Documentation/technical/telemetry.txt

-- 
2.9.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v1] telemetry: design documenation
  2018-06-07 14:53 [RFC PATCH v1] telemetry design overview (part 1) git
@ 2018-06-07 14:53 ` git
  2018-06-08 11:06   ` Ævar Arnfjörð Bjarmason
  2018-06-07 21:10 ` [RFC PATCH v1] telemetry design overview (part 1) Johannes Sixt
  1 sibling, 1 reply; 27+ messages in thread
From: git @ 2018-06-07 14:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Create design documentation to describe the telemetry feature.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Documentation/technical/telemetry.txt | 475 ++++++++++++++++++++++++++++++++++
 1 file changed, 475 insertions(+)
 create mode 100644 Documentation/technical/telemetry.txt

diff --git a/Documentation/technical/telemetry.txt b/Documentation/technical/telemetry.txt
new file mode 100644
index 0000000..0a708ad
--- /dev/null
+++ b/Documentation/technical/telemetry.txt
@@ -0,0 +1,475 @@
+Telemetry Design Notes
+======================
+
+The telemetry feature allows Git to generate structured telemetry data
+for executed commands.  Data includes command line arguments, execution
+times, error codes and messages, and information about child processes.
+
+Structued data is produced in a JSON-like format.  (See the UTF-8 related
+"limitations" described in json-writer.h)
+
+Telemetry data can be written to a local file or sent to a dynamically
+loaded shared library via a plugin API.
+
+The telemetry feature is similar to the existing trace API (defined in
+Documentation/technical/api-trace.txt).  Telemetry events are generated
+thoughout the life of a Git command just like trace messages.  But where
+as trace messages are essentially developer debug messages, telemetry
+events are intended for logging and automated analysis.
+
+The goal of the telemetry feature is to be able to gather usage data across
+a group of production users to identify real-world performance problems in
+production.  Additionally, it might help identify common user errors and
+guide future user training.
+
+By default, telemetry is disabled.  Telemetry is controlled using config
+settings (see "telemetry.*" in Documentation/config.txt).
+
+
+Telemetry Events
+================
+
+Telemetry data is generated as a series of events.  Each event is written
+as a self-describing JSON object.
+
+Events: cmd_start and cmd_exit
+------------------------------
+
+The `cmd_start` event is emitted the very beginning of the git.exe process
+in cmd_main() and `cmd_exit` event is emitted at the end of the process in
+the atexit cleanup routine.
+
+For example, running "git version" produces:
+
+{
+  "event_name": "cmd_start",
+  "argv": [
+    "C:\\work\\gfw\\git.exe",
+    "version"
+  ],
+  "clock": 1525978509976086000,
+  "pid": 25460,
+  "git_version": "2.17.0.windows.1",
+  "telemetry_version": "1",
+  "session_id": "1525978509976086000-25460"
+}
+{
+  "event_name": "cmd_exit",
+  "argv": [
+    "C:\\work\\gfw\\git.exe",
+    "version"
+  ],
+  "clock": 1525978509980903391,
+  "pid": 25460,
+  "git_version": "2.17.0.windows.1",
+  "telemetry_version": "1",
+  "session_id": "1525978509976086000-25460",
+  "is_interactive": false,
+  "exit_code": 0,
+  "elapsed_time_core": 0.004814,
+  "elapsed_time_total": 0.004817,
+  "builtin": {
+    "name": "version"
+  }
+}
+
+Fields common to all events:
+ * `event_name` is the name of the event.
+ * `argv` is the array of command line arguments.
+ * `clock` is the time of the event in nanoseconds since the epoch.
+ * `pid` is the process id.
+ * `git_version` is the git version string.
+ * `telemetry_version` is the version of the telemetry format.
+ * `session_id` is described in a later section.
+
+Additional fields in cmd_exit:
+ * `is_interactive` is true if git.exe spawned an interactive child process,
+      such as a pager, editor, prompt, or gui tool.
+ * `exit_code` is the value passed to exit() from main().
+ * `error_message` (not shown) is the array of error messages.
+ * `elapsed-core-time` measures the time in seconds until exit() was called.
+ * `elapsed-total-time` measures the time until the atexit() routine starts
+      (which will include time spend in other atexit() routines cleaning up
+      child processes and etc.).
+ * `alias` (not shown) the updated argv after alias expansion.
+ * `builtin.name` is the canonical command name (from the cmd_struct[]
+      table) of a builtin command.
+ * `builtin.mode` (not shown) is shown for some commands that have different
+      major modes and performance times.  For example, checkout can switch
+      branches or repair a single file.
+ * `child_summary` (not shown) is described in a later section.
+ * `timers` (not shown) is described in a later section.
+ * `aux` (not shown) is described in a later section.
+
+
+Events: child_start and child_exit
+----------------------------------
+
+The child-start event is emitted just before a child process is started.
+It includes a unique child-id and the child's command line arguments.
+
+The child-exit event is emitted after a child process exits and has
+been reaped.  This event extends the start event with the child's exit
+status and elapsed time.
+
+For example, during a "git fetch origin", git.exe runs gc in the background
+and these events are emitted by the fetch process before and after the
+child gc process:
+
+{
+  "event_name": "child_start",
+  "argv": [
+    "C:\\work\\gfw\\git.exe",
+    "fetch",
+    "origin"
+  ],
+  "clock": 1525979478738132887,
+  "pid": 18332,
+  "git_version": "2.17.0.windows.1",
+  "telemetry_version": "1",
+  "session_id": "1525979470792747000-18332",
+  "child_detail": {
+    "number": 3,
+    "class": "gc",
+    "argv": [
+      "gc",
+      "--auto"
+    ]
+  }
+}
+{
+  "event_name": "child_exit",
+  "argv": [
+    "C:\\work\\gfw\\git.exe",
+    "fetch",
+    "origin"
+  ],
+  "clock": 1525979479024707085,
+  "pid": 18332,
+  "git_version": "2.17.0.windows.1",
+  "telemetry_version": "1",
+  "session_id": "1525979470792747000-18332",
+  "child_detail": {
+    "number": 3,
+    "class": "gc",
+    "argv": [
+      "gc",
+      "--auto"
+    ],
+    "pid": 19608,
+    "exit_code": 0,
+    "elapsed_time": 0.286574
+  }
+}
+
+The common fields (`event_name` through `session_id`) are the same as
+in the `cmd_start` and `cmd_exit` events and refer to the parent process.
+
+The `child_detail` structure describes the child process:
+ * `number` is a simple counter incremented for each child event.
+ * `class` is a rough characterization of the type of child process.  Child
+      class is described in a later section.
+ * `argv` is the child's command line.
+ * `pid` is the child's process id.
+ * `exit_code` is the exit code of the child process.
+ * `elapsed_time` measures the time in seconds observed by the parent process
+      between the child_start and child_exit events.  This will be greater
+      than the elapsed time that the child internally observes because of
+      process startup and shutdown overhead.  For synchronous child processes,
+      this is the time that the parent spent waiting for the child.
+
+
+Event: perf
+-----------
+
+Perf events are a debugging aid to report on suspected hot spots in the
+code and collect data from production users.  This is intended to be a
+generic message with context-specific data.  New messages may be added
+in the future as the need arises to help with debugging.
+
+Perf events are organized by category, much like the various GIT_TRACE_*
+environment variables.  The "telemetry.perf" config setting can be set to
+true or to a string of the perf categories that should be enabled.
+
+Currently, the categories "index" and "status" are defined.  Others may
+be added later.
+
+For example, could be used to instrument read_index_from():
+
+{
+  "event_name": "perf",
+  "argv": [
+    "C:\\work\\gfw\\git.exe",
+    "fetch",
+    "origin"
+  ],
+  "clock": 1525979478735438090,
+  "pid": 18332,
+  "git_version": "2.17.0.windows.1",
+  "telemetry_version": "1",
+  "session_id": "1525979470792747000-18332",
+  "category": "index",
+  "label": "read_index_from",
+  "elapsed_time": 0.001536,
+  "data": {
+    "path": ".git/index",
+    "cache_nr": 3311
+  }
+}
+
+The common fields (`event_name` through `session_id`) are the same as
+in the `cmd_start` and `cmd_exit` events.
+
+All `perf` events also have:
+ * `category` is descriptive category and used like different GIT_TRACE_*
+      variables.
+ * `label` is the name of a function or region of interest.
+ * `elapsed_time` measures the time in seconds spent in the function or
+      region.
+ * `data` is an optional structure of context-specific (debug) data.
+
+
+More Details for Event Fields
+=============================
+
+Field: session_id
+-----------------
+
+A session_id (SID) is a cheap, unique-enough string to associate all of
+the events generated by a single process.  They incorporate the inherited
+SID from their parent process.
+
+SIDs should be considerd opaque data, but are constructed as:
+
+    [<parent_sid>]/<start_time>-<pid>
+
+This scheme is used rather than a simple PID or {PPID, PID} because PIDs
+are recycled by the OS (after sufficient time).  Also, if telemetry data
+is aggregated from multiple systems, PIDs are not sufficient.
+
+This also has the advantage of allowing telemetry analysis to associate
+Git child processes with their Git parent process even if there are
+intermediate shell processes.
+
+Note: we could use UUIDs or GUIDs for this, but that seemed overkill at
+this point.  It also required platform-specific code to generate which
+muddied up the code.
+
+
+Field: child_details.class
+--------------------------
+
+enum telemetry_class contains a set of classification values.  These attempt
+to roughly classify a child process from the point of view of the parent
+process.
+ * unclass: unclassified
+ * unclass-async: unclassified asynchronous child (see sub-process.c)
+ * alias: an alias expansion using a child process
+ * hook: a hook process that may do anything (including prompting, scanning,
+   and network operations) and wildly affect command run times.
+ * pager: a pager (indicating an interactive command)
+ * editor: an editor (indicating an interactive command)
+ * prompt: a prompt or credential or askpass process (also interactive)
+ * network: a command that might do network operations
+ * convert: an attribute filter process such as LFS or CRLF
+ * tool: a tool, such as difftool or mergetool, that may be interactive
+ * gc: an auto gc process
+
+struct child_process has been extended to have a telemetry_class field.  Some
+callers of start_command() and/or run_command() have been updated to suggest
+a classification when appropriate.  For example, child processes created by
+launch_editor() are marked with TELEMETRY_CLASS__EDITOR.
+
+The primary intent is to identify which child processes are likely to block
+on the user or network.  For example, "git commit" and "git commit -m <msg>"
+will have different performance characteristics because the former has to
+launch an editor and wait for the user to compose a message.  The former will
+have a child event which child_detail.class=editor and its exit event will
+have child_summary.editor.count=1 and child_summary.editor.elapsed_time=<t>.
+Analysis tools can choose to report average commit time for non-interactive
+commands or subtract the editor elapsed time from the commit elapsed time.
+
+For example, fetch runs rev-list, ssh, index-pack, and maybe (auto) gc.  The
+ssh child is marked as TELEMETRY_CLASS__NETWORK and the gc child is marked
+as TELEMETRY_CLASS__GC (since it is optional and possibly time consuming).
+The others are left unclassified (TELEMETRY_CLASS__UNCLASS) since we don't
+expect blocking operations.
+
+
+Field: child_summary
+--------------------
+
+The `child_summary` structure within the `cmd_exit` event summarizes the
+child processes created by the parent process.
+
+For example, "git fetch origin" spawns 4 child processes:
+
+{
+  "event_name": "cmd_exit",
+  "argv": [
+    "C:\\work\\gfw\\git.exe",
+    "fetch",
+    "origin"
+  ],
+  ...
+  "child_summary": {
+    "unclass": {
+      "count": 2,
+      "elapsed_time": 0.496387
+    },
+    "network": {
+      "count": 1,
+      "elapsed_time": 7.712466
+    },
+    "gc": {
+      "count": 1,
+      "elapsed_time": 0.286574
+    }
+  },
+  "exit_code": 0,
+  "elapsed_time_core": 8.232965,
+  "elapsed_time_total": 8.232968,
+  "builtin": {
+    "name": "fetch"
+  }
+}
+
+Within each `child_summary[<class>]` is a count of the number of child
+processes and the cummulative elapsed time.
+
+Analysis tools interested in a net-elapsed-time of the parent process may
+want to subtract the elapsed time of the child processes.  This approach is
+mostly valid, since most child processes are run synchronously.  However,
+some processes are run asynchronously, such as the pager and processes in
+the unclass-async pool, so care should be taken.
+
+
+Field: timers
+-------------
+
+A "telemetry timer" is a stopwatch-like timer with a counter.  It can be
+used to time a specific region of code, such as an expensive computation
+within the body of a larger loop.  It defines a generic way to collect
+perf data without causing an telemetry perf event to be fired on each
+iteration.  Instead, a timer is registered with the telemetry layer and
+the data will be included in a "timers" sub-section in the `cmd_exit` event.
+
+For example, a timer was added to do_read_index() and do_write_index()
+to measure the time spent reading and writing the index.
+
+{
+  "event_name": "cmd_exit",
+  "argv": [
+    "C:\\work\\gfw\\git.exe",
+    "status"
+  ],
+  ...
+  "timers": {
+    "do_read_index": {
+      "count": 1,
+      "total": 0.000740,
+      "min": 0.000740,
+      "max": 0.000740,
+      "avg": 0.000740
+    },
+    "do_write_index": {
+      "count": 1,
+      "total": 0.004724,
+      "min": 0.004724,
+      "max": 0.004724,
+      "avg": 0.004724
+    }
+  },
+  "exit_code": 0,
+  "elapsed_time_core": 0.049865,
+  "elapsed_time_total": 0.049867,
+  "builtin": {
+    "name": "status"
+  }
+}
+
+The `timers` structure contains a named member for each defined timer.
+Within each individual timer, we have:
+ * `count` is the number of times it was started/stopped.
+ * `total` is the total time the timer was running.
+ * `min` is the shortest interval.
+ * `max` is the longest interval.
+ * `avg` is the average interval.
+
+
+Field: aux
+---------------
+
+The `aux` structure within the `cmd_exit` event contains additional
+information about the process.  This is intended as a generic container for
+various fields, such as important config settings or repo data shape that
+may affect performance or help identify the repository for aggregation
+purposes.
+
+{
+  "event_name": "cmd_exit",
+  ...
+  "aux": {
+    "remote_origin_url": "git@github.com:git/git.git",
+    "index_count": 3311,
+    "sparse_checkout_count": 3
+  },
+  ...
+}
+
+Other fields (and even sub-structures) can be added to this container
+as needed.
+
+
+Telemetry Destination
+=====================
+
+Telemetry events are sent to a "destination".  This can be a file or a
+plugin.  Telemetry is disabled if a destination is not set.
+
+telemetry.path
+--------------
+
+If the config setting "telemetry.path" contains a pathname, telemetry
+events will be appended to that file using the builtin destination
+handler.  (File rotation is beyond the scope of this design.)
+
+Events are written as a series of JSON records.  When "telemetry.pretty"
+is false, each event record will be written on one line.
+
+(All of the examples in this document were prepared with "telemetry.pretty"
+set to true.)
+
+telemetry.plugin
+----------------
+
+If the config setting "telemetry.plugin" contains the pathname to a shared
+library, the library will be dynamically loaded during start up and events
+will be sent to it using the plugin API.
+
+This plugin model allows an organization to define a custom or private
+telemetry solution while using a stock version of Git.
+
+For example, on Windows, it allows telemetry events to go directly to the
+kernel via the plugin using the high performance Event Tracing for Windows
+(ETW) facility.
+
+The contrib/telemetry-plugin-examples directory contains two example
+plugins:
+ * A trivial log to stderr
+ * A trivial ETW writer
+
+
+GDPR and Privacy
+================
+
+The telemetry feature can log possibly sensitive user information (such as
+command line arguments, which may contain URLs, user names, and file names).
+
+The base telemetry feature can write telemetry data to a file on the system.
+
+The plugin facility can be used to publish the telemetry data to more general
+destinations (such as ETW or the network).
+
+In both cases, it is up to the user or system administrator to decide what
+is appropriate and sanitize the data accordingly before broadcasting it.
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-07 14:53 [RFC PATCH v1] telemetry design overview (part 1) git
  2018-06-07 14:53 ` [RFC PATCH v1] telemetry: design documenation git
@ 2018-06-07 21:10 ` Johannes Sixt
  2018-06-08  9:07   ` Jeff King
  2018-06-08  9:40   ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 27+ messages in thread
From: Johannes Sixt @ 2018-06-07 21:10 UTC (permalink / raw)
  To: git; +Cc: git, gitster, peff, Jeff Hostetler

Am 07.06.2018 um 16:53 schrieb git@jeffhostetler.com:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> I've been working to add code to Git to optionally collect telemetry data.
> The goal is to be able to collect performance data from Git commands and
> allow it to be aggregated over a user community to find "slow commands".

Seriously? "add code to collect telemetry data" said by somebody whose 
email address ends with @microsoft.com is very irritating. I really 
don't want to have yet another switch that I must check after every 
update that it is still off.

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-07 21:10 ` [RFC PATCH v1] telemetry design overview (part 1) Johannes Sixt
@ 2018-06-08  9:07   ` Jeff King
  2018-06-08 16:00     ` Thomas Braun
  2018-06-08  9:40   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-06-08  9:07 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, git, gitster, Jeff Hostetler

On Thu, Jun 07, 2018 at 11:10:52PM +0200, Johannes Sixt wrote:

> Am 07.06.2018 um 16:53 schrieb git@jeffhostetler.com:
> > From: Jeff Hostetler <jeffhost@microsoft.com>
> > 
> > I've been working to add code to Git to optionally collect telemetry data.
> > The goal is to be able to collect performance data from Git commands and
> > allow it to be aggregated over a user community to find "slow commands".
> 
> Seriously? "add code to collect telemetry data" said by somebody whose email
> address ends with @microsoft.com is very irritating. I really don't want to
> have yet another switch that I must check after every update that it is
> still off.

If you look at the design document, it's off by default and would write
to a file on the filesystem. That doesn't seem all that different from
GIT_TRACE.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-07 21:10 ` [RFC PATCH v1] telemetry design overview (part 1) Johannes Sixt
  2018-06-08  9:07   ` Jeff King
@ 2018-06-08  9:40   ` Ævar Arnfjörð Bjarmason
  2018-06-08 15:46     ` Duy Nguyen
  1 sibling, 1 reply; 27+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-08  9:40 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, git, gitster, peff, Jeff Hostetler

On Thu, Jun 07 2018, Johannes Sixt wrote:

> Am 07.06.2018 um 16:53 schrieb git@jeffhostetler.com:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> I've been working to add code to Git to optionally collect telemetry data.
>> The goal is to be able to collect performance data from Git commands and
>> allow it to be aggregated over a user community to find "slow commands".
>
> Seriously? "add code to collect telemetry data" said by somebody whose
> email address ends with @microsoft.com is very irritating. I really
> don't want to have yet another switch that I must check after every
> update that it is still off.

To elaborate a bit on Jeff's reply (since this was discussed in more
detail at Git Merge this year), the point of this feature is not to ship
git.git with some default telemetry, but so that in-house users of git
like Microsoft, Dropbox, Booking.com etc. can build & configure their
internal versions of git to turn on telemetry for their own users.

There's numerous in-house monkeypatches to git on various sites (at
least Microsoft & Dropbox reported having internal patches already).
Something like this facility would allow us to agree on some
implementation that could be shipped by default (but would be off by
default), those of us who'd make use of this feature already have "root"
on those user's machines, and control what git binary they use etc,
their /etc/gitconfig and so on.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry: design documenation
  2018-06-07 14:53 ` [RFC PATCH v1] telemetry: design documenation git
@ 2018-06-08 11:06   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 27+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-08 11:06 UTC (permalink / raw)
  To: git; +Cc: git, gitster, peff, Jeff Hostetler


On Thu, Jun 07 2018, git@jeffhostetler.com wrote:

> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> Create design documentation to describe the telemetry feature.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
>  Documentation/technical/telemetry.txt | 475 ++++++++++++++++++++++++++++++++++
>  1 file changed, 475 insertions(+)
>  create mode 100644 Documentation/technical/telemetry.txt
>
> diff --git a/Documentation/technical/telemetry.txt b/Documentation/technical/telemetry.txt
> new file mode 100644
> index 0000000..0a708ad
> --- /dev/null
> +++ b/Documentation/technical/telemetry.txt
> @@ -0,0 +1,475 @@
> +Telemetry Design Notes
> +======================
> +
> +The telemetry feature allows Git to generate structured telemetry data
> +for executed commands.  Data includes command line arguments, execution
> +times, error codes and messages, and information about child processes.
> +
> +Structued data is produced in a JSON-like format.  (See the UTF-8 related

s/Structued/Structured/

> +"limitations" described in json-writer.h)
> +
> +Telemetry data can be written to a local file or sent to a dynamically
> +loaded shared library via a plugin API.
> +
> +The telemetry feature is similar to the existing trace API (defined in
> +Documentation/technical/api-trace.txt).  Telemetry events are generated
> +thoughout the life of a Git command just like trace messages.  But where

s/thoughout/throughout/

> +as trace messages are essentially developer debug messages, telemetry
> +events are intended for logging and automated analysis.
> +
> +The goal of the telemetry feature is to be able to gather usage data across
> +a group of production users to identify real-world performance problems in
> +production.  Additionally, it might help identify common user errors and
> +guide future user training.
> +
> +By default, telemetry is disabled.  Telemetry is controlled using config
> +settings (see "telemetry.*" in Documentation/config.txt).
> +
> +
> +Telemetry Events
> +================
> +
> +Telemetry data is generated as a series of events.  Each event is written
> +as a self-describing JSON object.
> +
> +Events: cmd_start and cmd_exit
> +------------------------------
> +
> +The `cmd_start` event is emitted the very beginning of the git.exe process
> +in cmd_main() and `cmd_exit` event is emitted at the end of the process in
> +the atexit cleanup routine.
> +
> +For example, running "git version" produces:
> +
> +{
> +  "event_name": "cmd_start",
> +  "argv": [
> +    "C:\\work\\gfw\\git.exe",
> +    "version"
> +  ],
> +  "clock": 1525978509976086000,
> +  "pid": 25460,
> +  "git_version": "2.17.0.windows.1",
> +  "telemetry_version": "1",
> +  "session_id": "1525978509976086000-25460"
> +}
> +{
> +  "event_name": "cmd_exit",
> +  "argv": [
> +    "C:\\work\\gfw\\git.exe",
> +    "version"
> +  ],
> +  "clock": 1525978509980903391,
> +  "pid": 25460,
> +  "git_version": "2.17.0.windows.1",
> +  "telemetry_version": "1",
> +  "session_id": "1525978509976086000-25460",
> +  "is_interactive": false,
> +  "exit_code": 0,
> +  "elapsed_time_core": 0.004814,
> +  "elapsed_time_total": 0.004817,
> +  "builtin": {
> +    "name": "version"
> +  }
> +}
> +
> +Fields common to all events:
> + * `event_name` is the name of the event.
> + * `argv` is the array of command line arguments.
> + * `clock` is the time of the event in nanoseconds since the epoch.
> + * `pid` is the process id.
> + * `git_version` is the git version string.
> + * `telemetry_version` is the version of the telemetry format.
> + * `session_id` is described in a later section.

I wonder if we should add the path to git to these if RUNTIME_PREFIX is
defined. Would be useful to log experiments with different git versions,
and as noted by the RUNTIME_PREFIX feature this information is not
guaranteed to be something you can extract from argv.

> +Additional fields in cmd_exit:
> + * `is_interactive` is true if git.exe spawned an interactive child process,
> +      such as a pager, editor, prompt, or gui tool.
> + * `exit_code` is the value passed to exit() from main().
> + * `error_message` (not shown) is the array of error messages.
> + * `elapsed-core-time` measures the time in seconds until exit() was called.
> + * `elapsed-total-time` measures the time until the atexit() routine starts
> +      (which will include time spend in other atexit() routines cleaning up
> +      child processes and etc.).
> + * `alias` (not shown) the updated argv after alias expansion.
> + * `builtin.name` is the canonical command name (from the cmd_struct[]
> +      table) of a builtin command.
> + * `builtin.mode` (not shown) is shown for some commands that have different
> +      major modes and performance times.  For example, checkout can switch
> +      branches or repair a single file.
> + * `child_summary` (not shown) is described in a later section.
> + * `timers` (not shown) is described in a later section.
> + * `aux` (not shown) is described in a later section.
> +
> +
> +Events: child_start and child_exit
> +----------------------------------
> +
> +The child-start event is emitted just before a child process is started.
> +It includes a unique child-id and the child's command line arguments.
> +
> +The child-exit event is emitted after a child process exits and has
> +been reaped.  This event extends the start event with the child's exit
> +status and elapsed time.
> +
> +For example, during a "git fetch origin", git.exe runs gc in the background
> +and these events are emitted by the fetch process before and after the
> +child gc process:
> +
> +{
> +  "event_name": "child_start",
> +  "argv": [
> +    "C:\\work\\gfw\\git.exe",
> +    "fetch",
> +    "origin"
> +  ],
> +  "clock": 1525979478738132887,
> +  "pid": 18332,
> +  "git_version": "2.17.0.windows.1",
> +  "telemetry_version": "1",
> +  "session_id": "1525979470792747000-18332",
> +  "child_detail": {
> +    "number": 3,
> +    "class": "gc",
> +    "argv": [
> +      "gc",
> +      "--auto"
> +    ]
> +  }
> +}
> +{
> +  "event_name": "child_exit",
> +  "argv": [
> +    "C:\\work\\gfw\\git.exe",
> +    "fetch",
> +    "origin"
> +  ],
> +  "clock": 1525979479024707085,
> +  "pid": 18332,
> +  "git_version": "2.17.0.windows.1",
> +  "telemetry_version": "1",
> +  "session_id": "1525979470792747000-18332",
> +  "child_detail": {
> +    "number": 3,
> +    "class": "gc",
> +    "argv": [
> +      "gc",
> +      "--auto"
> +    ],
> +    "pid": 19608,
> +    "exit_code": 0,
> +    "elapsed_time": 0.286574
> +  }
> +}
> +
> +The common fields (`event_name` through `session_id`) are the same as
> +in the `cmd_start` and `cmd_exit` events and refer to the parent process.
> +
> +The `child_detail` structure describes the child process:
> + * `number` is a simple counter incremented for each child event.
> + * `class` is a rough characterization of the type of child process.  Child
> +      class is described in a later section.
> + * `argv` is the child's command line.
> + * `pid` is the child's process id.
> + * `exit_code` is the exit code of the child process.
> + * `elapsed_time` measures the time in seconds observed by the parent process
> +      between the child_start and child_exit events.  This will be greater
> +      than the elapsed time that the child internally observes because of
> +      process startup and shutdown overhead.  For synchronous child processes,
> +      this is the time that the parent spent waiting for the child.
> +
> +
> +Event: perf
> +-----------
> +
> +Perf events are a debugging aid to report on suspected hot spots in the
> +code and collect data from production users.  This is intended to be a
> +generic message with context-specific data.  New messages may be added
> +in the future as the need arises to help with debugging.
> +
> +Perf events are organized by category, much like the various GIT_TRACE_*
> +environment variables.  The "telemetry.perf" config setting can be set to
> +true or to a string of the perf categories that should be enabled.
> +
> +Currently, the categories "index" and "status" are defined.  Others may
> +be added later.
> +
> +For example, could be used to instrument read_index_from():
> +
> +{
> +  "event_name": "perf",
> +  "argv": [
> +    "C:\\work\\gfw\\git.exe",
> +    "fetch",
> +    "origin"
> +  ],
> +  "clock": 1525979478735438090,
> +  "pid": 18332,
> +  "git_version": "2.17.0.windows.1",
> +  "telemetry_version": "1",
> +  "session_id": "1525979470792747000-18332",
> +  "category": "index",
> +  "label": "read_index_from",
> +  "elapsed_time": 0.001536,
> +  "data": {
> +    "path": ".git/index",
> +    "cache_nr": 3311
> +  }
> +}
> +
> +The common fields (`event_name` through `session_id`) are the same as
> +in the `cmd_start` and `cmd_exit` events.
> +
> +All `perf` events also have:
> + * `category` is descriptive category and used like different GIT_TRACE_*
> +      variables.
> + * `label` is the name of a function or region of interest.
> + * `elapsed_time` measures the time in seconds spent in the function or
> +      region.
> + * `data` is an optional structure of context-specific (debug) data.
> +
> +
> +More Details for Event Fields
> +=============================
> +
> +Field: session_id
> +-----------------
> +
> +A session_id (SID) is a cheap, unique-enough string to associate all of
> +the events generated by a single process.  They incorporate the inherited
> +SID from their parent process.
> +
> +SIDs should be considerd opaque data, but are constructed as:

s/considerd/considered/

> +
> +    [<parent_sid>]/<start_time>-<pid>
> +
> +This scheme is used rather than a simple PID or {PPID, PID} because PIDs
> +are recycled by the OS (after sufficient time).  Also, if telemetry data
> +is aggregated from multiple systems, PIDs are not sufficient.
> +
> +This also has the advantage of allowing telemetry analysis to associate
> +Git child processes with their Git parent process even if there are
> +intermediate shell processes.
> +
> +Note: we could use UUIDs or GUIDs for this, but that seemed overkill at
> +this point.  It also required platform-specific code to generate which
> +muddied up the code.

Fully agreed, UUID would be overkill, and it's a very useful property to
have the timestamp in there, since this is very likely to end up in a
storage system where this is a primary key, and being able to just look
at those and see from what time they without also looking at "clock" is
useful.

Just as a bit of bikeshedding, I wonder if this would be better with the
<start_time> always at the start,
i.e. <start_time>-<parent_sid>-<pid>. Without a parent it would be
TIME--PID but with a parent TIME-PARENT_SID-PID.

The reason I'm suggesting that that is that if I e.g. make a MySQL table
to insert these records and use the SID as a primary key, the inserts
are much more efficient if they're in sorted order and all sort after
existing data. That's not *guaranteed* with my proposed amendmend, but
as a practical matter would be the case most of the time.

This is a property shared by a lot of DB storage formats, and it would
be useful to cater to it.

That's mostly true of <parent_sid>-<start_time>-<pid> too, but there's
going to be cases (e.g. gc) where we're starting subprocesses for a long
time, so by inserting new entries starting with <parent_sid> at the
beginning we're having to write behind in such a system.

But maybe you were aiming for the property of having all events invoked
by a given git (and the children) really closely packed together in such
a system.

I don't feel strongly about it, just something to consider.

> +Field: child_details.class
> +--------------------------
> +
> +enum telemetry_class contains a set of classification values.  These attempt
> +to roughly classify a child process from the point of view of the parent
> +process.
> + * unclass: unclassified
> + * unclass-async: unclassified asynchronous child (see sub-process.c)
> + * alias: an alias expansion using a child process
> + * hook: a hook process that may do anything (including prompting, scanning,
> +   and network operations) and wildly affect command run times.
> + * pager: a pager (indicating an interactive command)
> + * editor: an editor (indicating an interactive command)
> + * prompt: a prompt or credential or askpass process (also interactive)
> + * network: a command that might do network operations
> + * convert: an attribute filter process such as LFS or CRLF
> + * tool: a tool, such as difftool or mergetool, that may be interactive
> + * gc: an auto gc process
> +
> +struct child_process has been extended to have a telemetry_class field.  Some
> +callers of start_command() and/or run_command() have been updated to suggest
> +a classification when appropriate.  For example, child processes created by
> +launch_editor() are marked with TELEMETRY_CLASS__EDITOR.
> +
> +The primary intent is to identify which child processes are likely to block
> +on the user or network.  For example, "git commit" and "git commit -m <msg>"
> +will have different performance characteristics because the former has to
> +launch an editor and wait for the user to compose a message.  The former will
> +have a child event which child_detail.class=editor and its exit event will
> +have child_summary.editor.count=1 and child_summary.editor.elapsed_time=<t>.
> +Analysis tools can choose to report average commit time for non-interactive
> +commands or subtract the editor elapsed time from the commit elapsed time.
> +
> +For example, fetch runs rev-list, ssh, index-pack, and maybe (auto) gc.  The
> +ssh child is marked as TELEMETRY_CLASS__NETWORK and the gc child is marked
> +as TELEMETRY_CLASS__GC (since it is optional and possibly time consuming).
> +The others are left unclassified (TELEMETRY_CLASS__UNCLASS) since we don't
> +expect blocking operations.

This whole thing is very useful, and worth the small effort on our end
to classify commands as "might use network" etc.

> +Field: child_summary
> +--------------------
> +
> +The `child_summary` structure within the `cmd_exit` event summarizes the
> +child processes created by the parent process.
> +
> +For example, "git fetch origin" spawns 4 child processes:
> +
> +{
> +  "event_name": "cmd_exit",
> +  "argv": [
> +    "C:\\work\\gfw\\git.exe",
> +    "fetch",
> +    "origin"
> +  ],
> +  ...
> +  "child_summary": {
> +    "unclass": {
> +      "count": 2,
> +      "elapsed_time": 0.496387
> +    },
> +    "network": {
> +      "count": 1,
> +      "elapsed_time": 7.712466
> +    },
> +    "gc": {
> +      "count": 1,
> +      "elapsed_time": 0.286574
> +    }
> +  },

If we're going to have "clock" be the nanosecond epoch, let's not have
FLOAT seconds here, but instead just make "elapsed_time" be the elapsed
time in nanoseconds.

That'll both simplify dealing with this manually, and e.g. the schema of
any DB that this gets inserted into.

> +  "exit_code": 0,
> +  "elapsed_time_core": 8.232965,
> +  "elapsed_time_total": 8.232968,

Ditto the parent's elapsed time....

> +  "builtin": {
> +    "name": "fetch"
> +  }
> +}
> +
> +Within each `child_summary[<class>]` is a count of the number of child
> +processes and the cummulative elapsed time.

s/cummulative/cumulative/

> +Analysis tools interested in a net-elapsed-time of the parent process may
> +want to subtract the elapsed time of the child processes.  This approach is
> +mostly valid, since most child processes are run synchronously.  However,
> +some processes are run asynchronously, such as the pager and processes in
> +the unclass-async pool, so care should be taken.

... Exactly for this reason, now anyone wanting to do that math is
having to deal with the minefield that is IEEE Floating-Point
Arithmetic.

> +Field: timers
> +-------------
> +
> +A "telemetry timer" is a stopwatch-like timer with a counter.  It can be
> +used to time a specific region of code, such as an expensive computation
> +within the body of a larger loop.  It defines a generic way to collect
> +perf data without causing an telemetry perf event to be fired on each
> +iteration.  Instead, a timer is registered with the telemetry layer and
> +the data will be included in a "timers" sub-section in the `cmd_exit` event.
> +
> +For example, a timer was added to do_read_index() and do_write_index()
> +to measure the time spent reading and writing the index.
> +
> +{
> +  "event_name": "cmd_exit",
> +  "argv": [
> +    "C:\\work\\gfw\\git.exe",
> +    "status"
> +  ],
> +  ...
> +  "timers": {
> +    "do_read_index": {
> +      "count": 1,
> +      "total": 0.000740,
> +      "min": 0.000740,
> +      "max": 0.000740,
> +      "avg": 0.000740
> +    },
> +    "do_write_index": {
> +      "count": 1,
> +      "total": 0.004724,
> +      "min": 0.004724,
> +      "max": 0.004724,
> +      "avg": 0.004724
> +    }
> +  },
> +  "exit_code": 0,
> +  "elapsed_time_core": 0.049865,
> +  "elapsed_time_total": 0.049867,
> +  "builtin": {
> +    "name": "status"
> +  }
> +}
> +
> +The `timers` structure contains a named member for each defined timer.
> +Within each individual timer, we have:
> + * `count` is the number of times it was started/stopped.
> + * `total` is the total time the timer was running.
> + * `min` is the shortest interval.
> + * `max` is the longest interval.
> + * `avg` is the average interval.

I wonder if we should add "median" here, although that would (unless
I've missed some clever trick) require keeping all the times around,
sorting them and then picking the 50th percentile, whereas we can
compute a running min/avg/max. So maybe it should be optional (this can
always be added later).

Once we have that we can easily add 10/25/50/75/90/95/99th percentiles
as well.

> +Field: aux
> +---------------
> +
> +The `aux` structure within the `cmd_exit` event contains additional
> +information about the process.  This is intended as a generic container for
> +various fields, such as important config settings or repo data shape that
> +may affect performance or help identify the repository for aggregation
> +purposes.
> +
> +{
> +  "event_name": "cmd_exit",
> +  ...
> +  "aux": {
> +    "remote_origin_url": "git@github.com:git/git.git",
> +    "index_count": 3311,
> +    "sparse_checkout_count": 3
> +  },
> +  ...
> +}
> +
> +Other fields (and even sub-structures) can be added to this container
> +as needed.
> +
> +
> +Telemetry Destination
> +=====================
> +
> +Telemetry events are sent to a "destination".  This can be a file or a
> +plugin.  Telemetry is disabled if a destination is not set.
> +
> +telemetry.path

There is the telemetry.plugin escape hatch noted below I can change it,
but I don't like the design of this, because:

1) On POSIX systems (I don't know about Windows) if you O_APPPEND to
such a file you're guaranteed at most PIPE_MAX bytes written to the file
before there's no guarantee that your write is atomic, i.e. you will get
interleaved JSON if we ever write something bigger than that.

From the spec above it seems like this will be unlikely in practice for
most simple invocations, but with "aux" payloads and a sufficient amount
of child processes & timers I can see us easily going over PIPE_MAX,
which is commonly 4096 (e.g. the linux default).

2) It makes it hard to process these events piecemeal and in parallel,
i.e. imagine having all the dev machines NFS mount some logging FS where
they all write to this file, now I need to periodically move that file,
or ensure the config is set so we write to a daily/hourly file etc.

A tiny modification on this would address #1 and #2. We would make
telemetry.path the path to a directory where metrics are written, and
then write e.g. <THAT_DIRECTORY>/<first 2 chars of sid>/<remaining chars
of sid>, similar to how we write loose objects.

Then a bunch of processes could write to those files in parallel without
fear of clobbering anything, and anything consuming them could easily do
so with a standard readdir() + stat() + rename(F -> F.doing) &&
unlink(F.doing) pattern.

But reading between the lines you're not planning to use this yourself
at all, and we can always add a telemetry.directory later with this
behavior, leaving telemetry.path for a "simple" backend (while
prominently advertising that PIPE_MAX caveat).

> +If the config setting "telemetry.path" contains a pathname, telemetry
> +events will be appended to that file using the builtin destination
> +handler.  (File rotation is beyond the scope of this design.)
> +
> +Events are written as a series of JSON records.  When "telemetry.pretty"
> +is false, each event record will be written on one line.
> +
> +(All of the examples in this document were prepared with "telemetry.pretty"
> +set to true.)
> +
> +telemetry.plugin
> +----------------
> +
> +If the config setting "telemetry.plugin" contains the pathname to a shared
> +library, the library will be dynamically loaded during start up and events
> +will be sent to it using the plugin API.
> +
> +This plugin model allows an organization to define a custom or private
> +telemetry solution while using a stock version of Git.
> +
> +For example, on Windows, it allows telemetry events to go directly to the
> +kernel via the plugin using the high performance Event Tracing for Windows
> +(ETW) facility.
> +
> +The contrib/telemetry-plugin-examples directory contains two example
> +plugins:
> + * A trivial log to stderr
> + * A trivial ETW writer

Maybe we should also have a telemetry.hook for flexibility on systems
where invoking processes isn't expensive :)

This is a bit of scope creep on what you're trying to accomplish, but I
wonder if we could keep the requirements for some sort of general *.so
hook facility in mind with this.

Our hooks now do relatively simple things like take a stream on STDIN,
get ENV vars, and inspect ARGV. Could we generalize this to using some
callback mechanism that provides those, so this could also be used for
other existing hooks?

> +GDPR and Privacy
> +================
> +
> +The telemetry feature can log possibly sensitive user information (such as
> +command line arguments, which may contain URLs, user names, and file names).
> +
> +The base telemetry feature can write telemetry data to a file on the system.
> +
> +The plugin facility can be used to publish the telemetry data to more general
> +destinations (such as ETW or the network).
> +
> +In both cases, it is up to the user or system administrator to decide what
> +is appropriate and sanitize the data accordingly before broadcasting it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-08  9:40   ` Ævar Arnfjörð Bjarmason
@ 2018-06-08 15:46     ` Duy Nguyen
  0 siblings, 0 replies; 27+ messages in thread
From: Duy Nguyen @ 2018-06-08 15:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Sixt, Jeff Hostetler, Git Mailing List, Junio C Hamano,
	Jeff King, Jeff Hostetler

On Fri, Jun 8, 2018 at 11:40 AM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Thu, Jun 07 2018, Johannes Sixt wrote:
>
>> Am 07.06.2018 um 16:53 schrieb git@jeffhostetler.com:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> I've been working to add code to Git to optionally collect telemetry data.
>>> The goal is to be able to collect performance data from Git commands and
>>> allow it to be aggregated over a user community to find "slow commands".
>>
>> Seriously? "add code to collect telemetry data" said by somebody whose
>> email address ends with @microsoft.com is very irritating. I really
>> don't want to have yet another switch that I must check after every
>> update that it is still off.
>
> To elaborate a bit on Jeff's reply (since this was discussed in more
> detail at Git Merge this year), the point of this feature is not to ship
> git.git with some default telemetry, but so that in-house users of git
> like Microsoft, Dropbox, Booking.com etc. can build & configure their
> internal versions of git to turn on telemetry for their own users.
>
> There's numerous in-house monkeypatches to git on various sites (at
> least Microsoft & Dropbox reported having internal patches already).
> Something like this facility would allow us to agree on some
> implementation that could be shipped by default (but would be off by
> default), those of us who'd make use of this feature already have "root"
> on those user's machines, and control what git binary they use etc,
> their /etc/gitconfig and so on.

This elaboration should have been in the commit message of the first
patch (and I hope it will be in v2). You guys knew because it was
discussed at Git Merge but the rest of us may not.. It would be much
more convincing to have something like this spelled out.
-- 
Duy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-08  9:07   ` Jeff King
@ 2018-06-08 16:00     ` Thomas Braun
  2018-06-08 22:01       ` Johannes Sixt
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Braun @ 2018-06-08 16:00 UTC (permalink / raw)
  To: Jeff King, Johannes Sixt; +Cc: git, git, gitster, Jeff Hostetler

Am 08.06.2018 um 11:07 schrieb Jeff King:
> On Thu, Jun 07, 2018 at 11:10:52PM +0200, Johannes Sixt wrote:
> 
>> Am 07.06.2018 um 16:53 schrieb git@jeffhostetler.com:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> I've been working to add code to Git to optionally collect telemetry data.
>>> The goal is to be able to collect performance data from Git commands and
>>> allow it to be aggregated over a user community to find "slow commands".
>>
>> Seriously? "add code to collect telemetry data" said by somebody whose email
>> address ends with @microsoft.com is very irritating. I really don't want to
>> have yet another switch that I must check after every update that it is
>> still off.
> 
> If you look at the design document, it's off by default and would write
> to a file on the filesystem. That doesn't seem all that different from
> GIT_TRACE.

The patch also includes the following part

+telemetry.plugin
+----------------
+
+If the config setting "telemetry.plugin" contains the pathname to a shared
+library, the library will be dynamically loaded during start up and events
+will be sent to it using the plugin API.
+
+This plugin model allows an organization to define a custom or private
+telemetry solution while using a stock version of Git.
+
+For example, on Windows, it allows telemetry events to go directly to the
+kernel via the plugin using the high performance Event Tracing for Windows
+(ETW) facility.
+
+The contrib/telemetry-plugin-examples directory contains two example
+plugins:
+ * A trivial log to stderr
+ * A trivial ETW writer

which is not a file but, if enabled, some windows internal thingie where the data is gone/duplicated/sent out/whatever.

I for my part would much rather prefer that to be a compile time option so that I don't need to check on every git update on windows if this is now enabled or not.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-08 16:00     ` Thomas Braun
@ 2018-06-08 22:01       ` Johannes Sixt
  2018-06-08 22:20         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2018-06-08 22:01 UTC (permalink / raw)
  To: Thomas Braun, Jeff King; +Cc: git, git, gitster, Jeff Hostetler

Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> I for my part would much rather prefer that to be a compile time
> option so that I don't need to check on every git update on windows
> if  this is now enabled or not.

This exactly my concern, too! A compile-time option may make it a good 
deal less worrisome.

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-08 22:01       ` Johannes Sixt
@ 2018-06-08 22:20         ` Ævar Arnfjörð Bjarmason
  2018-06-09  5:03           ` Duy Nguyen
  2018-06-09  6:56           ` Johannes Sixt
  0 siblings, 2 replies; 27+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-08 22:20 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Thomas Braun, Jeff King, git, git, gitster, Jeff Hostetler


On Fri, Jun 08 2018, Johannes Sixt wrote:

> Am 08.06.2018 um 18:00 schrieb Thomas Braun:
>> I for my part would much rather prefer that to be a compile time
>> option so that I don't need to check on every git update on windows
>> if  this is now enabled or not.
>
> This exactly my concern, too! A compile-time option may make it a good
> deal less worrisome.

Can you elaborate on how someone who can maintain inject malicious code
into your git package + config would be thwarted by this being some
compile-time option, wouldn't they just compile it in?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-08 22:20         ` Ævar Arnfjörð Bjarmason
@ 2018-06-09  5:03           ` Duy Nguyen
  2018-06-09  6:31             ` Ævar Arnfjörð Bjarmason
  2018-06-09  6:51             ` Jeff King
  2018-06-09  6:56           ` Johannes Sixt
  1 sibling, 2 replies; 27+ messages in thread
From: Duy Nguyen @ 2018-06-09  5:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Sixt, Thomas Braun, Jeff King, Jeff Hostetler,
	Git Mailing List, Junio C Hamano, Jeff Hostetler

On Sat, Jun 9, 2018 at 12:22 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Fri, Jun 08 2018, Johannes Sixt wrote:
>
> > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> >> I for my part would much rather prefer that to be a compile time
> >> option so that I don't need to check on every git update on windows
> >> if  this is now enabled or not.
> >
> > This exactly my concern, too! A compile-time option may make it a good
> > deal less worrisome.
>
> Can you elaborate on how someone who can maintain inject malicious code
> into your git package + config would be thwarted by this being some
> compile-time option, wouldn't they just compile it in?


Look at this from a different angle. This is driven by the needs to
collect telemetry in _controlled_ environment (mostly server side, I
guess) and it should be no problem to make custom builds there for
you. Not making it a compile-time option could force [1] linux distro
to carry this function to everybody even if they don't use it (and
it's kinda dangerous to misuse if you don't anonymize the data
properly). I also prefer this a compile time option.

[1] Of course many distros can choose to patch it out. But it's the
same argument as bringing this option in in the first place: you guys
already have that code in private and now want to put it in stock git
to reduce maintenance cost, why add extra cost on linux distro
maintenance?
-- 
Duy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  5:03           ` Duy Nguyen
@ 2018-06-09  6:31             ` Ævar Arnfjörð Bjarmason
  2018-06-09  6:56               ` Jeff King
  2018-06-09  7:31               ` Duy Nguyen
  2018-06-09  6:51             ` Jeff King
  1 sibling, 2 replies; 27+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-09  6:31 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Johannes Sixt, Thomas Braun, Jeff King, Jeff Hostetler,
	Git Mailing List, Junio C Hamano, Jeff Hostetler

On Sat, Jun 09 2018, Duy Nguyen wrote:

> On Sat, Jun 9, 2018 at 12:22 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Fri, Jun 08 2018, Johannes Sixt wrote:
>>
>> > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
>> >> I for my part would much rather prefer that to be a compile time
>> >> option so that I don't need to check on every git update on windows
>> >> if  this is now enabled or not.
>> >
>> > This exactly my concern, too! A compile-time option may make it a good
>> > deal less worrisome.
>>
>> Can you elaborate on how someone who can maintain inject malicious code
>> into your git package + config would be thwarted by this being some
>> compile-time option, wouldn't they just compile it in?
>
>
> Look at this from a different angle. This is driven by the needs to
> collect telemetry in _controlled_ environment (mostly server side, I
> guess) and it should be no problem to make custom builds there for
> you.

Let's say you're in a corporate environment with Linux, OSX and Windows
boxes, but all of whom have some shared mounts provisioned & ability to
ship an /etc/gitconfig (wherever that lives on Windows).

It's much easier to just do that than figure out how to build a custom
Git on all three platforms.

I guess you might make the argument that "that's good", because in
practice that'll mean that it's such a hassle that fewer administrators
will turn this on.

But I think that would be a loss, because that's taking the default view
that people with the rights (i.e. managed config access) to turn on
something like this by default have nefarious motives, and we should do
what we can to stop them.

I don't think that's true, e.g. what I intend to use this for is:

 a) Getting aggregate data on what commands/switches are used, for
    purposes of training and prioritizing my upstream contributions.

 b) Aggregate performance data to figure out what hotspots to tackle in
    the code.

That's things that'll both benefit the users I'm responsible for, and
the wider git community.

> Not making it a compile-time option could force [1] linux distro
> to carry this function to everybody even if they don't use it (and
> it's kinda dangerous to misuse if you don't anonymize the data
> properly). I also prefer this a compile time option.

Setting GIT_TRACE to a filename that you published is also similarly
dangerous, so would setting up a trivial 4-line shell alias to wrap
"git" and log what it's doing.

> [1] Of course many distros can choose to patch it out. But it's the
> same argument as bringing this option in in the first place: you guys
> already have that code in private and now want to put it in stock git
> to reduce maintenance cost, why add extra cost on linux distro
> maintenance?

Because:

1) I really don't see the basis for this argument that they'd need to
   patch it out, they're not patching out e.g. GIT_TRACE now, which has
   all the same sort of concerns, it's just a format that's more of a
   hassle to parse than this proposed format.

2) I think you and Johannes are just seeing the "telemetry" part of
   this, but if you look past that all this *really* is is "GIT_TRACE
   facility that doesn't suck to parse".

   There's a lot of use-cases for that which have nothing to do with
   what this facility is originally written for, for example, a novice
   git user could turn it on and have it log in ~ somewhere, and then
   run some contrib script which analyzes his git usage and spews out
   suggestions ("you use this command/option, but not this related
   useful command/option").

   Users asking for help on the #git IRC channel or on this mailing list
   could turn this on if they have a problem, and paste it into some
   tool they could show to others to see exactly what they're doing /
   where it went wrong.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  5:03           ` Duy Nguyen
  2018-06-09  6:31             ` Ævar Arnfjörð Bjarmason
@ 2018-06-09  6:51             ` Jeff King
  2018-06-09  7:04               ` Johannes Sixt
  2018-06-12 16:04               ` Junio C Hamano
  1 sibling, 2 replies; 27+ messages in thread
From: Jeff King @ 2018-06-09  6:51 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Thomas Braun, Jeff Hostetler, Git Mailing List, Junio C Hamano,
	Jeff Hostetler

On Sat, Jun 09, 2018 at 07:03:53AM +0200, Duy Nguyen wrote:

> > > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> > >> I for my part would much rather prefer that to be a compile time
> > >> option so that I don't need to check on every git update on windows
> > >> if  this is now enabled or not.
> > >
> > > This exactly my concern, too! A compile-time option may make it a good
> > > deal less worrisome.
> >
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Look at this from a different angle. This is driven by the needs to
> collect telemetry in _controlled_ environment (mostly server side, I
> guess) and it should be no problem to make custom builds there for
> you. Not making it a compile-time option could force [1] linux distro
> to carry this function to everybody even if they don't use it (and
> it's kinda dangerous to misuse if you don't anonymize the data
> properly). I also prefer this a compile time option.

I actually think this could be useful for normal users, too. We have
GIT_TRACE for dumping debug output, and we sometimes ask users to turn
it on to help diagnose a problem (not to mention that we use it
ourselves).

AFAICT this telemetry data is the same thing, but for performance. When
somebody says "why does this command take so long", wouldn't it be nice
for us to be able to tell them to flip a switch that will collect data
on which operations are taking a long time?

> [1] Of course many distros can choose to patch it out. But it's the
> same argument as bringing this option in in the first place: you guys
> already have that code in private and now want to put it in stock git
> to reduce maintenance cost, why add extra cost on linux distro
> maintenance?

We don't do performance telemetry like this at GitHub, but we do collect
a few variables that we dump to a custom over a Unix socket (e.g., our
upload-pack records clone vs fetch and shallow vs non-shallow for each
request).

In my experience the maintenance burden is not in the "connect to a
socket" part, but the fact that you have to sprinkle the entry points
throughout the code (e.g., "set 'cloning' flag to 1" or "I'm entering
the pack generation phase of the fetch"). So I'd love to see us do that
sprinkling _once_ where everyone can benefit, whether they want better
tracing for debugging, telemetry across their installed base, or
whatever.

The mechanism to handle those calls is much easier to plug in and out,
then. And I don't have a huge preference for compile-time versus
run-time, or whether bits are in-tree or out-of-tree. Obviously we'd
have some basic "write to stderr or a file" consumer in-tree.

For myself, I'm happy with compile-time (I'm instrumenting the server
side, so I really only care about my specific build) and out-of-tree
(the protocol to our custom daemon consumer is not something anybody
else would care about anyway).

For people collecting from clients, I imagine it's more convenient to be
able to let them use official builds and just tweak a knob, even if they
_could_ build a custom Git and push it out to everybody. I don't know
anything about Windows event tracing, but if it's a standard facility
then I doubt we're talking about a huge maintenance burden to have it
in-tree as a configurable option.

So IMHO that really leaves the "is it too scary to have a config knob
that lets tracing go to this event facility versus to a file"? I guess I
just don't see it, as long as the user has to enable it explicitly. That
seems like complaining that GIT_TRACE could go to syslog. If you don't
want to store it, then don't turn on the feature.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-08 22:20         ` Ævar Arnfjörð Bjarmason
  2018-06-09  5:03           ` Duy Nguyen
@ 2018-06-09  6:56           ` Johannes Sixt
  2018-06-09 20:43             ` Johannes Schindelin
  2018-06-10  0:00             ` brian m. carlson
  1 sibling, 2 replies; 27+ messages in thread
From: Johannes Sixt @ 2018-06-09  6:56 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Thomas Braun, Jeff King, git, git, gitster, Jeff Hostetler

Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:
> 
> On Fri, Jun 08 2018, Johannes Sixt wrote:
> 
>> Am 08.06.2018 um 18:00 schrieb Thomas Braun:
>>> I for my part would much rather prefer that to be a compile time
>>> option so that I don't need to check on every git update on windows
>>> if  this is now enabled or not.
>>
>> This exactly my concern, too! A compile-time option may make it a good
>> deal less worrisome.
> 
> Can you elaborate on how someone who can maintain inject malicious code
> into your git package + config would be thwarted by this being some
> compile-time option, wouldn't they just compile it in?

Of course they can. But would we, the Git community do that?

 From the design document:

 > The goal of the telemetry feature is to be able to gather usage data
 > across a group of production users to identify real-world performance
 > problems in production.  Additionally, it might help identify common
 > user errors and guide future user training.

The goal to gather usage data may be valid for a small subset of Git 
installations. But it is wrong to put this into the software itself, in 
particular when the implementations includes scary things like loading 
unspecified dynamic libraries:

 > If the config setting "telemetry.plugin" contains the pathname to a
 > shared library, the library will be dynamically loaded during start up
 > and events will be sent to it using the plugin API.

When you want usage data, ask your users for feedback. Look over their 
shoulders. But do not ask the software itself to gather usage data. It 
will be abused.

Do not offer open source software that has a "call-home" method built-in.

If you want to peek into the workplaces of YOUR users, then monkey-patch 
survaillance into YOUR version of Git. But please do not burden the rest 
of us.

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  6:31             ` Ævar Arnfjörð Bjarmason
@ 2018-06-09  6:56               ` Jeff King
  2018-06-09 20:05                 ` Johannes Schindelin
  2018-06-09  7:31               ` Duy Nguyen
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-06-09  6:56 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Duy Nguyen, Johannes Sixt, Thomas Braun, Jeff Hostetler,
	Git Mailing List, Junio C Hamano, Jeff Hostetler

On Sat, Jun 09, 2018 at 08:31:58AM +0200, Ævar Arnfjörð Bjarmason wrote:

> 1) I really don't see the basis for this argument that they'd need to
>    patch it out, they're not patching out e.g. GIT_TRACE now, which has
>    all the same sort of concerns, it's just a format that's more of a
>    hassle to parse than this proposed format.
> 
> 2) I think you and Johannes are just seeing the "telemetry" part of
>    this, but if you look past that all this *really* is is "GIT_TRACE
>    facility that doesn't suck to parse".

So that actually takes me to another question, which is: instead of
introducing a new "telemetry" feature, could we make GIT_TRACE suck less
to parse?

E.g., could we have a flag or environment variable to have the existing
traces output JSON? I guess right now they're inherently free-form via
trace_printf, so it would involve adding some structured interface
calls. Which is more or less what I guess JeffH's proposed feature to
look like.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  6:51             ` Jeff King
@ 2018-06-09  7:04               ` Johannes Sixt
  2018-06-09  7:31                 ` Jeff King
  2018-06-12 16:04               ` Junio C Hamano
  1 sibling, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2018-06-09  7:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, Ævar Arnfjörð Bjarmason, Thomas Braun,
	Jeff Hostetler, Git Mailing List, Junio C Hamano, Jeff Hostetler

Am 09.06.2018 um 08:51 schrieb Jeff King:
> I actually think this could be useful for normal users, too. We have
> GIT_TRACE for dumping debug output, and we sometimes ask users to turn
> it on to help diagnose a problem (not to mention that we use it
> ourselves).

The BIG difference is in "we ask the users". Asking for a trace is a 
one-shot thing; this telemetry thing is obviously intended for long-term 
survaillance.

> AFAICT this telemetry data is the same thing, but for performance. When
> somebody says "why does this command take so long", wouldn't it be nice
> for us to be able to tell them to flip a switch that will collect data
> on which operations are taking a long time?

Why do we need long-term survaillance to answer this question and why 
can it not be made a mode of GIT_TRACE?

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  7:04               ` Johannes Sixt
@ 2018-06-09  7:31                 ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-06-09  7:31 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Duy Nguyen, Ævar Arnfjörð Bjarmason, Thomas Braun,
	Jeff Hostetler, Git Mailing List, Junio C Hamano, Jeff Hostetler

On Sat, Jun 09, 2018 at 09:04:16AM +0200, Johannes Sixt wrote:

> > AFAICT this telemetry data is the same thing, but for performance. When
> > somebody says "why does this command take so long", wouldn't it be nice
> > for us to be able to tell them to flip a switch that will collect data
> > on which operations are taking a long time?
> 
> Why do we need long-term survaillance to answer this question and why can it
> not be made a mode of GIT_TRACE?

I guess I don't see how this isn't simply a mode of GIT_TRACE.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  6:31             ` Ævar Arnfjörð Bjarmason
  2018-06-09  6:56               ` Jeff King
@ 2018-06-09  7:31               ` Duy Nguyen
  1 sibling, 0 replies; 27+ messages in thread
From: Duy Nguyen @ 2018-06-09  7:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Sixt, Thomas Braun, Jeff King, Jeff Hostetler,
	Git Mailing List, Junio C Hamano, Jeff Hostetler

On Sat, Jun 9, 2018 at 8:32 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> Let's say you're in a corporate environment with Linux, OSX and Windows
> boxes, but all of whom have some shared mounts provisioned & ability to
> ship an /etc/gitconfig (wherever that lives on Windows).
>
> It's much easier to just do that than figure out how to build a custom
> Git on all three platforms.

Let's say I don't care about making it easier on corporate
environment. You have resources to do that.

> > Not making it a compile-time option could force [1] linux distro
> > to carry this function to everybody even if they don't use it (and
> > it's kinda dangerous to misuse if you don't anonymize the data
> > properly). I also prefer this a compile time option.
>
> Setting GIT_TRACE to a filename that you published is also similarly
> dangerous, so would setting up a trivial 4-line shell alias to wrap
> "git" and log what it's doing.
>
> > [1] Of course many distros can choose to patch it out. But it's the
> > same argument as bringing this option in in the first place: you guys
> > already have that code in private and now want to put it in stock git
> > to reduce maintenance cost, why add extra cost on linux distro
> > maintenance?
>
> Because:
>
> 1) I really don't see the basis for this argument that they'd need to
>    patch it out, they're not patching out e.g. GIT_TRACE now, which has
>    all the same sort of concerns, it's just a format that's more of a
>    hassle to parse than this proposed format.
>
> 2) I think you and Johannes are just seeing the "telemetry" part of
>    this, but if you look past that all this *really* is is "GIT_TRACE
>    facility that doesn't suck to parse".

If it's simply an improvement over GIT_TRACE, then I have no problem
with that. That is:

- there is no new, permanent way to turn this one like config keys

- there's no plugin support or whatever for keeping output. Keep it to
either a file or a file descriptor like we do with GIT_TRACE.

>    There's a lot of use-cases for that which have nothing to do with
>    what this facility is originally written for, for example, a novice
>    git user could turn it on and have it log in ~ somewhere, and then
>    run some contrib script which analyzes his git usage and spews out
>    suggestions ("you use this command/option, but not this related
>    useful command/option").
>
>    Users asking for help on the #git IRC channel or on this mailing list
>    could turn this on if they have a problem, and paste it into some
>    tool they could show to others to see exactly what they're doing /
>    where it went wrong.

And they can use GIT_TRACE now. If GIT_TRACE does not give enough info
(e.g. too few trace points), add them. This new proposal is more about
automation. With GIT_TRACE, which is more human friendly, I could skim
over the output and remove sensitive info before I send it out for
help. Machine-friendly format (even json) tends to be a lot more
complex and harder to read/filter this way.
-- 
Duy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  6:56               ` Jeff King
@ 2018-06-09 20:05                 ` Johannes Schindelin
  2018-06-11  5:56                   ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Schindelin @ 2018-06-09 20:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Duy Nguyen, Johannes Sixt,
	Thomas Braun, Jeff Hostetler, Git Mailing List, Junio C Hamano,
	Jeff Hostetler

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

Hi Peff,

On Sat, 9 Jun 2018, Jeff King wrote:

> On Sat, Jun 09, 2018 at 08:31:58AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> > 1) I really don't see the basis for this argument that they'd need to
> >    patch it out, they're not patching out e.g. GIT_TRACE now, which has
> >    all the same sort of concerns, it's just a format that's more of a
> >    hassle to parse than this proposed format.
> >
> > 2) I think you and Johannes are just seeing the "telemetry" part of
> >    this, but if you look past that all this *really* is is "GIT_TRACE
> >    facility that doesn't suck to parse".
> 
> So that actually takes me to another question, which is: instead of
> introducing a new "telemetry" feature, could we make GIT_TRACE suck less
> to parse?

That's a different question, and even a different concern.

Please understand that this feature is needed for our own in-house use, to
be able to help our users pro-actively. We really want to be able to
contact a user when they struggle with the performance of any given Git
command, before they have to think of reaching out for assistance.

We discussed this plan at the contributor summit at GitMerge, and IIRC at
least Autodesk and DropBox were interested (but of course, the changes of
both Lars' and Alex' roles might make them less interested now), hence
JeffHost wanted to contribute this, for the benefit of the larger Git
community.

> E.g., could we have a flag or environment variable to have the existing
> traces output JSON? I guess right now they're inherently free-form via
> trace_printf, so it would involve adding some structured interface
> calls. Which is more or less what I guess JeffH's proposed feature to
> look like.

I think that is a much larger project than what JeffHost proposed, and
would unfortunately put too much of a brake on his project.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  6:56           ` Johannes Sixt
@ 2018-06-09 20:43             ` Johannes Schindelin
  2018-06-09 22:44               ` Johannes Sixt
  2018-06-10  0:00             ` brian m. carlson
  1 sibling, 1 reply; 27+ messages in thread
From: Johannes Schindelin @ 2018-06-09 20:43 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Ævar Arnfjörð Bjarmason, Thomas Braun, Jeff King,
	git, git, gitster, Jeff Hostetler

[-- Attachment #1: Type: text/plain, Size: 2532 bytes --]

Hi Hannes,

On Sat, 9 Jun 2018, Johannes Sixt wrote:

> Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:
> > 
> > On Fri, Jun 08 2018, Johannes Sixt wrote:
> > 
> > > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> > > > I for my part would much rather prefer that to be a compile time
> > > > option so that I don't need to check on every git update on windows
> > > > if  this is now enabled or not.
> > >
> > > This exactly my concern, too! A compile-time option may make it a good
> > > deal less worrisome.
> > 
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Of course they can. But would we, the Git community do that?
> 
> From the design document:
> 
> > The goal of the telemetry feature is to be able to gather usage data
> > across a group of production users to identify real-world performance
> > problems in production.  Additionally, it might help identify common
> > user errors and guide future user training.
> 
> The goal to gather usage data may be valid for a small subset of Git
> installations. But it is wrong to put this into the software itself, in
> particular when the implementations includes scary things like loading
> unspecified dynamic libraries:
> 
> > If the config setting "telemetry.plugin" contains the pathname to a
> > shared library, the library will be dynamically loaded during start up
> > and events will be sent to it using the plugin API.
> 
> When you want usage data, ask your users for feedback. Look over their
> shoulders. But do not ask the software itself to gather usage data. It will be
> abused.
> 
> Do not offer open source software that has a "call-home" method built-in.
> 
> If you want to peek into the workplaces of YOUR users, then monkey-patch
> survaillance into YOUR version of Git. But please do not burden the rest of
> us.

We already offer hooks. You can do anything with those hooks. Even, if you
do not pay close attention, to transfer all your bitcoin to a specific
account.

I agree with Peff: this is something you as a user need to be aware of,
and need to make sure you configure your Git just like you want. As long
as this is a purely opt-in feature, it is useful and helpful.

We do need it in-house, for the many thousands of Git users we try to
support with a relatively small team of Git developers.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09 20:43             ` Johannes Schindelin
@ 2018-06-09 22:44               ` Johannes Sixt
  2018-06-11  6:08                 ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2018-06-09 22:44 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, Thomas Braun, Jeff King,
	git, git, gitster, Jeff Hostetler

Am 09.06.2018 um 22:43 schrieb Johannes Schindelin:
> On Sat, 9 Jun 2018, Johannes Sixt wrote:
>> When you want usage data, ask your users for feedback. Look over their
>> shoulders. But do not ask the software itself to gather usage data. It will be
>> abused.
>>
>> Do not offer open source software that has a "call-home" method built-in.
>>
>> If you want to peek into the workplaces of YOUR users, then monkey-patch
>> survaillance into YOUR version of Git. But please do not burden the rest of
>> us.
> 
> We already offer hooks. You can do anything with those hooks. Even, if you
> do not pay close attention, to transfer all your bitcoin to a specific
> account.
> 
> I agree with Peff: this is something you as a user need to be aware of,
> and need to make sure you configure your Git just like you want. As long
> as this is a purely opt-in feature, it is useful and helpful.

The problem with this feature is not so much that it enables someone to 
do bad things, but that it is specifically targeted at recording *how 
users use Git*.

> We do need it in-house, for the many thousands of Git users we try to
> support with a relatively small team of Git developers.

Then please build your private version and sell it to your thousands of 
Git users. "in-house" == Microsoft? Small team of Git developers? So, 
instead of shelling out the bucks for this extra burden, they put the 
burden on the Community?

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  6:56           ` Johannes Sixt
  2018-06-09 20:43             ` Johannes Schindelin
@ 2018-06-10  0:00             ` brian m. carlson
  2018-06-11  6:14               ` Jeff King
  1 sibling, 1 reply; 27+ messages in thread
From: brian m. carlson @ 2018-06-10  0:00 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Ævar Arnfjörð Bjarmason, Thomas Braun, Jeff King,
	git, git, gitster, Jeff Hostetler

[-- Attachment #1: Type: text/plain, Size: 3389 bytes --]

On Sat, Jun 09, 2018 at 08:56:00AM +0200, Johannes Sixt wrote:
> Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:
> > 
> > On Fri, Jun 08 2018, Johannes Sixt wrote:
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Of course they can. But would we, the Git community do that?
> 
> From the design document:
> 
> > The goal of the telemetry feature is to be able to gather usage data
> > across a group of production users to identify real-world performance
> > problems in production.  Additionally, it might help identify common
> > user errors and guide future user training.
> 
> The goal to gather usage data may be valid for a small subset of Git
> installations. But it is wrong to put this into the software itself, in
> particular when the implementations includes scary things like loading
> unspecified dynamic libraries:
> 
> > If the config setting "telemetry.plugin" contains the pathname to a
> > shared library, the library will be dynamically loaded during start up
> > and events will be sent to it using the plugin API.
> 
> When you want usage data, ask your users for feedback. Look over their
> shoulders. But do not ask the software itself to gather usage data. It will
> be abused.
> 
> Do not offer open source software that has a "call-home" method built-in.
> 
> If you want to peek into the workplaces of YOUR users, then monkey-patch
> survaillance into YOUR version of Git. But please do not burden the rest of
> us.

I understand there's an interest in supporting the most people with the
fewest amount of staff.  I'm certainly in the situation where I, with
only minimal assistance, support every Git user in my division of the
company, regardless of technical ability, and I know how overwhelming
that can be.  (Burnout, I can tell you, is a thing.)

I also have to look at this issue from the interests of what is best for
the FLOSS community and for users as a whole.  Adding in functionality
that sends off usage data from a command-line tool, especially one that
is as widely used as Git is, is not in the interests of users as a
whole, nor is it common practice in FLOSS tools.

As a highly capable and technical user, I would find it very undesirable
to have my development tools reporting data like this, even if it is to
make my experience better.

The ability to load arbitrary libraries makes me concerned about people
using this to spirit away personal or company data or to subtly steal
data in a rootkit-like situation.  These are real threats in the kinds
of environments I distribute to in my work role.

I agree with Duy's point of view that GIT_TRACE-level output to a file
descriptor or file is fine, but a persistently enabled feature is not.

I expect this feature, if implemented, would be patched out of Debian's
Git, and it would be patched out of any Git I would distribute in my
work role for legal and ethical reasons.

As developers, we have a duty to be mindful of how our software can be
misused and abused and try to avoid that when possible.  I don't think
this feature is on the right side of that balance.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09 20:05                 ` Johannes Schindelin
@ 2018-06-11  5:56                   ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-06-11  5:56 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, Duy Nguyen, Johannes Sixt,
	Thomas Braun, Jeff Hostetler, Git Mailing List, Junio C Hamano,
	Jeff Hostetler

On Sat, Jun 09, 2018 at 10:05:49PM +0200, Johannes Schindelin wrote:

> > E.g., could we have a flag or environment variable to have the existing
> > traces output JSON? I guess right now they're inherently free-form via
> > trace_printf, so it would involve adding some structured interface
> > calls. Which is more or less what I guess JeffH's proposed feature to
> > look like.
> 
> I think that is a much larger project than what JeffHost proposed, and
> would unfortunately put too much of a brake on his project.

I definitely don't want to stall somebody else's momentum with a bunch
of what-if's. But I also don't want to end up down the road with two
nearly-identical systems for tracing information. That's confusing to
users, and to developers who must choose which system to use for any new
tracing information they add.

So I think it's worth at least giving a little thought to how we might
leverage similarities between the trace system and this. Even if we
don't implement it now, it would be nice to have a vague sense of how
they could grow together in the long run.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09 22:44               ` Johannes Sixt
@ 2018-06-11  6:08                 ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-06-11  6:08 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	Thomas Braun, git, git, gitster, Jeff Hostetler

On Sun, Jun 10, 2018 at 12:44:25AM +0200, Johannes Sixt wrote:

> > I agree with Peff: this is something you as a user need to be aware of,
> > and need to make sure you configure your Git just like you want. As long
> > as this is a purely opt-in feature, it is useful and helpful.
> 
> The problem with this feature is not so much that it enables someone to do
> bad things, but that it is specifically targeted at recording *how users use
> Git*.

I think one issue here is that we are not looking at concrete patches.
So for instance, I've seen a claim that Git should never have a way to
turn on tracing all the time. But at GitHub we have found it useful to
have a config option to log the sha1 of every object that is dropped by
git-prune or by "repack -ad". It's helped both as a developer (tracking
down races or bugs in our code) and as an administrator (figuring out
where a corruption was introduced). It needs to be on all the time to be
useful, since the point is to have an audit trail to look at _after_ a
bad thing happens.

That's something we do completely on the server side; I don't think
there are any privacy or "spying" issues there. And I don't think it's a
huge maintenance burden. Inside the existing code, it's literally a
one-line "log this" (the log code itself is a hundred or so lines in its
own file).

Now most users probably don't care that much about this use case. And
I'm OK to apply it as a custom patch. But doesn't it seem like that's
something other people hosting Git repos might want? Or that the concept
might extend to other loggable items that _are_ interesting on the
client side?

That's why I think it is worth taking this step-by-step. Let's log more
things. Let's make enabling tracing more flexible. Those are hopefully
uncontentious and universally useful. If you want to draw the line on
"spying", then I think the right place to draw it is when somebody wants
to ship code to actually move those logs out of the user's control.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-10  0:00             ` brian m. carlson
@ 2018-06-11  6:14               ` Jeff King
  2018-06-11  8:30                 ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-06-11  6:14 UTC (permalink / raw)
  To: brian m. carlson, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Thomas Braun, git, git,
	gitster, Jeff Hostetler

On Sun, Jun 10, 2018 at 12:00:57AM +0000, brian m. carlson wrote:

> I also have to look at this issue from the interests of what is best for
> the FLOSS community and for users as a whole.  Adding in functionality
> that sends off usage data from a command-line tool, especially one that
> is as widely used as Git is, is not in the interests of users as a
> whole, nor is it common practice in FLOSS tools.

I'm not sure if this last statement is true. For instance, many tools
have crash-reporting functionality, and some contain more advanced
telemetry. E.g.:

  https://wiki.mozilla.org/Telemetry

Personally I do not like this sort of data collection at all, and never
enable it. And I would be highly suspicious of any FLOSS tool that
shipped with it enabled by default. But it seems like at least some
projects _do_ find it useful, and enough people enable it to make it
worth carrying as a feature.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-11  6:14               ` Jeff King
@ 2018-06-11  8:30                 ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-06-11  8:30 UTC (permalink / raw)
  To: brian m. carlson, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Thomas Braun, git, git,
	gitster, Jeff Hostetler

On Mon, Jun 11, 2018 at 02:14:41AM -0400, Jeff King wrote:

> On Sun, Jun 10, 2018 at 12:00:57AM +0000, brian m. carlson wrote:
> 
> > I also have to look at this issue from the interests of what is best for
> > the FLOSS community and for users as a whole.  Adding in functionality
> > that sends off usage data from a command-line tool, especially one that
> > is as widely used as Git is, is not in the interests of users as a
> > whole, nor is it common practice in FLOSS tools.
> 
> I'm not sure if this last statement is true. For instance, many tools
> have crash-reporting functionality, and some contain more advanced
> telemetry. E.g.:
> 
>   https://wiki.mozilla.org/Telemetry
> 
> Personally I do not like this sort of data collection at all, and never
> enable it. And I would be highly suspicious of any FLOSS tool that
> shipped with it enabled by default. But it seems like at least some
> projects _do_ find it useful, and enough people enable it to make it
> worth carrying as a feature.

Re-reading what I wrote, I really want to emphasize something so that it
isn't missed. I think the important feature here is user consent, and
that seems to be the line that projects like Mozilla use (default to
off, and require explicit consent before sending anything off-host).

We don't have to use that line. I think "do not collect even locally" is
a reasonable line, because even collecting has performance and storage
management implications. And I don't think anybody is proposing to even
collect local telemetry without the user turning it on.

The line under discussion seems to be "implement code to collect
anything at all". That's a valid line, too, but IMHO it ends up blocking
useful features (and I think we already violate it with things like
GIT_TRACE!).

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v1] telemetry design overview (part 1)
  2018-06-09  6:51             ` Jeff King
  2018-06-09  7:04               ` Johannes Sixt
@ 2018-06-12 16:04               ` Junio C Hamano
  1 sibling, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2018-06-12 16:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Thomas Braun, Jeff Hostetler, Git Mailing List, Jeff Hostetler

Jeff King <peff@peff.net> writes:

> In my experience the maintenance burden is not in the "connect to a
> socket" part, but the fact that you have to sprinkle the entry points
> throughout the code (e.g., "set 'cloning' flag to 1" or "I'm entering
> the pack generation phase of the fetch"). So I'd love to see us do that
> sprinkling _once_ where everyone can benefit, whether they want better
> tracing for debugging, telemetry across their installed base, or
> whatever.

I tend to agree.  Transport (or file) can stay outside the core of
this "telemetry" thing---agreeing on what and when to trace, and how
the trace is represented, and having an API and solid guideline,
would allow us to annotate the code just once and get useful data in
a consistent way.

> The mechanism to handle those calls is much easier to plug in and out,
> then. And I don't have a huge preference for compile-time versus
> run-time, or whether bits are in-tree or out-of-tree. Obviously we'd
> have some basic "write to stderr or a file" consumer in-tree.

If it makes readers fearful of big brother feel safer, I think we
probably can add code that is runtime controllable that is compiled
in conditionally ;-)

I do not quite buy "Big brothers who want this for their own in
house use case can afford to patch" argument, by the way.  If
anything, having a solid tracing mechanism (to which existing
GIT_TRACE support _should_ be able to serve as a starting point)
would mean small budget guys do not have to do their own investing
to collect useful data over their customer base (I am seeing an
analog in "popcon", opt-in stats of installed package in a distro,
which we can view as "telemetry data for distro installer program").

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-06-12 16:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07 14:53 [RFC PATCH v1] telemetry design overview (part 1) git
2018-06-07 14:53 ` [RFC PATCH v1] telemetry: design documenation git
2018-06-08 11:06   ` Ævar Arnfjörð Bjarmason
2018-06-07 21:10 ` [RFC PATCH v1] telemetry design overview (part 1) Johannes Sixt
2018-06-08  9:07   ` Jeff King
2018-06-08 16:00     ` Thomas Braun
2018-06-08 22:01       ` Johannes Sixt
2018-06-08 22:20         ` Ævar Arnfjörð Bjarmason
2018-06-09  5:03           ` Duy Nguyen
2018-06-09  6:31             ` Ævar Arnfjörð Bjarmason
2018-06-09  6:56               ` Jeff King
2018-06-09 20:05                 ` Johannes Schindelin
2018-06-11  5:56                   ` Jeff King
2018-06-09  7:31               ` Duy Nguyen
2018-06-09  6:51             ` Jeff King
2018-06-09  7:04               ` Johannes Sixt
2018-06-09  7:31                 ` Jeff King
2018-06-12 16:04               ` Junio C Hamano
2018-06-09  6:56           ` Johannes Sixt
2018-06-09 20:43             ` Johannes Schindelin
2018-06-09 22:44               ` Johannes Sixt
2018-06-11  6:08                 ` Jeff King
2018-06-10  0:00             ` brian m. carlson
2018-06-11  6:14               ` Jeff King
2018-06-11  8:30                 ` Jeff King
2018-06-08  9:40   ` Ævar Arnfjörð Bjarmason
2018-06-08 15:46     ` Duy Nguyen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).