git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* RFC: error codes on exit
@ 2021-05-19 23:34 Jonathan Nieder
  2021-05-20  0:40 ` Felipe Contreras
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Jonathan Nieder @ 2021-05-19 23:34 UTC (permalink / raw)
  To: git; +Cc: Josh Steadmon, Jeff King, Jeff Hostetler

Hi,

(Danger, jrn is wading into error handling again...)

At $DAYJOB we are setting up some alerting for some bot fleets and
developer workstations, using trace2 as the data source.  Having
trace2 has been great --- combined with gradual weekly rollouts of
"next", it helps us to understand quickly when a change is creating a
regression for users, which hopefully improves the quality of Git for
everyone.

One kind of signal we haven't been able to make good use of is error
rates.  The problem is that a die() call can be an indication of

 a. the user asked to do something that isn't sensible, and we kindly
    rebuked the user

 b. we contacted a server, and the server was not happy with our
    request

 c. the local Git repository is corrupt

 d. we ran out of resources (e.g., disk space)

 e. we encountered an internal error in handling the user's
    legitimate request

and these different cases do not all motivate the same response.
(E.g., if (c) affects just a single bot but produces a high error rate
from that bot, we shouldn't be alarmed; if (d) is happening on a bot,
then we should look into giving it more disk; if (e) is increasing
significantly during a rollout then we should roll back quickly.)

In order to do this, I would like to annotate "exit" events with a
classification of the error.  I'm not too opinionated about what that
classification looks like (bikeshedding welcome!) --- e.g., something
like the enumeration at
https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto
is likely to work fine.

(I'm particularly fond of how that maps to HTTP statuses.  See also
https://github.com/abseil/abseil-cpp/blob/HEAD/absl/status/status.h
for an example of using that kind of enumeration within a single
process.)

The API could look something like

	--- a/cache.h
	+++ b/cache.h
	@@ -590,6 +590,15 @@ int is_git_directory(const char *path);
	  */
	 int is_nonbare_repository_dir(struct strbuf *path);

	+enum git_error_code {
	+	/*
	+	 * Not an error (= HTTP 200)
	+	 */
	+	OK = 0,
	+};
	+NORETURN void fatal(enum git_error_code code, const char *err, ...)
	+	__attribute__((format (printf, 2, 3)));
	+
	 #define READ_GITFILE_ERR_STAT_FAILED 1
	 #define READ_GITFILE_ERR_NOT_A_FILE 2
	 #define READ_GITFILE_ERR_OPEN_FAILED 3

(with new error codes added when they first get used) and a typical
caller could look like

	Subject: xsize_t: tag "cannot handle files this big" as a failed precondition

	Unlike retriable errors, failed preconditions indicate that some
	aspect of the state needs to be changed in order to recover.  Mark
	this error as such to make signals from monitoring in controlled
	environments (e.g., bot fleets or corporate installations of Git)
	easier to understand.

	Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
[...]
	+       /*
	+        * The system is not in a state required for the operation to succeed.
	+        * For example, a file on disk is larger than we can handle.
	+        * (= HTTP 400)
	+        */
	+       FAILED_PRECONDITION = 9,
[...]
	 static inline size_t xsize_t(off_t len)
	 {
		if (len < 0 || len > SIZE_MAX)
	-               die("Cannot handle files this big");
	+               fatal(FAILED_PRECONDITION, "Cannot handle files this big");

Further down the line I can imagine making use of git_error_code
elsewhere for e.g. some limited retries of the corresponding
transaction when we fail to lock a file.

Thoughts?  Good idea?  Bad idea?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: RFC: error codes on exit
  2021-05-19 23:34 RFC: error codes on exit Jonathan Nieder
@ 2021-05-20  0:40 ` Felipe Contreras
  2021-05-21 16:53   ` Alex Henrie
  2021-05-20  0:49 ` Junio C Hamano
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Felipe Contreras @ 2021-05-20  0:40 UTC (permalink / raw)
  To: Jonathan Nieder, git; +Cc: Josh Steadmon, Jeff King, Jeff Hostetler

Jonathan Nieder wrote:
> The API could look something like
> 
> 	--- a/cache.h
> 	+++ b/cache.h
> 	@@ -590,6 +590,15 @@ int is_git_directory(const char *path);
> 	  */
> 	 int is_nonbare_repository_dir(struct strbuf *path);
> 
> 	+enum git_error_code {
> 	+	/*
> 	+	 * Not an error (= HTTP 200)
> 	+	 */
> 	+	OK = 0,

It's good to not include many initial codes, but I would start with at
least three:

  OK = 0,
  UNKNOWN = 1,
  NORMAL = 2,

die() could be mapped to UNKNOWN.

> 	+};
> 	+NORETURN void fatal(enum git_error_code code, const char *err, ...)
> 	+	__attribute__((format (printf, 2, 3)));
> 	+

fatal() for me sounds 1) very dramatic, 2), not a verb, 3) and not a
complete thing (fatal what?)

I would prefer "fail", or "fault", or anything that is a verb.

> Thoughts?  Good idea?  Bad idea?

Great idea.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-19 23:34 RFC: error codes on exit Jonathan Nieder
  2021-05-20  0:40 ` Felipe Contreras
@ 2021-05-20  0:49 ` Junio C Hamano
  2021-05-20  1:19   ` Felipe Contreras
  2021-05-20  1:55   ` Jonathan Nieder
  2021-05-20 13:28 ` Jeff King
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2021-05-20  0:49 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Josh Steadmon, Jeff King, Jeff Hostetler

Jonathan Nieder <jrnieder@gmail.com> writes:

> One kind of signal we haven't been able to make good use of is error
> rates.  The problem is that a die() call can be an indication of
>
>  a. the user asked to do something that isn't sensible, and we kindly
>     rebuked the user
> ...
>  e. we encountered an internal error in handling the user's
>     legitimate request
>
> and these different cases do not all motivate the same response.
> ...
> In order to do this, I would like to annotate "exit" events with a
> classification of the error.

We already have BUG() for e. and die() for everything else, and
"everything else" may be overly broad for your purpose.

I am sympathetic to the cause and I agree that introducing a
finer-grained classification might be a solution.  I however am not
sure how we can enforce developers to apply such a manually assigned
"error code" cosistently.

Just to throw in a totally different alternative to see if it works
better, I wonder if you can teach die() to report to the trace2
stream where in the code it was called from and which vintage of Git
it is running.

The stat collection side that cares about certain class of failures
can have function that maps "die() at <filename>:<lineno>@<version>"
to "what kind of die() it is".  

E.g.  blame.c:50@v2.32.0-rc0-184-gbbde7e6616" may be BUG(), while
blame.c:2740@v2.32.0-rc0-184-gbbde7e6616 may be an user-error.

That way, our developers do not have to do anything special and
cannot do anything to screw up the classification.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-20  0:49 ` Junio C Hamano
@ 2021-05-20  1:19   ` Felipe Contreras
  2021-05-20  1:55   ` Jonathan Nieder
  1 sibling, 0 replies; 26+ messages in thread
From: Felipe Contreras @ 2021-05-20  1:19 UTC (permalink / raw)
  To: Junio C Hamano, Jonathan Nieder
  Cc: git, Josh Steadmon, Jeff King, Jeff Hostetler

Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:

> > In order to do this, I would like to annotate "exit" events with a
> > classification of the error.
> 
> We already have BUG() for e. and die() for everything else, and
> "everything else" may be overly broad for your purpose.

BUG() and die() can call fatal().

> I am sympathetic to the cause and I agree that introducing a
> finer-grained classification might be a solution.  I however am not
> sure how we can enforce developers to apply such a manually assigned
> "error code" cosistently.

You don't enforce developers to do this--just like you don't enforce
developers to use advice() instead of fprintf(stderr, )... You nudge
them in that direction, and eventually it becomes a habit.

Developers in other languages and stacks have no problem with this
granularity. They do this in languages like JavaScript, C++, Ruby and
Python regularlly. And developers dealing with HTTP have no trouble with
error codes (like 200, 400, and 404).

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-20  0:49 ` Junio C Hamano
  2021-05-20  1:19   ` Felipe Contreras
@ 2021-05-20  1:55   ` Jonathan Nieder
  2021-05-20  2:28     ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Jonathan Nieder @ 2021-05-20  1:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Josh Steadmon, Jeff King, Jeff Hostetler

Hi,

Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:

>> In order to do this, I would like to annotate "exit" events with a
>> classification of the error.
>
> We already have BUG() for e. and die() for everything else, and
> "everything else" may be overly broad for your purpose.
>
> I am sympathetic to the cause and I agree that introducing a
> finer-grained classification might be a solution.  I however am not
> sure how we can enforce developers to apply such a manually assigned
> "error code" cosistently.

I think two things you're hinting at are "what about maintainability?"
and "what is the migration path?"

I suspect that a number of error paths will remain unclassified for a
long time, possibly indefinitely.  The way I've seen this treated in
other tools is that it's okay for something to show up as an INTERNAL
error if it doesn't happen frequently: sure, that can cause us a bit
of unnecessary worry when it starts occuring more often, but at that
point we're in a good place to replace it with something more
appropriate.

That means that we would still want to keep die() or some equivalent.
That in turn might suggest that the API I suggested is overly verbose;
it might make sense to have a different die()-style helper for each
type of error, matching what we do with die() and BUG().

Side note: you might wonder why keeping die() would even be a
question.  For example, there are all the outstanding patches that
still use die(); changing such a fundamental API would seem to be a
nonstarter.  Fortunately, though, the tools in
contrib/coccinelle/README allow changing an API in three steps:

 1. Introduce the new API.  Keep the old API around for backward
    compatibility.

 2. Add a "pending" coccinelle semantic patch to automatically
    update callers to the new API.  Update existing callers using
    'make coccicheck-pending'.

 3. Remove the old API and mark the semantic patch as no longer
    pending.  Patches using the old API can be fixed using 'make
    coccicheck'

So we can make this decision based on whether the resulting API is one
we like more; in this example, I suspect that keeping die() is
preferable _even though_ it would be possible to remove by staying in
step 2 for a while without too much fuss.

> Just to throw in a totally different alternative to see if it works
> better, I wonder if you can teach die() to report to the trace2
> stream where in the code it was called from and which vintage of Git
> it is running.
>
> The stat collection side that cares about certain class of failures
> can have function that maps "die() at <filename>:<lineno>@<version>"
> to "what kind of die() it is".
>
> E.g.  blame.c:50@v2.32.0-rc0-184-gbbde7e6616" may be BUG(), while
> blame.c:2740@v2.32.0-rc0-184-gbbde7e6616 may be an user-error.

For ad hoc queries, this is a rather nice tool.  Traces already record
filename, version, and line number, though I believe in the die() case
it currently just points to the implementation of die(). ;-)

However, for analysis in aggregate (for example, to define an SLO[1])
that would require us to maintain a database that maps
<filename>:<lineno> to error code.  That database would be essentially
the same as patches to record the error codes, so what it would really
amount to is having these deployments using a permanent fork of Git.
It would also get rid of the chance to discuss and improve common
error paths on-list.

If we expect the error codes to not be useful to anyone else, then
that is the right choice to make (or rather, we'd have to use other
heuristics, such as having the traces record a collection of offsets
in the binary and a build-id so we can key off of stack trace
signatures).  Part of the reason I started this thread is to get a
sense of whether these can be useful to others.

Thanks,
Jonathan

[1] https://sre.google/sre-book/service-level-objectives/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-20  1:55   ` Jonathan Nieder
@ 2021-05-20  2:28     ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2021-05-20  2:28 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Josh Steadmon, Jeff King, Jeff Hostetler

Jonathan Nieder <jrnieder@gmail.com> writes:

>> I am sympathetic to the cause and I agree that introducing a
>> finer-grained classification might be a solution.  I however am not
>> sure how we can enforce developers to apply such a manually assigned
>> "error code" cosistently.
>
> I think two things you're hinting at are "what about maintainability?"
> and "what is the migration path?"

Not really.  There is not much to migrate from, and mislabeling by
developers, or more realistically disagreement between developers
what label is appropriate for which condition that leads to die(),
would be a usability problem and not maintainability that I would be
worried about.

> However, for analysis in aggregate (for example, to define an SLO[1])
> that would require us to maintain a database that maps
> <filename>:<lineno> to error code.  That database would be essentially
> the same as patches to record the error codes, so what it would really
> amount to is having these deployments using a permanent fork of Git.

Yes, the suggestion was to start from there to gain experience,
because ...

> If we expect the error codes to not be useful to anyone else, then
> that is the right choice to make (or rather, we'd have to use other
> heuristics, such as having the traces record a collection of offsets
> in the binary and a build-id so we can key off of stack trace
> signatures).  Part of the reason I started this thread is to get a
> sense of whether these can be useful to others.

... others will find it useful if the classification matches _their_
needs, but I suspect the "bin" a single die() location wants to be
classfied into would end up to be different depending on what the
log collecting entity is after.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-19 23:34 RFC: error codes on exit Jonathan Nieder
  2021-05-20  0:40 ` Felipe Contreras
  2021-05-20  0:49 ` Junio C Hamano
@ 2021-05-20 13:28 ` Jeff King
  2021-05-20 17:47   ` Jonathan Nieder
  2021-05-20 15:09 ` Jeff Hostetler
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2021-05-20 13:28 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Josh Steadmon, Jeff Hostetler

On Wed, May 19, 2021 at 04:34:24PM -0700, Jonathan Nieder wrote:

> One kind of signal we haven't been able to make good use of is error
> rates.  The problem is that a die() call can be an indication of
> 
>  a. the user asked to do something that isn't sensible, and we kindly
>     rebuked the user
> 
>  b. we contacted a server, and the server was not happy with our
>     request
> 
>  c. the local Git repository is corrupt
> 
>  d. we ran out of resources (e.g., disk space)
> 
>  e. we encountered an internal error in handling the user's
>     legitimate request

I've run into this problem, too. If you run a website that runs Git
commands on behalf of users and try to get metrics on failing exit
codes, it's hard to tell the difference between "the repo is broken",
"Git has a bug", "the user (or other caller) asked for something
stupid", and "some transient error occurred".

But I'm not sure that even Git can always tell the difference between
those things. Some real-world examples I've run into:

  - "rev-list $oid" can't find object $oid. Is the repo corrupt? Or is
    the caller unreasonable to ask for that object? Or was there a race
    or other transient error which made the object invisible?

  - upload-pack is writing out a packfile, but gets EPIPE. Did the
    network drop out? Or is a Git bug causing one side to break
    protocol?

Some rough categorization may help, but a lot of those need to propagate
the specific errors back to the caller. For instance, the rev-list
example could be FAILED_PRECONDITION in your terminology. But really, we
want to tell the caller "the object you asked for doesn't exist". And
then it can decide if that was user error (somebody hitting a URL for an
object that we have no reason to think exists), or a sign of problems
elsewhere in the system (if we just got $oid from Git, we expect it to
be there).

So it seems like the most useful thing is specific error codes for
specific cases. And that gets very daunting to think about annotating
and communicating about each such case (we don't even pass that level of
detailed information inside the program in a machine-readable way;
scraping stderr is the best way to figure this stuff out now).

I dunno. Maybe a rougher categorization would help your case, but not
mine. But I'm a bit skeptical that we'll have enough coverage of various
conditions to be useful, and that it won't turn into a headache trying
to categorize everything.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-19 23:34 RFC: error codes on exit Jonathan Nieder
                   ` (2 preceding siblings ...)
  2021-05-20 13:28 ` Jeff King
@ 2021-05-20 15:09 ` Jeff Hostetler
  2021-05-21  1:33   ` brian m. carlson
  2021-05-21  1:20 ` brian m. carlson
  2021-05-26  8:21 ` Ævar Arnfjörð Bjarmason
  5 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2021-05-20 15:09 UTC (permalink / raw)
  To: Jonathan Nieder, git; +Cc: Josh Steadmon, Jeff King, Jeff Hostetler



On 5/19/21 7:34 PM, Jonathan Nieder wrote:
> Hi,
> 
> (Danger, jrn is wading into error handling again...)
> 
> At $DAYJOB we are setting up some alerting for some bot fleets and
> developer workstations, using trace2 as the data source.  Having
> trace2 has been great --- combined with gradual weekly rollouts of
> "next", it helps us to understand quickly when a change is creating a
> regression for users, which hopefully improves the quality of Git for
> everyone.
> 
> One kind of signal we haven't been able to make good use of is error
> rates.  The problem is that a die() call can be an indication of
> 
>   a. the user asked to do something that isn't sensible, and we kindly
>      rebuked the user
> 
>   b. we contacted a server, and the server was not happy with our
>      request
> 
>   c. the local Git repository is corrupt
> 
>   d. we ran out of resources (e.g., disk space)
> 
>   e. we encountered an internal error in handling the user's
>      legitimate request
...

For the error event that `error()` and `die()` and friends generate,
I emit both the fully formatted error message and the format string.

The latter, if used as a dictionary key, would let you group like
events from different processes without worrying about the filename
or blob id or remote name or etc. in any one particular instance.

Would that be sufficient as an error classification and something
that you can key off of in your post-processing ?

Granted the same format message might be used in multiple places in
the source, but I also provide the source filename and line number.

If it turns out that all of the error events come from "usage.c"
(i.e. error_builtin() or die_builtin()), then maybe we need to look
at another way of wrapping those calls to pass the F/L of actual
caller.  I hesitated to do that because of the existing indirection
tricks in usage.c WRT the `set_error_routine()` and friends.
(And that assumes that the format string is a viable solution for
you problem.)

Jeff


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-20 13:28 ` Jeff King
@ 2021-05-20 17:47   ` Jonathan Nieder
  2021-05-21  9:43     ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Nieder @ 2021-05-20 17:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Josh Steadmon, Jeff Hostetler

Hi,

Jeff King wrote:
> On Wed, May 19, 2021 at 04:34:24PM -0700, Jonathan Nieder wrote:

>> One kind of signal we haven't been able to make good use of is error
>> rates.  The problem is that a die() call can be an indication of
[... list of some categories snipped ...]
> I've run into this problem, too. If you run a website that runs Git
> commands on behalf of users and try to get metrics on failing exit
> codes, it's hard to tell the difference between "the repo is broken",
> "Git has a bug", "the user (or other caller) asked for something
> stupid", and "some transient error occurred".
>
> But I'm not sure that even Git can always tell the difference between
> those things. Some real-world examples I've run into:
>
>   - "rev-list $oid" can't find object $oid. Is the repo corrupt? Or is
>     the caller unreasonable to ask for that object? Or was there a race
>     or other transient error which made the object invisible?
>
>   - upload-pack is writing out a packfile, but gets EPIPE. Did the
>     network drop out? Or is a Git bug causing one side to break
>     protocol?
>
> Some rough categorization may help, but a lot of those need to propagate
> the specific errors back to the caller. For instance, the rev-list
> example could be FAILED_PRECONDITION in your terminology. But really, we
> want to tell the caller "the object you asked for doesn't exist". And
> then it can decide if that was user error (somebody hitting a URL for an
> object that we have no reason to think exists), or a sign of problems
> elsewhere in the system (if we just got $oid from Git, we expect it to
> be there).
>
> So it seems like the most useful thing is specific error codes for
> specific cases.

As a bit of precedent: in the server we have both of these: we have
application-specific error codes like AUTHENTICATOR_EXPIRED, and we
have generic codes that those map to like UNAUTHENTICATED.  The
application-specific codes tend to be useful for ad hoc queries as
part of incident response, versus the generic codes that have been
more useful for defining an SLO (because they are about "how do I want
to respond to this error" instead of about the cause).

More specifically for the "missing object" case: it's common enough
for a user to ask for an object that doesn't exist that we indeed call
it FAILED_PRECONDITION, which has worked well.  We have other
monitoring in place for checking that all repositories pass fsck and
that fetching an object after pushing it succeeds.  (In general, this
kind of case is very common in monitoring any service that has state.)

>                 And that gets very daunting to think about annotating
> and communicating about each such case (we don't even pass that level of
> detailed information inside the program in a machine-readable way;
> scraping stderr is the best way to figure this stuff out now).

This feels like good news to me: it sounds like if we add
application-specific codes like MISSING_OBJECT to Git, then it would
be useful to both of us.

The mapping to HTTP-status-style generic codes could then wait for
later, to be submitted if and when others have interest.  (I.e., that
part is easy to keep maintained internally.)

So I'm feeling encouraged. :)

> I dunno. Maybe a rougher categorization would help your case, but not
> mine. But I'm a bit skeptical that we'll have enough coverage of various
> conditions to be useful, and that it won't turn into a headache trying
> to categorize everything.

Two more points I want to emphasize:

 1. We don't have to be exhaustive: as Felipe suggested, it's fine for
    some errors (even most error paths!) to use a code such as
    UNKNOWN.  I care more about coverage of commonly occuring errors
    than categorizing everything, especially because this sets up a
    feedback loop that can lead to improved coverage over time.

 2. By focusing on the practical and ignoring everything else, I think
    we can avoid this becoming an unbounded taxonomy exercise.  That's
    part of the appeal of code.proto /
    https://github.com/abseil/abseil-cpp/blob/HEAD/absl/status/status.h
    for me: by using a preexisting list of codes based on "here is
    what a user would be expected to do in response to this error",
    they make the error classification decision relatively simple.  I
    think we can maintain that kind of simplicity with a Git-specific
    enum, too, so I think this is doable (and I'd make sure to be
    available over time to help answer questions about the
    classification as the project gets used to it).

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-19 23:34 RFC: error codes on exit Jonathan Nieder
                   ` (3 preceding siblings ...)
  2021-05-20 15:09 ` Jeff Hostetler
@ 2021-05-21  1:20 ` brian m. carlson
  2021-05-26  8:21 ` Ævar Arnfjörð Bjarmason
  5 siblings, 0 replies; 26+ messages in thread
From: brian m. carlson @ 2021-05-21  1:20 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Josh Steadmon, Jeff King, Jeff Hostetler

[-- Attachment #1: Type: text/plain, Size: 2467 bytes --]

On 2021-05-19 at 23:34:24, Jonathan Nieder wrote:
> Hi,
> 
> (Danger, jrn is wading into error handling again...)
> 
> At $DAYJOB we are setting up some alerting for some bot fleets and
> developer workstations, using trace2 as the data source.  Having
> trace2 has been great --- combined with gradual weekly rollouts of
> "next", it helps us to understand quickly when a change is creating a
> regression for users, which hopefully improves the quality of Git for
> everyone.
> 
> One kind of signal we haven't been able to make good use of is error
> rates.  The problem is that a die() call can be an indication of
> 
>  a. the user asked to do something that isn't sensible, and we kindly
>     rebuked the user
> 
>  b. we contacted a server, and the server was not happy with our
>     request
> 
>  c. the local Git repository is corrupt
> 
>  d. we ran out of resources (e.g., disk space)
> 
>  e. we encountered an internal error in handling the user's
>     legitimate request
> 
> and these different cases do not all motivate the same response.
> (E.g., if (c) affects just a single bot but produces a high error rate
> from that bot, we shouldn't be alarmed; if (d) is happening on a bot,
> then we should look into giving it more disk; if (e) is increasing
> significantly during a rollout then we should roll back quickly.)

In general, I'm in favor of adding some sort of error code here.  Even
though I don't normally use trace2, I think there's a lot of benefit to
having a standardized set of error codes, and this seems like as good a
place as any to introduce them.

A future iteration of this might look like us returning a negative error
code from a function instead of -1 for us to signal to the caller that a
particular error case occurred.  We need not implement that now, of
course, but I bring it up in case we want to accommodate that in our
design now for future us.

I do agree with Peff that this may not necessarily provide all of the
insight you want, since it can be hard to distinguish why the error
occurred.  For example, in Git LFS, we sometimes will pass objects we
don't have to git rev-list (with --missing) and that's completely
expected, whereas a missing object with git fsck would generally be
cause for alarm.  Provided you're comfortable with some ambiguity, I
think this would be a nice improvement.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-20 15:09 ` Jeff Hostetler
@ 2021-05-21  1:33   ` brian m. carlson
  0 siblings, 0 replies; 26+ messages in thread
From: brian m. carlson @ 2021-05-21  1:33 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Jonathan Nieder, git, Josh Steadmon, Jeff King, Jeff Hostetler

[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]

On 2021-05-20 at 15:09:44, Jeff Hostetler wrote:
> For the error event that `error()` and `die()` and friends generate,
> I emit both the fully formatted error message and the format string.
> 
> The latter, if used as a dictionary key, would let you group like
> events from different processes without worrying about the filename
> or blob id or remote name or etc. in any one particular instance.
> 
> Would that be sufficient as an error classification and something
> that you can key off of in your post-processing ?

I don't think that's going to be sufficient.  Many calls to die()
contain a translated string.  I am a native Anglophone and work in a
company where English is the sole language of communication, but I have
my computer configured in French, and I have had it configured in
Spanish as well.

It's totally possible that one of my colleagues who has a non-English
native language might be using a different language as well, so it would
be difficult to reliably map the format string into a fixed error case
in a typical corporate setting, since many languages might be in use.

> Granted the same format message might be used in multiple places in
> the source, but I also provide the source filename and line number.

The source filename and line number would be more helpful, but
inconveniently, people frequently change the code of Git, so the line
numbers aren't always stable over time.  So I think an error code would
be helpful nevertheless.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-20 17:47   ` Jonathan Nieder
@ 2021-05-21  9:43     ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2021-05-21  9:43 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Josh Steadmon, Jeff Hostetler

On Thu, May 20, 2021 at 10:47:38AM -0700, Jonathan Nieder wrote:

> >                 And that gets very daunting to think about annotating
> > and communicating about each such case (we don't even pass that level of
> > detailed information inside the program in a machine-readable way;
> > scraping stderr is the best way to figure this stuff out now).
> 
> This feels like good news to me: it sounds like if we add
> application-specific codes like MISSING_OBJECT to Git, then it would
> be useful to both of us.

Perhaps. I think the context matters between "missing an object from the
command line" and "missing an object I expected to find while
traversing". And I'm not sure all spots which look up an object will
know that context.

In some sense that's "just" a programming problem; surfacing the errors
to the right spot that can decide how to exit. But I worry a bit that
it's fighting uphill against the current code structure. There's
probably going to be a period where MISSING_OBJECT versus UNKNOWN is
wildly inaccurate, and a long tail of cases to fix.

Erring to say "UNKNOWN" is probably OK for most callers (they are happy
to learn of a specific error and act on it appropriately, but if Git
can't tell it to them, they have a generic path). But erring in the
other direction might be bad (you fail to realize a repo is corrupt, and
instead attribute it to caller error).

So again, I return "I dunno". Something of this magnitude probably has
to be done incrementally and over time. But I'd be loathe to trust it
and convert existing callers use it for a while. And that creates a
chicken-and-egg problem for finding the places which need improvement.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-20  0:40 ` Felipe Contreras
@ 2021-05-21 16:53   ` Alex Henrie
  2021-05-21 23:20     ` H. Peter Anvin
  2021-05-22  9:12     ` Philip Oakley
  0 siblings, 2 replies; 26+ messages in thread
From: Alex Henrie @ 2021-05-21 16:53 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Jonathan Nieder, Git mailing list, Josh Steadmon, Jeff King,
	Jeff Hostetler

On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
<felipe.contreras@gmail.com> wrote:
>
> It's good to not include many initial codes, but I would start with at
> least three:
>
>   OK = 0,
>   UNKNOWN = 1,
>   NORMAL = 2,

If you go that route, could you please pick a word other than "normal"
to describe errors that are not entirely unexpected? I'm worried that
someone will see "normal" and use it instead of "OK" to indicate
success.

-Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-21 16:53   ` Alex Henrie
@ 2021-05-21 23:20     ` H. Peter Anvin
  2021-05-22  4:06       ` Bagas Sanjaya
  2021-05-22  8:49       ` Junio C Hamano
  2021-05-22  9:12     ` Philip Oakley
  1 sibling, 2 replies; 26+ messages in thread
From: H. Peter Anvin @ 2021-05-21 23:20 UTC (permalink / raw)
  To: Alex Henrie, Felipe Contreras
  Cc: Jonathan Nieder, Git mailing list, Josh Steadmon, Jeff King,
	Jeff Hostetler



On 5/21/21 9:53 AM, Alex Henrie wrote:
> On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
>>
>> It's good to not include many initial codes, but I would start with at
>> least three:
>>
>>    OK = 0,
>>    UNKNOWN = 1,
>>    NORMAL = 2,
> 
> If you go that route, could you please pick a word other than "normal"
> to describe errors that are not entirely unexpected? I'm worried that
> someone will see "normal" and use it instead of "OK" to indicate
> success.
> 

<sysexits.h>

	-hpa


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-21 23:20     ` H. Peter Anvin
@ 2021-05-22  4:06       ` Bagas Sanjaya
  2021-05-22  8:49       ` Junio C Hamano
  1 sibling, 0 replies; 26+ messages in thread
From: Bagas Sanjaya @ 2021-05-22  4:06 UTC (permalink / raw)
  To: H. Peter Anvin, Alex Henrie, Felipe Contreras
  Cc: Jonathan Nieder, Git mailing list, Josh Steadmon, Jeff King,
	Jeff Hostetler

On 22/05/21 06.20, H. Peter Anvin wrote:
> 
> <sysexits.h>
> 
>      -hpa
> 
Looking that header file you mention, I saw:

> #define EX_OK		0	/* successful termination */
> 
> #define EX__BASE	64	/* base value for error messages */
> 
> #define EX_USAGE	64	/* command line usage error */
> #define EX_DATAERR	65	/* data format error */
> #define EX_NOINPUT	66	/* cannot open input */
> #define EX_NOUSER	67	/* addressee unknown */
> #define EX_NOHOST	68	/* host name unknown */
> #define EX_UNAVAILABLE	69	/* service unavailable */
> #define EX_SOFTWARE	70	/* internal software error */
> #define EX_OSERR	71	/* system error (e.g., can't fork) */
> #define EX_OSFILE	72	/* critical OS file missing */
> #define EX_CANTCREAT	73	/* can't create (user) output file */
> #define EX_IOERR	74	/* input/output error */
> #define EX_TEMPFAIL	75	/* temp failure; user is invited to retry */
> #define EX_PROTOCOL	76	/* remote error in protocol */
> #define EX_NOPERM	77	/* permission denied */
> #define EX_CONFIG	78	/* configuration error */

For EX_USAGE case, we may sometimes display correct usage syntax so that
users can fix their typing.

We may use EX_CONFIG when we encounter any errors when parsing .gitconfig.

EX_OSFILE isn't necessary for Git, though.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-21 23:20     ` H. Peter Anvin
  2021-05-22  4:06       ` Bagas Sanjaya
@ 2021-05-22  8:49       ` Junio C Hamano
  2021-05-22  9:08         ` H. Peter Anvin
  2021-05-22 21:22         ` Felipe Contreras
  1 sibling, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2021-05-22  8:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alex Henrie, Felipe Contreras, Jonathan Nieder, Git mailing list,
	Josh Steadmon, Jeff King, Jeff Hostetler

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 5/21/21 9:53 AM, Alex Henrie wrote:
>> On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
>> <felipe.contreras@gmail.com> wrote:
>>>
>>> It's good to not include many initial codes, but I would start with at
>>> least three:
>>>
>>>    OK = 0,
>>>    UNKNOWN = 1,
>>>    NORMAL = 2,
>> If you go that route, could you please pick a word other than
>> "normal"
>> to describe errors that are not entirely unexpected? I'm worried that
>> someone will see "normal" and use it instead of "OK" to indicate
>> success.
>> 
>
> <sysexits.h>

Is the value assignment standardized across systems?

We want human-readable names in the source to help developers while
we want platform neutral output in the log so that log collectors
can do some "intelligent" things about the output.  If EX_USAGE is
always 64 everywhere, that is great---we can emit "64" in the log
and log collectors can take it as if they saw "EX_USAGE".  But if
the value assignment is platform-dependent, it does not help all
that much.

    Side note.  We had a similar discussion on <errno.h> and
    strerror(); the numbers do not help without knowing which
    platform the error came from, and strerror() output is localized
    and not suitable for machine consumption.

In a sense, it is worse than we keep a central mapping between names
programmers use to give to the new fatal() helper function and the
string the tracing machinery will emit for these names.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-22  8:49       ` Junio C Hamano
@ 2021-05-22  9:08         ` H. Peter Anvin
  2021-05-22 21:22         ` Felipe Contreras
  1 sibling, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2021-05-22  9:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Alex Henrie, Felipe Contreras, Jonathan Nieder, Git mailing list,
	Josh Steadmon, Jeff King, Jeff Hostetler

sysexits.h numbers are the same across all platforms which have it, I believe. I think it originated with Sendmail wanting to know a bit more about why any subprocesses exited.

On May 22, 2021 1:49:09 AM PDT, Junio C Hamano <gitster@pobox.com> wrote:
>"H. Peter Anvin" <hpa@zytor.com> writes:
>
>> On 5/21/21 9:53 AM, Alex Henrie wrote:
>>> On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
>>> <felipe.contreras@gmail.com> wrote:
>>>>
>>>> It's good to not include many initial codes, but I would start with
>at
>>>> least three:
>>>>
>>>>    OK = 0,
>>>>    UNKNOWN = 1,
>>>>    NORMAL = 2,
>>> If you go that route, could you please pick a word other than
>>> "normal"
>>> to describe errors that are not entirely unexpected? I'm worried
>that
>>> someone will see "normal" and use it instead of "OK" to indicate
>>> success.
>>> 
>>
>> <sysexits.h>
>
>Is the value assignment standardized across systems?
>
>We want human-readable names in the source to help developers while
>we want platform neutral output in the log so that log collectors
>can do some "intelligent" things about the output.  If EX_USAGE is
>always 64 everywhere, that is great---we can emit "64" in the log
>and log collectors can take it as if they saw "EX_USAGE".  But if
>the value assignment is platform-dependent, it does not help all
>that much.
>
>    Side note.  We had a similar discussion on <errno.h> and
>    strerror(); the numbers do not help without knowing which
>    platform the error came from, and strerror() output is localized
>    and not suitable for machine consumption.
>
>In a sense, it is worse than we keep a central mapping between names
>programmers use to give to the new fatal() helper function and the
>string the tracing machinery will emit for these names.
>
>Thanks.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-21 16:53   ` Alex Henrie
  2021-05-21 23:20     ` H. Peter Anvin
@ 2021-05-22  9:12     ` Philip Oakley
  2021-05-22 21:19       ` Felipe Contreras
  1 sibling, 1 reply; 26+ messages in thread
From: Philip Oakley @ 2021-05-22  9:12 UTC (permalink / raw)
  To: Alex Henrie, Felipe Contreras
  Cc: Jonathan Nieder, Git mailing list, Josh Steadmon, Jeff King,
	Jeff Hostetler

On 21/05/2021 17:53, Alex Henrie wrote:
> On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
>> It's good to not include many initial codes, but I would start with at
>> least three:
>>
>>   OK = 0,
>>   UNKNOWN = 1,
>>   NORMAL = 2,
> If you go that route, could you please pick a word other than "normal"
> to describe errors that are not entirely unexpected? I'm worried that
> someone will see "normal" and use it instead of "OK" to indicate
> success.
>
> -Alex
Typical <== Normal

Though abnormal and atypical often have different implications ;-)
P.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-22  9:12     ` Philip Oakley
@ 2021-05-22 21:19       ` Felipe Contreras
  2021-05-25 17:24         ` Alex Henrie
  0 siblings, 1 reply; 26+ messages in thread
From: Felipe Contreras @ 2021-05-22 21:19 UTC (permalink / raw)
  To: Philip Oakley, Alex Henrie, Felipe Contreras
  Cc: Jonathan Nieder, Git mailing list, Josh Steadmon, Jeff King,
	Jeff Hostetler

Philip Oakley wrote:
> On 21/05/2021 17:53, Alex Henrie wrote:
> > On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
> > <felipe.contreras@gmail.com> wrote:
> >> It's good to not include many initial codes, but I would start with at
> >> least three:
> >>
> >>   OK = 0,
> >>   UNKNOWN = 1,
> >>   NORMAL = 2,
> > If you go that route, could you please pick a word other than "normal"
> > to describe errors that are not entirely unexpected? I'm worried that
> > someone will see "normal" and use it instead of "OK" to indicate
> > success.
> >
> > -Alex
> Typical <== Normal
> 
> Though abnormal and atypical often have different implications ;-)
> P.

Or USUAL.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-22  8:49       ` Junio C Hamano
  2021-05-22  9:08         ` H. Peter Anvin
@ 2021-05-22 21:22         ` Felipe Contreras
  2021-05-22 21:29           ` H. Peter Anvin
  1 sibling, 1 reply; 26+ messages in thread
From: Felipe Contreras @ 2021-05-22 21:22 UTC (permalink / raw)
  To: Junio C Hamano, H. Peter Anvin
  Cc: Alex Henrie, Felipe Contreras, Jonathan Nieder, Git mailing list,
	Josh Steadmon, Jeff King, Jeff Hostetler

Junio C Hamano wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
> 
> > On 5/21/21 9:53 AM, Alex Henrie wrote:
> >> On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
> >> <felipe.contreras@gmail.com> wrote:
> >>>
> >>> It's good to not include many initial codes, but I would start with at
> >>> least three:
> >>>
> >>>    OK = 0,
> >>>    UNKNOWN = 1,
> >>>    NORMAL = 2,
> >> If you go that route, could you please pick a word other than
> >> "normal"
> >> to describe errors that are not entirely unexpected? I'm worried that
> >> someone will see "normal" and use it instead of "OK" to indicate
> >> success.
> >> 
> >
> > <sysexits.h>
> 
> Is the value assignment standardized across systems?

I think his intention was to suggest to use that list as inspiration...
As in have USAGE, NOINPUT, UNAVAILABE, etc.

I would prefer to start with something easy... UNKNOWN = 1, USUAL = 2.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-22 21:22         ` Felipe Contreras
@ 2021-05-22 21:29           ` H. Peter Anvin
  2021-05-22 21:53             ` Felipe Contreras
  0 siblings, 1 reply; 26+ messages in thread
From: H. Peter Anvin @ 2021-05-22 21:29 UTC (permalink / raw)
  To: Felipe Contreras, Junio C Hamano
  Cc: Alex Henrie, Jonathan Nieder, Git mailing list, Josh Steadmon,
	Jeff King, Jeff Hostetler

No, please use the standardized numbers when they apply.

On May 22, 2021 2:22:42 PM PDT, Felipe Contreras <felipe.contreras@gmail.com> wrote:
>Junio C Hamano wrote:
>> "H. Peter Anvin" <hpa@zytor.com> writes:
>> 
>> > On 5/21/21 9:53 AM, Alex Henrie wrote:
>> >> On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
>> >> <felipe.contreras@gmail.com> wrote:
>> >>>
>> >>> It's good to not include many initial codes, but I would start
>with at
>> >>> least three:
>> >>>
>> >>>    OK = 0,
>> >>>    UNKNOWN = 1,
>> >>>    NORMAL = 2,
>> >> If you go that route, could you please pick a word other than
>> >> "normal"
>> >> to describe errors that are not entirely unexpected? I'm worried
>that
>> >> someone will see "normal" and use it instead of "OK" to indicate
>> >> success.
>> >> 
>> >
>> > <sysexits.h>
>> 
>> Is the value assignment standardized across systems?
>
>I think his intention was to suggest to use that list as inspiration...
>As in have USAGE, NOINPUT, UNAVAILABE, etc.
>
>I would prefer to start with something easy... UNKNOWN = 1, USUAL = 2.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-22 21:29           ` H. Peter Anvin
@ 2021-05-22 21:53             ` Felipe Contreras
  2021-05-22 23:02               ` H. Peter Anvin
  0 siblings, 1 reply; 26+ messages in thread
From: Felipe Contreras @ 2021-05-22 21:53 UTC (permalink / raw)
  To: H. Peter Anvin, Felipe Contreras, Junio C Hamano
  Cc: Alex Henrie, Jonathan Nieder, Git mailing list, Josh Steadmon,
	Jeff King, Jeff Hostetler

H. Peter Anvin wrote:
> No, please use the standardized numbers when they apply.

I wasn't talking about the numbers, but the names.

Do you see something wrong with USAGE = EX__USAGE?

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-22 21:53             ` Felipe Contreras
@ 2021-05-22 23:02               ` H. Peter Anvin
  0 siblings, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2021-05-22 23:02 UTC (permalink / raw)
  To: Felipe Contreras, Junio C Hamano
  Cc: Alex Henrie, Jonathan Nieder, Git mailing list, Josh Steadmon,
	Jeff King, Jeff Hostetler

Why avoid the standard symbols?

On May 22, 2021 2:53:21 PM PDT, Felipe Contreras <felipe.contreras@gmail.com> wrote:
>H. Peter Anvin wrote:
>> No, please use the standardized numbers when they apply.
>
>I wasn't talking about the numbers, but the names.
>
>Do you see something wrong with USAGE = EX__USAGE?

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-22 21:19       ` Felipe Contreras
@ 2021-05-25 17:24         ` Alex Henrie
  2021-05-25 18:43           ` Felipe Contreras
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Henrie @ 2021-05-25 17:24 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Philip Oakley, Jonathan Nieder, Git mailing list, Josh Steadmon,
	Jeff King, Jeff Hostetler

On Sat, May 22, 2021 at 3:19 PM Felipe Contreras
<felipe.contreras@gmail.com> wrote:
>
> Philip Oakley wrote:
> > On 21/05/2021 17:53, Alex Henrie wrote:
> > > On Wed, May 19, 2021 at 6:40 PM Felipe Contreras
> > > <felipe.contreras@gmail.com> wrote:
> > >> It's good to not include many initial codes, but I would start with at
> > >> least three:
> > >>
> > >>   OK = 0,
> > >>   UNKNOWN = 1,
> > >>   NORMAL = 2,
> > > If you go that route, could you please pick a word other than "normal"
> > > to describe errors that are not entirely unexpected? I'm worried that
> > > someone will see "normal" and use it instead of "OK" to indicate
> > > success.
> > >
> > > -Alex
> > Typical <== Normal
> >
> > Though abnormal and atypical often have different implications ;-)
> > P.
>
> Or USUAL.

The words "typical" and "usual" have the same problem of making it
sound like there was no error. I would suggest terms like "user
error", "network error", etc. instead.

-Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-25 17:24         ` Alex Henrie
@ 2021-05-25 18:43           ` Felipe Contreras
  0 siblings, 0 replies; 26+ messages in thread
From: Felipe Contreras @ 2021-05-25 18:43 UTC (permalink / raw)
  To: Alex Henrie, Felipe Contreras
  Cc: Philip Oakley, Jonathan Nieder, Git mailing list, Josh Steadmon,
	Jeff King, Jeff Hostetler

Alex Henrie wrote:
> On Sat, May 22, 2021 at 3:19 PM Felipe Contreras
> <felipe.contreras@gmail.com> wrote:

> > > Though abnormal and atypical often have different implications ;-)
> >
> > Or USUAL.
> 
> The words "typical" and "usual" have the same problem of making it
> sound like there was no error.

Not to me. "Usual" is typically an adjective, which makes me think: usual
what? (something is missing). Although both can be used as nouns ("the
usual", "the new normal") in this context "usual" is much less confusing
(I don't recall hearing "the command returned usual status").

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RFC: error codes on exit
  2021-05-19 23:34 RFC: error codes on exit Jonathan Nieder
                   ` (4 preceding siblings ...)
  2021-05-21  1:20 ` brian m. carlson
@ 2021-05-26  8:21 ` Ævar Arnfjörð Bjarmason
  5 siblings, 0 replies; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-05-26  8:21 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Josh Steadmon, Jeff King, Jeff Hostetler


On Wed, May 19 2021, Jonathan Nieder wrote:

> Hi,
>
> (Danger, jrn is wading into error handling again...)
>
> At $DAYJOB we are setting up some alerting for some bot fleets and
> developer workstations, using trace2 as the data source.  Having
> trace2 has been great --- combined with gradual weekly rollouts of
> "next", it helps us to understand quickly when a change is creating a
> regression for users, which hopefully improves the quality of Git for
> everyone.
>
> One kind of signal we haven't been able to make good use of is error
> rates.  The problem is that a die() call can be an indication of
>
>  a. the user asked to do something that isn't sensible, and we kindly
>     rebuked the user
>
>  b. we contacted a server, and the server was not happy with our
>     request
>
>  c. the local Git repository is corrupt
>
>  d. we ran out of resources (e.g., disk space)
>
>  e. we encountered an internal error in handling the user's
>     legitimate request
> [...]
> Further down the line I can imagine making use of git_error_code
> elsewhere for e.g. some limited retries of the corresponding
> transaction when we fail to lock a file.
>
> Thoughts?  Good idea?  Bad idea?

Having read the thread at large (and some of this is a more general
response) a few points, not against or as a retort to this, just related
thoughts, complimentary suggestions etc:

 1. As shown in my f6d25d78789 (api docs: document that BUG() emits a
    trace2 error event, 2021-04-13) all of BUG/die/error/warning just
    emit "error" under trace2.

    It seems to me a good place to start with this effort would be for
    someone to split that up. It requires changing the trace2 schema,
    but it can be done in some backwards compatible way. Perhaps event:
    error, error_type: [bug,die,error,warning] ?

 1.5. Split up error_errno() from error() for trace2 purposes? This gets
      you partway to your "d".

 2. Similarly we need to log the correct line numbers for
    die/error/warning. They need to be a macro/function like BUG() /
    BUG_fl().

 3. You can then key error events/frequencies on the "fmt".

 4. To the extent tha #3 isn't true on client machines due to i18n we
    could change the API in a backwards-compatible way from
    e.g. error(_("string") to error(_N("string")). We'd then always
    transmit the C locale "fmt".

Basically I wonder if a more granular approach with just better logging
of information we have now (but lose in trace2) + maybe some split-up of
the current functions, e.g. having a user_error() distinct from
repository_error() or whatever wouldn't get us most/all of the way to
this.

> Further down the line I can imagine making use of git_error_code
> elsewhere for e.g. some limited retries of the corresponding
> transaction when we fail to lock a file.

Maybe, but that seems highly problem-dependant, and not e.g. something
where we'd like to just do a blind retry in one of our own porcelain
tools if a plumbing one failed with a "had an issue, retries might work"
code.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-05-26  9:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 23:34 RFC: error codes on exit Jonathan Nieder
2021-05-20  0:40 ` Felipe Contreras
2021-05-21 16:53   ` Alex Henrie
2021-05-21 23:20     ` H. Peter Anvin
2021-05-22  4:06       ` Bagas Sanjaya
2021-05-22  8:49       ` Junio C Hamano
2021-05-22  9:08         ` H. Peter Anvin
2021-05-22 21:22         ` Felipe Contreras
2021-05-22 21:29           ` H. Peter Anvin
2021-05-22 21:53             ` Felipe Contreras
2021-05-22 23:02               ` H. Peter Anvin
2021-05-22  9:12     ` Philip Oakley
2021-05-22 21:19       ` Felipe Contreras
2021-05-25 17:24         ` Alex Henrie
2021-05-25 18:43           ` Felipe Contreras
2021-05-20  0:49 ` Junio C Hamano
2021-05-20  1:19   ` Felipe Contreras
2021-05-20  1:55   ` Jonathan Nieder
2021-05-20  2:28     ` Junio C Hamano
2021-05-20 13:28 ` Jeff King
2021-05-20 17:47   ` Jonathan Nieder
2021-05-21  9:43     ` Jeff King
2021-05-20 15:09 ` Jeff Hostetler
2021-05-21  1:33   ` brian m. carlson
2021-05-21  1:20 ` brian m. carlson
2021-05-26  8:21 ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).