Review of git multimail

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Review of git multimail
@ 2013-07-02 19:23 Ramkumar Ramachandra
  2013-07-02 20:51 ` John Keeping
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Ramkumar Ramachandra @ 2013-07-02 19:23 UTC (permalink / raw)
  To: Git List; +Cc: Michael Haggerty

Hi,

I figured that we should quickly read through git-multimail and give
it an on-list review.  Hopefully, it'll educate the list about what
this is, and help improve the script itself.

Sources: https://github.com/mhagger/git-multimail

git_multimail.py wrote:
> #! /usr/bin/env python2

Do all distributions ship it as python2 now?

> class CommandError(Exception):
>     def __init__(self, cmd, retcode):
>         self.cmd = cmd
>         self.retcode = retcode
>         Exception.__init__(
>             self,
>             'Command "%s" failed with retcode %s' % (' '.join(cmd), retcode,)

So cmd is a list.

> class ConfigurationException(Exception):
>     pass

Dead code?

> def read_git_output(args, input=None, keepends=False, **kw):
>     """Read the output of a Git command."""
> 
>     return read_output(
>         ['git', '-c', 'i18n.logoutputencoding=%s' % (ENCODING,)] + args,
>         input=input, keepends=keepends, **kw
>         )

Okay, although I'm wondering what i18n.logoutputencoding has to do with anything.

> def read_output(cmd, input=None, keepends=False, **kw):
>     if input:
>         stdin = subprocess.PIPE
>     else:
>         stdin = None
>     p = subprocess.Popen(
>         cmd, stdin=stdin, stdout=subprocess.PIPE, stderr=subprocess.PIPE, **kw
>         )
>     (out, err) = p.communicate(input)
>     retcode = p.wait()
>     if retcode:
>         raise CommandError(cmd, retcode)
>     if not keepends:
>         out = out.rstrip('\n\r')
>     return out

Helper function that serves a single caller, read_git_output().

> def read_git_lines(args, keepends=False, **kw):
>     """Return the lines output by Git command.
> 
>     Return as single lines, with newlines stripped off."""
> 
>     return read_git_output(args, keepends=True, **kw).splitlines(keepends)

Okay.

> class Config(object):
>     def __init__(self, section, git_config=None):
>         """Represent a section of the git configuration.
> 
>         If git_config is specified, it is passed to "git config" in
>         the GIT_CONFIG environment variable, meaning that "git config"
>         will read the specified path rather than the Git default
>         config paths."""
> 
>         self.section = section
>         if git_config:
>             self.env = os.environ.copy()
>             self.env['GIT_CONFIG'] = git_config
>         else:
>             self.env = None

Okay.

>     @staticmethod
>     def _split(s):
>         """Split NUL-terminated values."""
> 
>         words = s.split('\0')
>         assert words[-1] == ''
>         return words[:-1]

Ugh.  Two callers of this poorly-defined static method: I wonder if
we'd be better off inlining it.

>     def get(self, name, default=''):
>         try:
>             values = self._split(read_git_output(
>                     ['config', '--get', '--null', '%s.%s' % (self.section, name)],
>                     env=self.env, keepends=True,
>                     ))

Wait, what is the point of using --null and then splitting by hand
using a poorly-defined static method?  Why not drop the --null and
splitlines() as usual?

>             assert len(values) == 1

When does this assert fail?

>             return values[0]
>         except CommandError:
>             return default

If you're emulating the dictionary get method, default=None.  This is
not C, where all codepaths of the function must return the same type.

>     def get_bool(self, name, default=None):
>         try:
>             value = read_git_output(
>                 ['config', '--get', '--bool', '%s.%s' % (self.section, name)],
>                 env=self.env,
>                 )
>         except CommandError:
>             return default
>         return value == 'true'

Correct.  On success, return bool.  On failure, return None.

>     def get_all(self, name, default=None):
>         """Read a (possibly multivalued) setting from the configuration.
> 
>         Return the result as a list of values, or default if the name
>         is unset."""
> 
>         try:
>             return self._split(read_git_output(
>                 ['config', '--get-all', '--null', '%s.%s' % (self.section, name)],
>                 env=self.env, keepends=True,
>                 ))
>         except CommandError, e:

CommandError as e?

>             if e.retcode == 1:

What does this cryptic retcode mean?

>                 return default
>             else:
>                 raise

raise what?

You've instantiated the Config class in two places: user and
multimailhook sections.  Considering that you're going to read all the
keys in that section, why not --get-regexp, pre-load the configuration
into a dictionary and refer to that instead of spawning 'git config'
every time you need a configuration value?

>     def get_recipients(self, name, default=None):
>         """Read a recipients list from the configuration.
> 
>         Return the result as a comma-separated list of email
>         addresses, or default if the option is unset.  If the setting
>         has multiple values, concatenate them with comma separators."""
> 
>         lines = self.get_all(name, default=None)
>         if lines is None:
>             return default
>         return ', '.join(line.strip() for line in lines)

Ugh.

>     def set(self, name, value):
>         read_git_output(
>             ['config', '%s.%s' % (self.section, name), value],
>             env=self.env,
>             )
>
>     def add(self, name, value):
>         read_git_output(
>             ['config', '--add', '%s.%s' % (self.section, name), value],
>             env=self.env,
>             )
> 
>     def has_key(self, name):
>         return self.get_all(name, default=None) is not None
> 
>     def unset_all(self, name):
>         try:
>             read_git_output(
>                 ['config', '--unset-all', '%s.%s' % (self.section, name)],
>                 env=self.env,
>                 )
>         except CommandError, e:
>             if e.retcode == 5:
>                 # The name doesn't exist, which is what we wanted anyway...
>                 pass
>             else:
>                 raise
> 
>     def set_recipients(self, name, value):
>         self.unset_all(name)
>         for pair in getaddresses([value]):
>             self.add(name, formataddr(pair))

Dead code?

> def generate_summaries(*log_args):
>     """Generate a brief summary for each revision requested.
> 
>     log_args are strings that will be passed directly to "git log" as
>     revision selectors.  Iterate over (sha1_short, subject) for each
>     commit specified by log_args (subject is the first line of the
>     commit message as a string without EOLs)."""
> 
>     cmd = [
>         'log', '--abbrev', '--format=%h %s',
>         ] + list(log_args) + ['--']

What is log_args if not a list?

But yeah, log is the best way to generate summaries.

>     for line in read_git_lines(cmd):
>         yield tuple(line.split(' ', 1))

Okay, let's see how you use this iterator.

> def limit_lines(lines, max_lines):
>     for (index, line) in enumerate(lines):
>         if index < max_lines:
>             yield line
> 
>     if index >= max_lines:
>         yield '... %d lines suppressed ...\n' % (index + 1 - max_lines,)

Random helper.

> def limit_linelength(lines, max_linelength):
>     for line in lines:
>         # Don't forget that lines always include a trailing newline.
>         if len(line) > max_linelength + 1:
>             line = line[:max_linelength - 7] + ' [...]\n'
>         yield line

Random helper.

> class GitObject(object):
>     def __init__(self, sha1, type=None):
>         if sha1 == ZEROS:
>             self.sha1 = self.type = self.commit = None
>         else:
>             self.sha1 = sha1
>             self.type = type or read_git_output(['cat-file', '-t', self.sha1])
> 
>             if self.type == 'commit':
>                 self.commit = self
>             elif self.type == 'tag':
>                 try:
>                     self.commit = GitObject(
>                         read_git_output(['rev-parse', '--verify', '%s^0' % (self.sha1,)]),
>                         type='commit',
>                         )
>                 except CommandError:
>                     self.commit = None
>             else:
>                 self.commit = None
> 
>         self.short = read_git_output(['rev-parse', '--short', sha1])

Just rev-parse --verify --short $SHA1^0: if it resolves, set
self.short; one liner?

>     def get_summary(self):
>         """Return (sha1_short, subject) for this commit."""
> 
>         if not self.sha1:
>             raise ValueError('Empty commit has no summary')

What is the point of letting the user instantiate a GitObject without
a valid .sha1 in the first place?

>         return iter(generate_summaries('--no-walk', self.sha1)).next()

Not exactly fond of this, but I don't have a concrete replacement at
the moment.

>     def __eq__(self, other):
>         return isinstance(other, GitObject) and self.sha1 == other.sha1
> 
>     def __hash__(self):
>         return hash(self.sha1)
> 
>     def __nonzero__(self):
>         return bool(self.sha1)

Okay.

>     def __str__(self):
>         return self.sha1 or ZEROS

I wonder what value this adds when .short is around.

> class Change(object):
>     """A Change that has been made to the Git repository.
> 
>     Abstract class from which both Revisions and ReferenceChanges are
>     derived.  A Change knows how to generate a notification email
>     describing itself."""
> 
>     def __init__(self, environment):
>         self.environment = environment
>         self._values = None
> 
>     def _compute_values(self):
>         """Return a dictionary {keyword : expansion} for this Change.
> 
>         Derived classes overload this method to add more entries to
>         the return value.  This method is used internally by
>         get_values().  The return value should always be a new
>         dictionary."""
> 
>         return self.environment.get_values()

Why is this an "internal function"?  What is your criterion for
internal versus non-internal?

>     def get_values(self, **extra_values):
>         """Return a dictionary {keyword : expansion} for this Change.
> 
>         Return a dictionary mapping keywords to the values that they
>         should be expanded to for this Change (used when interpolating
>         template strings).  If any keyword arguments are supplied, add
>         those to the return value as well.  The return value is always
>         a new dictionary."""
> 
>         if self._values is None:
>             self._values = self._compute_values()
> 
>         values = self._values.copy()
>         if extra_values:
>             values.update(extra_values)
>         return values

Unsure what this is about.

>     def expand(self, template, **extra_values):
>         """Expand template.
> 
>         Expand the template (which should be a string) using string
>         interpolation of the values for this Change.  If any keyword
>         arguments are provided, also include those in the keywords
>         available for interpolation."""
> 
>         return template % self.get_values(**extra_values)
> 
>     def expand_lines(self, template, **extra_values):
>         """Break template into lines and expand each line."""
> 
>         values = self.get_values(**extra_values)
>         for line in template.splitlines(True):
>             yield line % values

Okay.

>     def expand_header_lines(self, template, **extra_values):
>         """Break template into lines and expand each line as an RFC 2822 header.
> 
>         Encode values and split up lines that are too long.  Silently
>         skip lines that contain references to unknown variables."""
> 
>         values = self.get_values(**extra_values)
>         for line in template.splitlines(True):
>             (name, value) = line.split(':', 1)
>             value = value.rstrip('\n\r')

Doesn't splitlines() make the rstrip() redundant?

>             try:
>                 value = value % values
>             except KeyError, e:
>                 if DEBUG:
>                     sys.stderr.write(
>                         'Warning: unknown variable %r in the following line; line skipped:\n'
>                         '    %s'
>                         % (e.args[0], line,)
>                         )

If DEBUG isn't on, you risk leaving the value string interpolated
without even telling the user.  What does it mean to the end user?

>             else:
>                 try:
>                     h = Header(value, header_name=name)
>                 except UnicodeDecodeError:
>                     h = Header(value, header_name=name, charset=CHARSET, errors='replace')
>                 for splitline in ('%s: %s\n' % (name, h.encode(),)).splitlines(True):
>                     yield splitline

Not elated by this exception cascading, but I suppose it's cheaper
than actually checking everything.

>     def generate_email_header(self):
>         """Generate the RFC 2822 email headers for this Change, a line at a time.
> 
>         The output should not include the trailing blank line."""
> 
>         raise NotImplementedError()
> 
>     def generate_email_intro(self):
>         """Generate the email intro for this Change, a line at a time.
> 
>         The output will be used as the standard boilerplate at the top
>         of the email body."""
> 
>         raise NotImplementedError()
> 
>     def generate_email_body(self):
>         """Generate the main part of the email body, a line at a time.
> 
>         The text in the body might be truncated after a specified
>         number of lines (see multimailhook.emailmaxlines)."""
> 
>         raise NotImplementedError()
> 
>     def generate_email_footer(self):
>         """Generate the footer of the email, a line at a time.
> 
>         The footer is always included, irrespective of
>         multimailhook.emailmaxlines."""
> 
>         raise NotImplementedError()

Unsure what these are about.

>     def generate_email(self, push, body_filter=None):
>         """Generate an email describing this change.
> 
>         Iterate over the lines (including the header lines) of an
>         email describing this change.  If body_filter is not None,
>         then use it to filter the lines that are intended for the
>         email body."""
> 
>         for line in self.generate_email_header():
>             yield line
>         yield '\n'
>         for line in self.generate_email_intro():
>             yield line
> 
>         body = self.generate_email_body(push)
>         if body_filter is not None:

Redundant "is not None".

>             body = body_filter(body)
>         for line in body:
>             yield line
> 
>         for line in self.generate_email_footer():
>             yield line

Nicely done with yield.

> class Revision(Change):
>     """A Change consisting of a single git commit."""
> 
>     def __init__(self, reference_change, rev, num, tot):
>         Change.__init__(self, reference_change.environment)

super?

>         self.reference_change = reference_change
>         self.rev = rev
>         self.change_type = self.reference_change.change_type
>         self.refname = self.reference_change.refname
>         self.num = num
>         self.tot = tot
>         self.author = read_git_output(['log', '--no-walk', '--format=%aN <%aE>', self.rev.sha1])

Determining author using log --format.  Okay.

>         self.recipients = self.environment.get_revision_recipients(self)
> 
>     def _compute_values(self):
>         values = Change._compute_values(self)

super.

>         # First line of commit message:
>         try:
>             oneline = read_git_output(
>                 ['log', '--format=%s', '--no-walk', self.rev.sha1]
>                 )
>         except CommandError:
>             oneline = self.rev.sha1

What does this mean?  When will you get a CommandError?  And how do
you respond to it?

>         values['rev'] = self.rev.sha1
>         values['rev_short'] = self.rev.short
>         values['change_type'] = self.change_type
>         values['refname'] = self.refname
>         values['short_refname'] = self.reference_change.short_refname
>         values['refname_type'] = self.reference_change.refname_type
>         values['reply_to_msgid'] = self.reference_change.msgid
>         values['num'] = self.num
>         values['tot'] = self.tot
>         values['recipients'] = self.recipients
>         values['oneline'] = oneline
>         values['author'] = self.author

Ugh.  Use

  { rev: self.rev.sha1,
    rev_short: self.rev.short
    ...
  }

and merge it with the existing dictionary.  Unsure why you're building
a dictionary in the first place.

>     def generate_email_header(self):
>         for line in self.expand_header_lines(REVISION_HEADER_TEMPLATE):
>             yield line
> 
>     def generate_email_intro(self):
>         for line in self.expand_lines(REVISION_INTRO_TEMPLATE):
>             yield line
>
>     def generate_email_body(self, push):
>         """Show this revision."""
> 
>         return read_git_lines(
>             [
>                 'log', '-C',
>                  '--stat', '-p', '--cc',
>                 '-1', self.rev.sha1,
>                 ],
>             keepends=True,
>             )
>
>     def generate_email_footer(self):
>         return self.expand_lines(REVISION_FOOTER_TEMPLATE)

Okay.

> class ReferenceChange(Change):
>     """A Change to a Git reference.
> 
>     An abstract class representing a create, update, or delete of a
>     Git reference.  Derived classes handle specific types of reference
>     (e.g., tags vs. branches).  These classes generate the main
>     reference change email summarizing the reference change and
>     whether it caused any any commits to be added or removed.
> 
>     ReferenceChange objects are usually created using the static
>     create() method, which has the logic to decide which derived class
>     to instantiate."""
> 
>     REF_RE = re.compile(r'^refs\/(?P<area>[^\/]+)\/(?P<shortname>.*)$')

Okay.

>     @staticmethod

Unsure what such a huge static method is doing here, but we'll find
out soon enough.

>     def create(environment, oldrev, newrev, refname):
>         """Return a ReferenceChange object representing the change.
> 
>         Return an object that represents the type of change that is being
>         made. oldrev and newrev should be SHA1s or ZEROS."""

Like I said before, use the typesystem effectively: why is using a
string with 40 zeros somehow better than None in your program _logic_?
I can understand converting None to 40 zeros for display purposes.

>         old = GitObject(oldrev)
>         new = GitObject(newrev)
>         rev = new or old
> 
>         # The revision type tells us what type the commit is, combined with
>         # the location of the ref we can decide between
>         #  - working branch
>         #  - tracking branch
>         #  - unannotated tag
>         #  - annotated tag

Could be simpler.

>         return klass(
>             environment,
>             refname=refname, short_refname=short_refname,
>             old=old, new=new, rev=rev,
>             )

Everything inherits from ReferenceChange anyway, so it should be safe.

>     def __init__(self, environment, refname, short_refname, old, new, rev):
>         Change.__init__(self, environment)
>         self.change_type = {
>             (False, True) : 'create',
>             (True, True) : 'update',
>             (True, False) : 'delete',
>             }[bool(old), bool(new)]

As a general principle, avoid casting: if new is a dictionary, what
does bool(new) even mean?  You just have to trust types, and let go of
that much safety.

>     def get_subject(self):
>         template = {
>             'create' : REF_CREATED_SUBJECT_TEMPLATE,
>             'update' : REF_UPDATED_SUBJECT_TEMPLATE,
>             'delete' : REF_DELETED_SUBJECT_TEMPLATE,
>             }[self.change_type]
>         return self.expand(template)
> 
>     def generate_email_header(self):
>         for line in self.expand_header_lines(
>             REFCHANGE_HEADER_TEMPLATE, subject=self.get_subject(),
>             ):
>             yield line
> 
>     def generate_email_intro(self):
>         for line in self.expand_lines(REFCHANGE_INTRO_TEMPLATE):
>             yield line
> 
>     def generate_email_body(self, push):
>         """Call the appropriate body-generation routine.
> 
>         Call one of generate_create_summary() /
>         generate_update_summary() / generate_delete_summary()."""
> 
>         change_summary = {
>             'create' : self.generate_create_summary,
>             'delete' : self.generate_delete_summary,
>             'update' : self.generate_update_summary,
>             }[self.change_type](push)
>         for line in change_summary:
>             yield line
> 
>         for line in self.generate_revision_change_summary(push):
>             yield line
> 
>     def generate_email_footer(self):
>         return self.expand_lines(FOOTER_TEMPLATE)

Mostly boring string interpolation.  Okay.

>     def generate_revision_change_log(self, new_commits_list):
>         if self.showlog:
>             yield '\n'
>             yield 'Detailed log of new commits:\n\n'
>             for line in read_git_lines(
>                     ['log', '--no-walk']
>                     + self.logopts
>                     + new_commits_list
>                     + ['--'],
>                     keepends=True,
>                 ):
>                 yield line

Okay.

Got bored.  Skipping to the next class.

> class BranchChange(ReferenceChange):
>     refname_type = 'branch'

Unsure what new information this conveys over the type.

> class AnnotatedTagChange(ReferenceChange):
>     refname_type = 'annotated tag'
> 
>     def __init__(self, environment, refname, short_refname, old, new, rev):
>         ReferenceChange.__init__(
>             self, environment,
>             refname=refname, short_refname=short_refname,
>             old=old, new=new, rev=rev,
>             )
>         self.recipients = environment.get_announce_recipients(self)
>         self.show_shortlog = environment.announce_show_shortlog
> 
>     ANNOTATED_TAG_FORMAT = (
>         '%(*objectname)\n'
>         '%(*objecttype)\n'
>         '%(taggername)\n'
>         '%(taggerdate)'
>         )

Now I'm curious what you get by differentiating between annotated and
unannotated tags.

>     def describe_tag(self, push):
>         """Describe the new value of an annotated tag."""
> 
>         # Use git for-each-ref to pull out the individual fields from
>         # the tag
>         [tagobject, tagtype, tagger, tagged] = read_git_lines(
>             ['for-each-ref', '--format=%s' % (self.ANNOTATED_TAG_FORMAT,), self.refname],
>             )

You could've saved yourself a lot of trouble by running one f-e-r on
refs/tags and filtering that.  I don't know what you're gaining from
this overzealous object-orientation.

>         yield self.expand(
>             BRIEF_SUMMARY_TEMPLATE, action='tagging',
>             rev_short=tagobject, text='(%s)' % (tagtype,),
>             )
>         if tagtype == 'commit':
>             # If the tagged object is a commit, then we assume this is a
>             # release, and so we calculate which tag this tag is
>             # replacing
>             try:
>                 prevtag = read_git_output(['describe', '--abbrev=0', '%s^' % (self.new,)])
>             except CommandError:
>                 prevtag = None
>             if prevtag:
>                 yield '  replaces  %s\n' % (prevtag,)
>         else:
>             prevtag = None
>             yield '    length  %s bytes\n' % (read_git_output(['cat-file', '-s', tagobject]),)
> 
>         yield ' tagged by  %s\n' % (tagger,)
>         yield '        on  %s\n' % (tagged,)
>         yield '\n'

Okay, this information isn't present in an unannotated tag.  So you
differentiate to exploit the additional information you can get.

>         # Show the content of the tag message; this might contain a
>         # change log or release notes so is worth displaying.
>         yield LOGBEGIN
>         contents = list(read_git_lines(['cat-file', 'tag', self.new.sha1], keepends=True))

You could've easily batched this.

>         contents = contents[contents.index('\n') + 1:]
>         if contents and contents[-1][-1:] != '\n':
>             contents.append('\n')
>         for line in contents:
>             yield line
> 
>         if self.show_shortlog and tagtype == 'commit':
>             # Only commit tags make sense to have rev-list operations
>             # performed on them
>             yield '\n'
>             if prevtag:
>                 # Show changes since the previous release
>                 revlist = read_git_output(
>                     ['rev-list', '--pretty=short', '%s..%s' % (prevtag, self.new,)],
>                     keepends=True,
>                     )
>             else:
>                 # No previous tag, show all the changes since time
>                 # began
>                 revlist = read_git_output(
>                     ['rev-list', '--pretty=short', '%s' % (self.new,)],
>                     keepends=True,
>                     )
>             for line in read_git_lines(['shortlog'], input=revlist, keepends=True):
>                 yield line
> 
>         yield LOGEND
>         yield '\n'

Way too many git invocations, I think.

> class OtherReferenceChange(ReferenceChange):
>     refname_type = 'reference'
> 
>     def __init__(self, environment, refname, short_refname, old, new, rev):
>         # We use the full refname as short_refname, because otherwise
>         # the full name of the reference would not be obvious from the
>         # text of the email.
>         ReferenceChange.__init__(
>             self, environment,
>             refname=refname, short_refname=refname,
>             old=old, new=new, rev=rev,
>             )
>         self.recipients = environment.get_refchange_recipients(self)

What is the point of this?  Why not just use ReferenceChange directly?

> class Mailer(object):
>     """An object that can send emails."""
> 
>     def send(self, lines, to_addrs):
>         """Send an email consisting of lines.
> 
>         lines must be an iterable over the lines constituting the
>         header and body of the email.  to_addrs is a list of recipient
>         addresses (can be needed even if lines already contains a
>         "To:" field).  It can be either a string (comma-separated list
>         of email addresses) or a Python list of individual email
>         addresses.
> 
>         """
> 
>         raise NotImplementedError()

Abstract base class (abc)?  Or do you want to support Python <2.6?

> class SendMailer(Mailer):
>     """Send emails using '/usr/sbin/sendmail -t'."""
> 
>     def __init__(self, command=None, envelopesender=None):
>         """Construct a SendMailer instance.
> 
>         command should be the command and arguments used to invoke
>         sendmail, as a list of strings.  If an envelopesender is
>         provided, it will also be passed to the command, via '-f
>         envelopesender'."""
> 
>         if command:
>             self.command = command[:]
>         else:
>             self.command = ['/usr/sbin/sendmail', '-t']

If you want to DWIM when the configuration variable is missing, do it
fully using a list of good candidates like /usr/lib/sendmail,
/usr/sbin/sendmail, /usr/ucblib/sendmail, /usr/bin/msmtp.  Also, what
happened to our faithful 'git send-email' Perl script?  Isn't that
most likely to be installed?

>         if envelopesender:
>             self.command.extend(['-f', envelopesender])
> 
>     def send(self, lines, to_addrs):
>         try:
>             p = subprocess.Popen(self.command, stdin=subprocess.PIPE)
>         except OSError, e:
>             sys.stderr.write(
>                 '*** Cannot execute command: %s\n' % ' '.join(self.command)
>                 + '*** %s\n' % str(e)
>                 + '*** Try setting multimailhook.mailer to "smtp"\n'
>                 '*** to send emails without using the sendmail command.\n'
>                 )
>             sys.exit(1)

Why do you need to concatenate strings using +?  This can take a list of strings, no?

> class SMTPMailer(Mailer):
>     """Send emails using Python's smtplib."""
> 
>     def __init__(self, envelopesender, smtpserver):
>         if not envelopesender:
>             sys.stderr.write(
>                 'fatal: git_multimail: cannot use SMTPMailer without a sender address.\n'
>                 'please set either multimailhook.envelopeSender or user.email\n'
>                 )
>             sys.exit(1)
>         self.envelopesender = envelopesender
>         self.smtpserver = smtpserver
>         try:
>             self.smtp = smtplib.SMTP(self.smtpserver)
>         except Exception, e:
>             sys.stderr.write('*** Error establishing SMTP connection to %s***\n' % self.smtpserver)
>             sys.stderr.write('*** %s\n' % str(e))
>             sys.exit(1)

Let's hope Python's smtplib is robust.

>     def __del__(self):
>         self.smtp.quit()

So you close the connection when the object is destroyed by the GC.

>     def send(self, lines, to_addrs):
>         try:
>             msg = ''.join(lines)
>             # turn comma-separated list into Python list if needed.
>             if isinstance(to_addrs, basestring):
>                 to_addrs = [email for (name, email) in getaddresses([to_addrs])]
>             self.smtp.sendmail(self.envelopesender, to_addrs, msg)
>         except Exception, e:
>             sys.stderr.write('*** Error sending email***\n')
>             sys.stderr.write('*** %s\n' % str(e))
>             self.smtp.quit()
>             sys.exit(1)

Okay.

> class OutputMailer(Mailer):
>     """Write emails to an output stream, bracketed by lines of '=' characters.
> 
>     This is intended for debugging purposes."""
> 
>     SEPARATOR = '=' * 75 + '\n'
> 
>     def __init__(self, f):
>         self.f = f
> 
>     def send(self, lines, to_addrs):
>         self.f.write(self.SEPARATOR)
>         self.f.writelines(lines)
>         self.f.write(self.SEPARATOR)

Unsure what this is.

> def get_git_dir():
>     """Determine GIT_DIR.
> 
>     Determine GIT_DIR either from the GIT_DIR environment variable or
>     from the working directory, using Git's usual rules."""
> 
>     try:
>         return read_git_output(['rev-parse', '--git-dir'])
>     except CommandError:
>         sys.stderr.write('fatal: git_multimail: not in a git working copy\n')
>         sys.exit(1)

Why do you need a working copy?  Will a bare repository not suffice?

> class Environment(object):

New-style class.  I wonder why you suddenly switched.

>     REPO_NAME_RE = re.compile(r'^(?P<name>.+?)(?:\.git)$')
> 
>     def __init__(self, osenv=None):
>         self.osenv = osenv or os.environ
>         self.announce_show_shortlog = False
>         self.maxcommitemails = 500
>         self.diffopts = ['--stat', '--summary', '--find-copies-harder']
>         self.logopts = []
>         self.refchange_showlog = False
> 
>         self.COMPUTED_KEYS = [
>             'administrator',
>             'charset',
>             'emailprefix',
>             'fromaddr',
>             'pusher',
>             'pusher_email',
>             'repo_path',
>             'repo_shortname',
>             'sender',
>             ]
> 
>         self._values = None

Okay.

> [...]

Seems to be some boilerplate thing.  I'll skip to the next class.

> class ConfigEnvironmentMixin(Environment):
>     """A mixin that sets self.config to its constructor's config argument.
> 
>     This class's constructor consumes the "config" argument.
> 
>     Mixins that need to inspect the config should inherit from this
>     class (1) to make sure that "config" is still in the constructor
>     arguments with its own constructor runs and/or (2) to be sure that
>     self.config is set after construction."""
> 
>     def __init__(self, config, **kw):
>         super(ConfigEnvironmentMixin, self).__init__(**kw)
>         self.config = config

Overdoing the OO factories, much?

I'll skip a few boring factory classes.

> class GenericEnvironment(
>     ProjectdescEnvironmentMixin,
>     ConfigMaxlinesEnvironmentMixin,
>     ConfigFilterLinesEnvironmentMixin,
>     ConfigRecipientsEnvironmentMixin,
>     PusherDomainEnvironmentMixin,
>     ConfigOptionsEnvironmentMixin,
>     GenericEnvironmentMixin,
>     Environment,
>     ):
>     pass

Sigh.  I might as well be reading some Java now :/

Sorry, I'm exhausted.

Let's take a step back and look at what this gigantic script is doing.
It uses the information from a push to string-interpolate a template
and generate emails, right?  The rest of the script is about churning
on the updated refs to prettify the emails.

>From my quick reading, it seems to be unnecessarily complicated and
inefficient.  Why are there so many factories, and why do you call out
to git at every opportunity, instead of cleanly separating computation
from rendering?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-02 19:23 Review of git multimail Ramkumar Ramachandra
@ 2013-07-02 20:51 ` John Keeping
  2013-07-02 21:34   ` Ramkumar Ramachandra
  2013-07-02 22:21 ` Junio C Hamano
  2013-07-03  0:10 ` Michael Haggerty
  2 siblings, 1 reply; 15+ messages in thread
From: John Keeping @ 2013-07-02 20:51 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List, Michael Haggerty

On Wed, Jul 03, 2013 at 12:53:39AM +0530, Ramkumar Ramachandra wrote:
> > class CommandError(Exception):
> >     def __init__(self, cmd, retcode):
> >         self.cmd = cmd
> >         self.retcode = retcode
> >         Exception.__init__(
> >             self,
> >             'Command "%s" failed with retcode %s' % (' '.join(cmd), retcode,)
> 
> So cmd is a list.
> 
> > class ConfigurationException(Exception):
> >     pass
> 
> Dead code?

Huh?  ConfigurationException is used elsewhere.

> > def read_git_output(args, input=None, keepends=False, **kw):
> >     """Read the output of a Git command."""
> > 
> >     return read_output(
> >         ['git', '-c', 'i18n.logoutputencoding=%s' % (ENCODING,)] + args,
> >         input=input, keepends=keepends, **kw
> >         )
> 
> Okay, although I'm wondering what i18n.logoutputencoding has to do with anything.

Making sure the output is what's expected and not influenced by
environment variables?

> > def read_output(cmd, input=None, keepends=False, **kw):
> >     if input:
> >         stdin = subprocess.PIPE
> >     else:
> >         stdin = None
> >     p = subprocess.Popen(
> >         cmd, stdin=stdin, stdout=subprocess.PIPE, stderr=subprocess.PIPE, **kw
> >         )
> >     (out, err) = p.communicate(input)
> >     retcode = p.wait()
> >     if retcode:
> >         raise CommandError(cmd, retcode)
> >     if not keepends:
> >         out = out.rstrip('\n\r')
> >     return out
> 
> Helper function that serves a single caller, read_git_output().
> 
> > def read_git_lines(args, keepends=False, **kw):
> >     """Return the lines output by Git command.
> > 
> >     Return as single lines, with newlines stripped off."""
> > 
> >     return read_git_output(args, keepends=True, **kw).splitlines(keepends)
> 
> Okay.
> 
> > class Config(object):
> >     def __init__(self, section, git_config=None):
> >         """Represent a section of the git configuration.
> > 
> >         If git_config is specified, it is passed to "git config" in
> >         the GIT_CONFIG environment variable, meaning that "git config"
> >         will read the specified path rather than the Git default
> >         config paths."""
> > 
> >         self.section = section
> >         if git_config:
> >             self.env = os.environ.copy()
> >             self.env['GIT_CONFIG'] = git_config
> >         else:
> >             self.env = None
> 
> Okay.
> 
> >     @staticmethod
> >     def _split(s):
> >         """Split NUL-terminated values."""
> > 
> >         words = s.split('\0')
> >         assert words[-1] == ''
> >         return words[:-1]
> 
> Ugh.  Two callers of this poorly-defined static method: I wonder if
> we'd be better off inlining it.

In what way poorly defined?  Personally I'd make it _split_null at the
top level but it seems sensible.

> >     def get(self, name, default=''):
> >         try:
> >             values = self._split(read_git_output(
> >                     ['config', '--get', '--null', '%s.%s' % (self.section, name)],
> >                     env=self.env, keepends=True,
> >                     ))
> 
> Wait, what is the point of using --null and then splitting by hand
> using a poorly-defined static method?  Why not drop the --null and
> splitlines() as usual?
> 
> >             assert len(values) == 1
> 
> When does this assert fail?

In can't, which is presumably why it's an assert - it checks that we're
processing the Git output as expected.

> >             return values[0]
> >         except CommandError:
> >             return default
> 
> If you're emulating the dictionary get method, default=None.  This is
> not C, where all codepaths of the function must return the same type.
> 
> >     def get_bool(self, name, default=None):
> >         try:
> >             value = read_git_output(
> >                 ['config', '--get', '--bool', '%s.%s' % (self.section, name)],
> >                 env=self.env,
> >                 )
> >         except CommandError:
> >             return default
> >         return value == 'true'
> 
> Correct.  On success, return bool.  On failure, return None.
> 
> >     def get_all(self, name, default=None):
> >         """Read a (possibly multivalued) setting from the configuration.
> > 
> >         Return the result as a list of values, or default if the name
> >         is unset."""
> > 
> >         try:
> >             return self._split(read_git_output(
> >                 ['config', '--get-all', '--null', '%s.%s' % (self.section, name)],
> >                 env=self.env, keepends=True,
> >                 ))
> >         except CommandError, e:
> 
> CommandError as e?

Not before Python 2.6.

> >             if e.retcode == 1:
> 
> What does this cryptic retcode mean?

It mirrors subprocess.CalledProcessError, retcode is the return code of
the process.

> >                 return default
> >             else:
> >                 raise
> 
> raise what?

The current exception - this is pretty idiomatic Python.

> You've instantiated the Config class in two places: user and
> multimailhook sections.  Considering that you're going to read all the
> keys in that section, why not --get-regexp, pre-load the configuration
> into a dictionary and refer to that instead of spawning 'git config'
> every time you need a configuration value?
> 
> >     def get_recipients(self, name, default=None):
> >         """Read a recipients list from the configuration.
> > 
> >         Return the result as a comma-separated list of email
> >         addresses, or default if the option is unset.  If the setting
> >         has multiple values, concatenate them with comma separators."""
> > 
> >         lines = self.get_all(name, default=None)
> >         if lines is None:
> >             return default
> >         return ', '.join(line.strip() for line in lines)
> 
> Ugh.

What's so bad here?  This seems pretty clear to me.

> >     def set(self, name, value):
> >         read_git_output(
> >             ['config', '%s.%s' % (self.section, name), value],
> >             env=self.env,
> >             )
> >
> >     def add(self, name, value):
> >         read_git_output(
> >             ['config', '--add', '%s.%s' % (self.section, name), value],
> >             env=self.env,
> >             )
> > 
> >     def has_key(self, name):
> >         return self.get_all(name, default=None) is not None
> > 
> >     def unset_all(self, name):
> >         try:
> >             read_git_output(
> >                 ['config', '--unset-all', '%s.%s' % (self.section, name)],
> >                 env=self.env,
> >                 )
> >         except CommandError, e:
> >             if e.retcode == 5:
> >                 # The name doesn't exist, which is what we wanted anyway...
> >                 pass
> >             else:
> >                 raise
> > 
> >     def set_recipients(self, name, value):
> >         self.unset_all(name)
> >         for pair in getaddresses([value]):
> >             self.add(name, formataddr(pair))
> 
> Dead code?
> 
> > def generate_summaries(*log_args):
> >     """Generate a brief summary for each revision requested.
> > 
> >     log_args are strings that will be passed directly to "git log" as
> >     revision selectors.  Iterate over (sha1_short, subject) for each
> >     commit specified by log_args (subject is the first line of the
> >     commit message as a string without EOLs)."""
> > 
> >     cmd = [
> >         'log', '--abbrev', '--format=%h %s',
> >         ] + list(log_args) + ['--']
> 
> What is log_args if not a list?

Given the function signature, how can it not be a list?

> But yeah, log is the best way to generate summaries.
> 
> >     for line in read_git_lines(cmd):
> >         yield tuple(line.split(' ', 1))
> 
> Okay, let's see how you use this iterator.

[technically a generator...]

> > def limit_lines(lines, max_lines):
> >     for (index, line) in enumerate(lines):
> >         if index < max_lines:
> >             yield line
> > 
> >     if index >= max_lines:
> >         yield '... %d lines suppressed ...\n' % (index + 1 - max_lines,)
> 
> Random helper.
> 
> > def limit_linelength(lines, max_linelength):
> >     for line in lines:
> >         # Don't forget that lines always include a trailing newline.
> >         if len(line) > max_linelength + 1:
> >             line = line[:max_linelength - 7] + ' [...]\n'
> >         yield line
> 
> Random helper.
> 
> > class GitObject(object):
> >     def __init__(self, sha1, type=None):
> >         if sha1 == ZEROS:
> >             self.sha1 = self.type = self.commit = None
> >         else:
> >             self.sha1 = sha1
> >             self.type = type or read_git_output(['cat-file', '-t', self.sha1])
> > 
> >             if self.type == 'commit':
> >                 self.commit = self
> >             elif self.type == 'tag':
> >                 try:
> >                     self.commit = GitObject(
> >                         read_git_output(['rev-parse', '--verify', '%s^0' % (self.sha1,)]),
> >                         type='commit',
> >                         )
> >                 except CommandError:
> >                     self.commit = None
> >             else:
> >                 self.commit = None
> > 
> >         self.short = read_git_output(['rev-parse', '--short', sha1])
> 
> Just rev-parse --verify --short $SHA1^0: if it resolves, set
> self.short; one liner?

Because it can't create the recursive tag structure if it does that?
Did you miss the fact the in the 'tag' case it assigns a new GitObject
to self.commit?

> >     def get_summary(self):
> >         """Return (sha1_short, subject) for this commit."""
> > 
> >         if not self.sha1:
> >             raise ValueError('Empty commit has no summary')
> 
> What is the point of letting the user instantiate a GitObject without
> a valid .sha1 in the first place?

It looks like this is used for input from the hook, in which case it's
possible for either newrev or oldref to be the null SHA-1, which creates
this empty object.

> >         return iter(generate_summaries('--no-walk', self.sha1)).next()
> 
> Not exactly fond of this, but I don't have a concrete replacement at
> the moment.
> 
> >     def __eq__(self, other):
> >         return isinstance(other, GitObject) and self.sha1 == other.sha1
> > 
> >     def __hash__(self):
> >         return hash(self.sha1)
> > 
> >     def __nonzero__(self):
> >         return bool(self.sha1)
> 
> Okay.
> 
> >     def __str__(self):
> >         return self.sha1 or ZEROS
> 
> I wonder what value this adds when .short is around.

Debugging?  A friendlier API?  Having a oneline __str__ method isn't
exactly expensive and it can make some things much simpler.

> > class Change(object):
> >     """A Change that has been made to the Git repository.
> > 
> >     Abstract class from which both Revisions and ReferenceChanges are
> >     derived.  A Change knows how to generate a notification email
> >     describing itself."""
> > 
> >     def __init__(self, environment):
> >         self.environment = environment
> >         self._values = None
> > 
> >     def _compute_values(self):
> >         """Return a dictionary {keyword : expansion} for this Change.
> > 
> >         Derived classes overload this method to add more entries to
> >         the return value.  This method is used internally by
> >         get_values().  The return value should always be a new
> >         dictionary."""
> > 
> >         return self.environment.get_values()
> 
> Why is this an "internal function"?  What is your criterion for
> internal versus non-internal?
> 
> >     def get_values(self, **extra_values):
> >         """Return a dictionary {keyword : expansion} for this Change.
> > 
> >         Return a dictionary mapping keywords to the values that they
> >         should be expanded to for this Change (used when interpolating
> >         template strings).  If any keyword arguments are supplied, add
> >         those to the return value as well.  The return value is always
> >         a new dictionary."""
> > 
> >         if self._values is None:
> >             self._values = self._compute_values()
> > 
> >         values = self._values.copy()
> >         if extra_values:
> >             values.update(extra_values)
> >         return values
> 
> Unsure what this is about.
> 
> >     def expand(self, template, **extra_values):
> >         """Expand template.
> > 
> >         Expand the template (which should be a string) using string
> >         interpolation of the values for this Change.  If any keyword
> >         arguments are provided, also include those in the keywords
> >         available for interpolation."""
> > 
> >         return template % self.get_values(**extra_values)
> > 
> >     def expand_lines(self, template, **extra_values):
> >         """Break template into lines and expand each line."""
> > 
> >         values = self.get_values(**extra_values)
> >         for line in template.splitlines(True):
> >             yield line % values
> 
> Okay.
> 
> >     def expand_header_lines(self, template, **extra_values):
> >         """Break template into lines and expand each line as an RFC 2822 header.
> > 
> >         Encode values and split up lines that are too long.  Silently
> >         skip lines that contain references to unknown variables."""
> > 
> >         values = self.get_values(**extra_values)
> >         for line in template.splitlines(True):
> >             (name, value) = line.split(':', 1)
> >             value = value.rstrip('\n\r')
> 
> Doesn't splitlines() make the rstrip() redundant?
> 
> >             try:
> >                 value = value % values
> >             except KeyError, e:
> >                 if DEBUG:
> >                     sys.stderr.write(
> >                         'Warning: unknown variable %r in the following line; line skipped:\n'
> >                         '    %s'
> >                         % (e.args[0], line,)
> >                         )
> 
> If DEBUG isn't on, you risk leaving the value string interpolated
> without even telling the user.  What does it mean to the end user?
> 
> >             else:
> >                 try:
> >                     h = Header(value, header_name=name)
> >                 except UnicodeDecodeError:
> >                     h = Header(value, header_name=name, charset=CHARSET, errors='replace')
> >                 for splitline in ('%s: %s\n' % (name, h.encode(),)).splitlines(True):
> >                     yield splitline
> 
> Not elated by this exception cascading, but I suppose it's cheaper
> than actually checking everything.

It's generally considered more idiomatic in Python to use exceptions
than check explicitly.  In CPython the cost is roughly equivalent.

> >     def generate_email_header(self):
> >         """Generate the RFC 2822 email headers for this Change, a line at a time.
> > 
> >         The output should not include the trailing blank line."""
> > 
> >         raise NotImplementedError()
> > 
> >     def generate_email_intro(self):
> >         """Generate the email intro for this Change, a line at a time.
> > 
> >         The output will be used as the standard boilerplate at the top
> >         of the email body."""
> > 
> >         raise NotImplementedError()
> > 
> >     def generate_email_body(self):
> >         """Generate the main part of the email body, a line at a time.
> > 
> >         The text in the body might be truncated after a specified
> >         number of lines (see multimailhook.emailmaxlines)."""
> > 
> >         raise NotImplementedError()
> > 
> >     def generate_email_footer(self):
> >         """Generate the footer of the email, a line at a time.
> > 
> >         The footer is always included, irrespective of
> >         multimailhook.emailmaxlines."""
> > 
> >         raise NotImplementedError()
> 
> Unsure what these are about.
> 
> >     def generate_email(self, push, body_filter=None):
> >         """Generate an email describing this change.
> > 
> >         Iterate over the lines (including the header lines) of an
> >         email describing this change.  If body_filter is not None,
> >         then use it to filter the lines that are intended for the
> >         email body."""
> > 
> >         for line in self.generate_email_header():
> >             yield line
> >         yield '\n'
> >         for line in self.generate_email_intro():
> >             yield line
> > 
> >         body = self.generate_email_body(push)
> >         if body_filter is not None:
> 
> Redundant "is not None".

Why redundant?  body_filter could be non-None but evaluate as false and
"explicit is better than implicit".

> >             body = body_filter(body)
> >         for line in body:
> >             yield line
> > 
> >         for line in self.generate_email_footer():
> >             yield line
> 
> Nicely done with yield.
> 
> > class Revision(Change):
> >     """A Change consisting of a single git commit."""
> > 
> >     def __init__(self, reference_change, rev, num, tot):
> >         Change.__init__(self, reference_change.environment)
> 
> super?
> 
> >         self.reference_change = reference_change
> >         self.rev = rev
> >         self.change_type = self.reference_change.change_type
> >         self.refname = self.reference_change.refname
> >         self.num = num
> >         self.tot = tot
> >         self.author = read_git_output(['log', '--no-walk', '--format=%aN <%aE>', self.rev.sha1])
> 
> Determining author using log --format.  Okay.
> 
> >         self.recipients = self.environment.get_revision_recipients(self)
> > 
> >     def _compute_values(self):
> >         values = Change._compute_values(self)
> 
> super.
> 
> >         # First line of commit message:
> >         try:
> >             oneline = read_git_output(
> >                 ['log', '--format=%s', '--no-walk', self.rev.sha1]
> >                 )
> >         except CommandError:
> >             oneline = self.rev.sha1
> 
> What does this mean?  When will you get a CommandError?  And how do
> you respond to it?
> 
> >         values['rev'] = self.rev.sha1
> >         values['rev_short'] = self.rev.short
> >         values['change_type'] = self.change_type
> >         values['refname'] = self.refname
> >         values['short_refname'] = self.reference_change.short_refname
> >         values['refname_type'] = self.reference_change.refname_type
> >         values['reply_to_msgid'] = self.reference_change.msgid
> >         values['num'] = self.num
> >         values['tot'] = self.tot
> >         values['recipients'] = self.recipients
> >         values['oneline'] = oneline
> >         values['author'] = self.author
> 
> Ugh.  Use
> 
>   { rev: self.rev.sha1,
>     rev_short: self.rev.short
>     ...
>   }
> 
> and merge it with the existing dictionary.  Unsure why you're building
> a dictionary in the first place.
> 
> >     def generate_email_header(self):
> >         for line in self.expand_header_lines(REVISION_HEADER_TEMPLATE):
> >             yield line
> > 
> >     def generate_email_intro(self):
> >         for line in self.expand_lines(REVISION_INTRO_TEMPLATE):
> >             yield line
> >
> >     def generate_email_body(self, push):
> >         """Show this revision."""
> > 
> >         return read_git_lines(
> >             [
> >                 'log', '-C',
> >                  '--stat', '-p', '--cc',
> >                 '-1', self.rev.sha1,
> >                 ],
> >             keepends=True,
> >             )
> >
> >     def generate_email_footer(self):
> >         return self.expand_lines(REVISION_FOOTER_TEMPLATE)
> 
> Okay.
> 
> > class ReferenceChange(Change):
> >     """A Change to a Git reference.
> > 
> >     An abstract class representing a create, update, or delete of a
> >     Git reference.  Derived classes handle specific types of reference
> >     (e.g., tags vs. branches).  These classes generate the main
> >     reference change email summarizing the reference change and
> >     whether it caused any any commits to be added or removed.
> > 
> >     ReferenceChange objects are usually created using the static
> >     create() method, which has the logic to decide which derived class
> >     to instantiate."""
> > 
> >     REF_RE = re.compile(r'^refs\/(?P<area>[^\/]+)\/(?P<shortname>.*)$')
> 
> Okay.
> 
> >     @staticmethod
> 
> Unsure what such a huge static method is doing here, but we'll find
> out soon enough.
> 
> >     def create(environment, oldrev, newrev, refname):
> >         """Return a ReferenceChange object representing the change.
> > 
> >         Return an object that represents the type of change that is being
> >         made. oldrev and newrev should be SHA1s or ZEROS."""
> 
> Like I said before, use the typesystem effectively: why is using a
> string with 40 zeros somehow better than None in your program _logic_?
> I can understand converting None to 40 zeros for display purposes.
> 
> >         old = GitObject(oldrev)
> >         new = GitObject(newrev)
> >         rev = new or old
> > 
> >         # The revision type tells us what type the commit is, combined with
> >         # the location of the ref we can decide between
> >         #  - working branch
> >         #  - tracking branch
> >         #  - unannotated tag
> >         #  - annotated tag
> 
> Could be simpler.
> 
> >         return klass(
> >             environment,
> >             refname=refname, short_refname=short_refname,
> >             old=old, new=new, rev=rev,
> >             )
> 
> Everything inherits from ReferenceChange anyway, so it should be safe.
> 
> >     def __init__(self, environment, refname, short_refname, old, new, rev):
> >         Change.__init__(self, environment)
> >         self.change_type = {
> >             (False, True) : 'create',
> >             (True, True) : 'update',
> >             (True, False) : 'delete',
> >             }[bool(old), bool(new)]
> 
> As a general principle, avoid casting: if new is a dictionary, what
> does bool(new) even mean?

nonempty.

>                            You just have to trust types, and let go of
> that much safety.
> 
> >     def get_subject(self):
> >         template = {
> >             'create' : REF_CREATED_SUBJECT_TEMPLATE,
> >             'update' : REF_UPDATED_SUBJECT_TEMPLATE,
> >             'delete' : REF_DELETED_SUBJECT_TEMPLATE,
> >             }[self.change_type]
> >         return self.expand(template)
> > 
> >     def generate_email_header(self):
> >         for line in self.expand_header_lines(
> >             REFCHANGE_HEADER_TEMPLATE, subject=self.get_subject(),
> >             ):
> >             yield line
> > 
> >     def generate_email_intro(self):
> >         for line in self.expand_lines(REFCHANGE_INTRO_TEMPLATE):
> >             yield line
> > 
> >     def generate_email_body(self, push):
> >         """Call the appropriate body-generation routine.
> > 
> >         Call one of generate_create_summary() /
> >         generate_update_summary() / generate_delete_summary()."""
> > 
> >         change_summary = {
> >             'create' : self.generate_create_summary,
> >             'delete' : self.generate_delete_summary,
> >             'update' : self.generate_update_summary,
> >             }[self.change_type](push)
> >         for line in change_summary:
> >             yield line
> > 
> >         for line in self.generate_revision_change_summary(push):
> >             yield line
> > 
> >     def generate_email_footer(self):
> >         return self.expand_lines(FOOTER_TEMPLATE)
> 
> Mostly boring string interpolation.  Okay.
> 
> >     def generate_revision_change_log(self, new_commits_list):
> >         if self.showlog:
> >             yield '\n'
> >             yield 'Detailed log of new commits:\n\n'
> >             for line in read_git_lines(
> >                     ['log', '--no-walk']
> >                     + self.logopts
> >                     + new_commits_list
> >                     + ['--'],
> >                     keepends=True,
> >                 ):
> >                 yield line
> 
> Okay.
> 
> Got bored.  Skipping to the next class.
> 
> > class BranchChange(ReferenceChange):
> >     refname_type = 'branch'
> 
> Unsure what new information this conveys over the type.

Presumably the "refname_type" member is used to provide more friendly
text in the message.

> > class AnnotatedTagChange(ReferenceChange):
> >     refname_type = 'annotated tag'
> > 
> >     def __init__(self, environment, refname, short_refname, old, new, rev):
> >         ReferenceChange.__init__(
> >             self, environment,
> >             refname=refname, short_refname=short_refname,
> >             old=old, new=new, rev=rev,
> >             )
> >         self.recipients = environment.get_announce_recipients(self)
> >         self.show_shortlog = environment.announce_show_shortlog
> > 
> >     ANNOTATED_TAG_FORMAT = (
> >         '%(*objectname)\n'
> >         '%(*objecttype)\n'
> >         '%(taggername)\n'
> >         '%(taggerdate)'
> >         )
> 
> Now I'm curious what you get by differentiating between annotated and
> unannotated tags.
> 
> >     def describe_tag(self, push):
> >         """Describe the new value of an annotated tag."""
> > 
> >         # Use git for-each-ref to pull out the individual fields from
> >         # the tag
> >         [tagobject, tagtype, tagger, tagged] = read_git_lines(
> >             ['for-each-ref', '--format=%s' % (self.ANNOTATED_TAG_FORMAT,), self.refname],
> >             )
> 
> You could've saved yourself a lot of trouble by running one f-e-r on
> refs/tags and filtering that.  I don't know what you're gaining from
> this overzealous object-orientation.
> 
> >         yield self.expand(
> >             BRIEF_SUMMARY_TEMPLATE, action='tagging',
> >             rev_short=tagobject, text='(%s)' % (tagtype,),
> >             )
> >         if tagtype == 'commit':
> >             # If the tagged object is a commit, then we assume this is a
> >             # release, and so we calculate which tag this tag is
> >             # replacing
> >             try:
> >                 prevtag = read_git_output(['describe', '--abbrev=0', '%s^' % (self.new,)])
> >             except CommandError:
> >                 prevtag = None
> >             if prevtag:
> >                 yield '  replaces  %s\n' % (prevtag,)
> >         else:
> >             prevtag = None
> >             yield '    length  %s bytes\n' % (read_git_output(['cat-file', '-s', tagobject]),)
> > 
> >         yield ' tagged by  %s\n' % (tagger,)
> >         yield '        on  %s\n' % (tagged,)
> >         yield '\n'
> 
> Okay, this information isn't present in an unannotated tag.  So you
> differentiate to exploit the additional information you can get.
> 
> >         # Show the content of the tag message; this might contain a
> >         # change log or release notes so is worth displaying.
> >         yield LOGBEGIN
> >         contents = list(read_git_lines(['cat-file', 'tag', self.new.sha1], keepends=True))
> 
> You could've easily batched this.
> 
> >         contents = contents[contents.index('\n') + 1:]
> >         if contents and contents[-1][-1:] != '\n':
> >             contents.append('\n')
> >         for line in contents:
> >             yield line
> > 
> >         if self.show_shortlog and tagtype == 'commit':
> >             # Only commit tags make sense to have rev-list operations
> >             # performed on them
> >             yield '\n'
> >             if prevtag:
> >                 # Show changes since the previous release
> >                 revlist = read_git_output(
> >                     ['rev-list', '--pretty=short', '%s..%s' % (prevtag, self.new,)],
> >                     keepends=True,
> >                     )
> >             else:
> >                 # No previous tag, show all the changes since time
> >                 # began
> >                 revlist = read_git_output(
> >                     ['rev-list', '--pretty=short', '%s' % (self.new,)],
> >                     keepends=True,
> >                     )
> >             for line in read_git_lines(['shortlog'], input=revlist, keepends=True):
> >                 yield line
> > 
> >         yield LOGEND
> >         yield '\n'
> 
> Way too many git invocations, I think.
> 
> > class OtherReferenceChange(ReferenceChange):
> >     refname_type = 'reference'
> > 
> >     def __init__(self, environment, refname, short_refname, old, new, rev):
> >         # We use the full refname as short_refname, because otherwise
> >         # the full name of the reference would not be obvious from the
> >         # text of the email.
> >         ReferenceChange.__init__(
> >             self, environment,
> >             refname=refname, short_refname=refname,
> >             old=old, new=new, rev=rev,
> >             )
> >         self.recipients = environment.get_refchange_recipients(self)
> 
> What is the point of this?  Why not just use ReferenceChange directly?
> 
> > class Mailer(object):
> >     """An object that can send emails."""
> > 
> >     def send(self, lines, to_addrs):
> >         """Send an email consisting of lines.
> > 
> >         lines must be an iterable over the lines constituting the
> >         header and body of the email.  to_addrs is a list of recipient
> >         addresses (can be needed even if lines already contains a
> >         "To:" field).  It can be either a string (comma-separated list
> >         of email addresses) or a Python list of individual email
> >         addresses.
> > 
> >         """
> > 
> >         raise NotImplementedError()
> 
> Abstract base class (abc)?  Or do you want to support Python <2.6?
> 
> > class SendMailer(Mailer):
> >     """Send emails using '/usr/sbin/sendmail -t'."""
> > 
> >     def __init__(self, command=None, envelopesender=None):
> >         """Construct a SendMailer instance.
> > 
> >         command should be the command and arguments used to invoke
> >         sendmail, as a list of strings.  If an envelopesender is
> >         provided, it will also be passed to the command, via '-f
> >         envelopesender'."""
> > 
> >         if command:
> >             self.command = command[:]
> >         else:
> >             self.command = ['/usr/sbin/sendmail', '-t']
> 
> If you want to DWIM when the configuration variable is missing, do it
> fully using a list of good candidates like /usr/lib/sendmail,
> /usr/sbin/sendmail, /usr/ucblib/sendmail, /usr/bin/msmtp.  Also, what
> happened to our faithful 'git send-email' Perl script?  Isn't that
> most likely to be installed?
> 
> >         if envelopesender:
> >             self.command.extend(['-f', envelopesender])
> > 
> >     def send(self, lines, to_addrs):
> >         try:
> >             p = subprocess.Popen(self.command, stdin=subprocess.PIPE)
> >         except OSError, e:
> >             sys.stderr.write(
> >                 '*** Cannot execute command: %s\n' % ' '.join(self.command)
> >                 + '*** %s\n' % str(e)
> >                 + '*** Try setting multimailhook.mailer to "smtp"\n'
> >                 '*** to send emails without using the sendmail command.\n'
> >                 )
> >             sys.exit(1)
> 
> Why do you need to concatenate strings using +?  This can take a list of strings, no?
> 
> > class SMTPMailer(Mailer):
> >     """Send emails using Python's smtplib."""
> > 
> >     def __init__(self, envelopesender, smtpserver):
> >         if not envelopesender:
> >             sys.stderr.write(
> >                 'fatal: git_multimail: cannot use SMTPMailer without a sender address.\n'
> >                 'please set either multimailhook.envelopeSender or user.email\n'
> >                 )
> >             sys.exit(1)
> >         self.envelopesender = envelopesender
> >         self.smtpserver = smtpserver
> >         try:
> >             self.smtp = smtplib.SMTP(self.smtpserver)
> >         except Exception, e:
> >             sys.stderr.write('*** Error establishing SMTP connection to %s***\n' % self.smtpserver)
> >             sys.stderr.write('*** %s\n' % str(e))
> >             sys.exit(1)
> 
> Let's hope Python's smtplib is robust.
> 
> >     def __del__(self):
> >         self.smtp.quit()
> 
> So you close the connection when the object is destroyed by the GC.
> 
> >     def send(self, lines, to_addrs):
> >         try:
> >             msg = ''.join(lines)
> >             # turn comma-separated list into Python list if needed.
> >             if isinstance(to_addrs, basestring):
> >                 to_addrs = [email for (name, email) in getaddresses([to_addrs])]
> >             self.smtp.sendmail(self.envelopesender, to_addrs, msg)
> >         except Exception, e:
> >             sys.stderr.write('*** Error sending email***\n')
> >             sys.stderr.write('*** %s\n' % str(e))
> >             self.smtp.quit()
> >             sys.exit(1)
> 
> Okay.
> 
> > class OutputMailer(Mailer):
> >     """Write emails to an output stream, bracketed by lines of '=' characters.
> > 
> >     This is intended for debugging purposes."""
> > 
> >     SEPARATOR = '=' * 75 + '\n'
> > 
> >     def __init__(self, f):
> >         self.f = f
> > 
> >     def send(self, lines, to_addrs):
> >         self.f.write(self.SEPARATOR)
> >         self.f.writelines(lines)
> >         self.f.write(self.SEPARATOR)
> 
> Unsure what this is.
> 
> > def get_git_dir():
> >     """Determine GIT_DIR.
> > 
> >     Determine GIT_DIR either from the GIT_DIR environment variable or
> >     from the working directory, using Git's usual rules."""
> > 
> >     try:
> >         return read_git_output(['rev-parse', '--git-dir'])
> >     except CommandError:
> >         sys.stderr.write('fatal: git_multimail: not in a git working copy\n')
> >         sys.exit(1)
> 
> Why do you need a working copy?  Will a bare repository not suffice?
> 
> > class Environment(object):
> 
> New-style class.  I wonder why you suddenly switched.
> 
> >     REPO_NAME_RE = re.compile(r'^(?P<name>.+?)(?:\.git)$')
> > 
> >     def __init__(self, osenv=None):
> >         self.osenv = osenv or os.environ
> >         self.announce_show_shortlog = False
> >         self.maxcommitemails = 500
> >         self.diffopts = ['--stat', '--summary', '--find-copies-harder']
> >         self.logopts = []
> >         self.refchange_showlog = False
> > 
> >         self.COMPUTED_KEYS = [
> >             'administrator',
> >             'charset',
> >             'emailprefix',
> >             'fromaddr',
> >             'pusher',
> >             'pusher_email',
> >             'repo_path',
> >             'repo_shortname',
> >             'sender',
> >             ]
> > 
> >         self._values = None
> 
> Okay.
> 
> > [...]
> 
> Seems to be some boilerplate thing.  I'll skip to the next class.
> 
> > class ConfigEnvironmentMixin(Environment):
> >     """A mixin that sets self.config to its constructor's config argument.
> > 
> >     This class's constructor consumes the "config" argument.
> > 
> >     Mixins that need to inspect the config should inherit from this
> >     class (1) to make sure that "config" is still in the constructor
> >     arguments with its own constructor runs and/or (2) to be sure that
> >     self.config is set after construction."""
> > 
> >     def __init__(self, config, **kw):
> >         super(ConfigEnvironmentMixin, self).__init__(**kw)
> >         self.config = config
> 
> Overdoing the OO factories, much?
> 
> I'll skip a few boring factory classes.
> 
> > class GenericEnvironment(
> >     ProjectdescEnvironmentMixin,
> >     ConfigMaxlinesEnvironmentMixin,
> >     ConfigFilterLinesEnvironmentMixin,
> >     ConfigRecipientsEnvironmentMixin,
> >     PusherDomainEnvironmentMixin,
> >     ConfigOptionsEnvironmentMixin,
> >     GenericEnvironmentMixin,
> >     Environment,
> >     ):
> >     pass
> 
> Sigh.  I might as well be reading some Java now :/
> 
> Sorry, I'm exhausted.
> 
> Let's take a step back and look at what this gigantic script is doing.
> It uses the information from a push to string-interpolate a template
> and generate emails, right?  The rest of the script is about churning
> on the updated refs to prettify the emails.
> 
> From my quick reading, it seems to be unnecessarily complicated and
> inefficient.  Why are there so many factories, and why do you call out
> to git at every opportunity, instead of cleanly separating computation
> from rendering?

I thought the point of this was to produce a much more flexible way to
generate emails when commits happen.  AFAICT all of the "complexity" is
flexibility to make it easy to use this for new use cases.

I have to say that I don't think this is a particularly useful review,
you seem to have skipped huge portions of the code and spent a lot of
time making us read your thought process rather than providing
constructive feedback.  What feedback there is mostly seems to be
expressions of disgust rather than actionable points.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-02 20:51 ` John Keeping
@ 2013-07-02 21:34   ` Ramkumar Ramachandra
  0 siblings, 0 replies; 15+ messages in thread
From: Ramkumar Ramachandra @ 2013-07-02 21:34 UTC (permalink / raw)
  To: John Keeping; +Cc: Git List, Michael Haggerty

John Keeping wrote:
> I have to say that I don't think this is a particularly useful review,
> you seem to have skipped huge portions of the code and spent a lot of
> time making us read your thought process rather than providing
> constructive feedback.  What feedback there is mostly seems to be
> expressions of disgust rather than actionable points.

Well, ignore it then.  I'm sorry for having wasted your time.

I did what I could in the time I was willing to spend reading it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-02 19:23 Review of git multimail Ramkumar Ramachandra
  2013-07-02 20:51 ` John Keeping
@ 2013-07-02 22:21 ` Junio C Hamano
  2013-07-03  8:02   ` Michael Haggerty
  2013-07-03  0:10 ` Michael Haggerty
  2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2013-07-02 22:21 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List, Michael Haggerty

Ramkumar Ramachandra <artagnon@gmail.com> writes:

>>     def get(self, name, default=''):
>>         try:
>>             values = self._split(read_git_output(
>>                     ['config', '--get', '--null', '%s.%s' % (self.section, name)],
>>                     env=self.env, keepends=True,
>>                     ))
>
> Wait, what is the point of using --null and then splitting by hand
> using a poorly-defined static method?  Why not drop the --null and
> splitlines() as usual?

You may actually have spotted a bug or misuse of "--get" here.

With this sample configuration:

        $ cat >sample <<\EOF
        [a]
                one = value
                one = another

        [b]
                one = "value\nanother"
        EOF

A script cannot differentiate between them without using '--null'.

	$ git config -f sample --get-all a.one
        $ git config -f sample --get-all b.one

But that matters only when you use "--get-all", not "--get".  If
this method wants to make sure that the user did not misuse a.one
as a multi-valued configuration variable, use of "--null --get-all"
followed by checking how many items the command gives you back would
be a way to do so.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-02 19:23 Review of git multimail Ramkumar Ramachandra
  2013-07-02 20:51 ` John Keeping
  2013-07-02 22:21 ` Junio C Hamano
@ 2013-07-03  0:10 ` Michael Haggerty
  2013-07-03 10:23   ` Ramkumar Ramachandra
  2 siblings, 1 reply; 15+ messages in thread
From: Michael Haggerty @ 2013-07-03  0:10 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List

On 07/02/2013 09:23 PM, Ramkumar Ramachandra wrote:
> I figured that we should quickly read through git-multimail and give
> it an on-list review.  Hopefully, it'll educate the list about what
> this is, and help improve the script itself.

Wonderful, thanks!

> Sources: https://github.com/mhagger/git-multimail
> 
> git_multimail.py wrote:
>> #! /usr/bin/env python2
> 
> Do all distributions ship it as python2 now?

No, but nor is "python" always Python version 2.x (I believe that Arch
Linux now installs Python 3 as "python").  This topic was discussed here
[1].  Basically, my reasoning is that (a) PEP 394 [2] says that
"python2" is the correct name for a Python 2.x interpreter, (b) I
believe that other distros are moving in that direction, and (c) if the
script says "python2" but no python2 is installed, the error is pretty
obvious, whereas if the script says "python" and that is actually Python
3.x, the error would be more cryptic.

>> class CommandError(Exception):
>>     def __init__(self, cmd, retcode):
>>         self.cmd = cmd
>>         self.retcode = retcode
>>         Exception.__init__(
>>             self,
>>             'Command "%s" failed with retcode %s' % (' '.join(cmd), retcode,)
> 
> So cmd is a list.

Yes, commands are always lists in my code because it is less error-prone
than trying to quote arguments correctly for the shell.  Do you think I
should document that here, or elsewhere, or everywhere, or ...?

>> class ConfigurationException(Exception):
>>     pass
> 
> Dead code?

No, this defines an exception class that inherits all of its methods
(including its constructors) from Exception.  This is useful because an
exception of type ConfigurationException is distinguishable from other
types of Exceptions, and can be caught using "except
ConfigurationException, e".

>> def read_git_output(args, input=None, keepends=False, **kw):
>>     """Read the output of a Git command."""
>>
>>     return read_output(
>>         ['git', '-c', 'i18n.logoutputencoding=%s' % (ENCODING,)] + args,
>>         input=input, keepends=keepends, **kw
>>         )
> 
> Okay, although I'm wondering what i18n.logoutputencoding has to do with anything.

Ultimately, a lot of the output of these commands is going to be
inserted into an email that claims to be UTF-8, so this is here to
hopefully avoid at least one source of non-UTF-8 text.

> [...]
>> class Config(object):
>>     def __init__(self, section, git_config=None):
>>         """Represent a section of the git configuration.
>>
>>         If git_config is specified, it is passed to "git config" in
>>         the GIT_CONFIG environment variable, meaning that "git config"
>>         will read the specified path rather than the Git default
>>         config paths."""
>>
>>         self.section = section
>>         if git_config:
>>             self.env = os.environ.copy()
>>             self.env['GIT_CONFIG'] = git_config
>>         else:
>>             self.env = None
> 
> Okay.
> 
>>     @staticmethod
>>     def _split(s):
>>         """Split NUL-terminated values."""
>>
>>         words = s.split('\0')
>>         assert words[-1] == ''
>>         return words[:-1]
> 
> Ugh.  Two callers of this poorly-defined static method: I wonder if
> we'd be better off inlining it.
> 
>>     def get(self, name, default=''):
>>         try:
>>             values = self._split(read_git_output(
>>                     ['config', '--get', '--null', '%s.%s' % (self.section, name)],
>>                     env=self.env, keepends=True,
>>                     ))
> 
> Wait, what is the point of using --null and then splitting by hand
> using a poorly-defined static method?  Why not drop the --null and
> splitlines() as usual?

To avoid confusion if a single config value contains end-of-line
characters.  In this case we are using --get, so only a single value is
allowed anyway, and presumably we could take the output and strip a
single last '\n' from it.  But null-terminated output is simply easier
to handle in general and I don't see an advantage to avoiding its usage.

>>             assert len(values) == 1
> 
> When does this assert fail?

It shouldn't; treat this as a verification that everything is sane.  For
example if I had mistakenly used splitlines(), then the assertion could
have failed; without the assertion the next line would have masked the
error ;-)

>>             return values[0]
>>         except CommandError:
>>             return default
> 
> If you're emulating the dictionary get method, default=None.  This is
> not C, where all codepaths of the function must return the same type.

You are right.  When I designed this class, I though that the empty
string would often be convenient.  But I just reviewed callers, and many
(most?) of them explicitly set the default to None anyway.  I will
change this.

> [...]
>>     def get_all(self, name, default=None):
>>         """Read a (possibly multivalued) setting from the configuration.
>>
>>         Return the result as a list of values, or default if the name
>>         is unset."""
>>
>>         try:
>>             return self._split(read_git_output(
>>                 ['config', '--get-all', '--null', '%s.%s' % (self.section, name)],
>>                 env=self.env, keepends=True,
>>                 ))
>>         except CommandError, e:
> 
> CommandError as e?

The new syntax is not available before Python 2.6.  It will have to be
changed when we try to support Python 3.x, but until then the change
wouldn't bring any benefits and would definitely prevent the script from
running under Python 2.4 or 2.5 (which are currently supported).

>>             if e.retcode == 1:
> 
> What does this cryptic retcode mean?

According to git-config(1), it means "the section or key is invalid" and
empirically this is the error code you get when you try to read a key
that is not defined.  I will add a comment.

>>                 return default
>>             else:
>>                 raise
> 
> raise what?

This is the Python construct to re-throw the exception that was caught
in the catch block containing it; i.e., the CommandError from a few
lines earlier.

> You've instantiated the Config class in two places: user and
> multimailhook sections.  Considering that you're going to read all the
> keys in that section, why not --get-regexp, pre-load the configuration
> into a dictionary and refer to that instead of spawning 'git config'
> every time you need a configuration value?

Yes, it's on my todo list.

>>     def get_recipients(self, name, default=None):
>>         """Read a recipients list from the configuration.
>>
>>         Return the result as a comma-separated list of email
>>         addresses, or default if the option is unset.  If the setting
>>         has multiple values, concatenate them with comma separators."""
>>
>>         lines = self.get_all(name, default=None)
>>         if lines is None:
>>             return default
>>         return ', '.join(line.strip() for line in lines)
> 
> Ugh.

?

>>     def set(self, name, value):
>>         read_git_output(
>>             ['config', '%s.%s' % (self.section, name), value],
>>             env=self.env,
>>             )
>>
>>     def add(self, name, value):
>>         read_git_output(
>>             ['config', '--add', '%s.%s' % (self.section, name), value],
>>             env=self.env,
>>             )
>>
>>     def has_key(self, name):
>>         return self.get_all(name, default=None) is not None
>>
>>     def unset_all(self, name):
>>         try:
>>             read_git_output(
>>                 ['config', '--unset-all', '%s.%s' % (self.section, name)],
>>                 env=self.env,
>>                 )
>>         except CommandError, e:
>>             if e.retcode == 5:
>>                 # The name doesn't exist, which is what we wanted anyway...
>>                 pass
>>             else:
>>                 raise
>>
>>     def set_recipients(self, name, value):
>>         self.unset_all(name)
>>         for pair in getaddresses([value]):
>>             self.add(name, formataddr(pair))
> 
> Dead code?

git_multimail is used as a library by migrate-mailhook-config, and that
script uses these methods.

>> def generate_summaries(*log_args):
>>     """Generate a brief summary for each revision requested.
>>
>>     log_args are strings that will be passed directly to "git log" as
>>     revision selectors.  Iterate over (sha1_short, subject) for each
>>     commit specified by log_args (subject is the first line of the
>>     commit message as a string without EOLs)."""
>>
>>     cmd = [
>>         'log', '--abbrev', '--format=%h %s',
>>         ] + list(log_args) + ['--']
> 
> What is log_args if not a list?

It is a tuple and therefore needs to be converted to a list here.

> But yeah, log is the best way to generate summaries.
> 
> [...]
>> class GitObject(object):
>>     def __init__(self, sha1, type=None):
>>         if sha1 == ZEROS:
>>             self.sha1 = self.type = self.commit = None
>>         else:
>>             self.sha1 = sha1
>>             self.type = type or read_git_output(['cat-file', '-t', self.sha1])
>>
>>             if self.type == 'commit':
>>                 self.commit = self
>>             elif self.type == 'tag':
>>                 try:
>>                     self.commit = GitObject(
>>                         read_git_output(['rev-parse', '--verify', '%s^0' % (self.sha1,)]),
>>                         type='commit',
>>                         )
>>                 except CommandError:
>>                     self.commit = None
>>             else:
>>                 self.commit = None
>>
>>         self.short = read_git_output(['rev-parse', '--short', sha1])
> 
> Just rev-parse --verify --short $SHA1^0: if it resolves, set
> self.short; one liner?

I don't follow.  We need both the long and the short SHA-1s to fill in
the templates.  What code exactly do you propose to replace with your
one-liner?

>>     def get_summary(self):
>>         """Return (sha1_short, subject) for this commit."""
>>
>>         if not self.sha1:
>>             raise ValueError('Empty commit has no summary')
> 
> What is the point of letting the user instantiate a GitObject without
> a valid .sha1 in the first place?

'0'*40 is passed to the post-receive script to indicate "no object"; for
example, a branch deletion is represented as

0123456789abcdef0123456789abcdef01234567
0000000000000000000000000000000000000000 refs/heads/branch

It is convenient to treat this as if it were a GitObject.
GitObject.__nonzero__() (which is called if a GitObject is evaluated in
a boolean context) returns False for these non-objects.

>>         return iter(generate_summaries('--no-walk', self.sha1)).next()
> 
> Not exactly fond of this, but I don't have a concrete replacement at
> the moment.
> 
> [...]
> Okay.
> 
>>     def __str__(self):
>>         return self.sha1 or ZEROS
> 
> I wonder what value this adds when .short is around.

The full object name is used in the X-Git-{Oldrev,Newrev} email headers
and probably in some error messages and stuff.

>> class Change(object):
>>     """A Change that has been made to the Git repository.
>>
>>     Abstract class from which both Revisions and ReferenceChanges are
>>     derived.  A Change knows how to generate a notification email
>>     describing itself."""
>>
>>     def __init__(self, environment):
>>         self.environment = environment
>>         self._values = None
>>
>>     def _compute_values(self):
>>         """Return a dictionary {keyword : expansion} for this Change.
>>
>>         Derived classes overload this method to add more entries to
>>         the return value.  This method is used internally by
>>         get_values().  The return value should always be a new
>>         dictionary."""
>>
>>         return self.environment.get_values()
> 
> Why is this an "internal function"?  What is your criterion for
> internal versus non-internal?

This method is meant to be overridden by derived classes to affect the
map returned by get_values().  But elsewhere get_values() should be
called, not this method (because get_values() memoizes its return value).

>>     def get_values(self, **extra_values):
>>         """Return a dictionary {keyword : expansion} for this Change.
>>
>>         Return a dictionary mapping keywords to the values that they
>>         should be expanded to for this Change (used when interpolating
>>         template strings).  If any keyword arguments are supplied, add
>>         those to the return value as well.  The return value is always
>>         a new dictionary."""
>>
>>         if self._values is None:
>>             self._values = self._compute_values()
>>
>>         values = self._values.copy()
>>         if extra_values:
>>             values.update(extra_values)
>>         return values
> 
> Unsure what this is about.

The dictionary is mainly used to provide values that can be interpolated
into the email templates.  It also has the advantage that it is only
called once, and then its value is used multiple times, which limits the
amount of boilerplate needed for derived classes to override the getter
methods without forcing those methods to be called many times.

> [...]
>>     def expand_header_lines(self, template, **extra_values):
>>         """Break template into lines and expand each line as an RFC 2822 header.
>>
>>         Encode values and split up lines that are too long.  Silently
>>         skip lines that contain references to unknown variables."""
>>
>>         values = self.get_values(**extra_values)
>>         for line in template.splitlines(True):
>>             (name, value) = line.split(':', 1)
>>             value = value.rstrip('\n\r')
> 
> Doesn't splitlines() make the rstrip() redundant?

As written it doesn't because I pass keepends=True to splitlines().  But
if I remove the keepends argument then I can indeed drop the rstrip().
Will change.

>>             try:
>>                 value = value % values
>>             except KeyError, e:
>>                 if DEBUG:
>>                     sys.stderr.write(
>>                         'Warning: unknown variable %r in the following line; line skipped:\n'
>>                         '    %s'
>>                         % (e.args[0], line,)
>>                         )
> 
> If DEBUG isn't on, you risk leaving the value string interpolated
> without even telling the user.  What does it mean to the end user?

There are some "nice-to-have" values in the templates that are not
necessary and might be missing if the user hasn't gone to the trouble to
configure every last setting.  For example, if no emaildomain is defined
then the pusher_email cannot be determined, resulting in the Reply-To
header being omitted.

My assumption is that a sysadmin would turn on DEBUG when testing the
script, check that any missing headers are uninteresting, and then turn
off DEBUG for production use so that users don't have to see the
warnings every time they push.

If you have another suggestion, let me know.

>>             else:
>>                 try:
>>                     h = Header(value, header_name=name)
>>                 except UnicodeDecodeError:
>>                     h = Header(value, header_name=name, charset=CHARSET, errors='replace')
>>                 for splitline in ('%s: %s\n' % (name, h.encode(),)).splitlines(True):
>>                     yield splitline
> 
> Not elated by this exception cascading, but I suppose it's cheaper
> than actually checking everything.
> 
>>     def generate_email_header(self):
>>         """Generate the RFC 2822 email headers for this Change, a line at a time.
>>
>>         The output should not include the trailing blank line."""
>>
>>         raise NotImplementedError()
>>
>>     def generate_email_intro(self):
>>         """Generate the email intro for this Change, a line at a time.
>>
>>         The output will be used as the standard boilerplate at the top
>>         of the email body."""
>>
>>         raise NotImplementedError()
>>
>>     def generate_email_body(self):
>>         """Generate the main part of the email body, a line at a time.
>>
>>         The text in the body might be truncated after a specified
>>         number of lines (see multimailhook.emailmaxlines)."""
>>
>>         raise NotImplementedError()
>>
>>     def generate_email_footer(self):
>>         """Generate the footer of the email, a line at a time.
>>
>>         The footer is always included, irrespective of
>>         multimailhook.emailmaxlines."""
>>
>>         raise NotImplementedError()
> 
> Unsure what these are about.

These are basically just to allow code sharing across the various Change
classes.

>>     def generate_email(self, push, body_filter=None):
>>         """Generate an email describing this change.
>>
>>         Iterate over the lines (including the header lines) of an
>>         email describing this change.  If body_filter is not None,
>>         then use it to filter the lines that are intended for the
>>         email body."""
>>
>>         for line in self.generate_email_header():
>>             yield line
>>         yield '\n'
>>         for line in self.generate_email_intro():
>>             yield line
>>
>>         body = self.generate_email_body(push)
>>         if body_filter is not None:
> 
> Redundant "is not None".

This way of writing the test is robust against objects for which
bool(body_filter) might return False.

>>             body = body_filter(body)
>>         for line in body:
>>             yield line
>>
>>         for line in self.generate_email_footer():
>>             yield line
> 
> Nicely done with yield.
> 
>> class Revision(Change):
>>     """A Change consisting of a single git commit."""
>>
>>     def __init__(self, reference_change, rev, num, tot):
>>         Change.__init__(self, reference_change.environment)
> 
> super?

IMO, in Python 2.x, super() is really only useful in a class hierarchy
where multiple inheritance is going to be supported, like in the
Environment classes.  The problem is that even if you use super(), you
have to type the name of the containing class explicitly; e.g.,

    super(Revision, self).__init__(reference_change.environment)

It is even longer than the explicit reference to the parent class, and
though it doesn't break if another class is inserted into the
inheritance chain, it *does* break if the class itself is renamed.  So I
usually don't bother with super() unless I'm using multiple inheritance.

In Python 3, where super() doesn't require redundant arguments, it is
much less cumbersome to use.

> [...]
>>         # First line of commit message:
>>         try:
>>             oneline = read_git_output(
>>                 ['log', '--format=%s', '--no-walk', self.rev.sha1]
>>                 )
>>         except CommandError:
>>             oneline = self.rev.sha1
> 
> What does this mean?  When will you get a CommandError?

I can't think of a plausible reason that this command would fail.

>                                                          And how do
> you respond to it?

By using the commit's SHA-1 in place of its subject line.

>>         values['rev'] = self.rev.sha1
>>         values['rev_short'] = self.rev.short
>>         values['change_type'] = self.change_type
>>         values['refname'] = self.refname
>>         values['short_refname'] = self.reference_change.short_refname
>>         values['refname_type'] = self.reference_change.refname_type
>>         values['reply_to_msgid'] = self.reference_change.msgid
>>         values['num'] = self.num
>>         values['tot'] = self.tot
>>         values['recipients'] = self.recipients
>>         values['oneline'] = oneline
>>         values['author'] = self.author
> 
> Ugh.  Use
> 
>   { rev: self.rev.sha1,
>     rev_short: self.rev.short
>     ...
>   }
> 
> and merge it with the existing dictionary.

Yes, I could do that (though it needs quotes around the key strings).
Or the even more attractive

    values.update(
        rev=self.rev.sha1,
        rev_short=self.rev.short,
        ...
        )

I had the latter in an earlier version of the script, but I thought it
might be too unfamiliar for non-Python-experts.  I guess I'm using
pretty highfalutin Python anyway so this change wouldn't hurt.  What do
you think?

>                                             Unsure why you're building
> a dictionary in the first place.

To use in template interpolation and also the other reasons mentioned above.

> [...]
>>     @staticmethod
> 
> Unsure what such a huge static method is doing here, but we'll find
> out soon enough.
> 
>>     def create(environment, oldrev, newrev, refname):
>>         """Return a ReferenceChange object representing the change.
>>
>>         Return an object that represents the type of change that is being
>>         made. oldrev and newrev should be SHA1s or ZEROS."""
> 
> Like I said before, use the typesystem effectively: why is using a
> string with 40 zeros somehow better than None in your program _logic_?
> I can understand converting None to 40 zeros for display purposes.

The ZEROS come straight from the post-receive script input, and as soon
as they are wrapped in a GitObject they are turned into None.

>>         old = GitObject(oldrev)
>>         new = GitObject(newrev)
>>         rev = new or old
>>
>>         # The revision type tells us what type the commit is, combined with
>>         # the location of the ref we can decide between
>>         #  - working branch
>>         #  - tracking branch
>>         #  - unannotated tag
>>         #  - annotated tag
> 
> Could be simpler.

If you mean the distinction between four types of ref is
overcomplicated, this is something taken over from the old
post-receive-email script.  If you just mean that the code could be
simplified, then please make a suggestion.

> [...]
>>     def __init__(self, environment, refname, short_refname, old, new, rev):
>>         Change.__init__(self, environment)
>>         self.change_type = {
>>             (False, True) : 'create',
>>             (True, True) : 'update',
>>             (True, False) : 'delete',
>>             }[bool(old), bool(new)]
> 
> As a general principle, avoid casting: if new is a dictionary, what
> does bool(new) even mean?  You just have to trust types, and let go of
> that much safety.

old and new are not dictionaries, they are GitObject instances.  And
this is not casting, it is calling old.__nonzero__() and
new.__nonzero__() to see whether they are real objects vs. ZEROS and to
canonicalize their values so that they can be used as indexes for the
literal dictionary that decides what type of change is being described.

> [...]
>> class BranchChange(ReferenceChange):
>>     refname_type = 'branch'
> 
> Unsure what new information this conveys over the type.

It is made available for template interpolation.

> [...]
>>     def describe_tag(self, push):
>>         """Describe the new value of an annotated tag."""
>>
>>         # Use git for-each-ref to pull out the individual fields from
>>         # the tag
>>         [tagobject, tagtype, tagger, tagged] = read_git_lines(
>>             ['for-each-ref', '--format=%s' % (self.ANNOTATED_TAG_FORMAT,), self.refname],
>>             )
> 
> You could've saved yourself a lot of trouble by running one f-e-r on
> refs/tags and filtering that.  I don't know what you're gaining from
> this overzealous object-orientation.

It's only needed for the tags that have changed (which is probably zero
in most cases).

> [...]
>>         # Show the content of the tag message; this might contain a
>>         # change log or release notes so is worth displaying.
>>         yield LOGBEGIN
>>         contents = list(read_git_lines(['cat-file', 'tag', self.new.sha1], keepends=True))
> 
> You could've easily batched this.

I don't understand what you mean.

>>         contents = contents[contents.index('\n') + 1:]
>>         if contents and contents[-1][-1:] != '\n':
>>             contents.append('\n')
>>         for line in contents:
>>             yield line
>>
>>         if self.show_shortlog and tagtype == 'commit':
>>             # Only commit tags make sense to have rev-list operations
>>             # performed on them
>>             yield '\n'
>>             if prevtag:
>>                 # Show changes since the previous release
>>                 revlist = read_git_output(
>>                     ['rev-list', '--pretty=short', '%s..%s' % (prevtag, self.new,)],
>>                     keepends=True,
>>                     )
>>             else:
>>                 # No previous tag, show all the changes since time
>>                 # began
>>                 revlist = read_git_output(
>>                     ['rev-list', '--pretty=short', '%s' % (self.new,)],
>>                     keepends=True,
>>                     )
>>             for line in read_git_lines(['shortlog'], input=revlist, keepends=True):
>>                 yield line
>>
>>         yield LOGEND
>>         yield '\n'
> 
> Way too many git invocations, I think.

Luckily git is very fast :-)

I'm not to worried about performance here.  The script will typically
only be run on pushes, and most pushes affect only one or a few
references.  I don't think these few git invocations will be prohibitive.

>> class OtherReferenceChange(ReferenceChange):
>>     refname_type = 'reference'
>>
>>     def __init__(self, environment, refname, short_refname, old, new, rev):
>>         # We use the full refname as short_refname, because otherwise
>>         # the full name of the reference would not be obvious from the
>>         # text of the email.
>>         ReferenceChange.__init__(
>>             self, environment,
>>             refname=refname, short_refname=refname,
>>             old=old, new=new, rev=rev,
>>             )
>>         self.recipients = environment.get_refchange_recipients(self)
> 
> What is the point of this?  Why not just use ReferenceChange directly?

Maybe you missed "short_refname=refname" (one of the arguments is not
being passed through 1:1).  The reason is explained in the comment.

>> class Mailer(object):
>>     """An object that can send emails."""
>>
>>     def send(self, lines, to_addrs):
>>         """Send an email consisting of lines.
>>
>>         lines must be an iterable over the lines constituting the
>>         header and body of the email.  to_addrs is a list of recipient
>>         addresses (can be needed even if lines already contains a
>>         "To:" field).  It can be either a string (comma-separated list
>>         of email addresses) or a Python list of individual email
>>         addresses.
>>
>>         """
>>
>>         raise NotImplementedError()
> 
> Abstract base class (abc)?  Or do you want to support Python <2.6?

Yes, AFAIK the script works with any Python >= 2.4.

>> class SendMailer(Mailer):
>>     """Send emails using '/usr/sbin/sendmail -t'."""
>>
>>     def __init__(self, command=None, envelopesender=None):
>>         """Construct a SendMailer instance.
>>
>>         command should be the command and arguments used to invoke
>>         sendmail, as a list of strings.  If an envelopesender is
>>         provided, it will also be passed to the command, via '-f
>>         envelopesender'."""
>>
>>         if command:
>>             self.command = command[:]
>>         else:
>>             self.command = ['/usr/sbin/sendmail', '-t']
> 
> If you want to DWIM when the configuration variable is missing, do it
> fully using a list of good candidates like /usr/lib/sendmail,
> /usr/sbin/sendmail, /usr/ucblib/sendmail, /usr/bin/msmtp.

OK, I just added /usr/sbin/sendmail and /usr/lib/sendmail, which are the
paths checked by "git send-mail".

>                                                            Also, what
> happened to our faithful 'git send-email' Perl script?  Isn't that
> most likely to be installed?

We could use "git send-email" to generate and send the revision emails,
but then we would lose most control over the contents of the emails.

>>         if envelopesender:
>>             self.command.extend(['-f', envelopesender])
>>
>>     def send(self, lines, to_addrs):
>>         try:
>>             p = subprocess.Popen(self.command, stdin=subprocess.PIPE)
>>         except OSError, e:
>>             sys.stderr.write(
>>                 '*** Cannot execute command: %s\n' % ' '.join(self.command)
>>                 + '*** %s\n' % str(e)
>>                 + '*** Try setting multimailhook.mailer to "smtp"\n'
>>                 '*** to send emails without using the sendmail command.\n'
>>                 )
>>             sys.exit(1)
> 
> Why do you need to concatenate strings using +?  This can take a list of strings, no?

sys.stderr.write() can only take a single string argument.  You might
have seen it called like this:

    sys.stderr.write(
        'foo\n'
        'bar\n'
        )

This is using the Python compiler's feature that literal strings can be
appended to each other by juxtaposition (notice there are no commas).
But this only works for literal strings, not for string expressions.

> [...]
>>     def __del__(self):
>>         self.smtp.quit()
> 
> So you close the connection when the object is destroyed by the GC.

Yes, where here (since we are talking about CPython) reference counting
is used and objects are deleted as soon as the reference count goes to
zero.  The point is to send all of the emails through one connection to
the SMTP server, which I think saves a lot of time.

> [...]
>> class OutputMailer(Mailer):
>>     """Write emails to an output stream, bracketed by lines of '=' characters.
>>
>>     This is intended for debugging purposes."""
>>
>>     SEPARATOR = '=' * 75 + '\n'
>>
>>     def __init__(self, f):
>>         self.f = f
>>
>>     def send(self, lines, to_addrs):
>>         self.f.write(self.SEPARATOR)
>>         self.f.writelines(lines)
>>         self.f.write(self.SEPARATOR)
> 
> Unsure what this is.

For testing and debugging (e.g., via the --stdout command-line option).

>> def get_git_dir():
>>     """Determine GIT_DIR.
>>
>>     Determine GIT_DIR either from the GIT_DIR environment variable or
>>     from the working directory, using Git's usual rules."""
>>
>>     try:
>>         return read_git_output(['rev-parse', '--git-dir'])
>>     except CommandError:
>>         sys.stderr.write('fatal: git_multimail: not in a git working copy\n')
>>         sys.exit(1)
> 
> Why do you need a working copy?  Will a bare repository not suffice?

Yes, a bare repo definitely suffices.  I think the error message is just
misleading, correct?  Will fix.

>> class Environment(object):
> 
> New-style class.  I wonder why you suddenly switched.

?  All of the classes are new-style classes.

> [...]
>> class ConfigEnvironmentMixin(Environment):
>>     """A mixin that sets self.config to its constructor's config argument.
>>
>>     This class's constructor consumes the "config" argument.
>>
>>     Mixins that need to inspect the config should inherit from this
>>     class (1) to make sure that "config" is still in the constructor
>>     arguments with its own constructor runs and/or (2) to be sure that
>>     self.config is set after construction."""
>>
>>     def __init__(self, config, **kw):
>>         super(ConfigEnvironmentMixin, self).__init__(**kw)
>>         self.config = config
> 
> Overdoing the OO factories, much?

I went to a lot of trouble to make the Environment mixin classes
composable, because what I've learned from the feedback in the last
months is that everybody wants to do something different with this
script.  I tried out a few designs before I settled on this one.

> I'll skip a few boring factory classes.
> 
>> class GenericEnvironment(
>>     ProjectdescEnvironmentMixin,
>>     ConfigMaxlinesEnvironmentMixin,
>>     ConfigFilterLinesEnvironmentMixin,
>>     ConfigRecipientsEnvironmentMixin,
>>     PusherDomainEnvironmentMixin,
>>     ConfigOptionsEnvironmentMixin,
>>     GenericEnvironmentMixin,
>>     Environment,
>>     ):
>>     pass
> 
> Sigh.  I might as well be reading some Java now :/

No, Java doesn't allow multiple inheritance :-)

> Sorry, I'm exhausted.
> 
> Let's take a step back and look at what this gigantic script is doing.
> It uses the information from a push to string-interpolate a template
> and generate emails, right?  The rest of the script is about churning
> on the updated refs to prettify the emails.
> 
> From my quick reading, it seems to be unnecessarily complicated and
> inefficient.  Why are there so many factories, and why do you call out
> to git at every opportunity, instead of cleanly separating computation
> from rendering?

Regarding size: post-receive-email is 748 lines of shell script,
including comments and string literals.  The extent of its
configurability is approximately this block of code:

> projectdesc=$(sed -ne '1p' "$GIT_DIR/description" 2>/dev/null)
> # Check if the description is unchanged from it's default, and shorten it to
> # a more manageable length if it is
> if expr "$projectdesc" : "Unnamed repository.*$" >/dev/null
> then
> 	projectdesc="UNNAMED PROJECT"
> fi
> 
> recipients=$(git config hooks.mailinglist)
> announcerecipients=$(git config hooks.announcelist)
> envelopesender=$(git config hooks.envelopesender)
> emailprefix=$(git config hooks.emailprefix || echo '[SCM] ')
> custom_showrev=$(git config hooks.showrev)
> maxlines=$(git config hooks.emailmaxlines)
> diffopts=$(git config hooks.diffopts)
> : ${diffopts:="--stat --summary --find-copies-harder"}

The script has to be edited to make any non-trivial configuration change.

git_multimail.py is 2398 lines of Python script, including comments and
string literals.  The fraction of that code that is dedicated to
configurability is approximately 1000 lines.  Relative to
post-receive-email, it adds

* much more configurability, without the need to edit the script.

* optional separate emails for each commit

* non-buggy determination of which commits have been added by a
reference change, and distinction between commits that have been added
to a branch vs. commits that have been added altogether and between
commits that have been deleted from a branch vs. commits that have been
deleted altogether.

* migration code to migrate a post-receive-email configuration into a
git-multimail configuration (mostly via a supplemental script)

* support for gitolite environments

* support for sendmail vs. smtplib

* improved utf-8 correctness.

Regarding efficiency, I don't think it is a problem.  But patches or
concrete suggestions are certainly welcome.

Regarding separation of computation and rendering, yes, they could be
separated better.  (BTW, it would make the script even longer.)  The
rendering is already largely done via templates that can be changed from
outside of the script.  But I might work on separating them more
strictly so that some of the code could be reused, for example, to send
notifications via IRC or XMPP.

Thanks for all of your comments!  I hope I have addressed most of them
in this email and in the commits that I just pushed to GitHub.

Michael

[1] https://github.com/mhagger/git-multimail/pull/2
[2] http://www.python.org/dev/peps/pep-0394/

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-02 22:21 ` Junio C Hamano
@ 2013-07-03  8:02   ` Michael Haggerty
  2013-07-03  8:16     ` Junio C Hamano
  2013-07-03  8:29     ` John Keeping
  0 siblings, 2 replies; 15+ messages in thread
From: Michael Haggerty @ 2013-07-03  8:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Git List

On 07/03/2013 12:21 AM, Junio C Hamano wrote:
> Ramkumar Ramachandra <artagnon@gmail.com> writes:
> 
>>>     def get(self, name, default=''):
>>>         try:
>>>             values = self._split(read_git_output(
>>>                     ['config', '--get', '--null', '%s.%s' % (self.section, name)],
>>>                     env=self.env, keepends=True,
>>>                     ))
>>
>> Wait, what is the point of using --null and then splitting by hand
>> using a poorly-defined static method?  Why not drop the --null and
>> splitlines() as usual?
> 
> You may actually have spotted a bug or misuse of "--get" here.
> 
> With this sample configuration:
> 
>         $ cat >sample <<\EOF
>         [a]
>                 one = value
>                 one = another
> 
>         [b]
>                 one = "value\nanother"
>         EOF
> 
> A script cannot differentiate between them without using '--null'.
> 
> 	$ git config -f sample --get-all a.one
>         $ git config -f sample --get-all b.one
> 
> But that matters only when you use "--get-all", not "--get".  If
> this method wants to make sure that the user did not misuse a.one
> as a multi-valued configuration variable, use of "--null --get-all"
> followed by checking how many items the command gives you back would
> be a way to do so.

No, the code in question was a simple sanity check (i.e., mostly a check
of my own sanity and understanding of "git config" behavior) preceding
the information-losing next line "return values[0]".  If it had been
meant as a check that the user hadn't misconfigured the system, then I
wouldn't have used assert but rather raised a ConfigurationException
with an explanatory message.

I would be happy to add the checking that you described, but I didn't
have the impression that it is the usual convention.  Does code that
wants a single value from the config usually verify that there is
one-and-only-one value, or does it typically just do the equivalent of
"git config --get" and use the returned (effectively the last) value?

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03  8:02   ` Michael Haggerty
@ 2013-07-03  8:16     ` Junio C Hamano
  2013-07-03  8:29     ` John Keeping
  1 sibling, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-07-03  8:16 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Ramkumar Ramachandra, Git List

Michael Haggerty <mhagger@alum.mit.edu> writes:

> I would be happy to add the checking that you described, but I didn't
> have the impression that it is the usual convention.  Does code that
> wants a single value from the config usually verify that there is
> one-and-only-one value, or does it typically just do the equivalent of
> "git config --get" and use the returned (effectively the last) value?

In most cases, variables are "one value per key" and follow "the
last one wins" rule, which is the reason why we read from the most
generic to the most specific (i.e. $GIT_DIR/config is read last).
For such uses, reading from "--get", and not from "--get-all", is
absolutely the right thing to do.

But then as Ram said, there probably is not a need for --null; you
can just read from textual "--get" to the end without any splitting
(using splitlines is of course wrong if you do so).

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03  8:02   ` Michael Haggerty
  2013-07-03  8:16     ` Junio C Hamano
@ 2013-07-03  8:29     ` John Keeping
  2013-07-03  8:33       ` John Keeping
  1 sibling, 1 reply; 15+ messages in thread
From: John Keeping @ 2013-07-03  8:29 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Wed, Jul 03, 2013 at 10:02:34AM +0200, Michael Haggerty wrote:
> On 07/03/2013 12:21 AM, Junio C Hamano wrote:
> > Ramkumar Ramachandra <artagnon@gmail.com> writes:
> > 
> >>>     def get(self, name, default=''):
> >>>         try:
> >>>             values = self._split(read_git_output(
> >>>                     ['config', '--get', '--null', '%s.%s' % (self.section, name)],
> >>>                     env=self.env, keepends=True,
> >>>                     ))
> >>
> >> Wait, what is the point of using --null and then splitting by hand
> >> using a poorly-defined static method?  Why not drop the --null and
> >> splitlines() as usual?
> > 
> > You may actually have spotted a bug or misuse of "--get" here.
> > 
> > With this sample configuration:
> > 
> >         $ cat >sample <<\EOF
> >         [a]
> >                 one = value
> >                 one = another
> > 
> >         [b]
> >                 one = "value\nanother"
> >         EOF
> > 
> > A script cannot differentiate between them without using '--null'.
> > 
> > 	$ git config -f sample --get-all a.one
> >         $ git config -f sample --get-all b.one
> > 
> > But that matters only when you use "--get-all", not "--get".  If
> > this method wants to make sure that the user did not misuse a.one
> > as a multi-valued configuration variable, use of "--null --get-all"
> > followed by checking how many items the command gives you back would
> > be a way to do so.
> 
> No, the code in question was a simple sanity check (i.e., mostly a check
> of my own sanity and understanding of "git config" behavior) preceding
> the information-losing next line "return values[0]".  If it had been
> meant as a check that the user hadn't misconfigured the system, then I
> wouldn't have used assert but rather raised a ConfigurationException
> with an explanatory message.
> 
> I would be happy to add the checking that you described, but I didn't
> have the impression that it is the usual convention.  Does code that
> wants a single value from the config usually verify that there is
> one-and-only-one value, or does it typically just do the equivalent of
> "git config --get" and use the returned (effectively the last) value?

Doesn't "git config --get" return an error if there are multiple values?
The answer is apparently "no" - I wrote the text below from
git-config(1) and then checked the behaviour.  This seems to be a
regression in git-config (bisect running now).

I think the "correct" answer is what's below, but it doesn't work like
this in current Git:

    If you want a single value then I think it's normal to just read the
    output of "git config" and let it handle the error cases, without
    needing to split the result at all.

    I think there is a different issue in the "except" block following
    the code quoted at the top though - you will return "default" if a
    key happens to be multi-valued.  The script should check the return
    code and raise a ConfigurationException if it is 2.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03  8:29     ` John Keeping
@ 2013-07-03  8:33       ` John Keeping
  0 siblings, 0 replies; 15+ messages in thread
From: John Keeping @ 2013-07-03  8:33 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Ramkumar Ramachandra, Git List

On Wed, Jul 03, 2013 at 09:29:02AM +0100, John Keeping wrote:
> Doesn't "git config --get" return an error if there are multiple values?
> The answer is apparently "no" - I wrote the text below from
> git-config(1) and then checked the behaviour.  This seems to be a
> regression in git-config (bisect running now).

Ah, that was an intentional change in commit 00b347d (git-config: do not
complain about duplicate entries, 2012-10-23).  So the issue is that the
documentation was not updated when the behaviour was changed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03  0:10 ` Michael Haggerty
@ 2013-07-03 10:23   ` Ramkumar Ramachandra
  2013-07-03 11:02     ` Ramkumar Ramachandra
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Ramkumar Ramachandra @ 2013-07-03 10:23 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Git List

Michael Haggerty wrote:
> On 07/02/2013 09:23 PM, Ramkumar Ramachandra wrote:
>> git_multimail.py wrote:
>>> #! /usr/bin/env python2
>>
>> Do all distributions ship it as python2 now?
>
> No, but nor is "python" always Python version 2.x (I believe that Arch
> Linux now installs Python 3 as "python").  This topic was discussed here
> [1].  Basically, my reasoning is that (a) PEP 394 [2] says that
> "python2" is the correct name for a Python 2.x interpreter, (b) I
> believe that other distros are moving in that direction, and (c) if the
> script says "python2" but no python2 is installed, the error is pretty
> obvious, whereas if the script says "python" and that is actually Python
> 3.x, the error would be more cryptic.

Yeah, this is good reasoning.  And yes, I'm on Arch: python points to
python3, and python2 points to python2.  A couple of thoughts while
we're on the subject:

1. We should probably convert git-remote-{hg,bzr} to use this style
too: they give cryptic errors now, and I have a ~/bin/python pointing
to python2 which is higher up in $PATH to work around it.  Debian uses
an alternatives mechanism to have multiple versions of the same
package, but I personally think the system is ugly.

2. Is there a way to determine the python version in-script and
error-out quickly?  Is it worth the ugliness?

>>>             'Command "%s" failed with retcode %s' % (' '.join(cmd), retcode,)
>>
>> So cmd is a list.
>
> Yes, commands are always lists in my code because it is less error-prone
> than trying to quote arguments correctly for the shell.  Do you think I
> should document that here, or elsewhere, or everywhere, or ...?

If you look at the prototype of execvpe(), the calling semantics are
immediately clear, but we don't have that luxury in Python: probably
rename the variable cmd_argv?

>>> class ConfigurationException(Exception):
>>>     pass
>>
>> Dead code?
>
> No, this defines an exception class that inherits all of its methods
> (including its constructors) from Exception.  This is useful because an
> exception of type ConfigurationException is distinguishable from other
> types of Exceptions, and can be caught using "except
> ConfigurationException, e".

Okay.  I was under the impression that you had some future extension plans.

>>>     def get(self, name, default=''):
>>>         try:
>>>             values = self._split(read_git_output(
>>>                     ['config', '--get', '--null', '%s.%s' % (self.section, name)],
>>>                     env=self.env, keepends=True,
>>>                     ))
>>
>> Wait, what is the point of using --null and then splitting by hand
>> using a poorly-defined static method?  Why not drop the --null and
>> splitlines() as usual?
>
> To avoid confusion if a single config value contains end-of-line
> characters.  In this case we are using --get, so only a single value is
> allowed anyway, and presumably we could take the output and strip a
> single last '\n' from it.  But null-terminated output is simply easier
> to handle in general and I don't see an advantage to avoiding its usage.

My rationale for avoiding it comes from Python's lack of inbuilt
functions to handle it; but yeah, Junio pointed out that quirk in the
previous email.

>>>                 return default
>>>             else:
>>>                 raise
>>
>> raise what?
>
> This is the Python construct to re-throw the exception that was caught
> in the catch block containing it; i.e., the CommandError from a few
> lines earlier.

Ah, thanks.

>>>     def get_recipients(self, name, default=None):
>>>         lines = self.get_all(name, default=None)
>>>         if lines is None:
>>>             return default
>>>         return ', '.join(line.strip() for line in lines)
>>
>> Ugh.
>
> ?

I would expect it to return a list that can be churned on further, not
a string that's ready for rendering.  Doesn't it resemble the
dictionary get(), and even your own get_all() in name and signature?

>> Dead code?
>
> git_multimail is used as a library by migrate-mailhook-config, and that
> script uses these methods.

I see.  Perhaps clean separation to avoid confusing readers?

>>> def generate_summaries(*log_args):
>>>     cmd = [
>>>         'log', '--abbrev', '--format=%h %s',
>>>         ] + list(log_args) + ['--']
>>
>> What is log_args if not a list?
>
> It is a tuple and therefore needs to be converted to a list here.

Oh, I always thought *log_args meant list; to my surprise, it's a tuple.

>> Just rev-parse --verify --short $SHA1^0: if it resolves, set
>> self.short; one liner?
>
> I don't follow.  We need both the long and the short SHA-1s to fill in
> the templates.  What code exactly do you propose to replace with your
> one-liner?

Oh, you need both.  I was hoping to cut the cat-file -t too, but you
need another call resolving $SHA1^{object} to distinguish between
commits and tags; so never mind.

>> What is the point of letting the user instantiate a GitObject without
>> a valid .sha1 in the first place?
>
> '0'*40 is passed to the post-receive script to indicate "no object"; for
> example, a branch deletion is represented as
>
> 0123456789abcdef0123456789abcdef01234567
> 0000000000000000000000000000000000000000 refs/heads/branch
>
> It is convenient to treat this as if it were a GitObject.
> GitObject.__nonzero__() (which is called if a GitObject is evaluated in
> a boolean context) returns False for these non-objects.

Fair enough, although my objection mainly has to do with not
separating logic and rendering cleanly.

>>>     def __str__(self):
>>>         return self.sha1 or ZEROS
>>
>> I wonder what value this adds when .short is around.
>
> The full object name is used in the X-Git-{Oldrev,Newrev} email headers
> and probably in some error messages and stuff.

If I manage to separate out the logic from the rendering, I'll send
PRs directly.

>>>     def _compute_values(self):
>>>         return self.environment.get_values()
>>
>> Why is this an "internal function"?  What is your criterion for
>> internal versus non-internal?
>
> This method is meant to be overridden by derived classes to affect the
> map returned by get_values().  But elsewhere get_values() should be
> called, not this method (because get_values() memoizes its return value).

Fair rationale.

>>>     def get_values(self, **extra_values):
>>>         if self._values is None:
>>>             self._values = self._compute_values()
>>>
>>>         values = self._values.copy()
>>>         if extra_values:
>>>             values.update(extra_values)
>>>         return values
>>
>> Unsure what this is about.
>
> The dictionary is mainly used to provide values that can be interpolated
> into the email templates.  It also has the advantage that it is only
> called once, and then its value is used multiple times, which limits the
> amount of boilerplate needed for derived classes to override the getter
> methods without forcing those methods to be called many times.

Makes sense.

>>>             try:
>>>                 value = value % values
>>>             except KeyError, e:
>>>                 if DEBUG:
>>>                     sys.stderr.write(
>>>                         'Warning: unknown variable %r in the following line; line skipped:\n'
>>>                         '    %s'
>>>                         % (e.args[0], line,)
>>>                         )
>>
>> If DEBUG isn't on, you risk leaving the value string interpolated
>> without even telling the user.  What does it mean to the end user?
>
> There are some "nice-to-have" values in the templates that are not
> necessary and might be missing if the user hasn't gone to the trouble to
> configure every last setting.  For example, if no emaildomain is defined
> then the pusher_email cannot be determined, resulting in the Reply-To
> header being omitted.
>
> My assumption is that a sysadmin would turn on DEBUG when testing the
> script, check that any missing headers are uninteresting, and then turn
> off DEBUG for production use so that users don't have to see the
> warnings every time they push.

Ah, so that is the intended usage.  If the impact of omitting certain
headers (due to lack of information) doesn't result in unusable emails
being generated, I think we're good.  Are you sure that doesn't
happen?

>>>         raise NotImplementedError()
>>
>> Unsure what these are about.
>
> These are basically just to allow code sharing across the various Change
> classes.

I'm not sure it's worth supporting Python < 2.6, especially if it
costs more to port it to Python 3+ (no abstract base classes; ugliness
like this).  Which brings me to: considering that the first commit is
in late 2012, why didn't you choose to code it in python3 directly?
Unless I'm mistaken, the only reason git-remote-{bzr,hg} aren't in
python3 is because the dependencies bzrlib and hglib are in python2.

>>>         body = self.generate_email_body(push)
>>>         if body_filter is not None:
>>
>> Redundant "is not None".
>
> This way of writing the test is robust against objects for which
> bool(body_filter) might return False.

If body_filter isn't a function, we have much larger problems, don't we? ;)

Nevertheless, I'm not going to argue with idioms (I've not written any
Python in years now, so I don't know them).

>>> class Revision(Change):
>>>     """A Change consisting of a single git commit."""
>>>
>>>     def __init__(self, reference_change, rev, num, tot):
>>>         Change.__init__(self, reference_change.environment)
>>
>> super?
>
> IMO, in Python 2.x, super() is really only useful in a class hierarchy
> where multiple inheritance is going to be supported, like in the
> Environment classes.  The problem is that even if you use super(), you
> have to type the name of the containing class explicitly; e.g.,
>
>     super(Revision, self).__init__(reference_change.environment)
>
> It is even longer than the explicit reference to the parent class, and
> though it doesn't break if another class is inserted into the
> inheritance chain, it *does* break if the class itself is renamed.  So I
> usually don't bother with super() unless I'm using multiple inheritance.
>
> In Python 3, where super() doesn't require redundant arguments, it is
> much less cumbersome to use.

I see.  Thanks for the explanation.

>> [...]
>>>         # First line of commit message:
>>>         try:
>>>             oneline = read_git_output(
>>>                 ['log', '--format=%s', '--no-walk', self.rev.sha1]
>>>                 )
>>>         except CommandError:
>>>             oneline = self.rev.sha1
>>
>> What does this mean?  When will you get a CommandError?
>
> I can't think of a plausible reason that this command would fail.
>
>>                                                          And how do
>> you respond to it?
>
> By using the commit's SHA-1 in place of its subject line.

What you have written translates to: "If there is a valid commit whose
subject cannot be determined (empty subject is still determinate), I
will use the commit's SHA-1 hex in its place", which implies that you
do not trust git to be sane ;)

Isn't the entire premise of your script that you have a sane git?

> Yes, I could do that (though it needs quotes around the key strings).
> Or the even more attractive
>
>     values.update(
>         rev=self.rev.sha1,
>         rev_short=self.rev.short,
>         ...
>         )
>
> I had the latter in an earlier version of the script, but I thought it
> might be too unfamiliar for non-Python-experts.  I guess I'm using
> pretty highfalutin Python anyway so this change wouldn't hurt.  What do
> you think?

I like pretty, maintainable, modern code with low redundancy.  As a
general principle, I always tilt towards educating users/ developers
about new (or "advanced") features, not abstaining from them because
they are too unfamiliar (or "complicated").

On this specifically, a beginner can look up help(values.update) to
understand what the code is doing.  So, yes: values.update() is
definitely better.

>>>         # The revision type tells us what type the commit is, combined with
>>>         # the location of the ref we can decide between
>>>         #  - working branch
>>>         #  - tracking branch
>>>         #  - unannotated tag
>>>         #  - annotated tag
>>
>> Could be simpler.
>
> If you mean the distinction between four types of ref is
> overcomplicated, this is something taken over from the old
> post-receive-email script.  If you just mean that the code could be
> simplified, then please make a suggestion.

I'll send a PR directly if I manage to simplify it.

>>>     def __init__(self, environment, refname, short_refname, old, new, rev):
>>>         Change.__init__(self, environment)
>>>         self.change_type = {
>>>             (False, True) : 'create',
>>>             (True, True) : 'update',
>>>             (True, False) : 'delete',
>>>             }[bool(old), bool(new)]
>>
>> As a general principle, avoid casting: if new is a dictionary, what
>> does bool(new) even mean?  You just have to trust types, and let go of
>> that much safety.
>
> old and new are not dictionaries, they are GitObject instances.  And
> this is not casting, it is calling old.__nonzero__() and
> new.__nonzero__() to see whether they are real objects vs. ZEROS and to
> canonicalize their values so that they can be used as indexes for the
> literal dictionary that decides what type of change is being described.

Ah, I missed that.

>>>     def describe_tag(self, push):
>>>         """Describe the new value of an annotated tag."""
>>>
>>>         # Use git for-each-ref to pull out the individual fields from
>>>         # the tag
>>>         [tagobject, tagtype, tagger, tagged] = read_git_lines(
>>>             ['for-each-ref', '--format=%s' % (self.ANNOTATED_TAG_FORMAT,), self.refname],
>>>             )
>>
>> You could've saved yourself a lot of trouble by running one f-e-r on
>> refs/tags and filtering that.  I don't know what you're gaining from
>> this overzealous object-orientation.
>
> It's only needed for the tags that have changed (which is probably zero
> in most cases).

Hm, we'll have to discuss performance in the "typical case" soon.

>>>         contents = list(read_git_lines(['cat-file', 'tag', self.new.sha1], keepends=True))
>>
>> You could've easily batched this.
>
> I don't understand what you mean.

Whenever I call cat-file from a script, I always find myself using the
--batch variant.

>> Way too many git invocations, I think.
>
> Luckily git is very fast :-)
>
> I'm not to worried about performance here.  The script will typically
> only be run on pushes, and most pushes affect only one or a few
> references.  I don't think these few git invocations will be prohibitive.

I personally push very often, so it's not a problem.  I'm thinking of
a mirroring batched push, where the maintainer pushes out history to a
"release" server every major release (every few months): is the script
intended to be used in such a scenario, when multiple refs and tags
are updated non-trivially?

>>>         ReferenceChange.__init__(
>>>             self, environment,
>>>             refname=refname, short_refname=refname,
>>>             old=old, new=new, rev=rev,
>>>             )
>>>         self.recipients = environment.get_refchange_recipients(self)
>>
>> What is the point of this?  Why not just use ReferenceChange directly?
>
> Maybe you missed "short_refname=refname" (one of the arguments is not
> being passed through 1:1).  The reason is explained in the comment.

Gah, my bad habit of not reading comments.

>> If you want to DWIM when the configuration variable is missing, do it
>> fully using a list of good candidates like /usr/lib/sendmail,
>> /usr/sbin/sendmail, /usr/ucblib/sendmail, /usr/bin/msmtp.
>
> OK, I just added /usr/sbin/sendmail and /usr/lib/sendmail, which are the
> paths checked by "git send-mail".

I'm on Arch, and sendmail is dead (only available in AUR now).  I
think we should support modern sendmail-compatible alternatives like
msmtp (which I have and use).

>>                                                            Also, what
>> happened to our faithful 'git send-email' Perl script?  Isn't that
>> most likely to be installed?
>
> We could use "git send-email" to generate and send the revision emails,
> but then we would lose most control over the contents of the emails.

I'm talking about the <file>... form.  Does it necessarily mangle the
headers of an mbox that is fed to it, or am I missing something?

> sys.stderr.write() can only take a single string argument.  You might
> have seen it called like this:
>
>     sys.stderr.write(
>         'foo\n'
>         'bar\n'
>         )
>
> This is using the Python compiler's feature that literal strings can be
> appended to each other by juxtaposition (notice there are no commas).
> But this only works for literal strings, not for string expressions.

Ouch, that's ugly.

>>> class Environment(object):
>>
>> New-style class.  I wonder why you suddenly switched.
>
> ?  All of the classes are new-style classes.

When you say class Foo:, aren't you declaring an old-style class by
default in python2?  New-style classes are those that explicitly
inherit from object (implicit in python3).

>> Overdoing the OO factories, much?
>
> I went to a lot of trouble to make the Environment mixin classes
> composable, because what I've learned from the feedback in the last
> months is that everybody wants to do something different with this
> script.  I tried out a few designs before I settled on this one.

I see.  I'm not a user, so I can't comment.

>>> class GenericEnvironment(
>>>     ProjectdescEnvironmentMixin,
>>>     ConfigMaxlinesEnvironmentMixin,
>>>     ConfigFilterLinesEnvironmentMixin,
>>>     ConfigRecipientsEnvironmentMixin,
>>>     PusherDomainEnvironmentMixin,
>>>     ConfigOptionsEnvironmentMixin,
>>>     GenericEnvironmentMixin,
>>>     Environment,
>>>     ):
>>>     pass
>>
>> Sigh.  I might as well be reading some Java now :/
>
> No, Java doesn't allow multiple inheritance :-)

Ha, yes: I meant in the way you factory'ified everything.  I don't
recall seeing this kind of code in a real-world application though: I
skimmed through Django's code many years ago, and haven't found such a
pattern.  I have seen instances where multiple inheritance may be
classified as borderline useful, but nothing as extensive as this.
Additionally, Ruby does not have multiple inheritance.  So, I tilt
towards the conclusion that multiple inheritance is Bad and Wrong.
Then again, I'm not an OO person in general, so I can't have a mature
opinion on the subject.  Can you present a case for this kind of
extensive usage quoting real-world examples?

> git_multimail.py is 2398 lines of Python script, including comments and
> string literals.  The fraction of that code that is dedicated to
> configurability is approximately 1000 lines.  Relative to
> post-receive-email, it adds
>
> * much more configurability, without the need to edit the script.

So this is the main selling point.

> * non-buggy determination of which commits have been added by a
> reference change, and distinction between commits that have been added
> to a branch vs. commits that have been added altogether and between
> commits that have been deleted from a branch vs. commits that have been
> deleted altogether.

Okay, so the post-receive-email script has corner-case bugs.

> [...]

The others seem to be small'ish features.

> Regarding efficiency, I don't think it is a problem.  But patches or
> concrete suggestions are certainly welcome.

Pre-optimization is the root of all evil :)  Can you give us some
numbers from real-world usecases, so we know whether or not it _needs_
to be optimized?  I ran your test script (test-email, I think) and it
generated 35 emails in ~10 seconds; but the repository was
super-trivial.

> Regarding separation of computation and rendering, yes, they could be
> separated better.  (BTW, it would make the script even longer.)  The
> rendering is already largely done via templates that can be changed from
> outside of the script.  But I might work on separating them more
> strictly so that some of the code could be reused, for example, to send
> notifications via IRC or XMPP.

My reasons for wanting to separate out the computation from rendering
are different: they centre around clarity, speed, ease of
maintainability and composability.  It's possible that I'm
irrationally biased towards such a separation because of my experience
with MVC-like frameworks.  Then again, I don't have code to show; talk
is cheap.

> Thanks for all of your comments!  I hope I have addressed most of them
> in this email and in the commits that I just pushed to GitHub.

Glad to be of help.  Good turnaround time.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03 10:23   ` Ramkumar Ramachandra
@ 2013-07-03 11:02     ` Ramkumar Ramachandra
  2013-07-03 11:41     ` Michael Haggerty
  2013-07-03 21:09     ` Jed Brown
  2 siblings, 0 replies; 15+ messages in thread
From: Ramkumar Ramachandra @ 2013-07-03 11:02 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Git List

Ramkumar Ramachandra wrote:
>>> New-style class.  I wonder why you suddenly switched.
>>
>> ?  All of the classes are new-style classes.
>
> When you say class Foo:, aren't you declaring an old-style class by
> default in python2?  New-style classes are those that explicitly
> inherit from object (implicit in python3).

I just noticed that all your classes are new-style.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03 10:23   ` Ramkumar Ramachandra
  2013-07-03 11:02     ` Ramkumar Ramachandra
@ 2013-07-03 11:41     ` Michael Haggerty
  2013-07-03 21:09     ` Jed Brown
  2 siblings, 0 replies; 15+ messages in thread
From: Michael Haggerty @ 2013-07-03 11:41 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List

On 07/03/2013 12:23 PM, Ramkumar Ramachandra wrote:
> Michael Haggerty wrote:
>> On 07/02/2013 09:23 PM, Ramkumar Ramachandra wrote:
>>> git_multimail.py wrote:
>>>> #! /usr/bin/env python2
>>>
>>> Do all distributions ship it as python2 now?
>>
>> No, but nor is "python" always Python version 2.x (I believe that Arch
>> Linux now installs Python 3 as "python").  This topic was discussed here
>> [1].  Basically, my reasoning is that (a) PEP 394 [2] says that
>> "python2" is the correct name for a Python 2.x interpreter, (b) I
>> believe that other distros are moving in that direction, and (c) if the
>> script says "python2" but no python2 is installed, the error is pretty
>> obvious, whereas if the script says "python" and that is actually Python
>> 3.x, the error would be more cryptic.
> 
> Yeah, this is good reasoning.  And yes, I'm on Arch: python points to
> python3, and python2 points to python2.  A couple of thoughts while
> we're on the subject:
> 
> 1. We should probably convert git-remote-{hg,bzr} to use this style
> too: they give cryptic errors now, and I have a ~/bin/python pointing
> to python2 which is higher up in $PATH to work around it.  Debian uses
> an alternatives mechanism to have multiple versions of the same
> package, but I personally think the system is ugly.
> 
> 2. Is there a way to determine the python version in-script and
> error-out quickly?  Is it worth the ugliness?

The problems is that the whole script is parsed before execution starts,
so using the wrong interpreter likely leads to a SyntaxError before the
script even gains control.

The correct solution, I'm afraid, is to have a build step that
determines the correct Python shebang contents at build times and
rewrites the script like the filename.perl -> filename transformations
that are already done for Perl.

>>>>             'Command "%s" failed with retcode %s' % (' '.join(cmd), retcode,)
>>>
>>> So cmd is a list.
>>
>> Yes, commands are always lists in my code because it is less error-prone
>> than trying to quote arguments correctly for the shell.  Do you think I
>> should document that here, or elsewhere, or everywhere, or ...?
> 
> If you look at the prototype of execvpe(), the calling semantics are
> immediately clear, but we don't have that luxury in Python: probably
> rename the variable cmd_argv?

Added to to-do list.

> [...]
>>>>     def get_recipients(self, name, default=None):
>>>>         lines = self.get_all(name, default=None)
>>>>         if lines is None:
>>>>             return default
>>>>         return ', '.join(line.strip() for line in lines)
>>>
>>> Ugh.
>>
>> ?
> 
> I would expect it to return a list that can be churned on further, not
> a string that's ready for rendering.  Doesn't it resemble the
> dictionary get(), and even your own get_all() in name and signature?

We support both single and multiple-valued options, and each value can
contain multiple comma-separated RFC 2822 email addresses:

    git config multimailhook.mailinglist "larry@example.com"
    git config --add multimailhook.mailinglist "moe@example.com,
curley@example.com"
    git config --add multimailhook.mailinglist '"Shemp, the other one"
<shemp@example.com>'

(I think that last one is valid.)

So we could turn the arguments into a list, but to be useful it would
require the individual values to be parsed into possibly multiple
addresses.  That seemed overkill, given that we only need the result as
a single string.

> [...]
>>> Dead code?
>>
>> git_multimail is used as a library by migrate-mailhook-config, and that
>> script uses these methods.
> 
> I see.  Perhaps clean separation to avoid confusing readers?

Think of git_multimail.py as a library that can be included, e.g. by a
post-receive script, but just happens to be executable as well.

Splitting it up more would prevent a one-file install, which I think
would be unfortunate.

> [...]
>>>>             try:
>>>>                 value = value % values
>>>>             except KeyError, e:
>>>>                 if DEBUG:
>>>>                     sys.stderr.write(
>>>>                         'Warning: unknown variable %r in the following line; line skipped:\n'
>>>>                         '    %s'
>>>>                         % (e.args[0], line,)
>>>>                         )
>>>
>>> If DEBUG isn't on, you risk leaving the value string interpolated
>>> without even telling the user.  What does it mean to the end user?
>>
>> There are some "nice-to-have" values in the templates that are not
>> necessary and might be missing if the user hasn't gone to the trouble to
>> configure every last setting.  For example, if no emaildomain is defined
>> then the pusher_email cannot be determined, resulting in the Reply-To
>> header being omitted.
>>
>> My assumption is that a sysadmin would turn on DEBUG when testing the
>> script, check that any missing headers are uninteresting, and then turn
>> off DEBUG for production use so that users don't have to see the
>> warnings every time they push.
> 
> Ah, so that is the intended usage.  If the impact of omitting certain
> headers (due to lack of information) doesn't result in unusable emails
> being generated, I think we're good.  Are you sure that doesn't
> happen?

I believe that the Environment classes themselves will scream if a
required value is missing, though of course all bets are off if the user
overrides the templates.

>>>>         raise NotImplementedError()
>>>
>>> Unsure what these are about.
>>
>> These are basically just to allow code sharing across the various Change
>> classes.
> 
> I'm not sure it's worth supporting Python < 2.6, especially if it
> costs more to port it to Python 3+ (no abstract base classes; ugliness
> like this).  Which brings me to: considering that the first commit is
> in late 2012, why didn't you choose to code it in python3 directly?
> Unless I'm mistaken, the only reason git-remote-{bzr,hg} aren't in
> python3 is because the dependencies bzrlib and hglib are in python2.

Yes, someday.  But I think there are still widely-used server
distributions without a solid Python 3.x, and this script will mostly be
used on servers.  Whereas I doubt that there will be any significant
distribution without a Python 2.x for the foreseeable future.

> [...]
>>>>         body = self.generate_email_body(push)
>>>>         if body_filter is not None:
>>>
>>> Redundant "is not None".
>>
>> This way of writing the test is robust against objects for which
>> bool(body_filter) might return False.
> 
> If body_filter isn't a function, we have much larger problems, don't we? ;)
> Nevertheless, I'm not going to argue with idioms (I've not written any
> Python in years now, so I don't know them).

body_filter could be an instance of a class with a __call__() method and
a __nonzero__() method.  Since it is user-provided we should be as
robust as possible.

>>> [...]
>>>>         # First line of commit message:
>>>>         try:
>>>>             oneline = read_git_output(
>>>>                 ['log', '--format=%s', '--no-walk', self.rev.sha1]
>>>>                 )
>>>>         except CommandError:
>>>>             oneline = self.rev.sha1
>>>
>>> What does this mean?  When will you get a CommandError?
>>
>> I can't think of a plausible reason that this command would fail.
>>
>>>                                                          And how do
>>> you respond to it?
>>
>> By using the commit's SHA-1 in place of its subject line.
> 
> What you have written translates to: "If there is a valid commit whose
> subject cannot be determined (empty subject is still determinate), I
> will use the commit's SHA-1 hex in its place", which implies that you
> do not trust git to be sane ;)
> 
> Isn't the entire premise of your script that you have a sane git?

It is rather my own sanity that I doubt more than git's.  But I will
remove this redundant error-handling.

>> Yes, I could do that (though it needs quotes around the key strings).
>> Or the even more attractive
>>
>>     values.update(
>>         rev=self.rev.sha1,
>>         rev_short=self.rev.short,
>>         ...
>>         )
>>
>> I had the latter in an earlier version of the script, but I thought it
>> might be too unfamiliar for non-Python-experts.  I guess I'm using
>> pretty highfalutin Python anyway so this change wouldn't hurt.  What do
>> you think?
> 
> I like pretty, maintainable, modern code with low redundancy.  As a
> general principle, I always tilt towards educating users/ developers
> about new (or "advanced") features, not abstaining from them because
> they are too unfamiliar (or "complicated").
> 
> On this specifically, a beginner can look up help(values.update) to
> understand what the code is doing.  So, yes: values.update() is
> definitely better.

Will change.

>>>>         contents = list(read_git_lines(['cat-file', 'tag', self.new.sha1], keepends=True))
>>>
>>> You could've easily batched this.
>>
>> I don't understand what you mean.
> 
> Whenever I call cat-file from a script, I always find myself using the
> --batch variant.

OK, now I understand what you mean.  Yes, if people complain about
performance, this is one of the possibilities that I have in mind.

>>> Way too many git invocations, I think.
>>
>> Luckily git is very fast :-)
>>
>> I'm not to worried about performance here.  The script will typically
>> only be run on pushes, and most pushes affect only one or a few
>> references.  I don't think these few git invocations will be prohibitive.
> 
> I personally push very often, so it's not a problem.  I'm thinking of
> a mirroring batched push, where the maintainer pushes out history to a
> "release" server every major release (every few months): is the script
> intended to be used in such a scenario, when multiple refs and tags
> are updated non-trivially?

The script certainly supports such scenarios (at least modulo bugs).
Performance in a script like this is something that I try not to put
extra work into unless there are obvious concerns or until somebody
complains.

>>> If you want to DWIM when the configuration variable is missing, do it
>>> fully using a list of good candidates like /usr/lib/sendmail,
>>> /usr/sbin/sendmail, /usr/ucblib/sendmail, /usr/bin/msmtp.
>>
>> OK, I just added /usr/sbin/sendmail and /usr/lib/sendmail, which are the
>> paths checked by "git send-mail".
> 
> I'm on Arch, and sendmail is dead (only available in AUR now).  I
> think we should support modern sendmail-compatible alternatives like
> msmtp (which I have and use).

It's configurable.  This is the kind of thing that I would expect the
sysadmin (or the packager) to be familiar with, and I'd rather leave it
to them than make lots of wild guesses about what might be installed.

AFAIK there is a rather well-established convention that *whatever* mail
program is installed, it should put some kind of sendmail-compatible
binary at /usr/sbin/sendmail.  Perhaps you should suggest this to the
msmtp packager?

>>>                                                            Also, what
>>> happened to our faithful 'git send-email' Perl script?  Isn't that
>>> most likely to be installed?
>>
>> We could use "git send-email" to generate and send the revision emails,
>> but then we would lose most control over the contents of the emails.
> 
> I'm talking about the <file>... form.  Does it necessarily mangle the
> headers of an mbox that is fed to it, or am I missing something?

I only glanced at the documentation but it didn't look like an obvious
improvement over invoking sendmail directly or using smtplib.  But it
would be easy to add another Mailer class that delegates to "git
send-email" if somebody is motivated to do so.

>>>> class Environment(object):
>>>
>>> New-style class.  I wonder why you suddenly switched.
>>
>> ?  All of the classes are new-style classes.
> 
> When you say class Foo:, aren't you declaring an old-style class by
> default in python2?  New-style classes are those that explicitly
> inherit from object (implicit in python3).

That is correct.  But I don't think I use "class Foo:" anywhere in
git-multimail; in other words, I think I only use new-style classes in
this project.

>>>> class GenericEnvironment(
>>>>     ProjectdescEnvironmentMixin,
>>>>     ConfigMaxlinesEnvironmentMixin,
>>>>     ConfigFilterLinesEnvironmentMixin,
>>>>     ConfigRecipientsEnvironmentMixin,
>>>>     PusherDomainEnvironmentMixin,
>>>>     ConfigOptionsEnvironmentMixin,
>>>>     GenericEnvironmentMixin,
>>>>     Environment,
>>>>     ):
>>>>     pass
>>>
>>> Sigh.  I might as well be reading some Java now :/
>>
>> No, Java doesn't allow multiple inheritance :-)
> 
> Ha, yes: I meant in the way you factory'ified everything.  I don't
> recall seeing this kind of code in a real-world application though: I
> skimmed through Django's code many years ago, and haven't found such a
> pattern.  I have seen instances where multiple inheritance may be
> classified as borderline useful, but nothing as extensive as this.
> Additionally, Ruby does not have multiple inheritance.  So, I tilt
> towards the conclusion that multiple inheritance is Bad and Wrong.
> Then again, I'm not an OO person in general, so I can't have a mature
> opinion on the subject.  Can you present a case for this kind of
> extensive usage quoting real-world examples?

Multiple inheritance has to be used with great care.  I've never used it
this extensively before, so it was kindof an experiment for me.  But I
am pleased with the result, especially that it makes it possible to
separate concerns quite well.

I believe that Zope uses mixins quite extensively (and did so back in
the very old days), though that is not necessarily a strong endorsement :-)

> [...]
>> git_multimail.py is 2398 lines of Python script, including comments and
>> string literals.  The fraction of that code that is dedicated to
>> configurability is approximately 1000 lines.  Relative to
>> post-receive-email, it adds
>>
>> * much more configurability, without the need to edit the script.
> 
> So this is the main selling point.

Yes, that plus the one-email-per-commit feature and (hopefully!) better
maintainability due to the use of a more powerful language are the main
selling points.

> [...]
>> Regarding efficiency, I don't think it is a problem.  But patches or
>> concrete suggestions are certainly welcome.
> 
> Pre-optimization is the root of all evil :)  Can you give us some
> numbers from real-world usecases, so we know whether or not it _needs_
> to be optimized?  I ran your test script (test-email, I think) and it
> generated 35 emails in ~10 seconds; but the repository was
> super-trivial.

We use the script at work for our main Git repo, which is quite large.
On the same repo we use a rather extensive pre-receive hook as well, so
it is not obvious what part of the push time comes from each script.
Together they create a noticeable delay if many commits are being pushed
at once, but (except when I once pushed thousands of commits at a time)
the total delay is max a few seconds.

Thanks again for your feedback!
Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03 10:23   ` Ramkumar Ramachandra
  2013-07-03 11:02     ` Ramkumar Ramachandra
  2013-07-03 11:41     ` Michael Haggerty
@ 2013-07-03 21:09     ` Jed Brown
  2013-07-04  8:11       ` Matthieu Moy
  2013-07-04  8:27       ` Michael Haggerty
  2 siblings, 2 replies; 15+ messages in thread
From: Jed Brown @ 2013-07-03 21:09 UTC (permalink / raw)
  To: Ramkumar Ramachandra, Michael Haggerty; +Cc: Git List

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> Yeah, this is good reasoning.  And yes, I'm on Arch: python points to
> python3, and python2 points to python2.  

I'm also on Arch and it has been this way since October 2010 [1].
Ubuntu plans to remove python2 from the desktop CD images in 14.04 [2],
so having code that does not work with python3 will become more painful
pretty soon.

Note that RHEL5 has only python2.4 and will be supported through March,
2017.  Since it is not feasible to have code that works in both python3
and any versions prior to python2.6, any chosen dialect will be broken
by default on some major distributions that still have full vendor
support.

> A couple of thoughts while we're on the subject:
>
> 1. We should probably convert git-remote-{hg,bzr} to use this style
> too: 

Python-2.6.8 from python.org installs only python2.6 and python, but not
python2, so this will break on a lot of older systems.  Some
distributions have been nice enough to provide python2 symlinks anyway.

Michael's rationale that at least the error message is obvious still
stands.

> Debian uses an alternatives mechanism to have multiple versions of the
> same package

Alternatives is global configuration that doesn't really solve this
problem anyway.

[1] https://www.archlinux.org/news/python-is-now-python-3/
[2] https://wiki.ubuntu.com/Python/3

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03 21:09     ` Jed Brown
@ 2013-07-04  8:11       ` Matthieu Moy
  2013-07-04  8:27       ` Michael Haggerty
  1 sibling, 0 replies; 15+ messages in thread
From: Matthieu Moy @ 2013-07-04  8:11 UTC (permalink / raw)
  To: Jed Brown; +Cc: Ramkumar Ramachandra, Michael Haggerty, Git List

Jed Brown <jed@59A2.org> writes:

> Note that RHEL5 has only python2.4 and will be supported through March,
> 2017.  Since it is not feasible to have code that works in both python3
> and any versions prior to python2.6, any chosen dialect will be broken
> by default on some major distributions that still have full vendor
> support.

At worse, if git-multimail is ported to Python 3, people using old
distros will still be able to use today's version which works in Python
2 and already does a good job.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Review of git multimail
  2013-07-03 21:09     ` Jed Brown
  2013-07-04  8:11       ` Matthieu Moy
@ 2013-07-04  8:27       ` Michael Haggerty
  1 sibling, 0 replies; 15+ messages in thread
From: Michael Haggerty @ 2013-07-04  8:27 UTC (permalink / raw)
  To: Jed Brown; +Cc: Ramkumar Ramachandra, Git List

Thanks for all of the information.

On 07/03/2013 11:09 PM, Jed Brown wrote:
> Ramkumar Ramachandra <artagnon@gmail.com> writes:
> 
>> Yeah, this is good reasoning.  And yes, I'm on Arch: python points to
>> python3, and python2 points to python2.  
> 
> I'm also on Arch and it has been this way since October 2010 [1].
> Ubuntu plans to remove python2 from the desktop CD images in 14.04 [2],
> so having code that does not work with python3 will become more painful
> pretty soon.

It may not be on the CD image, but python2 will undoubtedly continue to
be supported in the Ubuntu repositories; i.e., it is just an "apt-get
install" away.  (For that matter, I don't think Git itself is on the
Ubuntu CD image.)

> Note that RHEL5 has only python2.4 and will be supported through March,
> 2017.  Since it is not feasible to have code that works in both python3
> and any versions prior to python2.6, any chosen dialect will be broken
> by default on some major distributions that still have full vendor
> support.

I think for a server-oriented program like git-multimail it is more
important to support old versions of Python than to support the
bleeding-edge versions.  For user-oriented programs a different
conclusion might be reached.

My vague, long-term plan is roughly:

* Continue to support Python 2.4 or at least 2.5 for the next year or
two.  This excludes any reasonable hope of simultaneously being Python
3.x compatible, so don't worry about 3.x for now (though
backwards-compatible and non-hideous changes that move in the direction
of Python 3.x compatibility are of course welcome).

* At some point, abandon support for the older Python 2.x releases and
start using 3.x-compatibility features that were added in Python 2.6 and
2.7.

* Make string handling Unicode-correct.

* Then evaluate the situation and decide between two courses of action:

  * Evolve the script to work with both Python 2.7 (and maybe 2.6) and
Python 3.3 (and maybe 3.2) simultaneously.

  * Fork development of a Python 3.x version while retaining a 2.x
version in maintenance mode.

>> A couple of thoughts while we're on the subject:
>>
>> 1. We should probably convert git-remote-{hg,bzr} to use this style
>> too: 
> 
> Python-2.6.8 from python.org installs only python2.6 and python, but not
> python2, so this will break on a lot of older systems.  Some
> distributions have been nice enough to provide python2 symlinks anyway.
> 
> Michael's rationale that at least the error message is obvious still
> stands.

The approach I've taken in git-multimail isn't necessarily applicable to
git-remote-*.  The main difference is that git-multimail *has to* be
installed by a repository administrator to have an effect (either by
being copied or linked to $GIT_DIR/hooks or by adding a script there
that refers to git_multimail.py).  So the admin, at that time, can
decide what python is best to use on his system and adjust the shebang
line or create a "python2" symlink or whatever.

The git-remote-* scripts are meant for users, who simply want to execute
them without thinking.  So in that case, the scripts should be installed
to the default $PATH with the correct shebang line already in place for
the local environment.  Thus their shebang lines will tend to be decided
by packagers, not end users, and this difference changes the situation.
 I think they should be managed via build step that rewrites the scripts
to use a build-time-configured Python interpreter.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-07-04  8:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-02 19:23 Review of git multimail Ramkumar Ramachandra
2013-07-02 20:51 ` John Keeping
2013-07-02 21:34   ` Ramkumar Ramachandra
2013-07-02 22:21 ` Junio C Hamano
2013-07-03  8:02   ` Michael Haggerty
2013-07-03  8:16     ` Junio C Hamano
2013-07-03  8:29     ` John Keeping
2013-07-03  8:33       ` John Keeping
2013-07-03  0:10 ` Michael Haggerty
2013-07-03 10:23   ` Ramkumar Ramachandra
2013-07-03 11:02     ` Ramkumar Ramachandra
2013-07-03 11:41     ` Michael Haggerty
2013-07-03 21:09     ` Jed Brown
2013-07-04  8:11       ` Matthieu Moy
2013-07-04  8:27       ` Michael Haggerty

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).