git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Nieder <jrnieder@gmail.com>
To: Andrew Donnellan <ajd@linux.ibm.com>
Cc: patchwork@lists.ozlabs.org,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Alexandre Belloni <alexandre.belloni@bootlin.com>,
	git@vger.kernel.org
Subject: Re: [PATCH] parser: Unmangle From: headers that have been mangled for DMARC purposes
Date: Thu, 10 Oct 2019 12:41:32 -0700	[thread overview]
Message-ID: <20191010194132.GA191800@google.com> (raw)
In-Reply-To: <20191010062047.21549-1-ajd@linux.ibm.com>

Hi,

Andrew Donnellan wrote:

> To avoid triggering spam filters due to failed signature validation, many
> mailing lists mangle the From header to change the From address to be the
> address of the list, typically where the sender's domain has a strict DMARC
> policy enabled.
>
> In this case, we should try to unmangle the From header.
>
> Add support for using the X-Original-Sender or Reply-To headers, as used by
> Google Groups and Mailman respectively, to unmangle the From header when
> necessary.
>
> Closes: #64 ("Incorrect submitter when using googlegroups")
> Reported-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
> ---
>  patchwork/parser.py            | 75 ++++++++++++++++++++++++++++++----
>  patchwork/tests/test_parser.py | 68 ++++++++++++++++++++++++++++--
>  2 files changed, 130 insertions(+), 13 deletions(-)

Interesting!  I'm cc-ing the Git mailing list in case "git am" might
wnat to learn the same support.

Thanks,
Jonathan

(patch left unsnipped for reference)
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index 7dc66bc05a5b..79beac5617b1 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -321,12 +321,7 @@ def find_series(project, mail, author):
>      return _find_series_by_markers(project, mail, author)
>  
>  
> -def get_or_create_author(mail):
> -    from_header = clean_header(mail.get('From'))
> -
> -    if not from_header:
> -        raise ValueError("Invalid 'From' header")
> -
> +def split_from_header(from_header):
>      name, email = (None, None)
>  
>      # tuple of (regex, fn)
> @@ -355,6 +350,65 @@ def get_or_create_author(mail):
>              (name, email) = fn(match.groups())
>              break
>  
> +    return (name, email)
> +
> +
> +# Unmangle From addresses that have been mangled for DMARC purposes.
> +#
> +# To avoid triggering spam filters due to failed signature validation, many
> +# mailing lists mangle the From header to change the From address to be the
> +# address of the list, typically where the sender's domain has a strict
> +# DMARC policy enabled.
> +#
> +# Unfortunately, there's no standardised way of preserving the original
> +# From address.
> +#
> +# Google Groups adds an X-Original-Sender header. If present, we use that.
> +#
> +# Mailman preserves the original address by adding a Reply-To, except in the
> +# case where the list is set to either reply to list, or reply to a specific
> +# address, in which case the original From is added to Cc instead. These corner
> +# cases are dumb, but we try and handle things as sensibly as possible by
> +# looking for a name in Reply-To/Cc that matches From. It's inexact but should
> +# be good enough for our purposes.
> +def get_original_sender(mail, name, email):
> +    if name and ' via ' in name:
> +        # Mailman uses the format "<name> via <list>"
> +        # Google Groups uses "'<name>' via <list>"
> +        stripped_name = name[:name.rfind(' via ')].strip().strip("'")
> +
> +    original_sender = clean_header(mail.get('X-Original-Sender', ''))
> +    if original_sender:
> +        new_email = split_from_header(original_sender)[1].strip()[:255]
> +        return (stripped_name, new_email)
> +
> +    addrs = []
> +    reply_to_headers = mail.get_all('Reply-To') or []
> +    cc_headers = mail.get_all('Cc') or []
> +    for header in reply_to_headers + cc_headers:
> +        header = clean_header(header)
> +        addrs = header.split(",")
> +        for addr in addrs:
> +            new_name, new_email = split_from_header(addr)
> +            if new_name:
> +                new_name = new_name.strip()[:255]
> +            if new_email:
> +                new_email = new_email.strip()[:255]
> +            if new_name == stripped_name:
> +                return (stripped_name, new_email)
> +
> +    # If we can't figure out the original sender, just keep it as is
> +    return (name, email)
> +
> +
> +def get_or_create_author(mail, project=None):
> +    from_header = clean_header(mail.get('From'))
> +
> +    if not from_header:
> +        raise ValueError("Invalid 'From' header")
> +
> +    name, email = split_from_header(from_header)
> +
>      if not email:
>          raise ValueError("Invalid 'From' header")
>  
> @@ -362,6 +416,9 @@ def get_or_create_author(mail):
>      if name is not None:
>          name = name.strip()[:255]
>  
> +    if project and email.lower() == project.listemail.lower():
> +        name, email = get_original_sender(mail, name, email)
> +
>      # this correctly handles the case where we lose the race to create
>      # the person and another process beats us to it. (If the record
>      # does not exist, g_o_c invokes _create_object_from_params which
> @@ -1004,7 +1061,7 @@ def parse_mail(mail, list_id=None):
>  
>      if not is_comment and (diff or pull_url):  # patches or pull requests
>          # we delay the saving until we know we have a patch.
> -        author = get_or_create_author(mail)
> +        author = get_or_create_author(mail, project)
>  
>          delegate = find_delegate_by_header(mail)
>          if not delegate and diff:
> @@ -1099,7 +1156,7 @@ def parse_mail(mail, list_id=None):
>                  is_cover_letter = True
>  
>          if is_cover_letter:
> -            author = get_or_create_author(mail)
> +            author = get_or_create_author(mail, project)
>  
>              # we don't use 'find_series' here as a cover letter will
>              # always be the first item in a thread, thus the references
> @@ -1159,7 +1216,7 @@ def parse_mail(mail, list_id=None):
>      if not submission:
>          return
>  
> -    author = get_or_create_author(mail)
> +    author = get_or_create_author(mail, project)
>  
>      try:
>          comment = Comment.objects.create(
> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
> index e5a2fca3044e..85c6e7550f93 100644
> --- a/patchwork/tests/test_parser.py
> +++ b/patchwork/tests/test_parser.py
> @@ -265,11 +265,23 @@ class SenderCorrelationTest(TestCase):
>      """
>  
>      @staticmethod
> -    def _create_email(from_header):
> +    def _create_email(from_header, reply_tos=None, ccs=None,
> +                      x_original_sender=None):
>          mail = 'Message-Id: %s\n' % make_msgid() + \
> -               'From: %s\n' % from_header + \
> -               'Subject: Tests\n\n'\
> -               'test\n'
> +               'From: %s\n' % from_header
> +
> +        if reply_tos:
> +            mail += 'Reply-To: %s\n' % ', '.join(reply_tos)
> +
> +        if ccs:
> +            mail += 'Cc: %s\n' % ', '.join(ccs)
> +
> +        if x_original_sender:
> +            mail += 'X-Original-Sender: %s\n' % x_original_sender
> +
> +        mail += 'Subject: Tests\n\n'\
> +            'test\n'
> +
>          return message_from_string(mail)
>  
>      def test_existing_sender(self):
> @@ -311,6 +323,54 @@ class SenderCorrelationTest(TestCase):
>          self.assertEqual(person_b._state.adding, False)
>          self.assertEqual(person_b.id, person_a.id)
>  
> +    def test_mailman_dmarc_munging(self):
> +        project = create_project()
> +        real_sender = 'Existing Sender <existing@example.com>'
> +        munged_sender = 'Existing Sender via List <{}>'.format(
> +            project.listemail)
> +        other_email = 'Other Person <other@example.com>'
> +
> +        # Unmunged author
> +        mail = self._create_email(real_sender)
> +        person_a = get_or_create_author(mail, project)
> +        person_a.save()
> +
> +        # Single Reply-To
> +        mail = self._create_email(munged_sender, [real_sender])
> +        person_b = get_or_create_author(mail, project)
> +        self.assertEqual(person_b._state.adding, False)
> +        self.assertEqual(person_b.id, person_a.id)
> +
> +        # Single Cc
> +        mail = self._create_email(munged_sender, [], [real_sender])
> +        person_b = get_or_create_author(mail, project)
> +        self.assertEqual(person_b._state.adding, False)
> +        self.assertEqual(person_b.id, person_a.id)
> +
> +        # Multiple Reply-Tos and Ccs
> +        mail = self._create_email(munged_sender, [other_email, real_sender],
> +                                  [other_email, other_email])
> +        person_b = get_or_create_author(mail, project)
> +        self.assertEqual(person_b._state.adding, False)
> +        self.assertEqual(person_b.id, person_a.id)
> +
> +    def test_google_dmarc_munging(self):
> +        project = create_project()
> +        real_sender = 'Existing Sender <existing@example.com>'
> +        munged_sender = "'Existing Sender' via List <{}>".format(
> +            project.listemail)
> +
> +        # Unmunged author
> +        mail = self._create_email(real_sender)
> +        person_a = get_or_create_author(mail, project)
> +        person_a.save()
> +
> +        # X-Original-Sender header
> +        mail = self._create_email(munged_sender, None, None, real_sender)
> +        person_b = get_or_create_author(mail, project)
> +        self.assertEqual(person_b._state.adding, False)
> +        self.assertEqual(person_b.id, person_a.id)
> +
>  
>  class SeriesCorrelationTest(TestCase):
>      """Validate correct behavior of find_series."""

       reply	other threads:[~2019-10-10 19:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20191010062047.21549-1-ajd@linux.ibm.com>
2019-10-10 19:41 ` Jonathan Nieder [this message]
2019-10-10 21:13   ` [PATCH] parser: Unmangle From: headers that have been mangled for DMARC purposes Andrew Donnellan
2019-10-10 23:16     ` Daniel Axtens
2019-10-10 23:40       ` Stephen Rothwell
2019-10-10 22:54   ` Jeff King
2019-10-10 23:01     ` Andrew Donnellan
2019-10-10 23:06       ` Jeff King
2019-10-11 15:42       ` Daniel Axtens
2019-10-11 15:51         ` Jeff King
2019-10-13 11:05           ` Andrew Donnellan
2019-10-11  4:29     ` Junio C Hamano
2019-10-11  4:36       ` Andrew Donnellan
2019-10-11  4:50         ` Andrew Donnellan
2019-10-11 13:13           ` Christian Schoenebeck
2019-10-11 17:36             ` Ian Kelling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191010194132.GA191800@google.com \
    --to=jrnieder@gmail.com \
    --cc=ajd@linux.ibm.com \
    --cc=alexandre.belloni@bootlin.com \
    --cc=git@vger.kernel.org \
    --cc=patchwork@lists.ozlabs.org \
    --cc=sfr@canb.auug.org.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).