git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH] gitweb: redacted e-mail addresses feature.
@ 2021-03-20 23:42 Georgios Kontaxis via GitGitGadget
  2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Georgios Kontaxis via GitGitGadget @ 2021-03-20 23:42 UTC (permalink / raw)
  To: git; +Cc: Georgios Kontaxis

From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers. This may result in unsolicited messages.
This is a feature for redacting e-mail addresses from the generated
HTML content.

This feature does not prevent someone from downloading the
unredacted commit log and extracting information from it.
It aims to hinder the low-effort bulk collection of e-mail
addresses by web crawlers.

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: Redacted e-mail addresses feature.
    
    Gitweb extracts content from the Git log and makes it accessible over
    HTTP. As a result, e-mail addresses found in commits are exposed to web
    crawlers. This may result in unsolicited messages. This is a feature for
    redacting e-mail addresses from the generated HTML content.
    
    This feature does not prevent someone from downloading the unredacted
    commit log and extracting information from it. It aims to hinder the
    low-effort bulk collection of e-mail addresses by web crawlers.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/910

 Documentation/gitweb.conf.txt | 12 ++++++++++++
 gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..10653d8670a8 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -896,6 +896,18 @@ same as of the snippet above:
 It is an error to specify a ref that does not pass "git check-ref-format"
 scrutiny. Duplicated values are filtered.
 
+email_privacy::
+    Redact e-mail addresses from the generated HTML, etc. content.
+    This hides e-mail addresses found in the commit log from web crawlers.
+    Enabled by default.
++
+It is highly recommended to keep this feature enabled unless web crawlers
+are hindered in some other way. You can disable this feature as shown below:
++
+---------------------------------------------------------------------------
+$feature{'email_privacy'}{'default'} = [0];
+---------------------------------------------------------------------------
+
 
 EXAMPLES
 --------
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..9d21c2583e18 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -569,6 +569,15 @@ sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+    # Redact e-mail addresses.
+
+    # To disable system wide have in $GITWEB_CONFIG
+    # $feature{'email_privacy'}{'default'} = [0];
+	'email_privacy' => {
+		'sub' => sub { feature_bool('email_privacy', @_) },
+		'override' => 0,
+		'default' => [1]},
 );
 
 sub gitweb_get_feature {
@@ -3471,6 +3480,10 @@ sub parse_tag {
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$tag{'author_name'}  = $1;
 				$tag{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$tag{'author_email'} = "private";
+					$tag{'author'} =~ s/<([^>]+)>/<private>/;
+				}
 			} else {
 				$tag{'author_name'} = $tag{'author'};
 			}
@@ -3519,6 +3532,10 @@ sub parse_commit_text {
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'author_name'}  = $1;
 				$co{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'author_email'} = "private";
+					$co{'author'} =~ s/<([^>]+)>/<private>/;
+				}
 			} else {
 				$co{'author_name'} = $co{'author'};
 			}
@@ -3529,6 +3546,10 @@ sub parse_commit_text {
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'committer_name'}  = $1;
 				$co{'committer_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'committer_email'} = "private";
+					$co{'committer'} =~ s/<([^>]+)>/<private>/;
+				}
 			} else {
 				$co{'committer_name'} = $co{'committer'};
 			}
@@ -3568,9 +3589,13 @@ sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		if (gitweb_check_feature('email_privacy') &&
+			$line =~ m/^([^<]+) <([^>]*)>/) {
+			$line =~ s/<([^>]+)>/<private>/;
+		}
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -8060,8 +8085,13 @@ sub git_commitdiff {
 		close $fd
 			or print "Reading git-diff-tree failed\n";
 	} elsif ($format eq 'patch') {
-		local $/ = undef;
-		print <$fd>;
+		while (my $line = <$fd>) {
+			if (gitweb_check_feature('email_privacy') &&
+				$line =~ m/^([^<]+) <([^>]*)>/) {
+				$line =~ s/<([^>]+)>/<private>/;
+			}
+			print $line;
+		}
 		close $fd
 			or print "Reading git-format-patch failed\n";
 	}

base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] gitweb: redacted e-mail addresses feature.
  2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
@ 2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
  2021-03-21  1:27   ` brian m. carlson
  2021-03-21  3:30   ` Georgios Kontaxis
  2021-03-21  3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
  2021-03-21  6:00 ` [PATCH] " Junio C Hamano
  2 siblings, 2 replies; 41+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-21  0:42 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget; +Cc: git, Georgios Kontaxis


On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers. This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses from the generated
> HTML content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log and extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.

So web crawlers that aren't going to obey robots.txt?

I'm not opposed to this feature, but a glance at gitweb's documentation
seems to show that we don't discuss how to set robots.txt up for it at
all.

Perhaps having that in the docs or otherwise in the default setup would
get us most of the win of this feature?

> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---

Odd:

>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

To have this duplication of the patch here below "---", some GGG feature
gone awry?

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
>  Documentation/gitweb.conf.txt | 12 ++++++++++++
>  gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..10653d8670a8 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,18 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Enabled by default.
> ++
> +It is highly recommended to keep this feature enabled unless web crawlers
> +are hindered in some other way. You can disable this feature as shown below:
> ++
> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [0];
> +---------------------------------------------------------------------------

I think there's plenty of gitweb users that are going to be relying on
the current behavior, so doesn't it make more sense for this to be
opt-in rather than opt-out?

>  
>  EXAMPLES
>  --------
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..9d21c2583e18 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +    # Redact e-mail addresses.
> +
> +    # To disable system wide have in $GITWEB_CONFIG
> +    # $feature{'email_privacy'}{'default'} = [0];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 0,
> +		'default' => [1]},
>  );
> [...]
>  sub gitweb_get_feature {
> @@ -3471,6 +3480,10 @@ sub parse_tag {
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$tag{'author_name'}  = $1;
>  				$tag{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$tag{'author_name'} = $tag{'author'};
>  			}
> @@ -3519,6 +3532,10 @@ sub parse_commit_text {
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'author_name'}  = $1;
>  				$co{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'author_email'} = "private";
> +					$co{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'author_name'} = $co{'author'};
>  			}
> @@ -3529,6 +3546,10 @@ sub parse_commit_text {
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'committer_name'}  = $1;
>  				$co{'committer_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'committer_email'} = "private";
> +					$co{'committer'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'committer_name'} = $co{'committer'};
>  			}
> @@ -3568,9 +3589,13 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		if (gitweb_check_feature('email_privacy') &&
> +			$line =~ m/^([^<]+) <([^>]*)>/) {
> +			$line =~ s/<([^>]+)>/<private>/;
> +		}
>  	}
>  	$co{'comment'} = \@commit_lines;

All of these hunks (and the below) should use some new function that
does this feature check + sanitizing instead of copy/pasting mostly the
same code N times. e.g.:
    
    sub maybe_hide_email {
        my $email = shift;
        return $email unless gitweb_check_feature('email_privacy');
        return hide_email($email);
    }

then:

    $tag{author_email} = maybe_hide_email($2);

Also it looks like this isn't a new issue, but does this need to
implement its own E-Mail parser? We ship with Mail::Address for
git-send-email, can gitweb (and the elided hide_email() function above)
use that too?


> @@ -8060,8 +8085,13 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			if (gitweb_check_feature('email_privacy') &&
> +				$line =~ m/^([^<]+) <([^>]*)>/) {
> +				$line =~ s/<([^>]+)>/<private>/;
> +			}
> +			print $line;
> +		}
>  		close $fd
>  			or print "Reading git-format-patch failed\n";

Is that "patch" output meant for "git am"? Won't this severely break
that use-case if so?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] gitweb: redacted e-mail addresses feature.
  2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
@ 2021-03-21  1:27   ` brian m. carlson
  2021-03-21  3:30   ` Georgios Kontaxis
  1 sibling, 0 replies; 41+ messages in thread
From: brian m. carlson @ 2021-03-21  1:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Georgios Kontaxis via GitGitGadget, git, Georgios Kontaxis

[-- Attachment #1: Type: text/plain, Size: 3268 bytes --]

On 2021-03-21 at 00:42:58, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
> 
> > From: Georgios Kontaxis <geko1702+commits@99rst.org>
> >
> > Gitweb extracts content from the Git log and makes it accessible
> > over HTTP. As a result, e-mail addresses found in commits are
> > exposed to web crawlers. This may result in unsolicited messages.
> > This is a feature for redacting e-mail addresses from the generated
> > HTML content.
> >
> > This feature does not prevent someone from downloading the
> > unredacted commit log and extracting information from it.
> > It aims to hinder the low-effort bulk collection of e-mail
> > addresses by web crawlers.
> 
> So web crawlers that aren't going to obey robots.txt?
> 
> I'm not opposed to this feature, but a glance at gitweb's documentation
> seems to show that we don't discuss how to set robots.txt up for it at
> all.
> 
> Perhaps having that in the docs or otherwise in the default setup would
> get us most of the win of this feature?

I'm going to guess that the two features are orthogonal.  robots.txt is
great for communicating to well-meaning actors what you do and don't
want crawled.  For example, one might ask a web crawler not to crawl
individual commits because that creates excessive load on the server.

This option is about preventing email harvesting, usually for the
purposes of sending spam.  Spam is email abuse and all reasonable people
know it's unacceptable, so by definition the people doing this are bad
actors and are not likely to honor the robots.txt.  As someone who runs
his own mail server, that is certainly my experience.

So I am in favor of this feature.  I think it mirrors what many other
tools do in this space and having it as an option is valuable.

> > diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> > index 7963a79ba98b..10653d8670a8 100644
> > --- a/Documentation/gitweb.conf.txt
> > +++ b/Documentation/gitweb.conf.txt
> > @@ -896,6 +896,18 @@ same as of the snippet above:
> >  It is an error to specify a ref that does not pass "git check-ref-format"
> >  scrutiny. Duplicated values are filtered.
> >  
> > +email_privacy::
> > +    Redact e-mail addresses from the generated HTML, etc. content.
> > +    This hides e-mail addresses found in the commit log from web crawlers.
> > +    Enabled by default.
> > ++
> > +It is highly recommended to keep this feature enabled unless web crawlers
> > +are hindered in some other way. You can disable this feature as shown below:
> > ++
> > +---------------------------------------------------------------------------
> > +$feature{'email_privacy'}{'default'} = [0];
> > +---------------------------------------------------------------------------
> 
> I think there's plenty of gitweb users that are going to be relying on
> the current behavior, so doesn't it make more sense for this to be
> opt-in rather than opt-out?

I agree this make sense as an opt-in feature.  While many people will
want to enable it, users who are performing an upgrade won't necessarily
want the behavior to change right away.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] gitweb: redacted e-mail addresses feature.
  2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
  2021-03-21  1:27   ` brian m. carlson
@ 2021-03-21  3:30   ` Georgios Kontaxis
  1 sibling, 0 replies; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-21  3:30 UTC (permalink / raw)
  To: "Ævar Arnfjörð Bjarmason"
  Cc: Georgios Kontaxis via GitGitGadget, git

>
> On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers. This may result in unsolicited messages.
>> This is a feature for redacting e-mail addresses from the generated
>> HTML content.
>>
>> This feature does not prevent someone from downloading the
>> unredacted commit log and extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> So web crawlers that aren't going to obey robots.txt?
>
> I'm not opposed to this feature, but a glance at gitweb's documentation
> seems to show that we don't discuss how to set robots.txt up for it at
> all.
>
> Perhaps having that in the docs or otherwise in the default setup would
> get us most of the win of this feature?
>
File robots.txt is basically asking nicely and we should work on that.
At the same time crawlers that look for addresses to send SPAM to
will probably ignore it so this change is meant for them.

>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>
> Odd:
>
>>     gitweb: Redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers. This may result in unsolicited messages. This is a feature
>> for
>>     redacting e-mail addresses from the generated HTML content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log and extracting information from it. It aims to hinder the
>>     low-effort bulk collection of e-mail addresses by web crawlers.
>>
>>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> To have this duplication of the patch here below "---", some GGG feature
> gone awry?
>
>> Published-As:
>> https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
>> pr-910/kontaxis/kontaxis/email_privacy-v1
>> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>>
>>  Documentation/gitweb.conf.txt | 12 ++++++++++++
>>  gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gitweb.conf.txt
>> b/Documentation/gitweb.conf.txt
>> index 7963a79ba98b..10653d8670a8 100644
>> --- a/Documentation/gitweb.conf.txt
>> +++ b/Documentation/gitweb.conf.txt
>> @@ -896,6 +896,18 @@ same as of the snippet above:
>>  It is an error to specify a ref that does not pass "git
>> check-ref-format"
>>  scrutiny. Duplicated values are filtered.
>>
>> +email_privacy::
>> +    Redact e-mail addresses from the generated HTML, etc. content.
>> +    This hides e-mail addresses found in the commit log from web
>> crawlers.
>> +    Enabled by default.
>> ++
>> +It is highly recommended to keep this feature enabled unless web
>> crawlers
>> +are hindered in some other way. You can disable this feature as shown
>> below:
>> ++
>> +---------------------------------------------------------------------------
>> +$feature{'email_privacy'}{'default'} = [0];
>> +---------------------------------------------------------------------------
>
> I think there's plenty of gitweb users that are going to be relying on
> the current behavior, so doesn't it make more sense for this to be
> opt-in rather than opt-out?
>
My concern is that Gitweb operators may not understand the need
for this feature or maybe won't be aware the feature exists.

Nevertheless, I've changed the feature to be off by default.
Perhaps we can revisit this decision in the future? :)

>>
>>  EXAMPLES
>>  --------
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..9d21c2583e18 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -569,6 +569,15 @@ sub evaluate_uri {
>>  		'sub' => \&feature_extra_branch_refs,
>>  		'override' => 0,
>>  		'default' => []},
>> +
>> +    # Redact e-mail addresses.
>> +
>> +    # To disable system wide have in $GITWEB_CONFIG
>> +    # $feature{'email_privacy'}{'default'} = [0];
>> +	'email_privacy' => {
>> +		'sub' => sub { feature_bool('email_privacy', @_) },
>> +		'override' => 0,
>> +		'default' => [1]},
>>  );
>> [...]
>>  sub gitweb_get_feature {
>> @@ -3471,6 +3480,10 @@ sub parse_tag {
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$tag{'author_name'}  = $1;
>>  				$tag{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$tag{'author_email'} = "private";
>> +					$tag{'author'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$tag{'author_name'} = $tag{'author'};
>>  			}
>> @@ -3519,6 +3532,10 @@ sub parse_commit_text {
>>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'author_name'}  = $1;
>>  				$co{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'author_email'} = "private";
>> +					$co{'author'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$co{'author_name'} = $co{'author'};
>>  			}
>> @@ -3529,6 +3546,10 @@ sub parse_commit_text {
>>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'committer_name'}  = $1;
>>  				$co{'committer_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'committer_email'} = "private";
>> +					$co{'committer'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$co{'committer_name'} = $co{'committer'};
>>  			}
>> @@ -3568,9 +3589,13 @@ sub parse_commit_text {
>>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>>  	}
>> -	# remove added spaces
>> +	# remove added spaces, redact e-mail addresses if applicable.
>>  	foreach my $line (@commit_lines) {
>>  		$line =~ s/^    //;
>> +		if (gitweb_check_feature('email_privacy') &&
>> +			$line =~ m/^([^<]+) <([^>]*)>/) {
>> +			$line =~ s/<([^>]+)>/<private>/;
>> +		}
>>  	}
>>  	$co{'comment'} = \@commit_lines;
>
> All of these hunks (and the below) should use some new function that
> does this feature check + sanitizing instead of copy/pasting mostly the
> same code N times. e.g.:
>
>     sub maybe_hide_email {
>         my $email = shift;
>         return $email unless gitweb_check_feature('email_privacy');
>         return hide_email($email);
>     }
>
> then:
>
>     $tag{author_email} = maybe_hide_email($2);
>
> Also it looks like this isn't a new issue, but does this need to
> implement its own E-Mail parser? We ship with Mail::Address for
> git-send-email, can gitweb (and the elided hide_email() function above)
> use that too?
>
Thanks for the suggestion.
I've removed the duplicate code.

The places where there are e-mail addresses today are pretty specific.
(Pretty much the author, committer fields and the Signed-off-by lines)

In theory someone could write a comment with a bunch of addresses in it
but that would be unstructured text and I think Mail::Address is not good
with that.

I can definitely keep working on this topic but maybe in subsequent PRs?
Assuming we're not exposing any addresses right now or redacting things
we shouldn't.

>
>> @@ -8060,8 +8085,13 @@ sub git_commitdiff {
>>  		close $fd
>>  			or print "Reading git-diff-tree failed\n";
>>  	} elsif ($format eq 'patch') {
>> -		local $/ = undef;
>> -		print <$fd>;
>> +		while (my $line = <$fd>) {
>> +			if (gitweb_check_feature('email_privacy') &&
>> +				$line =~ m/^([^<]+) <([^>]*)>/) {
>> +				$line =~ s/<([^>]+)>/<private>/;
>> +			}
>> +			print $line;
>> +		}
>>  		close $fd
>>  			or print "Reading git-format-patch failed\n";
>
> Is that "patch" output meant for "git am"? Won't this severely break
> that use-case if so?
>
Not sure who may be consuming that output.
I've added a note in the documentation for gitweb.conf.

If a web crawler can get to the information by following URLs
then I think we should redact it.

Hopefully by documenting this as a potential issue Gitweb operators
can create a workaround specific to their use case.
Possibly implementing access control and leaving this feature off.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2] gitweb: redacted e-mail addresses feature.
  2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
  2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
@ 2021-03-21  3:32 ` Georgios Kontaxis via GitGitGadget
  2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
  2021-03-21  6:00 ` [PATCH] " Junio C Hamano
  2 siblings, 1 reply; 41+ messages in thread
From: Georgios Kontaxis via GitGitGadget @ 2021-03-21  3:32 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers and they may not respect robots.txt.
This may result in unsolicited messages.
This is a feature for redacting e-mail addresses
from the generated HTML, etc. content.

This feature does not prevent someone from downloading the
unredacted commit log, e.g., by cloning the repository, and
extracting information from it.
It aims to hinder the low-effort bulk collection of e-mail
addresses by web crawlers.

Changes since v1:
- Turned off the feature by default.
- Removed duplicate code.
- Added note about Gitweb consumers receiving redacted logs.

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: Redacted e-mail addresses feature.
    
    Gitweb extracts content from the Git log and makes it accessible over
    HTTP. As a result, e-mail addresses found in commits are exposed to web
    crawlers. This may result in unsolicited messages. This is a feature for
    redacting e-mail addresses from the generated HTML content.
    
    This feature does not prevent someone from downloading the unredacted
    commit log and extracting information from it. It aims to hinder the
    low-effort bulk collection of e-mail addresses by web crawlers.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/910

Range-diff vs v1:

 1:  6fe6ebdb8e59 ! 1:  74af11ca8bf2 gitweb: redacted e-mail addresses feature.
     @@ Commit message
      
          Gitweb extracts content from the Git log and makes it accessible
          over HTTP. As a result, e-mail addresses found in commits are
     -    exposed to web crawlers. This may result in unsolicited messages.
     -    This is a feature for redacting e-mail addresses from the generated
     -    HTML content.
     +    exposed to web crawlers and they may not respect robots.txt.
     +    This may result in unsolicited messages.
     +    This is a feature for redacting e-mail addresses
     +    from the generated HTML, etc. content.
      
          This feature does not prevent someone from downloading the
     -    unredacted commit log and extracting information from it.
     +    unredacted commit log, e.g., by cloning the repository, and
     +    extracting information from it.
          It aims to hinder the low-effort bulk collection of e-mail
          addresses by web crawlers.
      
     +    Changes since v1:
     +    - Turned off the feature by default.
     +    - Removed duplicate code.
     +    - Added note about Gitweb consumers receiving redacted logs.
     +
          Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
      
       ## Documentation/gitweb.conf.txt ##
     @@ Documentation/gitweb.conf.txt: same as of the snippet above:
      +email_privacy::
      +    Redact e-mail addresses from the generated HTML, etc. content.
      +    This hides e-mail addresses found in the commit log from web crawlers.
     -+    Enabled by default.
     ++    Disabled by default.
      ++
     -+It is highly recommended to keep this feature enabled unless web crawlers
     -+are hindered in some other way. You can disable this feature as shown below:
     ++It is highly recommended to enable this feature unless web crawlers are
     ++hindered in some other way. Note that crawlers intent on harvesting e-mail
     ++addresses may disregard robots.txt. You can enable this feature like so:
      ++
      +---------------------------------------------------------------------------
     -+$feature{'email_privacy'}{'default'} = [0];
     ++$feature{'email_privacy'}{'default'} = [1];
      +---------------------------------------------------------------------------
     +++
     ++Note that if Gitweb is not the final step in a workflow then subsequent
     ++steps may misbehave because of the redacted information they receive.
      +
       
       EXAMPLES
     @@ gitweb/gitweb.perl: sub evaluate_uri {
       		'override' => 0,
       		'default' => []},
      +
     -+    # Redact e-mail addresses.
     ++	# Redact e-mail addresses.
      +
     -+    # To disable system wide have in $GITWEB_CONFIG
     -+    # $feature{'email_privacy'}{'default'} = [0];
     ++	# To enable system wide have in $GITWEB_CONFIG
     ++	# $feature{'email_privacy'}{'default'} = [1];
      +	'email_privacy' => {
      +		'sub' => sub { feature_bool('email_privacy', @_) },
      +		'override' => 0,
     -+		'default' => [1]},
     ++		'default' => [0]},
       );
       
       sub gitweb_get_feature {
     +@@ gitweb/gitweb.perl: sub parse_date {
     + 	return %date;
     + }
     + 
     ++sub hide_mailaddr_if_private {
     ++	my $line = shift;
     ++	return $line unless (gitweb_check_feature('email_privacy') &&
     ++						$line =~ m/^([^<]+) <([^>]*)>/);
     ++	return hide_mailaddr($line)
     ++}
     ++
     ++sub hide_mailaddr {
     ++	my $mailaddr = shift;
     ++	$mailaddr =~ s/<([^>]*)>/<private>/;
     ++	return $mailaddr;
     ++}
     ++
     + sub parse_tag {
     + 	my $tag_id = shift;
     + 	my %tag;
      @@ gitweb/gitweb.perl: sub parse_tag {
       			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
       				$tag{'author_name'}  = $1;
       				$tag{'author_email'} = $2;
      +				if (gitweb_check_feature('email_privacy')) {
      +					$tag{'author_email'} = "private";
     -+					$tag{'author'} =~ s/<([^>]+)>/<private>/;
     ++					$tag{'author'} = hide_mailaddr($tag{'author'});
      +				}
       			} else {
       				$tag{'author_name'} = $tag{'author'};
     @@ gitweb/gitweb.perl: sub parse_commit_text {
       				$co{'author_email'} = $2;
      +				if (gitweb_check_feature('email_privacy')) {
      +					$co{'author_email'} = "private";
     -+					$co{'author'} =~ s/<([^>]+)>/<private>/;
     ++					$co{'author'} = hide_mailaddr($co{'author'});
      +				}
       			} else {
       				$co{'author_name'} = $co{'author'};
     @@ gitweb/gitweb.perl: sub parse_commit_text {
       				$co{'committer_email'} = $2;
      +				if (gitweb_check_feature('email_privacy')) {
      +					$co{'committer_email'} = "private";
     -+					$co{'committer'} =~ s/<([^>]+)>/<private>/;
     ++					$co{'committer'} = hide_mailaddr($co{'committer'});
      +				}
       			} else {
       				$co{'committer_name'} = $co{'committer'};
     @@ gitweb/gitweb.perl: sub parse_commit_text {
      +	# remove added spaces, redact e-mail addresses if applicable.
       	foreach my $line (@commit_lines) {
       		$line =~ s/^    //;
     -+		if (gitweb_check_feature('email_privacy') &&
     -+			$line =~ m/^([^<]+) <([^>]*)>/) {
     -+			$line =~ s/<([^>]+)>/<private>/;
     -+		}
     ++		$line = hide_mailaddr_if_private($line);
       	}
       	$co{'comment'} = \@commit_lines;
       
     @@ gitweb/gitweb.perl: sub git_commitdiff {
      -		local $/ = undef;
      -		print <$fd>;
      +		while (my $line = <$fd>) {
     -+			if (gitweb_check_feature('email_privacy') &&
     -+				$line =~ m/^([^<]+) <([^>]*)>/) {
     -+				$line =~ s/<([^>]+)>/<private>/;
     -+			}
     -+			print $line;
     ++			print hide_mailaddr_if_private($line);
      +		}
       		close $fd
       			or print "Reading git-format-patch failed\n";


 Documentation/gitweb.conf.txt | 16 +++++++++++++
 gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..b7af3240177d 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -896,6 +896,22 @@ same as of the snippet above:
 It is an error to specify a ref that does not pass "git check-ref-format"
 scrutiny. Duplicated values are filtered.
 
+email_privacy::
+    Redact e-mail addresses from the generated HTML, etc. content.
+    This hides e-mail addresses found in the commit log from web crawlers.
+    Disabled by default.
++
+It is highly recommended to enable this feature unless web crawlers are
+hindered in some other way. Note that crawlers intent on harvesting e-mail
+addresses may disregard robots.txt. You can enable this feature like so:
++
+---------------------------------------------------------------------------
+$feature{'email_privacy'}{'default'} = [1];
+---------------------------------------------------------------------------
++
+Note that if Gitweb is not the final step in a workflow then subsequent
+steps may misbehave because of the redacted information they receive.
+
 
 EXAMPLES
 --------
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..210228f32efd 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -569,6 +569,15 @@ sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+	# Redact e-mail addresses.
+
+	# To enable system wide have in $GITWEB_CONFIG
+	# $feature{'email_privacy'}{'default'} = [1];
+	'email_privacy' => {
+		'sub' => sub { feature_bool('email_privacy', @_) },
+		'override' => 0,
+		'default' => [0]},
 );
 
 sub gitweb_get_feature {
@@ -3449,6 +3458,19 @@ sub parse_date {
 	return %date;
 }
 
+sub hide_mailaddr_if_private {
+	my $line = shift;
+	return $line unless (gitweb_check_feature('email_privacy') &&
+						$line =~ m/^([^<]+) <([^>]*)>/);
+	return hide_mailaddr($line)
+}
+
+sub hide_mailaddr {
+	my $mailaddr = shift;
+	$mailaddr =~ s/<([^>]*)>/<private>/;
+	return $mailaddr;
+}
+
 sub parse_tag {
 	my $tag_id = shift;
 	my %tag;
@@ -3471,6 +3493,10 @@ sub parse_tag {
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$tag{'author_name'}  = $1;
 				$tag{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$tag{'author_email'} = "private";
+					$tag{'author'} = hide_mailaddr($tag{'author'});
+				}
 			} else {
 				$tag{'author_name'} = $tag{'author'};
 			}
@@ -3519,6 +3545,10 @@ sub parse_commit_text {
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'author_name'}  = $1;
 				$co{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'author_email'} = "private";
+					$co{'author'} = hide_mailaddr($co{'author'});
+				}
 			} else {
 				$co{'author_name'} = $co{'author'};
 			}
@@ -3529,6 +3559,10 @@ sub parse_commit_text {
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'committer_name'}  = $1;
 				$co{'committer_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'committer_email'} = "private";
+					$co{'committer'} = hide_mailaddr($co{'committer'});
+				}
 			} else {
 				$co{'committer_name'} = $co{'committer'};
 			}
@@ -3568,9 +3602,10 @@ sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		$line = hide_mailaddr_if_private($line);
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -8060,8 +8095,9 @@ sub git_commitdiff {
 		close $fd
 			or print "Reading git-diff-tree failed\n";
 	} elsif ($format eq 'patch') {
-		local $/ = undef;
-		print <$fd>;
+		while (my $line = <$fd>) {
+			print hide_mailaddr_if_private($line);
+		}
 		close $fd
 			or print "Reading git-format-patch failed\n";
 	}

base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] gitweb: redacted e-mail addresses feature.
  2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
  2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
  2021-03-21  3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
@ 2021-03-21  6:00 ` Junio C Hamano
  2021-03-21  6:18   ` Junio C Hamano
  2021-03-21  6:43   ` Georgios Kontaxis
  2 siblings, 2 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21  6:00 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget; +Cc: git, Georgios Kontaxis

"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +    # To disable system wide have in $GITWEB_CONFIG
> +    # $feature{'email_privacy'}{'default'} = [0];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 0,
> +		'default' => [1]},
>  );

I do not see why this should default to true.

It would break existing installations, who have been perfectly happy
with the convenience of supplying a ready access to potential new
contributors who/which addresses to contact plausible mentors in the
projects they are interested in.

And more importantly, I do not see why it should be made impossible
to override per repository/project in a multi-tenant installation.
Some projects may be more "privacy" sensitive than others.  Those
who want to use tighter setting should be able to enable it even
when the side-wide default is set to false, no?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] gitweb: redacted e-mail addresses feature.
  2021-03-21  6:00 ` [PATCH] " Junio C Hamano
@ 2021-03-21  6:18   ` Junio C Hamano
  2021-03-21  6:43   ` Georgios Kontaxis
  1 sibling, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21  6:18 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget; +Cc: git, Georgios Kontaxis

Junio C Hamano <gitster@pobox.com> writes:

> And more importantly, I do not see why it should be made impossible
> to override per repository/project in a multi-tenant installation.
> Some projects may be more "privacy" sensitive than others.  Those
> who want to use tighter setting should be able to enable it even
> when the side-wide default is set to false, no?

To answer an inevitable and natural follow-up question preemptively,
the primary reason why we have the override => 0 mechanism is so
that site administrators can disable certain expensive features
(like blame, snapshot, etc.) no matter what each project hosted by
them wish.

And hiding contributor identity would not be a choice that is based
on how expensive the feature is to run.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] gitweb: redacted e-mail addresses feature.
  2021-03-21  6:00 ` [PATCH] " Junio C Hamano
  2021-03-21  6:18   ` Junio C Hamano
@ 2021-03-21  6:43   ` Georgios Kontaxis
  2021-03-21 16:55     ` Junio C Hamano
  1 sibling, 1 reply; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-21  6:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Georgios Kontaxis via GitGitGadget, git

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> +    # To disable system wide have in $GITWEB_CONFIG
>> +    # $feature{'email_privacy'}{'default'} = [0];
>> +	'email_privacy' => {
>> +		'sub' => sub { feature_bool('email_privacy', @_) },
>> +		'override' => 0,
>> +		'default' => [1]},
>>  );
>
> I do not see why this should default to true.
>
I've changed the default to "false". V2 should reflect the change.

> It would break existing installations, who have been perfectly happy
> with the convenience of supplying a ready access to potential new
> contributors who/which addresses to contact plausible mentors in the
> projects they are interested in.
>
> And more importantly, I do not see why it should be made impossible
> to override per repository/project in a multi-tenant installation.
> Some projects may be more "privacy" sensitive than others.  Those
> who want to use tighter setting should be able to enable it even
> when the side-wide default is set to false, no?
>
> Thanks.
>
I was actually thinking about the other way around;
preventing projects from disabling this feature.

Sounds like the "override" flag is for other types
of use cases though. I'll change it to "true".

Thanks for the feedback.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] gitweb: redacted e-mail addresses feature.
  2021-03-21  6:43   ` Georgios Kontaxis
@ 2021-03-21 16:55     ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21 16:55 UTC (permalink / raw)
  To: Georgios Kontaxis; +Cc: Georgios Kontaxis via GitGitGadget, git

"Georgios Kontaxis" <geko1702+commits@99rst.org> writes:

>> And more importantly, I do not see why it should be made impossible
>> to override per repository/project in a multi-tenant installation.
>> Some projects may be more "privacy" sensitive than others.  Those
>> who want to use tighter setting should be able to enable it even
>> when the side-wide default is set to false, no?
>>
>> Thanks.
>>
> I was actually thinking about the other way around;
> preventing projects from disabling this feature.

Yes, it cuts both ways, and override should be allowed for most
cases (unless absolutely necessary for healthy operation of the
system) for a simple reason that anybody who sets site-wide default
is not in a better position than those who set per-repository or
per-project setting to judge what is good for them.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21  3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
@ 2021-03-21 17:28   ` Georgios Kontaxis via GitGitGadget
  2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
                       ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Georgios Kontaxis via GitGitGadget @ 2021-03-21 17:28 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis, Georgios Kontaxis

From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers and they may not respect robots.txt.
This may result in unsolicited messages.
This is a feature for redacting e-mail addresses
from the generated HTML, etc. content.

This feature does not prevent someone from downloading the
unredacted commit log, e.g., by cloning the repository, and
extracting information from it.
It aims to hinder the low-effort bulk collection of e-mail
addresses by web crawlers.

Changes since v1:
- Turned off the feature by default.
- Removed duplicate code.
- Added note about Gitweb consumers receiving redacted logs.

Changes since v2:
- The feature can be set on a per-project basis. ('override' => 1)

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: Redacted e-mail addresses feature.
    
    Gitweb extracts content from the Git log and makes it accessible over
    HTTP. As a result, e-mail addresses found in commits are exposed to web
    crawlers. This may result in unsolicited messages. This is a feature for
    redacting e-mail addresses from the generated HTML content.
    
    This feature does not prevent someone from downloading the unredacted
    commit log and extracting information from it. It aims to hinder the
    low-effort bulk collection of e-mail addresses by web crawlers.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/910

Range-diff vs v2:

 1:  74af11ca8bf2 ! 1:  930cdefe7ee0 gitweb: redacted e-mail addresses feature.
     @@ Commit message
          - Removed duplicate code.
          - Added note about Gitweb consumers receiving redacted logs.
      
     +    Changes since v2:
     +    - The feature can be set on a per-project basis. ('override' => 1)
     +
          Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
      
       ## Documentation/gitweb.conf.txt ##
     @@ gitweb/gitweb.perl: sub evaluate_uri {
      +	# $feature{'email_privacy'}{'default'} = [1];
      +	'email_privacy' => {
      +		'sub' => sub { feature_bool('email_privacy', @_) },
     -+		'override' => 0,
     ++		'override' => 1,
      +		'default' => [0]},
       );
       


 Documentation/gitweb.conf.txt | 16 +++++++++++++
 gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..b7af3240177d 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -896,6 +896,22 @@ same as of the snippet above:
 It is an error to specify a ref that does not pass "git check-ref-format"
 scrutiny. Duplicated values are filtered.
 
+email_privacy::
+    Redact e-mail addresses from the generated HTML, etc. content.
+    This hides e-mail addresses found in the commit log from web crawlers.
+    Disabled by default.
++
+It is highly recommended to enable this feature unless web crawlers are
+hindered in some other way. Note that crawlers intent on harvesting e-mail
+addresses may disregard robots.txt. You can enable this feature like so:
++
+---------------------------------------------------------------------------
+$feature{'email_privacy'}{'default'} = [1];
+---------------------------------------------------------------------------
++
+Note that if Gitweb is not the final step in a workflow then subsequent
+steps may misbehave because of the redacted information they receive.
+
 
 EXAMPLES
 --------
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..174cc566d97d 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -569,6 +569,15 @@ sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+	# Redact e-mail addresses.
+
+	# To enable system wide have in $GITWEB_CONFIG
+	# $feature{'email_privacy'}{'default'} = [1];
+	'email_privacy' => {
+		'sub' => sub { feature_bool('email_privacy', @_) },
+		'override' => 1,
+		'default' => [0]},
 );
 
 sub gitweb_get_feature {
@@ -3449,6 +3458,19 @@ sub parse_date {
 	return %date;
 }
 
+sub hide_mailaddr_if_private {
+	my $line = shift;
+	return $line unless (gitweb_check_feature('email_privacy') &&
+						$line =~ m/^([^<]+) <([^>]*)>/);
+	return hide_mailaddr($line)
+}
+
+sub hide_mailaddr {
+	my $mailaddr = shift;
+	$mailaddr =~ s/<([^>]*)>/<private>/;
+	return $mailaddr;
+}
+
 sub parse_tag {
 	my $tag_id = shift;
 	my %tag;
@@ -3471,6 +3493,10 @@ sub parse_tag {
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$tag{'author_name'}  = $1;
 				$tag{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$tag{'author_email'} = "private";
+					$tag{'author'} = hide_mailaddr($tag{'author'});
+				}
 			} else {
 				$tag{'author_name'} = $tag{'author'};
 			}
@@ -3519,6 +3545,10 @@ sub parse_commit_text {
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'author_name'}  = $1;
 				$co{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'author_email'} = "private";
+					$co{'author'} = hide_mailaddr($co{'author'});
+				}
 			} else {
 				$co{'author_name'} = $co{'author'};
 			}
@@ -3529,6 +3559,10 @@ sub parse_commit_text {
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'committer_name'}  = $1;
 				$co{'committer_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'committer_email'} = "private";
+					$co{'committer'} = hide_mailaddr($co{'committer'});
+				}
 			} else {
 				$co{'committer_name'} = $co{'committer'};
 			}
@@ -3568,9 +3602,10 @@ sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		$line = hide_mailaddr_if_private($line);
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -8060,8 +8095,9 @@ sub git_commitdiff {
 		close $fd
 			or print "Reading git-diff-tree failed\n";
 	} elsif ($format eq 'patch') {
-		local $/ = undef;
-		print <$fd>;
+		while (my $line = <$fd>) {
+			print hide_mailaddr_if_private($line);
+		}
 		close $fd
 			or print "Reading git-format-patch failed\n";
 	}

base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
@ 2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
  2021-03-21 18:48       ` Junio C Hamano
  2021-03-21 19:48       ` Georgios Kontaxis
  2021-03-21 18:42     ` Junio C Hamano
  2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
  2 siblings, 2 replies; 41+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-21 18:26 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, brian m. carlson, Georgios Kontaxis


On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.
>
> Changes since v1:
> - Turned off the feature by default.
> - Removed duplicate code.
> - Added note about Gitweb consumers receiving redacted logs.
>
> Changes since v2:
> - The feature can be set on a per-project basis. ('override' => 1)
>
> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
> Range-diff vs v2:
>
>  1:  74af11ca8bf2 ! 1:  930cdefe7ee0 gitweb: redacted e-mail addresses feature.
>      @@ Commit message
>           - Removed duplicate code.
>           - Added note about Gitweb consumers receiving redacted logs.
>       
>      +    Changes since v2:
>      +    - The feature can be set on a per-project basis. ('override' => 1)
>      +
>           Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>       
>        ## Documentation/gitweb.conf.txt ##
>      @@ gitweb/gitweb.perl: sub evaluate_uri {
>       +	# $feature{'email_privacy'}{'default'} = [1];
>       +	'email_privacy' => {
>       +		'sub' => sub { feature_bool('email_privacy', @_) },
>      -+		'override' => 0,
>      ++		'override' => 1,
>       +		'default' => [0]},
>        );
>        
>
>
>  Documentation/gitweb.conf.txt | 16 +++++++++++++
>  gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
>  2 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..b7af3240177d 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,22 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Disabled by default.
> ++
> +It is highly recommended to enable this feature unless web crawlers are
> +hindered in some other way. Note that crawlers intent on harvesting e-mail
> +addresses may disregard robots.txt. You can enable this feature like so:
> ++
> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [1];
> +---------------------------------------------------------------------------
> ++
> +Note that if Gitweb is not the final step in a workflow then subsequent
> +steps may misbehave because of the redacted information they receive.
> +
>  
>  EXAMPLES
>  --------
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..174cc566d97d 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +	# Redact e-mail addresses.
> +
> +	# To enable system wide have in $GITWEB_CONFIG
> +	# $feature{'email_privacy'}{'default'} = [1];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 1,
> +		'default' => [0]},
>  );
>  
>  sub gitweb_get_feature {
> @@ -3449,6 +3458,19 @@ sub parse_date {
>  	return %date;
>  }
>  
> [snip]

So in the v1 feedback I suggested:

BEGIN QUOTE
    sub maybe_hide_email {
        my $email = shift;
        return $email unless gitweb_check_feature('email_privacy');
        return hide_email($email);
    }

then:

    $tag{author_email} = maybe_hide_email($2);
END QUOTE

But:

>  sub parse_tag {
>  	my $tag_id = shift;
>  	my %tag;
> @@ -3471,6 +3493,10 @@ sub parse_tag {
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$tag{'author_name'}  = $1;
>  				$tag{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} = hide_mailaddr($tag{'author'});
> +				}

This code seems quite awkward, we've already done the regex match, but
this code:

> [snip]
> +sub hide_mailaddr_if_private {
> +	my $line = shift;
> +	return $line unless (gitweb_check_feature('email_privacy') &&
> +						$line =~ m/^([^<]+) <([^>]*)>/);
> +	return hide_mailaddr($line)
> +}
> +
> +sub hide_mailaddr {
> +	my $mailaddr = shift;
> +	$mailaddr =~ s/<([^>]*)>/<private>/;
> +	return $mailaddr;
> +}

Is going to do it again incrementally, and then just act on a
search-replacement if we've got the feature enabled.

It seems much simpler to just turn your:

> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} = hide_mailaddr($tag{'author'});
> +				}

Into:

    $tag{'author'} = maybe_hide_mailaddr($tag{author}, \$tag{author_email});

Using:

    sub maybe_hide_email {
        my ($email, $ref) = shift;
        return $email unless gitweb_check_feature('email_privacy');
        $$ref = "private" if $ref;
        return hide_email($email);
    }

Which also works for the case where you don't have a "private" hash key
to assign to. But maybe it overcomplicates things...

>  			} else {
>  				$tag{'author_name'} = $tag{'author'};
>  			}
> @@ -3519,6 +3545,10 @@ sub parse_commit_text {
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'author_name'}  = $1;
>  				$co{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'author_email'} = "private";
> +					$co{'author'} = hide_mailaddr($co{'author'});
> +				}
>  			} else {
>  				$co{'author_name'} = $co{'author'};
>  			}
> @@ -3529,6 +3559,10 @@ sub parse_commit_text {
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'committer_name'}  = $1;
>  				$co{'committer_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'committer_email'} = "private";
> +					$co{'committer'} = hide_mailaddr($co{'committer'});
> +				}
> [...]
>  			} else {
>  				$co{'committer_name'} = $co{'committer'};
>  			}
> @@ -3568,9 +3602,10 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		$line = hide_mailaddr_if_private($line);
>  	}
>  	$co{'comment'} = \@commit_lines;
>  
> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			print hide_mailaddr_if_private($line);
> +		}

Urm, have you tested this? How does a while loop over a <$fd> make sense
when $/ is undef, the readline() operator will always return just one
record, so having a while loop doesn't make sense.

I'm not sure of the input here, but given that if you're expecting to
replace all e-mail addresses on all lines with this function that's not
how it'll work, the s/// doesn't have a /g, so it'll stop at the first
replacement.

>  		close $fd
>  			or print "Reading git-format-patch failed\n";
>  	}
>
> base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
  2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
@ 2021-03-21 18:42     ` Junio C Hamano
  2021-03-21 18:57       ` Junio C Hamano
  2021-03-21 20:07       ` Georgios Kontaxis
  2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
  2 siblings, 2 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21 18:42 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.

Lines from here ...

>
> Changes since v1:
> - Turned off the feature by default.
> - Removed duplicate code.
> - Added note about Gitweb consumers receiving redacted logs.
>
> Changes since v2:
> - The feature can be set on a per-project basis. ('override' => 1)

... to here are to help reviewers on the mailing list while
iterations of the same change is being reviewed and polished.

But it is useless noise for those who only read "git log".  They
simply do not have access to earlier iterations.

Such "changes between iterations" comments needs to be written after
the three-dash lines.

> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
> Range-diff vs v2:
>
>  1:  74af11ca8bf2 ! 1:  930cdefe7ee0 gitweb: redacted e-mail addresses feature.
>      @@ Commit message
>           - Removed duplicate code.
>           - Added note about Gitweb consumers receiving redacted logs.
>       
>      +    Changes since v2:
>      +    - The feature can be set on a per-project basis. ('override' => 1)
>      +
>           Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>       
>        ## Documentation/gitweb.conf.txt ##
>      @@ gitweb/gitweb.perl: sub evaluate_uri {
>       +	# $feature{'email_privacy'}{'default'} = [1];
>       +	'email_privacy' => {
>       +		'sub' => sub { feature_bool('email_privacy', @_) },
>      -+		'override' => 0,
>      ++		'override' => 1,
>       +		'default' => [0]},
>        );
>        
>
>
>  Documentation/gitweb.conf.txt | 16 +++++++++++++
>  gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
>  2 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..b7af3240177d 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,22 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Disabled by default.
> ++
> +It is highly recommended to enable this feature unless web crawlers are
> +hindered in some other way. Note that crawlers intent on harvesting e-mail
> +addresses may disregard robots.txt.

Up to this line is more-or-less OK, but with a few comments:

 - "This hides ... from web crawlers"?  Doesn't this hide it
   indiscriminately, whether the requester is an interactive
   end-user or a crawler?

 - The name of the configured feature, 'email_privacy'.  There is
   another existing feature with underscore in its name
   (remote_heads), but I think it is an odd-ball mistake we do not
   want to imitate.  Instead, I think the "extra branch refs" may be
   a good example to follow.  The feature name (written in Perl
   code) is "extra-branch-refs" (downcased words with inter-word
   dashes) and the corresponding configuration (I am not saying we
   should add one to support this feature, if we do not have one
   already) is "gitweb.extraBranchRefs" (camelCased words).

 - Do not exaggerate with words like "highly", but trust the
   intelligence of your readers to make the right decision when they
   understand the reason why this feature exists in the first place.

 - I do not think this entry should be added to the end of feature
   list.  How about listing it just after 'avatar', which is also
   about how author/committer identity is shown?

> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [1];
> +---------------------------------------------------------------------------

I do not think this should be here.  None of the boolean features
listed early in "Features in `%feature`" section like blame,
snapshot, grep, pickaxe, ... features tell readers how to enable a
specific feature like that.

Unless the syntax to configure a feature is one-off oddball that is
different from all other features, we shouldn't clutter the
description with an example like this.

At the beginning of the section "Configuring Gitweb Features" there
is a general description and that should be sufficient (and if it is
not sufficient, it is OK to add an example to that section).

> +Note that if Gitweb is not the final step in a workflow then subsequent
> +steps may misbehave because of the redacted information they receive.

In other words, this will break crawlers that expect email addresses
are there and want to use it for some good purpose.  But that is an
natural consequence of the "feature", and should be described as
such when we said "Redact e-mail addresses", not as a "by the way"
mention in a footnote.

	... after reading more, readers realize that the damage is
	far worse in the current incarnation of the patch, but let's
	read on ...

> @@ -3449,6 +3458,19 @@ sub parse_date {
>  	return %date;
>  }
>  
> +sub hide_mailaddr_if_private {
> +	my $line = shift;
> +	return $line unless (gitweb_check_feature('email_privacy') &&
> +						$line =~ m/^([^<]+) <([^>]*)>/);

I find that the second line is way too deeply indented.  Wouldn't

> +	return $line unless (gitweb_check_feature('email_privacy') &&
> +				$line =~ m/^([^<]+) <([^>]*)>/);

be enough?

> +	return hide_mailaddr($line)
> +}
> +
> +sub hide_mailaddr {
> +	my $mailaddr = shift;
> +	$mailaddr =~ s/<([^>]*)>/<private>/;

s/private/redacted/ perhaps?

> @@ -3568,9 +3602,10 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		$line = hide_mailaddr_if_private($line);

A commit log that ends with

	Thank you, A <a@example.com> and B <b@example.com>, for discovering
	this bug and quickly proposing a solution.

would recact only the first one but not the other, I suspect.

Much more problematic is that I see too many hits from:

    $ git log v2.30.0..v2.31.0 | grep ' <[^>@]*>'

That is, "find a line in the commit log message that has a SP, open <bra,
run of characters that are not ket> or at@sign, closed with ket>".  These
are clearly not e-mail addresses.

This "feature" butchers a commit title:

    mktag doc: say <hash> not <sha1>

to

    mktag doc: say <private> not <sha1>

doesn't it?

> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			print hide_mailaddr_if_private($line);
> +		}

And this is even worse.

 $ git log -p --no-merges v2.30.0..v2.31.0 | grep ' <[^>@]*>' | wc -l

gives me ~4700 hits.  Because the patch text as well as the log
message is munged, the "feature" makes the output utterly unreliable
as an input to "git am" by munging too much.  Interesting examples
are like these:

        -	for (i = 0; i < pairs->nr; ++i) {
                'git <command> [<revision>...] -- [<file>...]'
        +	for (i = split; i < geometry->pack_nr; i++) {

Now, I am *NOT* saying that you should tighten the e-mail address
syntax detection and keep munging the patch text.  The lines that
begin with minus and SP in a patch must match the preimage text,
so munging out existing e-mail addresses from them will make the
patch fail to find the part that needs to be modified.  And
replacing an e-mail address on a line that begins with plus would
redact it from the source---if they wrote an address, they must have
meant it to be available to those who consume the source, so I doubt
the wisdom of munging the patch part at all.

I may be sympathetic to the cause of the patch, but, I do not agree
with its execution in this iteration of the patch.

Thanks.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
@ 2021-03-21 18:48       ` Junio C Hamano
  2021-03-21 19:48       ` Georgios Kontaxis
  1 sibling, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21 18:48 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Georgios Kontaxis via GitGitGadget, git, brian m. carlson,
	Georgios Kontaxis

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> But:
>
>>  sub parse_tag {
>>  	my $tag_id = shift;
>>  	my %tag;
>> @@ -3471,6 +3493,10 @@ sub parse_tag {
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$tag{'author_name'}  = $1;
>>  				$tag{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$tag{'author_email'} = "private";
>> +					$tag{'author'} = hide_mailaddr($tag{'author'});
>> +				}
>
> This code seems quite awkward, we've already done the regex match, but
> this code:
>
>> [snip]
>> +sub hide_mailaddr_if_private {
>> +	my $line = shift;
>> +	return $line unless (gitweb_check_feature('email_privacy') &&
>> +						$line =~ m/^([^<]+) <([^>]*)>/);
>> +	return hide_mailaddr($line)
>> +}
>> +
>> +sub hide_mailaddr {
>> +	my $mailaddr = shift;
>> +	$mailaddr =~ s/<([^>]*)>/<private>/;
>> +	return $mailaddr;
>> +}
>
> Is going to do it again incrementally, and then just act on a
> search-replacement if we've got the feature enabled.

I think you misread the patch the same way as I did initially.  What
is called in parse_tag is not the "if-private" version---the caller
knows that $tag{'author'} MUST BE an address so unconditionally redact
when the feature is set by bypassing the _if_private variant.

Having said that, I do think

>     sub maybe_hide_email {
>         my ($email, $ref) = shift;
>         return $email unless gitweb_check_feature('email_privacy');
>         $$ref = "private" if $ref;
>         return hide_email($email);
>     }

this helper is how you would simplify these numerous callers.

>>  	} elsif ($format eq 'patch') {
>> -		local $/ = undef;
>> -		print <$fd>;
>> +		while (my $line = <$fd>) {
>> +			print hide_mailaddr_if_private($line);
>> +		}
>
> Urm, have you tested this? How does a while loop over a <$fd> make sense
> when $/ is undef, the readline() operator will always return just one
> record, so having a while loop doesn't make sense.
>
> I'm not sure of the input here, but given that if you're expecting to
> replace all e-mail addresses on all lines with this function that's not
> how it'll work, the s/// doesn't have a /g, so it'll stop at the first
> replacement.
>
>>  		close $fd
>>  			or print "Reading git-format-patch failed\n";
>>  	}

All true.  This one is reading from "format-patch --stdout" output,
but for the reasons I mentioned in my review, I do not think it
should touch the 'patch' output at all.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 18:42     ` Junio C Hamano
@ 2021-03-21 18:57       ` Junio C Hamano
  2021-03-21 19:05         ` Junio C Hamano
  2021-03-21 20:07       ` Georgios Kontaxis
  1 sibling, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21 18:57 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

Junio C Hamano <gitster@pobox.com> writes:

>> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
>>  		close $fd
>>  			or print "Reading git-diff-tree failed\n";
>>  	} elsif ($format eq 'patch') {
>> -		local $/ = undef;
>> -		print <$fd>;
>> +		while (my $line = <$fd>) {
>> +			print hide_mailaddr_if_private($line);
>> +		}
>
> And this is even worse.

And also Ævar is right.  The original just has a format-patch output
in $_ and prints it to the output stream in one go.  It won't make
any sense to attempt reading from that stream, which the original
code used to write to, at all.

In any case, I've already said that I do not think it is a good idea
to touch the 'patch' codepath at all, so this code won't do any
damage in the end by reading from the output pipe connected to
end-user's browser ;-)

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 18:57       ` Junio C Hamano
@ 2021-03-21 19:05         ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21 19:05 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

> And also Ævar is right.  The original just has a format-patch output
> in $_ and prints it to the output stream in one go.

Sorry but I misread the patch.  By dropping the $/ = undef, it wants
to read from $fd (which is "format-patch --stdout" output) one line
at a time.  So unless a line has two addresses, this would "work" as
intended.

Again, as I said, I do not agree with the definition of "work" here,
though ;-).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
  2021-03-21 18:48       ` Junio C Hamano
@ 2021-03-21 19:48       ` Georgios Kontaxis
  1 sibling, 0 replies; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-21 19:48 UTC (permalink / raw)
  To: "Ævar Arnfjörð Bjarmason"
  Cc: Georgios Kontaxis via GitGitGadget, git, brian m. carlson

>
> On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This may result in unsolicited messages.
>> This is a feature for redacting e-mail addresses
>> from the generated HTML, etc. content.
>>
>> This feature does not prevent someone from downloading the
>> unredacted commit log, e.g., by cloning the repository, and
>> extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>>
>> Changes since v1:
>> - Turned off the feature by default.
>> - Removed duplicate code.
>> - Added note about Gitweb consumers receiving redacted logs.
>>
>> Changes since v2:
>> - The feature can be set on a per-project basis. ('override' => 1)
>>
>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>>     gitweb: Redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers. This may result in unsolicited messages. This is a feature
>> for
>>     redacting e-mail addresses from the generated HTML content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log and extracting information from it. It aims to hinder the
>>     low-effort bulk collection of e-mail addresses by web crawlers.
>>
>>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>>
>> Published-As:
>> https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
>> pr-910/kontaxis/kontaxis/email_privacy-v3
>> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>>
>> Range-diff vs v2:
>>
>>  1:  74af11ca8bf2 ! 1:  930cdefe7ee0 gitweb: redacted e-mail addresses
>> feature.
>>      @@ Commit message
>>           - Removed duplicate code.
>>           - Added note about Gitweb consumers receiving redacted logs.
>>
>>      +    Changes since v2:
>>      +    - The feature can be set on a per-project basis. ('override'
>> => 1)
>>      +
>>           Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>>        ## Documentation/gitweb.conf.txt ##
>>      @@ gitweb/gitweb.perl: sub evaluate_uri {
>>       +	# $feature{'email_privacy'}{'default'} = [1];
>>       +	'email_privacy' => {
>>       +		'sub' => sub { feature_bool('email_privacy', @_) },
>>      -+		'override' => 0,
>>      ++		'override' => 1,
>>       +		'default' => [0]},
>>        );
>>
>>
>>
>>  Documentation/gitweb.conf.txt | 16 +++++++++++++
>>  gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
>>  2 files changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gitweb.conf.txt
>> b/Documentation/gitweb.conf.txt
>> index 7963a79ba98b..b7af3240177d 100644
>> --- a/Documentation/gitweb.conf.txt
>> +++ b/Documentation/gitweb.conf.txt
>> @@ -896,6 +896,22 @@ same as of the snippet above:
>>  It is an error to specify a ref that does not pass "git
>> check-ref-format"
>>  scrutiny. Duplicated values are filtered.
>>
>> +email_privacy::
>> +    Redact e-mail addresses from the generated HTML, etc. content.
>> +    This hides e-mail addresses found in the commit log from web
>> crawlers.
>> +    Disabled by default.
>> ++
>> +It is highly recommended to enable this feature unless web crawlers are
>> +hindered in some other way. Note that crawlers intent on harvesting
>> e-mail
>> +addresses may disregard robots.txt. You can enable this feature like
>> so:
>> ++
>> +---------------------------------------------------------------------------
>> +$feature{'email_privacy'}{'default'} = [1];
>> +---------------------------------------------------------------------------
>> ++
>> +Note that if Gitweb is not the final step in a workflow then subsequent
>> +steps may misbehave because of the redacted information they receive.
>> +
>>
>>  EXAMPLES
>>  --------
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..174cc566d97d 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -569,6 +569,15 @@ sub evaluate_uri {
>>  		'sub' => \&feature_extra_branch_refs,
>>  		'override' => 0,
>>  		'default' => []},
>> +
>> +	# Redact e-mail addresses.
>> +
>> +	# To enable system wide have in $GITWEB_CONFIG
>> +	# $feature{'email_privacy'}{'default'} = [1];
>> +	'email_privacy' => {
>> +		'sub' => sub { feature_bool('email_privacy', @_) },
>> +		'override' => 1,
>> +		'default' => [0]},
>>  );
>>
>>  sub gitweb_get_feature {
>> @@ -3449,6 +3458,19 @@ sub parse_date {
>>  	return %date;
>>  }
>>
>> [snip]
>
> So in the v1 feedback I suggested:
>
> BEGIN QUOTE
>     sub maybe_hide_email {
>         my $email = shift;
>         return $email unless gitweb_check_feature('email_privacy');
>         return hide_email($email);
>     }
>
> then:
>
>     $tag{author_email} = maybe_hide_email($2);
> END QUOTE
>
> But:
>
>>  sub parse_tag {
>>  	my $tag_id = shift;
>>  	my %tag;
>> @@ -3471,6 +3493,10 @@ sub parse_tag {
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$tag{'author_name'}  = $1;
>>  				$tag{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$tag{'author_email'} = "private";
>> +					$tag{'author'} = hide_mailaddr($tag{'author'});
>> +				}
>
> This code seems quite awkward, we've already done the regex match, but
> this code:
>
>> [snip]
>> +sub hide_mailaddr_if_private {
>> +	my $line = shift;
>> +	return $line unless (gitweb_check_feature('email_privacy') &&
>> +						$line =~ m/^([^<]+) <([^>]*)>/);
>> +	return hide_mailaddr($line)
>> +}
>> +
>> +sub hide_mailaddr {
>> +	my $mailaddr = shift;
>> +	$mailaddr =~ s/<([^>]*)>/<private>/;
>> +	return $mailaddr;
>> +}
>
> Is going to do it again incrementally, and then just act on a
> search-replacement if we've got the feature enabled.
>
I think hide_mailaddr_if_private (which checks the feature flag
and looks for the pattern) is only called when we haven't done so
already. In every other case we call hide_mailaddr directly.
So there's no duplication of checks.

> It seems much simpler to just turn your:
>
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$tag{'author_email'} = "private";
>> +					$tag{'author'} = hide_mailaddr($tag{'author'});
>> +				}
>
> Into:
>
>     $tag{'author'} = maybe_hide_mailaddr($tag{author},
> \$tag{author_email});
>
> Using:
>
>     sub maybe_hide_email {
>         my ($email, $ref) = shift;
>         return $email unless gitweb_check_feature('email_privacy');
>         $$ref = "private" if $ref;
>         return hide_email($email);
>     }
>
> Which also works for the case where you don't have a "private" hash key
> to assign to. But maybe it overcomplicates things...
>
>>  			} else {
>>  				$tag{'author_name'} = $tag{'author'};
>>  			}
>> @@ -3519,6 +3545,10 @@ sub parse_commit_text {
>>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'author_name'}  = $1;
>>  				$co{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'author_email'} = "private";
>> +					$co{'author'} = hide_mailaddr($co{'author'});
>> +				}
>>  			} else {
>>  				$co{'author_name'} = $co{'author'};
>>  			}
>> @@ -3529,6 +3559,10 @@ sub parse_commit_text {
>>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'committer_name'}  = $1;
>>  				$co{'committer_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'committer_email'} = "private";
>> +					$co{'committer'} = hide_mailaddr($co{'committer'});
>> +				}
>> [...]
>>  			} else {
>>  				$co{'committer_name'} = $co{'committer'};
>>  			}
>> @@ -3568,9 +3602,10 @@ sub parse_commit_text {
>>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>>  	}
>> -	# remove added spaces
>> +	# remove added spaces, redact e-mail addresses if applicable.
>>  	foreach my $line (@commit_lines) {
>>  		$line =~ s/^    //;
>> +		$line = hide_mailaddr_if_private($line);
>>  	}
>>  	$co{'comment'} = \@commit_lines;
>>
>> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
>>  		close $fd
>>  			or print "Reading git-diff-tree failed\n";
>>  	} elsif ($format eq 'patch') {
>> -		local $/ = undef;
>> -		print <$fd>;
>> +		while (my $line = <$fd>) {
>> +			print hide_mailaddr_if_private($line);
>> +		}
>
> Urm, have you tested this? How does a while loop over a <$fd> make sense
> when $/ is undef, the readline() operator will always return just one
> record, so having a while loop doesn't make sense.
>
$/ is not undef, the opposite, it's the default value which results
in reading from <$fd> one line at a time. (where line is a sequence
terminated by CRLF or LF)

> I'm not sure of the input here, but given that if you're expecting to
> replace all e-mail addresses on all lines with this function that's not
> how it'll work, the s/// doesn't have a /g, so it'll stop at the first
> replacement.
>
So we are reading from <$fd> one line at a time and looking for the
first "expr1 <expr2>" occurrence in each line. This successfully redacts
the author, committer, signed-off-by lines.

Not doing a global replacement per line is intentional so as to avoid
false positives.
So far I haven't noticed missing any addresses that should have been
redacted.

>>  		close $fd
>>  			or print "Reading git-format-patch failed\n";
>>  	}
>>
>> base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
>
>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 18:42     ` Junio C Hamano
  2021-03-21 18:57       ` Junio C Hamano
@ 2021-03-21 20:07       ` Georgios Kontaxis
  2021-03-21 22:17         ` Junio C Hamano
  1 sibling, 1 reply; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-21 20:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Georgios Kontaxis via GitGitGadget, git,
	"Ævar Arnfjörð Bjarmason",
	brian m. carlson

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This may result in unsolicited messages.
>> This is a feature for redacting e-mail addresses
>> from the generated HTML, etc. content.
>>
>> This feature does not prevent someone from downloading the
>> unredacted commit log, e.g., by cloning the repository, and
>> extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> Lines from here ...
>
>>
>> Changes since v1:
>> - Turned off the feature by default.
>> - Removed duplicate code.
>> - Added note about Gitweb consumers receiving redacted logs.
>>
>> Changes since v2:
>> - The feature can be set on a per-project basis. ('override' => 1)
>
> ... to here are to help reviewers on the mailing list while
> iterations of the same change is being reviewed and polished.
>
> But it is useless noise for those who only read "git log".  They
> simply do not have access to earlier iterations.
>
> Such "changes between iterations" comments needs to be written after
> the three-dash lines.
>
>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>>     gitweb: Redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers. This may result in unsolicited messages. This is a feature
>> for
>>     redacting e-mail addresses from the generated HTML content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log and extracting information from it. It aims to hinder the
>>     low-effort bulk collection of e-mail addresses by web crawlers.
>>
>>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>>
>> Published-As:
>> https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
>> pr-910/kontaxis/kontaxis/email_privacy-v3
>> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>>
>> Range-diff vs v2:
>>
>>  1:  74af11ca8bf2 ! 1:  930cdefe7ee0 gitweb: redacted e-mail addresses
>> feature.
>>      @@ Commit message
>>           - Removed duplicate code.
>>           - Added note about Gitweb consumers receiving redacted logs.
>>
>>      +    Changes since v2:
>>      +    - The feature can be set on a per-project basis. ('override'
>> => 1)
>>      +
>>           Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>>        ## Documentation/gitweb.conf.txt ##
>>      @@ gitweb/gitweb.perl: sub evaluate_uri {
>>       +	# $feature{'email_privacy'}{'default'} = [1];
>>       +	'email_privacy' => {
>>       +		'sub' => sub { feature_bool('email_privacy', @_) },
>>      -+		'override' => 0,
>>      ++		'override' => 1,
>>       +		'default' => [0]},
>>        );
>>
>>
>>
>>  Documentation/gitweb.conf.txt | 16 +++++++++++++
>>  gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
>>  2 files changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gitweb.conf.txt
>> b/Documentation/gitweb.conf.txt
>> index 7963a79ba98b..b7af3240177d 100644
>> --- a/Documentation/gitweb.conf.txt
>> +++ b/Documentation/gitweb.conf.txt
>> @@ -896,6 +896,22 @@ same as of the snippet above:
>>  It is an error to specify a ref that does not pass "git
>> check-ref-format"
>>  scrutiny. Duplicated values are filtered.
>>
>> +email_privacy::
>> +    Redact e-mail addresses from the generated HTML, etc. content.
>> +    This hides e-mail addresses found in the commit log from web
>> crawlers.
>> +    Disabled by default.
>> ++
>> +It is highly recommended to enable this feature unless web crawlers are
>> +hindered in some other way. Note that crawlers intent on harvesting
>> e-mail
>> +addresses may disregard robots.txt.
>
> Up to this line is more-or-less OK, but with a few comments:
>
>  - "This hides ... from web crawlers"?  Doesn't this hide it
>    indiscriminately, whether the requester is an interactive
>    end-user or a crawler?
>
Correct. It doesn't say "it only hides from".
I can see how it may be confusing. I'll rephrase.

>  - The name of the configured feature, 'email_privacy'.  There is
>    another existing feature with underscore in its name
>    (remote_heads), but I think it is an odd-ball mistake we do not
>    want to imitate.  Instead, I think the "extra branch refs" may be
>    a good example to follow.  The feature name (written in Perl
>    code) is "extra-branch-refs" (downcased words with inter-word
>    dashes) and the corresponding configuration (I am not saying we
>    should add one to support this feature, if we do not have one
>    already) is "gitweb.extraBranchRefs" (camelCased words).
>
I'll rename the feature.

>  - Do not exaggerate with words like "highly", but trust the
>    intelligence of your readers to make the right decision when they
>    understand the reason why this feature exists in the first place.
>
Will remove.

>  - I do not think this entry should be added to the end of feature
>    list.  How about listing it just after 'avatar', which is also
>    about how author/committer identity is shown?
>
Will do.

>> +---------------------------------------------------------------------------
>> +$feature{'email_privacy'}{'default'} = [1];
>> +---------------------------------------------------------------------------
>
> I do not think this should be here.  None of the boolean features
> listed early in "Features in `%feature`" section like blame,
> snapshot, grep, pickaxe, ... features tell readers how to enable a
> specific feature like that.
>
> Unless the syntax to configure a feature is one-off oddball that is
> different from all other features, we shouldn't clutter the
> description with an example like this.
>
> At the beginning of the section "Configuring Gitweb Features" there
> is a general description and that should be sufficient (and if it is
> not sufficient, it is OK to add an example to that section).
>
Will remove.

>> +Note that if Gitweb is not the final step in a workflow then subsequent
>> +steps may misbehave because of the redacted information they receive.
>
> In other words, this will break crawlers that expect email addresses
> are there and want to use it for some good purpose.  But that is an
> natural consequence of the "feature", and should be described as
> such when we said "Redact e-mail addresses", not as a "by the way"
> mention in a footnote.
>
> 	... after reading more, readers realize that the damage is
> 	far worse in the current incarnation of the patch, but let's
> 	read on ...
>
Will move this out of the footnote.

>> @@ -3449,6 +3458,19 @@ sub parse_date {
>>  	return %date;
>>  }
>>
>> +sub hide_mailaddr_if_private {
>> +	my $line = shift;
>> +	return $line unless (gitweb_check_feature('email_privacy') &&
>> +						$line =~ m/^([^<]+) <([^>]*)>/);
>
> I find that the second line is way too deeply indented.  Wouldn't
>
>> +	return $line unless (gitweb_check_feature('email_privacy') &&
>> +				$line =~ m/^([^<]+) <([^>]*)>/);
>
> be enough?
>
Will adjust.

>> +	return hide_mailaddr($line)
>> +}
>> +
>> +sub hide_mailaddr {
>> +	my $mailaddr = shift;
>> +	$mailaddr =~ s/<([^>]*)>/<private>/;
>
> s/private/redacted/ perhaps?
>
Will adjust.

>> @@ -3568,9 +3602,10 @@ sub parse_commit_text {
>>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>>  	}
>> -	# remove added spaces
>> +	# remove added spaces, redact e-mail addresses if applicable.
>>  	foreach my $line (@commit_lines) {
>>  		$line =~ s/^    //;
>> +		$line = hide_mailaddr_if_private($line);
>
> A commit log that ends with
>
> 	Thank you, A <a@example.com> and B <b@example.com>, for discovering
> 	this bug and quickly proposing a solution.
>
> would recact only the first one but not the other, I suspect.
>
> Much more problematic is that I see too many hits from:
>
>     $ git log v2.30.0..v2.31.0 | grep ' <[^>@]*>'
>
> That is, "find a line in the commit log message that has a SP, open <bra,
> run of characters that are not ket> or at@sign, closed with ket>".  These
> are clearly not e-mail addresses.
>
> This "feature" butchers a commit title:
>
>     mktag doc: say <hash> not <sha1>
>
> to
>
>     mktag doc: say <private> not <sha1>
>
> doesn't it?
>
True. Thanks for pointing these out.
Sounds like the pattern we're looking for needs to be more specific.

>> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
>>  		close $fd
>>  			or print "Reading git-diff-tree failed\n";
>>  	} elsif ($format eq 'patch') {
>> -		local $/ = undef;
>> -		print <$fd>;
>> +		while (my $line = <$fd>) {
>> +			print hide_mailaddr_if_private($line);
>> +		}
>
> And this is even worse.
>
>  $ git log -p --no-merges v2.30.0..v2.31.0 | grep ' <[^>@]*>' | wc -l
>
> gives me ~4700 hits.  Because the patch text as well as the log
> message is munged, the "feature" makes the output utterly unreliable
> as an input to "git am" by munging too much.  Interesting examples
> are like these:
>
>         -	for (i = 0; i < pairs->nr; ++i) {
>                 'git <command> [<revision>...] -- [<file>...]'
>         +	for (i = split; i < geometry->pack_nr; i++) {
>
> Now, I am *NOT* saying that you should tighten the e-mail address
> syntax detection and keep munging the patch text.  The lines that
> begin with minus and SP in a patch must match the preimage text,
> so munging out existing e-mail addresses from them will make the
> patch fail to find the part that needs to be modified.  And
> replacing an e-mail address on a line that begins with plus would
> redact it from the source---if they wrote an address, they must have
> meant it to be available to those who consume the source, so I doubt
> the wisdom of munging the patch part at all.
>
> I may be sympathetic to the cause of the patch, but, I do not agree
> with its execution in this iteration of the patch.
>
I see your point.

It seems hiding e-mail addresses should be limited to the commit message,
i.e., stop at the "---" line.

> Thanks.
>
>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 20:07       ` Georgios Kontaxis
@ 2021-03-21 22:17         ` Junio C Hamano
  2021-03-21 23:14           ` Georgios Kontaxis
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-03-21 22:17 UTC (permalink / raw)
  To: Georgios Kontaxis
  Cc: Georgios Kontaxis via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, brian m. carlson

"Georgios Kontaxis" <geko1702+commits@99rst.org> writes:

>> ... so I doubt
>> the wisdom of munging the patch part at all.
>>
>> I may be sympathetic to the cause of the patch, but, I do not agree
>> with its execution in this iteration of the patch.
>>
> I see your point.
>
> It seems hiding e-mail addresses should be limited to the commit message,
> i.e., stop at the "---" line.

I doubt it makes sense to redact anything in the 'patch' view at
all, actually.  What kind of URL does the crawler need to formulate
and what pieces of information (like commit object names or branch
names) does it need to fill in the URL to get a series of patches
out of gitweb?  As long as it takes more effort than running "git
clone" against the repository, the crawler would not have much
incentive to crawl and harvest addresses from the 'patch' pages, and
even in the log message part, the downsides of butchering the
payload would outweigh the "privacy benefit", I would have to say.

Quite honestly, if a site claims to offer a 'patch' download UI but
returns corrupt data back, I would say it is much worse than not
offering the service at all.  Perhaps disabling the 'patch' feature
in repositories that enable 'privacy' feature may be a much better
approach.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 22:17         ` Junio C Hamano
@ 2021-03-21 23:14           ` Georgios Kontaxis
  2021-03-22  4:25             ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-21 23:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Georgios Kontaxis via GitGitGadget, git,
	"Ævar Arnfjörð Bjarmason",
	brian m. carlson

> "Georgios Kontaxis" <geko1702+commits@99rst.org> writes:
>
>>> ... so I doubt
>>> the wisdom of munging the patch part at all.
>>>
>>> I may be sympathetic to the cause of the patch, but, I do not agree
>>> with its execution in this iteration of the patch.
>>>
>> I see your point.
>>
>> It seems hiding e-mail addresses should be limited to the commit
>> message,
>> i.e., stop at the "---" line.
>
> I doubt it makes sense to redact anything in the 'patch' view at
> all, actually.  What kind of URL does the crawler need to formulate
> and what pieces of information (like commit object names or branch
> names) does it need to fill in the URL to get a series of patches
> out of gitweb?  As long as it takes more effort than running "git
> clone" against the repository, the crawler would not have much
> incentive to crawl and harvest addresses from the 'patch' pages, and
> even in the log message part, the downsides of butchering the
> payload would outweigh the "privacy benefit", I would have to say.
>
No effort at all I would say.
E..g, somehow the web crawler gets to git.kernel.org.
It then follows every link, eventually arriving at a commitdiff page.
It then follows every link, which includes the URL for the patch output.
See how "wget --mirror" behaves for instance.

Just to clarify, my goal is not to stop someone who wants to extract
e-mail address from git.kernel.org specifically.
They can just "git clone" the repositories and grep through the logs.
My goal is to stop generic crawlers (pretty much "wget --mirror | grep"
scripts) from making their way to the logs.

> Quite honestly, if a site claims to offer a 'patch' download UI but
> returns corrupt data back, I would say it is much worse than not
> offering the service at all.  Perhaps disabling the 'patch' feature
> in repositories that enable 'privacy' feature may be a much better
> approach.
>
Good point. I think I'll try that.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
  2021-03-21 23:14           ` Georgios Kontaxis
@ 2021-03-22  4:25             ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-22  4:25 UTC (permalink / raw)
  To: Georgios Kontaxis
  Cc: Georgios Kontaxis via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, brian m. carlson

"Georgios Kontaxis" <geko1702+commits@99rst.org> writes:

>> Quite honestly, if a site claims to offer a 'patch' download UI but
>> returns corrupt data back, I would say it is much worse than not
>> offering the service at all.  Perhaps disabling the 'patch' feature
>> in repositories that enable 'privacy' feature may be a much better
>> approach.
>>
> Good point. I think I'll try that.

Alternatively, we could add 'captcha' to stop crawlers, perhaps?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v4] gitweb: redacted e-mail addresses feature.
  2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
  2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
  2021-03-21 18:42     ` Junio C Hamano
@ 2021-03-22  6:57     ` Georgios Kontaxis via GitGitGadget
  2021-03-22 18:32       ` Junio C Hamano
  2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
  2 siblings, 2 replies; 41+ messages in thread
From: Georgios Kontaxis via GitGitGadget @ 2021-03-22  6:57 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis, Georgios Kontaxis

From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers and they may not respect robots.txt.
This may result in unsolicited messages.
This is a feature for redacting e-mail addresses
from the generated HTML, etc. content.

This feature does not prevent someone from downloading the
unredacted commit log, e.g., by cloning the repository, and
extracting information from it.
It aims to hinder the low-effort bulk collection of e-mail
addresses by web crawlers.

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: redacted e-mail addresses feature.
    
    Gitweb extracts content from the Git log and makes it accessible over
    HTTP. As a result, e-mail addresses found in commits are exposed to web
    crawlers and they may not respect robots.txt. This may result in
    unsolicited messages. This is a feature for redacting e-mail addresses
    from the generated HTML, etc. content.
    
    This feature does not prevent someone from downloading the unredacted
    commit log, e.g., by cloning the repository, and extracting information
    from it. It aims to hinder the low-effort bulk collection of e-mail
    addresses by web crawlers.
    
    Changes since v1:
    
     * Turned off the feature by default.
     * Removed duplicate code.
     * Added note about Gitweb consumers receiving redacted logs.
    
    Changes since v2:
    
     * The feature can be set on a per-project basis. ('override' => 1)
    
    Changes since v3:
    
     * Renamed feature to "email-privacy" and improved documentation.
     * Removed UI elements for git-format-patch since it won't be redacted.
     * Simplified calls to the address redaction logic.
     * Mail::Address is now used to reduce false-positive redactions.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/910

Range-diff vs v3:

 1:  930cdefe7ee0 ! 1:  03a3f41c37ef gitweb: redacted e-mail addresses feature.
     @@ Commit message
          It aims to hinder the low-effort bulk collection of e-mail
          addresses by web crawlers.
      
     -    Changes since v1:
     -    - Turned off the feature by default.
     -    - Removed duplicate code.
     -    - Added note about Gitweb consumers receiving redacted logs.
     -
     -    Changes since v2:
     -    - The feature can be set on a per-project basis. ('override' => 1)
     -
          Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
      
       ## Documentation/gitweb.conf.txt ##
     -@@ Documentation/gitweb.conf.txt: same as of the snippet above:
     - It is an error to specify a ref that does not pass "git check-ref-format"
     - scrutiny. Duplicated values are filtered.
     +@@ Documentation/gitweb.conf.txt: default font sizes or lineheights are changed (e.g. via adding extra
     + CSS stylesheet in `@stylesheets`), it may be appropriate to change
     + these values.
       
     -+email_privacy::
     -+    Redact e-mail addresses from the generated HTML, etc. content.
     -+    This hides e-mail addresses found in the commit log from web crawlers.
     -+    Disabled by default.
     -++
     -+It is highly recommended to enable this feature unless web crawlers are
     -+hindered in some other way. Note that crawlers intent on harvesting e-mail
     -+addresses may disregard robots.txt. You can enable this feature like so:
     -++
     -+---------------------------------------------------------------------------
     -+$feature{'email_privacy'}{'default'} = [1];
     -+---------------------------------------------------------------------------
     -++
     -+Note that if Gitweb is not the final step in a workflow then subsequent
     -+steps may misbehave because of the redacted information they receive.
     ++email-privacy::
     ++	Redact e-mail addresses from the generated HTML, etc. content.
     ++	This hides e-mail addresses found in the commit log from HTTP clients.
     ++	It is meant to hinder web crawlers that harvest and abuse addresses.
     ++	Such crawlers may not respect robots.txt.
     ++	Note that users and user tools also see the addresses redacted.
     ++	If Gitweb is not the final step in a workflow then subsequent steps
     ++	may misbehave because of the redacted information they receive.
     ++	Disabled by default.
      +
     - 
     - EXAMPLES
     - --------
     + highlight::
     + 	Server-side syntax highlight support in "blob" view.  It requires
     + 	`$highlight_bin` program to be available (see the description of
      
       ## gitweb/gitweb.perl ##
     +@@
     + use File::Basename qw(basename);
     + use Time::HiRes qw(gettimeofday tv_interval);
     + use Digest::MD5 qw(md5_hex);
     ++use Git::LoadCPAN::Mail::Address;
     + 
     + binmode STDOUT, ':utf8';
     + 
      @@ gitweb/gitweb.perl: sub evaluate_uri {
       		'sub' => \&feature_extra_branch_refs,
       		'override' => 0,
     @@ gitweb/gitweb.perl: sub evaluate_uri {
      +	# Redact e-mail addresses.
      +
      +	# To enable system wide have in $GITWEB_CONFIG
     -+	# $feature{'email_privacy'}{'default'} = [1];
     -+	'email_privacy' => {
     -+		'sub' => sub { feature_bool('email_privacy', @_) },
     ++	# $feature{'email-privacy'}{'default'} = [1];
     ++	'email-privacy' => {
     ++		'sub' => sub { feature_bool('email-privacy', @_) },
      +		'override' => 1,
      +		'default' => [0]},
       );
     @@ gitweb/gitweb.perl: sub parse_date {
       	return %date;
       }
       
     -+sub hide_mailaddr_if_private {
     -+	my $line = shift;
     -+	return $line unless (gitweb_check_feature('email_privacy') &&
     -+						$line =~ m/^([^<]+) <([^>]*)>/);
     -+	return hide_mailaddr($line)
     ++sub is_mailaddr {
     ++	my @addrs = Mail::Address->parse(shift);
     ++	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
     ++		return 0;
     ++	}
     ++	return 1;
      +}
      +
     -+sub hide_mailaddr {
     -+	my $mailaddr = shift;
     -+	$mailaddr =~ s/<([^>]*)>/<private>/;
     -+	return $mailaddr;
     ++sub hide_mailaddrs_if_private {
     ++	my $line = shift;
     ++	return $line unless gitweb_check_feature('email-privacy');
     ++	while ($line =~ m/(<[^>]+>)/g) {
     ++		my $match = $1;
     ++		if (!is_mailaddr($match)) {
     ++			next;
     ++		}
     ++		my $offset = pos $line;
     ++		my $head = substr $line, 0, $offset - length($match);
     ++		my $redaction = "<redacted>";
     ++		my $tail = substr $line, $offset;
     ++		$line = $head . $redaction . $tail;
     ++		pos $line = length($head) + length($redaction);
     ++	}
     ++	return $line;
      +}
      +
       sub parse_tag {
       	my $tag_id = shift;
       	my %tag;
      @@ gitweb/gitweb.perl: sub parse_tag {
     + 		} elsif ($line =~ m/^tag (.+)$/) {
     + 			$tag{'name'} = $1;
     + 		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
     +-			$tag{'author'} = $1;
     ++			$tag{'author'} = hide_mailaddrs_if_private($1);
     + 			$tag{'author_epoch'} = $2;
     + 			$tag{'author_tz'} = $3;
       			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
     - 				$tag{'author_name'}  = $1;
     - 				$tag{'author_email'} = $2;
     -+				if (gitweb_check_feature('email_privacy')) {
     -+					$tag{'author_email'} = "private";
     -+					$tag{'author'} = hide_mailaddr($tag{'author'});
     -+				}
     - 			} else {
     - 				$tag{'author_name'} = $tag{'author'};
     - 			}
      @@ gitweb/gitweb.perl: sub parse_commit_text {
     + 		} elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) {
     + 			push @parents, $1;
     + 		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
     +-			$co{'author'} = to_utf8($1);
     ++			$co{'author'} = hide_mailaddrs_if_private(to_utf8($1));
     + 			$co{'author_epoch'} = $2;
     + 			$co{'author_tz'} = $3;
       			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
     - 				$co{'author_name'}  = $1;
     - 				$co{'author_email'} = $2;
     -+				if (gitweb_check_feature('email_privacy')) {
     -+					$co{'author_email'} = "private";
     -+					$co{'author'} = hide_mailaddr($co{'author'});
     -+				}
     - 			} else {
     +@@ gitweb/gitweb.perl: sub parse_commit_text {
       				$co{'author_name'} = $co{'author'};
       			}
     -@@ gitweb/gitweb.perl: sub parse_commit_text {
     + 		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
     +-			$co{'committer'} = to_utf8($1);
     ++			$co{'committer'} = hide_mailaddrs_if_private(to_utf8($1));
     + 			$co{'committer_epoch'} = $2;
     + 			$co{'committer_tz'} = $3;
       			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
     - 				$co{'committer_name'}  = $1;
     - 				$co{'committer_email'} = $2;
     -+				if (gitweb_check_feature('email_privacy')) {
     -+					$co{'committer_email'} = "private";
     -+					$co{'committer'} = hide_mailaddr($co{'committer'});
     -+				}
     - 			} else {
     - 				$co{'committer_name'} = $co{'committer'};
     - 			}
      @@ gitweb/gitweb.perl: sub parse_commit_text {
       	if (! defined $co{'title'} || $co{'title'} eq "") {
       		$co{'title'} = $co{'title_short'} = '(no commit message)';
     @@ gitweb/gitweb.perl: sub parse_commit_text {
      +	# remove added spaces, redact e-mail addresses if applicable.
       	foreach my $line (@commit_lines) {
       		$line =~ s/^    //;
     -+		$line = hide_mailaddr_if_private($line);
     ++		$line = hide_mailaddrs_if_private($line);
       	}
       	$co{'comment'} = \@commit_lines;
       
     -@@ gitweb/gitweb.perl: sub git_commitdiff {
     - 		close $fd
     - 			or print "Reading git-diff-tree failed\n";
     - 	} elsif ($format eq 'patch') {
     --		local $/ = undef;
     --		print <$fd>;
     -+		while (my $line = <$fd>) {
     -+			print hide_mailaddr_if_private($line);
     -+		}
     - 		close $fd
     - 			or print "Reading git-format-patch failed\n";
     +@@ gitweb/gitweb.perl: sub git_log_generic {
     + 			         -accesskey => "n", -title => "Alt-n"}, "next");
     + 	}
     + 	my $patch_max = gitweb_get_feature('patches');
     +-	if ($patch_max && !defined $file_name) {
     ++	if ($patch_max && !defined $file_name &&
     ++		!gitweb_check_feature('email-privacy')) {
     + 		if ($patch_max < 0 || @commitlist <= $patch_max) {
     + 			$paging_nav .= " &sdot; " .
     + 				$cgi->a({-href => href(action=>"patches", -replay=>1)},
     +@@ gitweb/gitweb.perl: sub git_commit {
     + 			} @$parents ) .
     + 			')';
       	}
     +-	if (gitweb_check_feature('patches') && @$parents <= 1) {
     ++	if (gitweb_check_feature('patches') && @$parents <= 1 &&
     ++		!gitweb_check_feature('email-privacy')) {
     + 		$formats_nav .= " | " .
     + 			$cgi->a({-href => href(action=>"patch", -replay=>1)},
     + 				"patch");
     +@@ gitweb/gitweb.perl: sub git_commitdiff {
     + 		$formats_nav =
     + 			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
     + 			        "raw");
     +-		if ($patch_max && @{$co{'parents'}} <= 1) {
     ++		if ($patch_max && @{$co{'parents'}} <= 1 &&
     ++			!gitweb_check_feature('email-privacy')) {
     + 			$formats_nav .= " | " .
     + 				$cgi->a({-href => href(action=>"patch", -replay=>1)},
     + 					"patch");
     +
     + ## t/lib-gitweb.sh ##
     +@@ t/lib-gitweb.sh: gitweb_run () {
     + 	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
     + 	export GITWEB_CONFIG
     + 
     ++	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
     ++	export PERL5LIB
     ++
     + 	# some of git commands write to STDERR on error, but this is not
     + 	# written to web server logs, so we are not interested in that:
     + 	# we are interested only in properly formatted errors/warnings


 Documentation/gitweb.conf.txt | 10 +++++++
 gitweb/gitweb.perl            | 54 ++++++++++++++++++++++++++++++-----
 t/lib-gitweb.sh               |  3 ++
 3 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..47d85717ccaa 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -751,6 +751,16 @@ default font sizes or lineheights are changed (e.g. via adding extra
 CSS stylesheet in `@stylesheets`), it may be appropriate to change
 these values.
 
+email-privacy::
+	Redact e-mail addresses from the generated HTML, etc. content.
+	This hides e-mail addresses found in the commit log from HTTP clients.
+	It is meant to hinder web crawlers that harvest and abuse addresses.
+	Such crawlers may not respect robots.txt.
+	Note that users and user tools also see the addresses redacted.
+	If Gitweb is not the final step in a workflow then subsequent steps
+	may misbehave because of the redacted information they receive.
+	Disabled by default.
+
 highlight::
 	Server-side syntax highlight support in "blob" view.  It requires
 	`$highlight_bin` program to be available (see the description of
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..6630c76d92fd 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -21,6 +21,7 @@
 use File::Basename qw(basename);
 use Time::HiRes qw(gettimeofday tv_interval);
 use Digest::MD5 qw(md5_hex);
+use Git::LoadCPAN::Mail::Address;
 
 binmode STDOUT, ':utf8';
 
@@ -569,6 +570,15 @@ sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+	# Redact e-mail addresses.
+
+	# To enable system wide have in $GITWEB_CONFIG
+	# $feature{'email-privacy'}{'default'} = [1];
+	'email-privacy' => {
+		'sub' => sub { feature_bool('email-privacy', @_) },
+		'override' => 1,
+		'default' => [0]},
 );
 
 sub gitweb_get_feature {
@@ -3449,6 +3459,32 @@ sub parse_date {
 	return %date;
 }
 
+sub is_mailaddr {
+	my @addrs = Mail::Address->parse(shift);
+	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
+		return 0;
+	}
+	return 1;
+}
+
+sub hide_mailaddrs_if_private {
+	my $line = shift;
+	return $line unless gitweb_check_feature('email-privacy');
+	while ($line =~ m/(<[^>]+>)/g) {
+		my $match = $1;
+		if (!is_mailaddr($match)) {
+			next;
+		}
+		my $offset = pos $line;
+		my $head = substr $line, 0, $offset - length($match);
+		my $redaction = "<redacted>";
+		my $tail = substr $line, $offset;
+		$line = $head . $redaction . $tail;
+		pos $line = length($head) + length($redaction);
+	}
+	return $line;
+}
+
 sub parse_tag {
 	my $tag_id = shift;
 	my %tag;
@@ -3465,7 +3501,7 @@ sub parse_tag {
 		} elsif ($line =~ m/^tag (.+)$/) {
 			$tag{'name'} = $1;
 		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
-			$tag{'author'} = $1;
+			$tag{'author'} = hide_mailaddrs_if_private($1);
 			$tag{'author_epoch'} = $2;
 			$tag{'author_tz'} = $3;
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3513,7 +3549,7 @@ sub parse_commit_text {
 		} elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) {
 			push @parents, $1;
 		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
-			$co{'author'} = to_utf8($1);
+			$co{'author'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'author_epoch'} = $2;
 			$co{'author_tz'} = $3;
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3523,7 +3559,7 @@ sub parse_commit_text {
 				$co{'author_name'} = $co{'author'};
 			}
 		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
-			$co{'committer'} = to_utf8($1);
+			$co{'committer'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'committer_epoch'} = $2;
 			$co{'committer_tz'} = $3;
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3568,9 +3604,10 @@ sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		$line = hide_mailaddrs_if_private($line);
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -7489,7 +7526,8 @@ sub git_log_generic {
 			         -accesskey => "n", -title => "Alt-n"}, "next");
 	}
 	my $patch_max = gitweb_get_feature('patches');
-	if ($patch_max && !defined $file_name) {
+	if ($patch_max && !defined $file_name &&
+		!gitweb_check_feature('email-privacy')) {
 		if ($patch_max < 0 || @commitlist <= $patch_max) {
 			$paging_nav .= " &sdot; " .
 				$cgi->a({-href => href(action=>"patches", -replay=>1)},
@@ -7550,7 +7588,8 @@ sub git_commit {
 			} @$parents ) .
 			')';
 	}
-	if (gitweb_check_feature('patches') && @$parents <= 1) {
+	if (gitweb_check_feature('patches') && @$parents <= 1 &&
+		!gitweb_check_feature('email-privacy')) {
 		$formats_nav .= " | " .
 			$cgi->a({-href => href(action=>"patch", -replay=>1)},
 				"patch");
@@ -7863,7 +7902,8 @@ sub git_commitdiff {
 		$formats_nav =
 			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
 			        "raw");
-		if ($patch_max && @{$co{'parents'}} <= 1) {
+		if ($patch_max && @{$co{'parents'}} <= 1 &&
+			!gitweb_check_feature('email-privacy')) {
 			$formats_nav .= " | " .
 				$cgi->a({-href => href(action=>"patch", -replay=>1)},
 					"patch");
diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
index 1f32ca66ea51..77fc1298d4c6 100644
--- a/t/lib-gitweb.sh
+++ b/t/lib-gitweb.sh
@@ -67,6 +67,9 @@ gitweb_run () {
 	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
 	export GITWEB_CONFIG
 
+	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
+	export PERL5LIB
+
 	# some of git commands write to STDERR on error, but this is not
 	# written to web server logs, so we are not interested in that:
 	# we are interested only in properly formatted errors/warnings

base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
  2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
@ 2021-03-22 18:32       ` Junio C Hamano
  2021-03-22 18:58         ` Georgios Kontaxis
  2021-03-23  4:27         ` Georgios Kontaxis
  2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
  1 sibling, 2 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-22 18:32 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:

[note to other reviewers. input from those who are more familiar
with gitweb and Perl is very much appreciated on this patch].

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.

"... are exposed to web crawlers, which spammers may use." would be
sufficient as a problem description.

After giving a problem description, it is customery to describe the
solution as if you are ordering the codebase to "be like so", so
instead of this ...

> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
    
... we may say something like

    Introduce an 'email-privacy' feature, which defaults to false,
    that redacts e-mail addresses that appear as author/committer
    info and in log messages from the generated HTML content.

> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.

And this is a good thing to add.  Overall, nicely written.

> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
>     gitweb: redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers and they may not respect robots.txt. This may result in
>     unsolicited messages. This is a feature for redacting e-mail addresses
>     from the generated HTML, etc. content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log, e.g., by cloning the repository, and extracting information
>     from it. It aims to hinder the low-effort bulk collection of e-mail
>     addresses by web crawlers.

You do not need to repeat the above, which is in the log message above.

>     Changes since v1:
>     
>      * Turned off the feature by default.
>      * Removed duplicate code.
>      * Added note about Gitweb consumers receiving redacted logs.
>     
>     Changes since v2:
>     
>      * The feature can be set on a per-project basis. ('override' => 1)
>     
>     Changes since v3:
>     
>      * Renamed feature to "email-privacy" and improved documentation.
>      * Removed UI elements for git-format-patch since it won't be redacted.
>      * Simplified calls to the address redaction logic.
>      * Mail::Address is now used to reduce false-positive redactions.

Having these under the --- line like this is very helpful.

> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..6630c76d92fd 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -21,6 +21,7 @@
>  use File::Basename qw(basename);
>  use Time::HiRes qw(gettimeofday tv_interval);
>  use Digest::MD5 qw(md5_hex);
> +use Git::LoadCPAN::Mail::Address;

I'll defer to others who are more familiar with gitweb and Perl
ecosystem if this is warranted, but I have a feeling that importing
and using Mail::Address->parse() only because we want to see if a
given "<string>" is an address is a bit overkill and it might be
sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i

> +	# Redact e-mail addresses.
> +
> +	# To enable system wide have in $GITWEB_CONFIG
> +	# $feature{'email-privacy'}{'default'} = [1];
> +	'email-privacy' => {
> +		'sub' => sub { feature_bool('email-privacy', @_) },
> +		'override' => 1,
> +		'default' => [0]},
>  );
>  
>  sub gitweb_get_feature {
> @@ -3449,6 +3459,32 @@ sub parse_date {
>  	return %date;
>  }
>  
> +sub is_mailaddr {
> +	my @addrs = Mail::Address->parse(shift);
> +	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +sub hide_mailaddrs_if_private {
> +	my $line = shift;
> +	return $line unless gitweb_check_feature('email-privacy');
> +	while ($line =~ m/(<[^>]+>)/g) {
> +		my $match = $1;
> +		if (!is_mailaddr($match)) {
> +			next;
> +		}
> +		my $offset = pos $line;
> +		my $head = substr $line, 0, $offset - length($match);
> +		my $redaction = "<redacted>";
> +		my $tail = substr $line, $offset;
> +		$line = $head . $redaction . $tail;
> +		pos $line = length($head) + length($redaction);

Hmmmm, Perl suggestions from others?  It looks quite strange to see
that s/// operator is not used and replacement is done manually with
byte position in a Perl script.

>  sub parse_tag {
>  	my $tag_id = shift;
>  	my %tag;
> @@ -3465,7 +3501,7 @@ sub parse_tag {
>  		} elsif ($line =~ m/^tag (.+)$/) {
>  			$tag{'name'} = $1;
>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
> -			$tag{'author'} = $1;
> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>  			$tag{'author_epoch'} = $2;
>  			$tag{'author_tz'} = $3;
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {

This (and the others that follow the same pattern) looks sensible.

> @@ -7489,7 +7526,8 @@ sub git_log_generic {
>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>  	}
>  	my $patch_max = gitweb_get_feature('patches');
> -	if ($patch_max && !defined $file_name) {
> +	if ($patch_max && !defined $file_name &&
> +		!gitweb_check_feature('email-privacy')) {
>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>  			$paging_nav .= " &sdot; " .
>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
> @@ -7550,7 +7588,8 @@ sub git_commit {
>  			} @$parents ) .
>  			')';
>  	}
> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
> +		!gitweb_check_feature('email-privacy')) {
>  		$formats_nav .= " | " .
>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  				"patch");
> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
>  		$formats_nav =
>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>  			        "raw");
> -		if ($patch_max && @{$co{'parents'}} <= 1) {
> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
> +			!gitweb_check_feature('email-privacy')) {
>  			$formats_nav .= " | " .
>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  					"patch");

I wouldn't have expected to see the above three hunks in the
"patch" codepath.  Rather, I was hoping that you'd do something
like this at startup when you find out that the privacy feature
is enabled:

	$feature{'patches'}{'default'} = [0]
		if gitweb_get_feature('email-privacy');

so that nothing related to the 'patches' has to be modified.
That way, even if there were fourth place that can leak an e-mail
address in the 'patch' codepath that above three hunks in this patch
missed, crawlers won't be able to get at it, no?

> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
> index 1f32ca66ea51..77fc1298d4c6 100644
> --- a/t/lib-gitweb.sh
> +++ b/t/lib-gitweb.sh
> @@ -67,6 +67,9 @@ gitweb_run () {
>  	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
>  	export GITWEB_CONFIG
>  
> +	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
> +	export PERL5LIB
> +

Why is this change suddenly become necessary?  This addition is only
about tests---do we need to do something similar in the runtime
environment when the updated gitweb that requires Mail::Address gets
deployed, or is that covered already somewhere else?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
  2021-03-22 18:32       ` Junio C Hamano
@ 2021-03-22 18:58         ` Georgios Kontaxis
  2021-03-28  1:41           ` Junio C Hamano
  2021-03-23  4:27         ` Georgios Kontaxis
  1 sibling, 1 reply; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-22 18:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Georgios Kontaxis via GitGitGadget, git,
	"Ævar Arnfjörð Bjarmason",
	brian m. carlson

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> [note to other reviewers. input from those who are more familiar
> with gitweb and Perl is very much appreciated on this patch].
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This may result in unsolicited messages.
>
> "... are exposed to web crawlers, which spammers may use." would be
> sufficient as a problem description.
>
Using robots.txt was suggested earlier in this review so
I want to make sure people don't rely on it instead.

> After giving a problem description, it is customery to describe the
> solution as if you are ordering the codebase to "be like so", so
> instead of this ...
>
>> This is a feature for redacting e-mail addresses
>> from the generated HTML, etc. content.
>
> ... we may say something like
>
>     Introduce an 'email-privacy' feature, which defaults to false,
>     that redacts e-mail addresses that appear as author/committer
>     info and in log messages from the generated HTML content.
>
Will rephrase.

>> This feature does not prevent someone from downloading the
>> unredacted commit log, e.g., by cloning the repository, and
>> extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> And this is a good thing to add.  Overall, nicely written.
>
>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>>     gitweb: redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers and they may not respect robots.txt. This may result in
>>     unsolicited messages. This is a feature for redacting e-mail
>> addresses
>>     from the generated HTML, etc. content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log, e.g., by cloning the repository, and extracting
>> information
>>     from it. It aims to hinder the low-effort bulk collection of e-mail
>>     addresses by web crawlers.
>
> You do not need to repeat the above, which is in the log message above.
>
>>     Changes since v1:
>>
>>      * Turned off the feature by default.
>>      * Removed duplicate code.
>>      * Added note about Gitweb consumers receiving redacted logs.
>>
>>     Changes since v2:
>>
>>      * The feature can be set on a per-project basis. ('override' => 1)
>>
>>     Changes since v3:
>>
>>      * Renamed feature to "email-privacy" and improved documentation.
>>      * Removed UI elements for git-format-patch since it won't be
>> redacted.
>>      * Simplified calls to the address redaction logic.
>>      * Mail::Address is now used to reduce false-positive redactions.
>
> Having these under the --- line like this is very helpful.
>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..6630c76d92fd 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -21,6 +21,7 @@
>>  use File::Basename qw(basename);
>>  use Time::HiRes qw(gettimeofday tv_interval);
>>  use Digest::MD5 qw(md5_hex);
>> +use Git::LoadCPAN::Mail::Address;
>
> I'll defer to others who are more familiar with gitweb and Perl
> ecosystem if this is warranted, but I have a feeling that importing
> and using Mail::Address->parse() only because we want to see if a
> given "<string>" is an address is a bit overkill and it might be
> sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
>
I think Mail::Address->parse() pretty much does the same thing.
It was suggested earlier in the review to use that function,
similar to what git-send-mail does.
I'm assuming because the maintainers of the Mail package are already
invested in the (accurate) parsing of e-mail addresses.

>> +	# Redact e-mail addresses.
>> +
>> +	# To enable system wide have in $GITWEB_CONFIG
>> +	# $feature{'email-privacy'}{'default'} = [1];
>> +	'email-privacy' => {
>> +		'sub' => sub { feature_bool('email-privacy', @_) },
>> +		'override' => 1,
>> +		'default' => [0]},
>>  );
>>
>>  sub gitweb_get_feature {
>> @@ -3449,6 +3459,32 @@ sub parse_date {
>>  	return %date;
>>  }
>>
>> +sub is_mailaddr {
>> +	my @addrs = Mail::Address->parse(shift);
>> +	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
>> +		return 0;
>> +	}
>> +	return 1;
>> +}
>> +
>> +sub hide_mailaddrs_if_private {
>> +	my $line = shift;
>> +	return $line unless gitweb_check_feature('email-privacy');
>> +	while ($line =~ m/(<[^>]+>)/g) {
>> +		my $match = $1;
>> +		if (!is_mailaddr($match)) {
>> +			next;
>> +		}
>> +		my $offset = pos $line;
>> +		my $head = substr $line, 0, $offset - length($match);
>> +		my $redaction = "<redacted>";
>> +		my $tail = substr $line, $offset;
>> +		$line = $head . $redaction . $tail;
>> +		pos $line = length($head) + length($redaction);
>
> Hmmmm, Perl suggestions from others?  It looks quite strange to see
> that s/// operator is not used and replacement is done manually with
> byte position in a Perl script.
>
If there's a more elegant way to do the above we can certain do that instead.

>>  sub parse_tag {
>>  	my $tag_id = shift;
>>  	my %tag;
>> @@ -3465,7 +3501,7 @@ sub parse_tag {
>>  		} elsif ($line =~ m/^tag (.+)$/) {
>>  			$tag{'name'} = $1;
>>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
>> -			$tag{'author'} = $1;
>> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>>  			$tag{'author_epoch'} = $2;
>>  			$tag{'author_tz'} = $3;
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>
> This (and the others that follow the same pattern) looks sensible.
>
>> @@ -7489,7 +7526,8 @@ sub git_log_generic {
>>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>>  	}
>>  	my $patch_max = gitweb_get_feature('patches');
>> -	if ($patch_max && !defined $file_name) {
>> +	if ($patch_max && !defined $file_name &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>>  			$paging_nav .= " &sdot; " .
>>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
>> @@ -7550,7 +7588,8 @@ sub git_commit {
>>  			} @$parents ) .
>>  			')';
>>  	}
>> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
>> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		$formats_nav .= " | " .
>>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  				"patch");
>> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
>>  		$formats_nav =
>>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>>  			        "raw");
>> -		if ($patch_max && @{$co{'parents'}} <= 1) {
>> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
>> +			!gitweb_check_feature('email-privacy')) {
>>  			$formats_nav .= " | " .
>>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  					"patch");
>
> I wouldn't have expected to see the above three hunks in the
> "patch" codepath.  Rather, I was hoping that you'd do something
> like this at startup when you find out that the privacy feature
> is enabled:
>
> 	$feature{'patches'}{'default'} = [0]
> 		if gitweb_get_feature('email-privacy');
>
> so that nothing related to the 'patches' has to be modified.
> That way, even if there were fourth place that can leak an e-mail
> address in the 'patch' codepath that above three hunks in this patch
> missed, crawlers won't be able to get at it, no?
>
Or there's multiple feature flags / triggers that lead to the "patch"
codepath.
Either way is fine with me. (disabling the "patches" feature or preventing
the
generation of UI elements)
Like you said, someone who regularly reviews Gitweb code may have an opinion
on how do this.

>> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
>> index 1f32ca66ea51..77fc1298d4c6 100644
>> --- a/t/lib-gitweb.sh
>> +++ b/t/lib-gitweb.sh
>> @@ -67,6 +67,9 @@ gitweb_run () {
>>  	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
>>  	export GITWEB_CONFIG
>>
>> +	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
>> +	export PERL5LIB
>> +
>
> Why is this change suddenly become necessary?  This addition is only
> about tests---do we need to do something similar in the runtime
> environment when the updated gitweb that requires Mail::Address gets
> deployed, or is that covered already somewhere else?
>
This is necessary because osx-clang, osx-gcc tests were failing.
If there's a better way to address this I'm open to suggestions.

I haven't been able to experience deployment on osx but on linux
i didn't have to change anything.

> Thanks.
>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
  2021-03-22 18:32       ` Junio C Hamano
  2021-03-22 18:58         ` Georgios Kontaxis
@ 2021-03-23  4:27         ` Georgios Kontaxis
  1 sibling, 0 replies; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-23  4:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Georgios Kontaxis via GitGitGadget, git,
	"Ævar Arnfjörð Bjarmason",
	brian m. carlson

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> [note to other reviewers. input from those who are more familiar
> with gitweb and Perl is very much appreciated on this patch].
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This may result in unsolicited messages.
>
> "... are exposed to web crawlers, which spammers may use." would be
> sufficient as a problem description.
>
> After giving a problem description, it is customery to describe the
> solution as if you are ordering the codebase to "be like so", so
> instead of this ...
>
>> This is a feature for redacting e-mail addresses
>> from the generated HTML, etc. content.
>
> ... we may say something like
>
>     Introduce an 'email-privacy' feature, which defaults to false,
>     that redacts e-mail addresses that appear as author/committer
>     info and in log messages from the generated HTML content.
>
>> This feature does not prevent someone from downloading the
>> unredacted commit log, e.g., by cloning the repository, and
>> extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> And this is a good thing to add.  Overall, nicely written.
>
>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>>     gitweb: redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers and they may not respect robots.txt. This may result in
>>     unsolicited messages. This is a feature for redacting e-mail
>> addresses
>>     from the generated HTML, etc. content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log, e.g., by cloning the repository, and extracting
>> information
>>     from it. It aims to hinder the low-effort bulk collection of e-mail
>>     addresses by web crawlers.
>
> You do not need to repeat the above, which is in the log message above.
>
>>     Changes since v1:
>>
>>      * Turned off the feature by default.
>>      * Removed duplicate code.
>>      * Added note about Gitweb consumers receiving redacted logs.
>>
>>     Changes since v2:
>>
>>      * The feature can be set on a per-project basis. ('override' => 1)
>>
>>     Changes since v3:
>>
>>      * Renamed feature to "email-privacy" and improved documentation.
>>      * Removed UI elements for git-format-patch since it won't be
>> redacted.
>>      * Simplified calls to the address redaction logic.
>>      * Mail::Address is now used to reduce false-positive redactions.
>
> Having these under the --- line like this is very helpful.
>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..6630c76d92fd 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -21,6 +21,7 @@
>>  use File::Basename qw(basename);
>>  use Time::HiRes qw(gettimeofday tv_interval);
>>  use Digest::MD5 qw(md5_hex);
>> +use Git::LoadCPAN::Mail::Address;
>
> I'll defer to others who are more familiar with gitweb and Perl
> ecosystem if this is warranted, but I have a feeling that importing
> and using Mail::Address->parse() only because we want to see if a
> given "<string>" is an address is a bit overkill and it might be
> sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
>
>> +	# Redact e-mail addresses.
>> +
>> +	# To enable system wide have in $GITWEB_CONFIG
>> +	# $feature{'email-privacy'}{'default'} = [1];
>> +	'email-privacy' => {
>> +		'sub' => sub { feature_bool('email-privacy', @_) },
>> +		'override' => 1,
>> +		'default' => [0]},
>>  );
>>
>>  sub gitweb_get_feature {
>> @@ -3449,6 +3459,32 @@ sub parse_date {
>>  	return %date;
>>  }
>>
>> +sub is_mailaddr {
>> +	my @addrs = Mail::Address->parse(shift);
>> +	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
>> +		return 0;
>> +	}
>> +	return 1;
>> +}
>> +
>> +sub hide_mailaddrs_if_private {
>> +	my $line = shift;
>> +	return $line unless gitweb_check_feature('email-privacy');
>> +	while ($line =~ m/(<[^>]+>)/g) {
>> +		my $match = $1;
>> +		if (!is_mailaddr($match)) {
>> +			next;
>> +		}
>> +		my $offset = pos $line;
>> +		my $head = substr $line, 0, $offset - length($match);
>> +		my $redaction = "<redacted>";
>> +		my $tail = substr $line, $offset;
>> +		$line = $head . $redaction . $tail;
>> +		pos $line = length($head) + length($redaction);
>
> Hmmmm, Perl suggestions from others?  It looks quite strange to see
> that s/// operator is not used and replacement is done manually with
> byte position in a Perl script.
>
>>  sub parse_tag {
>>  	my $tag_id = shift;
>>  	my %tag;
>> @@ -3465,7 +3501,7 @@ sub parse_tag {
>>  		} elsif ($line =~ m/^tag (.+)$/) {
>>  			$tag{'name'} = $1;
>>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
>> -			$tag{'author'} = $1;
>> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>>  			$tag{'author_epoch'} = $2;
>>  			$tag{'author_tz'} = $3;
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>
> This (and the others that follow the same pattern) looks sensible.
>
>> @@ -7489,7 +7526,8 @@ sub git_log_generic {
>>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>>  	}
>>  	my $patch_max = gitweb_get_feature('patches');
>> -	if ($patch_max && !defined $file_name) {
>> +	if ($patch_max && !defined $file_name &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>>  			$paging_nav .= " &sdot; " .
>>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
>> @@ -7550,7 +7588,8 @@ sub git_commit {
>>  			} @$parents ) .
>>  			')';
>>  	}
>> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
>> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		$formats_nav .= " | " .
>>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  				"patch");
>> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
>>  		$formats_nav =
>>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>>  			        "raw");
>> -		if ($patch_max && @{$co{'parents'}} <= 1) {
>> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
>> +			!gitweb_check_feature('email-privacy')) {
>>  			$formats_nav .= " | " .
>>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  					"patch");
>
> I wouldn't have expected to see the above three hunks in the
> "patch" codepath.  Rather, I was hoping that you'd do something
> like this at startup when you find out that the privacy feature
> is enabled:
>
> 	$feature{'patches'}{'default'} = [0]
> 		if gitweb_get_feature('email-privacy');
>
> so that nothing related to the 'patches' has to be modified.
> That way, even if there were fourth place that can leak an e-mail
> address in the 'patch' codepath that above three hunks in this patch
> missed, crawlers won't be able to get at it, no?
>
I forgot to mention one of the reasons for doing it this way.
If someone knows this URL exists and is relying on it (e.g., in a script)
they can still call it with "email-privacy" enabled.
The only thing that's changing is that the UI element (link) that a web
crawler would follow just because it's there is now gone.

But, as mentioned before, I don't feel strongly about either approach.
Let me know.

>> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
>> index 1f32ca66ea51..77fc1298d4c6 100644
>> --- a/t/lib-gitweb.sh
>> +++ b/t/lib-gitweb.sh
>> @@ -67,6 +67,9 @@ gitweb_run () {
>>  	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
>>  	export GITWEB_CONFIG
>>
>> +	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
>> +	export PERL5LIB
>> +
>
> Why is this change suddenly become necessary?  This addition is only
> about tests---do we need to do something similar in the runtime
> environment when the updated gitweb that requires Mail::Address gets
> deployed, or is that covered already somewhere else?
>
> Thanks.
>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
  2021-03-22 18:32       ` Junio C Hamano
@ 2021-03-27  3:56       ` Georgios Kontaxis via GitGitGadget
  2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
  2021-03-29  1:47         ` [PATCH v5] " Eric Wong
  1 sibling, 2 replies; 41+ messages in thread
From: Georgios Kontaxis via GitGitGadget @ 2021-03-27  3:56 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis, Georgios Kontaxis

From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers and they may not respect robots.txt.
This can result in unsolicited messages.

Introduce an 'email-privacy' feature which redacts e-mail addresses
from the generated HTML content. Specifically, obscure addresses
retrieved from the the author/committer and comment sections of the
Git log. The feature is off by default.

This feature does not prevent someone from downloading the
unredacted commit log, e.g., by cloning the repository, and
extracting information from it. It aims to hinder the low-
effort, bulk collection of e-mail addresses by web crawlers.

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: redacted e-mail addresses feature.
    
    Changes since v1:
    
     * Turned off the feature by default.
     * Removed duplicate code.
     * Added note about Gitweb consumers receiving redacted logs.
    
    Changes since v2:
    
     * The feature can be set on a per-project basis. ('override' => 1)
    
    Changes since v3:
    
     * Renamed feature to "email-privacy" and improved documentation.
     * Removed UI elements for git-format-patch since it won't be redacted.
     * Simplified calls to the address redaction logic.
     * Mail::Address is now used to reduce false-positive redactions.
    
    Changes since v4:
    
     * Rephrased the commit comment.
     * hide_mailaddrs_if_private is slighly more compact.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/910

Range-diff vs v4:

 1:  03a3f41c37ef ! 1:  1427231f9db5 gitweb: redacted e-mail addresses feature.
     @@ Commit message
          Gitweb extracts content from the Git log and makes it accessible
          over HTTP. As a result, e-mail addresses found in commits are
          exposed to web crawlers and they may not respect robots.txt.
     -    This may result in unsolicited messages.
     -    This is a feature for redacting e-mail addresses
     -    from the generated HTML, etc. content.
     +    This can result in unsolicited messages.
     +
     +    Introduce an 'email-privacy' feature which redacts e-mail addresses
     +    from the generated HTML content. Specifically, obscure addresses
     +    retrieved from the the author/committer and comment sections of the
     +    Git log. The feature is off by default.
      
          This feature does not prevent someone from downloading the
          unredacted commit log, e.g., by cloning the repository, and
     -    extracting information from it.
     -    It aims to hinder the low-effort bulk collection of e-mail
     -    addresses by web crawlers.
     +    extracting information from it. It aims to hinder the low-
     +    effort, bulk collection of e-mail addresses by web crawlers.
      
          Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
      
     @@ Documentation/gitweb.conf.txt: default font sizes or lineheights are changed (e.
       
      +email-privacy::
      +	Redact e-mail addresses from the generated HTML, etc. content.
     -+	This hides e-mail addresses found in the commit log from HTTP clients.
     ++	This obscures e-mail addresses retrieved from the author/committer
     ++	and comment sections of the Git log.
      +	It is meant to hinder web crawlers that harvest and abuse addresses.
      +	Such crawlers may not respect robots.txt.
     -+	Note that users and user tools also see the addresses redacted.
     ++	Note that users and user tools also see the addresses as redacted.
      +	If Gitweb is not the final step in a workflow then subsequent steps
      +	may misbehave because of the redacted information they receive.
      +	Disabled by default.
     @@ gitweb/gitweb.perl: sub parse_date {
      +		if (!is_mailaddr($match)) {
      +			next;
      +		}
     -+		my $offset = pos $line;
     -+		my $head = substr $line, 0, $offset - length($match);
     ++		my $match_offset = pos($line) - length($match);
     ++		pos $line = $match_offset;
     ++
      +		my $redaction = "<redacted>";
     -+		my $tail = substr $line, $offset;
     -+		$line = $head . $redaction . $tail;
     -+		pos $line = length($head) + length($redaction);
     ++		$line =~ s/\G(<[^>]+>)/$redaction/;
     ++
     ++		pos $line = $match_offset + length($redaction);
      +	}
      +	return $line;
      +}


 Documentation/gitweb.conf.txt | 11 +++++++
 gitweb/gitweb.perl            | 55 ++++++++++++++++++++++++++++++-----
 t/lib-gitweb.sh               |  3 ++
 3 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..34b1d6e22435 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -751,6 +751,17 @@ default font sizes or lineheights are changed (e.g. via adding extra
 CSS stylesheet in `@stylesheets`), it may be appropriate to change
 these values.
 
+email-privacy::
+	Redact e-mail addresses from the generated HTML, etc. content.
+	This obscures e-mail addresses retrieved from the author/committer
+	and comment sections of the Git log.
+	It is meant to hinder web crawlers that harvest and abuse addresses.
+	Such crawlers may not respect robots.txt.
+	Note that users and user tools also see the addresses as redacted.
+	If Gitweb is not the final step in a workflow then subsequent steps
+	may misbehave because of the redacted information they receive.
+	Disabled by default.
+
 highlight::
 	Server-side syntax highlight support in "blob" view.  It requires
 	`$highlight_bin` program to be available (see the description of
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..fe1dbc266ea7 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -21,6 +21,7 @@
 use File::Basename qw(basename);
 use Time::HiRes qw(gettimeofday tv_interval);
 use Digest::MD5 qw(md5_hex);
+use Git::LoadCPAN::Mail::Address;
 
 binmode STDOUT, ':utf8';
 
@@ -569,6 +570,15 @@ sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+	# Redact e-mail addresses.
+
+	# To enable system wide have in $GITWEB_CONFIG
+	# $feature{'email-privacy'}{'default'} = [1];
+	'email-privacy' => {
+		'sub' => sub { feature_bool('email-privacy', @_) },
+		'override' => 1,
+		'default' => [0]},
 );
 
 sub gitweb_get_feature {
@@ -3449,6 +3459,33 @@ sub parse_date {
 	return %date;
 }
 
+sub is_mailaddr {
+	my @addrs = Mail::Address->parse(shift);
+	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
+		return 0;
+	}
+	return 1;
+}
+
+sub hide_mailaddrs_if_private {
+	my $line = shift;
+	return $line unless gitweb_check_feature('email-privacy');
+	while ($line =~ m/(<[^>]+>)/g) {
+		my $match = $1;
+		if (!is_mailaddr($match)) {
+			next;
+		}
+		my $match_offset = pos($line) - length($match);
+		pos $line = $match_offset;
+
+		my $redaction = "<redacted>";
+		$line =~ s/\G(<[^>]+>)/$redaction/;
+
+		pos $line = $match_offset + length($redaction);
+	}
+	return $line;
+}
+
 sub parse_tag {
 	my $tag_id = shift;
 	my %tag;
@@ -3465,7 +3502,7 @@ sub parse_tag {
 		} elsif ($line =~ m/^tag (.+)$/) {
 			$tag{'name'} = $1;
 		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
-			$tag{'author'} = $1;
+			$tag{'author'} = hide_mailaddrs_if_private($1);
 			$tag{'author_epoch'} = $2;
 			$tag{'author_tz'} = $3;
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3513,7 +3550,7 @@ sub parse_commit_text {
 		} elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) {
 			push @parents, $1;
 		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
-			$co{'author'} = to_utf8($1);
+			$co{'author'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'author_epoch'} = $2;
 			$co{'author_tz'} = $3;
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3523,7 +3560,7 @@ sub parse_commit_text {
 				$co{'author_name'} = $co{'author'};
 			}
 		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
-			$co{'committer'} = to_utf8($1);
+			$co{'committer'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'committer_epoch'} = $2;
 			$co{'committer_tz'} = $3;
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3568,9 +3605,10 @@ sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		$line = hide_mailaddrs_if_private($line);
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -7489,7 +7527,8 @@ sub git_log_generic {
 			         -accesskey => "n", -title => "Alt-n"}, "next");
 	}
 	my $patch_max = gitweb_get_feature('patches');
-	if ($patch_max && !defined $file_name) {
+	if ($patch_max && !defined $file_name &&
+		!gitweb_check_feature('email-privacy')) {
 		if ($patch_max < 0 || @commitlist <= $patch_max) {
 			$paging_nav .= " &sdot; " .
 				$cgi->a({-href => href(action=>"patches", -replay=>1)},
@@ -7550,7 +7589,8 @@ sub git_commit {
 			} @$parents ) .
 			')';
 	}
-	if (gitweb_check_feature('patches') && @$parents <= 1) {
+	if (gitweb_check_feature('patches') && @$parents <= 1 &&
+		!gitweb_check_feature('email-privacy')) {
 		$formats_nav .= " | " .
 			$cgi->a({-href => href(action=>"patch", -replay=>1)},
 				"patch");
@@ -7863,7 +7903,8 @@ sub git_commitdiff {
 		$formats_nav =
 			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
 			        "raw");
-		if ($patch_max && @{$co{'parents'}} <= 1) {
+		if ($patch_max && @{$co{'parents'}} <= 1 &&
+			!gitweb_check_feature('email-privacy')) {
 			$formats_nav .= " | " .
 				$cgi->a({-href => href(action=>"patch", -replay=>1)},
 					"patch");
diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
index 1f32ca66ea51..77fc1298d4c6 100644
--- a/t/lib-gitweb.sh
+++ b/t/lib-gitweb.sh
@@ -67,6 +67,9 @@ gitweb_run () {
 	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
 	export GITWEB_CONFIG
 
+	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
+	export PERL5LIB
+
 	# some of git commands write to STDERR on error, but this is not
 	# written to web server logs, so we are not interested in that:
 	# we are interested only in properly formatted errors/warnings

base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
  2021-03-22 18:58         ` Georgios Kontaxis
@ 2021-03-28  1:41           ` Junio C Hamano
  2021-03-28 21:43             ` Georgios Kontaxis
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-03-28  1:41 UTC (permalink / raw)
  To: Georgios Kontaxis
  Cc: Georgios Kontaxis via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, brian m. carlson

"Georgios Kontaxis" <geko1702+commits@99rst.org> writes:

>> I'll defer to others who are more familiar with gitweb and Perl
>> ecosystem if this is warranted, but I have a feeling that importing
>> and using Mail::Address->parse() only because we want to see if a
>> given "<string>" is an address is a bit overkill and it might be
>> sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
>> ...
>>> +	while ($line =~ m/(<[^>]+>)/g) {
>>> +		my $match = $1;
>>> +		if (!is_mailaddr($match)) {
>>> +			next;
>>> +		}
>>> +		my $offset = pos $line;
>>> +		my $head = substr $line, 0, $offset - length($match);
>>> +		my $redaction = "<redacted>";
>>> +		my $tail = substr $line, $offset;
>>> +		$line = $head . $redaction . $tail;
>>> +		pos $line = length($head) + length($redaction);
>>
>> Hmmmm, Perl suggestions from others?  It looks quite strange to see
>> that s/// operator is not used and replacement is done manually with
>> byte position in a Perl script.
>>
> If there's a more elegant way to do the above we can certain do that instead.

For example, if we do not insist on using overkill Mail::Address->parse(),
we could do something silly like this:

	$line =~ s/<[^@>]+@[a-z0-9-.]+>/<redacted@address>/ig;

no?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
  2021-03-28  1:41           ` Junio C Hamano
@ 2021-03-28 21:43             ` Georgios Kontaxis
  2021-03-28 22:35               ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-28 21:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Georgios Kontaxis via GitGitGadget, git,
	"Ævar Arnfjörð Bjarmason",
	brian m. carlson

> "Georgios Kontaxis" <geko1702+commits@99rst.org> writes:
>
>>> I'll defer to others who are more familiar with gitweb and Perl
>>> ecosystem if this is warranted, but I have a feeling that importing
>>> and using Mail::Address->parse() only because we want to see if a
>>> given "<string>" is an address is a bit overkill and it might be
>>> sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
>>> ...
>>>> +	while ($line =~ m/(<[^>]+>)/g) {
>>>> +		my $match = $1;
>>>> +		if (!is_mailaddr($match)) {
>>>> +			next;
>>>> +		}
>>>> +		my $offset = pos $line;
>>>> +		my $head = substr $line, 0, $offset - length($match);
>>>> +		my $redaction = "<redacted>";
>>>> +		my $tail = substr $line, $offset;
>>>> +		$line = $head . $redaction . $tail;
>>>> +		pos $line = length($head) + length($redaction);
>>>
>>> Hmmmm, Perl suggestions from others?  It looks quite strange to see
>>> that s/// operator is not used and replacement is done manually with
>>> byte position in a Perl script.
>>>
>> If there's a more elegant way to do the above we can certain do that
>> instead.
>
> For example, if we do not insist on using overkill Mail::Address->parse(),
> we could do something silly like this:
>
> 	$line =~ s/<[^@>]+@[a-z0-9-.]+>/<redacted@address>/ig;
>
> no?
>
It's not clear if you think it's overkill because we have to depend on an
external module or because we don't need accurate parsing.

The above expression seems to get the job done but it isn't the correct
way to parse an e-mail address.
But if it's good enough for the reviewers I don't feel strongly about it.

If we prefer accurate parsing but don't like depending on Mail::Address,
it's easy to write complete expressions ourselves. (In a separate,
internal, Perl module)

Thoughts?



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
  2021-03-28 21:43             ` Georgios Kontaxis
@ 2021-03-28 22:35               ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-28 22:35 UTC (permalink / raw)
  To: Georgios Kontaxis
  Cc: Georgios Kontaxis via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, brian m. carlson

"Georgios Kontaxis" <geko1702+commits@99rst.org> writes:

> It's not clear if you think it's overkill because we have to depend on an
> external module or because we don't need accurate parsing.

It is neither; what we need to parse is not exactly 'e-mail
addresses' as known to Mail::Address.

The thing is, unlike send-email that needs to interact with the
real-world MTAs and e-mail addresses, the codepaths we are talking
about are mostly about author/committer ident, where the definition
is quite narrower than the Mail::Address's "has to cover all the
possible ways to spell e-mail addresses under the sun" requirement.

Having said all that ...

> If we prefer accurate parsing but don't like depending on Mail::Address,
> it's easy to write complete expressions ourselves. (In a separate,
> internal, Perl module)

... I do not have strong opinions either way.  I won't have the
final say on the way things are done in Perl and what is done to
Gitweb.

It was just that what I saw earlier with the offsets and manual
parsing instead of s///g operator did not smell not like a Perl
program to me, and the message you are responding to was my reaction
to it.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v6] gitweb: redacted e-mail addresses feature.
  2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
@ 2021-03-28 23:26         ` Georgios Kontaxis via GitGitGadget
  2021-03-29 20:00           ` Junio C Hamano
  2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
  2021-03-29  1:47         ` [PATCH v5] " Eric Wong
  1 sibling, 2 replies; 41+ messages in thread
From: Georgios Kontaxis via GitGitGadget @ 2021-03-28 23:26 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers and they may not respect robots.txt.
This can result in unsolicited messages.

Introduce an 'email-privacy' feature which redacts e-mail addresses
from the generated HTML content. Specifically, obscure addresses
retrieved from the the author/committer and comment sections of the
Git log. The feature is off by default.

This feature does not prevent someone from downloading the
unredacted commit log, e.g., by cloning the repository, and
extracting information from it. It aims to hinder the low-
effort, bulk collection of e-mail addresses by web crawlers.

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: redacted e-mail addresses feature.
    
    Changes since v1:
    
     * Turned off the feature by default.
     * Removed duplicate code.
     * Added note about Gitweb consumers receiving redacted logs.
    
    Changes since v2:
    
     * The feature can be set on a per-project basis. ('override' => 1)
    
    Changes since v3:
    
     * Renamed feature to "email-privacy" and improved documentation.
     * Removed UI elements for git-format-patch since it won't be redacted.
     * Simplified calls to the address redaction logic.
     * Mail::Address is now used to reduce false-positive redactions.
    
    Changes since v4:
    
     * Rephrased the commit comment.
     * hide_mailaddrs_if_private is slighly more compact.
    
    Changes since v5:
    
     * A simple <local@domain> filter is used instead of Mail::Address to
       identify addresses.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v6
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v6
Pull-Request: https://github.com/gitgitgadget/git/pull/910

Range-diff vs v5:

 1:  1427231f9db5 ! 1:  245cfed8ad58 gitweb: redacted e-mail addresses feature.
     @@ Documentation/gitweb.conf.txt: default font sizes or lineheights are changed (e.
       	`$highlight_bin` program to be available (see the description of
      
       ## gitweb/gitweb.perl ##
     -@@
     - use File::Basename qw(basename);
     - use Time::HiRes qw(gettimeofday tv_interval);
     - use Digest::MD5 qw(md5_hex);
     -+use Git::LoadCPAN::Mail::Address;
     - 
     - binmode STDOUT, ':utf8';
     - 
      @@ gitweb/gitweb.perl: sub evaluate_uri {
       		'sub' => \&feature_extra_branch_refs,
       		'override' => 0,
     @@ gitweb/gitweb.perl: sub parse_date {
       	return %date;
       }
       
     -+sub is_mailaddr {
     -+	my @addrs = Mail::Address->parse(shift);
     -+	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
     -+		return 0;
     -+	}
     -+	return 1;
     -+}
     -+
      +sub hide_mailaddrs_if_private {
      +	my $line = shift;
      +	return $line unless gitweb_check_feature('email-privacy');
     -+	while ($line =~ m/(<[^>]+>)/g) {
     -+		my $match = $1;
     -+		if (!is_mailaddr($match)) {
     -+			next;
     -+		}
     -+		my $match_offset = pos($line) - length($match);
     -+		pos $line = $match_offset;
     -+
     -+		my $redaction = "<redacted>";
     -+		$line =~ s/\G(<[^>]+>)/$redaction/;
     -+
     -+		pos $line = $match_offset + length($redaction);
     -+	}
     ++	$line =~ s/<[^@>]+@[^>]+>/<redacted>/ig;
      +	return $line;
      +}
      +
     @@ gitweb/gitweb.perl: sub git_commitdiff {
       			$formats_nav .= " | " .
       				$cgi->a({-href => href(action=>"patch", -replay=>1)},
       					"patch");
     -
     - ## t/lib-gitweb.sh ##
     -@@ t/lib-gitweb.sh: gitweb_run () {
     - 	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
     - 	export GITWEB_CONFIG
     - 
     -+	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
     -+	export PERL5LIB
     -+
     - 	# some of git commands write to STDERR on error, but this is not
     - 	# written to web server logs, so we are not interested in that:
     - 	# we are interested only in properly formatted errors/warnings


 Documentation/gitweb.conf.txt | 11 +++++++++++
 gitweb/gitweb.perl            | 34 +++++++++++++++++++++++++++-------
 2 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..34b1d6e22435 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -751,6 +751,17 @@ default font sizes or lineheights are changed (e.g. via adding extra
 CSS stylesheet in `@stylesheets`), it may be appropriate to change
 these values.
 
+email-privacy::
+	Redact e-mail addresses from the generated HTML, etc. content.
+	This obscures e-mail addresses retrieved from the author/committer
+	and comment sections of the Git log.
+	It is meant to hinder web crawlers that harvest and abuse addresses.
+	Such crawlers may not respect robots.txt.
+	Note that users and user tools also see the addresses as redacted.
+	If Gitweb is not the final step in a workflow then subsequent steps
+	may misbehave because of the redacted information they receive.
+	Disabled by default.
+
 highlight::
 	Server-side syntax highlight support in "blob" view.  It requires
 	`$highlight_bin` program to be available (see the description of
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..01c6faf88006 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -569,6 +569,15 @@ sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+	# Redact e-mail addresses.
+
+	# To enable system wide have in $GITWEB_CONFIG
+	# $feature{'email-privacy'}{'default'} = [1];
+	'email-privacy' => {
+		'sub' => sub { feature_bool('email-privacy', @_) },
+		'override' => 1,
+		'default' => [0]},
 );
 
 sub gitweb_get_feature {
@@ -3449,6 +3458,13 @@ sub parse_date {
 	return %date;
 }
 
+sub hide_mailaddrs_if_private {
+	my $line = shift;
+	return $line unless gitweb_check_feature('email-privacy');
+	$line =~ s/<[^@>]+@[^>]+>/<redacted>/ig;
+	return $line;
+}
+
 sub parse_tag {
 	my $tag_id = shift;
 	my %tag;
@@ -3465,7 +3481,7 @@ sub parse_tag {
 		} elsif ($line =~ m/^tag (.+)$/) {
 			$tag{'name'} = $1;
 		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
-			$tag{'author'} = $1;
+			$tag{'author'} = hide_mailaddrs_if_private($1);
 			$tag{'author_epoch'} = $2;
 			$tag{'author_tz'} = $3;
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3513,7 +3529,7 @@ sub parse_commit_text {
 		} elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) {
 			push @parents, $1;
 		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
-			$co{'author'} = to_utf8($1);
+			$co{'author'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'author_epoch'} = $2;
 			$co{'author_tz'} = $3;
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3523,7 +3539,7 @@ sub parse_commit_text {
 				$co{'author_name'} = $co{'author'};
 			}
 		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
-			$co{'committer'} = to_utf8($1);
+			$co{'committer'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'committer_epoch'} = $2;
 			$co{'committer_tz'} = $3;
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3568,9 +3584,10 @@ sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		$line = hide_mailaddrs_if_private($line);
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -7489,7 +7506,8 @@ sub git_log_generic {
 			         -accesskey => "n", -title => "Alt-n"}, "next");
 	}
 	my $patch_max = gitweb_get_feature('patches');
-	if ($patch_max && !defined $file_name) {
+	if ($patch_max && !defined $file_name &&
+		!gitweb_check_feature('email-privacy')) {
 		if ($patch_max < 0 || @commitlist <= $patch_max) {
 			$paging_nav .= " &sdot; " .
 				$cgi->a({-href => href(action=>"patches", -replay=>1)},
@@ -7550,7 +7568,8 @@ sub git_commit {
 			} @$parents ) .
 			')';
 	}
-	if (gitweb_check_feature('patches') && @$parents <= 1) {
+	if (gitweb_check_feature('patches') && @$parents <= 1 &&
+		!gitweb_check_feature('email-privacy')) {
 		$formats_nav .= " | " .
 			$cgi->a({-href => href(action=>"patch", -replay=>1)},
 				"patch");
@@ -7863,7 +7882,8 @@ sub git_commitdiff {
 		$formats_nav =
 			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
 			        "raw");
-		if ($patch_max && @{$co{'parents'}} <= 1) {
+		if ($patch_max && @{$co{'parents'}} <= 1 &&
+			!gitweb_check_feature('email-privacy')) {
 			$formats_nav .= " | " .
 				$cgi->a({-href => href(action=>"patch", -replay=>1)},
 					"patch");

base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
  2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
@ 2021-03-29  1:47         ` Eric Wong
  2021-03-29  3:17           ` Georgios Kontaxis
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Wong @ 2021-03-29  1:47 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This can result in unsolicited messages.

> Introduce an 'email-privacy' feature which redacts e-mail addresses
> from the generated HTML content

A general reply to the topic: have you considered munging
addresses in a way that is still human readable, but obviously
obfuscated?

On some other project, I settled on HTML "&#8226;" as a replacement
for '.' for admins who enable that option.  The $USER@$NO_DOT
remains as-is for easy identification+recognition of hosts.

I also considered Unicode homographs which can look identical
to replacement characters, too; but rejected that idea since
it would cause grief for legitimate users who would not notice
the homograph when pasting into their mail client.

Anyways, here's the list of candidates I tried:

homograph∂80x24.org
homograph@80x24ͺorg
homograph@80x24·org
homograph@80x24•org
homograph@80x24.org
homograph﹫80x24.org

https://en.wikipedia.org/wiki/Ano_Teleia#Similar_symbols
https://en.wikipedia.org/wiki/Enclosed_A

homographⒶ80x24.org
homograph@80x24 org
homograph@80x24․org
homograph@80x24ꓸorg

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-03-29  1:47         ` [PATCH v5] " Eric Wong
@ 2021-03-29  3:17           ` Georgios Kontaxis
  2021-04-08 17:16             ` Eric Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Georgios Kontaxis @ 2021-03-29  3:17 UTC (permalink / raw)
  To: Eric Wong
  Cc: Georgios Kontaxis via GitGitGadget, git,
	"Ævar Arnfjörð Bjarmason",
	brian m. carlson

> Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This can result in unsolicited messages.
>
>> Introduce an 'email-privacy' feature which redacts e-mail addresses
>> from the generated HTML content
>
> A general reply to the topic: have you considered munging
> addresses in a way that is still human readable, but obviously
> obfuscated?
>
> On some other project, I settled on HTML "&#8226;" as a replacement
> for '.' for admins who enable that option.  The $USER@$NO_DOT
> remains as-is for easy identification+recognition of hosts.
>
Thanks for the suggestion.

People have been trying to hinder address harvesting for a while now.
Replacing '@' with "at", the dot with "dot", adding spaces, etc.
was pretty common at some point. May still be.
I would expect crawlers to have caught up and this includes
all sorts of character encodings and unicode look-alike substitutions.

At the end of the day we are looking for something that's easy for humans
to read but hard for scripts to parse as an e-mail address.
(And that scripts cannot learn through an additional regex)
I'm not aware of anything like that. (I know CAPTCHAs, etc.)

> I also considered Unicode homographs which can look identical
> to replacement characters, too; but rejected that idea since
> it would cause grief for legitimate users who would not notice
> the homograph when pasting into their mail client.
>
> Anyways, here's the list of candidates I tried:
>
> homograph∂80x24.org
> homograph@80x24ͺorg
> homograph@80x24·org
> homograph@80x24•org
> homograph@80x24.org
> homograph﹫80x24.org
>
> https://en.wikipedia.org/wiki/Ano_Teleia#Similar_symbols
> https://en.wikipedia.org/wiki/Enclosed_A
>
> homographⒶ80x24.org
> homograph@80x24 org
> homograph@80x24․org
> homograph@80x24ꓸorg
>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6] gitweb: redacted e-mail addresses feature.
  2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
@ 2021-03-29 20:00           ` Junio C Hamano
  2021-03-31 21:14             ` Junio C Hamano
  2021-04-06  0:56             ` Junio C Hamano
  2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-29 20:00 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This can result in unsolicited messages.
>
> Introduce an 'email-privacy' feature which redacts e-mail addresses
> from the generated HTML content. Specifically, obscure addresses
> retrieved from the the author/committer and comment sections of the
> Git log. The feature is off by default.
>
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it. It aims to hinder the low-
> effort, bulk collection of e-mail addresses by web crawlers.
>
> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---

> @@ -751,6 +751,17 @@ default font sizes or lineheights are changed (e.g. via adding extra
>  CSS stylesheet in `@stylesheets`), it may be appropriate to change
>  these values.
>  
> +email-privacy::
> +	Redact e-mail addresses from the generated HTML, etc. content.
> +	This obscures e-mail addresses retrieved from the author/committer
> +	and comment sections of the Git log.
> +	It is meant to hinder web crawlers that harvest and abuse addresses.
> +	Such crawlers may not respect robots.txt.
> +	Note that users and user tools also see the addresses as redacted.
> +	If Gitweb is not the final step in a workflow then subsequent steps
> +	may misbehave because of the redacted information they receive.
> +	Disabled by default.

OK.  I still think everything after "Note that" is a bit redandant
(as any intelligent reader can read what the feature does and reach
its natural consequence themselves), but I do not strongly oppose
leaving it in.

> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..01c6faf88006 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +	# Redact e-mail addresses.
> +
> +	# To enable system wide have in $GITWEB_CONFIG
> +	# $feature{'email-privacy'}{'default'} = [1];
> +	'email-privacy' => {
> +		'sub' => sub { feature_bool('email-privacy', @_) },
> +		'override' => 1,
> +		'default' => [0]},
>  );

Sensible.

While reviewing this part, I noticed a few things

 - A few other features explain how to toggle it system wide and
   nothing else, like this one does.

 - Others explain, in addition, talk about how to toggle per
   repository (which may necessitate explaining how to toggle the
   override bit if the default is false).

It might be a good idea to standardise the explanation somehow.

This is just an observation (in other words, nothing needs to be
done in this patch).


> @@ -3449,6 +3458,13 @@ sub parse_date {
>  	return %date;
>  }
>  
> +sub hide_mailaddrs_if_private {
> +	my $line = shift;
> +	return $line unless gitweb_check_feature('email-privacy');
> +	$line =~ s/<[^@>]+@[^>]+>/<redacted>/ig;
> +	return $line;
> +}

OK.  The "this catches AUTHOR_EMAIL" pattern is a bit embarrassing
(my fault); "s/<[^@>]+@[-a-z0-9.]+>/<redacted>/ig" might be less
embarrassing but I do not care deeply enough either way.  The version
in this patch is probably closer to what ident.c uses anyway ;-)

> @@ -7489,7 +7506,8 @@ sub git_log_generic {
>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>  	}
>  	my $patch_max = gitweb_get_feature('patches');
> -	if ($patch_max && !defined $file_name) {
> +	if ($patch_max && !defined $file_name &&
> +		!gitweb_check_feature('email-privacy')) {
>  		if ($patch_max < 0 || @commitlist <= $patch_max) {

An observation unrelated to this change.  I think checking for
negative patch_max is a bug in the original code.  Everywhere else,
the way to disable the 'patch' view is to set it to 0, not negative,
and this block is already protected with "if ($patch_max".

> @@ -7550,7 +7568,8 @@ sub git_commit {
>  			} @$parents ) .
>  			')';
>  	}
> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
> +		!gitweb_check_feature('email-privacy')) {
>  		$formats_nav .= " | " .
>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  				"patch");
> @@ -7863,7 +7882,8 @@ sub git_commitdiff {
>  		$formats_nav =
>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>  			        "raw");
> -		if ($patch_max && @{$co{'parents'}} <= 1) {
> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
> +			!gitweb_check_feature('email-privacy')) {
>  			$formats_nav .= " | " .
>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  					"patch");

I see your wish is "we do not show any link to a patch page, but
anybody who knows what a link to the patch page looks like can craft
and ask", but I am not sure if that is worth the above three hunks.
It still feels more robust to just disable the 'patches' feature
when the email-privacy feature is enabled, but that may be just me.

Will queue as-is.  Input from those who are more adept at Perl
and/or interested in helping polish gitweb are still welcome, but at
my level of interest on the topic, this version looks as good as it
gets ;-)

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6] gitweb: redacted e-mail addresses feature.
  2021-03-29 20:00           ` Junio C Hamano
@ 2021-03-31 21:14             ` Junio C Hamano
  2021-04-06  0:56             ` Junio C Hamano
  1 sibling, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-03-31 21:14 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

Junio C Hamano <gitster@pobox.com> writes:

> Will queue as-is.  Input from those who are more adept at Perl
> and/or interested in helping polish gitweb are still welcome, but at
> my level of interest on the topic, this version looks as good as it
> gets ;-)

Oh by the way, as nobody commented on the title, I'd have to X-<.

I'll retitle it to

	gitweb: add "e-mail privacy" feature to redact e-mail addresses

while queuing.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6] gitweb: redacted e-mail addresses feature.
  2021-03-29 20:00           ` Junio C Hamano
  2021-03-31 21:14             ` Junio C Hamano
@ 2021-04-06  0:56             ` Junio C Hamano
  1 sibling, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-04-06  0:56 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, brian m. carlson,
	Georgios Kontaxis

Junio C Hamano <gitster@pobox.com> writes:

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This can result in unsolicited messages.
>> ...
> Will queue as-is.  Input from those who are more adept at Perl
> and/or interested in helping polish gitweb are still welcome, but at
> my level of interest on the topic, this version looks as good as it
> gets ;-)

Comments by anybody who is more adept at Perl and/or more well
versed in gitweb and e-mail privacy issues?

Otherwise I am inclined to merge it down to 'next' as-is in a couple
of days.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-03-29  3:17           ` Georgios Kontaxis
@ 2021-04-08 17:16             ` Eric Wong
  2021-04-08 21:04               ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Wong @ 2021-04-08 17:16 UTC (permalink / raw)
  To: Georgios Kontaxis
  Cc: Georgios Kontaxis via GitGitGadget, git,
	"Ævar Arnfjörð Bjarmason",
	brian m. carlson

Georgios Kontaxis <geko1702+commits@99rst.org> wrote:
> > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
> >> Introduce an 'email-privacy' feature which redacts e-mail addresses
> >> from the generated HTML content
> >
> Eric Wong wrote:
> > A general reply to the topic: have you considered munging
> > addresses in a way that is still human readable, but obviously
> > obfuscated?
> >
> > On some other project, I settled on HTML "&#8226;" as a replacement
> > for '.' for admins who enable that option.  The $USER@$NO_DOT
> > remains as-is for easy identification+recognition of hosts.
> >
> Thanks for the suggestion.
> 
> People have been trying to hinder address harvesting for a while now.
> Replacing '@' with "at", the dot with "dot", adding spaces, etc.
> was pretty common at some point. May still be.
> I would expect crawlers to have caught up and this includes
> all sorts of character encodings and unicode look-alike substitutions.

I figure the crawlers hit a combinatorial explosion and
give up since they'd be wasting time with false-positives.

> > I also considered Unicode homographs which can look identical
> > to replacement characters, too; but rejected that idea since
> > it would cause grief for legitimate users who would not notice
> > the homograph when pasting into their mail client.

As a data point, none of the homograph@ candidates I posted here
on Mar 29 have attracted any attempts on my mail server.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-04-08 17:16             ` Eric Wong
@ 2021-04-08 21:04               ` Junio C Hamano
  2021-04-08 21:19                 ` Eric Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-04-08 21:04 UTC (permalink / raw)
  To: Eric Wong
  Cc: Georgios Kontaxis, Georgios Kontaxis via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, brian m. carlson

Eric Wong <e@80x24.org> writes:

> Georgios Kontaxis <geko1702+commits@99rst.org> wrote:
>> > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
>> >> Introduce an 'email-privacy' feature which redacts e-mail addresses
>> >> from the generated HTML content
>> >
>> Eric Wong wrote:
>> > A general reply to the topic: have you considered munging
>> > addresses in a way that is still human readable, but obviously
>> > obfuscated?
>> ...
>> > I also considered Unicode homographs which can look identical
>> > to replacement characters, too; but rejected that idea since
>> > it would cause grief for legitimate users who would not notice
>> > the homograph when pasting into their mail client.
>
> As a data point, none of the homograph@ candidates I posted here
> on Mar 29 have attracted any attempts on my mail server.

That is an interesting observation.  All homograph@ non-addresses,
if a human corrected the funnies in their spelling, would have hit
whoever handles @80x24.org mailboxes.

I take it to mean that as a future direction, replacing <redacted>
with the obfuscated-but-readable-by-humans homographs is a likely
improvement that would help human users while still inconveniencing
the crawlers.  It may however need some provision to prevent casual
end-users from cutting-and-pasting these homographs, as you said in
your original mention of the homograph approach.

But other than that, does the patch look reasonable?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-04-08 21:04               ` Junio C Hamano
@ 2021-04-08 21:19                 ` Eric Wong
  2021-04-08 22:45                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Wong @ 2021-04-08 21:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Georgios Kontaxis, Georgios Kontaxis via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, brian m. carlson

Junio C Hamano <gitster@pobox.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > As a data point, none of the homograph@ candidates I posted here
> > on Mar 29 have attracted any attempts on my mail server.
> 
> That is an interesting observation.  All homograph@ non-addresses,
> if a human corrected the funnies in their spelling, would have hit
> whoever handles @80x24.org mailboxes.
> 
> I take it to mean that as a future direction, replacing <redacted>
> with the obfuscated-but-readable-by-humans homographs is a likely
> improvement that would help human users while still inconveniencing
> the crawlers.  It may however need some provision to prevent casual
> end-users from cutting-and-pasting these homographs, as you said in
> your original mention of the homograph approach.

Yes, exactly.

> But other than that, does the patch look reasonable?

I only took a cursory glance at it, but v6 seemed fine.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6] gitweb: redacted e-mail addresses feature.
  2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
  2021-03-29 20:00           ` Junio C Hamano
@ 2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
  2021-04-08 22:51             ` Junio C Hamano
  1 sibling, 1 reply; 41+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-04-08 22:43 UTC (permalink / raw)
  To: Georgios Kontaxis via GitGitGadget
  Cc: git, brian m. carlson, Georgios Kontaxis


On Mon, Mar 29 2021, Georgios Kontaxis via GitGitGadget wrote:

> [...]
> +email-privacy::
> +	Redact e-mail addresses from the generated HTML, etc. content.
> +	This obscures e-mail addresses retrieved from the author/committer
> +	and comment sections of the Git log.
> +	It is meant to hinder web crawlers that harvest and abuse addresses.
> +	Such crawlers may not respect robots.txt.
> +	Note that users and user tools also see the addresses as redacted.
> +	If Gitweb is not the final step in a workflow then subsequent steps
> +	may misbehave because of the redacted information they receive.
> +	Disabled by default.
> +
>  highlight::
>  	Server-side syntax highlight support in "blob" view.  It requires
>  	`$highlight_bin` program to be available (see the description of
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..01c6faf88006 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +	# Redact e-mail addresses.
> +
> +	# To enable system wide have in $GITWEB_CONFIG
> +	# $feature{'email-privacy'}{'default'} = [1];
> +	'email-privacy' => {
> +		'sub' => sub { feature_bool('email-privacy', @_) },
> +		'override' => 1,
> +		'default' => [0]},
>  );
>  
>  sub gitweb_get_feature {
> @@ -3449,6 +3458,13 @@ sub parse_date {
>  	return %date;
>  }
>  
> +sub hide_mailaddrs_if_private {
> +	my $line = shift;
> +	return $line unless gitweb_check_feature('email-privacy');
> +	$line =~ s/<[^@>]+@[^>]+>/<redacted>/ig;

The /i here is redundant, since you have nothing that'll case-fold on
the LHS of the s///, doesn't harm anything either. Just a small note
since it's new in v6...

> +	return $line;
> +}
> +
>  sub parse_tag {
>  	my $tag_id = shift;
>  	my %tag;
> @@ -3465,7 +3481,7 @@ sub parse_tag {
>  		} elsif ($line =~ m/^tag (.+)$/) {
>  			$tag{'name'} = $1;
>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
> -			$tag{'author'} = $1;
> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>  			$tag{'author_epoch'} = $2;
>  			$tag{'author_tz'} = $3;
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
> @@ -3513,7 +3529,7 @@ sub parse_commit_text {
>  		} elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) {
>  			push @parents, $1;
>  		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
> -			$co{'author'} = to_utf8($1);
> +			$co{'author'} = hide_mailaddrs_if_private(to_utf8($1));
>  			$co{'author_epoch'} = $2;
>  			$co{'author_tz'} = $3;
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
> @@ -3523,7 +3539,7 @@ sub parse_commit_text {
>  				$co{'author_name'} = $co{'author'};
>  			}
>  		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
> -			$co{'committer'} = to_utf8($1);
> +			$co{'committer'} = hide_mailaddrs_if_private(to_utf8($1));
>  			$co{'committer_epoch'} = $2;
>  			$co{'committer_tz'} = $3;
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
> @@ -3568,9 +3584,10 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		$line = hide_mailaddrs_if_private($line);
>  	}
>  	$co{'comment'} = \@commit_lines;
>  
> @@ -7489,7 +7506,8 @@ sub git_log_generic {
>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>  	}
>  	my $patch_max = gitweb_get_feature('patches');
> -	if ($patch_max && !defined $file_name) {
> +	if ($patch_max && !defined $file_name &&
> +		!gitweb_check_feature('email-privacy')) {
>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>  			$paging_nav .= " &sdot; " .
>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
> @@ -7550,7 +7568,8 @@ sub git_commit {
>  			} @$parents ) .
>  			')';
>  	}
> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
> +		!gitweb_check_feature('email-privacy')) {
>  		$formats_nav .= " | " .
>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  				"patch");
> @@ -7863,7 +7882,8 @@ sub git_commitdiff {
>  		$formats_nav =
>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>  			        "raw");
> -		if ($patch_max && @{$co{'parents'}} <= 1) {
> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
> +			!gitweb_check_feature('email-privacy')) {
>  			$formats_nav .= " | " .
>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  					"patch");

I didn't run this, and hadn't kept up for a few rounds. I'm happy to see
the pos/while etc. looping gone, this LGTM.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-04-08 21:19                 ` Eric Wong
@ 2021-04-08 22:45                   ` Ævar Arnfjörð Bjarmason
  2021-04-08 22:54                     ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-04-08 22:45 UTC (permalink / raw)
  To: Eric Wong
  Cc: Junio C Hamano, Georgios Kontaxis,
	Georgios Kontaxis via GitGitGadget, git, brian m. carlson


On Thu, Apr 08 2021, Eric Wong wrote:

> Junio C Hamano <gitster@pobox.com> wrote:
>> Eric Wong <e@80x24.org> writes:
>> > As a data point, none of the homograph@ candidates I posted here
>> > on Mar 29 have attracted any attempts on my mail server.
>> 
>> That is an interesting observation.  All homograph@ non-addresses,
>> if a human corrected the funnies in their spelling, would have hit
>> whoever handles @80x24.org mailboxes.
>> 
>> I take it to mean that as a future direction, replacing <redacted>
>> with the obfuscated-but-readable-by-humans homographs is a likely
>> improvement that would help human users while still inconveniencing
>> the crawlers.  It may however need some provision to prevent casual
>> end-users from cutting-and-pasting these homographs, as you said in
>> your original mention of the homograph approach.
>
> Yes, exactly.
>
>> But other than that, does the patch look reasonable?
>
> I only took a cursory glance at it, but v6 seemed fine.

Ditto, I left a small nit comment about a needless /i in a regex, but I
don't think that needs a re-roll.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6] gitweb: redacted e-mail addresses feature.
  2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
@ 2021-04-08 22:51             ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-04-08 22:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Georgios Kontaxis via GitGitGadget, git, brian m. carlson,
	Georgios Kontaxis

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> +sub hide_mailaddrs_if_private {
>> +	my $line = shift;
>> +	return $line unless gitweb_check_feature('email-privacy');
>> +	$line =~ s/<[^@>]+@[^>]+>/<redacted>/ig;
>
> The /i here is redundant, since you have nothing that'll case-fold on
> the LHS of the s///, doesn't harm anything either. Just a small note
> since it's new in v6...

True.  If it were left original version that was suggested during
the review, e.g.

	s/<[^@]+@[-.a-z0-9]+>/<red@cted>/ig;

the /i/ would have been needed, but without the "after the @-sign
should be a run of DNS-valid characters", I agree that there is no
need.

Thanks.  Will locally tweak.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5] gitweb: redacted e-mail addresses feature.
  2021-04-08 22:45                   ` Ævar Arnfjörð Bjarmason
@ 2021-04-08 22:54                     ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-04-08 22:54 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Eric Wong, Georgios Kontaxis, Georgios Kontaxis via GitGitGadget,
	git, brian m. carlson

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, Apr 08 2021, Eric Wong wrote:
>
>> Junio C Hamano <gitster@pobox.com> wrote:
>>> Eric Wong <e@80x24.org> writes:
>>> > As a data point, none of the homograph@ candidates I posted here
>>> > on Mar 29 have attracted any attempts on my mail server.
>>> 
>>> That is an interesting observation.  All homograph@ non-addresses,
>>> if a human corrected the funnies in their spelling, would have hit
>>> whoever handles @80x24.org mailboxes.
>>> 
>>> I take it to mean that as a future direction, replacing <redacted>
>>> with the obfuscated-but-readable-by-humans homographs is a likely
>>> improvement that would help human users while still inconveniencing
>>> the crawlers.  It may however need some provision to prevent casual
>>> end-users from cutting-and-pasting these homographs, as you said in
>>> your original mention of the homograph approach.
>>
>> Yes, exactly.
>>
>>> But other than that, does the patch look reasonable?
>>
>> I only took a cursory glance at it, but v6 seemed fine.
>
> Ditto, I left a small nit comment about a needless /i in a regex, but I
> don't think that needs a re-roll.

Thanks, both.

Will tweak the /i out, and re-queue with acked-by from you two.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-04-08 22:54 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
2021-03-21  1:27   ` brian m. carlson
2021-03-21  3:30   ` Georgios Kontaxis
2021-03-21  3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
2021-03-21 18:48       ` Junio C Hamano
2021-03-21 19:48       ` Georgios Kontaxis
2021-03-21 18:42     ` Junio C Hamano
2021-03-21 18:57       ` Junio C Hamano
2021-03-21 19:05         ` Junio C Hamano
2021-03-21 20:07       ` Georgios Kontaxis
2021-03-21 22:17         ` Junio C Hamano
2021-03-21 23:14           ` Georgios Kontaxis
2021-03-22  4:25             ` Junio C Hamano
2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
2021-03-22 18:32       ` Junio C Hamano
2021-03-22 18:58         ` Georgios Kontaxis
2021-03-28  1:41           ` Junio C Hamano
2021-03-28 21:43             ` Georgios Kontaxis
2021-03-28 22:35               ` Junio C Hamano
2021-03-23  4:27         ` Georgios Kontaxis
2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
2021-03-29 20:00           ` Junio C Hamano
2021-03-31 21:14             ` Junio C Hamano
2021-04-06  0:56             ` Junio C Hamano
2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
2021-04-08 22:51             ` Junio C Hamano
2021-03-29  1:47         ` [PATCH v5] " Eric Wong
2021-03-29  3:17           ` Georgios Kontaxis
2021-04-08 17:16             ` Eric Wong
2021-04-08 21:04               ` Junio C Hamano
2021-04-08 21:19                 ` Eric Wong
2021-04-08 22:45                   ` Ævar Arnfjörð Bjarmason
2021-04-08 22:54                     ` Junio C Hamano
2021-03-21  6:00 ` [PATCH] " Junio C Hamano
2021-03-21  6:18   ` Junio C Hamano
2021-03-21  6:43   ` Georgios Kontaxis
2021-03-21 16:55     ` Junio C Hamano

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).