git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Georgios Kontaxis" <geko1702+commits@99rst.org>
To: "Junio C Hamano" <gitster@pobox.com>
Cc: "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>,
	git@vger.kernel.org,
	"\"Ævar Arnfjörð Bjarmason\"" <avarab@gmail.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Subject: Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
Date: Tue, 23 Mar 2021 04:27:10 -0000	[thread overview]
Message-ID: <f604de775736756180d382d2f54e8b1c.squirrel@mail.kodaksys.org> (raw)
In-Reply-To: <xmqqlfaf6nu9.fsf@gitster.g>

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> [note to other reviewers. input from those who are more familiar
> with gitweb and Perl is very much appreciated on this patch].
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This may result in unsolicited messages.
>
> "... are exposed to web crawlers, which spammers may use." would be
> sufficient as a problem description.
>
> After giving a problem description, it is customery to describe the
> solution as if you are ordering the codebase to "be like so", so
> instead of this ...
>
>> This is a feature for redacting e-mail addresses
>> from the generated HTML, etc. content.
>
> ... we may say something like
>
>     Introduce an 'email-privacy' feature, which defaults to false,
>     that redacts e-mail addresses that appear as author/committer
>     info and in log messages from the generated HTML content.
>
>> This feature does not prevent someone from downloading the
>> unredacted commit log, e.g., by cloning the repository, and
>> extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> And this is a good thing to add.  Overall, nicely written.
>
>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>>     gitweb: redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers and they may not respect robots.txt. This may result in
>>     unsolicited messages. This is a feature for redacting e-mail
>> addresses
>>     from the generated HTML, etc. content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log, e.g., by cloning the repository, and extracting
>> information
>>     from it. It aims to hinder the low-effort bulk collection of e-mail
>>     addresses by web crawlers.
>
> You do not need to repeat the above, which is in the log message above.
>
>>     Changes since v1:
>>
>>      * Turned off the feature by default.
>>      * Removed duplicate code.
>>      * Added note about Gitweb consumers receiving redacted logs.
>>
>>     Changes since v2:
>>
>>      * The feature can be set on a per-project basis. ('override' => 1)
>>
>>     Changes since v3:
>>
>>      * Renamed feature to "email-privacy" and improved documentation.
>>      * Removed UI elements for git-format-patch since it won't be
>> redacted.
>>      * Simplified calls to the address redaction logic.
>>      * Mail::Address is now used to reduce false-positive redactions.
>
> Having these under the --- line like this is very helpful.
>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..6630c76d92fd 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -21,6 +21,7 @@
>>  use File::Basename qw(basename);
>>  use Time::HiRes qw(gettimeofday tv_interval);
>>  use Digest::MD5 qw(md5_hex);
>> +use Git::LoadCPAN::Mail::Address;
>
> I'll defer to others who are more familiar with gitweb and Perl
> ecosystem if this is warranted, but I have a feeling that importing
> and using Mail::Address->parse() only because we want to see if a
> given "<string>" is an address is a bit overkill and it might be
> sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
>
>> +	# Redact e-mail addresses.
>> +
>> +	# To enable system wide have in $GITWEB_CONFIG
>> +	# $feature{'email-privacy'}{'default'} = [1];
>> +	'email-privacy' => {
>> +		'sub' => sub { feature_bool('email-privacy', @_) },
>> +		'override' => 1,
>> +		'default' => [0]},
>>  );
>>
>>  sub gitweb_get_feature {
>> @@ -3449,6 +3459,32 @@ sub parse_date {
>>  	return %date;
>>  }
>>
>> +sub is_mailaddr {
>> +	my @addrs = Mail::Address->parse(shift);
>> +	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
>> +		return 0;
>> +	}
>> +	return 1;
>> +}
>> +
>> +sub hide_mailaddrs_if_private {
>> +	my $line = shift;
>> +	return $line unless gitweb_check_feature('email-privacy');
>> +	while ($line =~ m/(<[^>]+>)/g) {
>> +		my $match = $1;
>> +		if (!is_mailaddr($match)) {
>> +			next;
>> +		}
>> +		my $offset = pos $line;
>> +		my $head = substr $line, 0, $offset - length($match);
>> +		my $redaction = "<redacted>";
>> +		my $tail = substr $line, $offset;
>> +		$line = $head . $redaction . $tail;
>> +		pos $line = length($head) + length($redaction);
>
> Hmmmm, Perl suggestions from others?  It looks quite strange to see
> that s/// operator is not used and replacement is done manually with
> byte position in a Perl script.
>
>>  sub parse_tag {
>>  	my $tag_id = shift;
>>  	my %tag;
>> @@ -3465,7 +3501,7 @@ sub parse_tag {
>>  		} elsif ($line =~ m/^tag (.+)$/) {
>>  			$tag{'name'} = $1;
>>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
>> -			$tag{'author'} = $1;
>> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>>  			$tag{'author_epoch'} = $2;
>>  			$tag{'author_tz'} = $3;
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>
> This (and the others that follow the same pattern) looks sensible.
>
>> @@ -7489,7 +7526,8 @@ sub git_log_generic {
>>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>>  	}
>>  	my $patch_max = gitweb_get_feature('patches');
>> -	if ($patch_max && !defined $file_name) {
>> +	if ($patch_max && !defined $file_name &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>>  			$paging_nav .= " &sdot; " .
>>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
>> @@ -7550,7 +7588,8 @@ sub git_commit {
>>  			} @$parents ) .
>>  			')';
>>  	}
>> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
>> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		$formats_nav .= " | " .
>>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  				"patch");
>> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
>>  		$formats_nav =
>>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>>  			        "raw");
>> -		if ($patch_max && @{$co{'parents'}} <= 1) {
>> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
>> +			!gitweb_check_feature('email-privacy')) {
>>  			$formats_nav .= " | " .
>>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  					"patch");
>
> I wouldn't have expected to see the above three hunks in the
> "patch" codepath.  Rather, I was hoping that you'd do something
> like this at startup when you find out that the privacy feature
> is enabled:
>
> 	$feature{'patches'}{'default'} = [0]
> 		if gitweb_get_feature('email-privacy');
>
> so that nothing related to the 'patches' has to be modified.
> That way, even if there were fourth place that can leak an e-mail
> address in the 'patch' codepath that above three hunks in this patch
> missed, crawlers won't be able to get at it, no?
>
I forgot to mention one of the reasons for doing it this way.
If someone knows this URL exists and is relying on it (e.g., in a script)
they can still call it with "email-privacy" enabled.
The only thing that's changing is that the UI element (link) that a web
crawler would follow just because it's there is now gone.

But, as mentioned before, I don't feel strongly about either approach.
Let me know.

>> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
>> index 1f32ca66ea51..77fc1298d4c6 100644
>> --- a/t/lib-gitweb.sh
>> +++ b/t/lib-gitweb.sh
>> @@ -67,6 +67,9 @@ gitweb_run () {
>>  	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
>>  	export GITWEB_CONFIG
>>
>> +	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
>> +	export PERL5LIB
>> +
>
> Why is this change suddenly become necessary?  This addition is only
> about tests---do we need to do something similar in the runtime
> environment when the updated gitweb that requires Mail::Address gets
> deployed, or is that covered already somewhere else?
>
> Thanks.
>



  parent reply	other threads:[~2021-03-23  4:27 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
2021-03-21  1:27   ` brian m. carlson
2021-03-21  3:30   ` Georgios Kontaxis
2021-03-21  3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
2021-03-21 18:48       ` Junio C Hamano
2021-03-21 19:48       ` Georgios Kontaxis
2021-03-21 18:42     ` Junio C Hamano
2021-03-21 18:57       ` Junio C Hamano
2021-03-21 19:05         ` Junio C Hamano
2021-03-21 20:07       ` Georgios Kontaxis
2021-03-21 22:17         ` Junio C Hamano
2021-03-21 23:14           ` Georgios Kontaxis
2021-03-22  4:25             ` Junio C Hamano
2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
2021-03-22 18:32       ` Junio C Hamano
2021-03-22 18:58         ` Georgios Kontaxis
2021-03-28  1:41           ` Junio C Hamano
2021-03-28 21:43             ` Georgios Kontaxis
2021-03-28 22:35               ` Junio C Hamano
2021-03-23  4:27         ` Georgios Kontaxis [this message]
2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
2021-03-29 20:00           ` Junio C Hamano
2021-03-31 21:14             ` Junio C Hamano
2021-04-06  0:56             ` Junio C Hamano
2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
2021-04-08 22:51             ` Junio C Hamano
2021-03-29  1:47         ` [PATCH v5] " Eric Wong
2021-03-29  3:17           ` Georgios Kontaxis
2021-04-08 17:16             ` Eric Wong
2021-04-08 21:04               ` Junio C Hamano
2021-04-08 21:19                 ` Eric Wong
2021-04-08 22:45                   ` Ævar Arnfjörð Bjarmason
2021-04-08 22:54                     ` Junio C Hamano
2021-03-21  6:00 ` [PATCH] " Junio C Hamano
2021-03-21  6:18   ` Junio C Hamano
2021-03-21  6:43   ` Georgios Kontaxis
2021-03-21 16:55     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f604de775736756180d382d2f54e8b1c.squirrel@mail.kodaksys.org \
    --to=geko1702+commits@99rst.org \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).