git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Georgios Kontaxis" <geko1702+commits@99rst.org>
To: "Junio C Hamano" <gitster@pobox.com>
Cc: "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>,
	git@vger.kernel.org,
	"\"Ævar Arnfjörð Bjarmason\"" <avarab@gmail.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Subject: Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
Date: Mon, 22 Mar 2021 18:58:41 -0000	[thread overview]
Message-ID: <6f656f62497fb7c9322432b5eb151b86.squirrel@mail.kodaksys.org> (raw)
In-Reply-To: <xmqqlfaf6nu9.fsf@gitster.g>

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> [note to other reviewers. input from those who are more familiar
> with gitweb and Perl is very much appreciated on this patch].
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This may result in unsolicited messages.
>
> "... are exposed to web crawlers, which spammers may use." would be
> sufficient as a problem description.
>
Using robots.txt was suggested earlier in this review so
I want to make sure people don't rely on it instead.

> After giving a problem description, it is customery to describe the
> solution as if you are ordering the codebase to "be like so", so
> instead of this ...
>
>> This is a feature for redacting e-mail addresses
>> from the generated HTML, etc. content.
>
> ... we may say something like
>
>     Introduce an 'email-privacy' feature, which defaults to false,
>     that redacts e-mail addresses that appear as author/committer
>     info and in log messages from the generated HTML content.
>
Will rephrase.

>> This feature does not prevent someone from downloading the
>> unredacted commit log, e.g., by cloning the repository, and
>> extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> And this is a good thing to add.  Overall, nicely written.
>
>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>>     gitweb: redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers and they may not respect robots.txt. This may result in
>>     unsolicited messages. This is a feature for redacting e-mail
>> addresses
>>     from the generated HTML, etc. content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log, e.g., by cloning the repository, and extracting
>> information
>>     from it. It aims to hinder the low-effort bulk collection of e-mail
>>     addresses by web crawlers.
>
> You do not need to repeat the above, which is in the log message above.
>
>>     Changes since v1:
>>
>>      * Turned off the feature by default.
>>      * Removed duplicate code.
>>      * Added note about Gitweb consumers receiving redacted logs.
>>
>>     Changes since v2:
>>
>>      * The feature can be set on a per-project basis. ('override' => 1)
>>
>>     Changes since v3:
>>
>>      * Renamed feature to "email-privacy" and improved documentation.
>>      * Removed UI elements for git-format-patch since it won't be
>> redacted.
>>      * Simplified calls to the address redaction logic.
>>      * Mail::Address is now used to reduce false-positive redactions.
>
> Having these under the --- line like this is very helpful.
>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..6630c76d92fd 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -21,6 +21,7 @@
>>  use File::Basename qw(basename);
>>  use Time::HiRes qw(gettimeofday tv_interval);
>>  use Digest::MD5 qw(md5_hex);
>> +use Git::LoadCPAN::Mail::Address;
>
> I'll defer to others who are more familiar with gitweb and Perl
> ecosystem if this is warranted, but I have a feeling that importing
> and using Mail::Address->parse() only because we want to see if a
> given "<string>" is an address is a bit overkill and it might be
> sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
>
I think Mail::Address->parse() pretty much does the same thing.
It was suggested earlier in the review to use that function,
similar to what git-send-mail does.
I'm assuming because the maintainers of the Mail package are already
invested in the (accurate) parsing of e-mail addresses.

>> +	# Redact e-mail addresses.
>> +
>> +	# To enable system wide have in $GITWEB_CONFIG
>> +	# $feature{'email-privacy'}{'default'} = [1];
>> +	'email-privacy' => {
>> +		'sub' => sub { feature_bool('email-privacy', @_) },
>> +		'override' => 1,
>> +		'default' => [0]},
>>  );
>>
>>  sub gitweb_get_feature {
>> @@ -3449,6 +3459,32 @@ sub parse_date {
>>  	return %date;
>>  }
>>
>> +sub is_mailaddr {
>> +	my @addrs = Mail::Address->parse(shift);
>> +	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
>> +		return 0;
>> +	}
>> +	return 1;
>> +}
>> +
>> +sub hide_mailaddrs_if_private {
>> +	my $line = shift;
>> +	return $line unless gitweb_check_feature('email-privacy');
>> +	while ($line =~ m/(<[^>]+>)/g) {
>> +		my $match = $1;
>> +		if (!is_mailaddr($match)) {
>> +			next;
>> +		}
>> +		my $offset = pos $line;
>> +		my $head = substr $line, 0, $offset - length($match);
>> +		my $redaction = "<redacted>";
>> +		my $tail = substr $line, $offset;
>> +		$line = $head . $redaction . $tail;
>> +		pos $line = length($head) + length($redaction);
>
> Hmmmm, Perl suggestions from others?  It looks quite strange to see
> that s/// operator is not used and replacement is done manually with
> byte position in a Perl script.
>
If there's a more elegant way to do the above we can certain do that instead.

>>  sub parse_tag {
>>  	my $tag_id = shift;
>>  	my %tag;
>> @@ -3465,7 +3501,7 @@ sub parse_tag {
>>  		} elsif ($line =~ m/^tag (.+)$/) {
>>  			$tag{'name'} = $1;
>>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
>> -			$tag{'author'} = $1;
>> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>>  			$tag{'author_epoch'} = $2;
>>  			$tag{'author_tz'} = $3;
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>
> This (and the others that follow the same pattern) looks sensible.
>
>> @@ -7489,7 +7526,8 @@ sub git_log_generic {
>>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>>  	}
>>  	my $patch_max = gitweb_get_feature('patches');
>> -	if ($patch_max && !defined $file_name) {
>> +	if ($patch_max && !defined $file_name &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>>  			$paging_nav .= " &sdot; " .
>>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
>> @@ -7550,7 +7588,8 @@ sub git_commit {
>>  			} @$parents ) .
>>  			')';
>>  	}
>> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
>> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
>> +		!gitweb_check_feature('email-privacy')) {
>>  		$formats_nav .= " | " .
>>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  				"patch");
>> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
>>  		$formats_nav =
>>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>>  			        "raw");
>> -		if ($patch_max && @{$co{'parents'}} <= 1) {
>> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
>> +			!gitweb_check_feature('email-privacy')) {
>>  			$formats_nav .= " | " .
>>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>>  					"patch");
>
> I wouldn't have expected to see the above three hunks in the
> "patch" codepath.  Rather, I was hoping that you'd do something
> like this at startup when you find out that the privacy feature
> is enabled:
>
> 	$feature{'patches'}{'default'} = [0]
> 		if gitweb_get_feature('email-privacy');
>
> so that nothing related to the 'patches' has to be modified.
> That way, even if there were fourth place that can leak an e-mail
> address in the 'patch' codepath that above three hunks in this patch
> missed, crawlers won't be able to get at it, no?
>
Or there's multiple feature flags / triggers that lead to the "patch"
codepath.
Either way is fine with me. (disabling the "patches" feature or preventing
the
generation of UI elements)
Like you said, someone who regularly reviews Gitweb code may have an opinion
on how do this.

>> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
>> index 1f32ca66ea51..77fc1298d4c6 100644
>> --- a/t/lib-gitweb.sh
>> +++ b/t/lib-gitweb.sh
>> @@ -67,6 +67,9 @@ gitweb_run () {
>>  	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
>>  	export GITWEB_CONFIG
>>
>> +	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
>> +	export PERL5LIB
>> +
>
> Why is this change suddenly become necessary?  This addition is only
> about tests---do we need to do something similar in the runtime
> environment when the updated gitweb that requires Mail::Address gets
> deployed, or is that covered already somewhere else?
>
This is necessary because osx-clang, osx-gcc tests were failing.
If there's a better way to address this I'm open to suggestions.

I haven't been able to experience deployment on osx but on linux
i didn't have to change anything.

> Thanks.
>



  reply	other threads:[~2021-03-22 18:59 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason
2021-03-21  1:27   ` brian m. carlson
2021-03-21  3:30   ` Georgios Kontaxis
2021-03-21  3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
2021-03-21 18:48       ` Junio C Hamano
2021-03-21 19:48       ` Georgios Kontaxis
2021-03-21 18:42     ` Junio C Hamano
2021-03-21 18:57       ` Junio C Hamano
2021-03-21 19:05         ` Junio C Hamano
2021-03-21 20:07       ` Georgios Kontaxis
2021-03-21 22:17         ` Junio C Hamano
2021-03-21 23:14           ` Georgios Kontaxis
2021-03-22  4:25             ` Junio C Hamano
2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
2021-03-22 18:32       ` Junio C Hamano
2021-03-22 18:58         ` Georgios Kontaxis [this message]
2021-03-28  1:41           ` Junio C Hamano
2021-03-28 21:43             ` Georgios Kontaxis
2021-03-28 22:35               ` Junio C Hamano
2021-03-23  4:27         ` Georgios Kontaxis
2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
2021-03-29 20:00           ` Junio C Hamano
2021-03-31 21:14             ` Junio C Hamano
2021-04-06  0:56             ` Junio C Hamano
2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
2021-04-08 22:51             ` Junio C Hamano
2021-03-29  1:47         ` [PATCH v5] " Eric Wong
2021-03-29  3:17           ` Georgios Kontaxis
2021-04-08 17:16             ` Eric Wong
2021-04-08 21:04               ` Junio C Hamano
2021-04-08 21:19                 ` Eric Wong
2021-04-08 22:45                   ` Ævar Arnfjörð Bjarmason
2021-04-08 22:54                     ` Junio C Hamano
2021-03-21  6:00 ` [PATCH] " Junio C Hamano
2021-03-21  6:18   ` Junio C Hamano
2021-03-21  6:43   ` Georgios Kontaxis
2021-03-21 16:55     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6f656f62497fb7c9322432b5eb151b86.squirrel@mail.kodaksys.org \
    --to=geko1702+commits@99rst.org \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).