user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH 0/3] solver bugfixes and tweaks
@ 2020-01-02  9:24 Eric Wong
  2020-01-02  9:24 ` [PATCH 1/3] solver: try the next patch on apply failures Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Eric Wong @ 2020-01-02  9:24 UTC (permalink / raw)
  To: meta

A couple of improvements from errors I've seen with solver
and hopefully no regressions.  [PATCH 2/3] is a bit scary :x

Eric Wong (3):
  solver: try the next patch on apply failures
  solver: extract_diff: deal with missing "diff --git" line
  qspawn: use per-call quiet flag for solver

 lib/PublicInbox/Qspawn.pm    |   6 +-
 lib/PublicInbox/SolverGit.pm | 168 ++++++++++++++++++-----------------
 2 files changed, 89 insertions(+), 85 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] solver: try the next patch on apply failures
  2020-01-02  9:24 [PATCH 0/3] solver bugfixes and tweaks Eric Wong
@ 2020-01-02  9:24 ` Eric Wong
  2020-01-02  9:24 ` [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line Eric Wong
  2020-01-02  9:24 ` [PATCH 3/3] qspawn: use per-call quiet flag for solver Eric Wong
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-01-02  9:24 UTC (permalink / raw)
  To: meta

Sometimes a patch is corrupted and resent to create the same
OID.  We need to account for that case and actually move onto
the next patch instead of blindly trying "git ls-files" to get
nothing out of it.
---
 lib/PublicInbox/SolverGit.pm | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index c57fb4c6..3e3a5899 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -452,6 +452,7 @@ sub apply_result ($$) {
 		if ($nxt && oids_same_ish($nxt->{oid_b}, $di->{oid_b})) {
 			dbg($self, $msg);
 			dbg($self, 'trying '.di_url($self, $nxt));
+			return do_git_apply($self);
 		} else {
 			ERR($self, $msg);
 		}

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line
  2020-01-02  9:24 [PATCH 0/3] solver bugfixes and tweaks Eric Wong
  2020-01-02  9:24 ` [PATCH 1/3] solver: try the next patch on apply failures Eric Wong
@ 2020-01-02  9:24 ` Eric Wong
  2020-01-02  9:36   ` Eric Wong
                     ` (2 more replies)
  2020-01-02  9:24 ` [PATCH 3/3] qspawn: use per-call quiet flag for solver Eric Wong
  2 siblings, 3 replies; 8+ messages in thread
From: Eric Wong @ 2020-01-02  9:24 UTC (permalink / raw)
  To: meta

Rewrite the patch extraction loop using a single regexp which
accounts for missing "diff --git ..." lines and is capable of
extracting pathnames off the "+++ b/foo" line.

This fixes the solving of blob "96f1c7f" off
<2841d2de-32ad-eae8-6039-9251a40bb00e@tngtech.com>
in git@vger archives.
---
 lib/PublicInbox/SolverGit.pm | 165 ++++++++++++++++++-----------------
 1 file changed, 85 insertions(+), 80 deletions(-)

diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index 3e3a5899..e1b82f3e 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -10,6 +10,7 @@
 package PublicInbox::SolverGit;
 use strict;
 use warnings;
+use 5.010_001;
 use File::Temp 0.19 ();
 use Fcntl qw(SEEK_SET);
 use PublicInbox::Git qw(git_unquote git_quote);
@@ -38,7 +39,7 @@ my $MAX_PATCH = 9999;
 #	oid_a => abbreviated pre-image oid,
 #	oid_b => abbreviated post-image oid,
 #	tmp => anonymous file handle with the diff,
-#	hdr_lines => arrayref of various header lines for mode information
+#	hdr_lines => string of various header lines for mode information
 #	mode_a => original mode of oid_a (string, not integer),
 #	ibx => PublicInbox::Inbox object containing the diff
 #	smsg => PublicInbox::SearchMsg object containing diff
@@ -47,10 +48,6 @@ my $MAX_PATCH = 9999;
 #	n => numeric path of the patch (relative to worktree)
 # }
 
-# don't bother if somebody sends us a patch with these path components,
-# it's junk at best, an attack attempt at worse:
-my %bad_component = map { $_ => 1 } ('', '.', '..');
-
 sub dbg ($$) {
 	print { $_[0]->{out} } $_[1], "\n" or ERR($_[0], "print(dbg): $!");
 }
@@ -100,10 +97,8 @@ sub solve_existing ($$) {
 
 sub extract_diff ($$) {
 	my ($p, $arg) = @_;
-	my ($self, $diffs, $re, $ibx, $smsg) = @$arg;
+	my ($self, $diffs, $pre, $post, $ibx, $smsg) = @$arg;
 	my ($part) = @$p; # ignore $depth and @idx;
-	my $hdr_lines; # diff --git a/... b/...
-	my $tmp;
 	my $ct = $part->content_type || 'text/plain';
 	my ($s, undef) = msg_part_text($part, $ct);
 	defined $s or return;
@@ -116,66 +111,85 @@ sub extract_diff ($$) {
 		$s =~ s/\r\n/\n/sg;
 	}
 
-	foreach my $l (split(/^/m, $s)) {
-		if ($l =~ $re) {
-			$di->{oid_a} = $1;
-			$di->{oid_b} = $2;
-			if (defined($3)) {
-				my $mode_a = $3;
-				if ($mode_a =~ /\A(?:100644|120000|100755)\z/) {
-					$di->{mode_a} = $mode_a;
-				}
-			}
-
-
-			# start writing the diff out to a tempfile
-			my $path = ++$self->{tot};
-			$di->{n} = $path;
-			open($tmp, '>', $self->{tmp}->dirname . "/$path") or
-							die "open(tmp): $!";
-
-			push @$hdr_lines, $l;
-			$di->{hdr_lines} = $hdr_lines;
-			utf8::encode($_) for @$hdr_lines;
-			print $tmp @$hdr_lines or die "print(tmp): $!";
-
-			# for debugging/diagnostics:
-			$di->{ibx} = $ibx;
-			$di->{smsg} = $smsg;
-		} elsif ($l =~ m!\Adiff --git ("?[^/]+/.+) ("?[^/]+/.+)$!) {
-			last if $tmp; # got our blob, done!
-
-			my ($path_a, $path_b) = ($1, $2);
-
-			# diff header lines won't have \r because git
-			# will quote them, but Email::MIME gives CRLF
-			# for quoted-printable:
-			$path_b =~ tr/\r//d;
-
-			# don't care for leading 'a/' and 'b/'
-			my (undef, @a) = split(m{/}, git_unquote($path_a));
-			my (undef, @b) = split(m{/}, git_unquote($path_b));
-
-			# get rid of path-traversal attempts and junk patches:
-			foreach (@a, @b) {
-				return if $bad_component{$_};
-			}
-
-			$di->{path_a} = join('/', @a);
-			$di->{path_b} = join('/', @b);
-			$hdr_lines = [ $l ];
-		} elsif ($tmp) {
-			utf8::encode($l);
-			print $tmp $l or die "print(tmp): $!";
-		} elsif ($hdr_lines) {
-			push @$hdr_lines, $l;
-			if ($l =~ /\Anew file mode (100644|120000|100755)$/) {
-				$di->{mode_a} = $1;
-			}
-		}
-	}
-	return undef unless $tmp;
+	state $LF = qr!\r?\n!;
+	state $FN = qr!(?:("?[^/\n]+/[^\r\n]+)|/dev/null)!;
+
+	$s =~ m!( # $1 start header lines we save for debugging:
+
+		# everything before ^index is optional, but we don't
+		# want to match ^(old|copy|rename|deleted|...) unless
+		# we match /^diff --git/ first:
+		(?: # begin optional stuff:
+
+		# try to get the pre-and-post filenames as $2 and $3
+		(?:^diff\x20--git\x20$FN\x20$FN$LF)
+
+		# old mode $4
+		(?:^old mode\x20(100644|120000|100755)$LF)?
+
+		# ignore other info
+		(?:^(?:copy|rename|deleted|dissimilarity|similarity).*$LF)?
+
+		# new mode (possibly new file) ($5)
+		(?:^new\x20(?:file\x20)?mode\x20(100644|120000|100755)$LF)?
+
+		# ignore other info
+		(?:^(?:copy|rename|deleted|dissimilarity|similarity).*$LF)?
+
+		)? # end of optional stuff, everything below is required
+
+		# match the pre and post-image OIDs as $6 $7
+		^index\x20(${pre}[a-f0-9]*)\.\.(${post}[a-f0-9]*)
+			# mode if unchanged $8
+			(?:\x20(100644|120000|100755))?$LF
+	) # end of header lines ($1)
+	( # $9 is the patch body
+		# "--- a/foo.c" sets pre-filename ($10) in case
+		# $2 is missing
+		(?:^---\x20$FN$LF)
+
+		# "+++ b/foo.c" sets post-filename ($11) in case
+		# $3 is missing
+		(?:^\+{3}\x20$FN$LF)
+
+		# the meat of the diff, including "^\\No newline ..."
+		(?:^[\@\+\x20\-\\][^\r\n]*$LF)+
+	)!smx or return;
+
+	my $hdr_lines = $1;
+	my $path_a = $2 // $10;
+	my $path_b = $3 // $11;
+	$di->{oid_a} = $6;
+	$di->{oid_b} = $7;
+	$di->{mode_a} = $5 // $8 // $6; # new (file) // unchanged // old
+	my $patch = $9;
+
+	# don't care for leading 'a/' and 'b/'
+	my (undef, @a) = split(m{/}, git_unquote($path_a));
+	my (undef, @b) = split(m{/}, git_unquote($path_b));
+
+	# get rid of path-traversal attempts and junk patches:
+	# it's junk at best, an attack attempt at worse:
+	state %bad_component = map { $_ => 1 } ('', '.', '..');
+	foreach (@a, @b) { return if $bad_component{$_} }
+
+	$di->{path_a} = join('/', @a);
+	$di->{path_b} = join('/', @b);
+
+	utf8::encode($hdr_lines);
+	utf8::encode($patch);
+	my $path = ++$self->{tot};
+	$di->{n} = $path;
+	open(my $tmp, '>', $self->{tmp}->dirname . "/$path") or
+		die "open(tmp): $!";
+	print $tmp $hdr_lines, $patch or die "print(tmp): $!";
 	close $tmp or die "close(tmp): $!";
+
+	# for debugging/diagnostics:
+	$di->{ibx} = $ibx;
+	$di->{smsg} = $smsg;
+	$di->{hdr_lines} = $hdr_lines;
+
 	push @$diffs, $di;
 }
 
@@ -214,13 +228,13 @@ sub find_extract_diffs ($$$) {
 	}
 
 	my $msgs = $srch->query($q, { relevance => 1 });
-	my $re = qr/\Aindex ($pre[a-f0-9]*)\.\.($post[a-f0-9]*)(?: ([0-9]+))?/;
+
 	my $diffs = [];
 	foreach my $smsg (@$msgs) {
 		$ibx->smsg_mime($smsg) or next;
 		my $mime = delete $smsg->{mime};
 		msg_iter($mime, \&extract_diff,
-				[$self, $diffs, $re, $ibx, $smsg]);
+				[$self, $diffs, $pre, $post, $ibx, $smsg]);
 	}
 	@$diffs ? $diffs : undef;
 }
@@ -251,7 +265,7 @@ sub prepare_index ($) {
 
 	my $oid_full = $existing->[1];
 	my $path_a = $di->{path_a} or die "BUG: path_a missing for $oid_full";
-	my $mode_a = $di->{mode_a} || extract_old_mode($di);
+	my $mode_a = $di->{mode_a} // '100644';
 
 	my $in = tmpfile("update-index.$oid_full") or die "tmpfile: $!";
 	print $in "$mode_a $oid_full\t$path_a\0" or die "print: $!";
@@ -307,15 +321,6 @@ EOF
 	prepare_index($self);
 }
 
-sub extract_old_mode ($) {
-	my ($di) = @_;
-	if (join('', @{$di->{hdr_lines}}) =~
-			/^old mode (100644|100755|120000)\b/) {
-		return $1;
-	}
-	'100644';
-}
-
 sub do_finish ($) {
 	my ($self) = @_;
 	my ($found, $oid_want) = @$self{qw(found oid_want)};
@@ -484,7 +489,7 @@ sub do_git_apply ($) {
 		my $i = ++$self->{nr};
 		$di = shift @$patches;
 		dbg($self, "\napplying [$i/$total] " . di_url($self, $di) .
-			"\n" . join('', @{$di->{hdr_lines}}));
+			"\n" . $di->{hdr_lines});
 		my $path = $di->{n};
 		$len += length($path) + 1;
 		push @cmd, $path;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] qspawn: use per-call quiet flag for solver
  2020-01-02  9:24 [PATCH 0/3] solver bugfixes and tweaks Eric Wong
  2020-01-02  9:24 ` [PATCH 1/3] solver: try the next patch on apply failures Eric Wong
  2020-01-02  9:24 ` [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line Eric Wong
@ 2020-01-02  9:24 ` Eric Wong
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-01-02  9:24 UTC (permalink / raw)
  To: meta

solver can spawn multiple processes per HTTP request, but
"git apply" failures are needlessly noisy due to corrupt
patches.  We also don't want to silence "git ls-files"
or "git update-index" errors using $env->{'qspawn.quiet'},
either, so this granularity is needed.

Admins can check for 500 errors in access logs to detect
(and reproduce) solver failures, anyways, so there's no
need to log every time "git apply" rejects a corrupt patch.
---
 lib/PublicInbox/Qspawn.pm    | 6 ++----
 lib/PublicInbox/SolverGit.pm | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/Qspawn.pm b/lib/PublicInbox/Qspawn.pm
index 1a2b70e7..65bb178a 100644
--- a/lib/PublicInbox/Qspawn.pm
+++ b/lib/PublicInbox/Qspawn.pm
@@ -56,9 +56,7 @@ sub _do_spawn {
 
 	($self->{rpipe}, $self->{pid}) = popen_rd($cmd, $cmd_env, \%opts);
 
-	# drop any IO handles opt was holding open via $opt->{hold}
-	# No need to hold onto the descriptor once the child process has it.
-	$self->{args} = $cmd; # keep this around for logging
+	$self->{args} = $opts{quiet} ? undef : $cmd;
 
 	if (defined $self->{pid}) {
 		$limiter->{running}++;
@@ -108,7 +106,7 @@ sub waitpid_err ($$) {
 
 	if ($err) {
 		$self->{err} = $err;
-		if ($env && !$env->{'qspawn.quiet'}) {
+		if ($env && $self->{args}) {
 			log_err($env, join(' ', @{$self->{args}}) . ": $err");
 		}
 	}
diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index e1b82f3e..ef76d706 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -497,7 +497,7 @@ sub do_git_apply ($) {
 	} while (@$patches && $len < $ARG_SIZE_MAX &&
 		 !oids_same_ish($patches->[0]->{oid_b}, $prv_oid_b));
 
-	my $opt = { 2 => 1, -C => $dn };
+	my $opt = { 2 => 1, -C => $dn, quiet => 1 };
 	my $qsp = PublicInbox::Qspawn->new(\@cmd, $self->{git_env}, $opt);
 	$self->{-cur_di} = $di;
 	$self->{-qsp} = $qsp;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line
  2020-01-02  9:24 ` [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line Eric Wong
@ 2020-01-02  9:36   ` Eric Wong
  2020-01-02 19:04     ` Eric Wong
  2020-01-02 21:12   ` Eric Wong
  2020-01-03  0:43   ` Eric Wong
  2 siblings, 1 reply; 8+ messages in thread
From: Eric Wong @ 2020-01-02  9:36 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> -my %bad_component = map { $_ => 1 } ('', '.', '..');

<snip>

> +	# get rid of path-traversal attempts and junk patches:
> +	# it's junk at best, an attack attempt at worse:
> +	state %bad_component = map { $_ => 1 } ('', '.', '..');

Nope, that fails on Perl 5.24.1.

"Initialization of state variables in list context currently forbidden"

:<  Otherwise, "state" is a nice 5.10.x feature.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line
  2020-01-02  9:36   ` Eric Wong
@ 2020-01-02 19:04     ` Eric Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-01-02 19:04 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> Eric Wong <e@80x24.org> wrote:
> > -my %bad_component = map { $_ => 1 } ('', '.', '..');
> 
> <snip>
> 
> > +	# get rid of path-traversal attempts and junk patches:
> > +	# it's junk at best, an attack attempt at worse:
> > +	state %bad_component = map { $_ => 1 } ('', '.', '..');
> 
> Nope, that fails on Perl 5.24.1.
> 
> "Initialization of state variables in list context currently forbidden"
> 
> :<  Otherwise, "state" is a nice 5.10.x feature.

The following is a simple fix which preserves compatibility
with older Perls while keeping the variable contained,
will squash:

diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index ef76d706..ade94bd3 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -170,8 +170,8 @@ sub extract_diff ($$) {
 
 	# get rid of path-traversal attempts and junk patches:
 	# it's junk at best, an attack attempt at worse:
-	state %bad_component = map { $_ => 1 } ('', '.', '..');
-	foreach (@a, @b) { return if $bad_component{$_} }
+	state $bad_component = { map { $_ => 1 } ('', '.', '..') };
+	foreach (@a, @b) { return if $bad_component->{$_} }
 
 	$di->{path_a} = join('/', @a);
 	$di->{path_b} = join('/', @b);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line
  2020-01-02  9:24 ` [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line Eric Wong
  2020-01-02  9:36   ` Eric Wong
@ 2020-01-02 21:12   ` Eric Wong
  2020-01-03  0:43   ` Eric Wong
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-01-02 21:12 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
>  sub extract_diff ($$) {
> +		# old mode $4
> +		(?:^old mode\x20(100644|120000|100755)$LF)?

<snip>

> +	$di->{oid_a} = $6;
> +	$di->{oid_b} = $7;
> +	$di->{mode_a} = $5 // $8 // $6; # new (file) // unchanged // old

$6 is totally wrong, since it's already {oid_a}

diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index ef76d706..7c367895 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -161,7 +161,7 @@ sub extract_diff ($$) {
 	my $path_b = $3 // $11;
 	$di->{oid_a} = $6;
 	$di->{oid_b} = $7;
-	$di->{mode_a} = $5 // $8 // $6; # new (file) // unchanged // old
+	$di->{mode_a} = $5 // $8 // $4; # new (file) // unchanged // old
 	my $patch = $9;
 
 	# don't care for leading 'a/' and 'b/'

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line
  2020-01-02  9:24 ` [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line Eric Wong
  2020-01-02  9:36   ` Eric Wong
  2020-01-02 21:12   ` Eric Wong
@ 2020-01-03  0:43   ` Eric Wong
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-01-03  0:43 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> Rewrite the patch extraction loop using a single regexp which
> accounts for missing "diff --git ..." lines and is capable of
> extracting pathnames off the "+++ b/foo" line.
> 
> This fixes the solving of blob "96f1c7f" off
> <2841d2de-32ad-eae8-6039-9251a40bb00e@tngtech.com>
> in git@vger archives.
> ---
>  lib/PublicInbox/SolverGit.pm | 165 ++++++++++++++++++-----------------
>  1 file changed, 85 insertions(+), 80 deletions(-)
> 
> diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
> index 3e3a5899..e1b82f3e 100644
> --- a/lib/PublicInbox/SolverGit.pm
> +++ b/lib/PublicInbox/SolverGit.pm
> @@ -10,6 +10,7 @@
>  package PublicInbox::SolverGit;
>  use strict;
>  use warnings;
> +use 5.010_001;
>  use File::Temp 0.19 ();
>  use Fcntl qw(SEEK_SET);
>  use PublicInbox::Git qw(git_unquote git_quote);
> @@ -38,7 +39,7 @@ my $MAX_PATCH = 9999;
>  #	oid_a => abbreviated pre-image oid,
>  #	oid_b => abbreviated post-image oid,
>  #	tmp => anonymous file handle with the diff,
> -#	hdr_lines => arrayref of various header lines for mode information
> +#	hdr_lines => string of various header lines for mode information
>  #	mode_a => original mode of oid_a (string, not integer),
>  #	ibx => PublicInbox::Inbox object containing the diff
>  #	smsg => PublicInbox::SearchMsg object containing diff
> @@ -47,10 +48,6 @@ my $MAX_PATCH = 9999;
>  #	n => numeric path of the patch (relative to worktree)
>  # }
>  
> -# don't bother if somebody sends us a patch with these path components,
> -# it's junk at best, an attack attempt at worse:
> -my %bad_component = map { $_ => 1 } ('', '.', '..');
> -
>  sub dbg ($$) {
>  	print { $_[0]->{out} } $_[1], "\n" or ERR($_[0], "print(dbg): $!");
>  }
> @@ -100,10 +97,8 @@ sub solve_existing ($$) {
>  
>  sub extract_diff ($$) {
>  	my ($p, $arg) = @_;
> -	my ($self, $diffs, $re, $ibx, $smsg) = @$arg;
> +	my ($self, $diffs, $pre, $post, $ibx, $smsg) = @$arg;
>  	my ($part) = @$p; # ignore $depth and @idx;
> -	my $hdr_lines; # diff --git a/... b/...
> -	my $tmp;
>  	my $ct = $part->content_type || 'text/plain';
>  	my ($s, undef) = msg_part_text($part, $ct);
>  	defined $s or return;
> @@ -116,66 +111,85 @@ sub extract_diff ($$) {
>  		$s =~ s/\r\n/\n/sg;
>  	}
>  
> -	foreach my $l (split(/^/m, $s)) {
> -		if ($l =~ $re) {
> -			$di->{oid_a} = $1;
> -			$di->{oid_b} = $2;
> -			if (defined($3)) {
> -				my $mode_a = $3;
> -				if ($mode_a =~ /\A(?:100644|120000|100755)\z/) {
> -					$di->{mode_a} = $mode_a;
> -				}
> -			}
> -
> -
> -			# start writing the diff out to a tempfile
> -			my $path = ++$self->{tot};
> -			$di->{n} = $path;
> -			open($tmp, '>', $self->{tmp}->dirname . "/$path") or
> -							die "open(tmp): $!";
> -
> -			push @$hdr_lines, $l;
> -			$di->{hdr_lines} = $hdr_lines;
> -			utf8::encode($_) for @$hdr_lines;
> -			print $tmp @$hdr_lines or die "print(tmp): $!";
> -
> -			# for debugging/diagnostics:
> -			$di->{ibx} = $ibx;
> -			$di->{smsg} = $smsg;
> -		} elsif ($l =~ m!\Adiff --git ("?[^/]+/.+) ("?[^/]+/.+)$!) {
> -			last if $tmp; # got our blob, done!
> -
> -			my ($path_a, $path_b) = ($1, $2);
> -
> -			# diff header lines won't have \r because git
> -			# will quote them, but Email::MIME gives CRLF
> -			# for quoted-printable:
> -			$path_b =~ tr/\r//d;
> -
> -			# don't care for leading 'a/' and 'b/'
> -			my (undef, @a) = split(m{/}, git_unquote($path_a));
> -			my (undef, @b) = split(m{/}, git_unquote($path_b));
> -
> -			# get rid of path-traversal attempts and junk patches:
> -			foreach (@a, @b) {
> -				return if $bad_component{$_};
> -			}
> -
> -			$di->{path_a} = join('/', @a);
> -			$di->{path_b} = join('/', @b);
> -			$hdr_lines = [ $l ];
> -		} elsif ($tmp) {
> -			utf8::encode($l);
> -			print $tmp $l or die "print(tmp): $!";
> -		} elsif ($hdr_lines) {
> -			push @$hdr_lines, $l;
> -			if ($l =~ /\Anew file mode (100644|120000|100755)$/) {
> -				$di->{mode_a} = $1;
> -			}
> -		}
> -	}
> -	return undef unless $tmp;
> +	state $LF = qr!\r?\n!;
> +	state $FN = qr!(?:("?[^/\n]+/[^\r\n]+)|/dev/null)!;
> +
> +	$s =~ m!( # $1 start header lines we save for debugging:
> +
> +		# everything before ^index is optional, but we don't
> +		# want to match ^(old|copy|rename|deleted|...) unless
> +		# we match /^diff --git/ first:
> +		(?: # begin optional stuff:
> +
> +		# try to get the pre-and-post filenames as $2 and $3
> +		(?:^diff\x20--git\x20$FN\x20$FN$LF)
> +
> +		# old mode $4
> +		(?:^old mode\x20(100644|120000|100755)$LF)?
> +
> +		# ignore other info
> +		(?:^(?:copy|rename|deleted|dissimilarity|similarity).*$LF)?
> +
> +		# new mode (possibly new file) ($5)
> +		(?:^new\x20(?:file\x20)?mode\x20(100644|120000|100755)$LF)?
> +
> +		# ignore other info
> +		(?:^(?:copy|rename|deleted|dissimilarity|similarity).*$LF)?
> +
> +		)? # end of optional stuff, everything below is required
> +
> +		# match the pre and post-image OIDs as $6 $7
> +		^index\x20(${pre}[a-f0-9]*)\.\.(${post}[a-f0-9]*)
> +			# mode if unchanged $8
> +			(?:\x20(100644|120000|100755))?$LF
> +	) # end of header lines ($1)
> +	( # $9 is the patch body
> +		# "--- a/foo.c" sets pre-filename ($10) in case
> +		# $2 is missing
> +		(?:^---\x20$FN$LF)
> +
> +		# "+++ b/foo.c" sets post-filename ($11) in case
> +		# $3 is missing
> +		(?:^\+{3}\x20$FN$LF)
> +
> +		# the meat of the diff, including "^\\No newline ..."
> +		(?:^[\@\+\x20\-\\][^\r\n]*$LF)+
> +	)!smx or return;

...And we need to support context lines w/o leading space

a) because it happens in the real world because of MUAs

b) it was apparently proposed for GNU diff in the past
   (thankfully, it's not the default)

cf. https://public-inbox.org/git/b507b465f7831612b9d9fc643e3e5218b64e5bfa/s/
(oddly, I can't find the mail for that patch specifically)

This fixes a regression on git/5cd8845/s/?b=submodule.c

diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index ab7d5c19..a78360fd 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -153,7 +153,9 @@ sub extract_diff ($$) {
 		(?:^\+{3}\x20$FN$LF)
 
 		# the meat of the diff, including "^\\No newline ..."
-		(?:^[\@\+\x20\-\\][^\r\n]*$LF)+
+		# We also allow for totally blank lines w/o leading spaces,
+		# because git-apply(1) handles that case, too
+		(?:^(?:[\@\+\x20\-\\][^\r\n]*|)$LF)+
 	)!smx or return;
 
 	my $hdr_lines = $1;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-01-03  0:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-02  9:24 [PATCH 0/3] solver bugfixes and tweaks Eric Wong
2020-01-02  9:24 ` [PATCH 1/3] solver: try the next patch on apply failures Eric Wong
2020-01-02  9:24 ` [PATCH 2/3] solver: extract_diff: deal with missing "diff --git" line Eric Wong
2020-01-02  9:36   ` Eric Wong
2020-01-02 19:04     ` Eric Wong
2020-01-02 21:12   ` Eric Wong
2020-01-03  0:43   ` Eric Wong
2020-01-02  9:24 ` [PATCH 3/3] qspawn: use per-call quiet flag for solver Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).