From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 67B771F9FD; Mon, 1 Mar 2021 05:47:36 +0000 (UTC) Date: Mon, 1 Mar 2021 11:47:36 +0600 From: Eric Wong To: Kyle Meyer Cc: meta@public-inbox.org Subject: [PATCH 4/3] lei p2q: fix /dev/null filenames, fix phrase quoting rules Message-ID: <20210301054736.GA24278@dcvr> References: <20210228122528.18552-1-e@80x24.org> <20210228122528.18552-2-e@80x24.org> <87k0qrrhve.fsf@kyleam.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87k0qrrhve.fsf@kyleam.com> List-Id: Kyle Meyer wrote: > I noticed an unexpected term when trying dfa: > > $ curl -fSs \ > https://public-inbox.org/meta/20210228122528.18552-2-e@80x24.org/raw >msg > $ lei p2q --want=dfa msg > dfa:my @WQ_KEYS = qw dfa:"lxs l2m imp mrr cnv" dfa:"internal workers" dfa:dev/null > > So I think the upstream "--- " filename regexp needs to be adjusted to > account for "/dev/null". Thanks. Also, "my @WQ_KEYS = qw" needs to be quoted, at least, (and maybe '(' and ')', need to check Xapian more closely.... And I'll have to fix them in SearchIdx (and probably switch to use a common parser for indexing + term generation). On a side note: I find myself mega-confused using public-inbox patches as test data. I thought Perl was choking and spitting code back out at me :x ---8<--- Subject: [PATCH] lei p2q: fix /dev/null filenames, fix phrase quoting rules /dev/null mis-handling was reported by Kyle Meyer. Phrases quoting rules are also refined to avoid leaving spaces unquoted when "phrase generator" characters exist. Also, context-free hunk headers no longer clobber the in_diff state of the parser, since git can still generate those. Link: https://public-inbox.org/meta/87k0qrrhve.fsf@kyleam.com/ --- lib/PublicInbox/LeiP2q.pm | 10 +++++++--- t/lei-p2q.t | 3 +++ 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/LeiP2q.pm b/lib/PublicInbox/LeiP2q.pm index d1dd125e..e7ddc852 100644 --- a/lib/PublicInbox/LeiP2q.pm +++ b/lib/PublicInbox/LeiP2q.pm @@ -12,6 +12,7 @@ use PublicInbox::MsgIter qw(msg_part_text); use PublicInbox::Git qw(git_unquote); use PublicInbox::Spawn qw(popen_rd); use URI::Escape qw(uri_escape_utf8); +my $FN = qr!((?:"?[^/\n]+/[^\r\n]+)|/dev/null)!; sub xphrase ($) { my ($s) = @_; @@ -23,7 +24,7 @@ sub xphrase ($) { map { s/\A\s*//; s/\s+\z//; - /[\|=><,\sA-Z]/ && !m![\./:\\\@]! ? qq("$_") : $_; + m![^\./:\\\@\-\w]! ? qq("$_") : $_ ; } ($s =~ m!(\w[\|=><,\./:\\\@\-\w\s]+)!g); } @@ -40,7 +41,7 @@ sub extract_terms { # eml->each_part callback push @{$lei->{qterms}->{dfctx}}, xphrase($_); } elsif (/^-- $/) { # email signature begins $in_diff = undef; - } elsif (m!^diff --git "?[^/]+/.+ "?[^/]+/.+\z!) { + } elsif (m!^diff --git $FN $FN!) { # wait until "---" and "+++" to capture filenames $in_diff = 1; } elsif (/^index ([a-f0-9]+)\.\.([a-f0-9]+)\b/) { @@ -48,13 +49,16 @@ sub extract_terms { # eml->each_part callback push @{$lei->{qterms}->{dfpre}}, $oa; push @{$lei->{qterms}->{dfpost}}, $ob; # who uses dfblob? - } elsif (m!^(?:---|\+{3}) ("?[^/]+/.+)!) { + } elsif (m!^(?:---|\+{3}) ($FN)!) { + next if $1 eq '/dev/null'; my $fn = (split(m!/!, git_unquote($1.''), 2))[1]; push @{$lei->{qterms}->{dfn}}, xphrase($fn); } elsif ($in_diff && s/^\+//) { # diff added push @{$lei->{qterms}->{dfb}}, xphrase($_); } elsif ($in_diff && s/^-//) { # diff removed push @{$lei->{qterms}->{dfa}}, xphrase($_); + } elsif (/^@@ (?:\S+) (?:\S+) @@\s*$/) { + # traditional diff w/o -p } elsif (/^@@ (?:\S+) (?:\S+) @@\s*(\S+.*)/) { push @{$lei->{qterms}->{dfhh}}, xphrase($1); } elsif (/^(?:dis)similarity index/ || diff --git a/t/lei-p2q.t b/t/lei-p2q.t index 1a2c2e4f..87cf9fa7 100644 --- a/t/lei-p2q.t +++ b/t/lei-p2q.t @@ -25,5 +25,8 @@ test_lei(sub { "dfpost:6e006fd73b OR " . "dfpost:6e006fd73\n", '3-byte chop'); + + lei_ok(qw(p2q t/data/message_embed.eml --want=dfb)); + like($lei_out, qr/\bdfb:\S+/, 'got dfb off /dev/null file'); }); done_testing;