git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Pratyush Yadav <me@yadavpratyush.com>
To: Paul Mackerras <paulus@ozlabs.org>
Cc: git@vger.kernel.org
Subject: Re: [PATCH v2] gitk: Make web links clickable
Date: Sat, 14 Sep 2019 20:00:50 +0530	[thread overview]
Message-ID: <20190914143050.jiax3vhm3ng7glew@yadavpratyush.com> (raw)
In-Reply-To: <20190913233307.GA29205@blackberry>

On 14/09/19 09:33AM, Paul Mackerras wrote:
> On Fri, Aug 30, 2019 at 12:02:07AM +0530, Pratyush Yadav wrote:
> > On 29/08/19 11:27AM, Paul Mackerras wrote:
> > 
> > I know I suggested searching till the first non-whitespace character, 
> > but thinking more about, there are some problematic cases. Say someone 
> > has a commit message like:
> >   
> >   Foo bar baz (more details at https://example.com/hello)
> > 
> > Or like:
> > 
> >   Check out https://foo.com, https://bar.com
> > 
> > In the first example, the closing parenthesis gets included in the link, 
> > but shouldn't be. In the second, the comma after foo.com would be 
> > included in the link, but shouldn't be. So maybe use a more 
> > sophisticated regex?
> 
> I did think about that, but it seems to be impossible to get it right
> in all cases, so I went for simple and obvious.  In particular I don't
> see how to handle the common case of a '.' immediately following the
> URL, since '.' is a legal character in a URL.
> 
> > A quick Google search gives out the following options [0][1].
> > 
> > [0] gives the following regex:
> > 
> >   https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)
> > 
> > It is kind of ugly to look at, and I'm not even sure if there are any 
> > syntax differences with Tcl's regex library.
> > 
> > [1] lists a bunch of regexes and which URLs they work on and which ones 
> > they don't. The smallest among them I found is:
> > 
> >   @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS
> > 
> > Again, I'm not sure how well this would work with Tcl's regex library, 
> > or how commonly these URL patterns appear in actual commit messages.  
> > Just something to consider.
> > 
> > [0] https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
> > [1] https://mathiasbynens.be/demo/url-regex
> 
> I think I would be inclined to make the regex customizable, since that
> would also allow the user to match ftp or other URLs if they want.
> The only difficulty with that is if there are subexpressions, that
> will change how we have to interpret the list returned by the
> regexp -indices -all -inline command.

That just puts the responsibility of parsing the URL on the user, it 
doesn't solve the problem.

I don't have any numbers, but I think most problematic cases are when 
there are some trailing characters. We aren't dealing with malicious 
actors that want to do something bad or make gitk crash. IMO it is 
reasonable to expect legal URLs in a commit message.

So instead of trying to encompass all possible legal URLs and removing 
all illegal URLs, how about using a simple regex for basic filtering to 
weed out some false positives, and then trimming illegal trailing 
characters. These trailing characters would most likely be comma, 
period, parenthesis, question marks, quotation marks, etc. This way the 
logic stays simple and we tackle more real world problems.

Sounds reasonable?

-- 
Regards,
Pratyush Yadav

      reply	other threads:[~2019-09-14 14:30 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-26 22:14 [PATCH] gitk: Make web links clickable Paul Mackerras
2019-08-27 15:33 ` Barret Rhoden
2019-08-27 20:32 ` Junio C Hamano
2019-08-29  0:50   ` Paul Mackerras
2019-08-29  3:46     ` Junio C Hamano
2019-08-27 21:58 ` Pratyush Yadav
2019-08-29  1:27 ` [PATCH v2] " Paul Mackerras
2019-08-29 18:32   ` Pratyush Yadav
2019-09-13 23:33     ` Paul Mackerras
2019-09-14 14:30       ` Pratyush Yadav [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190914143050.jiax3vhm3ng7glew@yadavpratyush.com \
    --to=me@yadavpratyush.com \
    --cc=git@vger.kernel.org \
    --cc=paulus@ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).