Quick programming bounty: anti-phishing regex

theymos (OP)

Administrator
Legendary

Offline

Activity: 5390
Merit: 13426

⇾ Quick programming bounty: anti-phishing regex - 0.2 BTC

November 09, 2013, 12:53:49 AM

Create a single PCRE regex (PHP preg_match) that accurately matches phishing BBcode like:

Code:

[url=http://phishing.com]http://safe-site.com/login.php[/url]
[url=phishing.com]safe-site.com[/url]
[iurl=http://phishing.com]safe-site.com[/url]
[url=http://phishing.com][b]safe[/b]-site.com[/url]
[url=http://phishing.com]safe-site.io[/url]
[url=http://phishing.com]safe-site⠠com[/url] (notice Unicode . lookalike)
[url=http://phishing.com]safe-site .com[/url] (notice Unicode hair space)
[url=http://phishing.com]safe-site[img]http://asdf.com/period.png[/img]com[/url] (a link containing both text and an image)

but does not match:

Code:

[url=http://safe-site.com]http://safe-site.com[/url]
[url=safe-site.com]safe-site.com[/url]
[url=http://safe-site.com]safe-site.com[/url]
[url=safe-site.com]http://safe-site.com[/url]
[url=http://safe-site.com][img]http://asdf.com/image.png[/img]
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img]   (notice whitespace)
[url=http://safe-site.com]safe-site.com is a good site[/url]
[url=http://safe-site.com]こんにちは。[/url]

These parts of the URL should be captured:

Code:

[url=$1]$2[/url]

And there should be no other capturing groups.

Make your regex somewhat readable so that I can check your logic, perhaps using the x modifier.

Post your solution here when you have it. The person who posts the first solution that I find acceptable gets the bounty. I may split the bounty among several people if my chosen solution is a derivative of a previously-posted attempted solution.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD

dserrano5

Legendary

Offline

Activity: 1974
Merit: 1029

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 09, 2013, 01:48:58 AM

Let's see whether:

I understood the problem
php's "perl compatible regexes" are actually perl compatible. This is Perl code

Code:

#!/usr/bin/perl

use warnings;
use strict;

################################# useful stuff #######################
my $phishing_domain = 'phishing.com';
my $re = qr{
    ^\[
    i?
    url=
    ( (?:http://)? $phishing_domain] )
    (.* (?=\[/url]) )
}x;
################################# /useful stuff #######################

my @bad = split /\n/, <<EOF;
[url=http://phishing.com]http://safe-site.com/login.php[/url]
[url=phishing.com]safe-site.com[/url]
[iurl=http://phishing.com]safe-site.com[/url]
[url=http://phishing.com][b]safe[/b]-site.com[/url]
[url=http://phishing.com]safe-site.io[/url]
[url=http://phishing.com]safe-site⠠com[/url] (notice Unicode . lookalike)
[url=http://phishing.com]safe-site .com[/url] (notice Unicode hair space)
[url=http://phishing.com]safe-site[img]http://asdf.com/period.png[/img]com[/url] (a link containing both text and an image)
EOF

my @good = split /\n/, <<EOF;
[url=http://safe-site.com]http://safe-site.com[/url]
[url=safe-site.com]safe-site.com[/url]
[url=http://safe-site.com]safe-site.com[/url]
[url=safe-site.com]http://safe-site.com[/url]
[url=http://safe-site.com][img]http://asdf.com/image.png[/img]
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img]   (notice whitespace)
[url=http://safe-site.com]safe-site.com is a good site[/url]
[url=http://safe-site.com]こんにちは。[/url]
EOF

foreach my $bad (@bad) {
    if ($bad =~ /$re/) {
        printf "\$1 (%s) 2\$ (%s)\n", $1, $2;
    } else {
        print "oops line '$bad' should have matched\n";
    }
}

foreach my $good (@good) {
    print "oops, line '$good' unexpectedly matched\n" if $good =~ /$re/;
}

Edit: output is:

Code:

$1 (http://phishing.com]) 2$ (http://safe-site.com/login.php)
$1 (phishing.com]) 2$ (safe-site.com)
$1 (http://phishing.com]) 2$ (safe-site.com)
$1 (http://phishing.com]) 2$ ([b]safe[/b]-site.com)
$1 (http://phishing.com]) 2$ (safe-site.io)
$1 (http://phishing.com]) 2$ (safe-site⠠com)
$1 (http://phishing.com]) 2$ (safe-site .com)
$1 (http://phishing.com]) 2$ (safe-site[img]http://asdf.com/period.png[/img]com)

theymos (OP)

Administrator
Legendary

Offline

Activity: 5390
Merit: 13426

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 09, 2013, 01:55:03 AM

I don't want to match particular domains. phishing.com and safe-site.com are just examples. I want the regex to match all [url] links where the link text appears to be an auto-linkified URL on casual examination, but where the actual link URL is different.

Example:
http://bitcointalk.org (http://bitcointalk.org)
http://google.com ([url=http://google.com]http://google.com[/url])
http://google.com ([url=http://google.com]http://bitcointalk.org[/url])

I want the regex to match the last link's BBcode (without knowing about "bitcointalk.org" or "google.com"), and I don't want it to be possible for someone to bypass the regex using Unicode tricks, images, etc.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD

theymos (OP)

Administrator
Legendary

Offline

Activity: 5390
Merit: 13426

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 09, 2013, 02:07:49 AM

What the forum was previously doing was replacing all instances of:

Code:

(
\[i?url [^]]+ \]
    \W* ((?:http|www) [^[]+ )
\[/i?url\]
)ix

with the captured stuff ($1). So [url=http://google.com]http://asdf.com[/url] becomes just asdf.com. But this can be defeated in many ways.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD

dserrano5

Legendary

Offline

Activity: 1974
Merit: 1029

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 09, 2013, 02:23:35 AM

IMHO you don't want a catch-all regex for doing such a fine grained job.

Dragooon

Member

Offline

Activity: 88
Merit: 10

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 09, 2013, 07:48:00 AM

Why don't you simply capture individual parts and parse that in SMF's URL BBC parser (fairly sure SMF supports callbacks for BBC)? That should be faster than a single regex doing that.

▒ NOW token ▒ by ChangeNOW ▒ Get the WIN! ▒
ChangeNOW - an instant Non-custodial Exchange Service (( changenow.io ))

Whitepaper ▓ Telegram ▓ Twitter ▓ Facebook ▓ Medium ▓ Reddit ▓ Bounty Thread

sdp

Sr. Member

Offline

Activity: 469
Merit: 281

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 09, 2013, 04:09:23 PM

Can we just do the exact opposite instead? Instead of matching those that are bad we match those that are good.

Now I submit that these three shouldn't match either:

Code:

[url=http://safe-site.com][img]http://asdf.com/image.png[/img]
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img]   (notice whitespace)
[url=http://safe-site.com]safe-site.com is a good site[/url]
[url=http://safe-site.com]こんにちは。[/url]

First if you allow the images such as

Code:

[url=http://safe-site.com][img]http://asdf.com/image.png[/img]

then why not:

Code:

[url=phishing.com][img]http://asdf.com/image.png[/img][/url]]

and suppose that image looks just like the text www.safe-site.com ?

Coinsbank: Left money in their costodial wallet for my signature. Then they kept the money.

12648430

Full Member

Offline

Activity: 144
Merit: 100

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 10, 2013, 02:12:11 AM

There is no general solution that flags Unicode-using URL look-a-likes, but doesn't flag text that doesn't contain any URL.

dudeami

Full Member

Offline

Activity: 126
Merit: 100

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 10, 2013, 04:08:06 AM
Last edit: November 11, 2013, 01:04:45 AM by dudeami

Hey, have a pure RegExp solution that will work in most circumstances.

Code:

<?php
$regex = <<<'REGEXP'
@\[url=                         # Start of URL BBCode
(                               # Group 1
	((?:https?:\/\/)?)            # Group 2, capture protocol if it exists
	([\da-z\.-]+)                 # Group 3, capture the hostname, without the TLD
	\.                            # Need a period between the hostname and TLD
	([a-z\.]{2,6})                # Group 4, TLD
	((?:[\/\w \.-]*)*\/?)         # Group 5, path
)
\]
\s*?                            # Multiline support
(                               # Group 6
	(?!.*\3.+\4|.*\[img\].*)    # Lookahead to check for the non-phishing domain and images
	.*?
	((?:https?:\/\/)?)            # Group 7, phishing URL protocol
	(                             # Group 8, phishing URL host
		(?:[\da-z\.-]               # Match any characters normally found in a URL
		|                           # or
		\[[^\]]+\]                  # Match any BBCode
		|                           # or
		[^\x00-\x7F])+              # Match any unicode characters
	)                            
	(?:\.|[^\x00-\x7F]+)          # Need a period, but also look for unicode characters
	([a-z\.]{2,6})                # TLD
	((?:[\/\w \.-]*)*\/?)         # Path
	.*?
	|                             # or
	[^ ]+\[img\].*\[\/img\][^ ]+  # An image with anything other than space spaces surrounding it
)
\s*?                            # Find any whitespace inbetween
\[/url\]                        # End of URL BBCode
@xmi
REGEXP;

This will fail if any unicode characters are used inside of words, but other than that should be selective enough. Example of this failing at the bottom of the post.

Edit: Will also fail when replacing the legit URL's "." with " " (or any other non-alphanumeric, non-unicode character) and having a phishing site for the URL. Expanding the TLD sections to look for real world TLDs could fix this issue to an extent.

success	Code: [url=http://phishing.com]http://safe-site.com/login.php[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=phishing.com] safe-site.com[/url]	Code: [url=phishing.com]phishing.com[/url]
success	Code: [url=http://phishing.com]http://safe-site.com/login.php[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]	Code: [url=http://phishing.com]http://phishing.com[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]
success	Code: [url=http://phishing.com]Welcome to safe-site.com![/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=http://phishing.com]safe-site.com[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=http://phishing.com][b]safe[/b]-site.com[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=http://phishing.com]safe-site.io[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=http://phishing.com]safe-site⠠com[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=http://phishing.com]safe-site .com[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=http://phishing.com]s a f e-s i t e.com[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
success	Code: [url=http://safe-site.io]safe-site.com[/url]	Code: [url=http://safe-site.io]http://safe-site.io[/url]
success	Code: [url=http://phishing.com]safe-site[img]http://asdf.com/period.png[/img]com[/url]	Code: [url=http://phishing.com]http://phishing.com[/url]
failure	Code: [url=http://phishing.com]http://safe-site com[/url]	Code: [url=http://phishing.com]http://safe-site com[/url]
success	Code: [url=http://safe-site.com]http://safe-site.com[/url]	Code: [url=http://safe-site.com]http://safe-site.com[/url]
success	Code: [url=safe-site.com]safe-site.com[/url]	Code: [url=safe-site.com]safe-site.com[/url]
success	Code: [url=http://safe-site.com]safe-site.com[/url]	Code: [url=http://safe-site.com]safe-site.com[/url]
success	Code: [url=safe-site.com]http://safe-site.com[/url]	Code: [url=safe-site.com]http://safe-site.com[/url]
success	Code: [url=http://safe-site.com][img]http://asdf.com/image.png[/img][/url]	Code: [url=http://safe-site.com][img]http://asdf.com/image.png[/img][/url]
success	Code: [url=http://safe-site.com] [img]http://asdf.com/image.png[/img][/url]	Code: [url=http://safe-site.com] [img]http://asdf.com/image.png[/img][/url]
success	Code: [url=http://safe-site.com]safe-site.com is a good site[/url]	Code: [url=http://safe-site.com]safe-site.com is a good site[/url]
success	Code: [url=http://safe-site.com]Welcome to safe-site.com![/url]	Code: [url=http://safe-site.com]Welcome to safe-site.com![/url]
success	Code: [url=http://safe-site.com]こんにちは。[/url]	Code: [url=http://safe-site.com]こんにちは。[/url]
success	Code: Some normal text	Code: Some normal text
success	Code: [url=http://safe-site.com]☺☺☺ Hello ☺☺☺[/url]	Code: [url=http://safe-site.com]☺☺☺ Hello ☺☺☺[/url]
failure	Code: [url=http://safe-site.com]Hello☺World[/url]	Code: [url=http://safe-site.com]http://safe-site.com[/url]
success	Code: [url=http://safe-site.com]Hello ☺ World[/url]	Code: [url=http://safe-site.com]Hello ☺ World[/url]

Put your heart on the line, it determines your fate.
BTC: 1DUDEAMiV54PFJFSe5fen3wr1e71unkaGj

theymos (OP)

Administrator
Legendary

Offline

Activity: 5390
Merit: 13426

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 11, 2013, 12:20:05 AM

#10

Seems pretty close, but that's too greedy (and expensive). Your regex sees this entire string (which is valid BBcode) as a match:

Code:

[url=http://phishing.com]http://safe-site.com/login.php[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD

dudeami

Full Member

Offline

Activity: 126
Merit: 100

Re: Quick programming bounty: anti-phishing regex - 0.2 BTC

November 11, 2013, 01:26:44 AM

#11

Fixed the too greedy part, tests shown in orig post. As for expensive, yes it is :p Kinda stretching the limits of RegExp here, atleast to my knowledge.

Put your heart on the line, it determines your fate.
BTC: 1DUDEAMiV54PFJFSe5fen3wr1e71unkaGj

Pages: [1]

Bitcoin Forum > Economy > Marketplace > Services > Quick programming bounty: anti-phishing regex - 0.2 BTC

« previous topic next topic »