Bitcoin Forum
November 16, 2024, 04:08:39 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: Quick programming bounty: anti-phishing regex - 0.2 BTC  (Read 1850 times)
theymos (OP)
Administrator
Legendary
*
Offline Offline

Activity: 5390
Merit: 13426


View Profile
November 09, 2013, 12:53:49 AM
 #1

Create a single PCRE regex (PHP preg_match) that accurately matches phishing BBcode like:

Code:
[url=http://phishing.com]http://safe-site.com/login.php[/url]
[url=phishing.com]safe-site.com[/url]
[iurl=http://phishing.com]safe-site.com[/url]
[url=http://phishing.com][b]safe[/b]-site.com[/url]
[url=http://phishing.com]safe-site.io[/url]
[url=http://phishing.com]safe-site⠠com[/url] (notice Unicode . lookalike)
[url=http://phishing.com]safe-site .com[/url] (notice Unicode hair space)
[url=http://phishing.com]safe-site[img]http://asdf.com/period.png[/img]com[/url] (a link containing both text and an image)

but does not match:

Code:
[url=http://safe-site.com]http://safe-site.com[/url]
[url=safe-site.com]safe-site.com[/url]
[url=http://safe-site.com]safe-site.com[/url]
[url=safe-site.com]http://safe-site.com[/url]
[url=http://safe-site.com][img]http://asdf.com/image.png[/img]
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img]   (notice whitespace)
[url=http://safe-site.com]safe-site.com is a good site[/url]
[url=http://safe-site.com]こんにちは。[/url]

These parts of the URL should be captured:
Code:
[url=$1]$2[/url]
And there should be no other capturing groups.

Make your regex somewhat readable so that I can check your logic, perhaps using the x modifier.

Post your solution here when you have it. The person who posts the first solution that I find acceptable gets the bounty. I may split the bounty among several people if my chosen solution is a derivative of a previously-posted attempted solution.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
dserrano5
Legendary
*
Offline Offline

Activity: 1974
Merit: 1029



View Profile
November 09, 2013, 01:48:58 AM
 #2

Let's see whether:

  • I understood the problem
  • php's "perl compatible regexes" are actually perl compatible. This is Perl code Tongue

Code:
#!/usr/bin/perl

use warnings;
use strict;

################################# useful stuff #######################
my $phishing_domain = 'phishing.com';
my $re = qr{
    ^\[
    i?
    url=
    ( (?:http://)? $phishing_domain] )
    (.* (?=\[/url]) )
}x;
################################# /useful stuff #######################

my @bad = split /\n/, <<EOF;
[url=http://phishing.com]http://safe-site.com/login.php[/url]
[url=phishing.com]safe-site.com[/url]
[iurl=http://phishing.com]safe-site.com[/url]
[url=http://phishing.com][b]safe[/b]-site.com[/url]
[url=http://phishing.com]safe-site.io[/url]
[url=http://phishing.com]safe-site⠠com[/url] (notice Unicode . lookalike)
[url=http://phishing.com]safe-site .com[/url] (notice Unicode hair space)
[url=http://phishing.com]safe-site[img]http://asdf.com/period.png[/img]com[/url] (a link containing both text and an image)
EOF

my @good = split /\n/, <<EOF;
[url=http://safe-site.com]http://safe-site.com[/url]
[url=safe-site.com]safe-site.com[/url]
[url=http://safe-site.com]safe-site.com[/url]
[url=safe-site.com]http://safe-site.com[/url]
[url=http://safe-site.com][img]http://asdf.com/image.png[/img]
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img]   (notice whitespace)
[url=http://safe-site.com]safe-site.com is a good site[/url]
[url=http://safe-site.com]こんにちは。[/url]
EOF

foreach my $bad (@bad) {
    if ($bad =~ /$re/) {
        printf "\$1 (%s) 2\$ (%s)\n", $1, $2;
    } else {
        print "oops line '$bad' should have matched\n";
    }
}

foreach my $good (@good) {
    print "oops, line '$good' unexpectedly matched\n" if $good =~ /$re/;
}

Edit: output is:

Code:
$1 (http://phishing.com]) 2$ (http://safe-site.com/login.php)
$1 (phishing.com]) 2$ (safe-site.com)
$1 (http://phishing.com]) 2$ (safe-site.com)
$1 (http://phishing.com]) 2$ ([b]safe[/b]-site.com)
$1 (http://phishing.com]) 2$ (safe-site.io)
$1 (http://phishing.com]) 2$ (safe-site⠠com)
$1 (http://phishing.com]) 2$ (safe-site .com)
$1 (http://phishing.com]) 2$ (safe-site[img]http://asdf.com/period.png[/img]com)
theymos (OP)
Administrator
Legendary
*
Offline Offline

Activity: 5390
Merit: 13426


View Profile
November 09, 2013, 01:55:03 AM
 #3

I don't want to match particular domains. phishing.com and safe-site.com are just examples. I want the regex to match all [url] links where the link text appears to be an auto-linkified URL on casual examination, but where the actual link URL is different.

Example:
http://bitcointalk.org (http://bitcointalk.org)
http://google.com ([url=http://google.com]http://google.com[/url])
http://google.com ([url=http://google.com]http://bitcointalk.org[/url])

I want the regex to match the last link's BBcode (without knowing about "bitcointalk.org" or "google.com"), and I don't want it to be possible for someone to bypass the regex using Unicode tricks, images, etc.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
theymos (OP)
Administrator
Legendary
*
Offline Offline

Activity: 5390
Merit: 13426


View Profile
November 09, 2013, 02:07:49 AM
 #4

What the forum was previously doing was replacing all instances of:
Code:
(
\[i?url [^]]+ \]
    \W* ((?:http|www) [^[]+ )
\[/i?url\]
)ix
with the captured stuff ($1). So [url=http://google.com]http://asdf.com[/url] becomes just asdf.com. But this can be defeated in many ways.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
dserrano5
Legendary
*
Offline Offline

Activity: 1974
Merit: 1029



View Profile
November 09, 2013, 02:23:35 AM
 #5

IMHO you don't want a catch-all regex for doing such a fine grained job.
Dragooon
Member
**
Offline Offline

Activity: 88
Merit: 10


View Profile
November 09, 2013, 07:48:00 AM
 #6

Why don't you simply capture individual parts and parse that in SMF's URL BBC parser (fairly sure SMF supports callbacks for BBC)? That should be faster than a single regex doing that.

▒ NOW token ▒ by ChangeNOW ▒ Get the WIN! ▒
ChangeNOW - an instant Non-custodial Exchange Service  (( changenow.io ))
Whitepaper  ▓  Telegram  ▓  Twitter  ▓  Facebook  ▓  Medium  ▓  Reddit  ▓  Bounty Thread
sdp
Sr. Member
****
Offline Offline

Activity: 469
Merit: 281



View Profile WWW
November 09, 2013, 04:09:23 PM
 #7

Can we just do the exact opposite instead?       Instead of matching those that are bad we match those that are good.

Now I submit that these three shouldn't match either:
Code:
[url=http://safe-site.com][img]http://asdf.com/image.png[/img]
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img]   (notice whitespace)
[url=http://safe-site.com]safe-site.com is a good site[/url]
[url=http://safe-site.com]こんにちは。[/url]


First if you allow the images such as
Code:
[url=http://safe-site.com][img]http://asdf.com/image.png[/img]
then why not:
Code:
[url=phishing.com][img]http://asdf.com/image.png[/img][/url]]

and suppose that image looks just like the text www.safe-site.com ?


Coinsbank: Left money in their costodial wallet for my signature.  Then they kept the money.
12648430
Full Member
***
Offline Offline

Activity: 144
Merit: 100


View Profile
November 10, 2013, 02:12:11 AM
 #8

There is no general solution that flags Unicode-using URL look-a-likes, but doesn't flag text that doesn't contain any URL.
dudeami
Full Member
***
Offline Offline

Activity: 126
Merit: 100



View Profile
November 10, 2013, 04:08:06 AM
Last edit: November 11, 2013, 01:04:45 AM by dudeami
 #9

Hey, have a pure RegExp solution that will work in most circumstances.

Code:
<?php
$regex 
= <<<'REGEXP'
@\[url=                         # Start of URL BBCode
(                               # Group 1
((?:https?:\/\/)?)            # Group 2, capture protocol if it exists
([\da-z\.-]+)                 # Group 3, capture the hostname, without the TLD
\.                            # Need a period between the hostname and TLD
([a-z\.]{2,6})                # Group 4, TLD
((?:[\/\w \.-]*)*\/?)         # Group 5, path
)
\]
\s*?                            # Multiline support
(                               # Group 6
(?!.*\3.+\4|.*\[img\].*)    # Lookahead to check for the non-phishing domain and images
.*?
((?:https?:\/\/)?)            # Group 7, phishing URL protocol
(                             # Group 8, phishing URL host
(?:[\da-z\.-]               # Match any characters normally found in a URL
|                           # or
\[[^\]]+\]                  # Match any BBCode
|                           # or
[^\x00-\x7F])+              # Match any unicode characters
)                            
(?:\.|[^\x00-\x7F]+)          # Need a period, but also look for unicode characters
([a-z\.]{2,6})                # TLD
((?:[\/\w \.-]*)*\/?)         # Path
.*?
|                             # or
[^ ]+\[img\].*\[\/img\][^ ]+  # An image with anything other than space spaces surrounding it
)
\s*?                            # Find any whitespace inbetween
\[/url\]                        # End of URL BBCode
@xmi
REGEXP;

This will fail if any unicode characters are used inside of words, but other than that should be selective enough. Example of this failing at the bottom of the post.

Edit: Will also fail when replacing the legit URL's "." with " " (or any other non-alphanumeric, non-unicode character) and having a phishing site for the URL. Expanding the TLD sections to look for real world TLDs could fix this issue to an extent.


success
Code:
[url=http://phishing.com]http://safe-site.com/login.php[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=phishing.com]
safe-site.com[/url]
Code:
[url=phishing.com]phishing.com[/url]
success
Code:
[url=http://phishing.com]http://safe-site.com/login.php[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]
Code:
[url=http://phishing.com]http://phishing.com[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]
success
Code:
[url=http://phishing.com]Welcome to safe-site.com![/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site.com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com][b]safe[/b]-site.com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site.io[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site⠠com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site .com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]s a f e-s i t e.com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://safe-site.io]safe-site.com[/url]
Code:
[url=http://safe-site.io]http://safe-site.io[/url]
success
Code:
[url=http://phishing.com]safe-site[img]http://asdf.com/period.png[/img]com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
failure
Code:
[url=http://phishing.com]http://safe-site com[/url]
Code:
[url=http://phishing.com]http://safe-site com[/url]
success
Code:
[url=http://safe-site.com]http://safe-site.com[/url]
Code:
[url=http://safe-site.com]http://safe-site.com[/url]
success
Code:
[url=safe-site.com]safe-site.com[/url]
Code:
[url=safe-site.com]safe-site.com[/url]
success
Code:
[url=http://safe-site.com]safe-site.com[/url]
Code:
[url=http://safe-site.com]safe-site.com[/url]
success
Code:
[url=safe-site.com]http://safe-site.com[/url]
Code:
[url=safe-site.com]http://safe-site.com[/url]
success
Code:
[url=http://safe-site.com][img]http://asdf.com/image.png[/img][/url]
Code:
[url=http://safe-site.com][img]http://asdf.com/image.png[/img][/url]
success
Code:
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img][/url]
Code:
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img][/url]
success
Code:
[url=http://safe-site.com]safe-site.com is a good site[/url]
Code:
[url=http://safe-site.com]safe-site.com is a good site[/url]
success
Code:
[url=http://safe-site.com]Welcome to safe-site.com![/url]
Code:
[url=http://safe-site.com]Welcome to safe-site.com![/url]
success
Code:
[url=http://safe-site.com]こんにちは。[/url]
Code:
[url=http://safe-site.com]こんにちは。[/url]
success
Code:
Some normal text
Code:
Some normal text
success
Code:
[url=http://safe-site.com]☺☺☺ Hello ☺☺☺[/url]
Code:
[url=http://safe-site.com]☺☺☺ Hello ☺☺☺[/url]
failure
Code:
[url=http://safe-site.com]Hello☺World[/url]
Code:
[url=http://safe-site.com]http://safe-site.com[/url]
success
Code:
[url=http://safe-site.com]Hello ☺ World[/url]
Code:
[url=http://safe-site.com]Hello ☺ World[/url]

Put your heart on the line, it determines your fate.
BTC: 1DUDEAMiV54PFJFSe5fen3wr1e71unkaGj
theymos (OP)
Administrator
Legendary
*
Offline Offline

Activity: 5390
Merit: 13426


View Profile
November 11, 2013, 12:20:05 AM
 #10

Seems pretty close, but that's too greedy (and expensive). Your regex sees this entire string (which is valid BBcode) as a match:
Code:
[url=http://phishing.com]http://safe-site.com/login.php[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
dudeami
Full Member
***
Offline Offline

Activity: 126
Merit: 100



View Profile
November 11, 2013, 01:26:44 AM
 #11

Fixed the too greedy part, tests shown in orig post. As for expensive, yes it is :p Kinda stretching the limits of RegExp here, atleast to my knowledge.

Put your heart on the line, it determines your fate.
BTC: 1DUDEAMiV54PFJFSe5fen3wr1e71unkaGj
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!