Bitcoin Forum
May 30, 2024, 11:36:16 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: [Anti-plagiarism] The full list of homographs  (Read 231 times)
Coin-1 (OP)
Legendary
*
Offline Offline

Activity: 2478
Merit: 2216



View Profile
September 03, 2018, 12:10:28 PM
Last edit: September 05, 2018, 01:33:07 PM by Coin-1
 #1

Usually bounty hunters of BitcoinTalk signature campaigns are required to write a certain number of posts within a week, participants are credited with the stakes for this activity. Sometimes unscrupulous users copy a messages of other members or a paragraphs from the external articles in the Internet and post them here on the forum. Such posts can easily be compared and tracked by SEO services, therefore these bounty hunters began using homographs to complicate detection.

Simplistically saying, homographs are symbols in the international Unicode table which look the same visually. The english alphabet uses only ASCII characters.

If homographs from different languages are mixed in some text, the human reading it will not distinguish any difference, however the analyzing systems will not be able to detect plagiarism by simply comparing texts encoded in UTF-8.

For example:
  • "SEO". Here are the ASCII characters only, homographs are not used. The word length in UTF-8 is 3 bytes.
  • "SEO". The first symbol "S" is taken from the macedonian alphabet, the second symbols "E" is taken from the greek alphabet, the third symbols "O" is taken from the russian alphabet here. These non-english letters look the same as an ASCII characters, but they are encoded by two bytes, so the word length in UTF-8 is 6 bytes.

Such a way some members who use homographs write posts on the forum, simply copying and modifying the texts of other people. Therefore I decided to create the full list of homographs that can be used in the texts in English.



According to the HTML code, the forum uses the following CSS style:
Code:
style="font-family: Verdana, Arial, sans-serif;"
Thus, the messages uses three fonts: "Verdana", "Arial" and "Sans Serif". Also, the "Courier New" is used for mono-space texts.

The table shows the ASCII characters and their homographs near by them that are written in all four of these fonts. Look at my next post below.
Coin-1 (OP)
Legendary
*
Offline Offline

Activity: 2478
Merit: 2216



View Profile
September 03, 2018, 12:11:00 PM
Last edit: September 05, 2018, 01:38:46 PM by Coin-1
 #2

The list of homographs for ASCII:

ASCII char  Unicode number  Comment       Verdana   Arial        Sans Serif  Courier New
1)A (65)0x0391 (913)GreekA  AA  AA  AA  A
2)B (66)0x0392 (914)GreekB  BB  BB  BB  B
3)E (69)0x0395 (917)GreekE  EE  EE  EE  E
4)Z (90)0x0396 (918)GreekZ  ZZ  ZZ  ZZ  Z
5)H (72)0x0397 (919)GreekH  HH  HH  HH  H
6)I (73)0x0399 (921)GreekI  II  II  II  I
7)K (75)0x039A (922)GreekK  KK  KK  KK  K
8)M (77)0x039C (924)GreekM  MM  MM  MM  M
9)N (78)0x039D (925)GreekN  NN  NN  NN  N
10)O (79)0x039F (927)GreekO  OO  OO  OO  O
11)P (80)0x03A1 (929)GreekP  PP  PP  PP  P
12)T (84)0x03A4 (932)GreekT  TT  TT  TT  T
13)Y (89)0x03A5 (933)GreekY  YY  YY  YY  Y
14)X (88)0x03A7 (935)GreekX  XX  XX  XX  X
15)o (111)0x03BF (959)Greeko  oo  oo  oo  o
16)c (99) [4]0x03E2 (994)Greekc  ϲc  ϲc  ϲc  ϲ
17)j (106) [2]0x03E3 (995)Macedonianj  ϳj  ϳj  ϳj  ϳ
18)C (67) [4]0x03E9 (1001)C  ϹC  ϹC  ϹC  Ϲ
19)S (83)0x0405 (1029)MacedonianS  SS  SS  SS  S
20)I (73)0x0406 (1030)I  II  II  II  I
21)J (74)0x0408 (1032)MacedonianJ  JJ  JJ  JJ  J
22)A (65)0x0410 (1040)RussianA  AA  AA  AA  A
23)B (66)0x0412 (1042)RussianB  BB  BB  BB  B
24)E (69)0x0415 (1045)RussianE  EE  EE  EE  E
25)K (75) [1]0x041A (1050)RussianK  КK  КK  КK  К
26)M (77)0x041C (1052)RussianM  MM  MM  MM  M
27)H (72)0x041D (1053)RussianH  HH  HH  HH  H
28)O (79)0x041E (1054)RussianO  OO  OO  OO  O
29)P (80)0x0420 (1056)RussianP  PP  PP  PP  P
30)C (67)0x0421 (1057)RussianC  CC  CC  CC  C
31)T (84)0x0422 (1058)RussianT  TT  TT  TT  T
32)X (88)0x0425 (1061)RussianX  XX  XX  XX  X
33)a (97)0x0430 (1072)Russiana  aa  aa  aa  a
34)e (101)0x0435 (1077)Russiane  ee  ee  ee  e
35)o (111)0x043E (1086)Russiano  oo  oo  oo  o
36)p (112)0x0440 (1088)Russianp  pp  pp  pp  p
37)c (99)0x0441 (1089)Russianc  cc  cc  cc  c
38)y (121) [3]0x0443 (1091)Russiany  yy  yy  yy  y
39)x (120)0x0445 (1093)Russianx  xx  xx  xx  x
40)s (115)0x0455 (1109)Macedonians  ss  ss  ss  s
41)i (105)0x0456 (1110)i  ii  ii  ii  i
42)j (106)0x0458 (1112)Macedonianj  jj  jj  jj  j
43)Y (89)0x04AE (1198)Y  YY  YY  YY  Y
44)h (104)0x04BB (1211)h  hh  hh  hh  h
45)I (73) [2]0x04C0 (1216)I  ӀI  ӀI  ӀI  Ӏ
46)l (108) [2]0x04CF (1231)l  ӏl  ӏl  ӏl  ӏ
47)G (71) [1]0x050C (1292)G  GG  GG  GG  G
48)Q (81)0x051A (1306)Q  ԚQ  ԚQ  ԚQ  Ԛ
49)q (113)0x051B (1307)q  qq  qq  qq  q
50)W (87)0x051C (1308)W  ԜW  ԜW  ԜW  Ԝ
51)w (119)0x051D (1309)w  ww  ww  ww  w

[1] almost identical in all fonts
[2] identical in all fonts except "Verdana" (v5.02)
[3] identical in all fonts except "Courier New" (v5.11)
[4] identical only in the font "Arial" (v5.06)
Coin-1 (OP)
Legendary
*
Offline Offline

Activity: 2478
Merit: 2216



View Profile
September 03, 2018, 12:11:27 PM
 #3

Reserved.
o_e_l_e_o
In memoriam
Legendary
*
Offline Offline

Activity: 2268
Merit: 18566


View Profile
September 03, 2018, 12:40:19 PM
 #4

This has recently been address by theymos following posts by iasenko over the last few months. See here: https://bitcointalk.org/index.php?topic=4967143.0

Essentially, all homographs that look the same as Latin characters are retroactively auto-replaced with the Latin characters in the English boards.
Coin-1 (OP)
Legendary
*
Offline Offline

Activity: 2478
Merit: 2216



View Profile
September 05, 2018, 03:27:07 PM
Last edit: September 05, 2018, 03:40:28 PM by Coin-1
 #5

This has recently been address by theymos following posts by iasenko over the last few months. See here: https://bitcointalk.org/index.php?topic=4967143.0

Yes, the posts of iasenko encouraged me to create this table. It seems that I was late a little bit. Smiley

As I can see, the list of nkampala does not contain these homographs:
0x03E2 (994): ϲ -> c
0x03E9 (1001): C -> C
0x041A (1050): К -> K
0x04C0 (1216): Ӏ -> I
0x04CF (1231): l -> l
0x051A (1306): Q -> Q
0x051C (1308): W -> W

So I guess that my list can also be useful.

By the way, I think that the following symbols posted by nkampala are significantly different from ASCII characters:
Code:
Ғ -> F
Լ -> L
ε -> e
ι -> i
κ -> k
յ -> j
η -> n
ρ -> p
զ -> q
τ -> t
υ -> u
ν -> v
ω -> w
χ -> x
γ -> y



Essentially, all homographs that look the same as Latin characters are retroactively auto-replaced with the Latin characters in the English boards.

As I understood, the posted homographs are stored in the forum database, but they are replaced with ASCII characters at displaying on non-local sections. In my opinion, since homographs are only used in other english boards, the Meta section should allow to show homographs for reporting or some administrative reasons.

Anyways, I will look for other homographs in the Unicode table later.
TheBeardedBaby
Legendary
*
Offline Offline

Activity: 2198
Merit: 3134


₿uy / $ell


View Profile
September 05, 2018, 08:20:27 PM
 #6

It's good to have more alternatives, good work OP no matter that the problem is fixed, theymos will probably add those that are missing from the nkampala's list.

Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!