Bitcoin Forum
May 04, 2024, 10:56:37 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 »  All
  Print  
Author Topic: [MERGED] BIP-39 List of words in Portuguese accepted!!  (Read 995 times)
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 04, 2020, 05:09:18 PM
Last edit: December 21, 2020, 11:52:05 AM by bitmover
Merited by DarkStar_ (10), ABCbits (9), malevolent (7), Coding Enthusiast (5), naufragus (4), o_e_l_e_o (2), fillippone (2), 1miau (2), jackg (1), Pmalek (1), alegotardo (1), nc50lc (1), Husna QA (1), NotATether (1), Heisenberg_Hunter (1), Charles-Tim (1), PawGo (1)
 #1

Hello everyone

I am part of a group of 4 users (sabotag3x, alegotardo, Tryninja and me) in the Portuguese board who are creating a list of 2048 words in Portuguese to be submitted to https://github.com/bitcoin/bips/tree/master/bip-0039

Our bitcointalk topic for dicussion is:
[2020] Lista de Palavras em Português para o BIP-0039

We followed many rules to add the words, that can be seen here:
https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md
Quote
Words can be uniquely determined typing the first 4 characters.
No accents or special characters.
No complex verb forms.
No plural words, unless there's no singular form.
No words with double spelling.
No words with the exact sound of another word with different spelling.
No offensive words.
No words already used in other language mnemonic sets.
The words which have not the same spelling in Brazil and in Portugal are excluded.
No words that remind negative/sad/bad things.


Our work is nearly done (we have now a few more than 2048, which are going to be carefully excluded, but all those words follow the criteria above) and it is almost ready to make the pull request to the main branch.

I would like to know if is there any suggestion or any special procedure that we didn't make before making the pull request.

Our list can be seen here:
https://github.com/sabotag3x/bips/blob/master/bip-0039/portuguese.txt

I hope our small group will be able to get into bitcoin history.

Thanks everyone.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
1714863397
Hero Member
*
Offline Offline

Posts: 1714863397

View Profile Personal Message (Offline)

Ignore
1714863397
Reply with quote  #2

1714863397
Report to moderator
1714863397
Hero Member
*
Offline Offline

Posts: 1714863397

View Profile Personal Message (Offline)

Ignore
1714863397
Reply with quote  #2

1714863397
Report to moderator
Once a transaction has 6 confirmations, it is extremely unlikely that an attacker without at least 50% of the network's computation power would be able to reverse it.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714863397
Hero Member
*
Offline Offline

Posts: 1714863397

View Profile Personal Message (Offline)

Ignore
1714863397
Reply with quote  #2

1714863397
Report to moderator
1714863397
Hero Member
*
Offline Offline

Posts: 1714863397

View Profile Personal Message (Offline)

Ignore
1714863397
Reply with quote  #2

1714863397
Report to moderator
1714863397
Hero Member
*
Offline Offline

Posts: 1714863397

View Profile Personal Message (Offline)

Ignore
1714863397
Reply with quote  #2

1714863397
Report to moderator
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6728


bitcoincleanup.com / bitmixlist.org


View Profile WWW
September 05, 2020, 12:23:17 AM
Merited by DarkStar_ (10), ABCbits (6), bitmover (4), TryNinja (3), Husna QA (1), Heisenberg_Hunter (1), Coding Enthusiast (1)
 #2

I really like your initiative, I was waiting until I got on PC to type this.

Make sure all the words in your list are words that people have heard of. Words not familiar to most people in your locale should be avoided. In some of the previous PRs for other wordlists, there were such words inside. Here's an example of this in the French wordlist.

Also I'd recommend limiting the maximum length of each word to 8, according to the below comment, it will save you time from having to revise your PR:

Hi. As I am interested in the creation of all word lists (to a reasonable extent), not only the German one, let me express my thoughts here as well. I am glad to see that there are contributors willing to work on word lists. However, what bothers me is that whenever a person (a group of people) shows up, take(s) care of just one list. I.e. to be exact, what bothers me is the fact that for each new list very similar problems needs to be tackled. For example requirements - for languages with Latin alphabet the maximum word lenght should be 8, due to the limitations of the displays of hardware wallets. Or requirements that first 4 letters should uniquely define a word? Not too mention about requirements like the one related to Levenshtein distance. Can't such requirements be shared across many languages? Especially that once developed tools (to ease work with Levenshtein distance) could be reused. That is why I launched a separate repository just for the creation of word lists: https://github.com/p2w34/wlips. I have launched it with a vision of tackling the creation of all word lists, not just one. Please do not get me wrong - I am not saying the work in this PR needs to be somehow stopped/abandoned/whatever - I am not the one to judge which approach is better. Let me also mention that I am the author of PR with Polish word list and I know how much time is needed to create such list from scratch. I just wanted to mention here there is also another approach possible. Thank you.

So apparently hardware wallets can only display up to 8 characters of a word. The rest won't be visible so there is a possibility for collision when using hardware wallets.

Levenshtien distance between two words is the number of characters you need to alter, add or remove to transform the first word to the second. Make sure the distance between all letters is not too low, there isn't a defined minimum but I would make it at least 2.

If everything goes well then judging by the opening and closing times of previous PRs, it should take about a month between opening the PR and getting it merged to the tree. Good luck!

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 05, 2020, 02:00:30 AM
 #3

I really like your initiative, I was waiting until I got on PC to type this.

Make sure all the words in your list are words that people have heard of. Words not familiar to most people in your locale should be avoided. In some of the previous PRs for other wordlists, there were such words inside. Here's an example of this in the French wordlist.

Also I'd recommend limiting the maximum length of each word to 8, according to the below comment, it will save you time from having to revise your PR:

Hi. As I am interested in the creation of all word lists (to a reasonable extent), not only the German one, let me express my thoughts here as well. I am glad to see that there are contributors willing to work on word lists. However, what bothers me is that whenever a person (a group of people) shows up, take(s) care of just one list. I.e. to be exact, what bothers me is the fact that for each new list very similar problems needs to be tackled. For example requirements - for languages with Latin alphabet the maximum word lenght should be 8, due to the limitations of the displays of hardware wallets. Or requirements that first 4 letters should uniquely define a word? Not too mention about requirements like the one related to Levenshtein distance. Can't such requirements be shared across many languages? Especially that once developed tools (to ease work with Levenshtein distance) could be reused. That is why I launched a separate repository just for the creation of word lists: https://github.com/p2w34/wlips. I have launched it with a vision of tackling the creation of all word lists, not just one. Please do not get me wrong - I am not saying the work in this PR needs to be somehow stopped/abandoned/whatever - I am not the one to judge which approach is better. Let me also mention that I am the author of PR with Polish word list and I know how much time is needed to create such list from scratch. I just wanted to mention here there is also another approach possible. Thank you.

So apparently hardware wallets can only display up to 8 characters of a word. The rest won't be visible so there is a possibility for collision when using hardware wallets.

Levenshtien distance between two words is the number of characters you need to alter, add or remove to transform the first word to the second. Make sure the distance between all letters is not too low, there isn't a defined minimum but I would make it at least 2.

If everything goes well then judging by the opening and closing times of previous PRs, it should take about a month between opening the PR and getting it merged to the tree. Good luck!

Thank you so much for your input.

I will take a closer look on first 4 letter requirement (which is not 100% yet) and this Levenshtein_distance.

I will try to make or find a Levenshtein distance script in python to check our list.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
ABCbits
Legendary
*
Offline Offline

Activity: 2870
Merit: 7464


Crypto Swap Exchange


View Profile
September 05, 2020, 11:29:57 AM
Last edit: September 07, 2020, 12:15:15 PM by ETFbitcoin
Merited by bitmover (3), Husna QA (1), NotATether (1), Coding Enthusiast (1)
 #4

Good initiative, i hope it will be accepted quickly Smiley

I will try to make or find a Levenshtein distance script in python to check our list.

I would recommend Jellyfish library (https://pypi.org/project/jellyfish/) which have better performance (since it's wrapper of C library).

█▀▀▀











█▄▄▄
▀▀▀▀▀▀▀▀▀▀▀
e
▄▄▄▄▄▄▄▄▄▄▄
█████████████
████████████▄███
██▐███████▄█████▀
█████████▄████▀
███▐████▄███▀
████▐██████▀
█████▀█████
███████████▄
████████████▄
██▄█████▀█████▄
▄█████████▀█████▀
███████████▀██▀
████▀█████████
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
c.h.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
▀▀▀█











▄▄▄█
▄██████▄▄▄
█████████████▄▄
███████████████
███████████████
███████████████
███████████████
███░░█████████
███▌▐█████████
█████████████
███████████▀
██████████▀
████████▀
▀██▀▀
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6728


bitcoincleanup.com / bitmixlist.org


View Profile WWW
September 06, 2020, 12:44:06 PM
Merited by Cyrus (2), bitmover (1), naufragus (1)
 #5

@bitmover

I am developing a python program that evaluates Latin wordlists for their word length, number of similar characters at the beginning and the Levenshtein distance between each two words. I have not finished it yet but I plan on putting the code on Github and PyPI soon. Hopefully it won't stall like some of my previous projects have.

If you are still looking for or writing a script for your task, keep doing that since I don't have an ETA for when it will be done, though my program is relatively simple and not complicated to code. I hope I can finish it within a few days.

The idea is that it can be used by future groups making wordlists to validate their words. so it will still be useful even if your group has finished checking the Portuguese list.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 06, 2020, 12:55:05 PM
Last edit: September 06, 2020, 02:05:03 PM by bitmover
 #6

Hello NotATether and ETFbitcoin

Thanks for your suggestions

I think I will make  a 2048x2048 matrix like this one. I think it is the best option.
https://stackoverflow.com/questions/47152344/how-to-calculate-levenshtein-ratio-distance-for-rows-in-my-column-in-python

In a matrix I can compare each 2048 words with each other.

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6728


bitcoincleanup.com / bitmixlist.org


View Profile WWW
September 06, 2020, 08:01:25 PM
Last edit: September 06, 2020, 10:44:33 PM by NotATether
 #7

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

Good question. I haven't written that part of the program yet but I want it to look something along the lines of this:

Code:
$wordlistvalidator -i wordlist.txt # Name subject to change
# Generic copyright notice here
<xxx> words read.

Performing Levenshtein distance test
Evaluating Levenshtein distances between <yyy> pairs of words...
-----
Pairs with Levenshtein distance 1:    # Omitted if there are no pairs with such distance
<word>, line <a>, and <word2>, line <b>   # EDIT: This is how I want all the words to be printed, with their lines, but I am editing on mobile right now so changing the other lines like this takes too long.
<word3> <word4>
...

Pairs with Levenshtein distance 2:    # And so on, up until and including a maximum configurable by command line argument.
...

# Or don't display this part of output if no pairs with distances up to that much are found

Finished performing Levenshtein distance test
Performing matching initial characters test
Comparing first <n> characters between <yyy> pairs of words...
----
Pairs with matching first <n> identical characters:    # Omitted if there are no pairs with such identical characters
<word> <word2>
<word3> <word4>
...

No pairs found with matching first <n> identical characters  # Displayed if there are no such pairs
Finished matching initial characters test
Performing word length test
Checking length of <xxx> words...
----
Words longer than <m> characters:    # Omitted if no such words
<word>
<word2>
...
No words found longer than <m> characters  # Displayed if no such words
Finished word length test

<0/1/2/3>/3 tests passed

It is meant to be human-readable output to easily find and address the problems. I'm not designing the output to be parsed by a second program, but I'll add an API to compromise for that.


Update: the backend of tests for unique initial characters and word length tests are complete, still working on levenshtein distance test
Update 2: finished the levenshtein distance test, working on the front-end and command-line argument parser

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 08, 2020, 04:46:19 AM
Merited by fillippone (2), Coding Enthusiast (2), ABCbits (1), Husna QA (1), HCP (1), Heisenberg_Hunter (1)
 #8

@ETFBitcoin and @NotATether

I wanna share this with you both. I did it thanks to your suggestions.

Following ETFBitcoin library suggestion (which is pretty fast btw) I made a code that generated this matrix (we had to delete some words and now we only have 2005).



as you can see, this matrix shows levenshtein distance of all 2005 words with each other in a loop. As 0 compared to 0 is the same word, it is distance is 0. As 1 compared to 1 is also 0, you can see a diagonal line comparing the same words as zero until the end of the last row.

I was able to identify all the values where distance was 1 and I generated a dictionary with those coordinates in the matrix:
Code:
{1: 164, 4: 182, 16: 1521, 23: 516, 31: 567, 32: 33, 33: 32, 35: 67, 39: 677, 51: 1305, 57: 126, 60: 51, 67: 35, 75: 78, 76: 1261, 78: 75, 83: 1655, 103: 107, 104: 140, 105: 106, 106: 105, 107: 103, 117: 1376, 126: 57, 128: 690, 140: 1928, 148: 176, 158: 178, 161: 1767, 164: 1, 166: 1910, 169: 1914, 175: 181, 176: 148, 178: 158, 181: 175, 182: 4, 183: 1019, 187: 221, 188: 205, 190: 697, 194: 234, 195: 1681, 200: 708, 205: 188, 220: 730, 221: 187, 228: 247, 231: 236, 233: 1610, 234: 194, 235: 236, 236: 235, 238: 228, 244: 750, 245: 1617, 247: 228, 252: 255, 254: 1869, 255: 252, 266: 471, 270: 292, 272: 1237, 274: 672, 280: 1102, 281: 1642, 283: 678, 284: 286, 286: 1528, 287: 280, 292: 270, 313: 1135, 314: 318, 315: 1538, 317: 1373, 318: 347, 321: 1014, 329: 1384, 338: 1928, 340: 341, 341: 340, 343: 703, 345: 348, 346: 621, 347: 318, 348: 345, 371: 886, 382: 1598, 395: 408, 399: 976, 402: 737, 403: 405, 405: 403, 408: 1202, 426: 446, 427: 1833, 432: 438, 433: 1214, 438: 432, 439: 433, 443: 1844, 446: 426, 461: 338, 466: 1497, 471: 266, 474: 795, 493: 1573, 516: 23, 517: 1590, 523: 1429, 538: 1976, 539: 550, 540: 544, 542: 538, 544: 1076, 549: 1082, 550: 539, 554: 1279, 557: 857, 567: 31, 568: 726, 613: 659, 619: 1585, 621: 346, 627: 635, 635: 627, 659: 1947, 672: 274, 677: 39, 678: 283, 687: 705, 690: 128, 691: 725, 694: 1027, 696: 1677, 697: 703, 699: 1394, 702: 710, 703: 705, 705: 703, 708: 760, 710: 1811, 715: 1957, 724: 715, 725: 691, 726: 568, 730: 220, 733: 1095, 735: 1073, 737: 402, 740: 1451, 742: 1207, 746: 1214, 747: 1838, 748: 742, 750: 244, 758: 761, 759: 1573, 760: 708, 761: 1862, 764: 759, 765: 1615, 768: 769, 769: 768, 770: 773, 771: 991, 773: 770, 782: 51, 785: 1002, 790: 792, 792: 790, 794: 1923, 795: 785, 798: 802, 801: 1681, 802: 798, 803: 1799, 825: 1196, 827: 1450, 834: 1614, 839: 1851, 842: 1468, 846: 1321, 853: 1878, 857: 557, 859: 1090, 882: 1458, 886: 371, 896: 1092, 912: 915, 915: 912, 919: 945, 945: 919, 955: 1717, 958: 971, 961: 1361, 969: 1381, 971: 958, 973: 1393, 976: 399, 978: 1073, 979: 978, 987: 989, 989: 987, 991: 771, 993: 1334, 997: 1102, 1002: 785, 1010: 997, 1014: 321, 1015: 1025, 1016: 1143, 1018: 1379, 1019: 183, 1025: 1015, 1027: 694, 1028: 1053, 1031: 1155, 1036: 1801, 1038: 1573, 1042: 1414, 1044: 1038, 1046: 1042, 1048: 1065, 1051: 1055, 1053: 1028, 1054: 1070, 1055: 1051, 1057: 1307, 1058: 1962, 1063: 1069, 1065: 1066, 1066: 1065, 1069: 1600, 1070: 1054, 1073: 978, 1074: 1716, 1075: 1829, 1076: 1074, 1077: 1205, 1082: 549, 1087: 1738, 1090: 1094, 1092: 896, 1094: 1090, 1095: 733, 1098: 1112, 1102: 997, 1112: 1113, 1113: 1112, 1122: 1654, 1131: 1166, 1134: 1920, 1135: 313, 1143: 1016, 1147: 1381, 1154: 1928, 1155: 1031, 1166: 1131, 1169: 1154, 1177: 1184, 1184: 1177, 1195: 1607, 1196: 1200, 1200: 1196, 1202: 408, 1203: 1305, 1205: 1207, 1207: 1205, 1214: 746, 1221: 1147, 1225: 1624, 1231: 1225, 1237: 1248, 1248: 1381, 1252: 1385, 1261: 76, 1268: 1277, 1272: 1332, 1277: 1268, 1279: 554, 1305: 1203, 1307: 1057, 1321: 1339, 1332: 1272, 1334: 1844, 1339: 1321, 1346: 1360, 1358: 1775, 1360: 1346, 1361: 961, 1367: 1381, 1373: 317, 1374: 1413, 1376: 117, 1379: 1461, 1381: 1468, 1384: 329, 1385: 1252, 1392: 1395, 1393: 1413, 1394: 699, 1395: 1392, 1398: 1931, 1403: 1686, 1405: 1573, 1409: 1413, 1413: 1409, 1414: 1415, 1415: 1414, 1429: 523, 1448: 2001, 1450: 827, 1451: 1452, 1452: 1451, 1454: 1829, 1456: 1458, 1457: 1462, 1458: 1456, 1461: 1379, 1462: 1620, 1464: 1462, 1468: 1477, 1470: 1468, 1473: 1478, 1477: 1585, 1478: 1473, 1485: 1491, 1491: 1485, 1495: 1501, 1497: 466, 1501: 1495, 1519: 1527, 1521: 16, 1523: 1908, 1524: 1767, 1525: 1541, 1527: 1519, 1528: 286, 1529: 1771, 1538: 1541, 1541: 1538, 1548: 1568, 1551: 1574, 1556: 1559, 1559: 1673, 1564: 1574, 1568: 1548, 1571: 1801, 1573: 1405, 1574: 1564, 1585: 1811, 1590: 1600, 1596: 1600, 1598: 382, 1600: 1596, 1607: 1195, 1608: 1980, 1610: 233, 1612: 1829, 1614: 834, 1615: 765, 1617: 245, 1620: 1462, 1624: 1225, 1631: 1644, 1635: 1905, 1637: 1638, 1638: 1637, 1642: 281, 1644: 1631, 1654: 1122, 1655: 83, 1659: 1662, 1662: 1659, 1667: 1681, 1668: 1713, 1673: 1559, 1677: 1681, 1679: 1795, 1681: 1700, 1686: 1403, 1687: 1804, 1695: 1946, 1700: 1681, 1713: 1668, 1715: 1730, 1716: 1721, 1717: 1736, 1721: 1716, 1727: 1730, 1730: 1727, 1736: 1717, 1738: 1754, 1749: 1738, 1754: 1738, 1767: 1524, 1768: 1775, 1769: 1910, 1771: 1769, 1775: 1768, 1788: 1795, 1790: 1798, 1791: 1799, 1793: 1928, 1795: 1788, 1798: 1790, 1799: 1791, 1801: 1571, 1804: 1687, 1811: 1585, 1822: 1851, 1824: 1872, 1829: 1612, 1833: 1824, 1834: 1840, 1838: 747, 1840: 1834, 1844: 1334, 1851: 1822, 1859: 1872, 1862: 1859, 1869: 254, 1872: 1859, 1878: 853, 1884: 1888, 1885: 1888, 1888: 1885, 1905: 1925, 1906: 1991, 1907: 1949, 1908: 1924, 1909: 1912, 1910: 1914, 1912: 1909, 1914: 1910, 1920: 1966, 1923: 1972, 1924: 1908, 1925: 1905, 1928: 1793, 1931: 1398, 1932: 1933, 1933: 1932, 1946: 1695, 1947: 659, 1949: 1966, 1953: 1970, 1955: 1970, 1957: 1959, 1959: 1957, 1962: 1058, 1966: 1949, 1970: 1955, 1972: 1923, 1976: 538, 1977: 2000, 1980: 1984, 1984: 1980, 1985: 1986, 1986: 1985, 1991: 1906, 2000: 1977, 2001: 1448}

Now it was easy. With the coordinates in the matrix, I just generated an array with all collided pairs:
Code:
['abaixo - baixo',
 'abater - bater',
 'achar - rachar',
 'adiante - diante',
 'afetivo - efetivo',
 'aflito - afoito',
 'afoito - aflito',
 'agora - amora',
 'agulha - fagulha',
 'alho - olho',
 'altitude - atitude',
 'alvo - alho',
 'amora - agora',
 'anel - anil',
 'anexo - nexo',
 'anil - anel',
 'anta - santa',
 'arca - arma',
 'areia - aveia',
 'argila - argola',
 'argola - argila',
 'arma - arca',
 'assado - passado',
 'atitude - altitude',
 'ator - fator',
 'aveia - veia',
 'babado - barbado',
 'bagulho - barulho',
 'bainha - tainha',
 'baixo - abaixo',
 'bala - vala',
 'balsa - valsa',
 'barata - batata',
 'barbado - babado',
 'barulho - bagulho',
 'batata - barata',
 'bater - abater',
 'batido - latido',
 'beato - boato',
 'beco - bico',
 'beira - feira',
 'beliche - boliche',
 'belo - selo',
 'besta - festa',
 'bico - beco',
 'bloco - floco',
 'boato - beato',
 'bode - boxe',
 'boldo - bolso',
 'bolha - rolha',
 'boliche - beliche',
 'bolo - bolso',
 'bolso - bolo',
 'bonde - bode',
 'bossa - fossa',
 'botina - rotina',
 'boxe - bode',
 'briga - brita',
 'brincar - trincar',
 'brita - briga',
 'busto - custo',
 'cabelo - camelo',
 'cabo - nabo',
 'cabuloso - fabuloso',
 'cadeira - madeira',
 'caibro - saibro',
 'caixa - faixa',
 'cajado - calado',
 'calado - ralado',
 'caldeira - cadeira',
 'camelo - cabelo',
 'carinho - marinho',
 'carneiro - carteiro',
 'caro - raro',
 'carreira - parreira',
 'carteiro - certeiro',
 'casca - lasca',
 'causar - pausar',
 'ceia - veia',
 'cenoura - censura',
 'censura - cenoura',
 'cera - fera',
 'cereja - cerveja',
 'cerrado - errado',
 'certeiro - carteiro',
 'cerveja - cereja',
 'cidade - idade',
 'cisco - risco',
 'coceira - coleira',
 'coelho - joelho',
 'coice - foice',
 'coifa - coisa',
 'coisa - coifa',
 'coleira - moleira',
 'copeiro - coveiro',
 'copo - topo',
 'corja - coruja',
 'corno - morno',
 'coruja - corja',
 'corvo - corno',
 'couro - touro',
 'coveiro - copeiro',
 'cuia - ceia',
 'cunhado - punhado',
 'custo - busto',
 'data - gata',
 'dente - rente',
 'diante - adiante',
 'dica - rica',
 'dinheiro - pinheiro',
 'doador - voador',
 'dobrado - dourado',
 'doca - dona',
 'domador - doador',
 'dona - lona',
 'dotado - lotado',
 'dourado - dobrado',
 'dublado - nublado',
 'dueto - gueto',
 'efetivo - afetivo',
 'eixo - fixo',
 'enxame - exame',
 'ereto - reto',
 'errado - cerrado',
 'escola - esmola',
 'esmola - escola',
 'exame - vexame',
 'fabuloso - cabuloso',
 'fagulha - agulha',
 'faixa - caixa',
 'farpa - ferpa',
 'fator - ator',
 'favela - fivela',
 'febre - lebre',
 'feio - seio',
 'feira - fera',
 'feixe - peixe',
 'feno - feto',
 'fera - ferpa',
 'ferpa - fera',
 'festa - fresta',
 'feto - teto',
 'figa - viga',
 'fita - figa',
 'fivela - favela',
 'fixo - eixo',
 'floco - bloco',
 'fluxo - luxo',
 'fogo - logo',
 'foice - coice',
 'folia - polia',
 'fonte - monte',
 'forno - morno',
 'forrar - torrar',
 'forte - fonte',
 'fossa - bossa',
 'freio - frevo',
 'frente - rente',
 'fresta - festa',
 'frevo - trevo',
 'fronte - frente',
 'frota - rota',
 'fundo - fungo',
 'fungo - fundo',
 'funil - fuzil',
 'furado - jurado',
 'fuzil - funil',
 'galho - alho',
 'gama - lama',
 'garoupa - garupa',
 'garupa - garoupa',
 'gasto - vasto',
 'gata - gama',
 'geada - gemada',
 'gelo - selo',
 'gemada - geada',
 'gemido - temido',
 'goela - moela',
 'goleiro - poleiro',
 'gosto - rosto',
 'gralha - tralha',
 'grato - prato',
 'grelha - orelha',
 'gruta - truta',
 'gueto - dueto',
 'gula - lula',
 'horta - porta',
 'idade - cidade',
 'ilustre - lustre',
 'incolor - indolor',
 'indolor - incolor',
 'inferno - inverno',
 'inverno - inferno',
 'isolado - solado',
 'jaca - jeca',
 'janela - panela',
 'jato - pato',
 'jeca - jaca',
 'jeito - peito',
 'joelho - coelho',
 'jogo - logo',
 'joio - jogo',
 'julho - junho',
 'junho - julho',
 'jurado - furado',
 'juro - ouro',
 'ladeira - madeira',
 'lama - gama',
 'lareira - ladeira',
 'lasca - casca',
 'laser - lazer',
 'lastro - mastro',
 'latente - patente',
 'latido - batido',
 'lazer - laser',
 'lebre - febre',
 'legado - ligado',
 'leigo - meigo',
 'lenda - tenda',
 'lente - rente',
 'lesado - pesado',
 'leste - lente',
 'levado - lesado',
 'liberal - literal',
 'licitar - limitar',
 'ligado - legado',
 'ligeiro - lixeiro',
 'limitar - licitar',
 'limpo - olimpo',
 'linda - vinda',
 'lisa - lixa',
 'literal - litoral',
 'litoral - literal',
 'lixa - rixa',
 'lixeiro - ligeiro',
 'logo - jogo',
 'loja - soja',
 'lombo - tombo',
 'lona - loja',
 'longe - monge',
 'lotado - dotado',
 'luar - suar',
 'lula - luva',
 'lustre - ilustre',
 'luva - lula',
 'luxo - fluxo',
 'machado - malhado',
 'madeira - ladeira',
 'malhado - malvado',
 'malvado - malhado',
 'mangue - sangue',
 'marcador - mercador',
 'margem - vargem',
 'marinho - carinho',
 'mastro - lastro',
 'mato - pato',
 'meia - veia',
 'meigo - leigo',
 'mercador - marcador',
 'mesa - meia',
 'miado - mimado',
 'mimado - miado',
 'moedor - roedor',
 'moela - mola',
 'mola - moela',
 'moleira - coleira',
 'molho - olho',
 'monge - monte',
 'monte - monge',
 'morno - forno',
 'moto - mato',
 'mugido - rugido',
 'munido - mugido',
 'nabo - nato',
 'nato - pato',
 'navio - pavio',
 'nexo - anexo',
 'noivo - novo',
 'nosso - osso',
 'novo - noivo',
 'nublado - dublado',
 'olho - molho',
 'olimpo - limpo',
 'orelha - ovelha',
 'osso - nosso',
 'ouro - touro',
 'ovelha - orelha',
 'padeiro - pandeiro',
 'pampa - tampa',
 'pandeiro - padeiro',
 'panela - janela',
 'papo - pato',
 'parreira - carreira',
 'parto - perto',
 'passado - assado',
 'patente - potente',
 'pato - prato',
 'pausar - causar',
 'pavio - navio',
 'pegada - pelada',
 'peito - perto',
 'peixe - feixe',
 'pelada - pegada',
 'peludo - veludo',
 'penhor - senhor',
 'pente - rente',
 'perito - perto',
 'perto - perito',
 'pesado - pescado',
 'pescado - pesado',
 'pinheiro - dinheiro',
 'poeira - zoeira',
 'poleiro - goleiro',
 'polia - polpa',
 'polpa - polia',
 'pombo - tombo',
 'ponta - porta',
 'porco - pouco',
 'porta - ponta',
 'potente - patente',
 'pouco - rouco',
 'pouso - pouco',
 'prato - preto',
 'prazo - prato',
 'pregar - prezar',
 'preto - reto',
 'prezar - pregar',
 'profeta - proveta',
 'proveta - profeta',
 'pular - puxar',
 'punhado - cunhado',
 'puxar - pular',
 'rabada - rajada',
 'rachar - achar',
 'raiar - vaiar',
 'rainha - tainha',
 'raio - raso',
 'rajada - rabada',
 'ralado - calado',
 'ralo - talo',
 'raro - raso',
 'raso - raro',
 'reator - reitor',
 'recente - repente',
 'redator - redutor',
 'redutor - sedutor',
 'regente - repente',
 'reitor - reator',
 'renda - tenda',
 'rente - pente',
 'repente - regente',
 'reto - teto',
 'rica - rixa',
 'ripa - rixa',
 'risco - cisco',
 'rixa - ripa',
 'roedor - moedor',
 'rolante - volante',
 'rolha - bolha',
 'rombo - tombo',
 'rosto - gosto',
 'rota - frota',
 'rotina - botina',
 'rouco - pouco',
 'rugido - mugido',
 'sacada - salada',
 'sadio - vadio',
 'safira - safra',
 'safra - safira',
 'saibro - caibro',
 'salada - sacada',
 'sangue - mangue',
 'santa - anta',
 'sarda - sarna',
 'sarna - sarda',
 'sebo - selo',
 'secar - socar',
 'sedutor - redutor',
 'seio - selo',
 'selar - telar',
 'selo - silo',
 'senhor - penhor',
 'sentar - tentar',
 'setor - vetor',
 'silo - selo',
 'socar - secar',
 'sogro - soro',
 'soja - soma',
 'solado - sovado',
 'soma - soja',
 'sono - soro',
 'soro - sono',
 'sovado - solado',
 'suar - suor',
 'sujar - suar',
 'suor - suar',
 'tainha - rainha',
 'taipa - tampa',
 'tala - vala',
 'talo - tala',
 'tampa - taipa',
 'tear - telar',
 'tecer - temer',
 'tecido - temido',
 'teia - veia',
 'telar - tear',
 'temer - tecer',
 'temido - tecido',
 'tenda - renda',
 'tentar - sentar',
 'teto - reto',
 'toalha - tralha',
 'toco - troco',
 'tombo - rombo',
 'topo - toco',
 'tora - tosa',
 'torrar - forrar',
 'tosa - tora',
 'touro - ouro',
 'tralha - toalha',
 'treco - troco',
 'trevo - treco',
 'trincar - brincar',
 'troco - treco',
 'truta - gruta',
 'turbo - turvo',
 'turco - turvo',
 'turvo - turco',
 'vadio - vazio',
 'vaga - zaga',
 'vagem - viagem',
 'vaiar - vazar',
 'vaidade - validade',
 'vala - valsa',
 'validade - vaidade',
 'valsa - vala',
 'vargem - virgem',
 'vasto - visto',
 'vazar - vaiar',
 'vazio - vadio',
 'veia - teia',
 'veludo - peludo',
 'vencedor - vendedor',
 'vendedor - vencedor',
 'vetor - setor',
 'vexame - exame',
 'viagem - virgem',
 'videira - viseira',
 'vieira - viseira',
 'viga - vigia',
 'vigia - viga',
 'vinda - linda',
 'virgem - viagem',
 'viseira - vieira',
 'visto - vasto',
 'voador - doador',
 'voar - zoar',
 'volante - votante',
 'votante - volante',
 'vulgo - vulto',
 'vulto - vulgo',
 'zaga - vaga',
 'zoar - voar',
 'zoeira - poeira']

The results are not as bad as they look like. They are doubled, because just as like "abaixo" is 1 distance from "baixo", "baixo" is also 1 distance from "abaixo". So we will see everything doubled here.

I learned a lot along the way. I never thought that generating those words was going to get so complicated. We will go back to word in the local board generating more words, as I deleted a lot now.

I hope this mindset can help your library/program NotATether.

Thank you both again.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
Coding Enthusiast
Legendary
*
Offline Offline

Activity: 1039
Merit: 2783


Bitcoin and C♯ Enthusiast


View Profile WWW
September 08, 2020, 06:47:42 AM
Merited by bitmover (4), ABCbits (2), fillippone (2), Heisenberg_Hunter (1)
 #9

I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).
Other languages don't seem to be any better. Here are only some of the example words not all:
Italian first word in the list is "abaco" which has a similar one "baco" or "sino", "asino".
French seems better but it has words with distance=2 like "apaiser", "abaisser"
Spanish has "bono", "abono" and "abrazo", "brazo"
Czech has words with distance=2 like "abeceda", "beseda" and "adresa", "agrese"
Japanese has "あいさつ", "かいさつ" and "あきる", "あける"
Korean has "가격", "간격"
Chinese results don't make much sense but since the last 3 are complicated languages I'm not sure if the Levenshtein distance is even valid for them.

Projects List+Suggestion box
Donate: 1Q9s or bc1q
|
|
|
FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git
|
|
|
BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git
Coding Enthusiast
Legendary
*
Offline Offline

Activity: 1039
Merit: 2783


Bitcoin and C♯ Enthusiast


View Profile WWW
September 08, 2020, 11:12:59 AM
Merited by ABCbits (1)
 #10

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.
  • The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
  • Damerau–Levenshtein adds transposition to the above 3.
  • Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
  • Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

PS. You can run Jaro-Winkler algorithm here on sharplab just change the s1 and s2 in Main() method.

Projects List+Suggestion box
Donate: 1Q9s or bc1q
|
|
|
FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git
|
|
|
BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 08, 2020, 11:13:45 AM
Last edit: September 08, 2020, 03:36:35 PM by bitmover
 #11

I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).

This is a great find.
distance 2 certainly is ok because it would be too restrictive, but I didn't know if a distance of 1 would be acceptable.

Looking carefully at the https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md I found that only French is worried about Levenshtein distance

Quote
French
10. No very similar words with 1 letter of difference.
https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md#french


This is same as Levenshtein  distance > 1.

What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy task but certainly doable.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
Coding Enthusiast
Legendary
*
Offline Offline

Activity: 1039
Merit: 2783


Bitcoin and C♯ Enthusiast


View Profile WWW
September 08, 2020, 11:40:33 AM
 #12

What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy tasks but certainly doable.
I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.

Projects List+Suggestion box
Donate: 1Q9s or bc1q
|
|
|
FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git
|
|
|
BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6728


bitcoincleanup.com / bitmixlist.org


View Profile WWW
September 08, 2020, 10:24:29 PM
Merited by fillippone (2), ABCbits (1), Coding Enthusiast (1)
 #13

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.
  • The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
  • Damerau–Levenshtein adds transposition to the above 3.
  • Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
  • Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

I am against using Jaro-Winkler similarity for measuring distances because it is tainted by its weighing earlier characters more. I feel like it's trying to take on the task of both measuring distance and counting initial unique characters, but it is not effective for measuring either of them because it just adds the distance metric and a very scaled down initial character uniqueness measurement together. IMHO adding two metrics together just ruins the measurement.

Jaro similarity is a little better, it just measures distance but I notice that character swaps have less weighting on the metric than the presence of unique characters in either two words, which makes sense if you are feeding a program with input that has several similar words in it, but interchanges between adjacent characters and deletions are the most common mistakes people make when writing from a wordlist. Plus the percentage metric doesn't lend itself well to quantifying the number of character replacements you need to get from one distance to another, at least by the human brain. I can't just say, "for a Jaro distance of 0.7 I need to change <x> characters to make it 0.6".

For typing there are also typos made not by swapping but by typing the adjacent character on the Qwerty keyboard. A typo is just a substitution, which can be modeled by deletion and insert pair, additional typos can replace one of the deletes and one of the inserts with a swap*. There may already be production algorithms that take proximity of the neighboring Qwerty characters into account when measuring similarity, measuring insert/delete pairs of nearby keyboard characters more harshly than distant characters, and they do exist since search engines can detect typos. And I think that should be the goal when making a wordlist, to filter out as many opportunities to make typos as possible. The guidelines for similarity checking were only created because a small group can't be expected to make such a sophisticated checker Smiley

*That's why I think counting swaps with one point instead of two points of insert/delete is bad for distance measurements, as we are falsely making the word pair look more unique. Hence my argument against using Damerau-Levenshtein.

So for simpletons like us, none of the alternative algorithms are good for our needs, and Levenshtein is out best measuring ruler that is not complex to implement.

What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy task but certainly doable.

Levenshtein distance of one means only one substitution needs to be made, like "fish" --> "fist", which has a risk of being spelt incorrectly like Coding Enthusiast mentioned. Insertions and deletions are harder to get wrong though, because a user has to subconsciously type an extra character or omit one. so if you absolutely must use distance 1 word pairs then use ones with an extra or missing letter. But it is unlikely that users will miswrite 2 substituted characters wrong.

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 09, 2020, 12:07:50 AM
 #14

What do you think Coding Enthusiast, should we try to keep Levenshtein  distance > 1? That is not an easy tasks but certainly doable.
I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.

Thanks for your suggestion. We decided to keep with that restriction (removing all Levenshtein distance =1). Our wordlist is going to be the one with the most restricted rules.
French wordlist followed Levenshtein distance =1 rule, however they didn't worry about repeting words from others lists like we did.

https://github.com/bitcoin/bips/pull/152#issuecomment-412618598


I hope our list will be quickly accepted. We did a nice work.

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Share with us this code when you are done. There are still other languages to make a wordlist. and your program may also be used in other projects that we don't know of yet.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6728


bitcoincleanup.com / bitmixlist.org


View Profile WWW
September 12, 2020, 05:21:12 PM
 #15

There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 12, 2020, 08:18:07 PM
Merited by NotATether (3), ABCbits (2)
 #16

There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.
Sorting is not required,  but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

Quote

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

personally, I think that accepting words with accent a big mistake. I would reject it straight away. And I wouldn't use that word list

Because " àbaco" and "abaco" are different words and it could lead to some problem in some software.

Portuguese list won't have words with special characters.

Quote

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary.  Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalk.org/index.php?topic=5272106.msg55131643#msg55131643

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6728


bitcoincleanup.com / bitmixlist.org


View Profile WWW
September 12, 2020, 10:45:15 PM
 #17

Sorting is not required,  but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

I designed my program to have as little user-interaction as possible, because it's easier to show some sort of report card that shows you the status of each test, and where specifically in each test is wrong so you can immediately go to that part of the file and fix it. In simple terms, my program lets you can control which tests to enable from the command line, it prints progresses and status messages as it performs each tests, and it tells you which tests passed and failed.

I am not comfortable with modifying the wordlist file in-place because it could have bugs that mistakenly mess up the wordlist. I've done that error too many times in other projects and I don't want to take any chances here. So I think I will just print a warning if it detects the list isn't sorted.

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary.  Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalk.org/index.php?topic=5272106.msg55131643#msg55131643

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:
for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages? In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented.  So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
Coding Enthusiast
Legendary
*
Offline Offline

Activity: 1039
Merit: 2783


Bitcoin and C♯ Enthusiast


View Profile WWW
September 13, 2020, 03:09:01 AM
Merited by ABCbits (1)
 #18

Sorting is not required

Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Projects List+Suggestion box
Donate: 1Q9s or bc1q
|
|
|
FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git
|
|
|
BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6728


bitcoincleanup.com / bitmixlist.org


View Profile WWW
September 13, 2020, 10:56:56 AM
 #19

Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Good catch, reminding me of the exact number of words that should be in a wordlist. I will also implement a check which validates that there are exactly 2048 words in the list.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
bitmover (OP)
Legendary
*
Offline Offline

Activity: 2296
Merit: 5919


bitcoindata.science


View Profile WWW
September 14, 2020, 12:24:00 AM
 #20

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:
for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

*So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages?

You can add more letters if you think it is necessary, I am not sure this dictionary will cover all possibilities.
About your code, I don't like to use Loops unless it is extremely necessary. Loops are computational costly and makes your code slow.

I did this in my code:
Code:
import pandas as pd
accent_dict = {...}
spanish = pd.read_csv('spanish.txt', header = None)
spanish=spanish.replace(accent_dict , regex=True)

Code will be cleaner.
1 line and faster processing instead of a loop

Quote
In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented.  So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

A dictionary is a better approach than  using coding in my opinion

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
Pages: [1] 2 3 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!