[MERGED] BIP-39 List of words in Portuguese accepted!!

█░░░░░░█░░░░░░█
▀███▀░░▀███▀░░▀███▀
▀░▀░░░░▀░▀░░░░▀░▀
▄░░░░░░░░░░░░
▀██████████
░░░░░███░░░░▀
░░█░░░███▄█░░░█
░░██▌░░███░▀░░██▌
░█░██░░███░░░█░██
░█▀▀▀█▌░███░░█▀▀▀█▌
▄█▄░░░██▄███▄█▄░░▄██▄
▄███▄
░░░░▀██▄▀

▄▄████▄▄
▄███▀▀███▄
██████████
▀███▄░▄██▀
▄▄████▄▄░▀█▀▄██▀▄▄████▄▄
▄███▀▀▀████▄▄██▀▄███▀▀███▄
███████▄▄▀▀████▄▄▀▀███████
▀███▄▄███▀░░░▀▀████▄▄▄███▀
▀▀████▀▀████████▀▀████▀▀

│

Re: BIP-39 List of words in Portuguese ready for submission

NotATether

Legendary

Offline

Activity: 1694
Merit: 7155

In memory of o_e_l_e_o

September 05, 2020, 12:23:17 AM

Merited by DarkStar_ (10), ABCbits (6), bitmover (4), TryNinja (3), Husna QA (1), Heisenberg_Hunter (1), Coding Enthusiast (1)

I really like your initiative, I was waiting until I got on PC to type this.

Make sure all the words in your list are words that people have heard of. Words not familiar to most people in your locale should be avoided. In some of the previous PRs for other wordlists, there were such words inside. Here's an example of this in the French wordlist.

Also I'd recommend limiting the maximum length of each word to 8, according to the below comment, it will save you time from having to revise your PR:

Quote from: https://github.com/bitcoin/bips/pull/942#issuecomment-663078429

Hi. As I am interested in the creation of all word lists (to a reasonable extent), not only the German one, let me express my thoughts here as well. I am glad to see that there are contributors willing to work on word lists. However, what bothers me is that whenever a person (a group of people) shows up, take(s) care of just one list. I.e. to be exact, what bothers me is the fact that for each new list very similar problems needs to be tackled. For example requirements - for languages with Latin alphabet the maximum word lenght should be 8, due to the limitations of the displays of hardware wallets. Or requirements that first 4 letters should uniquely define a word? Not too mention about requirements like the one related to Levenshtein distance. Can't such requirements be shared across many languages? Especially that once developed tools (to ease work with Levenshtein distance) could be reused. That is why I launched a separate repository just for the creation of word lists: https://github.com/p2w34/wlips. I have launched it with a vision of tackling the creation of all word lists, not just one. Please do not get me wrong - I am not saying the work in this PR needs to be somehow stopped/abandoned/whatever - I am not the one to judge which approach is better. Let me also mention that I am the author of PR with Polish word list and I know how much time is needed to create such list from scratch. I just wanted to mention here there is also another approach possible. Thank you.

So apparently hardware wallets can only display up to 8 characters of a word. The rest won't be visible so there is a possibility for collision when using hardware wallets.

Levenshtien distance between two words is the number of characters you need to alter, add or remove to transform the first word to the second. Make sure the distance between all letters is not too low, there isn't a defined minimum but I would make it at least 2.

If everything goes well then judging by the opening and closing times of previous PRs, it should take about a month between opening the PR and getting it merged to the tree. Good luck!

⇾ Re: BIP-39 List of words in Portuguese ready for submission

bitmover (OP)

Legendary

Offline

Activity: 2394
Merit: 6181

Crypto Swap Exchange🈺

September 05, 2020, 02:00:30 AM

Quote from: NotATether on September 05, 2020, 12:23:17 AM

Quote from: https://github.com/bitcoin/bips/pull/942#issuecomment-663078429

Thank you so much for your input.

I will take a closer look on first 4 letter requirement (which is not 100% yet) and this Levenshtein_distance.

I will try to make or find a Levenshtein distance script in python to check our list.

.
BC.GAME

│

│

Re: BIP-39 List of words in Portuguese ready for submission

ABCbits

Legendary

Offline

Activity: 2968
Merit: 7780

Crypto Swap Exchange

September 05, 2020, 11:29:57 AM
Last edit: September 07, 2020, 12:15:15 PM by ETFbitcoin

Merited by bitmover (3), Husna QA (1), NotATether (1), Coding Enthusiast (1)

Good initiative, i hope it will be accepted quickly

Quote from: bitmover on September 05, 2020, 02:00:30 AM

I will try to make or find a Levenshtein distance script in python to check our list.

I would recommend Jellyfish library (https://pypi.org/project/jellyfish/) which have better performance (since it's wrapper of C library).

█▀▀▀
█
█
█
█
█
█
█
█
█
█
█
█▄▄▄

▀▀▀▀▀▀▀▀▀▀▀
e
▄▄▄▄▄▄▄▄▄▄▄

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
c.h.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄

▀▀▀█
█
█
█
█
█
█
█
█
█
█
█
▄▄▄█

▄██████▄▄▄
█████████████▄▄
███████████████
███████████████
███████████████
███████████████
███░░█████████
███▌▐█████████
█████████████
███████████▀
██████████▀
████████▀
░▀▀██▀▀

NotATether

Legendary

Offline

Activity: 1694
Merit: 7155

In memory of o_e_l_e_o

Re: BIP-39 List of words in Portuguese ready for submission

September 06, 2020, 12:44:06 PM

Merited by Cyrus (2), bitmover (1), naufragus (1)

@bitmover

I am developing a python program that evaluates Latin wordlists for their word length, number of similar characters at the beginning and the Levenshtein distance between each two words. I have not finished it yet but I plan on putting the code on Github and PyPI soon. Hopefully it won't stall like some of my previous projects have.

If you are still looking for or writing a script for your task, keep doing that since I don't have an ETA for when it will be done, though my program is relatively simple and not complicated to code. I hope I can finish it within a few days.

The idea is that it can be used by future groups making wordlists to validate their words. so it will still be useful even if your group has finished checking the Portuguese list.

Re: BIP-39 List of words in Portuguese ready for submission

bitmover (OP)

Legendary

Offline

Activity: 2394
Merit: 6181

Crypto Swap Exchange🈺

September 06, 2020, 12:55:05 PM
Last edit: September 06, 2020, 02:05:03 PM by bitmover

Hello NotATether and ETFbitcoin

Thanks for your suggestions

I think I will make a 2048x2048 matrix like this one. I think it is the best option.
https://stackoverflow.com/questions/47152344/how-to-calculate-levenshtein-ratio-distance-for-rows-in-my-column-in-python

In a matrix I can compare each 2048 words with each other.

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

.
BC.GAME

│

│

Re: BIP-39 List of words in Portuguese ready for submission

NotATether

Legendary

Offline

Activity: 1694
Merit: 7155

In memory of o_e_l_e_o

September 06, 2020, 08:01:25 PM
Last edit: September 06, 2020, 10:44:33 PM by NotATether

Quote from: bitmover on September 06, 2020, 12:55:05 PM

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

Good question. I haven't written that part of the program yet but I want it to look something along the lines of this:

Code:

$wordlistvalidator -i wordlist.txt # Name subject to change
# Generic copyright notice here
<xxx> words read.

Performing Levenshtein distance test
Evaluating Levenshtein distances between <yyy> pairs of words...
-----
Pairs with Levenshtein distance 1:    # Omitted if there are no pairs with such distance
<word>, line <a>, and <word2>, line <b>   # EDIT: This is how I want all the words to be printed, with their lines, but I am editing on mobile right now so changing the other lines like this takes too long.
<word3> <word4>
...

Pairs with Levenshtein distance 2:    # And so on, up until and including a maximum configurable by command line argument.
...

# Or don't display this part of output if no pairs with distances up to that much are found

Finished performing Levenshtein distance test
Performing matching initial characters test
Comparing first <n> characters between <yyy> pairs of words...
----
Pairs with matching first <n> identical characters:    # Omitted if there are no pairs with such identical characters
<word> <word2>
<word3> <word4>
...

No pairs found with matching first <n> identical characters  # Displayed if there are no such pairs
Finished matching initial characters test
Performing word length test
Checking length of <xxx> words...
----
Words longer than <m> characters:    # Omitted if no such words
<word>
<word2>
...
No words found longer than <m> characters  # Displayed if no such words
Finished word length test

<0/1/2/3>/3 tests passed

It is meant to be human-readable output to easily find and address the problems. I'm not designing the output to be parsed by a second program, but I'll add an API to compromise for that.

Update: the backend of tests for unique initial characters and word length tests are complete, still working on levenshtein distance test
Update 2: finished the levenshtein distance test, working on the front-end and command-line argument parser

Re: BIP-39 List of words in Portuguese nearly ready for submission

bitmover (OP)

Legendary

Offline

Activity: 2394
Merit: 6181

Crypto Swap Exchange🈺

September 08, 2020, 04:46:19 AM

Merited by fillippone (2), Coding Enthusiast (2), ABCbits (1), Husna QA (1), HCP (1), Heisenberg_Hunter (1)

@ETFBitcoin and @NotATether

I wanna share this with you both. I did it thanks to your suggestions.

Following ETFBitcoin library suggestion (which is pretty fast btw) I made a code that generated this matrix (we had to delete some words and now we only have 2005).

as you can see, this matrix shows levenshtein distance of all 2005 words with each other in a loop. As 0 compared to 0 is the same word, it is distance is 0. As 1 compared to 1 is also 0, you can see a diagonal line comparing the same words as zero until the end of the last row.

I was able to identify all the values where distance was 1 and I generated a dictionary with those coordinates in the matrix:

Code:

{1: 164, 4: 182, 16: 1521, 23: 516, 31: 567, 32: 33, 33: 32, 35: 67, 39: 677, 51: 1305, 57: 126, 60: 51, 67: 35, 75: 78, 76: 1261, 78: 75, 83: 1655, 103: 107, 104: 140, 105: 106, 106: 105, 107: 103, 117: 1376, 126: 57, 128: 690, 140: 1928, 148: 176, 158: 178, 161: 1767, 164: 1, 166: 1910, 169: 1914, 175: 181, 176: 148, 178: 158, 181: 175, 182: 4, 183: 1019, 187: 221, 188: 205, 190: 697, 194: 234, 195: 1681, 200: 708, 205: 188, 220: 730, 221: 187, 228: 247, 231: 236, 233: 1610, 234: 194, 235: 236, 236: 235, 238: 228, 244: 750, 245: 1617, 247: 228, 252: 255, 254: 1869, 255: 252, 266: 471, 270: 292, 272: 1237, 274: 672, 280: 1102, 281: 1642, 283: 678, 284: 286, 286: 1528, 287: 280, 292: 270, 313: 1135, 314: 318, 315: 1538, 317: 1373, 318: 347, 321: 1014, 329: 1384, 338: 1928, 340: 341, 341: 340, 343: 703, 345: 348, 346: 621, 347: 318, 348: 345, 371: 886, 382: 1598, 395: 408, 399: 976, 402: 737, 403: 405, 405: 403, 408: 1202, 426: 446, 427: 1833, 432: 438, 433: 1214, 438: 432, 439: 433, 443: 1844, 446: 426, 461: 338, 466: 1497, 471: 266, 474: 795, 493: 1573, 516: 23, 517: 1590, 523: 1429, 538: 1976, 539: 550, 540: 544, 542: 538, 544: 1076, 549: 1082, 550: 539, 554: 1279, 557: 857, 567: 31, 568: 726, 613: 659, 619: 1585, 621: 346, 627: 635, 635: 627, 659: 1947, 672: 274, 677: 39, 678: 283, 687: 705, 690: 128, 691: 725, 694: 1027, 696: 1677, 697: 703, 699: 1394, 702: 710, 703: 705, 705: 703, 708: 760, 710: 1811, 715: 1957, 724: 715, 725: 691, 726: 568, 730: 220, 733: 1095, 735: 1073, 737: 402, 740: 1451, 742: 1207, 746: 1214, 747: 1838, 748: 742, 750: 244, 758: 761, 759: 1573, 760: 708, 761: 1862, 764: 759, 765: 1615, 768: 769, 769: 768, 770: 773, 771: 991, 773: 770, 782: 51, 785: 1002, 790: 792, 792: 790, 794: 1923, 795: 785, 798: 802, 801: 1681, 802: 798, 803: 1799, 825: 1196, 827: 1450, 834: 1614, 839: 1851, 842: 1468, 846: 1321, 853: 1878, 857: 557, 859: 1090, 882: 1458, 886: 371, 896: 1092, 912: 915, 915: 912, 919: 945, 945: 919, 955: 1717, 958: 971, 961: 1361, 969: 1381, 971: 958, 973: 1393, 976: 399, 978: 1073, 979: 978, 987: 989, 989: 987, 991: 771, 993: 1334, 997: 1102, 1002: 785, 1010: 997, 1014: 321, 1015: 1025, 1016: 1143, 1018: 1379, 1019: 183, 1025: 1015, 1027: 694, 1028: 1053, 1031: 1155, 1036: 1801, 1038: 1573, 1042: 1414, 1044: 1038, 1046: 1042, 1048: 1065, 1051: 1055, 1053: 1028, 1054: 1070, 1055: 1051, 1057: 1307, 1058: 1962, 1063: 1069, 1065: 1066, 1066: 1065, 1069: 1600, 1070: 1054, 1073: 978, 1074: 1716, 1075: 1829, 1076: 1074, 1077: 1205, 1082: 549, 1087: 1738, 1090: 1094, 1092: 896, 1094: 1090, 1095: 733, 1098: 1112, 1102: 997, 1112: 1113, 1113: 1112, 1122: 1654, 1131: 1166, 1134: 1920, 1135: 313, 1143: 1016, 1147: 1381, 1154: 1928, 1155: 1031, 1166: 1131, 1169: 1154, 1177: 1184, 1184: 1177, 1195: 1607, 1196: 1200, 1200: 1196, 1202: 408, 1203: 1305, 1205: 1207, 1207: 1205, 1214: 746, 1221: 1147, 1225: 1624, 1231: 1225, 1237: 1248, 1248: 1381, 1252: 1385, 1261: 76, 1268: 1277, 1272: 1332, 1277: 1268, 1279: 554, 1305: 1203, 1307: 1057, 1321: 1339, 1332: 1272, 1334: 1844, 1339: 1321, 1346: 1360, 1358: 1775, 1360: 1346, 1361: 961, 1367: 1381, 1373: 317, 1374: 1413, 1376: 117, 1379: 1461, 1381: 1468, 1384: 329, 1385: 1252, 1392: 1395, 1393: 1413, 1394: 699, 1395: 1392, 1398: 1931, 1403: 1686, 1405: 1573, 1409: 1413, 1413: 1409, 1414: 1415, 1415: 1414, 1429: 523, 1448: 2001, 1450: 827, 1451: 1452, 1452: 1451, 1454: 1829, 1456: 1458, 1457: 1462, 1458: 1456, 1461: 1379, 1462: 1620, 1464: 1462, 1468: 1477, 1470: 1468, 1473: 1478, 1477: 1585, 1478: 1473, 1485: 1491, 1491: 1485, 1495: 1501, 1497: 466, 1501: 1495, 1519: 1527, 1521: 16, 1523: 1908, 1524: 1767, 1525: 1541, 1527: 1519, 1528: 286, 1529: 1771, 1538: 1541, 1541: 1538, 1548: 1568, 1551: 1574, 1556: 1559, 1559: 1673, 1564: 1574, 1568: 1548, 1571: 1801, 1573: 1405, 1574: 1564, 1585: 1811, 1590: 1600, 1596: 1600, 1598: 382, 1600: 1596, 1607: 1195, 1608: 1980, 1610: 233, 1612: 1829, 1614: 834, 1615: 765, 1617: 245, 1620: 1462, 1624: 1225, 1631: 1644, 1635: 1905, 1637: 1638, 1638: 1637, 1642: 281, 1644: 1631, 1654: 1122, 1655: 83, 1659: 1662, 1662: 1659, 1667: 1681, 1668: 1713, 1673: 1559, 1677: 1681, 1679: 1795, 1681: 1700, 1686: 1403, 1687: 1804, 1695: 1946, 1700: 1681, 1713: 1668, 1715: 1730, 1716: 1721, 1717: 1736, 1721: 1716, 1727: 1730, 1730: 1727, 1736: 1717, 1738: 1754, 1749: 1738, 1754: 1738, 1767: 1524, 1768: 1775, 1769: 1910, 1771: 1769, 1775: 1768, 1788: 1795, 1790: 1798, 1791: 1799, 1793: 1928, 1795: 1788, 1798: 1790, 1799: 1791, 1801: 1571, 1804: 1687, 1811: 1585, 1822: 1851, 1824: 1872, 1829: 1612, 1833: 1824, 1834: 1840, 1838: 747, 1840: 1834, 1844: 1334, 1851: 1822, 1859: 1872, 1862: 1859, 1869: 254, 1872: 1859, 1878: 853, 1884: 1888, 1885: 1888, 1888: 1885, 1905: 1925, 1906: 1991, 1907: 1949, 1908: 1924, 1909: 1912, 1910: 1914, 1912: 1909, 1914: 1910, 1920: 1966, 1923: 1972, 1924: 1908, 1925: 1905, 1928: 1793, 1931: 1398, 1932: 1933, 1933: 1932, 1946: 1695, 1947: 659, 1949: 1966, 1953: 1970, 1955: 1970, 1957: 1959, 1959: 1957, 1962: 1058, 1966: 1949, 1970: 1955, 1972: 1923, 1976: 538, 1977: 2000, 1980: 1984, 1984: 1980, 1985: 1986, 1986: 1985, 1991: 1906, 2000: 1977, 2001: 1448}

Now it was easy. With the coordinates in the matrix, I just generated an array with all collided pairs:

Code:

['abaixo - baixo',
 'abater - bater',
 'achar - rachar',
 'adiante - diante',
 'afetivo - efetivo',
 'aflito - afoito',
 'afoito - aflito',
 'agora - amora',
 'agulha - fagulha',
 'alho - olho',
 'altitude - atitude',
 'alvo - alho',
 'amora - agora',
 'anel - anil',
 'anexo - nexo',
 'anil - anel',
 'anta - santa',
 'arca - arma',
 'areia - aveia',
 'argila - argola',
 'argola - argila',
 'arma - arca',
 'assado - passado',
 'atitude - altitude',
 'ator - fator',
 'aveia - veia',
 'babado - barbado',
 'bagulho - barulho',
 'bainha - tainha',
 'baixo - abaixo',
 'bala - vala',
 'balsa - valsa',
 'barata - batata',
 'barbado - babado',
 'barulho - bagulho',
 'batata - barata',
 'bater - abater',
 'batido - latido',
 'beato - boato',
 'beco - bico',
 'beira - feira',
 'beliche - boliche',
 'belo - selo',
 'besta - festa',
 'bico - beco',
 'bloco - floco',
 'boato - beato',
 'bode - boxe',
 'boldo - bolso',
 'bolha - rolha',
 'boliche - beliche',
 'bolo - bolso',
 'bolso - bolo',
 'bonde - bode',
 'bossa - fossa',
 'botina - rotina',
 'boxe - bode',
 'briga - brita',
 'brincar - trincar',
 'brita - briga',
 'busto - custo',
 'cabelo - camelo',
 'cabo - nabo',
 'cabuloso - fabuloso',
 'cadeira - madeira',
 'caibro - saibro',
 'caixa - faixa',
 'cajado - calado',
 'calado - ralado',
 'caldeira - cadeira',
 'camelo - cabelo',
 'carinho - marinho',
 'carneiro - carteiro',
 'caro - raro',
 'carreira - parreira',
 'carteiro - certeiro',
 'casca - lasca',
 'causar - pausar',
 'ceia - veia',
 'cenoura - censura',
 'censura - cenoura',
 'cera - fera',
 'cereja - cerveja',
 'cerrado - errado',
 'certeiro - carteiro',
 'cerveja - cereja',
 'cidade - idade',
 'cisco - risco',
 'coceira - coleira',
 'coelho - joelho',
 'coice - foice',
 'coifa - coisa',
 'coisa - coifa',
 'coleira - moleira',
 'copeiro - coveiro',
 'copo - topo',
 'corja - coruja',
 'corno - morno',
 'coruja - corja',
 'corvo - corno',
 'couro - touro',
 'coveiro - copeiro',
 'cuia - ceia',
 'cunhado - punhado',
 'custo - busto',
 'data - gata',
 'dente - rente',
 'diante - adiante',
 'dica - rica',
 'dinheiro - pinheiro',
 'doador - voador',
 'dobrado - dourado',
 'doca - dona',
 'domador - doador',
 'dona - lona',
 'dotado - lotado',
 'dourado - dobrado',
 'dublado - nublado',
 'dueto - gueto',
 'efetivo - afetivo',
 'eixo - fixo',
 'enxame - exame',
 'ereto - reto',
 'errado - cerrado',
 'escola - esmola',
 'esmola - escola',
 'exame - vexame',
 'fabuloso - cabuloso',
 'fagulha - agulha',
 'faixa - caixa',
 'farpa - ferpa',
 'fator - ator',
 'favela - fivela',
 'febre - lebre',
 'feio - seio',
 'feira - fera',
 'feixe - peixe',
 'feno - feto',
 'fera - ferpa',
 'ferpa - fera',
 'festa - fresta',
 'feto - teto',
 'figa - viga',
 'fita - figa',
 'fivela - favela',
 'fixo - eixo',
 'floco - bloco',
 'fluxo - luxo',
 'fogo - logo',
 'foice - coice',
 'folia - polia',
 'fonte - monte',
 'forno - morno',
 'forrar - torrar',
 'forte - fonte',
 'fossa - bossa',
 'freio - frevo',
 'frente - rente',
 'fresta - festa',
 'frevo - trevo',
 'fronte - frente',
 'frota - rota',
 'fundo - fungo',
 'fungo - fundo',
 'funil - fuzil',
 'furado - jurado',
 'fuzil - funil',
 'galho - alho',
 'gama - lama',
 'garoupa - garupa',
 'garupa - garoupa',
 'gasto - vasto',
 'gata - gama',
 'geada - gemada',
 'gelo - selo',
 'gemada - geada',
 'gemido - temido',
 'goela - moela',
 'goleiro - poleiro',
 'gosto - rosto',
 'gralha - tralha',
 'grato - prato',
 'grelha - orelha',
 'gruta - truta',
 'gueto - dueto',
 'gula - lula',
 'horta - porta',
 'idade - cidade',
 'ilustre - lustre',
 'incolor - indolor',
 'indolor - incolor',
 'inferno - inverno',
 'inverno - inferno',
 'isolado - solado',
 'jaca - jeca',
 'janela - panela',
 'jato - pato',
 'jeca - jaca',
 'jeito - peito',
 'joelho - coelho',
 'jogo - logo',
 'joio - jogo',
 'julho - junho',
 'junho - julho',
 'jurado - furado',
 'juro - ouro',
 'ladeira - madeira',
 'lama - gama',
 'lareira - ladeira',
 'lasca - casca',
 'laser - lazer',
 'lastro - mastro',
 'latente - patente',
 'latido - batido',
 'lazer - laser',
 'lebre - febre',
 'legado - ligado',
 'leigo - meigo',
 'lenda - tenda',
 'lente - rente',
 'lesado - pesado',
 'leste - lente',
 'levado - lesado',
 'liberal - literal',
 'licitar - limitar',
 'ligado - legado',
 'ligeiro - lixeiro',
 'limitar - licitar',
 'limpo - olimpo',
 'linda - vinda',
 'lisa - lixa',
 'literal - litoral',
 'litoral - literal',
 'lixa - rixa',
 'lixeiro - ligeiro',
 'logo - jogo',
 'loja - soja',
 'lombo - tombo',
 'lona - loja',
 'longe - monge',
 'lotado - dotado',
 'luar - suar',
 'lula - luva',
 'lustre - ilustre',
 'luva - lula',
 'luxo - fluxo',
 'machado - malhado',
 'madeira - ladeira',
 'malhado - malvado',
 'malvado - malhado',
 'mangue - sangue',
 'marcador - mercador',
 'margem - vargem',
 'marinho - carinho',
 'mastro - lastro',
 'mato - pato',
 'meia - veia',
 'meigo - leigo',
 'mercador - marcador',
 'mesa - meia',
 'miado - mimado',
 'mimado - miado',
 'moedor - roedor',
 'moela - mola',
 'mola - moela',
 'moleira - coleira',
 'molho - olho',
 'monge - monte',
 'monte - monge',
 'morno - forno',
 'moto - mato',
 'mugido - rugido',
 'munido - mugido',
 'nabo - nato',
 'nato - pato',
 'navio - pavio',
 'nexo - anexo',
 'noivo - novo',
 'nosso - osso',
 'novo - noivo',
 'nublado - dublado',
 'olho - molho',
 'olimpo - limpo',
 'orelha - ovelha',
 'osso - nosso',
 'ouro - touro',
 'ovelha - orelha',
 'padeiro - pandeiro',
 'pampa - tampa',
 'pandeiro - padeiro',
 'panela - janela',
 'papo - pato',
 'parreira - carreira',
 'parto - perto',
 'passado - assado',
 'patente - potente',
 'pato - prato',
 'pausar - causar',
 'pavio - navio',
 'pegada - pelada',
 'peito - perto',
 'peixe - feixe',
 'pelada - pegada',
 'peludo - veludo',
 'penhor - senhor',
 'pente - rente',
 'perito - perto',
 'perto - perito',
 'pesado - pescado',
 'pescado - pesado',
 'pinheiro - dinheiro',
 'poeira - zoeira',
 'poleiro - goleiro',
 'polia - polpa',
 'polpa - polia',
 'pombo - tombo',
 'ponta - porta',
 'porco - pouco',
 'porta - ponta',
 'potente - patente',
 'pouco - rouco',
 'pouso - pouco',
 'prato - preto',
 'prazo - prato',
 'pregar - prezar',
 'preto - reto',
 'prezar - pregar',
 'profeta - proveta',
 'proveta - profeta',
 'pular - puxar',
 'punhado - cunhado',
 'puxar - pular',
 'rabada - rajada',
 'rachar - achar',
 'raiar - vaiar',
 'rainha - tainha',
 'raio - raso',
 'rajada - rabada',
 'ralado - calado',
 'ralo - talo',
 'raro - raso',
 'raso - raro',
 'reator - reitor',
 'recente - repente',
 'redator - redutor',
 'redutor - sedutor',
 'regente - repente',
 'reitor - reator',
 'renda - tenda',
 'rente - pente',
 'repente - regente',
 'reto - teto',
 'rica - rixa',
 'ripa - rixa',
 'risco - cisco',
 'rixa - ripa',
 'roedor - moedor',
 'rolante - volante',
 'rolha - bolha',
 'rombo - tombo',
 'rosto - gosto',
 'rota - frota',
 'rotina - botina',
 'rouco - pouco',
 'rugido - mugido',
 'sacada - salada',
 'sadio - vadio',
 'safira - safra',
 'safra - safira',
 'saibro - caibro',
 'salada - sacada',
 'sangue - mangue',
 'santa - anta',
 'sarda - sarna',
 'sarna - sarda',
 'sebo - selo',
 'secar - socar',
 'sedutor - redutor',
 'seio - selo',
 'selar - telar',
 'selo - silo',
 'senhor - penhor',
 'sentar - tentar',
 'setor - vetor',
 'silo - selo',
 'socar - secar',
 'sogro - soro',
 'soja - soma',
 'solado - sovado',
 'soma - soja',
 'sono - soro',
 'soro - sono',
 'sovado - solado',
 'suar - suor',
 'sujar - suar',
 'suor - suar',
 'tainha - rainha',
 'taipa - tampa',
 'tala - vala',
 'talo - tala',
 'tampa - taipa',
 'tear - telar',
 'tecer - temer',
 'tecido - temido',
 'teia - veia',
 'telar - tear',
 'temer - tecer',
 'temido - tecido',
 'tenda - renda',
 'tentar - sentar',
 'teto - reto',
 'toalha - tralha',
 'toco - troco',
 'tombo - rombo',
 'topo - toco',
 'tora - tosa',
 'torrar - forrar',
 'tosa - tora',
 'touro - ouro',
 'tralha - toalha',
 'treco - troco',
 'trevo - treco',
 'trincar - brincar',
 'troco - treco',
 'truta - gruta',
 'turbo - turvo',
 'turco - turvo',
 'turvo - turco',
 'vadio - vazio',
 'vaga - zaga',
 'vagem - viagem',
 'vaiar - vazar',
 'vaidade - validade',
 'vala - valsa',
 'validade - vaidade',
 'valsa - vala',
 'vargem - virgem',
 'vasto - visto',
 'vazar - vaiar',
 'vazio - vadio',
 'veia - teia',
 'veludo - peludo',
 'vencedor - vendedor',
 'vendedor - vencedor',
 'vetor - setor',
 'vexame - exame',
 'viagem - virgem',
 'videira - viseira',
 'vieira - viseira',
 'viga - vigia',
 'vigia - viga',
 'vinda - linda',
 'virgem - viagem',
 'viseira - vieira',
 'visto - vasto',
 'voador - doador',
 'voar - zoar',
 'volante - votante',
 'votante - volante',
 'vulgo - vulto',
 'vulto - vulgo',
 'zaga - vaga',
 'zoar - voar',
 'zoeira - poeira']

The results are not as bad as they look like. They are doubled, because just as like "abaixo" is 1 distance from "baixo", "baixo" is also 1 distance from "abaixo". So we will see everything doubled here.

I learned a lot along the way. I never thought that generating those words was going to get so complicated. We will go back to word in the local board generating more words, as I deleted a lot now.

I hope this mindset can help your library/program NotATether.

Thank you both again.

.
BC.GAME

│

│

Re: BIP-39 List of words in Portuguese nearly ready for submission

Coding Enthusiast

Legendary

Offline

Activity: 1040
Merit: 2784

Bitcoin and C♯ Enthusiast

September 08, 2020, 06:47:42 AM

Merited by bitmover (4), ABCbits (2), fillippone (2), Heisenberg_Hunter (1)

I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).
Other languages don't seem to be any better. Here are only some of the example words not all:
Italian first word in the list is "abaco" which has a similar one "baco" or "sino", "asino".
French seems better but it has words with distance=2 like "apaiser", "abaisser"
Spanish has "bono", "abono" and "abrazo", "brazo"
Czech has words with distance=2 like "abeceda", "beseda" and "adresa", "agrese"
Japanese has "あいさつ", "かいさつ" and "あきる", "あける"
Korean has "가격", "간격"
Chinese results don't make much sense but since the last 3 are complicated languages I'm not sure if the Levenshtein distance is even valid for them.

Re: BIP-39 List of words in Portuguese nearly ready for submission

|
|
|

FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git

|
|
|

BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git

Coding Enthusiast

Legendary

Offline

Activity: 1040
Merit: 2784

Bitcoin and C♯ Enthusiast

September 08, 2020, 11:12:59 AM

Merited by ABCbits (1)

#10

Quote from: ETFbitcoin on September 08, 2020, 09:40:27 AM

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.

The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
Damerau–Levenshtein adds transposition to the above 3.
Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

PS. You can run Jaro-Winkler algorithm here on sharplab just change the s1 and s2 in Main() method.

Re: BIP-39 List of words in Portuguese nearly ready for submission

|
|
|

FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git

|
|
|

BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git

bitmover (OP)

Legendary

Offline

Activity: 2394
Merit: 6181

Crypto Swap Exchange🈺

September 08, 2020, 11:13:45 AM
Last edit: September 08, 2020, 03:36:35 PM by bitmover

#11

Quote from: Coding Enthusiast on September 08, 2020, 06:47:42 AM

This is a great find.
distance 2 certainly is ok because it would be too restrictive, but I didn't know if a distance of 1 would be acceptable.

Looking carefully at the https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md I found that only French is worried about Levenshtein distance

Quote

French
10. No very similar words with 1 letter of difference.

https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md#french

This is same as Levenshtein distance > 1.

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy task but certainly doable.

.
BC.GAME

│

│

Re: BIP-39 List of words in Portuguese nearly ready for submission

Coding Enthusiast

Legendary

Offline

Activity: 1040
Merit: 2784

Bitcoin and C♯ Enthusiast

September 08, 2020, 11:40:33 AM

#12

Quote from: bitmover on September 08, 2020, 11:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy tasks but certainly doable.

I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.

Re: BIP-39 List of words in Portuguese nearly ready for submission

|
|
|

FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git

|
|
|

BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git

NotATether

Legendary

Offline

Activity: 1694
Merit: 7155

In memory of o_e_l_e_o

September 08, 2020, 10:24:29 PM

Merited by fillippone (2), ABCbits (1), Coding Enthusiast (1)

#13

Quote from: Coding Enthusiast on September 08, 2020, 11:12:59 AM

Quote from: ETFbitcoin on September 08, 2020, 09:40:27 AM

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.

The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
Damerau–Levenshtein adds transposition to the above 3.
Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I am against using Jaro-Winkler similarity for measuring distances because it is tainted by its weighing earlier characters more. I feel like it's trying to take on the task of both measuring distance and counting initial unique characters, but it is not effective for measuring either of them because it just adds the distance metric and a very scaled down initial character uniqueness measurement together. IMHO adding two metrics together just ruins the measurement.

Jaro similarity is a little better, it just measures distance but I notice that character swaps have less weighting on the metric than the presence of unique characters in either two words, which makes sense if you are feeding a program with input that has several similar words in it, but interchanges between adjacent characters and deletions are the most common mistakes people make when writing from a wordlist. Plus the percentage metric doesn't lend itself well to quantifying the number of character replacements you need to get from one distance to another, at least by the human brain. I can't just say, "for a Jaro distance of 0.7 I need to change <x> characters to make it 0.6".

For typing there are also typos made not by swapping but by typing the adjacent character on the Qwerty keyboard. A typo is just a substitution, which can be modeled by deletion and insert pair, additional typos can replace one of the deletes and one of the inserts with a swap*. There may already be production algorithms that take proximity of the neighboring Qwerty characters into account when measuring similarity, measuring insert/delete pairs of nearby keyboard characters more harshly than distant characters, and they do exist since search engines can detect typos. And I think that should be the goal when making a wordlist, to filter out as many opportunities to make typos as possible. The guidelines for similarity checking were only created because a small group can't be expected to make such a sophisticated checker

*That's why I think counting swaps with one point instead of two points of insert/delete is bad for distance measurements, as we are falsely making the word pair look more unique. Hence my argument against using Damerau-Levenshtein.

So for simpletons like us, none of the alternative algorithms are good for our needs, and Levenshtein is out best measuring ruler that is not complex to implement.

Quote from: bitmover on September 08, 2020, 11:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy task but certainly doable.

Levenshtein distance of one means only one substitution needs to be made, like "fish" --> "fist", which has a risk of being spelt incorrectly like Coding Enthusiast mentioned. Insertions and deletions are harder to get wrong though, because a user has to subconsciously type an extra character or omit one. so if you absolutely must use distance 1 word pairs then use ones with an extra or missing letter. But it is unlikely that users will miswrite 2 substituted characters wrong.

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Re: BIP-39 List of words in Portuguese nearly ready for submission

bitmover (OP)

Legendary

Offline

Activity: 2394
Merit: 6181

Crypto Swap Exchange🈺

September 09, 2020, 12:07:50 AM

#14

Quote from: Coding Enthusiast on September 08, 2020, 11:40:33 AM

Quote from: bitmover on September 08, 2020, 11:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy tasks but certainly doable.

Thanks for your suggestion. We decided to keep with that restriction (removing all Levenshtein distance =1). Our wordlist is going to be the one with the most restricted rules.
French wordlist followed Levenshtein distance =1 rule, however they didn't worry about repeting words from others lists like we did.

https://github.com/bitcoin/bips/pull/152#issuecomment-412618598

I hope our list will be quickly accepted. We did a nice work.

Quote from: NotATether on September 08, 2020, 10:24:29 PM

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Share with us this code when you are done. There are still other languages to make a wordlist. and your program may also be used in other projects that we don't know of yet.

.
BC.GAME

│

│

Re: BIP-39 List of words in Portuguese nearly ready for submission

NotATether

Legendary

Offline

Activity: 1694
Merit: 7155

In memory of o_e_l_e_o

September 12, 2020, 05:21:12 PM

#15

There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Quote from: https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md#spanish

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?

Re: BIP-39 List of words in Portuguese nearly ready for submission

bitmover (OP)

Legendary

Offline

Activity: 2394
Merit: 6181

Crypto Swap Exchange🈺

September 12, 2020, 08:18:07 PM

Merited by NotATether (3), ABCbits (2)

#16

Quote from: NotATether on September 12, 2020, 05:21:12 PM

Sorting is not required, but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

Quote

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt but that wordlist's rules say:

Quote from: https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md#spanish

personally, I think that accepting words with accent a big mistake. I would reject it straight away. And I wouldn't use that word list

Because " àbaco" and "abaco" are different words and it could lead to some problem in some software.

Portuguese list won't have words with special characters.

Quote

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary. Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalk.org/index.php?topic=5272106.msg55131643#msg55131643

.
BC.GAME

│

│

Re: BIP-39 List of words in Portuguese nearly ready for submission

NotATether

Legendary

Offline

Activity: 1694
Merit: 7155

In memory of o_e_l_e_o

September 12, 2020, 10:45:15 PM

#17

Quote from: bitmover on September 12, 2020, 08:18:07 PM

I designed my program to have as little user-interaction as possible, because it's easier to show some sort of report card that shows you the status of each test, and where specifically in each test is wrong so you can immediately go to that part of the file and fix it. In simple terms, my program lets you can control which tests to enable from the command line, it prints progresses and status messages as it performs each tests, and it tells you which tests passed and failed.

I am not comfortable with modifying the wordlist file in-place because it could have bugs that mistakenly mess up the wordlist. I've done that error too many times in other projects and I don't want to take any chances here. So I think I will just print a warning if it detects the list isn't sorted.

Quote from: bitmover on September 12, 2020, 08:18:07 PM

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:

for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages? In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented. So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

Re: BIP-39 List of words in Portuguese nearly ready for submission

Coding Enthusiast

Legendary

Offline

Activity: 1040
Merit: 2784

Bitcoin and C♯ Enthusiast

September 13, 2020, 03:09:01 AM

Merited by ABCbits (1)

#18

Quote from: bitmover on September 12, 2020, 08:18:07 PM

Sorting is not required

Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Re: BIP-39 List of words in Portuguese nearly ready for submission

|
|
|

FinderOuter(0.19.1)Ann-git
Denovo(0.7.0)Ann-git
Bitcoin.Net(0.26.0)Ann-git

|
|
|

BitcoinTransactionTool(0.11.0)Ann-git
WatchOnlyBitcoinWallet(3.2.1)Ann-git
SharpPusher(0.12.0)Ann-git

NotATether

Legendary

Offline

Activity: 1694
Merit: 7155

In memory of o_e_l_e_o

September 13, 2020, 10:56:56 AM

#19

Quote from: Coding Enthusiast on September 13, 2020, 03:09:01 AM

Sometimes the implementations want the option to perform a binary search on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Good catch, reminding me of the exact number of words that should be in a wordlist. I will also implement a check which validates that there are exactly 2048 words in the list.

Re: BIP-39 List of words in Portuguese nearly ready for submission

bitmover (OP)

Legendary

Offline

Activity: 2394
Merit: 6181

Crypto Swap Exchange🈺

September 14, 2020, 12:24:00 AM

#20

Quote from: NotATether on September 12, 2020, 10:45:15 PM

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:

for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

*So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages?

You can add more letters if you think it is necessary, I am not sure this dictionary will cover all possibilities.
About your code, I don't like to use Loops unless it is extremely necessary. Loops are computational costly and makes your code slow.

I did this in my code:

Code:

import pandas as pd
accent_dict = {...}
spanish = pd.read_csv('spanish.txt', header = None)
spanish=spanish.replace(accent_dict , regex=True)

Code will be cleaner.
1 line and faster processing instead of a loop

Quote

In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented. So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

A dictionary is a better approach than using coding in my opinion

.
BC.GAME

│

│