Print Page - [MERGED] BIP-39 List of words in Portuguese accepted!!

Title: [MERGED] BIP-39 List of words in Portuguese accepted!!
Post by: bitmover on September 04, 2020, 05:09:18 PM

Hello everyone

I am part of a group of 4 users (sabotag3x (https://bitcointalk.org/index.php?topic=5272106.0), alegotardo (https://bitcointalk.org/index.php?topic=5272106.msg55128616#msg55128616), Tryninja (https://bitcointalk.org/index.php?topic=5272106.msg55085882#msg55085882) and me) in the Portuguese board who are creating a list of 2048 words in Portuguese to be submitted to https://github.com/bitcoin/bips/tree/master/bip-0039

Our bitcointalk topic for dicussion is:
[2020] Lista de Palavras em Português para o BIP-0039 (https://bitcointalk.org/index.php?topic=5272106.0)

We followed many rules to add the words, that can be seen here:
https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md

Quote

Words can be uniquely determined typing the first 4 characters.
No accents or special characters.
No complex verb forms.
No plural words, unless there's no singular form.
No words with double spelling.
No words with the exact sound of another word with different spelling.
No offensive words.
No words already used in other language mnemonic sets.
The words which have not the same spelling in Brazil and in Portugal are excluded.
No words that remind negative/sad/bad things.

Our work is nearly done (we have now a few more than 2048, which are going to be carefully excluded, but all those words follow the criteria above) and it is almost ready to make the pull request to the main branch.

I would like to know if is there any suggestion or any special procedure that we didn't make before making the pull request.

Our list can be seen here:
https://github.com/sabotag3x/bips/blob/master/bip-0039/portuguese.txt

I hope our small group will be able to get into bitcoin history.

Thanks everyone.

Title: Re: BIP-39 List of words in Portuguese ready for submission
Post by: NotATether on September 05, 2020, 12:23:17 AM

I really like your initiative, I was waiting until I got on PC to type this.

Make sure all the words in your list are words that people have heard of. Words not familiar to most people in your locale should be avoided. In some of the previous PRs for other wordlists, there were such words inside. Here's an example of this in the French wordlist (https://github.com/bitcoin/bips/pull/152#issuecomment-100645396).

Also I'd recommend limiting the maximum length of each word to 8, according to the below comment, it will save you time from having to revise your PR:

Quote from: https://github.com/bitcoin/bips/pull/942#issuecomment-663078429

Hi. As I am interested in the creation of all word lists (to a reasonable extent), not only the German one, let me express my thoughts here as well. I am glad to see that there are contributors willing to work on word lists. However, what bothers me is that whenever a person (a group of people) shows up, take(s) care of just one list. I.e. to be exact, what bothers me is the fact that for each new list very similar problems needs to be tackled. For example requirements - for languages with Latin alphabet the maximum word lenght should be 8, due to the limitations of the displays of hardware wallets. Or requirements that first 4 letters should uniquely define a word? Not too mention about requirements like the one related to Levenshtein distance. Can't such requirements be shared across many languages? Especially that once developed tools (to ease work with Levenshtein distance) could be reused. That is why I launched a separate repository just for the creation of word lists: https://github.com/p2w34/wlips. I have launched it with a vision of tackling the creation of all word lists, not just one. Please do not get me wrong - I am not saying the work in this PR needs to be somehow stopped/abandoned/whatever - I am not the one to judge which approach is better. Let me also mention that I am the author of PR with Polish word list and I know how much time is needed to create such list from scratch. I just wanted to mention here there is also another approach possible. Thank you.

So apparently hardware wallets can only display up to 8 characters of a word. The rest won't be visible so there is a possibility for collision when using hardware wallets.

Levenshtien distance (https://en.wikipedia.org/wiki/Levenshtein_distance) between two words is the number of characters you need to alter, add or remove to transform the first word to the second. Make sure the distance between all letters is not too low, there isn't a defined minimum but I would make it at least 2.

If everything goes well then judging by the opening and closing times of previous PRs, it should take about a month between opening the PR and getting it merged to the tree. Good luck!

Title: Re: BIP-39 List of words in Portuguese ready for submission
Post by: bitmover on September 05, 2020, 02:00:30 AM

Quote from: NotATether on September 05, 2020, 12:23:17 AM

Quote from: https://github.com/bitcoin/bips/pull/942#issuecomment-663078429

Thank you so much for your input.

I will take a closer look on first 4 letter requirement (which is not 100% yet) and this Levenshtein_distance.

I will try to make or find a Levenshtein distance script in python to check our list.

Title: Re: BIP-39 List of words in Portuguese ready for submission
Post by: ABCbits on September 05, 2020, 11:29:57 AM

Good initiative, i hope it will be accepted quickly :)

Quote from: bitmover on September 05, 2020, 02:00:30 AM

I will try to make or find a Levenshtein distance script in python to check our list.

I would recommend Jellyfish library (https://pypi.org/project/jellyfish/ (https://pypi.org/project/jellyfish/)) which have better performance (since it's wrapper of C library).

Title: Re: BIP-39 List of words in Portuguese ready for submission
Post by: NotATether on September 06, 2020, 12:44:06 PM

@bitmover

I am developing a python program that evaluates Latin wordlists for their word length, number of similar characters at the beginning and the Levenshtein distance between each two words. I have not finished it yet but I plan on putting the code on Github and PyPI soon. Hopefully it won't stall like some of my previous projects have.

If you are still looking for or writing a script for your task, keep doing that since I don't have an ETA for when it will be done, though my program is relatively simple and not complicated to code. I hope I can finish it within a few days.

The idea is that it can be used by future groups making wordlists to validate their words. so it will still be useful even if your group has finished checking the Portuguese list.

Title: Re: BIP-39 List of words in Portuguese ready for submission
Post by: bitmover on September 06, 2020, 12:55:05 PM

Hello NotATether and ETFbitcoin

Thanks for your suggestions

I think I will make a 2048x2048 matrix like this one. I think it is the best option.
https://stackoverflow.com/questions/47152344/how-to-calculate-levenshtein-ratio-distance-for-rows-in-my-column-in-python

In a matrix I can compare each 2048 words with each other.

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

Title: Re: BIP-39 List of words in Portuguese ready for submission
Post by: NotATether on September 06, 2020, 08:01:25 PM

Quote from: bitmover on September 06, 2020, 12:55:05 PM

NotATether , please send me your code when it's ready so I can check with it as well. What is the expected output's format of your program?

Good question. I haven't written that part of the program yet but I want it to look something along the lines of this:

Code:

$wordlistvalidator -i wordlist.txt # Name subject to change
# Generic copyright notice here
<xxx> words read.

Performing Levenshtein distance test
Evaluating Levenshtein distances between <yyy> pairs of words...
-----
Pairs with Levenshtein distance 1:    # Omitted if there are no pairs with such distance
<word>, line <a>, and <word2>, line <b>   # EDIT: This is how I want all the words to be printed, with their lines, but I am editing on mobile right now so changing the other lines like this takes too long.
<word3> <word4>
...

Pairs with Levenshtein distance 2:    # And so on, up until and including a maximum configurable by command line argument.
...

# Or don't display this part of output if no pairs with distances up to that much are found

Finished performing Levenshtein distance test
Performing matching initial characters test
Comparing first <n> characters between <yyy> pairs of words...
----
Pairs with matching first <n> identical characters:    # Omitted if there are no pairs with such identical characters
<word> <word2>
<word3> <word4>
...

No pairs found with matching first <n> identical characters  # Displayed if there are no such pairs
Finished matching initial characters test
Performing word length test
Checking length of <xxx> words...
----
Words longer than <m> characters:    # Omitted if no such words
<word>
<word2>
...
No words found longer than <m> characters  # Displayed if no such words
Finished word length test

<0/1/2/3>/3 tests passed

It is meant to be human-readable output to easily find and address the problems. I'm not designing the output to be parsed by a second program, but I'll add an API to compromise for that.

Update: the backend of tests for unique initial characters and word length tests are complete, still working on levenshtein distance test
Update 2: finished the levenshtein distance test, working on the front-end and command-line argument parser

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: bitmover on September 08, 2020, 04:46:19 AM

@ETFBitcoin and @NotATether

I wanna share this with you both. I did it thanks to your suggestions.

Following ETFBitcoin library suggestion (which is pretty fast btw) I made a code that generated this matrix (we had to delete some words and now we only have 2005).

https://i.imgur.com/qmkJXDC.png

as you can see, this matrix shows levenshtein distance of all 2005 words with each other in a loop. As 0 compared to 0 is the same word, it is distance is 0. As 1 compared to 1 is also 0, you can see a diagonal line comparing the same words as zero until the end of the last row.

I was able to identify all the values where distance was 1 and I generated a dictionary with those coordinates in the matrix:

Code:

{1: 164, 4: 182, 16: 1521, 23: 516, 31: 567, 32: 33, 33: 32, 35: 67, 39: 677, 51: 1305, 57: 126, 60: 51, 67: 35, 75: 78, 76: 1261, 78: 75, 83: 1655, 103: 107, 104: 140, 105: 106, 106: 105, 107: 103, 117: 1376, 126: 57, 128: 690, 140: 1928, 148: 176, 158: 178, 161: 1767, 164: 1, 166: 1910, 169: 1914, 175: 181, 176: 148, 178: 158, 181: 175, 182: 4, 183: 1019, 187: 221, 188: 205, 190: 697, 194: 234, 195: 1681, 200: 708, 205: 188, 220: 730, 221: 187, 228: 247, 231: 236, 233: 1610, 234: 194, 235: 236, 236: 235, 238: 228, 244: 750, 245: 1617, 247: 228, 252: 255, 254: 1869, 255: 252, 266: 471, 270: 292, 272: 1237, 274: 672, 280: 1102, 281: 1642, 283: 678, 284: 286, 286: 1528, 287: 280, 292: 270, 313: 1135, 314: 318, 315: 1538, 317: 1373, 318: 347, 321: 1014, 329: 1384, 338: 1928, 340: 341, 341: 340, 343: 703, 345: 348, 346: 621, 347: 318, 348: 345, 371: 886, 382: 1598, 395: 408, 399: 976, 402: 737, 403: 405, 405: 403, 408: 1202, 426: 446, 427: 1833, 432: 438, 433: 1214, 438: 432, 439: 433, 443: 1844, 446: 426, 461: 338, 466: 1497, 471: 266, 474: 795, 493: 1573, 516: 23, 517: 1590, 523: 1429, 538: 1976, 539: 550, 540: 544, 542: 538, 544: 1076, 549: 1082, 550: 539, 554: 1279, 557: 857, 567: 31, 568: 726, 613: 659, 619: 1585, 621: 346, 627: 635, 635: 627, 659: 1947, 672: 274, 677: 39, 678: 283, 687: 705, 690: 128, 691: 725, 694: 1027, 696: 1677, 697: 703, 699: 1394, 702: 710, 703: 705, 705: 703, 708: 760, 710: 1811, 715: 1957, 724: 715, 725: 691, 726: 568, 730: 220, 733: 1095, 735: 1073, 737: 402, 740: 1451, 742: 1207, 746: 1214, 747: 1838, 748: 742, 750: 244, 758: 761, 759: 1573, 760: 708, 761: 1862, 764: 759, 765: 1615, 768: 769, 769: 768, 770: 773, 771: 991, 773: 770, 782: 51, 785: 1002, 790: 792, 792: 790, 794: 1923, 795: 785, 798: 802, 801: 1681, 802: 798, 803: 1799, 825: 1196, 827: 1450, 834: 1614, 839: 1851, 842: 1468, 846: 1321, 853: 1878, 857: 557, 859: 1090, 882: 1458, 886: 371, 896: 1092, 912: 915, 915: 912, 919: 945, 945: 919, 955: 1717, 958: 971, 961: 1361, 969: 1381, 971: 958, 973: 1393, 976: 399, 978: 1073, 979: 978, 987: 989, 989: 987, 991: 771, 993: 1334, 997: 1102, 1002: 785, 1010: 997, 1014: 321, 1015: 1025, 1016: 1143, 1018: 1379, 1019: 183, 1025: 1015, 1027: 694, 1028: 1053, 1031: 1155, 1036: 1801, 1038: 1573, 1042: 1414, 1044: 1038, 1046: 1042, 1048: 1065, 1051: 1055, 1053: 1028, 1054: 1070, 1055: 1051, 1057: 1307, 1058: 1962, 1063: 1069, 1065: 1066, 1066: 1065, 1069: 1600, 1070: 1054, 1073: 978, 1074: 1716, 1075: 1829, 1076: 1074, 1077: 1205, 1082: 549, 1087: 1738, 1090: 1094, 1092: 896, 1094: 1090, 1095: 733, 1098: 1112, 1102: 997, 1112: 1113, 1113: 1112, 1122: 1654, 1131: 1166, 1134: 1920, 1135: 313, 1143: 1016, 1147: 1381, 1154: 1928, 1155: 1031, 1166: 1131, 1169: 1154, 1177: 1184, 1184: 1177, 1195: 1607, 1196: 1200, 1200: 1196, 1202: 408, 1203: 1305, 1205: 1207, 1207: 1205, 1214: 746, 1221: 1147, 1225: 1624, 1231: 1225, 1237: 1248, 1248: 1381, 1252: 1385, 1261: 76, 1268: 1277, 1272: 1332, 1277: 1268, 1279: 554, 1305: 1203, 1307: 1057, 1321: 1339, 1332: 1272, 1334: 1844, 1339: 1321, 1346: 1360, 1358: 1775, 1360: 1346, 1361: 961, 1367: 1381, 1373: 317, 1374: 1413, 1376: 117, 1379: 1461, 1381: 1468, 1384: 329, 1385: 1252, 1392: 1395, 1393: 1413, 1394: 699, 1395: 1392, 1398: 1931, 1403: 1686, 1405: 1573, 1409: 1413, 1413: 1409, 1414: 1415, 1415: 1414, 1429: 523, 1448: 2001, 1450: 827, 1451: 1452, 1452: 1451, 1454: 1829, 1456: 1458, 1457: 1462, 1458: 1456, 1461: 1379, 1462: 1620, 1464: 1462, 1468: 1477, 1470: 1468, 1473: 1478, 1477: 1585, 1478: 1473, 1485: 1491, 1491: 1485, 1495: 1501, 1497: 466, 1501: 1495, 1519: 1527, 1521: 16, 1523: 1908, 1524: 1767, 1525: 1541, 1527: 1519, 1528: 286, 1529: 1771, 1538: 1541, 1541: 1538, 1548: 1568, 1551: 1574, 1556: 1559, 1559: 1673, 1564: 1574, 1568: 1548, 1571: 1801, 1573: 1405, 1574: 1564, 1585: 1811, 1590: 1600, 1596: 1600, 1598: 382, 1600: 1596, 1607: 1195, 1608: 1980, 1610: 233, 1612: 1829, 1614: 834, 1615: 765, 1617: 245, 1620: 1462, 1624: 1225, 1631: 1644, 1635: 1905, 1637: 1638, 1638: 1637, 1642: 281, 1644: 1631, 1654: 1122, 1655: 83, 1659: 1662, 1662: 1659, 1667: 1681, 1668: 1713, 1673: 1559, 1677: 1681, 1679: 1795, 1681: 1700, 1686: 1403, 1687: 1804, 1695: 1946, 1700: 1681, 1713: 1668, 1715: 1730, 1716: 1721, 1717: 1736, 1721: 1716, 1727: 1730, 1730: 1727, 1736: 1717, 1738: 1754, 1749: 1738, 1754: 1738, 1767: 1524, 1768: 1775, 1769: 1910, 1771: 1769, 1775: 1768, 1788: 1795, 1790: 1798, 1791: 1799, 1793: 1928, 1795: 1788, 1798: 1790, 1799: 1791, 1801: 1571, 1804: 1687, 1811: 1585, 1822: 1851, 1824: 1872, 1829: 1612, 1833: 1824, 1834: 1840, 1838: 747, 1840: 1834, 1844: 1334, 1851: 1822, 1859: 1872, 1862: 1859, 1869: 254, 1872: 1859, 1878: 853, 1884: 1888, 1885: 1888, 1888: 1885, 1905: 1925, 1906: 1991, 1907: 1949, 1908: 1924, 1909: 1912, 1910: 1914, 1912: 1909, 1914: 1910, 1920: 1966, 1923: 1972, 1924: 1908, 1925: 1905, 1928: 1793, 1931: 1398, 1932: 1933, 1933: 1932, 1946: 1695, 1947: 659, 1949: 1966, 1953: 1970, 1955: 1970, 1957: 1959, 1959: 1957, 1962: 1058, 1966: 1949, 1970: 1955, 1972: 1923, 1976: 538, 1977: 2000, 1980: 1984, 1984: 1980, 1985: 1986, 1986: 1985, 1991: 1906, 2000: 1977, 2001: 1448}

Now it was easy. With the coordinates in the matrix, I just generated an array with all collided pairs:

Code:

['abaixo - baixo',
 'abater - bater',
 'achar - rachar',
 'adiante - diante',
 'afetivo - efetivo',
 'aflito - afoito',
 'afoito - aflito',
 'agora - amora',
 'agulha - fagulha',
 'alho - olho',
 'altitude - atitude',
 'alvo - alho',
 'amora - agora',
 'anel - anil',
 'anexo - nexo',
 'anil - anel',
 'anta - santa',
 'arca - arma',
 'areia - aveia',
 'argila - argola',
 'argola - argila',
 'arma - arca',
 'assado - passado',
 'atitude - altitude',
 'ator - fator',
 'aveia - veia',
 'babado - barbado',
 'bagulho - barulho',
 'bainha - tainha',
 'baixo - abaixo',
 'bala - vala',
 'balsa - valsa',
 'barata - batata',
 'barbado - babado',
 'barulho - bagulho',
 'batata - barata',
 'bater - abater',
 'batido - latido',
 'beato - boato',
 'beco - bico',
 'beira - feira',
 'beliche - boliche',
 'belo - selo',
 'besta - festa',
 'bico - beco',
 'bloco - floco',
 'boato - beato',
 'bode - boxe',
 'boldo - bolso',
 'bolha - rolha',
 'boliche - beliche',
 'bolo - bolso',
 'bolso - bolo',
 'bonde - bode',
 'bossa - fossa',
 'botina - rotina',
 'boxe - bode',
 'briga - brita',
 'brincar - trincar',
 'brita - briga',
 'busto - custo',
 'cabelo - camelo',
 'cabo - nabo',
 'cabuloso - fabuloso',
 'cadeira - madeira',
 'caibro - saibro',
 'caixa - faixa',
 'cajado - calado',
 'calado - ralado',
 'caldeira - cadeira',
 'camelo - cabelo',
 'carinho - marinho',
 'carneiro - carteiro',
 'caro - raro',
 'carreira - parreira',
 'carteiro - certeiro',
 'casca - lasca',
 'causar - pausar',
 'ceia - veia',
 'cenoura - censura',
 'censura - cenoura',
 'cera - fera',
 'cereja - cerveja',
 'cerrado - errado',
 'certeiro - carteiro',
 'cerveja - cereja',
 'cidade - idade',
 'cisco - risco',
 'coceira - coleira',
 'coelho - joelho',
 'coice - foice',
 'coifa - coisa',
 'coisa - coifa',
 'coleira - moleira',
 'copeiro - coveiro',
 'copo - topo',
 'corja - coruja',
 'corno - morno',
 'coruja - corja',
 'corvo - corno',
 'couro - touro',
 'coveiro - copeiro',
 'cuia - ceia',
 'cunhado - punhado',
 'custo - busto',
 'data - gata',
 'dente - rente',
 'diante - adiante',
 'dica - rica',
 'dinheiro - pinheiro',
 'doador - voador',
 'dobrado - dourado',
 'doca - dona',
 'domador - doador',
 'dona - lona',
 'dotado - lotado',
 'dourado - dobrado',
 'dublado - nublado',
 'dueto - gueto',
 'efetivo - afetivo',
 'eixo - fixo',
 'enxame - exame',
 'ereto - reto',
 'errado - cerrado',
 'escola - esmola',
 'esmola - escola',
 'exame - vexame',
 'fabuloso - cabuloso',
 'fagulha - agulha',
 'faixa - caixa',
 'farpa - ferpa',
 'fator - ator',
 'favela - fivela',
 'febre - lebre',
 'feio - seio',
 'feira - fera',
 'feixe - peixe',
 'feno - feto',
 'fera - ferpa',
 'ferpa - fera',
 'festa - fresta',
 'feto - teto',
 'figa - viga',
 'fita - figa',
 'fivela - favela',
 'fixo - eixo',
 'floco - bloco',
 'fluxo - luxo',
 'fogo - logo',
 'foice - coice',
 'folia - polia',
 'fonte - monte',
 'forno - morno',
 'forrar - torrar',
 'forte - fonte',
 'fossa - bossa',
 'freio - frevo',
 'frente - rente',
 'fresta - festa',
 'frevo - trevo',
 'fronte - frente',
 'frota - rota',
 'fundo - fungo',
 'fungo - fundo',
 'funil - fuzil',
 'furado - jurado',
 'fuzil - funil',
 'galho - alho',
 'gama - lama',
 'garoupa - garupa',
 'garupa - garoupa',
 'gasto - vasto',
 'gata - gama',
 'geada - gemada',
 'gelo - selo',
 'gemada - geada',
 'gemido - temido',
 'goela - moela',
 'goleiro - poleiro',
 'gosto - rosto',
 'gralha - tralha',
 'grato - prato',
 'grelha - orelha',
 'gruta - truta',
 'gueto - dueto',
 'gula - lula',
 'horta - porta',
 'idade - cidade',
 'ilustre - lustre',
 'incolor - indolor',
 'indolor - incolor',
 'inferno - inverno',
 'inverno - inferno',
 'isolado - solado',
 'jaca - jeca',
 'janela - panela',
 'jato - pato',
 'jeca - jaca',
 'jeito - peito',
 'joelho - coelho',
 'jogo - logo',
 'joio - jogo',
 'julho - junho',
 'junho - julho',
 'jurado - furado',
 'juro - ouro',
 'ladeira - madeira',
 'lama - gama',
 'lareira - ladeira',
 'lasca - casca',
 'laser - lazer',
 'lastro - mastro',
 'latente - patente',
 'latido - batido',
 'lazer - laser',
 'lebre - febre',
 'legado - ligado',
 'leigo - meigo',
 'lenda - tenda',
 'lente - rente',
 'lesado - pesado',
 'leste - lente',
 'levado - lesado',
 'liberal - literal',
 'licitar - limitar',
 'ligado - legado',
 'ligeiro - lixeiro',
 'limitar - licitar',
 'limpo - olimpo',
 'linda - vinda',
 'lisa - lixa',
 'literal - litoral',
 'litoral - literal',
 'lixa - rixa',
 'lixeiro - ligeiro',
 'logo - jogo',
 'loja - soja',
 'lombo - tombo',
 'lona - loja',
 'longe - monge',
 'lotado - dotado',
 'luar - suar',
 'lula - luva',
 'lustre - ilustre',
 'luva - lula',
 'luxo - fluxo',
 'machado - malhado',
 'madeira - ladeira',
 'malhado - malvado',
 'malvado - malhado',
 'mangue - sangue',
 'marcador - mercador',
 'margem - vargem',
 'marinho - carinho',
 'mastro - lastro',
 'mato - pato',
 'meia - veia',
 'meigo - leigo',
 'mercador - marcador',
 'mesa - meia',
 'miado - mimado',
 'mimado - miado',
 'moedor - roedor',
 'moela - mola',
 'mola - moela',
 'moleira - coleira',
 'molho - olho',
 'monge - monte',
 'monte - monge',
 'morno - forno',
 'moto - mato',
 'mugido - rugido',
 'munido - mugido',
 'nabo - nato',
 'nato - pato',
 'navio - pavio',
 'nexo - anexo',
 'noivo - novo',
 'nosso - osso',
 'novo - noivo',
 'nublado - dublado',
 'olho - molho',
 'olimpo - limpo',
 'orelha - ovelha',
 'osso - nosso',
 'ouro - touro',
 'ovelha - orelha',
 'padeiro - pandeiro',
 'pampa - tampa',
 'pandeiro - padeiro',
 'panela - janela',
 'papo - pato',
 'parreira - carreira',
 'parto - perto',
 'passado - assado',
 'patente - potente',
 'pato - prato',
 'pausar - causar',
 'pavio - navio',
 'pegada - pelada',
 'peito - perto',
 'peixe - feixe',
 'pelada - pegada',
 'peludo - veludo',
 'penhor - senhor',
 'pente - rente',
 'perito - perto',
 'perto - perito',
 'pesado - pescado',
 'pescado - pesado',
 'pinheiro - dinheiro',
 'poeira - zoeira',
 'poleiro - goleiro',
 'polia - polpa',
 'polpa - polia',
 'pombo - tombo',
 'ponta - porta',
 'porco - pouco',
 'porta - ponta',
 'potente - patente',
 'pouco - rouco',
 'pouso - pouco',
 'prato - preto',
 'prazo - prato',
 'pregar - prezar',
 'preto - reto',
 'prezar - pregar',
 'profeta - proveta',
 'proveta - profeta',
 'pular - puxar',
 'punhado - cunhado',
 'puxar - pular',
 'rabada - rajada',
 'rachar - achar',
 'raiar - vaiar',
 'rainha - tainha',
 'raio - raso',
 'rajada - rabada',
 'ralado - calado',
 'ralo - talo',
 'raro - raso',
 'raso - raro',
 'reator - reitor',
 'recente - repente',
 'redator - redutor',
 'redutor - sedutor',
 'regente - repente',
 'reitor - reator',
 'renda - tenda',
 'rente - pente',
 'repente - regente',
 'reto - teto',
 'rica - rixa',
 'ripa - rixa',
 'risco - cisco',
 'rixa - ripa',
 'roedor - moedor',
 'rolante - volante',
 'rolha - bolha',
 'rombo - tombo',
 'rosto - gosto',
 'rota - frota',
 'rotina - botina',
 'rouco - pouco',
 'rugido - mugido',
 'sacada - salada',
 'sadio - vadio',
 'safira - safra',
 'safra - safira',
 'saibro - caibro',
 'salada - sacada',
 'sangue - mangue',
 'santa - anta',
 'sarda - sarna',
 'sarna - sarda',
 'sebo - selo',
 'secar - socar',
 'sedutor - redutor',
 'seio - selo',
 'selar - telar',
 'selo - silo',
 'senhor - penhor',
 'sentar - tentar',
 'setor - vetor',
 'silo - selo',
 'socar - secar',
 'sogro - soro',
 'soja - soma',
 'solado - sovado',
 'soma - soja',
 'sono - soro',
 'soro - sono',
 'sovado - solado',
 'suar - suor',
 'sujar - suar',
 'suor - suar',
 'tainha - rainha',
 'taipa - tampa',
 'tala - vala',
 'talo - tala',
 'tampa - taipa',
 'tear - telar',
 'tecer - temer',
 'tecido - temido',
 'teia - veia',
 'telar - tear',
 'temer - tecer',
 'temido - tecido',
 'tenda - renda',
 'tentar - sentar',
 'teto - reto',
 'toalha - tralha',
 'toco - troco',
 'tombo - rombo',
 'topo - toco',
 'tora - tosa',
 'torrar - forrar',
 'tosa - tora',
 'touro - ouro',
 'tralha - toalha',
 'treco - troco',
 'trevo - treco',
 'trincar - brincar',
 'troco - treco',
 'truta - gruta',
 'turbo - turvo',
 'turco - turvo',
 'turvo - turco',
 'vadio - vazio',
 'vaga - zaga',
 'vagem - viagem',
 'vaiar - vazar',
 'vaidade - validade',
 'vala - valsa',
 'validade - vaidade',
 'valsa - vala',
 'vargem - virgem',
 'vasto - visto',
 'vazar - vaiar',
 'vazio - vadio',
 'veia - teia',
 'veludo - peludo',
 'vencedor - vendedor',
 'vendedor - vencedor',
 'vetor - setor',
 'vexame - exame',
 'viagem - virgem',
 'videira - viseira',
 'vieira - viseira',
 'viga - vigia',
 'vigia - viga',
 'vinda - linda',
 'virgem - viagem',
 'viseira - vieira',
 'visto - vasto',
 'voador - doador',
 'voar - zoar',
 'volante - votante',
 'votante - volante',
 'vulgo - vulto',
 'vulto - vulgo',
 'zaga - vaga',
 'zoar - voar',
 'zoeira - poeira']

The results are not as bad as they look like. They are doubled, because just as like "abaixo" is 1 distance from "baixo", "baixo" is also 1 distance from "abaixo". So we will see everything doubled here.

I learned a lot along the way. I never thought that generating those words was going to get so complicated. We will go back to word in the local board generating more words, as I deleted a lot now.

I hope this mindset can help your library/program NotATether.

Thank you both again.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: Coding Enthusiast on September 08, 2020, 06:47:42 AM

I just added the code to compute Levenshtein distance for all existing BIP-39 lists to Bitcoin.Net (https://github.com/Autarkysoft/Denovo/commit/e7587b808575d933b4af2c33c397d0ef43d2e6c3) and the first thing I noticed is that the English list contains words such as "able", "cable", "table", "unable", "viable" with very short distances (1 for the first three and 2 for the other two).
Other languages don't seem to be any better. Here are only some of the example words not all:
Italian first word in the list is "abaco" which has a similar one "baco" or "sino", "asino".
French seems better but it has words with distance=2 like "apaiser", "abaisser"
Spanish has "bono", "abono" and "abrazo", "brazo"
Czech has words with distance=2 like "abeceda", "beseda" and "adresa", "agrese"
Japanese has "あいさつ", "かいさつ" and "あきる", "あける"
Korean has "가격", "간격"
Chinese results don't make much sense but since the last 3 are complicated languages I'm not sure if the Levenshtein distance is even valid for them.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: Coding Enthusiast on September 08, 2020, 11:12:59 AM

Quote from: ETFbitcoin on September 08, 2020, 09:40:27 AM

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.

The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
Damerau–Levenshtein adds transposition to the above 3.
Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I believe the results could pretty much give the same conclusion about the word lists BIP-39 is dealing with. But I think Levenshtein may be better here. For example Jaro-Winkler gives 0.866 for "cable" and "table" (closer to 1 means less similar) while Levenshtein returns 1 which is a much better indication of it being bad.

PS. You can run Jaro-Winkler algorithm here on sharplab (https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEUCuA7AHwHoiACAYQGJSBLAWwAcAbGOmPDAQwxoj1IgAzUp2AQAbjBEMGUCJzAALUgFgAUAAEATAEZVa9dtIBxAGLH9Ab33qAkCVIaArDz6kMEUmE5MwOJtxSGIpSdg4AUpxypADK9DQBUDQYAJ4CwhgA7p4xGEl4AOYAzvq2GjpIpAAmEDjALKQAVlEQAPpVNEVceGAwABTlAAykRTpojjrDRVoAlKXWdvZkAJIZIbF5NIUlUVIwAI44PqW2NMJ9o6QAvFcjsydlAOykOgB0gwDcpYsOADLsBWC6Xc2Q2+WKJy2GFILDweluo1e/0KQK+iyhMPYWmudyRAOCaLUth+ZAAspwEPQcHRqp1ur1SDgGB5SJlFDQlKQ6NwlFsCicHJ0REwmBBMjAqpCOFyKe06Ti+lCZn1ycFXqZRdA+n0anUWMqHrYjSruIpXuSEH1YWNMXh7mR7nAXjMXYTiUSlhRatKhDKMEoYCV0dLuf7lLdPt8PQ4ABKcIrKQTQP0BoMejEAbQAuop44pWpdbngYJlaBwM4jkYDFFm3adpdncwmC9iiyWyxgK1o8Sia26SaQACpQTiSKBFIKKOQ4ArKYJSQQ0cfQ3Lgk5JqCkRXSmg4yO7gA8tp0XxoAGoz3NFgsiYtPeQQmAANa0NYwWC0HZ4NKh1MPDdbhijQ4qqZoWn0gzjLuTrcggcpdDMdZGsBR6geaWxWliUGkGeMpwR0XQ4c6XyNBeV63h6RoOKs7ghB+QqcCmyh3qc5yjBmNBZtcCJaBmjRcQAZAJpBNvm0x8VxNykIM5FGkaN5yUaokFjoHGSS8SFKXmLYSTiJ6GrYv6KBemm2MAsCcE+mkAL4nLZdgDjR870SUeCeEZkLnB5UkyQ8GjPIM7z9tGZAAHLUsA77AnknB4EUDAQEUyS8HFJy6vUQR7sFRoYglGIRtl97etCeARVFvoQGAfgjj0gYCmQbLvkEoJKFECgYO+JQecAODQgOzlSAxtFLpKRkSl4uYjmAHWbgOWykJFWQwOwtGDXgHS9GmRoAdu0K7gVtCkEe1qnmRDxnFuynsZx3HOg8CksQ4phbJK86kMWCDQmNkqtVNM2GoK/ATpA61gnyhpsgkUh9FdvF5RwkkRrJinwxgJlRopF0XKpN0AIQ8RmqMXlxyNyWjZ5IfZFFGtCRC3FohUOAASjAGA4FA/BvZE0RxHQCRRMkaSLP5W7aul+pGXMZBi7UGUzNapO2LhMt6jAMyS6Q0s6rL+qwvcLHK9rqvqzyyhOhgUui+LauS6TZAAMxBaUVPup6AAiSRjl4EBVKERLaNidgKWUkwjPCpAAETeBlEfjNMOIR1wMeM2QAAK+TQs0PPxIkgvRaCq58ltIcAJx9FnbQEfS/SjHHsyIc7+hU+oDiDuyJSQL7n7exwSQ9R1krAGkACCeBPskTOcDQgx6OoLdkOQPtSJADA0ONghyDSigYBgDBFCAJCZEfrwFMtT5FBup8wOfrzQAURAV3AsVVHAj+ZFsT4sFAcBJXzuepEQdQQA=) just change the s1 and s2 in Main() method.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: bitmover on September 08, 2020, 11:13:45 AM

Quote from: Coding Enthusiast on September 08, 2020, 06:47:42 AM

This is a great find.
distance 2 certainly is ok because it would be too restrictive, but I didn't know if a distance of 1 would be acceptable.

Looking carefully at the https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md I found that only French is worried about Levenshtein distance

Quote

French
10. No very similar words with 1 letter of difference.

https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md#french

This is same as Levenshtein distance > 1.

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy task but certainly doable.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: Coding Enthusiast on September 08, 2020, 11:40:33 AM

Quote from: bitmover on September 08, 2020, 11:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy tasks but certainly doable.

I think you should try to avoid it if possible. It's definitely beneficial to keep all the words as distinct as possible. For example in the case above simply a bad handwriting could cause issues between letter 'c' and 't' in "cable" and "table" and having multiple one of these mistakes in a mnemonic could potentially make recovery impossible.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: NotATether on September 08, 2020, 10:24:29 PM

Quote from: Coding Enthusiast on September 08, 2020, 11:12:59 AM

Quote from: ETFbitcoin on September 08, 2020, 09:40:27 AM

Interesting find. What do you think about about Jaro Similarity? No idea how it works, but the output is fixed from 0.0 to 1.0 and i think it'd be easier to determine the threshold.

The Levenshtein algorithm considers 3 factors: deletion, insertion and substitution
Damerau–Levenshtein adds transposition to the above 3.
Meanwhile the Jaro algorithm only considers similarity (characters in common but only in a short distance)
Jaro-Winkler is the same as Jaro but gives a more favorable result when the characters in common are from the beginning of the string (ie. AAAXYZ and AAAUVW are more similar than XZYAAA and UVWAAA).

I am against using Jaro-Winkler similarity for measuring distances because it is tainted by its weighing earlier characters more. I feel like it's trying to take on the task of both measuring distance and counting initial unique characters, but it is not effective for measuring either of them because it just adds the distance metric and a very scaled down initial character uniqueness measurement together. IMHO adding two metrics together just ruins the measurement.

Jaro similarity is a little better, it just measures distance but I notice that character swaps have less weighting on the metric than the presence of unique characters in either two words, which makes sense if you are feeding a program with input that has several similar words in it, but interchanges between adjacent characters and deletions are the most common mistakes people make when writing from a wordlist. Plus the percentage metric doesn't lend itself well to quantifying the number of character replacements you need to get from one distance to another, at least by the human brain. I can't just say, "for a Jaro distance of 0.7 I need to change <x> characters to make it 0.6".

For typing there are also typos made not by swapping but by typing the adjacent character on the Qwerty keyboard. A typo is just a substitution, which can be modeled by deletion and insert pair, additional typos can replace one of the deletes and one of the inserts with a swap*. There may already be production algorithms that take proximity of the neighboring Qwerty characters into account when measuring similarity, measuring insert/delete pairs of nearby keyboard characters more harshly than distant characters, and they do exist since search engines can detect typos. And I think that should be the goal when making a wordlist, to filter out as many opportunities to make typos as possible. The guidelines for similarity checking were only created because a small group can't be expected to make such a sophisticated checker :)

*That's why I think counting swaps with one point instead of two points of insert/delete is bad for distance measurements, as we are falsely making the word pair look more unique. Hence my argument against using Damerau-Levenshtein.

So for simpletons like us, none of the alternative algorithms are good for our needs, and Levenshtein is out best measuring ruler that is not complex to implement.

Quote from: bitmover on September 08, 2020, 11:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy task but certainly doable.

Levenshtein distance of one means only one substitution needs to be made, like "fish" --> "fist", which has a risk of being spelt incorrectly like Coding Enthusiast mentioned. Insertions and deletions are harder to get wrong though, because a user has to subconsciously type an extra character or omit one. so if you absolutely must use distance 1 word pairs then use ones with an extra or missing letter. But it is unlikely that users will miswrite 2 substituted characters wrong.

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: bitmover on September 09, 2020, 12:07:50 AM

Quote from: Coding Enthusiast on September 08, 2020, 11:40:33 AM

Quote from: bitmover on September 08, 2020, 11:13:45 AM

What do you think Coding Enthusiast, should we try to keep Levenshtein distance > 1? That is not an easy tasks but certainly doable.

Thanks for your suggestion. We decided to keep with that restriction (removing all Levenshtein distance =1). Our wordlist is going to be the one with the most restricted rules.
French wordlist followed Levenshtein distance =1 rule, however they didn't worry about repeting words from others lists like we did.

https://github.com/bitcoin/bips/pull/152#issuecomment-412618598
https://i.imgur.com/56eIbFY.png

I hope our list will be quickly accepted. We did a nice work.

Quote from: NotATether on September 08, 2020, 10:24:29 PM

P.S. as for the progress of my wordlist validator program, I have implemented the command line arguments and logging functionality. I am currently implementing a progress bar so people can know how far in validation it's in. When that's finished the program will be complete, I'll just have to write a stable API, documentation and unit testing for the whole thing and then it should be ready for publishing.

Share with us this code when you are done. There are still other languages to make a wordlist. and your program may also be used in other projects that we don't know of yet.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: NotATether on September 12, 2020, 05:21:12 PM

There are two oddities that have me stumped while writing the logic to process the word list file. All of the words in every wordlist I checked are sorted. Is it a hard requirement for submitted wordlists to be sorted? My code assumes valid wordlists might be in a random order. If sorting is absolutely required, I could introduce another check that tests whether the words are in order.

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt (https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt) but that wordlist's rules say:

Quote from: https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md#spanish

Special Spanish characters like 'ñ', 'ü', 'á', etc... are considered equal to 'n', 'u', 'a', etc... in terms of identifying a word. Therefore, there is no need to use a Spanish keyboard to introduce the passphrase, an application with the Spanish wordlist will be able to identify the words after the first 4 chars have been typed even if the chars with accents have been replaced with the equivalent without accents.

Despite the list having accented characters in it, should applications accept the words typed only in the form without accents, so that Spanish wordlist processing is consistent with submitted wordlists in other Latin languages? Personally I'm not in favor of implementing a special case during validation for handling words in Spanish, because I want my implementation to be reusable.

The accented characters also make it slightly harder to check the validity of a word because I now have to convert a Latin character to its non-accented form (my current code tests if the character is between "a" and "z"). Does anyone know a Python function or module that can do this?

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: bitmover on September 12, 2020, 08:18:07 PM

Quote from: NotATether on September 12, 2020, 05:21:12 PM

Sorting is not required, but as it is extremely easy to do in every software of language (even in excel), I think it is very basic and elegant to submit your list sorted out. In no way I would submit my list in a random order, unless there would be a reason to do so.

I think you could implement a question like "your word list is not sorted. Would you like to sort it now?"

Quote

Second, the Spanish wordlist contains words with accents in them. https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt (https://github.com/sabotag3x/bips/blob/master/bip-0039/spanish.txt) but that wordlist's rules say:

Quote from: https://github.com/sabotag3x/bips/blob/master/bip-0039/bip-0039-wordlists.md#spanish

personally, I think that accepting words with accent a big mistake. I would reject it straight away. And I wouldn't use that word list

Because " àbaco" and "abaco" are different words and it could lead to some problem in some software.

Portuguese list won't have words with special characters.

Quote

I made a dictionary with all special characters and replaced Spanish special characters using that dictionary. Then I checked for common words in my list and spanish. I can share the code if you wish.

My dictionary is here
https://bitcointalk.org/index.php?topic=5272106.msg55131643#msg55131643

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: NotATether on September 12, 2020, 10:45:15 PM

Quote from: bitmover on September 12, 2020, 08:18:07 PM

I designed my program to have as little user-interaction as possible, because it's easier to show some sort of report card that shows you the status of each test, and where specifically in each test is wrong so you can immediately go to that part of the file and fix it. In simple terms, my program lets you can control which tests to enable from the command line, it prints progresses and status messages as it performs each tests, and it tells you which tests passed and failed.

I am not comfortable with modifying the wordlist file in-place because it could have bugs that mistakenly mess up the wordlist. I've done that error too many times in other projects and I don't want to take any chances here. So I think I will just print a warning if it detects the list isn't sorted.

Quote from: bitmover on September 12, 2020, 08:18:07 PM

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:

for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages? In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented. So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: Coding Enthusiast on September 13, 2020, 03:09:01 AM

Quote from: bitmover on September 12, 2020, 08:18:07 PM

Sorting is not required

Sometimes the implementations want the option to perform a binary search (https://en.wikipedia.org/wiki/Binary_search_algorithm) on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: NotATether on September 13, 2020, 10:56:56 AM

Quote from: Coding Enthusiast on September 13, 2020, 03:09:01 AM

Sometimes the implementations want the option to perform a binary search (https://en.wikipedia.org/wiki/Binary_search_algorithm) on their word list which is only possible if the array they are working with (the string[2048] here) is sorted.
This is also suggested in BIP-39 `Wordlist > c) sorted wordlists`

Good catch, reminding me of the exact number of words that should be in a wordlist. I will also implement a check which validates that there are exactly 2048 words in the list.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: bitmover on September 14, 2020, 12:24:00 AM

Quote from: NotATether on September 12, 2020, 10:45:15 PM

Great job, your dictionary is a good start. To utilize it I could write something like this:

Code:

for i in range(0,len(word)):
  char = word[i]
  try:
    word[i] = bitmover_table[char]
  except KeyError as e:
    pass

*So the letters in your dictionary, are they only from the spanish alphabet or did you include letters used in other European languages?

You can add more letters if you think it is necessary, I am not sure this dictionary will cover all possibilities.
About your code, I don't like to use Loops unless it is extremely necessary. Loops are computational costly and makes your code slow.

I did this in my code:

Code:

import pandas as pd
accent_dict = {...}
spanish = pd.read_csv('spanish.txt', header = None)
spanish=spanish.replace(accent_dict , regex=True)

Code will be cleaner.
1 line and faster processing instead of a loop

Quote

In the long term, I want to avoid dealing with bugs where a new wordlist is made that has characters that aren't on the list and those pass through without getting de-accented. So I decided to use the code in https://stackoverflow.com/a/15547803/12452330 instead, it checks if the character's Unicode name has "WITH" (WITH ACCENT, WITH CIRCUMFLEX, etc.), remove that part from the name, and then reverse look up the (normal) character from the name.

As a side effect, this actually not only remove diacritics from Latin characters, it also returns diacritics from non-Latin characters like greek, arabic, as long as its unicode name has the word WITH in it, and if it doesn't, it just leaves it alone. e.g:

"ά": GREEK SMALL LETTER ALPHA WITH TONOS -> "α": GREEK SMALL LETTER ALPHA

It even strips them from symbols and emojis, which is probably not desirable but I don't need such support for this project. This stack overflow code is very handy in case I ever work on another project that needs this capability.

Bear in mind that the rest of the program fundamentally isn't designed to process non-Latin languages, since the validation rules could be completely different for them, or have more complex combining rules than 1 character/letter. This is just some food for thought for me.

A dictionary is a better approach than using coding in my opinion

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: NotATether on September 15, 2020, 12:15:50 AM

Someone just made a pull request for a Romanian wordlist: https://github.com/bitcoin/bips/pull/993

That should be a wake up call for me to speed up development. I almost finished the programmatic API. Docs are easy to write, and I think I'll use Nose as my testing suite.

EDIT: Levenshtein distance section of the API completed, initial unique characters and maximum length are almost finished

What would you guys think of me adding a method to fetch the wordlist from the internet, by supplying the URL? It would allow you to instantly check the wordlist if it's on places like Github without having to download it first.

e.g. In your web browser you go to the wordlist file on Github, and then click on View Raw to get this kind of link https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt (https://raw.githubusercontent.com/bitcoin/bips/03ff98d00717804723a2c4db8188c0b5cf0cbfbf/bip-0039/romanian.txt), which is a plain text file which can be downloaded and processed easily.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: bitmover on September 15, 2020, 03:30:07 AM

Quote from: NotATether on September 15, 2020, 12:15:50 AM

You can add a field to upload the file directly or to add the URL. This makes little difference, as the person who is working with the wordlist certainly have the txt file in his computer or he can access github directly.

It is more important to be able to upload the txt file imo

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: NotATether on September 15, 2020, 10:10:34 AM

Quote from: bitmover on September 15, 2020, 03:30:07 AM

Quote from: NotATether on September 15, 2020, 12:15:50 AM

Where are likely sites where people would want to upload the wordlist though? At least for Github and other source control sites, it can already be tracked and committed with git, so I don't really see a point in adding such a feature.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: NotATether on September 25, 2020, 11:59:40 AM

Small update, in case you thought my initiative stagnated.

I am currently writing documentation for my validator library. Once I finish that, I'll make a release candidate and put it on Github for you guys to test. I'm anticipating a proper release within 1 or 2 weeks.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: bitmover on September 25, 2020, 03:40:31 PM

I am going to make an updated as well
We submitted our wordlist as a pull request here:
https://github.com/bitcoin/bips/pull/998

But no answer from contributors who could approve it. Any ideas?

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: Coding Enthusiast on September 25, 2020, 04:04:04 PM

Quote from: bitmover on September 25, 2020, 03:40:31 PM

https://github.com/bitcoin/bips/pull/998

You should squash your 151 commits into 1 (https://stackoverflow.com/questions/5189560/squash-my-last-x-commits-together-using-git).

Quote from: bitmover on September 25, 2020, 03:40:31 PM

But no answer from contributors who could approve it. Any ideas?

Sometimes PR on BIPs repo go unnoticed for very long times. Last commit to word-lists was merged by luke-jr (https://github.com/bitcoin/bips/commits/master/bip-0039/bip-0039-wordlists.md), try mentioning their username under your PR.

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: TryNinja on September 27, 2020, 12:51:56 AM

Quote from: Coding Enthusiast on September 25, 2020, 04:04:04 PM

You should squash your 151 commits into 1 (https://stackoverflow.com/questions/5189560/squash-my-last-x-commits-together-using-git).

It's done. I would appreciate if you can confirm I didn't make any mistake. It's the first time I do that. :)

https://github.com/bitcoin/bips/pull/998

Title: Re: BIP-39 List of words in Portuguese nearly ready for submission
Post by: Coding Enthusiast on September 27, 2020, 03:27:04 AM

Quote from: TryNinja on September 27, 2020, 12:51:56 AM

It's done. I would appreciate if you can confirm I didn't make any mistake. It's the first time I do that. :)
https://github.com/bitcoin/bips/pull/998

Looks fine to me. You can also see the actual change in Files (https://github.com/bitcoin/bips/pull/998/files) tab to see if everything is alright.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: NotATether on September 30, 2020, 10:07:38 PM

As promised here is version 1.0.0rc1 of BIP39 Validator (https://github.com/ZenulAbidin/bip39validator/tree/release-1.0.0rc1). Please make sure you clone the release-1.0.0rc1 branch. You all are welcome to test it and check for bugs but be warned that this is alpha quality software and the API and main program may not even work and throw trivial syntax exceptions.

@bitmover: here's the page you can download source code from: https://github.com/ZenulAbidin/bip39validator/releases

I haven't written the unit tests yet, I should do so sometime soon to at least make the program runnable.

Note: I heavily borrowed README formatting from a completely unrelated project called nanogui (https://github.com/wjakob/nanogui) so there may be broken links in my project pointing to resources there. This will be fixed soon. Screenshot also needs to be replaced preferably with one from a live run of BIP39 Validator.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: bitmover on October 03, 2020, 12:58:46 PM

Quote from: NotATether on September 30, 2020, 10:07:38 PM

Hello NotATether,

I tried to test your software, but I had some problems. I will give you a few suggestions:

1- I think you should make an ANN thread for it.
2- This link is broken (documentation) https://bip39validator.readthedocs.io/
3 - I was not able to install it, saw this message:

Code:

pip install bip39validator
ERROR: Could not find a version that satisfies the requirement bip39validator (from versions: none)
ERROR: No matching distribution found for bip39validator

There was no clear message about what are the requirements which are not installed. Personally, I don't like to install too many python libraries because they can mess up with my environments. However, I tried to create in both my testing env and my base env, same error.

I think Python lacks a good way to share software. I would be nice if you could create a standalone version of your program, or even an online web version (you just send your wordlist.txt file to a server and it returns your python output). I don't know if it can be done, but i think it can be done with Ruby.
You are also supposing that people who create lists have anaconda installed, which is a 400mb software and not newbie-friendly. In my group of the 10 people who created the portuguese wordlist, only I (afaik) use python.

There are also some libraries which allow you to create a standalone version of your code.However they do not work very well, AFAIK. Once I created one of those standalones which had nearly 300mb (basically all anaconda package)

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: NotATether on October 03, 2020, 06:05:50 PM

It's going to be quite a challenge to compose this on mobile, but I hope it comes out correctly. One draft already got trashed by Android so I had to type this again (on mobile again as I'm not available yet on desktop)

Quote from: bitmover on October 03, 2020, 12:58:46 PM

Code:

pip install bip39validator
ERROR: Could not find a version that satisfies the requirement bip39validator (from versions: none)
ERROR: No matching distribution found for bip39validator

First of all thank you for taking the time to try to run this, I really appreciate it especially since it's difficult to find collaborators for small projects.

Yes there are a ton of things which are currently broken, I need to get around testing the library sometime tonight.

Testing has been constantly delayed by continuous power outages etc. so I'm in the final stages of leasing a dedicated server for staging my development on so I don't have to reopen my applications each time.

I have not pushed bip39validator to PyPI or Readthedocs yet, The Readthedocs listing I hope I can do that tonight, as for the PyPI release I am waiting until I publish a stable release before I publish it. I don't want to expose development versions of libraries to PyPI, because people expect them to just work. For now the only way to get the source is to download the source code from the Releases page on Github, or by cloning the release-1.0.0rc1 branch (the other branches are subject to breaking changes).

Then you'd run the main.py script with arguments as I'm not sure if my setup.py script from setuptools works yet. Do not install it using pip, it won't work until I push it to PyPI.

Quote

bip39validator only has three dependencies: jellyfish to compute Levenshtein distances, requests to support downloading wordlists from URLs, and rich for formatting console output. The list of dependencies can also be viewed in the requirements.txt file (it's meant to be read by pip to automatically install dependencies).

I always test libraries in virtualvenvs and not on the base python installation because the setup.py logic might be broken and need tobbe debugged. No one wants to deal with a broken library that won't install in their main env, when they can simply delete a virtualenv, create a new one and start over.

Quote

I think Python lacks a good way to share software. I would be nice if you could create a standalone version of your program, or even an online web version (you just send your wordlist.txt file to a server and it returns your python output). I don't know if it can be done, but i think it can be done with Ruby.

Setuptools can make a standalone program out of python programs, if you give it the file that's supposed to represent the main script. Then you can just run it like a shell command. I just haven't got around implementing that yet.

Maybe I'll develop a GUI version as well, which primarily benefits Windows users who don't want to run programs from Command Prompt. But no more features will be added to this version, so that will have to wait until 1.1.0.

I could also add self-hosting support using Django so people can subit wordlists to a server that does the validation for them, that'll allow me to advertise this to wordlist drafters who don't want to install Python. It will have to wait until the next major version though, because of it's complexity. Thank you for this suggestion, @bitmover (and I even figured this out only after writing this a second time! Some things stick to your head much later.)

For now though, I will host a Jupyter notebook out of my own pocket so at least you guys can experience a demo version without installing it.

I put several man-hours in this project already so it's too late to change languages and frameworks.

Quote

You are also supposing that people who create lists have anaconda installed, which is a 400mb software and not newbie-friendly. In my group of the 10 people who created the portuguese wordlist, only I (afaik) use python.

There are also some libraries which allow you to create a standalone version of your code.However they do not work very well, AFAIK. Once I created one of those standalones which had nearly 300mb (basically all anaconda package)

This project does not use Anaconda and you don't need it to install this. It only requires the three dependencies I mentioned above, jellyfish, requests and rich.

Urgent TODO list:

- Finish setup.py, so people can at least install it sanely.
- Run main.py and make sure there are no basic runtime errors
- ~~Push documentation to Readthedocs~~ Done, please go to https://bip39validator.readthedocs.io (though it's missing all the docstrings, I need to fix that)
- Create unit tests and test wordlists to find bugs in edge cases

Then we can talk about releasing rc2.

I'm not making another post just to announce this, but you all can view progress of this at

https://github.com/ZenulAbidin/bip39validator/projects/1

and

https://github.com/ZenulAbidin/bip39validator/milestone/1

(Github project and milestone dashboards respectively)

There are only three bugs/enhancements I need to attend to before I mark another release candidate.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: NotATether on November 05, 2020, 11:52:37 PM

BIP39 Validator version 1.0.0rc2 released!

This took way too longer than it should have, but now that the heavy work is done, only a few finishing touches are left to make and then the stable version 1.0.0 should be ready. I weeded most of the bugs out from this version, so you can get the source code and run python setup.py install to install bip39validator to try it out yourselves.

Quote

Pre-release version of BIP39 Validator 1.0.0. Much more stable than 1.0.0rc1 and should run without errors.

Notable changes in this release:

- Setup.py now works and can be used to install bip39validator locally
- Unit tests created
- New documentation theme

Get the source code here: https://github.com/ZenulAbidin/bip39validator/releases/tag/1.0.0rc2

And check out the new documentation for it. https://bip39validator.readthedocs.io/en/latest/

(apologies for bumping your thread again @bitmover, this is the last pre-release I'll announce here and then I'll fork 1.0.0 to my own thread. How is your pull request doing by the way?)

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: bitmover on November 06, 2020, 01:01:36 PM

Quote from: NotATether on November 05, 2020, 11:52:37 PM

BIP39 Validator version 1.0.0rc2 released!

Congratulations NotATether. I will take a look at it later.

I was thinking about trying to implement your software in Javascript , in future versions. I have been studying javascript a lot, and I think it would be easier to use if you could just create a webpage where you upload your wordlist.txt and received the feedback in an HTML page as well. What do you think?

However, the levenshtein distance would be hard to implement in JS, unless if it would be possible to find a library.

Quote

(apologies for bumping your thread again @bitmover, this is the last pre-release I'll announce here and then I'll fork 1.0.0 to my own thread. How is your pull request doing by the way?)

No problem.

Our PR is still pending.
Some people answered it, but I believe our PR is not a high priority...
There is some support from some users, but still no answers from the people with the authority to merge the pull request.

https://github.com/bitcoin/bips/pull/998

Edit: Maybe you could test our wordlist in your software, and then if it passes your tool check you can make a comment? This would be also an opportunity to test and spread the word about your app.

Edit 2: Someone tested our wordlist here:
https://travis-ci.org/github/bitcoin/bips/builds/730941730

The Travis CI build. I don't know what it is, maybe something similar to your tool?

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: Coding Enthusiast on November 06, 2020, 04:01:00 PM

Quote from: bitmover on November 06, 2020, 01:01:36 PM

Edit 2: Someone tested our wordlist here:
https://travis-ci.org/github/bitcoin/bips/builds/730941730

The Travis CI build. I don't know what it is, maybe something similar to your tool?

It is not testing your wordlist. Tools such as travis-ci are "hosted continuous integration service" which are automatically called each time there is a new commit (or if it is set, a new pull request) and they run a script (the .yml file) that is sent to their server on each call to run. The script is usually defined to build and run tests but it can be anything.
For example I use it to build my projects on Linux-x64 and also run all my tests there; it helps catch any problems with the build or any bugs in the code that tests can catch.
In case of bips they are running this script (https://github.com/bitcoin/bips/blob/master/.travis.yml) that performs some basic checks.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: NotATether on November 06, 2020, 05:48:09 PM

Quote from: bitmover on November 06, 2020, 01:01:36 PM

I was thinking about trying to implement your software in Javascript , in future versions. I have been studying javascript a lot, and I think it would be easier to use if you could just create a webpage where you upload your wordlist.txt and received the feedback in an HTML page as well. What do you think?

A webpage for uploading is on my TODO list. The beauty about modern programming languages is that they all have some sort of web framework where you can create a web app using the language you want to use. I’ll most likely use Django or Flask libraries to create the webpage which I’ll then host on my spinning rust.

Quote from: bitmover on November 06, 2020, 01:01:36 PM

Our PR is still pending.
Some people answered it, but I believe our PR is not a high priority...
There is some support from some users, but still no answers from the people with the authority to merge the pull request.

https://github.com/bitcoin/bips/pull/998

That other guy in the conversation was talking about how requests BIP39 wordlists in general don’t get merged. Many of the PRs stalled and are left without anyone actively working on replacing words. I think this is because of the sheer amount of manual verification you have to do with 2048 words, and that can overwhelm even a group of 4.

For the record, I believe the BIP39 comments (https://github.com/bitcoin/bips/wiki/Comments:BIP-0039) are why there's a backlog of unmerged wordlists. Something about a flaw in using secure RNGs as entropy. And notice how the comment in the BIP39 standard (https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki) itself says Comments-Summary: Unanimously Discourage for implementation. This comment was retroactively added on March 15, 2017 (https://github.com/bitcoin/bips/commit/d9396155350b16b098a0d0b5ec10da80ebd96d5f#diff-7c99d0bc5824cfd5ac997aa737fd2496c440028563e76308acedb135f193dfa7) along with a bunch of other BIPs.

The problem with using a foreign wordlist, then, seems to be that while two words in different lists with the same line number correspond to each other, PBKDF2 will make a different binary seed for each wordlist because PBKDF2 uses the word characters themselves, not their line numbers in binary form. This means if you e.g. change the English words to corresponding French words, it'll treat it as a different binary seed and will make completely different private keys.

Quote from: https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki#from-mnemonic-to-seed

From mnemonic to seed

A user may decide to protect their mnemonic with a passphrase. If a passphrase is not present, an empty string "" is used instead.

To create a binary seed from the mnemonic, we use the PBKDF2 function with a mnemonic sentence (in UTF-8 NFKD) used as the password and the string "mnemonic" + passphrase (again in UTF-8 NFKD) used as the salt. The iteration count is set to 2048 and HMAC-SHA512 is used as the pseudo-random function. The length of the derived key is 512 bits (= 64 bytes).
~snip

I'm not even sure if maintainers are taking PRs for wordlists any more. Filtering the Github issues to the ones created after the day it was unanimously discouraged (https://github.com/bitcoin/bips/pulls?page=1&q=is%3Apr+created%3A%3E2017-03-15++bip39) show that no proposed wordlist has been merged, except for the Korean wordlist in August 2017.

Our best shot at mass adoption of foreign language wordlists, then, is if we gather them together for Electrum users like this (https://github.com/spesmilo/electrum/tree/master/electrum/wordlist) where PRs will likely be considered.

Quote from: bitmover on November 06, 2020, 01:01:36 PM

Edit: Maybe you could test our wordlist in your software, and then if it passes your tool check you can make a comment? This would be also an opportunity to test and spread the word about your app.

I could, or you could even do it yourself by running bip39validator WORDLIST_FILE, which would make your job tremendously easier, but one crucial test checking for similar words in other merged languages' wordlists, has not been implemented yet, and without it, you'll just be asked to check the lists for duplicates. Not to discourage you or anything, but also see my above reply on why the wordlists might never be merged into BIP39.

I don’t have an ETA for this feature, it’s not trivial to implement.

Quote from: Coding Enthusiast on November 06, 2020, 04:01:00 PM

Quote from: bitmover on November 06, 2020, 01:01:36 PM

Edit 2: Someone tested our wordlist here:
https://travis-ci.org/github/bitcoin/bips/builds/730941730

The Travis CI build. I don't know what it is, maybe something similar to your tool?

That integration test has nothing to do with wordlists, it is running:

- scripts/link-format-chk.sh
and
- scripts/buildtable.pl

to assemble all the BIPs together in the README file. That makes sense because all the pull requests are for updating BIPs.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: TryNinja on December 20, 2020, 07:20:20 PM

Our list just got merged! 🎉

https://github.com/bitcoin/bips/commit/cf0b529e78860fa2d4fe77944091aa98c5e04624
https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md#portuguese

Thank you for all the help,
@NotATether
@Coding Enthusiast
@ETFbitcoin

And the people who helped write it,
@alegotardo
@bitmover
@brenorb
@kuthullu
@sabotag3x
@Trimegistus

Obrigado! ;D

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: Coding Enthusiast on December 21, 2020, 10:12:24 AM

Quote from: TryNinja on December 20, 2020, 07:20:20 PM

Our list just got merged! 🎉

I just added the new wordlist (https://github.com/Autarkysoft/Denovo/commit/e3b52e3309e56514e1d9899a2f532c94fce3bf22) to Bitcoin.Net which means next version of my other project FinderOuter will also be able to automatically recover any mnemonics that uses Portuguese word list. ;)

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: NotATether on December 21, 2020, 10:48:28 AM

Do you think that this should get merged into Electrum's seed version system too?

They seem to already have a Portuguese wordlist which is not going to be removed for compatibility reasons. What is the likelihood of such a pull request getting merged? I'm not sure if they are willing to add a "Portuguese v2" wordlist to the codebase given that they are also missing some other official BIP39 wordlists.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: alegotardo on December 21, 2020, 11:33:46 AM

Quote from: TryNinja on December 20, 2020, 07:20:20 PM

Our list just got merged! 🎉

Congratulations to everyone who contributed to this !!!
After years without new additions to BIP39, I thought that day would never come.

Quote from: NotATether on December 21, 2020, 10:48:28 AM

Sorry for my ignorance.
But could you tell me where I can find this wordlist?

I changed the language of my electrum desktop wallet to Portuguese and tried to create a new wallet, even so the seeds I received were all in English.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: bitmover on December 21, 2020, 11:38:53 AM

Quote from: alegotardo on December 21, 2020, 11:33:46 AM

I made a simple search and found that:
https://github.com/spesmilo/electrum/blob/master/electrum/wordlist/portuguese.txt

I didn't know that it existed either. This is so crazy.

Quote from: NotATether on December 21, 2020, 10:48:28 AM

It was done in 2014 and it is a Monero list?

Quote

It has only 1626 words and it cannot be used in BIP39 lists like ours.

Title: Re: BIP-39 List of words in Portuguese submitted!
Post by: alegotardo on December 21, 2020, 12:25:01 PM

Quote from: bitmover on December 21, 2020, 11:38:53 AM

I made a simple search and found that:
https://github.com/spesmilo/electrum/blob/master/electrum/wordlist/portuguese.txt

I didn't know that it existed either. This is so crazy.

Tks bitmover!

I'm terrified of what I saw on that list :o
This has several unusual words, with one letter of difference, with accents (removed), more than 8 letters... a very, very bad list.

Title: Re: [MERGED] BIP-39 List of words in Portuguese accepted!!
Post by: naufragus on December 22, 2020, 06:04:20 AM

Thanks for your effort.

Maybe a quote from rudimentary UNIX philosophers:

Those who don't understand Unix are condemned to reinvent it

I reckon that refers to communal computing and its joy.

Title: Re: [MERGED] BIP-39 List of words in Portuguese accepted!!
Post by: bitmover on March 20, 2021, 12:40:51 AM

After being merged into BIP39, I am happy to announce that the Portuguese word list finally got into iancoleman.io/bip39 :)

https://i.imgur.com/BQUy68C.png
https://iancoleman.io/bip39/#portuguese

I hope we will soon see it into some good wallets :)

Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: bitmover on September 04, 2020, 05:09:18 PM