There was a power outage so I couldn't start working on the fix. It just got restored so I will start working on it now. In my haste I forgot to upload the Changelog file, and make bip39validator tell a URL input from a file. I got a long list of TODOs tonight.
It needs about 200MB to run comfortably. Inside a Jupyter Notebook however, it could use up to 700MB (I don't know why, it is definitely a problem with Jupyter).
This is the main problem with python. It is very hard to distribute any software.
Your code problably has less than 100k, but due to all dependencies and libraries it uses more than 200mb... I had this problem as well with very small apps that I made.
Probably the best way to use it is using Google Colab (which is not ideal)
On the contrary, the large footprint comes from my own code
I did a few memory profiles of bip39validator using mprof, and it turns out that the majority of my own code, and of the functions I use from my dependencies, use no more than 50MB. Requests and rich almost have no memory footprint, same with jellyfish.levenshtein_distance (as it just takes two strings as input and returns a number, it's not taking a whole array like I am doing).
The large memory usage is coming from the code in my Levenshtein distance results class. As that tests runs, the program builds an array of Levenshtein distances across all words. This takes O(N^2) space but I was able to reduce the space by a factor of 12 by serializing this intermediate array as a string. After the distances are serialized like this, the string is loaded directly into the LevDistResult class.
I used to store the intermediate results of the distances as a list of dictionaries with keys and values, but this took 2.4GB (!) to run the Levenshtein distance test. Replacing that with a string array where fields and entries are separated by commas and dashes brought it down to 1.4GB. Finally I rewrote the LevDistResult class to not carelessly create more intermediate representations of the array, to bring its memory usage to where it is today.
I could also split the string array into pieces and compress each of them with LZMA compression, but honestly that runs for so long that I abandoned the idea.
To be honest, I have no idea what Jupyter is importing to triple bip39validator's memory usage. Vanilla python doesn't have this problem, just only use that and you should be fine, regular end users don't use Jupyter anyway
I'm surprised however, that I get normal memory usage when it's running inside Anaconda without any Jupyter libraries imported. (In fact, the sample run shown in the GIF was ran in a virtualenv within Anaconda)
Features:
- A Levenshtein distance check
- A unique prefix check
- A maximum length check
- Supports wordlists from local files and plain text remote URLs
- Powerful statistical classes for analyzing the results and drawing conclusions from wordlists
Planned features:
- Spawn self hosted webpage that can validate an uploaded wordlist
- Make fourth test for checking that there are no similar words in other languages' wordlists
I think the best solution would be to move it all to JS, then anyone could just run it with less than 500kb (at most). JS has a very similar syntax, but it lacks pandas. You will probably need to deal with json format instead of pandas dataframes...
I don't know exactly how to do it, especially Levenshtein distance check, but all other checks are simple and probably easily done in JS. It would be nice if someone could just upload a TXT file and have the results in an HTML page.
I really don't want to rewrite it in another language after I just finished it
but for real though, the memory usage stems from an algorithmic problem, not a language problem. It'll probably still take 200MB if I wrote it in Node.js.
As I did a very similar code to create the
Portuguese Wordlist , I miss an important feature in your code.
If there is a word in a list, such as "mamá" and in other list the word "mama", you will miss that in your code (I guess).
I made a dictionary to solve this problem. I just replaced all previous list especial characters with this dictionary, and then I run the checks.
~snip
Then I just replaced all lists using
italian=italian.replace(repl_dict, regex=True)
I actually already have that feature, see rmdiacritics in internal/util.py
https://github.com/ZenulAbidin/bip39validator/blob/04fdb443fae7c614d544d7fbc2434ec28a31548b/bip39validator/internal/util.py#L50-L67. I look up the Unicode name of the character, such as "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" for Â, and I remove everything after WITH, until I'm left with "LATIN CAPITAL LETTER A" and I transform unicode name back into a character.
It's not normalization per se because someone could type a capital A followed by a diacritic for the circumflex and cause the sanity check test (the very first test) to fail. This is also a bug I have to fix.
I tried to test your Portuguese wordlist with this but it simultaneously outputs success and failure, the bug reported by @ETFbitcoin, and no word pairs are printed. So either your wordlist is flawless, or it's doing something wrong. I am leaning on your wordlist being flawless as I'm sure the part of the program that prints the words is debugged fully.