Even using 5 bit alphabet you waste a lot of space, you need to compress message first,
Using
Arithmetic coding for example.
You can even perform it on words level, not the letters level.
Right, I have thought about both options, first to assign different number of bits to symbols depending on their frequency (from 4 to 6 or 7 bits, but I would need to make some measurements of efficiency on English texts first) and then to use dictionaries for each language. Like
https://github.com/bitcoin/bips/blob/master/bip-0039/english.txtFor example, the first two bits of the "lexical token" will mean either:
a) the following 6 bits encode one of 64 most popular words;
b) the following 14 bits encode one of 16000+ of less popular ones;
c) the following is UTF-8 encoded chunk of length (in symbols, not bytes) 1..32 if the following bit is 0, or VARINT shifted by three bits encoded length if it's 1;
d) same for UTF-16LE, where length is the number of Unicode symbols.
The problem with this is that I want to provide it for at least half a dozen languages, and the project is likely to outgrow it's hobbyist nature :=)
Another argument in favour of 5-bit scheme is that it's very simple, can be easily translated into many programming languages if becomes popular.
For the "Morse telegraph" option I would need at least to create the frequency tables of letters in each language (or find them on the web).
P.S. In the original solution, codepage number can occupy more than 3 bits if it's in the end: the most popular ones will be numbered 0..7, in Little Endian, and any zeroes at the end can be dropped if do not fit into the "last byte".