The concept's not complicated, but it's hard to explain quickly, so brace yourself for a wall of text... Originally, the program simply went through every raw byte in the input file and output the Unicode character (0x3900 + raw value), which happens to be in the "CJK Unified Ideographs Extension A" block, and did the opposite (subtracted 0x3900 from each character's code point) to decode. I chose that block because I have my computers configured so they can display those characters for other reasons, so it's just more convenient. It used the Windows API calls WideCharToMultiByte() and MultiByteToWideChar() for actually encoding and decoding the UTF-8 bytes. However... torch-rnn doesn't support "seeding" the network with Unicode characters on the command line, and, just so that I could seed it with the start of the training data if I wanted, I modified my program to use a lookup table and use all the non-problematic, printable ASCII characters first, before switching to Unicode when it ran out of unique ASCII characters. Basically, the table has 2 columns (raw byte, Unicode character) and 256 rows. For every raw byte read on the input, it looks through the table from the start until it finds a matching raw byte. If so, it outputs the Unicode character on that table row in UTF-8 (note: UTF-8 and ASCII are equivalent in the printable ACSII range). If it reaches the end of the table (as it exists so far) and finds no match, it means it hasn't come across this raw byte before, and needs to add a new row to the table for it. It chooses the next unused (by the table) printable ASCII character, or if there are none left, the next unused Unicode character (in that CJK block I mentioned before), and then outputs the now-chosen Unicode character (as would've happened if that table row had already existed). Now, in the future, that raw byte will cause that Unicode character to be output. When it reaches the end of the input file, it also dumps the lookup table to a separate file which is needed in order to "decode" the UTF-8 back to raw bytes. To decode, the process is much simpler because it doesn't have to worry about building the lookup table - it simply loads the lookup table that was already dumped to a file in the encoding process. Then it's just a case of looking at every Unicode character, finding that in the table's 2nd row and outputting the corersponding raw byte in the table's first row. I still use the same 2 Windows APIs for converting the characters to/from actual UTF-8 bytes for convenience, but really, you could just make the 2nd column in the table an array of bytes (that is the UTF-8 representation of the Unicode character). (In hindsight, that'd probably make it much faster - I should probably change my code...) Anyway, using the lookup table means that characters for the first ~96 unique byte values can be passed as a seed (torch-rnn's "-start_text") on the command line, which is typically much more than 96 bytes (a thousand or so is easily possible), as some values will be used multiple times before so many unique values are used that the encoder had to switch to Unicode. For my convenience, the encoder also wrote a little TXT file with info such as the byte number in the input data on which it had to start using Unicode characters, so that I can easily select that many ASCII characters and use them with -start_text if I want.