Archive for May, 2017

BinToUTF8 – Public release

Thursday, May 4th, 2017

Because several people have asked for it, I’ve decided to release my program for converting any binary file to a valid UTF-8-encoded text file (and vice-versa). This is the program I made to be able to train the open-source neural network software “torch-rnn” on audio, even though it’s only designed to work with text, in these previous videos.

My program is a console-mode program, so it has no graphical interface, and it’s an EXE, so it’ll only run on Windows (and maybe Wine). It’s also slow, because I hadn’t had the pressure (from the idea of making it public) to optimize it until I suddenly decided to release it this evening. It comes with pseudocode and a technical description for any programmers who want to remake it to run on other OSes, though (they’re the same text files I linked to in the blog post for my first neural network video).

The download contains BinToUTF8.exe, which you can use yourself on the command prompt (run it from the command prompt without any parameters to see usage instructions). It also contains several batch files, which make it much more convenient to use – you only have to drag a binary or text file onto the batch file on Windows Explorer to automatically launch BinToUTF8.exe with the appropriate command line parameters.

A brief description is below, but make sure you read the included “info.txt” to find out what each batch file does and avoid accidentally overwriting any of your own files!

The program works by assigning a unique Unicode or ASCII character to each of the 256 possible byte values in your binary file. There are 2 modes for this:

  • Byte/Character Lookup (BCL) mode (recommended):

Characters are assigned on a “first-come, first-served” basis, meaning that bytes appearing near the beginning of the file will be assigned ASCII characters, and Chinese Unicode characters will only be used once no more ACSII characters are available. This is done to allow you to pass text from the start of the file to torch-rnn using torch-rnn’s -start_text parameter, which does not support Unicode characters. A utf8.bcl file is made when converting to text and is required when converting back to binary. This file is the lookup table for converting between bytes and Unicode characters which the program made when converting the binary file to text.

  • Non-BCL mode (default, not recommended for torch-rnn):

All bytes are converted to Chinese Unicode characters and none are converted to ASCII. This means the text file will be larger, but more importantly, you won’t be able to use any of this text with torch-rnn’s -start_text parameter. The conversion in this mode may be faster, and no utf8.bcl file is made or required.

Text files made using the BCL mode cannot be converted back to binary using the non-BCL mode, and vice-versa. To convert text back to binary correctly, you must use the same mode that you used when converting the original binary file to text.

You can download BinToUTF8 from here (19 KB). Now, have fun!

(By the way, if training torch-rnn on audio files, you should use an 8-bit audio encoding such as 8-bit PCM, U-law or A-law, to be kind to torch-rnn.)