BinToUTF8 – Public release

Posted on 2017-05-04 at 23:48 in Programs by Robbi-985.

Because several people have asked for it, I’ve decided to release my program for converting any binary file to a valid UTF-8-encoded text file (and vice-versa). This is the program I made to be able to train the open-source neural network software “torch-rnn” on audio, even though it’s only designed to work with text, in these previous videos.

My program is a console-mode program, so it has no graphical interface, and it’s an EXE, so it’ll only run on Windows (and maybe Wine). It’s also slow, because I hadn’t had the pressure (from the idea of making it public) to optimize it until I suddenly decided to release it this evening. It comes with pseudocode and a technical description for any programmers who want to remake it to run on other OSes, though (they’re the same text files I linked to in the blog post for my first neural network video).

The download contains BinToUTF8.exe, which you can use yourself on the command prompt (run it from the command prompt without any parameters to see usage instructions). It also contains several batch files, which make it much more convenient to use – you only have to drag a binary or text file onto the batch file on Windows Explorer to automatically launch BinToUTF8.exe with the appropriate command line parameters.

A brief description is below, but make sure you read the included “info.txt” to find out what each batch file does and avoid accidentally overwriting any of your own files!

The program works by assigning a unique Unicode or ASCII character to each of the 256 possible byte values in your binary file. There are 2 modes for this:

  • Byte/Character Lookup (BCL) mode (recommended):

Characters are assigned on a “first-come, first-served” basis, meaning that bytes appearing near the beginning of the file will be assigned ASCII characters, and Chinese Unicode characters will only be used once no more ACSII characters are available. This is done to allow you to pass text from the start of the file to torch-rnn using torch-rnn’s -start_text parameter, which does not support Unicode characters. A utf8.bcl file is made when converting to text and is required when converting back to binary. This file is the lookup table for converting between bytes and Unicode characters which the program made when converting the binary file to text.

  • Non-BCL mode (default, not recommended for torch-rnn):

All bytes are converted to Chinese Unicode characters and none are converted to ASCII. This means the text file will be larger, but more importantly, you won’t be able to use any of this text with torch-rnn’s -start_text parameter. The conversion in this mode may be faster, and no utf8.bcl file is made or required.

Text files made using the BCL mode cannot be converted back to binary using the non-BCL mode, and vice-versa. To convert text back to binary correctly, you must use the same mode that you used when converting the original binary file to text.

You can download BinToUTF8 from here (19 KB). Now, have fun!

(By the way, if training torch-rnn on audio files, you should use an 8-bit audio encoding such as 8-bit PCM, U-law or A-law, to be kind to torch-rnn.)


5 Responses to BinToUTF8 – Public release

  1. Joel B says:

    I don’t know if you would be willing to release the source but it would be fun to see what would happen if the network could real-time ingest audio and ‘interact’ with people…

  2. Reece says:

    Hey i tried replicating the results you got from your video on “Feeding my voice into a neural network” how did you get torch to output a file that could be converted back into a sound file? I get an error when trying to convert the output file into audio

    • Robbi-985 says:

      I simply did e.g.
      th sample.lua -gpu 0 -checkpoint cv/checkpoint_1000.t7 -length 100000 > pphina_01000.txt

      What error are you getting, and from what program, and what does it say?

      • Lasperic says:

        I am facing the same problem. The getting UTF from t7 goes through , running it through your utftobin conversion. The final binary file is 10kb (when the length is 100000) and it does not produce any sound.

        While running the bat I get the error
        Starting conversion…

        ——————————————————— (i) ———————————————————- [Done with errors] Done (2 / 10,402 failed) Press any key to continue . . .

        Any idea what could make so many failed results?
        Thanks

        BTW great thanks for the tool , helped a lot

        • Robbi-985 says:

          Oh, don’t worry about that message! That’s just due to a final carriage return and line feed (2 characters) added to the end of the text by Torch-rnn “print”ing the text out to StdOut (as it’s expecting the text to be displayed on a console, and wants any following text to be on a new line). BinToUTF8 doesn’t use those 2 particular characters in its conversion (to avoid cases like this from messing up the conversion), so this message is expected behaviour – that’s why I made it display the number of failed character conversions too, so you can see whether it’s a real problem (thousands failed) or not. It simply ignores those two characters and should produce an output of exactly 10,400 bytes (instead of 10,402 bytes).

          I assume the silence is due to the training data, though – Torch-rnn output 10,400 recognised characters, which got converted back into silence, but I can’t tell why Torch-rnn predicted silence. To confirm, you could look at the actual text output by Torch-rnn and see if it’s the same character (or handful of characters, all around the waveform centre) throughout.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>