Neural Network Tries to Generate English Speech (RNN/LSTM)

Posted on 2016-12-24 at 20:56 in Programs, Videos by Robbi-985.

By popular demand, I threw my own voice into a neural network (3 times) and got it to recreate what it had learned along the way!

[Watch in HD]

This is 3 different recurrent neural networks (LSTM type) trying to find patterns in raw audio and reproduce them as well as they can. The networks are quite small considering the complexity of the data. I recorded 3 different vocal sessions as training data for the network, trying to get more impressive results out of the network each time. The audio is 8-bit and a low sample rate because sound files get very big very quickly, making the training of the network take a very long time. Well over 300 hours of training in total went into the experiments with my voice that led to this video.

The graphs are created from log files made during training, and show the progress that it was making leading up to immediately before the audio that you hear at every point in the video. Their scrolling speeds up at points where I only show a short sample of the sound, because I wanted to dedicated more time to the more impressive parts. I included a lot of information in the video itself where it’s relevant (and at the end), especially details about each of the 3 neural networks at the beginning of each of the 3 sections, so please be sure to check that if you’d like more details.

I’m less happy with the results this time around than in my last RNN+voice video, because I’ve experimented much less with my own voice than I have with higher-pitched voices from various games and haven’t found the ideal combination of settings yet. That’s because I don’t really want to hear the sound of my own voice, but so many people commented on my old video that they wanted to hear a neural network trained on a male English voice, so here we are now! Also, learning from a low-pitched voice is not as easy as with a high-pitched voice, for reasons explained in the first part of the video (basically, the most fundamental patterns are longer with a low-pitched voice).

The neural network software is the open-source “torch-rnn“, although that is only designed to learn from plain text. Frankly, I’m still amazed at what a good job it does of learning from raw audio, with many overlapping patterns over longer timeframes than text. I made a program (explained here, and available for download here) that substitutes raw bytes in any file (e.g. audio) for valid UTF-8 text characters and torch-rnn happily learned from it. My program also substituted torch-rnn’s generated text back into raw bytes to get audio again. I do not understand the mathematics and low-level algorithms that go make a neural network work, and I cannot program my own, so please check the code and .md files at torch-rnn’s Github page for details. Also, torch-rnn is actually a more-efficient fork of an earlier software called char-rnn, whose project page also has a lot of useful information.

I will probably soon release the program that I wrote to create the line graphs from CSV files. It can make images up to 16383 pixels wide/tall with customisable colours, from CSV files with hundreds of thousands of lines, in a few seconds. All free software I could find failed hideously at this (e.g. OpenOffice Calc took over a minute to refresh the screen with only a fraction of that many lines, during which time it stopped responding; the lines overlapped in an ugly way that meant you couldn’t even see the average value; and “exporting” graphs is limited to pressing Print Screen, so you’re limited to the width of your screen… really?).

3 Responses to Neural Network Tries to Generate English Speech (RNN/LSTM)

  1. Cachot says:

    Dear Robbi, great work! Can I have the VB6 code for tranfer BINUTF, I was not able to bring the C++ pseudocode running… :-( Vielen Dank aus Berlin! Cachot

  2. nbw says:

    Hi. I have tried for the last few days to replicate this process as per the methodology you posted on the previous experiment’s post. I’ve verified that the bintoutf8 and utf8tobin works bidirectionally by using it on one audio file both ways (although there’s a little bit of quality loss, but I guess that’s to be expected). I got the RNN running and it will train on my generated file, but when I attempt to use utf8tobin in order to turn it back into an audio file what I end up getting is universally just a very loud solid sound like radio static.

    Would you mind posting a more in-depth end-to-end explanation of the process involved in this? I have been starting with an ogg file, converting it to a raw 8 bit PCM or 16 bit PCM file, using bintoutf8_bcl, and then just using that as the training data on the RNN (same settings as in your other post), then writing a sample to a text file and converting it to a bin file with utf8tobin. What am I missing here?

    • Robbi-985 says:


      You can’t use 16-bit PCM because my program works at the byte level and the network won’t be able to learn to make pairs of bytes reliably. Changing my program to work with 16 bits isn’t feasible either as you’d need ~65 thousand input and output neurons which would probably make the network be gigabytes in size and would certainly make it unusably slow.

      You need to use a format that is raw and one byte per sample. So 8-bit PCM will work, as will u-law and A-law. I don’t know why it didn’t work for you, but there should be zero loss in the conversion process to and from text – you should be able to take any file type, even executables, images or other text files, convert in both directions and get a byte-identical file. Are you sure that you were correctly importing the converted file from the generated text on Audacity? E.g. using “unsigned 8-bit PCM” and not “signed 8-bit PCM” if you exported as “unsigned 8-bit PCM”? Audacity will try to auto-detect the format when importing raw data, but it’s quite unreliable. Also:
      - It may take a few thousand iterations before you get anything sounding like noise.
      - I never had success with torch-rnn when using more than 4 layers, so maybe start with 2 or 3 layers.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>