Neural Network Learns to Generate Voice (RNN/LSTM)

Posted on 2016-05-24 at 05:15 in Programs, Videos by Robbi-985.

This is what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what it’s learned. (WARNING: Although I decreased the volume and there’s visual indication of what sound is to come, please don’t have your volume too high.)

[Watch in HD]

This is a recurrent neural network (LSTM type) with 3 layers of 680 neurons each, trying to find patterns in audio and reproduce them as well as it can. It’s not a particularly big network considering the complexity and size of the data, mostly due to computing constraints, which makes me even more impressed with what it managed to do.

The audio that the network was learning from is voice actress Kanematsu Yuka voicing Hinata from Pure Pure. I used 11025 Hz, 8-bit audio because sound files get big quickly, at least compared to text files – 10 minutes already runs to 6.29MB, while that much plain text would take weeks or months for a human to read.

I was using the program “torch-rnn“, which is actually designed to learn from and generate plain text. I wrote a program that converts any data into UTF-8 text and vice-versa, and to my excitement, torch-rnn happily processed that text as if there was nothing unusual. I did this because I don’t know where to begin coding my own neural network program, but this workaround has some annoying restraints. E.g. torch-rnn doesn’t like to output more than about 300KB of data, hence all generated sounds being only ~27 seconds long.

It took roughly 29 hours to train the network to ~35 epochs (74,000 iterations) and over 12 hours to generate the samples (output audio). These times are quite approximate as the same server was both training and sampling (from past network “checkpoints”) at the same time, which slowed it down. Huge thanks go to Melan for letting me use his server for this fun project! Let’s try a bigger network next time, if you can stand waiting an hour for 27 seconds of potentially-useless audio. xD

I feel that my target audience couldn’t possibly get any smaller than it is right now…

EDIT: Because I’ve been asked a lot, the settings I used for training were: rnn_size: 680, num_layers: 3, wordvec_size: 110. Also, here are some graphs showing losses during training (click to see full-size versions):


Training loss (at every iteration) (linear time scale)


Training loss (at every iteration) (logarithmic time scale)


Validation loss (at every checkpoint, i.e. 1000th iteration) (linear time scale)


Validation loss (at every checkpoint, i.e. 1000th iteration) (logarithmic time scale)

For sampling, I simply used torch-rnn’s default settings (which is a temperature of 1), specifying only the checkpoint and length and redirecting it to a file. For training an RNN on voice in this way, I think the most important aspect is how “clear” the audio is, i.e. how obvious patterns are against noise, plus the fact that it’s 8-bit so it only has to learn from 256 unique symbols. This relatively sharp-sounding voice is very close to a filtered sawtooth signal, compared to other voices which are more breathy/noisy (the difference is even visible to human eyes just by looking at the waveform), so I think it had an easier time learning this voice than it would some others. There’s also the simple fact that, because the voice is high-pitched, the lengths of the patterns that it needs to learn are shorter.

EDIT 2: I have been asked several times about my binary-to-UTF-8 program. The program basically substitutes any raw byte value for a valid UTF-8 encoding of a character. So after conversion, there’ll be a maximum of 256 unique UTF-8 characters. I threw the program together in VB6, so it will only run on Windows. However, I rewrote all the important code in a C++-like pseudocode here. Also, here is an English explanation of how my binary-to-UTF-8 program works.


5 Responses to Neural Network Learns to Generate Voice (RNN/LSTM)

  1. Kyle Polich says:

    Hi Robbi,

    Apologies for using the comment box to reach out, I wasn’t able to find any contact info on your site. I think this is a really interesting post and I’d like to feature it on the Data Skeptic podcast. If you’re interested, would you mind shooting an email to discuss the details?

    Thanks!

    Kyle

  2. Giancarlo says:

    I’m very interested in your results. Did you use the -start_text to generate the samples in this video? Your answer would be very appreciated. Thank you for sharing.

    • Robbi-985 says:

      I planned to, which is why I went through the effort of making my binary-to-UTF-8 program use ASCII characters for as long as possible before using Unicode, but in the end, I never did use -start_text.

  3. Ayat says:

    Thanks for doing all this work. I am following your methodology for a similar project. I managed to install and get some test results. However, being able to execute your pseudocode would be so amazing. What I am doing now is encoding the .wav into some text file with a tool called sox, and then (semi)manually deleting timestamps for each sample. I then replace every digit with an alphabet letter. This whole thing takes a long time,-my programming skills are very basic. Is there any way of using your code? Would appreciate it a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>