This is what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what it’s learned. (WARNING: Although I decreased the volume and there’s visual indication of what sound is to come, please don’t have your volume too high.)
This is a recurrent neural network (LSTM type) with 3 layers of 680 neurons each, trying to find patterns in audio and reproduce them as well as it can. It’s not a particularly big network considering the complexity and size of the data, mostly due to computing constraints, which makes me even more impressed with what it managed to do.
The audio that the network was learning from is voice actress Kanematsu Yuka voicing Hinata from Pure Pure. I used 11025 Hz, 8-bit audio because sound files get big quickly, at least compared to text files – 10 minutes already runs to 6.29MB, while that much plain text would take weeks or months for a human to read.
I was using the program “torch-rnn“, which is actually designed to learn from and generate plain text. I wrote a program that converts any data into UTF-8 text and vice-versa, and to my excitement, torch-rnn happily processed that text as if there was nothing unusual. I did this because I don’t know where to begin coding my own neural network program, but this workaround has some annoying restraints. E.g. torch-rnn doesn’t like to output more than about 300KB of data, hence all generated sounds being only ~27 seconds long.
It took roughly 29 hours to train the network to ~35 epochs (74,000 iterations) and over 12 hours to generate the samples (output audio). These times are quite approximate as the same server was both training and sampling (from past network “checkpoints”) at the same time, which slowed it down. Huge thanks go to Melan for letting me use his server for this fun project! Let’s try a bigger network next time, if you can stand waiting an hour for 27 seconds of potentially-useless audio. xD
I feel that my target audience couldn’t possibly get any smaller than it is right now…
EDIT: Because I’ve been asked a lot, the settings I used for training were: rnn_size: 680, num_layers: 3, wordvec_size: 110. Also, here are some graphs showing losses during training (click to see full-size versions):
For sampling, I simply used torch-rnn’s default settings (which is a temperature of 1), specifying only the checkpoint and length and redirecting it to a file. For training an RNN on voice in this way, I think the most important aspect is how “clear” the audio is, i.e. how obvious patterns are against noise, plus the fact that it’s 8-bit so it only has to learn from 256 unique symbols. This relatively sharp-sounding voice is very close to a filtered sawtooth signal, compared to other voices which are more breathy/noisy (the difference is even visible to human eyes just by looking at the waveform), so I think it had an easier time learning this voice than it would some others. There’s also the simple fact that, because the voice is high-pitched, the lengths of the patterns that it needs to learn are shorter.
EDIT 2: I have been asked several times about my binary-to-UTF-8 program. The program basically substitutes any raw byte value for a valid UTF-8 encoding of a character. So after conversion, there’ll be a maximum of 256 unique UTF-8 characters. I threw the program together in VB6, so it will only run on Windows. However, I rewrote all the important code in a C++-like pseudocode here. Also, here is an English explanation of how my binary-to-UTF-8 program works.