Archive for the ‘Videos’ Category
By popular demand, I threw my own voice into a neural network (3 times) and got it to recreate what it had learned along the way!
This is 3 different recurrent neural networks (LSTM type) trying to find patterns in raw audio and reproduce them as well as they can. The networks are quite small considering the complexity of the data. I recorded 3 different vocal sessions as training data for the network, trying to get more impressive results out of the network each time. The audio is 8-bit and a low sample rate because sound files get very big very quickly, making the training of the network take a very long time. Well over 300 hours of training in total went into the experiments with my voice that led to this video.
The graphs are created from log files made during training, and show the progress that it was making leading up to immediately before the audio that you hear at every point in the video. Their scrolling speeds up at points where I only show a short sample of the sound, because I wanted to dedicated more time to the more impressive parts. I included a lot of information in the video itself where it’s relevant (and at the end), especially details about each of the 3 neural networks at the beginning of each of the 3 sections, so please be sure to check that if you’d like more details.
I’m less happy with the results this time around than in my last RNN+voice video, because I’ve experimented much less with my own voice than I have with higher-pitched voices from various games and haven’t found the ideal combination of settings yet. That’s because I don’t really want to hear the sound of my own voice, but so many people commented on my old video that they wanted to hear a neural network trained on a male English voice, so here we are now! Also, learning from a low-pitched voice is not as easy as with a high-pitched voice, for reasons explained in the first part of the video (basically, the most fundamental patterns are longer with a low-pitched voice).
The neural network software is the open-source “torch-rnn“, although that is only designed to learn from plain text. Frankly, I’m still amazed at what a good job it does of learning from raw audio, with many overlapping patterns over longer timeframes than text. I made a program (explained here) that substitutes raw bytes in any file (e.g. audio) for valid UTF-8 text characters and torch-rnn happily learned from it. My program also substituted torch-rnn’s generated text back into raw bytes to get audio again. I do not understand the mathematics and low-level algorithms that go make a neural network work, and I cannot program my own, so please check the code and .md files at torch-rnn’s Github page for details. Also, torch-rnn is actually a more-efficient fork of an earlier software called char-rnn, whose project page also has a lot of useful information.
I will probably soon release the program that I wrote to create the line graphs from CSV files. It can make images up to 16383 pixels wide/tall with customisable colours, from CSV files with hundreds of thousands of lines, in a few seconds. All free software I could find failed hideously at this (e.g. OpenOffice Calc took over a minute to refresh the screen with only a fraction of that many lines, during which time it stopped responding; the lines overlapped in an ugly way that meant you couldn’t even see the average value; and “exporting” graphs is limited to pressing Print Screen, so you’re limited to the width of your screen… really?).
This isn’t going to become a thing on this channel – I was just hungry and wanted to record it… I think I should stick to computer stuff. If I hadn’t put any effort into editing this, I would’ve put this on my other channel.
I started playing around using the melody and chord progression that a huge number of people created together at CrowdSound, and ended up making this little arrangement for Bawami, my MIDI synth. It took a few hours over 3 nights.
CrowdSound is a site where people were given a chord progression and song structure, and were then allowed to vote note-by-note to make a melody. It’s an experiment to see if lots of people can work together to gradually make an entire song by voting on many tiny additions. Since people are making remixes already, I decided I’d try, too.
As of the 15th of August 2016, only the melody is complete, so I imported the MIDI of the melody (from here) into Sekaiju (the MIDI editor I use). From there, based on the chord progression, I made tracks for bass, percussion, overdriven and acoustic guitar parts, 2-part pad and a portamento synth sequence to liven things up a bit. Then I decided on how I’d switch between the various backing parts so they weren’t all fighting for the spotlight at the same time. After that, I changed the velocities of all the melody notes (since I’m using a velocity-sensitive lead instrument on Bawami), to make it sound less annoying and repetitive and to complement the beat. I also shortened some long notes (which is within CrowdSound’s rules for arranging) to let the lead stop for breath every now and then, added modulation (vibrato) sparingly, and decided to somtimes pitch-bend from one note to another during the conclusion instead of instantly jumping (I think this should be allowed, because a real human voice would have to do this all the time =P).
In keeping with the openness of CrowdSound, you can download my MIDI (designed to be played on Bawami rev.132 or later) here. It uses several GS “variation” instruments, so it will sound worse on GM synths. It also uses an instrument (12-string Guitar) which is not present in Bawami rev.131, the currently-released version, but it should still sound fine on that version (it’ll fall back to the “Acoustic Guitar (Steel)” instrument). That, along with many other changes, will be in the next version I release!
This MIDI is playing on BaWaMI, which is a freeware, retro-sounding MIDI synth that uses subtractive synthesis. I’ve been working on it every now and then since 2010. You can find out more (and grab the latest version) here (click its title to get to the download page).
The 3D scrolling view of notes is MIDITrail.
Here’s my MIDI software synth Bawami doing its best to even keep responding while trying to play TheSuperMarioBros2‘s black MIDI “Arecibo“. The left view shows how it’s processing every MIDI message. Not shown: About 5 minutes of Bawami loading the 12MB MIDI file hideously inefficiently (tempo changes make it even worse).
This problem of my player stopping responding when maxed out is something I need to (re-)fix. I fixed this a long time ago (probably before releasing Bawami), but broke it again afterwards somehow, also a long time ago now… As always, the most recent version of Bawami can be download here (also check the most recently tagged posts to see recent changes).
TheSuperMarioBros2 have made a lot of great black MIDIs that are often fun to stress-test MIDI players with. You can see lots playing at their channel (they also provide download links for the MIDI files). However, Bawami’s loading of MIDIs is inefficient, so I’d recommend not trying to torture it with black MIDIs too much. I also suggest unticking “Loop” so that, if it stops responding during playback, it’ll eventually start responding again at the end.
These are a few of the bits that I cut from the main video because it was too long, including running it at full speed and a comparison with a super-simple system! There’s a strobe light in this video, too.
You might want to watch part 1 if you haven’t already, so that this makes more sense.
I threw this together from an old toy’s motor, old printer’s iR sensor, pizza box and some other things, to try out the PID controller algorithm after discovering it on Wikipedia and seeing that there was pseudocode, meaning that I didn’t have to get a PhD in mathematics to be able to read the crazy-looking formulas that Wikipedia seems to be so fond of. There’s a strobe light in this video.
I had planned to screen-capture my program while recording but completely forgot to at the time, so please try to survive my camcorder pointing at my laptop screen…
Here, the PID controller is trying to keep the motor at a precise speed (and get it there as quickly as possible). It doesn’t work well half the time because the L298 (H-bridge), responsible for switching power to the motor, doesn’t seem to like making the motor brake. That means it speeds up much more quickly than it slows down, which the algorithm doesn’t like (it’s designed for linear systems) – it basically ends up trying too hard to slow down, resulting in a big undershoot. I might be able to somewhat compensate for that in code.
I might try this with a Sabertooth motor speed controller (as used in my old singing motors project) in place of the L298, which can certainly force a motor to stop spinning, but the Sabertooth gives such a boost to the motor to get it up to speed that 90% of the PID’s job becomes redundant… Oh well, at least it’d be able to hit any given note without me having to calibrate it first like I did with the singing motors. By the way, that’s why this system measures speed in Hz – I originally intended for it to play music like a new kind of “singing motor”.
Originally, I planned to use a 3-pin computer fan instead of this motor, using the tachometer pin to measure the speed, but that required me to have a common ground for the motor and the tachometer, and I didn’t have the right components available (I only had N-channel MOSFETs, but I needed a P-channel MOSFET). So I ended up throwing my own motor assembly together and using an N-channel MOSFET only (could only turn power on/off, not brake), which the PID system didn’t like. I thought the L298 would fix that problem, since it’d allow the PID system to reverse power to the motor and brake it, but it turns out it’s too weak to have much of an effect after all… =/
Part 2/2 will show it running at full speed (with a more powerful PSU), show a much more naïve speed controller algorithm for the lulz, and just clear up a couple of details.
This is what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what it’s learned. (WARNING: Although I decreased the volume and there’s visual indication of what sound is to come, please don’t have your volume too high.)
This is a recurrent neural network (LSTM type) with 3 layers of 680 neurons each, trying to find patterns in audio and reproduce them as well as it can. It’s not a particularly big network considering the complexity and size of the data, mostly due to computing constraints, which makes me even more impressed with what it managed to do.
The audio that the network was learning from is voice actress Kanematsu Yuka voicing Hinata from Pure Pure. I used 11025 Hz, 8-bit audio because sound files get big quickly, at least compared to text files – 10 minutes already runs to 6.29MB, while that much plain text would take weeks or months for a human to read.
I was using the program “torch-rnn“, which is actually designed to learn from and generate plain text. I wrote a program that converts any data into UTF-8 text and vice-versa, and to my excitement, torch-rnn happily processed that text as if there was nothing unusual. I did this because I don’t know where to begin coding my own neural network program, but this workaround has some annoying restraints. E.g. torch-rnn doesn’t like to output more than about 300KB of data, hence all generated sounds being only ~27 seconds long.
It took roughly 29 hours to train the network to ~35 epochs (74,000 iterations) and over 12 hours to generate the samples (output audio). These times are quite approximate as the same server was both training and sampling (from past network “checkpoints”) at the same time, which slowed it down. Huge thanks go to Melan for letting me use his server for this fun project! Let’s try a bigger network next time, if you can stand waiting an hour for 27 seconds of potentially-useless audio. xD
I feel that my target audience couldn’t possibly get any smaller than it is right now…
EDIT: Because I’ve been asked a lot, the settings I used for training were: rnn_size: 680, num_layers: 3, wordvec_size: 110. Also, here are some graphs showing losses during training (click to see full-size versions):
For sampling, I simply used torch-rnn’s default settings (which is a temperature of 1), specifying only the checkpoint and length and redirecting it to a file. For training an RNN on voice in this way, I think the most important aspect is how “clear” the audio is, i.e. how obvious patterns are against noise, plus the fact that it’s 8-bit so it only has to learn from 256 unique symbols. This relatively sharp-sounding voice is very close to a filtered sawtooth signal, compared to other voices which are more breathy/noisy (the difference is even visible to human eyes just by looking at the waveform), so I think it had an easier time learning this voice than it would some others. There’s also the simple fact that, because the voice is high-pitched, the lengths of the patterns that it needs to learn are shorter.
EDIT 2: I have been asked several times about my binary-to-UTF-8 program. The program basically substitutes any raw byte value for a valid UTF-8 encoding of a character. So after conversion, there’ll be a maximum of 256 unique UTF-8 characters. I threw the program together in VB6, so it will only run on Windows. However, I rewrote all the important code in a C++-like pseudocode here. Also, here is an English explanation of how my binary-to-UTF-8 program works.
I was playing around with Windows 98 drivers and found a combination of settings that printed the slowest, loudest, darkest black line I’ve ever seen this dot matrix printer print. And then a second one on top of the first one, just in case it wasn’t dark enough already.
I’d guess that that’s about a week’s worth of wear in 30 seconds.
And now, I’m enjoying random fainter black parts in my prints because that part of the ribbon’s worn much more than the rest, lol.
(Printer is an Epson LQ-300+II)
The second and much-shorter part, as I clear out some random rubbish in my room. There are a few more old electronic devices, including a ~25-year-old LCD game, plus some paper stuff…
This time, I didn’t throw away or dismantle everything in the video! I did thoroughly rearrange whatever remained afterwards, though.