Archive for the ‘Programs’ Category
By popular demand, I threw my own voice into a neural network (3 times) and got it to recreate what it had learned along the way!
This is 3 different recurrent neural networks (LSTM type) trying to find patterns in raw audio and reproduce them as well as they can. The networks are quite small considering the complexity of the data. I recorded 3 different vocal sessions as training data for the network, trying to get more impressive results out of the network each time. The audio is 8-bit and a low sample rate because sound files get very big very quickly, making the training of the network take a very long time. Well over 300 hours of training in total went into the experiments with my voice that led to this video.
The graphs are created from log files made during training, and show the progress that it was making leading up to immediately before the audio that you hear at every point in the video. Their scrolling speeds up at points where I only show a short sample of the sound, because I wanted to dedicated more time to the more impressive parts. I included a lot of information in the video itself where it’s relevant (and at the end), especially details about each of the 3 neural networks at the beginning of each of the 3 sections, so please be sure to check that if you’d like more details.
I’m less happy with the results this time around than in my last RNN+voice video, because I’ve experimented much less with my own voice than I have with higher-pitched voices from various games and haven’t found the ideal combination of settings yet. That’s because I don’t really want to hear the sound of my own voice, but so many people commented on my old video that they wanted to hear a neural network trained on a male English voice, so here we are now! Also, learning from a low-pitched voice is not as easy as with a high-pitched voice, for reasons explained in the first part of the video (basically, the most fundamental patterns are longer with a low-pitched voice).
The neural network software is the open-source “torch-rnn“, although that is only designed to learn from plain text. Frankly, I’m still amazed at what a good job it does of learning from raw audio, with many overlapping patterns over longer timeframes than text. I made a program (explained here) that substitutes raw bytes in any file (e.g. audio) for valid UTF-8 text characters and torch-rnn happily learned from it. My program also substituted torch-rnn’s generated text back into raw bytes to get audio again. I do not understand the mathematics and low-level algorithms that go make a neural network work, and I cannot program my own, so please check the code and .md files at torch-rnn’s Github page for details. Also, torch-rnn is actually a more-efficient fork of an earlier software called char-rnn, whose project page also has a lot of useful information.
I will probably soon release the program that I wrote to create the line graphs from CSV files. It can make images up to 16383 pixels wide/tall with customisable colours, from CSV files with hundreds of thousands of lines, in a few seconds. All free software I could find failed hideously at this (e.g. OpenOffice Calc took over a minute to refresh the screen with only a fraction of that many lines, during which time it stopped responding; the lines overlapped in an ugly way that meant you couldn’t even see the average value; and “exporting” graphs is limited to pressing Print Screen, so you’re limited to the width of your screen… really?).
Here’s my MIDI software synth Bawami doing its best to even keep responding while trying to play TheSuperMarioBros2‘s black MIDI “Arecibo“. The left view shows how it’s processing every MIDI message. Not shown: About 5 minutes of Bawami loading the 12MB MIDI file hideously inefficiently (tempo changes make it even worse).
This problem of my player stopping responding when maxed out is something I need to (re-)fix. I fixed this a long time ago (probably before releasing Bawami), but broke it again afterwards somehow, also a long time ago now… As always, the most recent version of Bawami can be download here (also check the most recently tagged posts to see recent changes).
TheSuperMarioBros2 have made a lot of great black MIDIs that are often fun to stress-test MIDI players with. You can see lots playing at their channel (they also provide download links for the MIDI files). However, Bawami’s loading of MIDIs is inefficient, so I’d recommend not trying to torture it with black MIDIs too much. I also suggest unticking “Loop” so that, if it stops responding during playback, it’ll eventually start responding again at the end.
This is what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what it’s learned. (WARNING: Although I decreased the volume and there’s visual indication of what sound is to come, please don’t have your volume too high.)
This is a recurrent neural network (LSTM type) with 3 layers of 680 neurons each, trying to find patterns in audio and reproduce them as well as it can. It’s not a particularly big network considering the complexity and size of the data, mostly due to computing constraints, which makes me even more impressed with what it managed to do.
The audio that the network was learning from is voice actress Kanematsu Yuka voicing Hinata from Pure Pure. I used 11025 Hz, 8-bit audio because sound files get big quickly, at least compared to text files – 10 minutes already runs to 6.29MB, while that much plain text would take weeks or months for a human to read.
I was using the program “torch-rnn“, which is actually designed to learn from and generate plain text. I wrote a program that converts any data into UTF-8 text and vice-versa, and to my excitement, torch-rnn happily processed that text as if there was nothing unusual. I did this because I don’t know where to begin coding my own neural network program, but this workaround has some annoying restraints. E.g. torch-rnn doesn’t like to output more than about 300KB of data, hence all generated sounds being only ~27 seconds long.
It took roughly 29 hours to train the network to ~35 epochs (74,000 iterations) and over 12 hours to generate the samples (output audio). These times are quite approximate as the same server was both training and sampling (from past network “checkpoints”) at the same time, which slowed it down. Huge thanks go to Melan for letting me use his server for this fun project! Let’s try a bigger network next time, if you can stand waiting an hour for 27 seconds of potentially-useless audio. xD
I feel that my target audience couldn’t possibly get any smaller than it is right now…
EDIT: Because I’ve been asked a lot, the settings I used for training were: rnn_size: 680, num_layers: 3, wordvec_size: 110. Also, here are some graphs showing losses during training (click to see full-size versions):
For sampling, I simply used torch-rnn’s default settings (which is a temperature of 1), specifying only the checkpoint and length and redirecting it to a file. For training an RNN on voice in this way, I think the most important aspect is how “clear” the audio is, i.e. how obvious patterns are against noise, plus the fact that it’s 8-bit so it only has to learn from 256 unique symbols. This relatively sharp-sounding voice is very close to a filtered sawtooth signal, compared to other voices which are more breathy/noisy (the difference is even visible to human eyes just by looking at the waveform), so I think it had an easier time learning this voice than it would some others. There’s also the simple fact that, because the voice is high-pitched, the lengths of the patterns that it needs to learn are shorter.
EDIT 2: I have been asked several times about my binary-to-UTF-8 program. The program basically substitutes any raw byte value for a valid UTF-8 encoding of a character. So after conversion, there’ll be a maximum of 256 unique UTF-8 characters. I threw the program together in VB6, so it will only run on Windows. However, I rewrote all the important code in a C++-like pseudocode here. Also, here is an English explanation of how my binary-to-UTF-8 program works.
This is a small update which fixes OGG file rendering, and a couple of other superficial things.
- Fixed tabbing order of controls on “Mod shape” tab of the config window.
- Writing OGG files works again – I accidentally removed a file needed by the OGG encoder in revision 130. Users who extracted revision 130′s folder over revision 129′s would not have experienced the problem.
- Stopped warning about buggy controls from being displayed on the “MIDI params” tab of the config window, since it’s no longer true.
This update lets Bawami write WAV files in paths containing non-ASCII characters (something long-overdue), and fixes a bunch of complicated interwoven bugs related to the handling of “bad” MIDI files and of percussive notes. There’s also a load of other fixes, including for portamento-related bugs, crashing in /translator mode, and visual fixes, and a couple of new instruments.
You can grab the latest version from here (7.81 MB), and see details of the changes below the page break:
Showing off a couple more things not possible with the Epson driver: multi-strike printing and “quiet” (not really) mode, along with CMYK colour correction which is nearly invisible to my camcorder, so that part was a waste of video…
This is a program I’m casually working on every now and then to print images on any 24-pin ESC/P2 dot matrix printer (ESC/P2 is Epson’s control language for their dot matrix printers). It directly controls the printer by sending raw commands to it; you just need to tell Windows that it’s a “Generic / Text Only” printer on Windows, not using the official Epson driver, and Windows will pass the commands straight on to the printer without trying to translate them.
This is a standalone program for printing image files, not a driver for printing from any program. I’ve not yet released it, but I intend to some time. Compared to the driver, it currently allows:
- Printing in (lower) resolutions for high speed (down to 60 DPI).
- Detailed control over colour dithering/thresholding.
- Very tall print-outs not restricted to a paper length (e.g. for continuous paper).
- Printing only individual component colour(s) of an image.
- * Faster colour printing by doing large blocks of each colour at once.
- * Multi-strike printing (optionally offsetting each one to fill in the gaps between the earlier ones’ dots).
- * “Quiet” (multi-pass) printing (unfortunately, I can’t control the actual speed).
*The last three are somewhat “hacks”, abusing commands to try to force unofficial behaviour, and as such, they rarely work properly in combination with each other. In particular, the last two often don’t work when printing colour.
By the way, printing in blocks of colour is no longer done by relying on sending commands with the correct timing (as it did in the previous video), which means it’s now much more reliable and doesn’t get messed-up by pausing the printer, image content, etc.
This is a nice big bunch of fixes – mostly related to /console-mode, but a couple of more serious things (which affect more people) are fixed, too.
You can grab this latest version from here (7.87 MB), and see details of all the changes below.
General bug fixes
- Fixed bug where button to browse for a new MIDI file remained disabled after opening a MIDI by dropping it onto the main window.
- Now recognises when a filename is passed on the command line without speech marks (and without any command line parameters). This fixes the problem of Bawami not opening a path+file containing no spaces if such a file was dragged onto its icon (or if Bawami is associated with .MID files), because Windows does not automatically surround such a path+file with speech marks when passing it to the program.
Fixes for when using /console
- No longer crashes if you’re redirecting Bawami’s output to a file or other program.
- When using “/infolevel 2“, log file text is now sent to StdOut instead of (accidentally) to StdErr.
- “Finished.” is now (always) displayed when shutting down.
- Key prompts for answers to questions are now sent to StdErr instead of StdOut (as was already being done for the question’s title and message), so that they’re visible on the console even if you’re redirecting Bawami’s StdOut to a file.
- Green “settings” button’s tooltip is now loaded in the chosen language (instead of having no tooltip) (broken in revision 128).
- Corrected Z-order of several controls on the config window so that the dotted border that indicates a control that has focus isn’t having one of its horizontal edges cut off unnecessarily.
- Corrected a reference to how green “settings” button appears in info.txt.
Logging (also affects text shown when using /console /infolevel >=1)
- No longer shows “Closing MIDI input port” when shutting down, if MIDI-in wasn’t in use.
- No longer shows “Decoding absolute timings” a few lines after the last track has finished being decoded.
This version includes a whole load more bug fixes, plus some serious support for running Bawami from the console (command prompt), which also guarantees to never display message boxes, along with several other fun new command line parameters. It should also start slightly faster (less writing to log file, and some language files are now only loaded when they’re needed), seeking to a different playback position is easier on the ears, and there’s a bunch of safeguards added to the code related to writing OGG files. A few command line parameters have been renamed to make them less long-winded, they all begin with a slash instead of a hyphen now, and Bawami now lets you know exactly which ones it didn’t understand (if any). But despite a lot of work being focussed on command line parameters, there are several GUI-related bug fixes, too!
NOTE: In order for Bawami to be able to output to the Windows console (command prompt) when using /console option, the EXE file is now compiled as a console-mode program. Annoyingly, this causes a console window to appear for a brief moment before Bawami’s main window appears. However, this doesn’t slow anything down; it’s only for part of the amount of time where, previously, nothing at all was displayed.
You can download the new version here (7.87 MB). Full details on all of the changes and bug fixes are below, but allow me to first introduce a new feature of Bawami:
If you run Bawami from the command line, this option should be very useful, and is highly recommended instead of /invisible. Text is output to “standard output” (meaning you can see what Bawami’s doing in the console), and you can respond to any messages by pressing keys on the console, too. In this mode, Bawami also starts faster and is safe to crash in the middle of playback (nothing is leaked – but for now, make sure that you also kill the OGG encoder if you crash it in the middle of writing an OGG file). Just for fun, you can also view every single raw MIDI message scrolling up in the console as it plays (/stdmidi) , and use /infolevel 1 or 2 to get a more in-depth look at exactly what’s going on internally. Of course, these options which spam text to the console will slow Bawami down a bit. Please check the “COMMAND LINE PARAMETERS” section in info.txt for full details, and see below for an overview of all changes/additions.
View all changes below the page break:
When I first published revision 127 of Bawami about 25 minutes ago, the download included a 64-bit version of “vcut.exe” (official Vorbis splitter, used by this version of Bawami) which would not work on 32-bit Windows. A universally-compatible, 32-bit version is now included instead.
To avoid problems (OGG file export not working), I highly advise that you re-download it if you downloaded it within the past 25 minutes. I am sorry for the inconvenience.
This is essentially a whole bunch of bug fixes, including one that I really should’ve released sooner (no longer freezes at the end of exporting a WAV/OGG file under certain conditions). There aren’t really new features, which should mean that, for once, the total number of bugs has actually decreased! ^^;
Most fixes are regarding WAV/OGG file-writing, click artifacts due to release times, and a couple of visual glitches. Full details are below the page-break.
EDIT: When I first published this post, the download included a 64-bit version of “vcut.exe” (official Vorbis splitter, used by this version of Bawami) which would not work on 32-bit Windows. A universally-compatible, 32-bit version is now included instead. Sorry for the inconvenience.