Adventures in Augmenting Creativity

When humans help computers help humans make art.

By Michael Russo, on 23 Apr 2019.

Robin Sloan has been generating extremely interesting audio using machine learning models (see: Making the music of the Magz and Expressive temperature).¹

Inspired, I took these techniques for a spin.

First question: what audio to use to train the machine learning model?

Oneohtrix Point Never’s R Plus Seven immediately came to mind. From Pitchfork’s review:

On R Plus Seven, Oneohtrix Point Never’s follow-up to 2011’s Replica, Daniel Lopatin builds new music using the bright yet cold textures of the early computing age. The album plays with our collective unconscious of music technology to develop something that comes off as strange and otherworldly and, most importantly, rich with feeling, despite the icy surface layer.

Perfect.

A quick detour: Iliana and I visited the beautiful Cape Breton Island in 2015, and found ourselves drawn to R Plus Seven, listening to it over and over and over again.

The album’s ethereal quality was a match for the scenery:

Haunting, yet beautiful, and not unlike the audio I’m hoping to produce.

After spinning up a server² and configuring the necessary dependencies, I fed the preprocessed audio (all of R Plus Seven) into SampleRNN_torch, and started the training process.

I let the model train for approximately 72 hours. (After 48 hours of training, the sounds were starting to take shape; the model was clearly picking up on the structure of the source material. By the 72nd hour, it seemed to have learned all it was going to learn.)

It took another day of processing to generate about two-and-a-half hours of audio.³ (It’s a good idea to generate an abundance of samples, and later discard the dreary and the banal. The quality of samples will vary widely.)

Let’s listen to a few examples. Note that with the exception of the conversion to MP3, these samples are not processed or otherwise edited; they’re pulled directly from SampleRNN.

It’s striking how clear the lineage is to R Plus Seven. It sounds related, but it’s also something very different.

Note: the overall audio has a rougher, grungier texture than the source material; it’s as if the speaker that’s playing back the sound has a low-grade headache. Expect the audio produced by systems like these to improve in the future.

There are these moments sprinkled throughout that you can pinpoint precisely to specific tracks from the album. It sounds almost like they’ve been sampled.

And it’s even more pronounced in this example.

Now, let’s borrow from Robin’s technique of varying the temperature:

In machine learning systems designed to generate text or audio, there’s often a final step between the system’s model of the training data and some concrete output that a human can evaluate or enjoy. Technically, this is the sampling of a multinomial distribution, but call it rolling a giant dice, many-many-sided, and also weighted. Weird dice.

To that final dice roll, there’s a factor applied that is called, by convention, the “sampling temperature.” Its default value is 1.0, which means we roll the weighted dice just as the model has provided it. But we can choose other values, too. We can tamper with the dice.

I’ve tampered with the dice, generating samples with a starting temperature of 0.875, increasing at a constant rate until reaching 1.15 (by the end of the sample).

Let’s listen:

This begins slow, cautious. Boring. Eventually, it starts going somewhere, but not really because it’s mostly moving in circles, until the end—temperature cranked all the way up!—when it’s unhinged, free of constraints, soaring.

I love that break in the middle of the sample; surely, it’s deliberate preparation for the chaotic outburst that will be the final segment.

What an ending.⁴

Expressive temperature, indeed.

By remixing and combining the generated sounds, it’s possible to create entirely new works.

Here’s an original composition; let’s call it “RNN+7”:

This particular piece was created by splicing together audio samples generated by SampleRNN, with no additional sounds added. (Some noise reduction was applied to the mix.)

In the hands of a musician (i.e., not me), this quirky, idiosyncratic, probabilistic sampler could make for one hell of an instrument.

Where all of this really shines is as input to the human creative process.

To create RNN+7, I used the samples produced by the model as input—the only input—to the final piece.

But those samples, instead, could have been used strictly as a creative input. As inspiration.

Today’s machine learning models can produce material for the artist to dissect, probe, and riff on. Tomorrow’s will create material to help the artist explore the space of possible solutions, without the burden of preconceived notions or a nagging inner critic. They will be capable of taking the artist down paths that they may never have explored on their own.⁵

And that’s where this gets really exciting.

Regardless of the medium you work in—words written or spoken, images still or moving, sounds melodic or atonal, et cetera, et cetera—one thing is for sure:

The inspiration machines are coming.

And likewise, for text. (See: Writing with the machine, and Voyages in sentence space.) ↩
I used Salamander to provision an instance with an NVIDIA K80 GPU (and 4x vCPU, with 61 GB RAM), at a cost of approximately $0.37 (USD) per hour. ↩
215 samples, each 45 seconds long. ↩
And, dare I say, “strange and otherworldly?” ↩
Machine as muse: One day, it will be standard practice for artists to find inspiration with the help of machine learning models. (Expect incredible care to go into the selection of the training material.) ↩