The story of music and AI is a new one and a rapidly progressing one. But let’s start at the very beginning.
One of the preliminary uses of AI with music was genre classification. The first approach involved a simple feature extractor that would convert a song or sound portion into an n-dimensional vector where each number represents some quantity about the music. For instance, one dimension might represent the centroid of the frequency domain, while others might represent various MFCCs (Mel Frequency Cepstral Coefficients).
My first experiment involved training a simple KNN (K-nearest neighbors) model on 1000 audio samples of 10 genres with various levels of complexity to see how well they perform at genre classification.
Model 0: I started with a 1-dimensional feature vector containing just the centroid. This model had an accuracy of about 17%.
Model 1: Then I added flux and rms for a total of 3 dimensions, and it performed much better with an accuracy of about 30%.
Model 2: Model 2 was 8 dimensions, 5 of which were MFCCs. This model had an accuracy of about 34%.
Model 3: Model 3 now includes 20 MFCCs for a total of 23 dimensions. This model had an accuracy of about 43%.
Model 4: Model 4 was the same as Model 3, with 2 more dimensions added for 25% roll-off and 75% roll-off. The accuracy was also around 43%.
The second experiment was finding a way to use this classification system to generate new music. I wanted to find a way to convert my voice into a drum and bass track.
So far, my project kinda does that. Essentially the program takes the input from the microphone, finds the dnb samples (1 drum and 1 bass) that have feature vectors most similar to the input, then plays them. Here it is kinda working:
Meet dnb-synthesis-mic.ck:
I was impressed by the program's robustness and how well it mimicked what my voice sounded like. Next, I had to think about improvements that I could make to the model that I had already created.
One lacking quality was the ability to tweak hyperparameters while running the program. I wanted the ability to turn on and off certain bass and drum tracks so I could better create shifts in dynamics throughout a performance. For this, I added the ability to increment or decrease the number of bass and drum tracks played simultaneously. There could be, for example, 1 bass track running while 4 drum tracks are running.
My original program was also limited in the fact that it had a fixed synth window that was also relatively long. So, in part two, I added the ability to change how long the synth window was for the bass and drum tracks individually. Now, you can synthesize the bass and drums at different rates while they are still always in sync.
Controls (on keyboard):
number of bass tracks: w increases, and s decreases
number of drum tracks: e increases, and d decreases
bass synth window size: r doubles it, and f halves it
drum synth window size: t doubles it, and g halves it
The video below demonstrates the sonic results of my final result. The video in the background is kind of irrelevant, as it just shows my friend messing around with the program. The text flying over the screen represents how often each drum and bass window is synthesized.
Many things could be improved about this program; however, I do not have infinite time. If I did, I would add envelopes to the various sounds to remove any pops while it is playing, and I would also add a low pass filter that could be controlled with your voice (I think that would sound pretty cool since it would sound like it was coming out of your mouth).
But the most interesting part of this project was how it felt to create music using it. It didn’t feel like playing a smart instrument, but it also didn’t feel like I was playing a dumb instrument. There were essentially 9 inputs to the instrument: 8 keys and my voice. I had to play around with it for a while before I figured out interesting ways to transition between sections and add dynamics to arrangements. But there was also this aspect to the instrument that was inherently mysterious. Maybe I just don’t know how to play it well enough, but part of me thinks that the underlying math that it uses is inherently not intuitive. We don’t listen to sounds and hear the centroid of the MFCCs of a sound, so in some regards, the sounds that came out of the program were surprising.
When using dnb-synthesis-mic.ck, it felt like I was performing while also experiencing a performance. It was, in some ways, a duet between man and machine. It was a partner dance where I took the lead, but she took me places I didn’t even consider.
It’s not a perfect tool by any means, but I think the experience of using dnb-synthesis-mic.ck perfectly walks the line between incorporating enough AI elements and maintaining control through various knobs and inputs.
Here is the code in all of its glory. Huge thank you to Ge Wang for writing most of it.
1// input: pre-extracted model file 2string DRUM_FEATURES_FILE; 3string BASS_FEATURES_FILE; 4// if have arguments, override filename 5if( me.args() > 1 ) 6{ 7 me.arg(0) => DRUM_FEATURES_FILE; 8 me.arg(1) => BASS_FEATURES_FILE; 9} 10else 11{ 12 // print usage 13 <<< "usage: chuck mosaic-synth-mic.ck:INPUT", "" >>>; 14 <<< " |- INPUT: drum model file : bass model file", "" >>>; 15} 16//------------------------------------------------------------------------------ 17// unit analyzer network: *** this must match the features in the features file 18//------------------------------------------------------------------------------ 19// audio input into a FFT 20adc => FFT fft; 21// a thing for collecting multiple features into one vector 22FeatureCollector combo => blackhole; 23// add spectral feature: Centroid 24fft =^ Centroid centroid =^ combo; 25// add spectral feature: Flux 26fft =^ Flux flux =^ combo; 27// add spectral feature: RMS 28fft =^ RMS rms =^ combo; 29// add spectral feature: MFCC 30fft =^ MFCC mfcc =^ combo; 31 32 33//----------------------------------------------------------------------------- 34// setting analysis parameters -- also should match what was used during extration 35//----------------------------------------------------------------------------- 36// set number of coefficients in MFCC (how many we get out) 37// 13 is a commonly used value; using less here for printing 3820 => mfcc.numCoeffs; 39// set number of mel filters in MFCC 4010 => mfcc.numFilters; 41 42// do one .upchuck() so FeatureCollector knows how many total dimension 43combo.upchuck(); 44// get number of total feature dimensions 45combo.fvals().size() => int NUM_DIMENSIONS; 46 47// set FFT size 48// 4096 => fft.size; 4915207 => fft.size; 50// set window type and size 51Windowing.hann(fft.size()) => fft.window; 52// our hop size (how often to perform analysis) 53// (fft.size()/2)::samp => dur HOP; 54(fft.size())::samp => dur HOP; 55// how many frames to aggregate before averaging? 56// (this does not need to match extraction; might play with this number) 574 => int NUM_FRAMES; 58// how much time to aggregate features for each file 59fft.size()::samp * NUM_FRAMES => dur EXTRACT_TIME; 60 61 62//------------------------------------------------------------------------------ 63// unit generator network: for real-time sound synthesis 64//------------------------------------------------------------------------------ 65// how many max at any time? 662 => int NUM_VOICES_BASS; 672 => int NUM_VOICES_DRUMS; 68// a number of audio buffers to cycel between 69SndBuf buffers_bass[NUM_VOICES_BASS]; SndBuf buffers_drums[NUM_VOICES_DRUMS]; ADSR envs[NUM_VOICES_BASS]; 70// set parameters 71for( int i; i < NUM_VOICES_BASS; i++ ) 72{ 73 // connect audio 74 // buffers_bass[i] => envs[i] => pans[i] => dac; 75 buffers_bass[i] => NRev rev => Pan2 pan => dac; 76 0.8 => buffers_bass[i].gain; 77 Math.random2f(-.75,.75) => pan.pan; 78 Math.random2f(0,.5) => rev.mix; 79 // set chunk size (how to to load at a time) 80 // this is important when reading from large files 81 // if this is not set, SndBuf.read() will load the entire file immediately 82 fft.size() => buffers_bass[i].chunks; 83 84 // randomize pan => pans[i].pan; 85 // set envelope parameters 86 envs[i].set( EXTRACT_TIME, EXTRACT_TIME/2, 1, EXTRACT_TIME ); 87} 88for( int i; i < NUM_VOICES_DRUMS; i++ ) 89{ 90 // connect audio 91 // buffers_bass[i] => envs[i] => pans[i] => dac; 92 buffers_drums[i] => Pan2 panR => dac; 93 // 0.5 => panR.pan; 94 // set chunk size (how to to load at a time) 95 // this is important when reading from large files 96 // if this is not set, SndBuf.read() will load the entire file immediately 97 fft.size() => buffers_drums[i].chunks; 98 99} 100 101//------------------------------------------------------------------------------ 102// load feature data; read important global values like numPoints and numCoeffs 103//------------------------------------------------------------------------------ 104// values to be read from file 1050 => int numPointsDrums; // number of points in data 1060 => int numPointsBass; 1070 => int numCoeffs; // number of dimensions in data 108// file read PART 1: read over the file to get numPoints and numCoeffs 109<<< "LOADING FILES" >>>; 110loadFile( DRUM_FEATURES_FILE, 1 ) @=> FileIO @ fin_drum; 111loadFile( BASS_FEATURES_FILE, 0 ) @=> FileIO @ fin_bass; 112<<< "LOADED FILES", numPointsBass, numPointsDrums >>>; 113// check 114if( !fin_drum.good() ) me.exit(); 115if( !fin_bass.good() ) me.exit(); 116// check dimension at least 117if( numCoeffs != NUM_DIMENSIONS ) 118{ 119 // error 120 <<< "[error] expecting:", NUM_DIMENSIONS, "dimensions; but features file has:", numCoeffs >>>; 121 // stop 122 me.exit(); 123} 124 125 126//------------------------------------------------------------------------------ 127// each Point corresponds to one line in the input file, which is one audio window 128//------------------------------------------------------------------------------ 129class AudioWindow 130{ 131 // unique point index (use this to lookup feature vector) 132 int uid; 133 // which file did this come file (in files arary) 134 int fileIndex; 135 // starting time in that file (in seconds) 136 float windowTime; 137 138 // set 139 fun void set( int id, int fi, float wt ) 140 { 141 id => uid; 142 fi => fileIndex; 143 wt => windowTime; 144 } 145} 146 147// array of all points in model file 148AudioWindow windows[numPointsBass + numPointsDrums]; 149// unique filenames; we will append to this 150string files[0]; 151// map of filenames loaded 152int filename2state[0]; 153// feature vectors of data points 154float inFeaturesBass[numPointsBass][numCoeffs]; 155float inFeaturesDrums[numPointsDrums][numCoeffs]; 156// generate array of unique indices 157int uids_bass[numPointsBass]; for( int i; i < numPointsBass; i++ ) i => uids_bass[i]; 158int uids_drums[numPointsDrums]; for( int i; i < numPointsDrums; i++ ) i => uids_drums[i]; 159 160int uids_playing[NUM_VOICES_BASS + NUM_VOICES_DRUMS]; for( int i; i < uids_playing.size(); i++ ) -1 => uids_playing[i]; 161 162// use this for new input 163float features[NUM_FRAMES][numCoeffs]; 164// average values of coefficients across frames 165float featureMean[numCoeffs]; 166 167 168//------------------------------------------------------------------------------ 169// read the data 170//------------------------------------------------------------------------------ 171readData( fin_drum, 1 ); 172readData( fin_bass, 0 ); 173 174//------------------------------------------------------------------------------ 175// set up our KNN object to use for classification 176// (KNN2 is a fancier version of the KNN object) 177// -- run KNN2.help(); in a separate program to see its available functions -- 178//------------------------------------------------------------------------------ 179KNN2 knn_drums; 180KNN2 knn_bass; 181// k nearest neighbors 1822 => int K; 183// results vector (indices of k nearest points) 184int knnResultDrums[K]; 185int knnResultBass[K]; 186// knn train 187knn_drums.train( inFeaturesDrums, uids_drums ); 188knn_bass.train( inFeaturesBass, uids_bass ); 189 190 191// used to rotate sound buffers 1920 => int which_bass; 1930 => int which_drums; 194 195 196 197fun void synthesize_both( int uid_drums, int uid_bass, int loop_num) 198{ 199 if (checkIfLooping(uid_drums, which_drums) == 0) { 200 buffers_drums[which_drums] @=> SndBuf @ sound; 201 // increment and wrap if needed 202 which_drums++; if( which_drums >= buffers_drums.size() ) 0 => which_drums; 203 204 // get a referencde to the audio fragment to synthesize 205 windows[uid_drums] @=> AudioWindow @ win; 206 // get filename 207 // chout <= files[0]; 208 files[win.fileIndex] => string filename; 209 <<< filename, win.fileIndex, uid_drums >>>; 210 // load into sound buffer 211 filename => sound.read; 212 chout <= filename <= " "; 213 sound.loop(1); 214 chout <= "synthsizing drum window:"; 215 chout <= win.uid <= "[" 216 <= win.fileIndex <= ":" 217 <= win.windowTime <= ":POSITION=" 218 <= sound.pos() <= "]"; 219 chout <= IO.newline(); 220 221 222 } else { 223 chout <= "ALREADY PLAYING" <= IO.newline(); 224 } 225 226 227 // if (checkIfLooping(uid_bass, which_bass + NUM_VOICES_BASS) == 0) { 228 229 buffers_bass[which_bass] @=> SndBuf @ sound; 230 envs[which_bass] @=> ADSR @ envelope; 231 which_bass++; if( which_bass >= buffers_bass.size() ) 0 => which_bass; 232 233 windows[uid_bass + numPointsDrums] @=> AudioWindow @ win; 234 files[win.fileIndex] => string filename; 235 filename => sound.read; 236 chout <= filename <= " "; 237 0 => sound.pos; 238 239 chout <= "synthsizing bass window:"; 240 chout <= win.uid <= "[" 241 <= win.fileIndex <= ":" 242 <= win.windowTime <= ":POSITION=" 243 <= sound.pos() <= "]"; 244 chout <= IO.newline(); 245 246 envelope.keyOn(); 247 30000::samp => now; 248 envelope.keyOff(); 249 envelope.releaseTime() => now; 250 251 // } else { 252 // chout <= "ALREADY PLAYING" <= IO.newline(); 253 // chout <= uid_drums, which_bass, NUM_VOICES_BASS; 254 // } 255} 256 257fun int checkIfLooping(int uid, int whichIndex) { 258 for (0 => int i; i < uids_playing.size(); i++) { 259 if (uids_playing[i] == uid) { 260 return 1; 261 } 262 } 263 uid => uids_playing[whichIndex]; 264 return 0; 265} 266 267//------------------------------------------------------------------------------ 268// real-time similarity retrieval loop 269//------------------------------------------------------------------------------ 2700 => int loop_num; 271while( true ) 272{ 273 // aggregate features over a period of time 274 for( int frame; frame < NUM_FRAMES; frame++ ) 275 { 276 //------------------------------------------------------------- 277 // a single upchuck() will trigger analysis on everything 278 // connected upstream from combo via the upchuck operator (=^) 279 // the total number of output dimensions is the sum of 280 // dimensions of all the connected unit analyzers 281 //------------------------------------------------------------- 282 combo.upchuck(); 283 // get features 284 for( int d; d < NUM_DIMENSIONS; d++) 285 { 286 // store them in current frame 287 combo.fval(d) => features[frame][d]; 288 } 289 // advance time 290 2 * 15206::samp => now; 291 } 292 293 // compute means for each coefficient across frames 294 for( int d; d < NUM_DIMENSIONS; d++ ) 295 { 296 // zero out 297 0.0 => featureMean[d]; 298 // loop over frames 299 for( int j; j < NUM_FRAMES; j++ ) 300 { 301 // add 302 features[j][d] +=> featureMean[d]; 303 } 304 // average 305 NUM_FRAMES /=> featureMean[d]; 306 } 307 308 //------------------------------------------------- 309 // search using KNN2; results filled in knnResults, 310 // which should the indices of k nearest points 311 //------------------------------------------------- 312 knn_bass.search( featureMean, K, knnResultBass ); 313 knn_drums.search( featureMean, K, knnResultDrums ); 314 315 // SYNTHESIZE THIS 316 // spork ~ synthesize_both( knnResultDrums[Math.random2(0,knnResultDrums.size()-1)], 317 // knnResultBass[Math.random2(0,knnResultBass.size()-1)], 318 // loop_num); 319 spork ~ synthesize_both( knnResultDrums[0],knnResultBass[0],loop_num); 320 loop_num++; 321 // if (loop_num % 1 == 0) { 322 // spork ~ synthesize_bass( knnResultBass[Math.random2(0,knnResultBass.size()-1)]); 323 // } 324 // if (loop_num % 4 == 0) { 325 // spork ~ synthesize_drums( knnResultDrums[Math.random2(0,knnResultDrums.size()-1)]); 326 // } 327 // 15207::samp => now; 328} 329//------------------------------------------------------------------------------ 330// end of real-time similiarity retrieval loop 331//------------------------------------------------------------------------------ 332 333 334 335 336//------------------------------------------------------------------------------ 337// function: load data file 338//------------------------------------------------------------------------------ 339fun FileIO loadFile( string filepath , int isDrums) 340{ 341 // reset 342 if (isDrums == 1) { 343 0 => numPointsDrums; 344 } else { 345 0 => numPointsBass; 346 } 347 0 => numCoeffs; 348 349 // load data 350 FileIO fio; 351 if( !fio.open( filepath, FileIO.READ ) ) 352 { 353 // error 354 <<< "cannot open file:", filepath >>>; 355 // close 356 fio.close(); 357 // return 358 return fio; 359 } 360 361 string str; 362 string line; 363 // read the first non-empty line 364 while( fio.more() ) 365 { 366 // read each line 367 fio.readLine().trim() => str; 368 // check if empty line 369 if( str != "" ) 370 { 371 if (isDrums == 1) { 372 numPointsDrums++; 373 } else { 374 numPointsBass++; 375 } 376 str => line; 377 } 378 } 379 380 // a string tokenizer 381 StringTokenizer tokenizer; 382 // set to last non-empty line 383 tokenizer.set( line ); 384 // negative (to account for filePath windowTime) 385 -2 => numCoeffs; 386 // see how many, including label name 387 while( tokenizer.more() ) 388 { 389 tokenizer.next(); 390 numCoeffs++; 391 } 392 393 // see if we made it past the initial fields 394 if( numCoeffs < 0 ) 0 => numCoeffs; 395 396 // check 397 if( (isDrums == 1 && numPointsDrums == 0) || (isDrums == 0 && numPointsBass == 0) || numCoeffs <= 0 ) 398 { 399 <<< "no data in file:", filepath >>>; 400 fio.close(); 401 return fio; 402 } 403 404 // print 405 <<< "# of drum data points:", numPointsDrums, " # of bass data points: ", numPointsBass, "dimensions:", numCoeffs >>>; 406 407 // done for now 408 return fio; 409} 410 411 412//------------------------------------------------------------------------------ 413// function: read the data 414//------------------------------------------------------------------------------ 415fun void readData( FileIO fio, int isDrums ) 416{ 417 // rewind the file reader 418 fio.seek( 0 ); 419 420 // a line 421 string line; 422 // a string tokenizer 423 StringTokenizer tokenizer; 424 425 // points index 426 0 => int index; 427 // file index 428 0 => int fileIndex; 429 // file name 430 string filename; 431 // window start time 432 float windowTime; 433 // coefficient 434 int c; 435 436 // read the first non-empty line 437 while( fio.more() ) 438 { 439 // read each line 440 fio.readLine().trim() => line; 441 // check if empty line 442 if( line != "" ) 443 { 444 // set to last non-empty line 445 tokenizer.set( line ); 446 // file name 447 tokenizer.next() => filename; 448 // window start time 449 tokenizer.next() => Std.atof => windowTime; 450 // have we seen this filename yet? 451 if( filename2state[filename] == 0 ) 452 { 453 // append 454 filename => string sss; 455 files << sss; 456 // new id 457 files.size() => filename2state[filename]; 458 } 459 // get fileindex 460 filename2state[filename]-1 => fileIndex; 461 // set 462 if (isDrums == 1) { 463 windows[index].set( index, fileIndex, windowTime ); 464 } else { 465 windows[index + numPointsDrums].set( index, fileIndex, windowTime ); 466 } 467 468 // zero out 469 0 => c; 470 // for each dimension in the data 471 repeat( numCoeffs ) 472 { 473 // read next coefficient 474 if (isDrums == 0) { 475 tokenizer.next() => Std.atof => inFeaturesBass[index][c]; 476 } else { 477 tokenizer.next() => Std.atof => inFeaturesDrums[index][c]; 478 } 479 // increment 480 c++; 481 } 482 483 // increment global index 484 index++; 485 } 486 } 487}