Audio Synthesis using
Generative adverserial networks

Benjamin Havenaar, Job Vink, Niels Witte

On this website we are presenting our progress for the 2021 course Audio Processing and indexing for the Leiden Institute of Advanced Computation. Repository

Project results (03-06-2021)

These are the final samples for our project. We generated these by using the different model snapshots that we recorded over a week of training. The snapshots were taken at 21k, 30k and 70k steps. Please note that these are not the 'raw' outputs. Our dataset was slowed down (130 -> 120 bpm) and these samples are sped up (120 -> 130bpm)

Training samples

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Samples after 21k steps

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Samples after 30k steps

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Samples after 72k steps

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

We also updated our similarity table to reflect our new data.

Dself Dtrain
Train 0 1.07 +- 0.29
Test 1.07 +- 0.20 1.07 +- 0.20
21k 6.87 +- 0.12 0.65 +- 0.77
29k 6.87 +- 0.14 0.65 +- 0.94
72k 6.87 +- 0.14 0.65 +- 0.10

Lastly, we surveyed a group of 16. The survey can be found here. Our results are presented in the table below.

Mean St. dev
Train 2.03 0.94
72k 3.41 1.01
29k 3.49 1.25
21k 3.54 1.07

Techno music synthesized - Presentation (06-05-2021)

The goal of our paper was to train a GAN with Techno music, in order to acheave this we created an dataset which consists of techno music audio samples.

We trained our network for 30.000 steps and these are our results. Please note that these are not the 'raw' outputs. Our dataset was slowed down (130 -> 120 bpm) and these samples are sped up (120 -> 130bpm)

Sample 1

Sample 2

Sample 3

One of the ways we are measuring the fitness of our model is by comparing properties of our datasets. One of thease measures is the euclidean distance to K-Neareast Neighbor from the training set to a query set. By comparing thease measures we can make an estimate of the diversity (Dself) and similarity (Dtrain)


Dself Dtrain
Train (real) 1.06 +- 0.29 0
Test (real) 1.08 +- 0.19 1.28 +- 0.23
Inference 0.83 +- 0.82 2.50 +- 0.12

First audio synthesized (22-04-2021)

We started of with code from the original authors of the Audio Synthesis paper.

The author main focus was on a dataset with speaker recordings of the digits 0 through 9 (SC09). Since our focus lays on the generation of music samples we first tried the Generative Adverserial Network (GAN) on piano sounds.

Here are some samples we genarated along the way:

audio after 200 epochs

audio after 500 epochs

audio after 1000 epochs

In the future we hope to train the network on some audio recordings of techno music.