Audio Classification¶

The Task¶

The task of identifying what is in an audio file is called audio classification. Typically, Audio Classification is used to identify audio files containing sounds or words. The task predicts which ‘class’ the sound or words most likely belongs to with a degree of certainty. A class is a label that describes the sounds in an audio file, such as ‘children_playing’, ‘jackhammer’, ‘siren’ etc.

Example¶

Let’s look at the task of predicting whether audio file contains sounds of an airconditioner, carhorn, childrenplaying, dogbark, drilling, engingeidling, gunshot, jackhammer, siren, or street_music using the UrbanSound8k spectrogram images dataset. The dataset contains train, val and test folders, and then each folder contains a airconditioner folder, with spectrograms generated from air-conditioner sounds, siren folder with spectrograms generated from siren sounds and the same goes for the other classes.

urban8k_images
├── train
│   ├── air_conditioner
│   ├── car_horn
│   ├── children_playing
│   ├── dog_bark
│   ├── drilling
│   ├── engine_idling
│   ├── gun_shot
│   ├── jackhammer
│   ├── siren
│   └── street_music
├── test
│   ├── air_conditioner
│   ├── car_horn
│   ├── children_playing
│   ├── dog_bark
│   ├── drilling
│   ├── engine_idling
│   ├── gun_shot
│   ├── jackhammer
│   ├── siren
│   └── street_music
└── val
    ├── air_conditioner
    ├── car_horn
    ├── children_playing
    ├── dog_bark
    ├── drilling
    ├── engine_idling
    ├── gun_shot
    ├── jackhammer
    ├── siren
    └── street_music

        ...

Once we’ve downloaded the data using download_data(), we create the AudioClassificationData. We select a pre-trained backbone to use for our ImageClassifier and fine-tune on the UrbanSound8k spectrogram images data. We then use the trained ImageClassifier for inference. Finally, we save the model. Here’s the full example:

import flash
import torch
from flash.audio import AudioClassificationData
from flash.core.data.utils import download_data
from flash.image import ImageClassifier

# 1. Create the DataModule
download_data("https://pl-flash-data.s3.amazonaws.com/urban8k_images.zip", "./data")

datamodule = AudioClassificationData.from_folders(
    train_folder="data/urban8k_images/train",
    val_folder="data/urban8k_images/val",
    transform_kwargs={"spectrogram_size": (64, 64)},
    batch_size=4,
)

# 2. Build the model.
model = ImageClassifier(backbone="resnet18", labels=datamodule.labels)

# 3. Create the trainer and finetune the model
trainer = flash.Trainer(max_epochs=3, gpus=torch.cuda.device_count())
trainer.finetune(model, datamodule=datamodule, strategy=("freeze_unfreeze", 1))

# 4. Predict what's on few images! air_conditioner, children_playing, siren etc.
datamodule = AudioClassificationData.from_files(
    predict_files=[
        "data/urban8k_images/test/air_conditioner/13230-0-0-5.wav.jpg",
        "data/urban8k_images/test/children_playing/9223-2-0-15.wav.jpg",
        "data/urban8k_images/test/jackhammer/22883-7-10-0.wav.jpg",
    ],
    batch_size=3,
)
predictions = trainer.predict(model, datamodule=datamodule, output="labels")
print(predictions)

# 5. Save the model!
trainer.save_checkpoint("audio_classification_model.pt")

To learn how to view the available backbones / heads for this task, see Backbones and Heads.

Flash Zero¶

The audio classifier can be used directly from the command line with zero code using Flash Zero. You can run the above example with:

flash audio_classification

To view configuration options and options for running the audio classifier with your own data, use:

flash audio_classification --help