Shortcuts

Audio Classification

The Task

The task of identifying what is in an audio file is called audio classification. Typically, Audio Classification is used to identify audio files containing sounds or words. The task predicts which ‘class’ the sound or words most likely belongs to with a degree of certainty. A class is a label that describes the sounds in an audio file, such as ‘children_playing’, ‘jackhammer’, ‘siren’ etc.


Example

Let’s look at the task of predicting whether audio file contains sounds of an airconditioner, carhorn, childrenplaying, dogbark, drilling, engingeidling, gunshot, jackhammer, siren, or street_music using the UrbanSound8k spectrogram images dataset. The dataset contains train, val and test folders, and then each folder contains a airconditioner folder, with spectrograms generated from air-conditioner sounds, siren folder with spectrograms generated from siren sounds and the same goes for the other classes.

urban8k_images
├── train
│   ├── air_conditioner
│   ├── car_horn
│   ├── children_playing
│   ├── dog_bark
│   ├── drilling
│   ├── engine_idling
│   ├── gun_shot
│   ├── jackhammer
│   ├── siren
│   └── street_music
├── test
│   ├── air_conditioner
│   ├── car_horn
│   ├── children_playing
│   ├── dog_bark
│   ├── drilling
│   ├── engine_idling
│   ├── gun_shot
│   ├── jackhammer
│   ├── siren
│   └── street_music
└── val
    ├── air_conditioner
    ├── car_horn
    ├── children_playing
    ├── dog_bark
    ├── drilling
    ├── engine_idling
    ├── gun_shot
    ├── jackhammer
    ├── siren
    └── street_music

        ...

Once we’ve downloaded the data using download_data(), we create the AudioClassificationData. We select a pre-trained backbone to use for our ImageClassifier and fine-tune on the UrbanSound8k spectrogram images data. We then use the trained ImageClassifier for inference. Finally, we save the model. Here’s the full example:

import torch

import flash
from flash.audio import AudioClassificationData
from flash.core.data.utils import download_data
from flash.core.finetuning import FreezeUnfreeze
from flash.image import ImageClassifier

# 1. Create the DataModule
download_data("https://pl-flash-data.s3.amazonaws.com/urban8k_images.zip", "./data")

datamodule = AudioClassificationData.from_folders(
    train_folder="data/urban8k_images/train",
    val_folder="data/urban8k_images/val",
    spectrogram_size=(64, 64),
)

# 2. Build the model.
model = ImageClassifier(backbone="resnet18", num_classes=datamodule.num_classes)

# 3. Create the trainer and finetune the model
trainer = flash.Trainer(max_epochs=3, gpus=torch.cuda.device_count())
trainer.finetune(model, datamodule=datamodule, strategy=FreezeUnfreeze(unfreeze_epoch=1))

# 4. Predict what's on few images! air_conditioner, children_playing, siren e.t.c
predictions = model.predict(
    [
        "data/urban8k_images/test/air_conditioner/13230-0-0-5.wav.jpg",
        "data/urban8k_images/test/children_playing/9223-2-0-15.wav.jpg",
        "data/urban8k_images/test/jackhammer/22883-7-10-0.wav.jpg",
    ]
)
print(predictions)

# 5. Save the model!
trainer.save_checkpoint("audio_classification_model.pt")

Flash Zero

The audio classifier can be used directly from the command line with zero code using Flash Zero. You can run the above example with:

flash audio_classification

To view configuration options and options for running the audio classifier with your own data, use:

flash audio_classification --help

Loading Data

This section details the available ways to load your own data into the AudioClassificationData.

from_folders

Construct the AudioClassificationData from folders.

The supported file extensions are: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, .webp, .npy.

For train, test, and val data, the folders are expected to contain a sub-folder for each class. Here’s the required structure:

train_folder
├── class_1
│   ├── file1.jpg
│   ├── file2.jpg
│   ...
└── class_2
    ├── file1.jpg
    ├── file2.jpg
    ...

For prediction, the folder is expected to contain the files for inference, like this:

predict_folder
├── file1.jpg
├── file2.jpg
...

Example:

data_module = AudioClassificationData.from_folders(
    train_folder = "./train_folder",
    predict_folder = "./predict_folder",
    ...
)

from_files

Construct the AudioClassificationData from lists of files and corresponding lists of targets.

The supported file extensions are: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, .webp, .npy.

Example:

train_files = ["file1.jpg", "file2.jpg", "file3.jpg", ...]
train_targets = [0, 1, 0, ...]

datamodule = AudioClassificationData.from_files(
    train_files = train_files,
    train_targets = train_targets,
    ...
)

from_datasets

Construct the AudioClassificationData from the given datasets for each stage.

Example:

from torch.utils.data.dataset import Dataset

train_dataset: Dataset = ...

datamodule = AudioClassificationData.from_datasets(
    train_dataset = train_dataset,
    ...
)

Note

The __getitem__ of your datasets should return a dictionary with "input" and "target" keys which map to the input spectrogram image (as a NumPy array) and the target (as an int or list of ints) respectively.

Read the Docs v: stable
Versions
latest
stable
0.5.2
0.5.1
0.5.0
0.4.0
0.3.2
0.3.1
0.3.0
0.2.3
0.2.2
0.2.1
0.2.0
0.1.0post1
docs-fix_tabular_forecasting
Downloads
pdf
html
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.