PANN ResNet Exploration For DS 4440

INTRO:

In Qiuqiang et al's recent research into Pretrained Audio Neural Networks (PANNs) [1], the authors highlight a novel approach to the problem of audio labeling. Prior to this paper, state-of-the-art approaches to audio labeling tasks have been models specifically trained for that task on data specific to the task. The authors of this paper prove that it is instead more effective to train a larger model on a much wider range of audio clips [2], before fine-tuning to a specific task.

The work done by this paper is incredible, and its CONCEPT has since revolutionized not just audio labeling, but the field of deep learning. The specific model the researchers fine-tuning for their testing however holds potential for improvement. All of the hyperparameter tuning and testing done in this paper utilized a specific CNN architecture they titled CNN14. Upon reading their results section though, we found that their best-case CNN14 was actually outperformed by a Residual Network called ResNet38, which did so without any hyperparameter tuning.

Our project investigated the viability of this alternate architecture by testing it on one of the fine-tuning tasks tested on CNN14 in the paper. To do this, we imported their pre-trained ResNet38 model, and constructed a fine-tuning algorithm to train it on the GTZAN song genre labeling task [3].

REVIEW:

Qiuqiang et al's work in this paper is very detailed; providing not only exact architectural details on all models used, but thorough investigations into all hyperparameters. The paper generally follows the format of describing a potential technique and testing how it performs when applied on the AudioSet dataset [2]. Their work is incredibly comprehensive, testing many of the cutting-edge deep-learning techniques on top of designing their own. They also include a final section on how their best CNN14 model performs when fine-tuned for a few specific tasks.

Throughout the paper, models' accuracies are measured through a score called mean average precision (mAP). According to the research, the best mAP a CNN14 could achieve on the AudioSet training task was 0.431. This score was raised to 0.439 through a novel technique called "wavegram-logmel," but this technique is later disregarded for fine-tuning tasks. In the Results section, the paper highlights an alternative 38-layer residual network structure that achieves an mAP of 0.434, beating out the CNN14. Our research has been into investigating the potential of this alternative ResNet structure, and comparing its performance against the proposed CNN14 on a fine-tuning task. Specifically, we're exploring fine-tuning on the GTZAN music genre classification dataset [3], on which the CNN14 achieved a near-state-of-the-art accuracy of 91.5%.

TECHNICAL DETAILS:

Code: https://github.com/joshiarnav/ds4440final

In order to test how the pretrained 3-layer residual network may perform on the GTZAN dataset, it was necessary to implement our own fine-tuning pipeline. Thanks to the exhaustive work by the paper's authors, this was made simple, thanks to their clear architecture layouts, and available source code. In order to keep our results consistent with those reached by Qiuqiang et al, we tried to keep our training methodology aligned with theirs (including hyperparameters, loss function, etc.).

Our first task was to retrieve the fine-tuning dataset, and parse it into model-readable features. We were able to download the 1000 GTZAN samples through HuggingFace, as .wav files labeled by genre. We were then able to use Python's librosa library to load each file into memory as a numpy array. Finally, we padded/truncated each array file to be a consistent 30 seconds, and normalized each item to a value [0-1]. After converting each genre label to a one-hot vector, and labeling one tenth of the samples as "validation", our data is ready to be processed.

Figure 1: Architecture of the ResNet38 model

Our model architecture follows exactly that which is laid out for ResNet38 in the paper. In total, it consists of five sub-classes, which implement one another and combine to create a final "ResNet38_Transfer" class used for fine-tuning. Maybe the most unique class is "ResNet38" which includes our logmel spectrogram extractor and augmentor, and our mixup implementation. When an input is passed to a "ResNet38," it is first run through our spectrogram extractor and our logmel extractor, both imported from pylibrosa (a repo by Qiuqiang!). After a BatchNorm is applied, we then perform spectrogram augmentation, again imported from pylibrosa. We then run a final data augmentation using audio mixup augmentation, before passing the results from this into the structure highlighted above.

Our "ResNet38_Transfer" class is our outermost class of our network, and what we use to transfer the pretrained "ResNet38" to the GTZAN task. Its structure consists of a ResNet38 block, followed immediately by a fully-connected transfer layer, mapping the 2048 classes ResNet38 is trained on to the 10 GTZAN genres. This class also includes a function to load in our previously downloaded pre-trained model into the ResNet38 base layer using pytorch's built-in "load_state_dict" function.

In order to fine-tune the above model to a novel task, we implemented a training function that takes in the parsed dataset and pretrained model. We utilize pytorch's "DataLoader" class alongside custom batch samplers and collate functions in order to generate random batches of training and validation data. We use the ADAM optimizer with a learning rate of 0.0001 and no weight decay, and Binary Cross-Entropy as our loss function. As our system trains, we use pytorch's built-in "save" function to save checkpoints of our model's state dictionary for later testing.

FINDINGS:

Utilizing a similar training strategy to that detailed within the paper, we imported the ResNet38 PANN in order to determine whether it could adapt to downstream tasks in the manner mentioned in the paper. The ResNet38 was trained using 10-fold cross validation alongside data augmentation. By training on this task, the fine-tuned model was able to achieve 92% accuracy after 760 iterations in categorizing audio by music genre despite not being originally trained on this task. This accuracy is a higher performance than what was found in the paper for the CNN14, which was 91.5%. While this this is a higher accuracy, it only lasts for one of the iterations and the model generally stabilizes around 86-89% accuracy, which means it simply performs similarly to the CNN14, not better.

The 92% accuracy was achieved utilizing a batch size of 16 and a single holdout set (10% of the total data) after 760 iterations. The accuracy of the model plateaued around this point and stabilized in the high 80% range (generally between 86 and 89% accuracy). The fine-tuning thus demonstrates diminishing returns especially after 1000 iterations at this training speed and these hyperparameters. Additionally, due to the small dataset (only 1000 samples), the holdout set accuracy can fluctuate dramatically depending on luck, even if the data is augmented well prior to training.

Figure 2: Abridged accuracy of the model over time (first 1000 iterations)

Figure 3: Full training accuracy of the model (missing iterations 1000-1400)

The mAP (mean average precision) scores as mentioned in the paper were also recorded for the model training, although the paper does not mention these statistics specifically for the GTZAN dataset when training with the CNN14. Therefore, these measures stand as a potential marker for future studies or for comparison with other models.

Figure 4: mAP values during training

The confusion matrices for both fine-tuned models were relatively similar and displayed high accuracy for most genres. It seems that both models struggled particularly with the "rock" genre, with the CNN14 model only making a correct prediction 4/10 times. While the ResNet38 model performed better, at 7/10 times, this was still the most difficult genre for the model to predict. Additionally, both models seemed to place extra songs within the "pop" genre. These biases are likely due to the small dataset size and the inherent difficulty in distinguishing between similar genres. Overall, according to our confusion matrices, it seems that the ResNet38 model was able to generally outperform the CNN14 model, but the differences were not significant.

Figure 5: ResNet38 confusion matrix

Figure 5: CNN14 confusion matrix

Due to the limitations of our hardware, even fine-tuning the model required a significant amount of time. The paper utilized an Nvidia Tesla V100 to train while we were using an RTX 2060, causing training to take upwards of multiple days to train until the original paper's end goal, or about a day to reach what we believe to be a good local minimum (no major changes in loss for several hundred iterations). Utilizing the Tesla V100, they were able to reach convergence within an hour, displaying both the training speed and accuracy of the CNN14. Extrapolating this to the difference between hardware, a CNN14 would likely be able to converge much faster on our hardware as well. This is likely a contributing factor as to why the original paper chose to explore the CNN14 over the ResNet38.

CONCLUSION:

This study demonstrates the efficacy of ResNet38, a pre-trained deep learning model, in audio pattern recognition tasks, specifically in music genre classification using the GTZAN dataset. Our experimental results show that the ResNet38 model not only achieved comparable accuracy to the previously benchmarked CNN14 but also highlighted the potential for broader application of deep learning models pre-trained on extensive datasets like AudioSet. This investigation underscores the importance of selecting the right model architecture and training strategy, even in the presence of hardware limitations.

Despite the extended training times due to less capable hardware, the fine-tuned ResNet38 model reached a notable accuracy of 92%, suggesting that advanced models might offer significant improvements in performance if computational resources permit. These findings have practical implications for developing more robust and accurate audio recognition systems, which could enhance applications in digital music services, content management systems, and automated monitoring environments.

For future work, it would be valuable to explore further the trade-offs between model complexity and computational efficiency. Additionally, expanding the application of such models to other audio recognition tasks could reveal insights into the generalizability and adaptability of pretrained networks.

Continued research in this direction could pave the way for advancements in audio analysis technologies, contributing positively to both commercial and academic fields. Due to the application of fine-tuned neural networks in tasks requiring high performance, the insights provided by this analysis may be helpful in selecting architecture carefully in order to trade computational efficiency for performance.

Ultimately, this analysis demonstrates that there are further performance gains that can be found in fine-tuning large neural network models for downstream tasks, and these performance gains will be ever important in the current neural network industry landscape.

References

[1] Kong, Qiuqiang, et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880-2894, Oct. 2020

[2] J. F. Gemmeke et al. Audio Set: An ontology and human-labeled dataset for audio events, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 776-780, 2017.

[3] G. Tzanetakis and P. Cook Musical genre classification of audio signals, IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293-302, Jul. 2002.

Team Members

Arnav Joshi
Github | LinkedIn | Website

Sam Phillippo
Github | LinkedIn