The Song Search

Deep Learning - CS 7150

Prof. David Bau

September 29, 2022

Main Paper/Blog
- Links:
  - Hawthorne, Curtis, et al. "Sequence-to-sequence piano transcription with Transformers." arXiv preprint arXiv:2107.09142 (2021).
  - Gardner, Josh, et al. "Mt3: Multi-task multitrack music transcription." arXiv preprint arXiv:2111.03017 (2021).
  - https://magenta.tensorflow.org/transcription-with-transformers (the original blog)
- A brief review of these papers:
  - The blog (and the papers) start with describing the task of Automatic Music Transcription (AMT) which is the task of extracting symbolic representations of music from raw audio.
  - The authors then speak about the course of their research, that initially focused on AMT for pianos (as published in November 2021), but now is gradually extending towards other instruments.
  - To achieve their results, they have implemented a T5 small model.
  - The major focus today is exploring on making a general purpose AMT.
  - For this general purpose AMT they use MT3 (Multi-task Multitrack Music Transcription), which we found to be very interesting (and was the focus of the second paper that was published in March 2022).
- Source Code:
  - https://github.com/magenta/mt3
  - https://github.com/google-research/text-to-text-transfer-transformer/ (T5)

Auxiliary Papers/Blogs
1. https://towardsdatascience.com/3-reasons-why-music-is-ideal-for-learning-and-teaching-data-science-59d892913608 (Max Hilsdorf)
1. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
1. Gong, Yuan, Yu-An Chung, and James Glass. "Ast: Audio spectrogram transformer." arXiv preprint arXiv:2104.01778 (2021).
1. M. Awiszus, “Automatic music transcription using sequence to sequence learning,” Master’s thesis, Karlsruhe Institute of Technology, 2019.
1. https://magenta.tensorflow.org/onsets-frames

The aim of our project is to build an Information Retrieval system for music.

As a first step, we have chosen to use Google Magenta’s research work on Music Transcription with Transformers, to get a string representation of the music.

Once we have understood and familiarized with this research work, we aim to use the string representation of the music.

As a next step, we want to extract vector representation of the music, that the transformer will undoubtably produce as a consequence of its architecture.

Then use these vector representations to perform information retrieval tasks like song identification and retrieval.

Finally, we want to try to find and implement any other extension of using these representations.