Content-Based Music Retrieval: Searching Audio and Notation Databases

Content-Based Music Retrieval: An Overview

Systems for content-based music information retrieval (MIR) fall into two broad categories: those that search audio data and those that search notated music. Hybrid systems also exist; they begin by transcribing an audio signal into a symbolic description of notes before searching a notation database. Klapuri’s work exemplifies such transcription, focusing on multiple fundamental frequency estimation and musical metre estimation — tasks that involve ordering rhythmic aspects of music. Part of this research draws on known properties of the human auditory system.

Content-based music search serves a range of audiences and purposes:

  • In record stores, a customer may recognise a tune but not know its title, composer, or performers. Salespeople with encyclopaedic knowledge who can identify humming are rare. Computers could take on this task, matching melodies and suggesting records.
  • A search engine that finds musical scores similar to a given query can help musicologists trace influences among composers or connections between works. This work has been done manually for centuries; automating it could yield faster insights with less effort.
  • Composers concerned about copyright infringement could discover whether someone is plagiarising them, or whether a new work exposes them to plagiarism accusations. A content-based retrieval system would simplify such searches.

Audio-specific search mechanisms also serve useful purposes:

  • Pointing a smartphone at loudspeakers for a few seconds can identify music played on the radio or in a bar, using audio fingerprinting systems.
  • Surveillance recordings can be searched for suspicious sounds.
  • Analysing audio content, including music, can make video retrieval more powerful.
  • Theatre companies, filmmakers, and broadcasters might benefit from a search engine that finds sound effects similar to a given query or description in a large audio library.

Although MIR is a young field, commercial applications already exist. For instance, several companies offer automatic recording identification via smartphones, charging customers for tune recognition and selling matching ringtones and CDs.

Music Formats

Three basic music representations are commonly considered: music notation, time-stamped events, and audio. Most research focuses on mainstream Western music based on notes with definite pitch. Byrd and Crawford provide a more detailed overview of the challenges involved.

Common Music Notation (CMN) dates back to the Middle Ages. Scores represent time horizontally and pitch vertically. Each event appears as a note on a five-line staff; the note’s shape indicates its relative duration (whole, half, quarter notes, etc.). Additional symbols give detailed performance instructions.

MIDI is an industry-standard, time-stamped protocol that allows electronic instruments and computers to communicate. Each pitch is stored as note-on and note-off events. Information about tempo, key signatures, track names, and instruments can also be saved.

Digital audio exists in many formats. Files typically store resolution, sampling rate, and compression type. MPEG created open standards like MP3, while PCM (Pulse Code Modulation) is a common uncompressed format used on audio CDs and WAV files.

background image

The figure compares music formats across dimensions: structure, the ease of converting to lower or higher formats, and analogy to text and image retrieval.

Retrieval Tasks

Content-based retrieval serves three main user groups: industry (broadcast, performance), consumers, and professionals (performers, teachers, musicologists). Retrieval happens at four levels:

  1. Work instance — the individual score or sound object.
  2. Work — a set of instances considered essentially the same.
  3. Artist — creator or performer.
  4. Genre — broad categories like classical, jazz, pop, or world music.

These levels form a continuum rather than a strict hierarchy. An artist may work in multiple genres, and one work can be performed (or created) by several artists. The concept of a “work” itself is flexible: changing even a single note of a composer’s definitive score may violate the work, yet wildly different performances of “My Way” are usually considered the same work.

background image

Common MIR tasks include:

  • Copyright and royalties: tracking broadcast or publication payments.
  • Plagiarism detection: identifying unauthorised use of musical ideas.
  • Recommendation: finding music matching a personal profile.
  • Sounds-as: finding music that resembles a given recording.
  • Mood: music suited to a certain atmosphere.
  • Emotion: music reflecting or contrasting an emotional state.
  • Style: music belonging to a generic category.
  • Performer: music by a specific performer or type of performer.
  • Feature: technical features used to retrieve works in a genre or by an artist.
  • Composer: finding works by one composer.
  • Intertextuality: works employing the same material or alluding to each other.
  • Identification: finding works containing a given theme (e.g., query by humming).
  • Source: identifying the work to which an instance belongs when metadata is missing.

For task frequencies, see Lee and Downie. Figure 2 maps tasks to user classes and retrieval levels. Audio fingerprinting excels at identifying recordings (work instances) because only audio differs between renditions of the same musical content. Audio data also works well for general tasks like genre and artist recognition.

Query-by-humming systems help consumers lacking expertise to enter interval sequences or contours textually, identifying works or finding similar ones.

Audio and symbolic methods suit different tasks. Identifying instances requires audio; identifying works benefits from symbolic representation. Genre determination from audio is promising, but symbolic approaches also work by enabling more complex queries.

Searching Symbolic Music

String-Based Methods for Monophonic Melodies

Monophonic music can be represented as one-dimensional character strings, where each character represents a note or a note pair. Strings can encode interval sequences, gross contour, pitch sequences, and more. Standard string matching algorithms — editing distance, longest common subsequence, substring search — have been adapted for melody matching.

Some systems check only for exact matches or substring occurrences, using algorithms like Knuth-Morris-Pratt or Boyer-Moore. Themefinder uses regular expressions; different strings can match the same expression. For approximate matching, dynamic programming computes an editing distance. Musipedia does this, but computing editing distance between entire strings is insufficient when pieces differ in length, so suitable substrings must be selected first.

Cilibrasi and colleagues applied an approximation of Kolmogorov distance to compute clusters of music. They process MIDI into strings over a finite alphabet and use Normalized Compression Distance (NCD), approximating Kolmogorov complexity with compressed string length. This method yields good genre and composer clusters.

For substring matching, standard text indexing works (inverted files, B-trees). Since music lacks word equivalents, melodies are cut into N-grams (sequences of N pitch intervals). Indexing methods relying on the triangle inequality — such as metric trees or vantage point trees — are used for editing distances.

Geometry-Based Methods for Polyphonic Melodies

Set-based methods treat notes as unordered events with onset time, pitch, and duration. Clausen and colleagues view scores and queries as sets; exact matches are supersets of queries, and approximate matching involves supersets of query subsets or alternative sets. Quantisation and segmentation into measures enable inverted file use.

Typke and colleagues also use sets of notes but apply transportation distances like the Earth Mover’s Distance. They exploit the triangle inequality for indexing without quantisation, pre-calculating distances from database entries to fixed vantage objects. Queries only need comparison to entries with similar distances.

Ukkonen and colleagues propose several algorithms: finding translations where all query note onsets and pitches match; finding translations where some match; and finding translations maximising shared time (notes sounding simultaneously with same pitch, regardless of onset alignment).

Probabilistic Matching

Probabilistic methods determine properties of candidate pieces and compare them to queries. The GUIDO system computes Markov models from features like interval sequences, pitch sequences, or rhythm in pieces. States correspond to pitches, intervals, or durations; transition probabilities reflect occurrences of consecutive states. Similarity between a query and database candidate is the product of transition probabilities for each consecutive query state pair. Transition matrices are organised as a tree: leaves hold piece matrices, inner nodes concatenate subtree matrices.

Searching Musical Audio

Most audio retrieval overviews focus on speech, so the emphasis here is on musical audio. Clausen and Kurth extended their geometry-based method for audio, using feature extraction to convert PCM signals into sets treatable like note sets. Self-Organizing Maps (SOMs) cluster and classify pieces by features like rhythm patterns extracted from audio.

Feature Extraction

A common way to compare audio recordings is to extract abstract, perceptually relevant descriptions from the signal and apply a distance function. Recordings are segmented into short, overlapping frames of 25–40 milliseconds. Commonly extracted features include:

  • Loudness: approximated by the square root of signal energy from the short-time Fourier transform, measured in decibels.
  • Pitch: the Fourier spectrum yields a fundamental frequency via an approximate greatest common divisor algorithm.
  • Tone (brightness and bandwidth): brightness measures higher-frequency content; bandwidth is the magnitude-weighted average of the differences between spectral components and the centroid. Bandwidth is zero for a pure sine wave but infinite for white noise.
  • Mel-filtered Cepstral Coefficients (MFCCs): computed by applying mel-spaced triangular filters to the short-time Fourier transform, followed by a discrete cosine transform. The mel scale accounts for human perception — linear below 1000 Hz, logarithmic above. This transforms the spectrum into perception-based characteristics.
  • Derivatives: dynamic behaviour matters, so instantaneous derivatives (time differences) of all features are often calculated.

Other measured attributes include frequency, attack (the time needed to go from zero to peak amplitude), decay (the interval from peak amplitude to a stable sustain level), sustain (the steady-state amplitude magnitude), release (the duration from sustain back to zero amplitude), zero crossing rate, and spectral centroid (the amplitude-weighted average frequency of a spectrum). Numerous audio retrieval systems compare feature vectors of this type to identify recordings that sound similar to a user's query.

Alternatively, coefficients from compression schemes can be used directly. For instance, [11] employs coefficients taken from MP3 coded audio—these represent the output of the polyphase filters utilized in MP3 encoding. The polyphase filter bank splits the audio signal into 32 equal-width frequency subbands. A psychoacoustic model, based on the human auditory system, decides whether each subband's coefficient should be encoded.

Audio Fingerprinting

When the objective is to identify a specific recording rather than just the work itself, audio fingerprinting methods are highly effective. An audio fingerprint is a content-based compact signature that summarizes an audio recording [2]. All phone-based music identification systems rely on some variant of audio fingerprinting. A feature extractor characterizes short segments of recordings in a manner that is highly robust to typical distortions—poor speakers, cheap microphones, cellular phone connections, and background noise (such as conversation in a bar). These features need not relate to human perception or the musical content; they merely must be unique across different recordings and resilient to degradation. Such fingerprints, typically just a few bytes per segment, are stored in a database index alongside pointers to the recordings they come from. The same feature extractor processes the query, and from the extracted fingerprints, candidate matching recordings are quickly retrieved. The number of candidates can be narrowed by verifying that the fingerprints occur in the correct order and with consistent local timing. Common fingerprint requirements include [2]:

* Discriminative power over massive fingerprint collections, minimizing false positives. * Robustness against distortions such as additive noise and microphone artifacts. * Compactness for efficient processing. * Computational simplicity to ensure fast processing.

Concluding Remarks

For both audio and notated music, retrieval performance could likely be greatly enhanced by adopting human-centered features instead of purely technology-driven ones. After all, music is not primarily perceived or remembered as a sequence of discrete notes or frequencies. Theories of music cognition and perception may therefore play a pivotal role in future retrieval systems.