Music Understanding Through Computer Systems

Music Understanding refers to the ability of a computer to recognize structure and pattern in musical data. Several research projects have tackled different aspects of this challenge. The first involves Computer Accompaniment of Melodic Instruments, where a system listens to a live performer and plays along using a prewritten score. The second extends this capability to handle polyphonic keyboard input, allowing a computer to accompany a pianist. The third project deals with blues improvisation, a situation where the score is not predetermined, requiring deeper comprehension of musical structure. The fourth, called Beat Tracking, identifies musical beats in a live performance without a score, producing both tempo data and a transcription as byproducts. The final project is the Piano Tutor, an intelligent system designed to teach beginning piano students.

Introduction to music understanding

Although research in Music Understanding is relatively new, it holds considerable importance for the evolution of computer music. As music systems grow more complex, using them effectively becomes harder. Artificial intelligence offers ways to simplify these systems, making them accessible for a wider audience. The heart of such intelligent, user-friendly systems is Music Understanding.

Music Understanding investigates methods by which computers can identify pattern and structure within musical data. This understanding opens the door to numerous interesting applications. For example, keyboard performers face limits in producing multiple timbres at once due to the restricted information flow from a keyboard. Producing trumpet, trombone, and saxophone sounds simultaneously is difficult, even if suitable synthesizers exist. A Music Understanding system capable of listening to chords could serve as a real-time arranger, intelligently selecting notes for a synthetic brass section.

Music Understanding systems could also control reverberation and equalization in real time, enhancing live or synthetic instruments much like a recording engineer does today. Possibilities are plentiful, and this field remains fresh.

Five Music Understanding projects are outlined in the following sections. The first two focus on responsive synchronization of an accompaniment during the live performance of a composed score. The third explores following a jazz improvisation where the underlying chord sequence is known but the pitches are not. The fourth targets "foot-tapping": determining the timing and length of beats in a metrical performance, a skill that supports both synchronization and transcription. The final project is the Piano Tutor, which uses Music Understanding for diagnosing student errors.

Research shows that these seemingly low-level musical tasks are far from simple to automate. All five share several traits: they are well-defined yet challenging problems, they provide fairly objective success measures, and their solutions address practical issues in real-time computer music performance. The sections that follow present the work accomplished so far in each of these projects, and the concluding section offers a summary and thoughts on future directions.

Score following and computer accompaniment

A fundamental skill for any musician fluent in notation is reading the score while listening to a live performance. Humans can track complex scores in real-time without prior exposure to either the music or the notation. The aim of Computer Accompaniment is to follow a real-time performance using a score and synchronize a computer playback. The computer plays a precomposed part, so the requirements are responsive synchronization rather than real-time composition.

Several computer accompaniment systems have been created by the author and his collaborators. These differ from Accompaniment systems developed by others primarily in the algorithms used for score tracking. Computer Accompaniment can be broken into two tasks. The first is tracking the live performer within the score, and the second is generating the accompaniment in sync with the live playing.

Score following

The first Accompanying task requires knowing, at each instant, exactly where the performer is within the score. This is done by matching the notes that are played live with those written in the score. The matching must function even when the performer changes tempo or plays incorrect pitches.

The matching process deals solely with pitches; durations and timings are not considered during this first step. This approach is more robust when tempo changes occur, as matching remains the priority. Once a match is found, it supplies the knowledge of the performer's current position in the score.

Accompaniment

Matches are then handed to an Accompaniment task, which relies on musical knowledge to run a synthesizer. The basic logic is straightforward: when a match indicates the accompaniment is lagging, it speeds up; when it is ahead, it slows down.

In practice, the Accompaniment task evaluates the performer's speed by tracking recent match times and positions. With estimates of speed and location, the Accompaniment task can re-synchronize more quickly with the performer. However, overly rapid correction can sound clumsy and unmusical. Consequently, the Accompaniment task employs multiple guidelines to adjust appropriately for different conditions. For instance, if the Accompaniment needs to jump ahead by several seconds, it skips the corresponding interval in the accompaniment part rather than rushing through every note.

Systems built with these techniques have yielded very good results. More recent updates have extended the Matcher to handle trills, glissandos, and grace notes as special cases that otherwise posed difficulties, and this newer release has been effectively used in several concerts.

Polyphonic accompaniment

The accompaniment described so far functions with melodic instruments that produce one pitch at a time. (The computer's accompaniment itself faces no limitations—it can include multiple instruments with complex polyphony.) The next step was to expand this capability to keyboard input, including chords and polyphony.

Polyphonic accompaniment necessitated a fresh matching strategy, and two alternatives emerged. One approach groups notes that occur roughly at the same moment into structures called compound events. Once the definition of "matches" is adjusted, the monophonic matcher can seek a correspondence between sequences of compound events. This method is somewhat time-sensitive because it relies on timing to group notes.

The other approach processes each incoming event at the moment it occurs, ignoring timing relationships to other notes. In this case, the algorithm must allow notes belonging to a chord (compound event in the score) to arrive in any order, making it time-independent.

Otherwise, a polyphonic system works similarly to a monophonic one: matches give score positions to an Accompaniment task, which makes use of musical rules to keep the accompaniment in sync with the performance. Later on, we will revisit further applications for polyphonic score following and matching.

Following improvisations

Another Music Understanding challenge involves following improvisations. Skilled listeners can frequently identify a well-known song even if the melody is absent. Harmonic and rhythmic structures persist without the melody. Even the solo monophonic improvisation of a single instrument can carry enough hints for listeners to make out the chords and rhythm underneath. Could a computer display the same depth of understanding?

To address this question, work began on a system able to hear a trumpet performance improvising over a twelve-bar blues progression. The objective was for the computer to synchronize an accompaniment by working out the tempo and location within that blues harmonic progression. The "location" refers to the position within the progression that repeats every 12 measures (48 beats) of the twelve-bar blues. Once the computer detects location and tempo, it can contribute drums, bass, and piano.

Like the earlier Accompaniment project, this improvisation understanding system divides into two duties: first, finding location and tempo; second, generating synchronized accompaniment. The first required new approaches; the second reused techniques from the original Accompaniment system.

Following extensive discussions with Bernard Mont-Reynaud, who had built beat-tracking software for an automatic music transcription system, we joined forces on a "blues follower" program. Mont-Reynaud developed the beat-following (foot tapper) module, and I worked on harmonic analysis.

The initial harmonic analysis approach I attempted did not succeed, but it still seems worth outlining. A major challenge when analyzing an improvisation is that, in theory, any pitch could occur within any harmonic framework. Yet, depending on the harmony, certain notes would assume specific roles, such as chromatic passing tones. This suggested that one might assign functions to different notes by extracting features. After labeling these functions and observing several notes, it might be possible to isolate the harmonic setting unambiguously through elimination.

Nevertheless, that path has not proved fruitful. Instead, a statistical method was tested. This assumes that though any pitch might arise anywhere, certain pitches are more probable in certain locations than in others. One isolated note conveys minimal information, but collecting and combining evidence from many notes through statistical means can produce a markedly clear overall picture. The aim therefore is to compute the most plausible location in the progression, given a large collection of "hints." Those hints take the form of pitches that occur more often in certain spots than in others.

The foot tapper, along with a real-time version of this statistical concept, was united into a live improvisation understanding program for further trials. The outcomes are interesting but lie below the standard needed for practical musical use. It is plain that human listeners outperform this simplistic computer listener, and richer models are necessary before computer systems can be regarded as significant listeners musically.

Even if this approach remains insufficient for live performance, it points toward interesting analytical possibilities. Could such techniques characterize a performer or a blues style? The statistical methods allow one to compare multiple performances or build statistical summaries across performances, opening up diverse analytical routes.

Rhythm understanding

The "foot-tapping" challenge is pinpointing the location and length of beats in metrical music. In principle, foot tapping seems straightforward. One assumption is that note onsets often land on evenly spaced beats. The task then amounts to discovering a slowly shifting tempo function that predicts beats that correspond to observed onsets. If a beat prediction occurs just ahead of a note onset, the tempo assumption is considered too fast, and the estimate is decreased. If an onset occurs just before a predicted beat, the estimate is deemed too slow, and it is raised. In this manner, the anticipated beats align with detected onsets and hopefully the "true" beat.

However, direct implementations of this idea are seldom reliable. To track tempo shifts the system must respond to the timing of just one or two notes, making it highly reactive to ordinary microtiming fluctuations that do not signal actual tempo changes. Conversely, reducing sensitivity removes the ability to follow changes effectively. Additionally, once the foot tapper veers off the beat, recapturing sync becomes tough.

Working with Paul Allen, a revised approach was developed that ought to track fluctuations better while staying responsive to tempo modulations. Our observation was typical foot tappers often encountered ambiguous note onset interpretations. Distinguishing between a tempo increase putting the onset on a downbeat, and a tempo decrease putting it just before the downbeat, easily led to errors. Upon making such an error, a basic foot tapper tends to commit further mistakes in an attempt to force the data into its current estimates. The system seemed to veer away from the correct tempo rather than lock on.

To sidestep this, we built a system that keeps many concurrent interpretations of note onset timing via a beam-search technique. Beam search preserves a set of candidate interpretations. Each interpretation consists of an estimated beat length (tempo) and an estimated beat phase (where you are within the beat). Performance data arriving as new notes triggers new interpretations generated from each stored alternative. Storing many candidates increases the chances that the correct one hangs on while implausible ones are dismissed as reality unfolds. We employ heuristics to score interpretations—for example, penalizing large tempo changes or patterns that generate complex rhythms, and completely rejecting unlikely rhythmic combinations even if they are held as possible in theory.

The foot tapper is now running in real time. Initial results indicate it sometimes outdoes simpler methods. The system handles notable tempo variations and the timing irregularities typical of inexperienced keyboard players. That said, tracking quality depends on the musical texture—steady eighth notes are straightforward while highly syncopated passages remain harder. More detailed characterization is needed as noted, and understanding its weaknesses will direct future improvements.

The piano tutor

The Piano Tutor is an intelligent system for teaching beginning piano players. Its components include an electronic piano attached to a computer, a videodisc player, a computer monitor, and a separate video display. During typical interaction, the Tutor picks a lesson and delivers a suitable presentation. Lesson presentations usually explain a newly required skill, for instance, identifying a G with the left hand and playing it. Afterwards, the student receives practice and an assessment opportunities—often by performing a piece of music. Following the performance, the Tutor analyzes what happened, suggests remedies if the student made errors, and selects the next lesson upon successful completion, repeating the whole procedure.

The Piano Tutor utilizes score following to follow the student's live playing, letting it determine whether timing and registered are correct. This integrated approach uses Music Understanding in two critical aspects: following the performance and detecting errors. It can, for example, recognize a tripping or a misplaced on a skip above cut forward on course.

The system can “turn pages” on a computer screen at the right moment. Computer accompaniment also rewards the student once a new piece is mastered. Beyond these score‑following uses, the Piano Tutor makes score following the basis for evaluating student performances. A by‑product of score following is a precise alignment between each performed note and its counterpart in the score. From this match it becomes simple to detect wrong notes, missed notes, extra notes, and notes that are too short, too long, early, or too late.

These low‑level errors are examined to derive explanations. A cluster of early notes might mean the student is accelerating the tempo. If the first note lies in the wrong octave, the Piano Tutor instructs the student to find the correct octave and restart. When a wrong note relates to a concept or skill just taught, the system can repeat the explanation. In many respects, the Piano Tutor exhibits the deepest understanding among all the systems discussed here. This depth is partly possible because the domain is tightly constrained: we know in advance which concepts are being taught, what errors students typically make, and which pieces will be played. Such foreknowledge makes it easier for the Piano Tutor to analyze student mistakes and produce plausible responses.

7. Summary and Conclusions

Five Music Understanding systems have been illustrated. Each aimed to recognize pattern or structure in music to perform a low‑level musical task. These examples addressed melodic and polyphonic (keyboard) score following, identifying correlations between a jazz improvisation and a chord progression, beat tracking, and analyzing student performance errors.

These tasks are only a few of the many that might be automated, and there is clearly room for improvement the areas already discussed. Following an ensemble has not been studied, and vocal music following has not been carefully examined. Voices typically show far more variation in pitch and articulation than instruments. Incorporating learning into these and other Music Understanding systems is an important future direction that remains unexplored.

Music Understanding is a crucial factor in developing computer‑based music systems. Computer music systems are approaching a limit imposed by their human interface, and more sophisticated interfaces can evolve only when computer systems grasp musical concepts. Without such understanding capabilities, computer music systems can automate only mechanical aspects of music making, and the full potential of computers in music will remain unrealized.

Music Understanding systems also play a significant role in developing and testing theories of cognition. The quest to build music understanding systems encourages us to explore our own cognitive capacities and mechanisms. Moreover, computers provide an objective means to test theories and models of music cognition. Formal study of music understanding leads to more formal music models, which may spark interesting new music theories.

In summary, Music Understanding is a new but important area of computer music research today. It promises improved interfaces for computer music systems, new capabilities, advances in cognitive psychology, and developments in music theory.

8. Acknowledgments

This paper is based on a talk prepared for the International Wenner‑Gren Symposium on Music, Language, Speech, and Brain. This work would not have been possible without major contributions from several colleagues. Joshua Bloch co‑designed and implemented the first polyphonic computer accompaniment system. Bernard Mont‑Reynaud designed and implemented the beat tracker for the jazz improvisation understanding system, and Paul Allen co‑designed and implemented the foot tapper program and evaluated many alternative designs. The Piano Tutor concept originated with Marta Sanchez and Annabelle Joseph, and the system was developed with contributions from Peter Capell, Ron Saul (who implemented most of the analysis system), and Robert Joseph. Software contributions also came from John Maloney, and Hal Mukaino supplied numerous design suggestions. Hal also implemented the best polyphonic accompaniment system to date.

This work was largely made possible by the Carnegie Mellon University School of Computer Science and was partially supported by Yamaha (computer accompaniment) and the Markle Foundation (the Piano Tutor).