Turning DNA Alignments into Music with the CALIGN System

Turning DNA sequences into sound

Evolution and natural selection shape each organism's phenotype and genotype in distinctive ways. Homologous sequences descend from a common ancestor through a series of selective changes, diverging over time. Various selective pressures acting on a genomic sequence constrain its evolution and give rise to interesting structures—modularization being one example. These evolutionarily shaped structures become visible when sequences originating from a common ancestor are aligned. The outcome, also called an "alignment," is a matrix that is not only rich in information and revealing for a biology expert but also patterned in ways that can be aesthetically appealing. Some patterns emerge when using any of the numerous visualization tools available.

The modular and structured quality of music has long struck many researchers as offering a promising route to understanding genomic data by translating it into sound. However, only a few attempts have been made to use music to convey these patterns to an audience. Earlier efforts all concentrated on single DNA or protein sequences. The earliest approaches translated DNA directly into music, assigning two notes to each of the four nucleotide characters—this allowed some flexibility in arranging notes into musical themes. Sonifying protein sequences provided a larger set of starting characters (the twenty amino acids) but was even more constrained and tended to generate a monotonous string of notes with little musical depth. Incorporating additional properties of the characters, and deriving mathematical formulas from that extra information, led to more interesting music but obscured the underlying biological information. One tool, gene2music, enables automated conversion of protein-coding sequences into music by mapping the twenty amino acids onto thirteen chords. It groups chemically similar amino acids together, and the duration of each chord depends on the frequency of the underlying codon. Another system, PROMUSE, sonifies both amino acid features and structural information, as well as the similarity between related proteins along a sequence. This similarity between proteins and genomic sequences, arising from common ancestry and moderate variation, is central to studies of evolution and genomics.

Presenting highly complex, multidimensional data requires far more channels to carry information than the visual channel can handle on its own. While visualization and animation are relatively well developed, research into delivering information through sonification has gained interest only recently. Surprisingly, the complexity of information conveyed through the audio channel is generally low, even though music composed for entertainment or art often displays highly complex structures. In a multimedia setting, Lodha and colleagues demonstrated that sonification can effectively disambiguate data when the visual presentation alone is unclear. Still, a direct comparison of the efficiency of auditory versus visual information uptake is difficult to perform. It is reasonable to expect that perceiving data through sonification versus visualization involves fundamentally different cognitive processes. Whether this difference can benefit data presentation is an area we intend to keep exploring.

This report describes CALIGN, the first prototype for alignment sonification that translates genome-wide aligned data into a musical composition. Such an acoustic representation requires a unique mapping of alignment information onto musical parameters. While certain mappings are straightforward to define, our goal is a mapping that is intuitive, easy to perceive, and also meets the artistic requirements of being pleasant and interesting.

Methods

Mapping

The central aim of our approach is to sonify the presence and absence of characters in an alignment so that their assignment to a particular sequence, and therefore species, is clear. For simplicity, we assume sequences come from different species, which allows us to refer to "different sequences" as "different species." The source of the sequences is not essential to our theoretical framework and can be added later. The following mapping was chosen, formalized as follows.

A musical motif or pattern is an ordered set of notes and rests played within one measure with a specific rhythm. Given a set of species, a set of instruments, and a set of distinct patterns, we assign to each species one instrument and one pattern, which the chosen instrument plays. Accordingly, we define an injective function that maps each species to an ordered pair consisting of an instrument and a pattern. This means each species receives a unique combination of instrument and pattern, and the number of species must not exceed the total number of available instrument-pattern combinations. The remaining degrees of freedom can be used to incorporate auxiliary information, such as the phylogenetic relationships among the species. Therefore, instruments are assigned to species in such a way that the relationships among the instruments reflect the relationships among the species. Because the perceived relatedness of instruments is subjective, this assignment is done manually. Using two independent features—instrument and pattern—to encode each species allows us to handle alignments with up to one hundred species and to represent two-dimensional phylogenetic information as provided by tools like SplitsTree. In addition to these options, we provide additional motifs played on drums and cymbals. These rhythmic motifs are especially useful for sonifying outgroup species.

For a given sequence, we consider a set of units—subsequences of the sequence that are ordered so that one unit occurs before another when its position is earlier. Biologically, these units are characters in general, or "genes" in this contribution. Each unit can be classified as absent or present in either the forward or reverse orientation.

This allows us to define a matrix, which is also called an alignment. An entry is set to present in one orientation when the unit appears in a given species in that orientation; otherwise, it is set to absent. This means that for a fixed unit, all entries indicating presence are homologous. Since we have assigned each species a particular instrument playing a particular pattern, the instrument and pattern assigned to a given species will play during a particular time interval whenever that unit occurs in that species (i.e., the entry is not marked absent). Otherwise, the instrument rests. Whether a sound is produced depends only on the presence or absence of the unit, but three options can be activated to highlight specific information.

Orientation. This option controls whether a pattern is played forwards or backwards, depending on the orientation of the occurring unit. More precisely, if a unit appears in a species with a given instrument and pattern, the pattern is played forwards or backwards when the entry for that species indicates a forward or reverse orientation, respectively. The default setting treats any present entry as having forward orientation.

Conservation. Conservation information is critically important for a biological researcher. In some situations, units present across all species are the most interesting and are analyzed in further detail. This option emphasizes units that are present, or conserved, in all species. It is implemented as a change in harmony. Altering the harmony of a motif is done through diatonic transposition, shifting every pitch of a pattern by a fixed number of scale steps relative to the pattern's musical scale. For each pattern, we apply a transposition chosen with a probability that depends on the pattern's current scale whenever a unit is present in all species. The probability values are partly based on general principles of common-practice tonal harmony to ensure well-formed harmonic progressions. Thus a transposition maps one pattern to a new pattern, which replaces the old one. This process is a first-order Markov chain.

For untuned idiophones and membranophones (drums and cymbals), the new pattern equals the old pattern—that is, these motifs cannot be transposed. The original pattern and its transposed version are perceived as identical except for the change in scale.

Compression. Phylogenetic analyses focus on differential information. In such contexts, conserved units are considered uninformative. This option compresses the detailed information in conserved units while simply indicating that a unit occurs in all species. Under default options, the musical motif is played in full. If the compression option is activated and a unit is present in all species, then for every species the chosen instruments simultaneously play the first note of each respective pattern, relative to their orientation, producing a so-called tutti chord.

Invertibility of the Mapping

Both visualization and sonification attempt to convey abstract information in intuitive ways. First, the information must be formally retrievable from the representation: the mapping needs to be bijective, providing a unique way to recover the information. Second, the information must be perceivable to the human ear, taking advantage of our sense of hearing.

When all options are turned off, it is easy to see that the species can be identified by their unique combination of instrument and pattern because the function mapping species to these value pairs is bijective.

Constraints from Orientation. To determine whether a given unit appears in forward or backward orientation in a species, it must be possible to tell whether the motif is played forwards or backwards. This restriction means no symmetric patterns are allowed. Furthermore, there cannot be two patterns such that playing one pattern backwards sounds the same as playing the other pattern forwards.

Constraints from Conservation. This option imposes restrictions on instrument and pattern usage if we want to distinguish different species by listening to their respective combinations of instrument and pattern. We can examine two cases. First, if for every pair of species the instruments are different, then there is no restriction on patterns, because each species is uniquely identified by its instrument. Second, if some species share the same instrument, they must be distinguished by their pattern. Therefore, it is not permitted that any composition of transpositions of the two patterns (if applicable) results in one and the same pattern. If the orientation option is also active, we must ensure that no transposition leads to a symmetric pattern. By the definition of transposition, this scenario cannot happen if no pattern is originally symmetric.

Constraints from Compression. This option is used to emphasize occurrences of a unit in all species and to hide detailed information through compression. Many ways exist to implement this. One of the simplest is inserting a single beep. For musical reasons, we chose to play the tutti chord described earlier. We recognize that compression generally leads to a loss of information (for instance, orientation is lost). But we argue that, in most cases, the qualitative information that a unit is present in all species is sufficient. For the remaining cases, we recommend omitting the compression option.

Implementation

Our program CALIGN consists of a back end for composing music using Common Music, which runs within Gauche Scheme. Common Music is a valuable toolbox for algorithmic composition and for generating MIDI output. It provides a high-level description of compositional elements and convenient definition of transformation processes thanks to the expressive power of the Scheme language. Additionally, there is a web front end written in Haskell that acts as a CGI program, enabling easy use without requiring the installation of extra software. The data flow is illustrated in the accompanying figure.

The user uploads an input file. After the initial file analysis and automatic selection of settings, the user can adjust various parameters. These include selecting the reference sequence and assigning musical instruments and motifs to the individual sequences. The default settings are the ones discussed here, though a different assignment may be optimal depending on the biological question.

The alignment data is converted into music based on the user's settings. To do this, an appropriate Scheme file is generated, which is then processed by Common Music to create a MIDI file. The Scheme file contains the collection of motifs, the rules for composition, and the mapping of each species to one of the twelve motifs and available instruments. The user can listen to or download the resulting piece of music.

Input Format. We employed a custom comma-separated ASCII file type for input. The input file is a matrix with a number of rows equal to the number of units and a number of columns equal to three times the number of species. For each species, three columns hold the genomic start position, end position, and orientation of the unit. All columns within a row are separated by commas. If a unit is absent in a sequence, "NA" is entered for all three entries. Comment lines begin with a "#" symbol. The first block of columns is always treated as the reference species. In principle, any tabular data containing absence/presence information can be sonified with CALIGN. An example input file and corresponding output files are available in the supplementary material.

Demonstration and Findings

Application to Gene Annotation Alignments

We chose a real dataset consisting of twelve fly species, each assigned a unique instrument and pattern. One possible mapping is presented in the associated figures and tables. In all our applications, the assignment of species to instruments and patterns satisfies the conditions for a unique mapping under all parameter settings, except that orientation information is lost under the compression option.

background image

We aimed to sonify such data flexibly. The motifs were designed to work in various musical registers, and they featured varied contours and rhythms to help individual motifs stand out within a dense musical texture.

As input, we used gene annotations and gene correspondences for chromosome 3R from Drosophila melanogaster and the other eleven sequenced Drosophilid genomes. The input was a matrix consisting of genes and species. The genes were listed by their genomic sequence interval and orientation. We sorted genes by their start position in the reference species (D. melanogaster). Furthermore, we used relative orientation information, with the orientation of D. melanogaster genes set to "+" and the orientation for the other genes set to "+" or "-" when the orientation is the same or reversed relative to D. melanogaster, respectively.

Figure 3: Mapping of fly species to instruments. The tree on the left-hand side represents the topology of the phylogenetic tree [Con07]. Branch lengths are arbitrary.

To make the instrumentation reflect the relative closeness of each species, we used the expert knowledge represented by the tree in Figure 3. Of the 12 Drosophila species, five are very closely related: D. melanogaster, D. simulans, D. sechellia, D. yakuba, and D. erecta. The model organism and reference species, D. melanogaster, received a continuous piano motif, which formed the basis for the rest of the music. The other four species were assigned to strings and woodwinds, offering timbral similarity along with enough register distinction to be identifiable (Figure 3).

Currently each measure lasts 2 seconds, resulting in an 11.5-minute piece for all 345 genes.

3.2 Evaluation

While Section 2.2 formally shows that selecting a unique instrument and pattern for each species enables a unique mapping under certain constraints, we also needed to evaluate how users perceive the sonification. The following analysis of CALiGn is based on impressions from 50 untrained, non-musician test subjects. The example in Section 3.1 is just one of several tested cases with varied settings.

Number of Organisms/Instruments. Depending on their arts education, participants could recognize up to 12 instruments, though most felt comfortable distinguishing six. If more instrument tracks need to be discerned, most people would require training to differentiate instruments or patterns. Another possibility is using other types of instrumental or synthesized sounds that untrained ears can identify more easily.

With 2 or 3 species, the composition was described as “musically pleasing,” and users found it easy to hear which genes were present in which species. Yet the ability to resolve presence/absence patterns declined rapidly as the number of simultaneous instruments or motifs increased. Nonetheless, detecting presence/absence of genes within groups of species remained straightforward for listeners.

Most people who focused on a specific instrument and tracked presence/absence at a given time point arrived at the correct answer, regardless of the total number of instruments playing concurrently.

Conserved Sites – Changes in Harmony. Introducing harmony changes rooted in the local context enriched the artistic quality and sustained listener attention. All participants rated the music as far more interesting when the conservation option was enabled. Beyond its aesthetic effect, this also highlighted conserved regions and guided listeners’ focus.

Conserved Sites – Compressed Units. This approach sets the presence of m (where m is the total number of species in the alignment) and states with fewer than m species clearly apart, while also compressing time. It lets users concentrate on biologically more informative absence/presence patterns. Participants preferred the compression option over the conservation option for emphasizing conservation.

All test subjects expressed enthusiasm after the addition of harmonic changes and compressed chords, praising the increased musical variety. The resulting output was described as much “happier,” “interesting, irregular,” “less crowded,” “rhythmically interesting,” and “dramatic.” An intriguing finding emerged: choices made largely for aesthetic reasons also enhanced the sonification’s legibility for users.

Orientation of a Gene – Forward and Backward Motifs. The asymmetry of individual motifs—some of which clearly ascend—proves essential for conveying directional information. The character of each motif let users identify mirrored versions as belonging to the same original. While results sounded pleasant, most test subjects found it hard to track which motifs were reversed when several instruments played simultaneously. It remains unclear whether the ear simply needs training or whether alternative strategies are necessary to communicate this information.

Mapping – Assignment of Instruments and Patterns. With different settings, we anticipated some combinations might sound unpleasant. For unconventional instrument pairings (such as drums, marimba, and trumpet), most participants unexpectedly found the result rich in character and interesting. When multiple outputs of the same data file were heard with different instruments and patterns, listeners felt this highlighted the underlying data structure.

Conclusion and Future Work

CALiGn is the first prototype of an alignment sonification tool. Existing sonification methods for single biological sequences map individual characters (for example,

nucleotides or amino acids) to single notes or chords. We opted to map one character per measure. This had two major effects. First, it provided the necessary degrees of freedom to encode more information while allowing compositional considerations to ensure pleasant sound. Second, it stretched the information across a larger time interval, presented the data in an organized measure structure, and made perception easier. CALiGn gains its power from modular and flexible motif design and mapping rules. Biological sequence alignments are well-suited for sonification because individual information elements can blur in a composition when researchers focus on the broader picture, such as groups of species with conspicuous absence/presence patterns. Music may prove to be a suitable medium for conveying information at multiple simultaneous resolution levels. This naturally prompts the question: can sonification compete with or surpass the dominant visualization approach? If not, is sonification better suited for conveying certain kinds of information? The prevalence of visualization might suggest it outperforms sonification in all respects. However, a competitive sonification tool must first be fully developed; our prototype represents just a small step.

Based on our project experience, we plan to construct a mapping for alignments that can incorporate additional kinds of contextual information, such as character lengths, distances between characters, higher-order annotations, and phastCons scores. An interactive interface should let users edit parameters in real time and display scores and alignments in floating windows. This will allow interested users to play—and play with—their alignments.

“Play is the highest form of research.” (Albert Einstein)

Acknowledgments. This work was supported in part by the Graduierten-Kolleg Wissensrepräsentation and by a grant (01GQ0432) from the BMBF in the NNCS program. We thank the anonymous reviewers for their valuable and constructive comments.

References

[Con07] Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature, 450(7167):203–218, Nov 2007.

[DC99] John Dunn and Mary Ann Clark. Life Music: The Sonification of Proteins. Leonardo, 32(1):25–32, 1999.

[GJ05] S Griffiths-Jones. RALEE–RNA ALignment editor in Emacs. Bioinformatics, 21(2):257–259, Jan 2005.

[GS95] P Gena and C Strom. Musical synthesis of DNA sequences. In XI Colloquio di Informatica Musicale, pages 203–204, Bologna, I, 1995.

[GS01] P Gena and C Strom. A physiological approach to DNA music. In Proceedings of CADE 2001, pages 81–86, Glasgow, UK, 2001. Glasgow School of Art Press.

[HB06] D H Huson and D Bryant. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol, 23(2):254–267, Feb 2006.

[HCL⁺99] M D Hansen, E Charp, S Lodha, D Meads, and A Pang. PROMUSE: a system for multi-media data presentation of protein structural alignments. Pac Symp Biocomput, pages 368–379, 1999.

[HM84] K Hayashi and N Munakata. Basically musical. Nature, 310(5973):96–96, Jul 1984.

[HMR00] T Hermann, P Meinicke, and H Ritter. Principal Curve Sonification. In Proceedings of the Int. Conf. on Auditory Display, pages 81–86, 2000.

[HR05] Thomas Hermann and Helge Ritter. Crystallization sonification of high-dimensional datasets. ACM Trans. Applied Perception, 2(4):550–558, 10 2005.

[Kaw] Shiro Kawai. Gauche Scheme - [http://practical-scheme.net/gauche/index.html](http://practical-scheme.net/gauche/index.html).

[KKZ⁺09] R M Kuhn, D Karolchik, A S Zweig, T Wang, K E Smith, K R Rosenbloom, B Rhead, B J Raney, A Pohl, M Pheasant, L Meyer, F Hsu, A S Hinrichs, R A Harte, B Giardine, P Fujita, M Diekhans, T Dreszer, H Clawson, G P Barber, D Haussler, and W J Kent. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res, 37(Database issue):755–761, Jan 2009.

[KP00] Stefan M. Kostka and Dorothy Payne. Tonal Harmony, with an introduction to twentieth-century music. McGraw-Hill, Boston, 4th edition, 2000.

[LBB⁺07] M A Larkin, G Blackshields, N P Brown, R Chenna, P A McGettigan, H McWilliam, F Valentin, I M Wallace, A Wilm, R Lopez, J D Thompson, T J Gibson, and D G Higgins. Clustal W and Clustal X version 2.0. Bioinformatics, 23(21):2947–2948, Nov 2007.

[LWHC00] Suresh K Lodha, Doug Whitmore, Marc Hansen, and Eric Charp. Analysis and user evaluation of a musical-visual system: Does music make any difference. In Proceedings of the Int. Conf. on Auditory Displays, pages 167–172, 2000.

[Ohn87] S Ohno. Repetition as the essence of life on this earth: music and genes. Haematol Blood Transfus, 31:511–518, 1987.

[Ohn93] S Ohno. A song in praise of peptide palindromes. Leukemia, 7 Suppl 2:157–159, Aug 1993.

[OO86] S Ohno and M Ohno. The all pervasive principle of repetitious recurrence governs not only coding sequence construction but also human endeavor in musical composition. Immunogenetics, 24(2):71–78, 1986.

[RPC⁺00] K Rutherford, J Parkhill, J Crook, T Horsnell, P Rice, M A Rajandream, and B Barrell. Artemis: sequence visualization and annotation. Bioinformatics, 16(10):944–945, Oct 2000.

[TM07] R Takahashi and J H Miller. Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns. Genome Biol, 8(5):405–405, 2007.