Interface Metaphors and Signal Representation
for Audiovisual Performance Systems
by Golan Levin

Thesis Proposal for the degree of Master of Science in Media Arts and Sciences
Massachusetts Institute of Technology
Original 5 November 1999
Revised 11 December 1999

Table of Contents

Abstract

This thesis proposes an investigation into the development of software environments that enable the simultaneous performance of moving image and sound. The goal of such systems is to provide easily apprehensible yet extremely malleable environments for creative expression and self-discovery. Current systems for audiovisual authoring have little or no knowledge about gestural data, and employ interaction metaphors and interface devices that substantially constrict the space of possible results. The proposed works will introduce new metaphors and new technologies for mapping between the dimensions of an audiovisual simulation, and signal representations of gestural information captured from a variety of physical interfaces. The thesis document will include an analysis and taxonomy of the instrument design space, an evaluation of the introduced techniques, and a condensation of design principles for audiovisual instruments generally.

Motivation

"In the impossibility of replacing the essential element of color by words or other means lies the possibility of a monumental art. Here, amidst extremely rich and different combinations, there remains to be discovered one that is based upon the principle [that] the same inner sound can be rendered at the same moment by different arts. But apart from this general sound, each art will display that extra element which is essential and peculiar to itself, thereby adding to that inner sound which they have in common a richness and power that cannot be attained by one art alone." Wassily Kandinsky (1912)

A few weeks ago the New York Times reported the discovery of a 9,000 year old flute in China. Remarkably enough, the flute was still playable. As I listened in awe to soundfiles of the flute that the Times had posted to the Web, I was struck by an awareness that the human drive toward creative expression, as it is realized through such vehicles as musical instruments and drawing materials, must be among the oldest and most universal of human desires.

The thesis I propose here seeks to fulfill our will to creative expression, by making new expressions possible, and by advancing the state of the art in our contemporary means. My focus is the design of systems which make possible the simultaneous performance of animated image and sound. I have chosen to implement these systems by making use of the digital computer's capacity to synthesize graphics and sound in response to real-time, high-bandwidth gestural inputs.

I am not the first person to attempt to design such a system. In fact, the vision of a performance medium which unifies sound and image has a long history, as Wassily Kandinsky's quote suggests. Instead, I hope to bring to this history a provocative new set of questions and answers about the power, beauty, sophistication and personality that it is possible for an audiovisual instrument to have.

A successful work to emerge from this thesis would be a meta-artwork whose interface was supple and easy to learn, but which also yielded interesting, infinitely variable, and personally expressive performances in both the visual and aural domains. I hope to produce several examples of just such works, by bringing two things to bear on the problem space of audiovisual instruments: new technologies, such as knowledge representation and multi-dimensional gestural interfaces, and a new aesthetic, which seeks to substantiate such works with an underpinning of perceptual motivation.

This work is important as it represents a vision for creative activity on the computer, in which uniquely ephemeral dynamic media blossom from the expressive "voice" of a human user. At the end of my design cycle, I intend to analyze my software artifacts in order to tease apart and taxonomize the elements of their design space, evaluate the success of the techniques I have introduced, and extract principles for the design of future audiovisual instruments.

Background

The synchrony of abstract image and sound, variably known as ocular music, visual music, color music, or music for the eyes, has a history that spans several centuries of work by dozens of gifted practitioners [17]. Despite the breadth and depth of this history, however, a casual Web search reveals an unfortunate ignorance of it, as numerous sites continue to advertise "an entirely novel concept, relating graphics and music" or something similar [5]. Adrien Bernard Klein, in his 1927 book Color-Music: the Art of Light, deftly characterized this myopia: "It is an odd fact that almost everyone who develops a color-organ is under the misapprehension that he, or she, is the first mortal to attempt to do so" [9].

The earliest known device for performing visual music was built in 1734 by the Jesuit priest and mathematician, Father Louis-Bertrand Castel. Castel's Ocular Harpsichord coupled the action of a harpsichord to the movement of transparent tapes, whose colors were believed by Castel to correspond to the notes of the occidental musical scale [14]. In 1789, Erasmus Darwin suggested that visual music could be produced by projecting light from oil lamps through colored glasses; his proposal was implemented in 1844 by D. D. Jameson, whose "color organ" filtered light through liquids of various colors and reflected it off metal plates onto a wall [14]. Thereafter followed a steady development of audiovisual instruments, employing a wide range of technologies and materials: Frederic Kastner's 1869 Pyrophone, for example, opened flaming gas jets into crystal tubes to create both sound and image [15], while an 1877 device by Bainbridge Bishop sat atop a pipe organ and produced light with a high-voltage electric arc [14]. An instrument patented by William Schooling in 1895 controlled the illumination of variously-shaped vacuum tubes with a keyboard and set of foot-pedals [14]. Other historic examples include George Hall's Musichrome (1930s), Morgan Russell and Stanton Macdonald-Wright's Kinetic Light Machine (1931), Charles Dockum's Mobile Color (1952), Gordon Pask and McKinnon Wood's Musicolour machines (1953), and Jordon Belson's liquid-based instruments from the late 1950's [13, 14, 15]. Apart from these, two twentieth-century instruments deserve special mention: Thomas Wilfred's Clavilux (1920's) [21], and Oskar Fischinger's Lumigraph (1948), both of which achieved considerable critical acclaim through international high-art performances. Both were optomechanical; the Clavilux filtered light through several stages of multicolored glass disks, while the Lumigraph interupted colored beams of light with a flexible fabric surface. Interestingly, these instruments also became modest commercial successes as home entertainment systems, and as such penetrated the collective cultural consciousness to an unprecedented degree.

While these innovators developed "real-time" tools for the performance of visual music, other pioneers composed elaborate visual statements in the off-line laboratory of the newly invented animation studio. Operating from deeply held beliefs in a "universal language of abstract form," animators like Walter Ruttman, Viking Eggeling, Len Lye, Norman McLaren and Oskar Fischinger began systematic studies of abstract temporal composition in order to uncover "the rules of a plastic counterpoint" [18]. Landmark events in abstract cinema included the 1921 Frankfurt run of Ruttmann's short Lichtspiel Opus I, thought to have been the first screening ever of an abstract film for a general audience [18], and the 1924 release of Eggeling's Diagonal Symphony, which was the first entirely abstract film. The painstakingly constructed efforts of these and other artists dramatically expanded the language of dynamic visual form, at a time when the language of cinematic montage itself was only beginning to be understood. Of particular inspiration to the work I propose in this thesis is the cinematic vocabulary developed by the New Zealand animator Len Lye (active 1930-1960), who explored the dynamic properties of cameraless animation techniques such as drawing, scratching and painting directly on celluloid. Lye's work vaults the gulf between the vitality of performance and the precision of composition, for even though his movies were meticulously constructed in his animation studio, his process of improvisation survives on-screen in frenetic and biomorphic works that are a direct connection to his own experience, thought and mark-making [20].

Physical color organs are burdened by an inherent tradeoff in their ability to yield specific versus general content [19]. The control of detailed or precise images requires a specificity of generative means, whereas the use of highly general means tends to produce amorphous and difficult-to-control results. To display the image of a triangle in the physical world, for example, requires a triangular chip of transparent material, or a triangular aperture—and that triangular element can do little else but make triangles. By projecting light through a tray of immiscible colored liquids, on the other hand, one can produce an infinity of outcomes, but its inchoate and complex results can be only vaguely directed. Computer technology has made it possible for visual music designers to transcend the limitations of physics, mechanics and optics, and overcome the specific/general conflict inherent in electromechanical and optomechanical visual instruments. One of the first artists to take advantage of these means was the California filmmaker John Whitney, who began his studies of computational dynamic form in 1960 after twenty years of producing animations optomechanically. Shortly thereafter, Myron Krueger made some of the most fundamental developments in the connection between interaction and computer graphics; his 1969 Video Place used information from motion capture to direct interactions with abstract forms [10]. Since that time, the fields of computer graphics and human-computer interaction have burgeoned considerably.

Three important and relatively recent computational precursors to this research are Timepaint by John Maeda, the Motion Phone by Scott Snibbe [20], and Music Insects (later sold as SimTunes) by Toshio Iwai, all of which were developed in the early 1990's. John Maeda's Timepaint is a delicate illustration of the dynamic process by which apparently static marks are made: by extending a gesture's temporal record into the third dimension, Maeda's work can flip between a flat animated composition and a volumetric diagram of temporality. Snibbe's Motion Phone is an application for interactively authoring dynamic animations; it accretes recordings of gestures into an abstract animation loop, creating lively and rhythmic patterns of colorful triangles, squares, circles and lines. Although it produces no sound, it is nonetheless an important example of a purely "visual instrument." Music Insects, on the other hand, is a paint program in which the pixels deposited by the user operate as scorelike elements in a music-producing simulation. Music Insects differs from the work I propose here insofar as its users produce static rather than dynamic images, and in the fact that it treats user input as positional (discrete) rather than gestural (continuous) data.

Over the past three years, I have developed approximately twenty small interactive systems which interpret the dynamism of two-dimensional gestures in abstract animation spaces. Six were developed when I was at the Interval Research Corporation in Palo Alto: Disctopia, Blebs, Streamer, Escargogolator, Schizosticks, and Polygona Nervosa, the last four of which were produced in collaboration with Scott Snibbe. The remainder were developed over the last year and a half at MIT: Molassograph, Splat, Stripe, Ribble, Telephone, Polka, Directrix, Meshy, Curly, Floccus, Floo, and Aurora. What these all share is the treatment of temporal gestures as inputs to dynamic simulations. My two most recent applications, Yellowtail and Loom, are my first to additionally permit the performance of synthesized sound, and represent early forays into the work I propose in this thesis. In the next section, I detail the principal design goals, and concomitant space of technical challenges, that I have crystallized from my three years' research into the development of digital performance systems. It is my hope that by applying new techniques in the service of a personal aesthetic, it shall be possible to provide provocative new answers to some very old questions.

Methodology

In the process of building the body of work which has led to this proposal, and through conversations with numerous collaborators and mentors, I have come to articulate a set of desiderata for the design of audiovisual instruments. These stipulations, taken together with the technical challenges they entail, comprise the methodology which will direct the execution of my thesis work:

  • Instantly knowable, infinitely expressible interfaces. Most software systems are either easy to learn, or extremely powerful. Rarely are they both, for to be so demands that their rules of operation be simple, yet afford a boundless space of possible outcomes. This is difficult, and nearly contradictory. Nevertheless, there exist real-world exemplars of such systems, such as the piano and the pencil: although any four-year-old can discover their basic principles of operation, an adult can just as well spend fifty years practicing at them, and still feel like there remains more that can be expressed through them, and more mastery that can be achieved in performing with them. Such systems, moreover, have the extraordinary property that an individual may eventually, through their use, discover or reveal a unique and personal voice in that medium. We all have our own spatio-temporal signatures, our own unique ways of moving through space; successful instruments bring the character of these traces into relief and reflect them back to us. An essential goal of this thesis is the design of audiovisual instruments that possess these qualities, of simplicity, possibility, and transparency.

  • New metaphors relating sound and image. Over the past few years, metaphors for relating sound to image in interactive graphical environments have coalesced into three basic conventions: "timeline" metaphors, "control panel" metaphors, and "interactive widget" metaphors. MIDI sequencers and audio-editing programs, for example, typically use a diagrammatic score or "timeline" metaphor, in which a pitch or amplitude ordinate is plotted against an abscissa of time. Many software synthesizers, on the other hand, have adopted a "control panel" metaphor in which a screen full of knobs, dials, sliders and buttons—in imitation of classic analog hardware devices—provides precise control of a sound's parameters. Finally, some designers have experimented with an "interactive objects" metaphor, in which the properties of one or more reactive virtual widgets are mapped onto generated sounds; this last approach can be found in the works of Toshio Iwai, Media Laboratory graduates Reed Kram and Pete Rice, British designer Lukas Girling, Todd Robbins, and several others. Unfortunately, none of these metaphors fulfill the needs of this thesis, which seeks to create a highly malleable audiovisual environment that more closely resembles a paintable canvas, than a diagram. To successfully build a visual performance system coincident with an equally plastic musical performance system, and not merely a GUI for a musical instrument, I will need to develop new metaphors for audiovisual interaction.

The above goals are chiefly aesthetic desiderata, but they entail many important technical questions. In the course of pursuing the above design goals, I expect to encounter and develop solutions to the following technological challenges:

  • Signal representations for gesture. Ordinarily, recorded gestures exist in a computer as arrays of Cartesian coordinates. Unfortunately, this representation is almost entirely opaque to the computer, as it provides no explicit representation of a gesture's unique character, and therefore no handles by which its character may be directly modified or used. If a system is designed to amplify or augment human gesture, it is reasonable to expect that it be able to base its transformations or manipulations on whatever higher-level information is available about its gestural input. For this reason, I intend to construct a small toolkit for the analysis and representation of gestural "content." This toolkit will apply signal-analysis techniques such as Hidden Markov Models and Gaussian Mixture Models to the domain of two-dimensional (and possibly three-dimensional) movements, in order to derive representations of a gesture's curvature, overall orientation, spatial frequency content, irregularity, intrinsic periodicities, and similarity to other marks. Once a set of adequate representations for gesture have been identified, it remains to develop interesting software artifacts that can make use of those representations. I have already made preliminary attempts at doing so: my recent Loom application, for example, couples the modulation index of an FM-synthesized sound to a gesture's local curvature, creating a tight connection between the spectral brightness of spatial and aural signals. More generally, the analysis tookit I describe will be useful in expanding an instrument's temporal context, such as by permitting regular-expression-like searches through databases of gestures recorded days or weeks earlier. It should be noted that the representations I intend to use or develop are not representations of symbolic knowledge , but rather signal transfer functions which may have some close perceptual correlates.

  • Mappings between input and output that are based on perceptual "primitives". Many audio/visual software tools, and their user interfaces, base their affordances on the manipulation of computationally-convenient quantities instead of perceptually-meaningful phenomena. The software synthesizer Metasynth, for example, operates according to an essentially arbitrary set of mappings which could be summarized as "the horizontal index of a pixel corresponds to time, while its hue is mapped to the stereo placement of a sound." Many paint programs, to take another example, commonly expect a user to specify a color on a computer display according to the strengths of the electron guns that activate a pixel's red, green and blue components. Evidence from usability studies indicate, however, that color specification can in many circumstances be more intuitive in a space whose dimensions correspond to hue, saturation, and one of brightness (HSB), value (HSV), intensity (HSI), or lightness (HLS) [7]. Even though these alternative color models are technologically less convenient to implement, their grounding in the dimensions of human perception enables the systems that incorporate them to become more transparent, intuitive, and convenient in their use.

    A central tenet of this thesis is that an audiovisual performance system will be most successful if it can eschew numerically convenient mappings in favor of perceptually motivated ones. As a part of this thesis work , therefore, I intend to perform a basic survey of targeted literature on human perception in order to become familiar with the perceptual primitives of sight, sound and gestural movement [for example, in 1, 2, 3, 5, 11, 19]. My hope is that an expressive medium which can make use of such perceptually-grounded mappings could become its own interface, and in our use of it feel coextensive with ourselves.

  • High-bandwidth physical interface. Joy Mountford once observed that to design an interaction for a computer's mouse interface is to treat an entire person as if they were a single finger. Put another way, the mouse is an extremely narrow straw through which to suck all of expressive human movement. Although I have not made the design of physical interfaces a topic of research in this thesis, I am interested in using alternative commercial interfaces for gestural input where possible. The specific interface devices I will use, such as the Ascension Flock of Birds, the Haptek PenCat force-feedback stylus, and the Wacom tablet and puck, share the property that each provides substantially greater bandwith than a mouse for gestural input. Unlike the mouse, moreover, the expressive potential of these devices is still vastly unexplored. The additional dimensions of expressive input afforded by these devices may be especially applicable to the important challenge of composing longer temporal structures. I intend to integrate this hardware at the earliest opportunity into our group's graphics programming environment.

  • Synthesized graphics, synthesized sound. The infinite plasticity of a synthetic canvas demands that any sonic counterpart to it be equally malleable and infinite in its possibilities. This can only occur if the system's model of sound generation ultimately affords the expressive control, however abstractly or indirectly, of every single sound sample. To provide any less—by resorting to a model based on the mixing or filtering of canned sound loops, for example—merely creates a toy instrument whose expressive depth is drastically attenuated and explicitly curtailed from the outset. I have settled on a methodology in which I create software synthesizers from scratch, exposing expressive software hooks into their inner mechanisms along the way. This method has worked successfully in two of my most recent pieces, Yellowtail and Loom, for which I constructed Additive and FM synthesizers; I expect to implement a granular synthesizer and a wave-terrain synthesizer soon.

    At this point the question may arise as to where, within the domains of design, engineering, or cognitive psychology, this thesis is positioned. Even though I intend to base audiovisual mappings on findings from the perceptual sciences, and to use algorithms from signal engineering in order to implement these mappings, I wish to emphasise that my attitude to the amassed knowledge of these disciplines will be appropriative. In other words, it is not my goal to conduct original research in either cognitive psychology nor signal engineering, but rather to select and combine the fruits of these domains into novel systems that advance the field of interactive art and design.

Evaluation

In order to evaluate the success or failure of the proposed work, it is helpful to establish the context in which the work is positioned and according to whose standards it should be measured. As with many Media Laboratory theses, this is made difficult by the interdisciplinary nature of the work; the software systems that support this thesis inhabit a domain at the juncture of art, design, and the engineering of tools and instruments. As artworks, they fit within and extend an established Twentieth Century tradition in which artworks are themselves generative systems for other media; in Marshall McLuhan's terms, such systems are characterized by an "outer medium" (in my case, gestural performance and interaction) whose forms make possible the articulation of yet other expressions in an "inner medium" (for this work, synthetic animation and sound). Distinguishing such meta-artworks from the kinds of artifacts we conventionally call "tools" or "instruments" is largely a question of semantics and context; certainly the works I propose fit well within the usual definitions of these categories. I take exception to these labels only insofar as they carry with them the implication that a given tool or instrument is successful only if it is held to be useful and desirable by a broad base of consumers. I am not developing these systems with an audience of general users in mind, but rather as vehicles through which I can explore and present a strictly personal vocabulary of design practice, and suggest new technological solutions for human-machine interaction. In this sense this thesis will bear greater similarity to a "Hyperinstruments model" of artistic activity and technological craft (e.g., in which an artist originates specialized tools for himself or herself), than to a commercial, "Adobe model" of populist software development (e.g., in which market-driven usability specialists refine plug-and-play solutions for efficiency-seeking consumers). Thus, although my software may coincidently have some potential marketabilityan opinion drawn merely from my own observation that numerous people have enjoyed its useI leave its evaluation by such metrics to those who are customarily concerned with maximizing this sort of value. Instead of the marketplace, I choose as contexts of evaluation the music hall and the art gallery, and submit that the software artifacts supporting this thesis should minimally be able to support (A) a public performance by expert users, and (B) an engaging experience for interested gallerygoers. In the next section of this proposal, Deliverables, I outline specific plans for just such situated review.

Deliverables

  • I intend to produce a corpus of approximately six software pieces that enable the simultaneous authoring of moving image and sound. The success of these works will be predicated on their use of new mappings between the elements of their audiovisual displays, and higher-level representations of the gestural information captured from high-bandwidth physical interfaces.
  • Additionally, I plan two public presentations of the above work: (A) a live audiovisual performance, composed by myself and performed by a small ensemble, that makes use of these software works, and (B) a public exhibition or installation of the work in a gallery-like venue, where casual, hands-on interaction is possible.
  • Finally, I will submit a written thesis which evaluates and analyzes these applications in order to taxonomize the features of their design space, and extract design principles for future audiovisual instruments.

Schedule

  • November 1999 will be spent creating software connections between the ACG graphics environment, software sound synthesizers, and such high-bandwidth hardware input devices as the Ascension Flock of Birds, the Wacom tablet and puck, and the Haptek PenCat force-feedback stylus.
  • December will be spent in the construction of a toolkit for the representation of higher-level gestural information.
  • January 2000 will be spent creating software sketches of several pieces/instruments.
  • February and March will be spent refining these instruments and building others.
  • April will be spent writing the thesis document, and also on the production of a live performance making use of these works.
  • The writing of the thesis document will be completed in May.

Resources

For software development I will need the regular use of one SGI Octane computer and one dedicated Windows NT computer. For sound performance and recording I will also need the occasional use of a small multi-channel mixer and an electronic reverb unit. All of these resources are currently in place or easily accessible.

Thesis Readers

John Maeda is Sony Career Development Professor of Media Arts and Sciences, Assistant Professor of Design and Computation at the MIT Media Laboratory, where he also directs the Aesthetics & Computation Group (ACG). His mission at MIT is to foster the development of individuals who can find the natural intersection between the disciplines of computer science and visual communication.

Tod Machover is Professor of Music & Media, Head of the Opera of the Future/ Hyperinstruments Group, and Co-Director of the Things That Think (TTT) and Toys of Tomorrow (TOT) consortia at the MIT Media Laboratory. He is also a composer, respected for his innovative syntheses of music and novel technologies.

Marc Davis is Chairman and Chief Technology Officer of Amova.com. His mission is to revolutionize popular culture with highly personalized video media. Marc received his doctorate from the Machine Understanding Group of the Learning and Common Sense Section at the MIT Media Laboratory, and has a diverse background in literary theory, media technology, film theory, and artificial intelligence.

References

  1. Jacques Bertin, Semiology of Graphics: Diagrams, Networks, Maps. University of Wisconsin Press, 1983.
  2. Ronald M. Baecker, William Buxton, and Jonathan Grudin (Editors), Readings in Human-Computer Interaction : Toward the Year 2000. Morgan Kaufmann, 1995.
  3. William S. Cleveland, Visualizing Data. Hobart Press, 1993.
  4. Fred Collopy, Imagers and Lumia. <http://imagers.cwru.edu/index.html>
  5. Perry R. Cook, Music, Cognition, and Computerized Sound : An Introduction to Psychoacoustics. MIT Press, 1999.
  6. Tom DeWitt, "Visual music: Searching for an aesthetic," Leonardo, 20 (1987), 115-122.
  7. Sarah Douglas and Ted Kirkpatrick, "Do Color Models Really Make a Difference?" Proceedings of CHI'96.
  8. Karl Gerstner, The forms of color: The interaction of visual elements, Cambridge, MA: MIT Press, 1986.
  9. Adrien Bernard Klein, Color-Music: The Art of Light. Crosby, Lockwood & Son, London, 1927.
  10. Myron W. Krueger, Artificial Reality. Reading, Mass: Addison-Wesley, 1983.
  11. Jock Mackinlay, "Automating the Design of Graphical Presentations of Relational Information." ACM Transactions on Graphics, Vol. 5, No. 2, April 1986, pp 110-141.
  12. Barton McLean, "Composition with sound and light," Leonardo Music Journal, 2 (1992), 13-18.
  13. William Moritz, "The Dream of Color Music," in The Spiritual in Art: Abstract Painting 1890-1985, ed. Maurice Tuchman. Abbeville, 1993.
  14. Kenneth Peacock, "Instruments to perform color-music: Two centuries of technological experimentation," Leonardo, 21 (1988), 397-406.
  15. Frank Popper, Origins and Development of Kinetic Art. 1968.
  16. A. Wallace Rimington, Colour-Music: The art of mobile colour, New York: Frederick A. Stokes Company, 1911.
  17. Don Ritter, "Interactive Video as a Way of Life," Musicworks 56, Fall 1993, 48-54.
  18. Robert Russett and Cecile Starr, Experimental Animation: Origins of a New Art (2nd edition), New York: Da Capo Press, 1988.
  19. Wayne Slawson, Sound Color. University of California Press, Berkeley, 1985.
  20. Scott Snibbe and Golan Levin, Towards Dynamic Abstraction. Interval Research Corporation: Internal document, October 1997.
  21. Thomas Wilfred, "Composing in the art of lumia," Journal of Aesthetics and Art Criticism, (VII) December 1948, 79-93.