Representation and alignment of sung queries for music information retrieval. Norman H. Adams and Gregory H. Wakefield (Dept. of Electrical Engineering and Computer Science, Univ. of Michigan, 1101 Beal Avenue, Ann Arbor, MI 48109-2110) The pursuit of robust and rapid query-by-humming systems, which search melodic databases using sung queries, is a common theme in music information retrieval. The retrieval aspect of this database problem has received considerable attention, whereas the front-end processing of sung queries and the data structure to represent melodies has been based on musical intuition and historical momentum. The present work explores three time series representations for sung queries: a sequence of notes, a `smooth' pitch contour, and a sequence of pitch histograms. The performance of the three representations is compared using a collection of naturally sung queries. It is found that the most robust performance is achieved by the representation with highest dimension, the smooth pitch contour, but that this representation presents a formidable computational burden. For all three representations, it is necessary to align the query and target in order to achieve robust performance. The computational cost of the alignment is quadratic, hence it is necessary to keep the dimension small for rapid retrieval. Accordingly, iterative deepening is employed to achieve both robust performance and rapid retrieval. Finally, the conventional iterative framework is expanded to adapt the alignment constraints based on previous iterations, further expediting retrieval without degrading performance. Using music structure to improve beat tracking. Roger B. Dannenberg (Computer Science Dept., Carnegie Mellon Univ., Pittsburgh, PA 15213) Beats are an important feature of most music. Beats are used in music information retrieval systems for genre classification, similarity search, and segmentation. However, beats can be difficult to identify, especially in music audio. Traditional beat trackers attempt to (1) match predicted beats to observations of likely beats, and (2) maintain a fairly steady tempo. A third criterion can be added: when repetitions of musical passages occur, the beats in the first repetition should align with the beats in all other repetitions. This third criterion improves beat tracking performance significantly. Repetitions of musical passages are discovered in audio data by searching for similar sequences of chroma vectors. Beats are "tracked" by first locating a sequence of likely beats in the music audio using high frequency energy as an indicator of beat likelihood. This beat sequence is then extended by searching forward and backward for more matching beats, allowing slight variations in tempo, and using a relaxation algorithm to optimize the proposed beat locations with respect to the three criteria. Other high-level music features may offer further improvements in beat identification. Using instrument recognition for melody extraction from polyphonic audio. Jana Eggink (Sony Deutschland GmbH, European Technology Center, Hedelfinger Strasse 61, 70327 Stuttgart, Germany) and Guy J. Brown (Dept. of Computer Science, Univ. of Sheffield, Regent Court, 211 Portobello St., Sheffield S1 4DP, U. K.) A system is proposed that identifies the solo instrument in accompanied sonatas and concertos, and uses this knowledge to extract the melody line played by this instrument. The approach uses a feature representation based solely on the spectral peaks belonging to the harmonic series of a fundamental frequency (F0). Based on an initial approximate F0 estimation, this representation proved to be sufficient for instrument classification even in the presence of highly unpredictable background accompaniment. Once the solo instrument is known, a more accurate estimation of the melody line is carried out based on so-called melody models, which are trained on instrument specific training material. In every time-frame multiple F0 candidates are extracted and their likelihood is evaluated according to the chosen melody model. Additional temporal constraints take the form of frame-to-frame transition probabilities, and are obtained from the same training material. The two knowledge sources are combined in a statistical search for the overall most likely melody line. When evaluated on realistic recordings of classical sonatas and concertos, the system was able to find the correct F0 in 72% of frames, an improvement of over 20% compared to a simple salience based approach. Music to Knowledge: a visual programming environment. Andreas F. Ehmann (Dept. of Electrical and Computer Engineering, Univ. of Illinois Urbana-Champaign, Urbana, IL 61801) and J. Stephen Downie (Graduate School of Library and Information Science, Univ. of Illinois Urbana-Champaign, Champaign, IL 61820} The objective of the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) project is the creation of a large, secure corpus of audio and symbolic music data accessible to the music information retrieval (MIR) community for the testing and evaluation of various MIR techniques. As part of the IMIRSEL project, a cross-platform JAVA based visual programming environment called Music to Knowledge (M2K) is being developed for a variety of music information retrieval related tasks. The primary objective of M2K is to supply the MIR community with a toolset that provides the ability to rapidly prototype algorithms, as well as foster the sharing of techniques within the MIR community through the use of a standardized set of tools. Due to the relatively large size of audio data and the computational costs associated with some digital signal processing and machine learning techniques, M2K is also designed to support distributed computing across computing clusters. In addition, facilities to allow the integration of non-JAVA based (e.g. C/C++, MATLAB, etc.) algorithms and programs are provided within M2K. [Work supported by the Andrew W. Mellon Foundation and NSF Grants No. IIS-0340597 and No. IIS-0327371.] Distributed digital music archives and libraries. Ichiro Fujinaga (Faculty of Music, McGill Univ., Montreal, Canada) The main goal of this research program is to develop and evaluate practices, frameworks, and tools for the design and construction of worldwide distributed digital music archives and libraries. Over the last few millennia, humans have amassed an enormous amount of musical information that is scattered around the world. It is becoming abundantly clear that the optimal path for acquisition is to distribute the task of digitizing the wealth of historical and cultural heritage material that exists in analog formats, which may include books and manuscripts related to music, music scores, photographs, videos, audio tapes, and phonograph records. In order to achieve this goal, libraries, museums, and archives throughout the world, large or small, need well-researched policies, proper guidance, and efficient tools to digitize their collections and to make them available economically. The research conducted within the program addresses unique and imminent challenges posed by the digitization and dissemination of music media. There are four major research projects in progress: development and evaluation of digitization methods for preservation of analog recordings; optical music recognition using microfilms; design of workflow management system with automatic metadata extraction; and formulation of interlibrary communication strategies. Speech-recognition interfaces for music information retrieval. Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST), 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, JAPAN, m.goto@aist.go.jp) This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech_if.html) MPEG-7 - standardized tools for music information retrieval. Juergen Herre (Audio & Multimedia Departments, Fraunhofer Institute for Integrated Circiuts (IIS), Am Wolfsmantel 33, 91058 Erlangen, Germany) Today, many applications in Music Information Retrieval (MIR) employ audio features which have been tailored individually by the algorithm developers. For a broader use also in commercial applications, MIR technology can benefit significantly from a "common languag" in audio signal description that can be used to annotate any type of multimedia assets in order to facilitate search & retrieval according to a wide range of conceivable criteria in an interoperable way. The audio part of the ISO/MPEG-7 "Multimedia Content Description Interface" provides such a common signal description language by defining a rather comprehensive set of standardized features (called "Low Level Descriptors", LLDs), application-centric subsets, and a unified way of exchanging this data based on XML. The talk provides an overview of the MPEG-7 Audio tool chest, including existing and forthcoming extensions. While the idea is clearly to create a universal platform for any conceivable MIR task, some of the initially conceived applications of MPEG-7 Audio are illustrated. Automatic detection of the dominant melody in acoustic musical signals. Anssi P. Klapuri (Inst. of Signal Processing, Tampere Univ. of Technology, Korkeakoulunkatu 1, 33720 Tampere, Finland} An auditory-model based method is described for estimating the fundamental frequency contour of the dominant melody in complex music signals. The core method consists of a conventional cochlear model followed by a novel periodicity analysis mechanism within the subbands. As the output, the method computes the salience (i.e., strength) of different fundamental frequency candidates in successive time frames. The maximum value of this vector in each frame can be used to indicate the dominant fundamental frequency directly. In addition, however, it was noted that the first-order time differential of the salience vector leads to an efficient use of temporal features which improve the performance in the presence of a large number of concurrent sounds. These temporal features include particularly the common amplitude or frequency modulation of the partials of the sound that is used to communicate the melody. A noise-suppression mechanism is described which improves the robustness of estimation in the presence of drums and percussive instruments. In evaluations, a database of complex music signals was used where the melody was manually annotated. Use of the method for music information retrieval and music summarization is discussed. Using string alignment in a query-by-humming system for real world applications. Christian Sailer (Fraunhofer IDMT, Langewiesener Str. 22, 98693 Ilmenau, Germany) Though Query by Humming (i.e. retrieving music or information about music by singing a characteristic melody) has been a popular research topic during the last decade, few approaches have reached a level of usefulness beyond mere scientific interest. One of the main problems is the inherent contradiction between error tolerance and dicriminative power in conventional melody matching algorithms that rely on a melody contour approach to handle intonation or transcription errors. Adopting the string matching / alignment techniques from bioinformatics to melody sequences allows to directly assess the similarity between two melodies. This method takes an MPEG-7 compliant melody sequence (i.e. a list of note invervals and length ratios) as query and evaluates the steps necessary to transform it into the reference sequence. By introducing a musically founded cost-of-replace function and an adequate post processing, this method yields a measure for melodic similarity. Thus, it is possible to construct a query by humming system that can properly discriminate between thousands of melodies and still be sufficiently error tolerant to be used by untrained singers. The robustness has been verified in extensive tests and real world applications.