MIT LCS TM 642

MIT-LCS-TM-642 MIT-LCS-TM-642

User Manual: MIT-LCS-TM-642

Open the PDF directly: View PDF .
Page Count: 18

LOUD: A 1020-Node Modular Microphone

Array and Beamformer for Intelligent

Computing Spaces

Eugene Weinstein, Kenneth Steele, Anant Agarwal, and James Glass

MIT Computer Science and Artiﬁcial Intelligence Laboratory

32 Vassar Street

Cambridge, MA 02139 USA

{ecoder,steele,agarwal,glass}@csail.mit.edu

Abstract. Ubiquitous computing environments are characterized by an

unbounded amount of noise and crosstalk. In these environments, tradi-

tional methods of sound capture are insuﬃcient, and array microphones

are needed in order to obtain a clean recording of desired speech. In this

work, we have designed, implemented, and tested LOUD, a novel 1020-

node microphone array utilizing the Raw tile parallel processor architec-

ture [1] for computation. To the best of our knowledge, this is currently

the largest microphone array in the world. We have explored the uses

of the array within ubiquitous computing scenarios by implementing an

acoustic beamforming algorithm for sound source ampliﬁcation in a noisy

environment, and have obtained preliminary results demonstrating the

eﬃcacy of the array. From one to 1020 microphones, we have shown a

13.7dB increase in peak SNR on a representative utterance, an 87.2%

drop in word error rate with interferer present, and an 89.6% drop in

WER without an interferer.

1 Introduction

The interaction between humans and computers has been a central focus of ubiq-

uitous computing research in recent times. In particular, communication through

speech has been extensively explored as a method for making human-computer

interaction more natural. However, computer recognition of human speech per-

forms when a recording can be made without the presence of much ambient noise

or crosstalk. Seeking to create a natural setting, ubiquitous computing environ-

ments tend to fall in this category of situations where natural-interaction speech

recognition is a challenging problem. When signiﬁcant levels of noise are present,

or several humans are talking at the same time, recognition becomes diﬃcult,

if not impossible, in the absence of an appropriate technology to separate the

desired speech from the undesired speech and ambient noise. As part of MIT’s

Project Oxygen [2], we have created a modular large microphone array, called

the Large acOUstic Data, or LOUD Array.

Recently, arrays of microphones have been increasingly explored as an aid for

untethered acoustic source selection and ampliﬁcation. When sound is recorded

2 Eugene Weinstein et al.

in a noisy environment through a single microphone, proximity of the microphone

to the speaker’s mouth is essential for audio of suﬃcient quality for speech recog-

nition or transmission in a tele-conference. In many situations, this proximity

can not be readily achieved – for instance when the recording is taking place in a

conference room environment marred with crosstalk, a machine room with noisy

fans, or a large auditorium where an audience member is asking a question. In

these situations, a single microphone at a ﬁxed location in the room cannot sepa-

rate the voice of one speaker from another, or the voice of the speaker from noise.

In contrast, arrays of microphones have a spatial extent that can be exploited

along with the propagation qualities of a sound wave to detect, separate, and am-

plify speech sources in a noisy environment. Microphone arrays can be “steered”

in software toward a desired sound source, ﬁltering out undesired sources. When

an appropriate level of computational power is available, microphone arrays can

also “track” a desired source around a space as the source moves [3].

The LOUD modular microphone array currently consists of 1020 micro-

phones. Our reasons for building such a large array are twofold. First, the

performance of a microphone array improves linearly as the size of the array

grows. This is well established in the theoretical literature on microphone arrays

(e.g., [4, 5]), and our experimental results in Section 5 conﬁrm this in practice.

To date, the largest microphone array known to us [6] had been a 512-element

array, and work in microphone arrays of this size has been extremely limited.

Second, ubiquitous computing applications often involve a large number of si-

multaneous feeds of streaming data (e.g., video, audio, haptics, etc). The I/O

bandwidth and computational power necessary to process these streaming data

has pushed the limits of traditional computer architectures and I/O schemes.

To this end, our lab has been designing a scalable parallel processing architec-

ture called Raw [1], speciﬁcally designed to handle large volumes of streaming

data, such as that created by a microphone array. Raw belongs to a new class of

microprocessors called tiled architectures, which are designed to be eﬀective for

both general purpose and embedded computation. Our microphone array, which

generates nearly 50 MB of data ever second, is an appropriate application to

test the limitations of this architecture. In fact, an acoustic beamforming algo-

rithm for 1020 microphones is present in the suite of tasks used to evaluate the

performance of the Raw processor [7].

In this paper we ﬁrst present our architecture for a novel modular 1020-node

microphone array and beamformer utilizing a general-purpose tile processor ar-

chitecture for computation. The modularity of the array represents the ﬁrst

major contribution of this work. Second, we present the results for a prelim-

inary round of experiments giving speech recognition accuracy rates for data

collected with the array in a noisy environment, both in and out of the pres-

ence of an interferer. We show an improvement in speech recognition accuracy

from 3.0% with one microphone to 87.6% with the full 1020-microphone array

(87.2% drop in word error rate) when an interferer is present, and from 9.6%

with one microphone to 90.6% with the full array with no interferer (89.6% drop

in WER). The SNR improves by 13.7dB from one to 1020 microphones for a

Microphone Array 3

representative utterance. As the other main contribution of this work, we show

a steady improvemnt in recognition performance and SNR as the size of the ar-

ray is increased to 1020 microphones, thus clearly demonstrating the beneﬁt of

the use of large arrays to record speech in noisy environments. A video demon-

strating the improvement in sound quality when using the array is available at

http://cag.csail.mit.edu/mic-array/videos/. Finally, we outline our use of

the Raw general-purpose tile parallel processor architecture, which, in contrast

with previously-used DSP chips, allows programmers to write code in conven-

tional programming languages. Raw is currently able to run an acoustic beam-

forming algorithm in real-time on all the 1020 simultaneous data streams from

the array.

We begin the paper with an overview of related work in Section 2. We then

outline the details of our microphone array hardware and processing software

implementation in Section 3. Section 4 presents the setup and methods used in

our experiments to evaluate the array. In Section 5, we present the results of our

experiments, and in Section 6, we discuss the results and relate our ﬁndings to

past work. In Section 7, we outline our plans for future work, and conclude the

paper.

2 Related Work

Sensor arrays have been extensively explored in the past half-century, initially as

a tool for radar-based tracking of objects [8], and then for a number of other ap-

plications including radio astronomy [9], sonar systems [10], and seismology [11].

Over the past two decades, arrays of microphones (i.e., acoustic sensors in air)

have been increasingly used for sound source separation and ampliﬁcation, and

since the late 1980s have been explored as a tool for capturing audio in diﬃcult

acoustic environments [12, 13].

Microphone arrays have quickly become popular as an aide for speech recog-

nition, and several recent projects are exploring the use of microphone arrays

for this purpose [14–16]. A number of projects report signiﬁcant improvements

in recognition performance when using a microphone array when compared to

a single omnidirectional microphone. For instance, [14] reports a near three-fold

decrease in recognition error rates using a circular array of eight microphones

in a conference room, and [16] reports similar gains with an eight-microphone

linear array. However, all of these used substantially smaller arrays than the one

presented in this paper.

In the recent past, microphone arrays have seen increased exposure in ubiq-

uitous and multimodal computing applications. For instance, [17] used a two-

microphone array as part of a multi-modal person tracking system on a mobile

robot, and [18] used two microphones on a speaker’s tie to detect whether speech

was coming from the speaker or another person in the environment. [19] used a

32-microphone array in conjunction with a computer vision-based person track-

ing system to selectively amplify speakers in the presence of noise and interfering

speakers.

4 Eugene Weinstein et al.

The literature [4, 5] states that the performance of microphone arrays will

theoretically scale with an even much larger number of microphones. However,

most microphone arrays used in research and industry today have a small number

of microphones (i.e., less than 20). There are, however, a small number of larger

arrays in existence. [19–21] present intermediate-sized arrays of 32, 64, and 64

microphones, respectively. [22] shows a 400-microphone square array two meters

on a side, but no publications seem to be available about the project. Finally, the

Huge Microphone Array [6], an array of 512 microphones, has to our knowledge

been the largest microphone array to date. The researchers of this project have

designed custom hardware for both sound capture and processing (using DSP

chips). However, the publications stemming from this work all appear to use a

16-microphone subset of the large array known as the Brown Megamike. For

instance, [23] showed improved recognition performance for this smaller array

size. Based on investigation of past work, it appears that there is little or no

published work on speech recognition experiments using microphone arrays of

the scale of the array presented in this work.

3 Implementation

This section outlines our implementation of the 1020-node microphone array

and beamformer. We ﬁrst outline the hardware and ﬁrmware design of the array

components and the connections to the Raw tile processor. We then present the

array geometry that we have used and our reasons for choosing this geometry.

Finally, we describe the software algorithms used to process the data recorded

by the array, and our mapping of these algorithms onto Raw.

3.1 Hardware

Our microphone array feeds data into the Raw microprocessor [1], which is a

parallel tile architecture currently being researched in our lab. The design of the

Raw 16-tile processor has taken place since 1997, and our lab received the ﬁrst

prototype chips in early 2002. Raw is a parallel machine speciﬁcally designed

for applications requiring real-time processing of large amounts of streaming

data. By exposing the details of the interconnection networks on the chip to the

software, Raw allows for highly eﬃcient systolic communication on the chip, and

thus exposes a potential for a great deal of parallel real-time computation. Raw

provides two static networks and two dynamic networks; the static networks

being more eﬃcient for systolic real-time computation. The static network is

controlled by an entirely independent switch processor on each tile, and the

routing code that runs on the switch is exposed to the software. In this work,

we utilize the static networks as outlined in Section 3.3.

The 1020-node microphone array (Figure 3) consists of 510 two-microphone

printed circuit boards (PCBs), pictured in Figure 1 (the microphones used were

Panasonic WM-54BT Electret Condenser microphones) We have opted to cre-

ate small microphone modules to ensure LEGO-like modularity in the design

Microphone Array 5

of our array. Each PCB contains two microphones, one stereo A-to-D converter

(Cirrus Logic CS53L32A), and a small CPLD (Xilinx Coolrunner XCR3032XL).

The A-to-D converter samples at 16 KHz, generating 24-bit serial data for each

microphone. Our decision to place two microphones on one PCB was mainly due

to the fact that the A-to-D converter is able to accommodate two channels of

audio. The two-microphone boards are connected in chains of 16 boards (32 mi-

crophones), and each chain plugs into a connector board. The data are streamed

through the chain and into the connector board using time-division multiplexing.

Each connector board takes eight chains, and four connector boards are used to

accommodate 1024 microphones in total.

Fig. 1. A two-microphone board from the LOUD array.

The four connector boards are connected to an expansion connector on the

Raw parallel processor motherboard via a high-bandwidth micro-coax cable. The

cable can accommodate up to ﬁve connector boards, so 256 additional micro-

phones could be added with the current conﬁguration. A large FPGA on the Raw

motherboard (Xilinx 3000E) converts the serial data from the array into packets

of parallel words. The packets are streamed into Raw on one of its sixteen I/O

ports. Currently two of the Raw I/O ports are linked to physical connectors on

the Raw motherboard, meaning a total of 2560 microphones could be accommo-

dated with the current motherboard; but if more connectors were added or if

other Raw boards were used, we could theoretically support an arbitrarily large

number of microphones.

Figure 2 illustrates the hardware design of the array, the interconnections

between the array and Raw, and the associated bandwidths at each link. Each of

6 Eugene Weinstein et al.

the 32-microphone chains produce 12.3 Mbits/sec. Each connector board receives

eight microphone chains, or 98.3 Mbits/sec. There are four connector boards,

meaning the total bandwidth into Raw is 393 Mbits/sec, or 49.1 MBytes/sec.

Fig. 2. A schematic diagram of the LOUD microphone array hardware design, the con-

nections to the Raw tile processor, and the badndwidths required at each connection.

3.2 Geometry

Many array geometries have been suggested in past work, from linear to rect-

angular to circular; and, similarly, many microphone spacing schemes have been

suggested, from uniform to logarithmic. While many geometrical conﬁgurations

of the array are possible and potentially desirable, our initial 1020-microphone

geometry (pictured in Figure 3) is a rectangular array 60 microphones wide

by 17 microphones high. This allows us to steer the ampliﬁcation beam verti-

cally as well as horizontally. Since microphone arrays sample the signal in both

space and time, spatial sampling [4] as well as temporal sampling can aﬀect

the resulting waveform. Arrays spatially sample at the intra-microphone spacing

wavelength, and any source signal component with a wavelength shorter than

twice the spacing will be aliased (per the Nyquist criterion). For this work, we

have chosen to use uniform spacing at 3 cm (meaning spatially sampling the

waveform at 342m/s

0.03m=11,400 Hz, meaning frequencies above 5,700 Hz will be

aliased). This decision was due both to practicality reasons, as well as prelimi-

nary experiments with various spacings. The 3 cm spacing is maintained in both

Microphone Array 7

the vertical and horizontal directions, by deliberate placement of microphone

boards horizontally on an aluminum plate, and by using spacers of appropriate

lengths to stack boards vertically.

Fig. 3. A picture of the LOUD 1020-node microphone array

3.3 Software

In order to utilize the microphone array to selectively amplify sound coming

from a particular source or sources, we have used a beamforming algorithm [4,

24] on our tile parallel processor. Beamforming algorithms use the properties of

sound propagation through space for sound source separation. Currently, we are

using a delay-and-sum beamforming algorithm [24], which is the simplest way of

computing the beam. Delay-and-sum beamforming uses the fact that the delay

for the sound wave to propagate from one microphone in the array to the next

can be empirically measured or calculated from the array geometry. This delay

is diﬀerent for each direction of sound propagation, i.e., from the sound source

position. By delaying the signal from each microphone by an amount of time

corresponding to the direction of propagation and then summing the delayed

signals, we selectively amplify sound coming from a particular direction. Sub-

sample precision delays are handled by interpolation between the two adjacent

integral sample values. Delay-and-sum beamforming assumes that the position

of the desired source relative to the array is known. The problem of accurately

localizing a source is crucial, but rather separate from the problem of amplifying

8 Eugene Weinstein et al.

sound coming from a particular direction. For the work presented in this pa-

per, we assume that the position of the speaker is known in advance; however,

Section 7 outlines our plans to pursue work in source localization.

The delay-and-sum beamforming algorithm runs on the Raw microprocessor.

The audio collected at each microphone in the array is streamed into one Raw

static network input port. The data are streamed from tile to tile on the static

network, with each tile’s switch processor directing a portion of the data into the

local processor. The processor then stores the data into memory, and retrieves

the appropriately-delayed sample for each of its microphones. The running sum

for the beamforming computation is passed along from tile to tile, also on the

static network. The mapping of computation onto Raw tiles is illustrated in

Figure 4. Twelve tiles are used for computation, one for delaying the output

for debugging, one for bandpass-ﬁltering the output, and one for formatting

it properly for the output D-to-A converter (used for monitoring). One tile is

unused due to I/O port placement constraints.

Fig. 4. Beamforming algorithm mapping to Raw tiles.

In order to enable this implementation, we have created a ﬂexible framework

for mapping computation to appropriate tiles, including scripts to automatically

generate static network code for the Raw switch processor. We have optimized

the code to run the beamforming algorithm for one ﬁxed source position from

all 1020 microphones in real time on one Raw chip (16 tiles). Due to current

ﬁrmware constraints, we are able to run Raw at 150 MHz; however, the Raw

Microphone Array 9

chip can support clock rates upward of 400 MHz, thus in the future we could

easily accommodate more microphones, or more computation per microphone.

4 Evaluation

We have conducted preliminary experiments with the LOUD array. The exper-

iments involved recording a person speaking in a room where several sources

of noise were present. The room is a very noisy hardware laboratory. The main

noise sources are several tens of cooling fans for computers and custom hardware,

and a loud air conditioner.

The subject reads a series of digit strings, and the speech is recorded with var-

ious subsets of the LOUD microphone array and a high-quality noise-canceling

close talking microphone (Sennheiser HMD-410) (i.e., “clean” speech). In some

of the trials, another person serves as the “interferer,” reading a text passage

(the “Rainbow Passage” often used in speech recognition experiments) at the

same time as the main speaker is speaking digit strings. The interferer scenario

models a situation where several people in a room are talking, but we are inter-

ested in recording the voice of only one person, such as in a conference or in a

surveillance situation.

As mentioned in Section 3.3, we have assumed a ﬁxed position for our sub-

ject. The amount of time required for the sound to travel from the position

to each microphone in the array was determined in advance with the following

procedure. A broadband “chirp” (frequency sweep) was played through a small

loudspeaker located at the “focal point” – the desired point for ampliﬁcation. A

reference recording was obtained with a single microphone at the speaker. The

data captured by each microphone was also captured and stored on disk. The

data were then upsampled by a factor of 50 to obtain sub-sample precision in

the calculation. A cross-correlation function (basically a dot-product at every

possible time oﬀset) was then calculated between the reference recording and

the signal from each microphone. The time shift for which the function was at

a maximum was taken as the propagation delay for that microphone. Anecdo-

tally, this method was shown to be more accurate than calculating microphone

delays from array and point geometry; probably due to some slop in microphone

positions on the boards.

The layout of the experimental setup is as follows. The array is positioned on

a counter top 145cm from the ground. The “focal point” of the array is located

in line with the left edge of the array, or 88.5 cm to the left of the center of the

array, 137 cm in front of the array, and 25 cm above the bottom row of the array.

The interferer stands at a mirror image point in line with the right edge of the

array, or 88.5 cm to the left of center, and 137 cm in front.

The data from the microphone array are streamed into the Raw micropro-

cessor as described in Section 3.1, and are stored in the 2GB of oﬀchip DRAM

currently available to the processor. Once the audio streams are stored in mem-

ory, the processor performs a delay-and-sum beamforming (see Section 3.3) run

for 23 diﬀerent microphone sizes/conﬁgurations, ranging from one microphone

10 Eugene Weinstein et al.

to all 1020 microphones. This process simulates simultaneously recording the

same audio stream with arrays of varying sizes, allowing us to compare the ef-

fect of the number of microphones on array performance. The microphones are

ﬁrst taken from the bottom row in powers of two (1, 2, 4, 8, 16, 32), and then

row by row all the way to the top of the array (each row adds 60 microphones –

60, 120, 180, . . . , 1020).

The output of the beamforming algorithm for each virtual array conﬁgura-

tion is processed with a band-pass ﬁlter set to pass through frequencies between

300Hz and 3,500Hz, in order to eliminate low-frequency noise and high-frequency

content that would be aliased when downsampling to 8KHz (as needed for recog-

nition). Finally the output waveforms are streamed from the chip to a desktop

host machine via a static network port. The resulting waveforms are then writ-

ten to disk on the the host machine. Due to current host interface constraints,

it was not practical to record the output of all 1020 microphones; however, this

capability will soon be in place, allowing us to record a corpus of large-array

recordings (see Section 7).

In order to obtain an initial metric of array performance, we measured the

signal-to-noise ratio (SNR) of the beamformer output. The SNR for an utterance

can be deﬁned in many ways. When a precise recording of the original source

audio is available, the SNR can be calculated as the variance between the source

signal and the noisy signal. When such a recording is not available, or can not

be reliably made, the “peak” SNR can be approximated by taking the maximum

signal power over a time window during the speech segments of the waveform

as signal, and the signal power over a non-speech segment as the noise. While a

reference recording was available to us from the close-talking microphone, it was

still somewhat noisy; thus we chose to use the second method. However, since

the method required hand-segmenting the beamformed audio for the 23 diﬀerent

microphone conﬁgurations (to ﬁnd speech and non-speech components), we only

obtained SNR ﬁgures for only one representative recording.

The evaluation portion of our experiment consisted of running the output of

the beamforming algorithm through the MIT summit recognizer [25], which is

a feature-based ﬁnite state transducer speech recognizer created at the Spoken

Language Systems group in our laboratory. The recognizer was trained on a com-

bination of clean and noisy speech from the Aurora digits corpus [26]. The Aurora

corpus is based on over 4,000 samples of humans reading digit strings recorded

with a close-talking microphone from [27]. The clean data are augmented by

synthetically adding noise from typical environments (e.g., train, babble, car,

and exhibition hall) at various SNRs to the utterances to simulate noisy data.

The simulated noisy speech in combination with the clean speech constitutes

over 28,000 utterances, which are all used to train the summit recognizer. We

note that due to the channel diﬀerences between the close-talking microphone

used to record the Aurora data and the LOUD microphone array, it is unrealis-

tic to expect the array test data to match recognition rates given in the Aurora

literature.

Microphone Array 11

For this initial round of experiments, we recorded 150 utterances from two

male speakers with an interferer, and 110 utterances from the same speakers

without interferers. The data for the close-talking microphone were collected

as 80 utterances with an interferer at the same time as the array experiments,

in order to provide a baseline for the speech recognition experiments. In the

interferer trials, the person not serving as the subject served as the interferer.

Certainly, much more extensive testing is necessary in order to evaluate the

microphone array in suﬃcient detail; this work is ongoing (see section 7).

The ampliﬁcation pattern of a microphone array beamformer consists of a

main ampliﬁcation lobe and a number of smaller side lobes. The width of the

main lobe quantiﬁes the precision of the beam. A metric known as half-power

beam width (HPBW) [4] measures the distance from the focus of the beam at

which the ampliﬁcation strength of the beam drops oﬀ by a factor of two (3dB).

To measure HPBW, we played a sine wave from a one-inch diameter portable

speaker while moving it around in space. At the same time we continually ran

the power calculation code on Raw in conjunction with the beamformer. In this

fashion, we were able to obtain some preliminary measurements of the HPBW

of our array.

5 Results

Figure 5 gives approximate peak SNRs in dB for a representative utterance, dis-

playing the trend of improvement as the number of microphones is increased. The

close-talking microphone, with an SNR level of 35.0dB, serves as the baseline.

The SNR improves from 17.2dB with one microphone to 30.9dB with all 1020

microphones. This 13.7dB improvement corresponds to a 4.6-fold improvement

in the ratio of signal energy to noise energy.

Speech recognition quality is typically evaluated based on the word error

rate (WER) meaning the percentage of words that were recognized incorrectly.

In this work, we mostly give results in terms of the accuracy rates, which is simply

calculated as (100% −W ER). Figure 6 and Table 1 give the accuracy rates for

the experimental data that we have collected, for all the array sizes ranging

from one microphone to all 1020. Our baseline accuracy is for a close-talking

microphone, at 98.8%. This is consistent with results from the Aurora corpus

when tested on clean speech [26], meaning that our speech recognizer performs on

a level consistent with other state-of-the-art recognizers. For one microphone, the

accuracy is below 10% both with and without an interferer, meaning acceptable

recognition is impossible to achieve. The accuracy rises above 50% around 60

microphones (one full row of the array) – a reasonable recognition hypothesis

can sometimes be made at this level. All 1020 microphones yield 87.6% accuracy

(87.2% drop in word error rate) in the presence of an interferer and 90.6% (89.6%

drop in WER) without an interferer.

Our measurements of the HPBW of the array show that when listening to a

1 KHz source, the energy level drops oﬀ by half when the source moves 5 inches

horizontally, or 10 inches vertically from the point the algorithm is amplifying.

12 Eugene Weinstein et al.

100101102103

Number of Microphones

Peak Signal−to−Noise Ratio (SNR) (dB)

Microphone Array

Close−talking microphone

Fig. 5. Peak SNRs for one representative recording from the microphone array.

This result is consistent with the 60x17 array geometry because the array has

a greater spatial extent horizontally than vertically. However the ampliﬁcation

pattern is diﬀerent for each frequency of sound, and speech is a broadband signal.

Thus it is diﬃcult to quantify exactly how well the beamformer would be able

to separate two speakers at arbitrary positions in the space.

6 Discussion

The results in Figures 5 and 6 clearly demonstrate there is indeed a beneﬁt to

having arrays of this large size. In this work, we focused on the design of the sys-

tem, and did not implemented sophisticated beamforming algorithms or other

signal processing software components. However, even with the simplest beam-

former possible, we were able to obtain increasing SNRs and gains in recognition

accuracy all the way to 1020 microphones. This is perhaps the most signiﬁcant

result.

The most drastic jump in the recognition accuracy curve is seen when the

number of microphones jumps from 32 to 60, most likely because this completes

the full line of the array (60 microphones), making the beam width almost twice

as narrow as with 32 microphones. After this point, adding more microphones

does not make the array wider, just taller. We note that the accuracy even with

1020 microphones (87.6% and 90.6%) is clearly signiﬁcantly short of the 98.8%

baseline from the close-talking microphone; and this is consistent with the SNRs

Microphone Array 13

100101102103

100

Number of Microphones

Recognition Accuracy (%)

Array with interferer

Array without interferer

Close−talking microphone

Fig. 6. Experimental results from the LOUD microphone array. Results for data

recorded with the array both in the presence of an interference, and when the in-

terferer is absent, are given. The baseline level of 98.8% accuracy is given when using

a high-quality close-talking microphone.

noted in the recordings. However, with more complicated signal processing and

beamforming algorithms and a better match between the recognizer training and

test conditions (see Section 4), we are conﬁdent that the recognition accuracy of

audio recorded with the array can approach that of a close-talking microphone.

One can imagine that a projection based on Figures 5 and 6 will eventually allow

array performance to reach close-talking microphone levels.

Comparison with past work is diﬃcult for several reasons. One reason is dif-

ferences in experimental conditions. Our data were collected in a very noisy envi-

ronment; likely noisier than most of the currently-published results. For instance,

[14] cites an accuracy rate of 42.4% with a single omnidirectional microphone

and one interferer, compared to our 3.0%; [23] is at 58%; and [16] is at 33%.

While SNR is one intuitive way of comparing noise levels, it is actually diﬃcult

to compare based on SNR, since the various methods for determining SNR can

produce very diﬀerent results. Some of the previous work uses a form of average

SNR over a hand-segmented waveform; however in our experiments, due to time

constraints, it was impossible to perform such an analysis. Another alternative is

to use a speech recognizer to segment the waveform and then calculate average

SNRs; however this approach is often inaccurate because recognizer segmenta-

tion will be poor at low SNRs. In addition, there is much opinion that SNRs

may not be a good measure of speech quality at all [28, 23].

14 Eugene Weinstein et al.

Number of Microphones Peak Signal-to-Noise Ratio (dB) Accuracy Rate

Interferer Present No Interferer

1 17.2 3.0% 7.1%

2 16.7 3.2% 6.0%

4 17.9 3.6% 6.7%

8 18.5 5.7% 11.2%

16 20.7 9.5% 19.6%

32 21.0 14.7% 30.7%

60 22.1 54.0% 70.0%

120 23.4 61.8% 74.6%

180 24.4 65.5% 81.1%

240 25.3 69.6% 83.8%

300 26.6 72.4% 86.3%

360 26.1 75.8% 86.8%

420 27.2 77.1% 87.5%

480 27.6 78.6% 87.7%

540 28.2 80.8% 88.3%

600 28.6 82.0% 88.9%

660 29.6 83.2% 89.6%

720 29.6 83.9% 89.5%

780 30.0 84.7% 89.8%

840 30.2 85.1% 90.3%

900 30.4 86.0% 90.8%

960 30.5 87.0% 90.8%

1020 30.9 87.6% 92.0%

Close-talking baseline 35.0 98.8%

Table 1. Recognition accuracy rates and signal-to-noise ratios with various test con-

ditions

Further making it diﬃcult to compare this work with past results is the

fact that most currently published experimental results of speech recognition

experiments with microphone arrays, use a much smaller arrays than this work.

For instance, in his 1996 PhD thesis, Sullivan [16] writes,

... the [recognition] improvement appears to level oﬀ at about 8 sensors.

This suggests that there may be other factors involved in word errors that

the type of processing provided by microphone arrays cannot correct. In

our experiments it is fair to say that the amount of extra computation

and hardware needed to process additional microphones beyond 4-8 mi-

crophones appears to exceed the beneﬁts that one obtains from including

them.

The work reports that recognition accuracy levels oﬀ at around 60% after

eight microphones. In our work, we were able to achieve recognition levels well

in excess of this; perhaps the discrepancy can be attributed to diﬀerence in

speech recognition technology (the work reported by Sullivan was published eight

Microphone Array 15

years ago). Indeed, in a later work, Adcock [23] was able to achieve recognition

accuracy rates above 80% for noisy speech recorded with a 16-microphone array,

but starting at 58% for one microphone, as compared to 33% in Sullivan. It is

unknown whether the accuracy rates in Adcock would have continued improving

with larger array sizes. Our work starts at recognition rates below 10% for one

microphone, and accuracy only exceeds the 50% mark until after 60 microphones

are used, clearly demonstrating a diﬀerence in noise conditions.

In general, it is probably fair to note that popular opinion in this direction

among the signal processing community has made it seem that researching large

arrays was impractical and unbeneﬁcial. We submit that by demonstrating a

consistent improvement with increasing array sizes, we have shown that, at least

at the current time, this belief is not entirely accurate. While our accuracy rates

even with a 1020-microphone array fall short of that of recordings made with

a close-talking high-quality microphone, we believe that this merely serves to

motivate future research in adaptive beamforming and other advanced sound

source selection techniques (see Section 7).

Regarding the hardware needed for the real-time processing of data from a

large microphone array, we have also shown that using a parallel tile architec-

ture is both suﬃcient and practical for arrays of this size. Past work in large

arrays (e.g., [6]) has used special-purpose DSP chips. In contrast, with our use

of the Raw microprocessor, we have followed the recent trend in the computer

architecture ﬁeld to leverage general-purpose tiled architectures to increase both

computation density and ease of programming.

7 Conclusion

In this work we have presented LOUD, a 1020-node microphone array and beam-

former for intelligent computing spaces. We have outlined our design of the array

hardware and software architecture, and our utilization of the Raw tile parallel

processor for computation. In addition, we have presented an analysis of exper-

imental data collected with the array. The data were captured with the array,

processed with a beamforming algorithm, and evaluated by means of running

a speech recognizer on the beamformed data. We have presented experimental

signal-to-noise ratios and recognition accuracy scores for 23 diﬀerent array sizes,

ranging from one to 1020 microphones, showing a steady improvement all the

way to 1020 microphones. We believe that with these results we have made the

case that large microphone arrays deserve a thorough investigation both by the

ubiquitous computing and signal processing communities.

We are pursuing several directions for future work in this project. First, we

are continuing work to evaluate the microphone array’s usefulness in ubicomp

scenarios. The experiments presented in this work are only a ﬁrst step towards

a full evaluation of the system. In order to prove the applicability of our micro-

phone array in ubiquitous computing, we plan to conduct user studies in which

the subjective quality of the speech produced by the array can be measured on

communication tasks such as tele-conferencing.

16 Eugene Weinstein et al.

In addition, in order to evaluate the quantitative performance of the array

further, more speech data will be collected from more speakers, in more points,

in diﬀerent noise environments, with diﬀerent array conﬁgurations and spacings,

etc. Our group is currently working on increasing the speed of data collection

by improving the USB 2.0 interface between the host computer and the Raw

motherboard. Improved speed here will allow us to record data from all 1020

microphones, rather than that for only 23 beamformed channels. Once this work

is completed, the data collection eﬀort will be greatly simpliﬁed, since the re-

quirement to run beamforming on the data at collection time will be eliminated.

The data will simply be compressed on the Raw chip and shipped oﬀ to stor-

age. Once we are able to quickly record data from all 1020 microphones, we will

be able to produce an array microphone recording corpus, which we will make

available for public distribution on the web. Other array corpora have been made

available in the past, but only with a smaller number of microphones (e.g. [29]

has data from 37 and 23 microphones).

Improved host interface speeds will enable us to perform experiments on the

data that currently run slower than real-time. For instance, we plan to experi-

ment with “tracking” speakers as they move in the space around the array. This

can be accomplished by performing subsequent rounds of perturbing the beam

position, or delay values, and moving in the direction of maximum energy over a

time window (basically, a gradient ascent approach). This approach is a type of

adaptive beamforming, or beamforming that adapts as aspects of the environ-

ment change. Even with our current stationary experimental setup, it would be

useful to have this algorithm in place: since it is impossible for the speaker to

stand in precisely the same place from trial to trial, we would start the beam at

the point corresponding to our initial delay measurements, and then search for

the speaker in the surrounding space. The position with the maximum in energy

likely corresponds to the position of the speaker’s mouth. This approach can

be used generally to search for speakers in a room, although there are several

diﬃculties. For instance, if the speaker is silent over several energy windows, the

beam could wander away and actually amplify a noise source. Or, if two speakers

are present and both talking, it may be hard to maintain a consistent tracking of

one of the speakers. In order to compensate for these diﬃculties, this method can

be augmented with information from other modalities. For instance, a group in

our lab [19] has used computer vision algorithms for person detection algorithms

using stereo cameras. The vision-based person location was used as a ﬁrst-order

approximation for a 32-microphone array beamformer, attempting to recognize

speech in the presence of noise and interferers. This multimodal sensor fusion

ﬁts in well with the goals of our ubiquitous computing research under Project

Oxygen; and we plan to explore this approach with the LOUD array.

We are pursuing several target applications for microphone arrays within

Project Oxygen. We plan to integrate the microphone array into our ubicomp

research spaces, such as our lab’s kiosk platform for human-computer interaction

experiments [30]. A microphone array will greatly increase the usability (and

hopefully, the use) of our kiosks; the current method is to as the user to wear a

Microphone Array 17

close-talking microphone headset, which is bulky and inconvenient for passers-by

wishing to only have a short interaction with the kiosk.

Another aspect that we plan to consider in the future is the eﬀect that room

acoustics and environmental conditions (other than noise) have on array per-

formance. Array performance can be aﬀected by reverberations and distortions

due to the room in which the array is located. This eﬀect is likely present in our

work, since the array is currently located in a cluttered hardware lab with many

hard surfaces. We plan to measure array performance in with diﬀerent room con-

ﬁgurations in order to understand the eﬀect of these factors. In addition, when

microphone delays are calculated from array geometry (as is usually the case

in beamforming), a particular value for the speed of sound must be assumed.

However, the speed of sound varies with room temperature and humidity; for

instance, a ﬁve-degree Celsius variation in temperature changes sound speed by

more than 3 m/s. The accuracy of the calculations, and thus the performance

of the array can vary as the temperature changes.

Acknowledgments

Many thanks are in order to Karen Livescu, who set up the summit speech

recognizer for use with the Aurora speech corpus, and provided assistance during

this project. Additionally, the authors wish to acknowledge Arthur Baggeroer,

Patrick Miller, and Kevin Wilson for their valuable advice on microphone array

systems.

This work was funded in part by DARPA and an industrial consortium sup-

porting the MIT Oxygen alliance.

References

1. Taylor, M.B., et al.: The Raw microprocessor: A computational fabric for software

circuits and general purpose programs. IEEE Micro (2002)

2. MIT CSAIL: MIT Project Oxygen. http://oxygen.lcs.mit.edu/ (2004)

3. Sturim, D.E., Brandstein, M.S., Silverman, H.F.: Tracking multiple talkers us-

ing microphone-array measurements. In: Proceedings ICASSP, Munich, Germany

(1997)

4. Van Trees, H.L.: Optimum Array Processing. Wiley-Interscience (2002)

5. Johnson, D.H., Dudgeon, D.E.: Array Signal Processing. Prentice Hall, Englewood

Cliﬀs, New Jersey (1993)

6. Silverman, H.F., Patterson, W.R., Flanagan, J.L.: The huge microphone array.

Technical report, LEMS, Brown University (1996)

7. Taylor, M.B., et al.: Evaluation of the Raw microprocessor: An exposed-wire-delay

architecture for ILP and streams. In: Proceedings International Symposium on

Computer Architecture, M¨unchen, Germany (2004)

8. Skolnik, M.I.: Introduction to Radar Systems. McGraw-Hill, New York (1980)

9. Haykin, S., ed.: 5. In: Array Signal Processing. Prentice Hall, Englewood Cliﬀs,

NJ (1985)

10. Oppenheim, A.V., ed.: 6. In: Applications of Digital Signal Processing. Prentice

Hall, Englewood Cliﬀs, NJ (1978)

18 Eugene Weinstein et al.

11. Haykin, S., ed.: 2. In: Array Signal Processing. Prentice Hall, Englewood Cliﬀs,

NJ (1985)

12. Flanagan, J., Berkley, D., Elko, G., West, J., Sondhi, M.: Autodirective microphone

systems. Acustica 73 (1991) 58–91

13. Lin, Q., Jan, E., Che, C., Flanagan., J.L.: Speaker identiﬁcation in teleconferenc-

ing environments using microphone arrays and neural networks. In: Proceedings

ESCA Workshop on Speaker Recognition, Identiﬁcation and Veriﬁcation, Switzer-

land (1994) 235–238

14. Moore, D., McCowan, I.: Microphone array speech recognition: Experiments on

overlapping speech in meetings. In: Proceedings ICASSP, Hong Kong (2003) 497–

500

15. Omologo, M., Matassoni, M., Svaizer, P.: In Brandstein, M., Ward, D., eds, Mi-

crophone Arrays, 331–353. Springer (2001)

16. Sullivan, T.M.: Multi-Microphone Correlation-Based Processing for Robust Auto-

matic Speech Recognition. PhD thesis, ECE Department, Carnegie Mellon Unvir-

sity (1996)

17. Lang, S., Kleinehagenbrock, M., Hohenner, S., Fritsch, J., Fink, G.A., Sagerer, G.:

Providing the basis for human-robot-interaction: a multi-modal attention system

for a mobile robot. In: Proceedings ICMI, Vancouver, British Columbia, Canada

(2003)

18. Schmidt, A., Gellersen, H., Beigl, M.: A wearable context-awareness component:

Finally a good reason to wear a tie. In: Proceedings International Symposium on

Wearable Computers, San Francisco, California (1999) 176–177

19. Wilson, K., Rangarajan, V., Checka, N., Darrell, T.: Audiovisual arrays for un-

tethered spoken interfaces. In: Proceedings ICMI. (2002)

20. Havelock, D.I.: A large microphone array for outdoor sound propagation studies.

In: Proceedings of the Acoustical Society of America, Austin, Texas (1994)

21. Stanford, V.: The NIST Mark-III microphone array - infrastructure, reference

data, and metrics. In: Proceedings International Workshop on Microphone Array

Systems - Theory and Practice, Pommersfelden, Germany (2003)

22. Bell Laboratories: The 400 Element Square Planar Microphone Array.

http://www.bell-labs.com/org/1133/Research/Acoustics/MicArrayPicture.html

(2000)

23. Adcock, J.: Optimal Filtering and Speech Recognition With Microphone Arrays.

PhD thesis, Brown University (2001)

24. Van Veen, B.D., Buckley, K.: Beamforming: A versatile approach to spatial ﬁlter-

ing. IEEE ASSP Magazine 5(2002) 4–24

25. Glass, J.: A probabilistic framework for segment-based speech recognition. Com-

puter, Speech, and Language 17 (2003) 137–152

26. Hirsch, H.G., Pearce, D.: The Aurora experimental framework for the performance

evaluation of speech recognition systems under noisy conditions. In: Proceedings

ISCA Tutorial and Research Workshop ASR2000, Paris, France (2000)

27. Leonard, R.: A database for speaker independent digit recognition. In: Proceedings

ICASSP, San Diego, California (1984)

28. Quackenbush, S.R., Barnwell, T.P., Clements, M.A.: Objective Measures of Speech

Quality. Prentice Hall, Englewood Cliﬀs, New Jersey (1988)

29. Jan, E., Svaizer, P., Flanagan, J.: A database for microphone array experimenta-

tion. In: Proceedings Eurospeech, Madrid, Spain (1995)

30. M. Van Kleek: Intelligent environments for informal public spaces: the Ki/o kiosk

platform. Master’s thesis, MIT (2003)

MIT LCS TM 642

MIT-LCS-TM-642 MIT-LCS-TM-642

Navigation menu

Versions of this User Manual:

Views

Navigation