Many people are accustomed to donning headphones to enjoy music at a desired volume without inflicting their tunes on others nearby. But there are tradeoffs inherent in the headphones experience. For one, you’re generally physically tethered to the sound source via wires, inhibiting your movement. Second, the headphones tend to isolate you from your physical environment. What if such limitations could be overcome? Within Microsoft Research, a trio of researchers has moved beyond “what if” and “how” toward “when".
“We’re trying to recreate the headphone experience without headphones,” says Jasha Droppo, a researcher with the Speech Technology Group. Droppo, along with teammates Ivan Tashev, a software architect, and Michael Seltzer, a fellow researcher, has developed a project called Personal Audio Space, which they define as a semi-private, energy-efficient system for real-time communication.
The project uses multiple speakers to focus sound around the user. This tailored approach enables the user to hear the audio clearly, while people adjacent to the focus area experience the sound, if at all, as much quieter in volume than does the target recipient.
“Ivan, Mike, and I are trying to look at ways computers can work with audio to make the computing experience better,” Droppo says. “We are looking at computer control of multiple audio drivers, multiple speaker cones, to see what kind of interesting things are possible with those.
“There’s a whole area of mathematics that has been developed for microphone arrays and capturing sounds from different directions. How does that apply to sound rendering? What kind of interesting things can we do?”
Jasha Droppo (left), Ivan Tashev (center), and Mike Seltzer display the latest version of their Personal Audio Space sound-targeting speaker array. The team has demonstrated its proof of concept with a deceptively simple piece of hardware. The first prototype consisted of 16 two-inch speaker cones attached to the front of a 42-inch 2-by-4. Even with this array of speakers, they were able to direct sound waves effectively, amplifying some and negating others to define a sweet spot that optimizes projected audio for the listener yet diminishes it for others within hearing range. Although the second prototype, pictured at left, features a baffling box and more intelligent wiring, the principle remains the same.
In a project FAQ, the three state: “The magic lies in independent computer control of multiple speaker drivers.” Droppo elaborates:
“If you have audio going into a single speaker, it pours into a room like water. It just goes everywhere. Once we have multiple speakers under computer control, we can pre-distort the audio so that it builds up in some areas of the room and cancels itself out in others. We can do computer simulations of how the sound is going to propagate from the individual speaker cones and specify that we want more sound in one region and less sound in another.
“What we’re doing is the simplest thing that computers know how to do to an audio signal. It’s called a linear filter. Basically, it shapes the frequency content of the signal in a rather straightforward way. For every frequency, we try to determine what kind of unique delay we can apply so that, as these speakers cooperate to produce a sound field, it does what we want.”
The effect, as one might expect, is liberating. While the technology is not yet ready to be included in a product, those who have been exposed to it invariably come away with smiles on their faces.
That sort of response is rewarding for Droppo, Tashev, and Seltzer, who bring a well-rounded research portfolio to their Personal Audio Space project.
“In the past,” Droppo says, “I have worked with speech enhancement and speech recognition. Once you have captured a speech signal, how can you make it sound better? How can you do better, more accurate recognition?
“The same kinds of tools that go into processing speech sounds on a computer are also useful for doing audio rendering. It’s the same basic tool set of linear algebra and convolution and frequency analysis.”
Tashev, meanwhile, is an expert on microphone arrays, which are similar to speaker arrays but designed for sound capture.
“A lot of the math that Ivan developed for his microphone-array technologies,” Droppo says, “is similar to what we’re using for audio rendering. He likes to joke that he just runs the software backward.”
Seltzer, too, has a background in microphone arrays and speech recognition. Together their talents put them in an advantageous position for tackling the Personal Audio Space project.
“There are some things about building a speaker array that are similar to building a microphone array,” Droppo says. “There are also a lot of things that are quite different. Discovering what these different things were was a goal in the first phase of our project.”
Another goal was to learn if such functionality was economically feasible.
“The other thing we were trying to do was see how cheaply we could build one of these things,” Droppo adds. “There exist on the market devices that are similar at the surface, but they cost a lot of money to build and to buy. One of the questions we were trying to answer was: Do we really need to spend a lot of money building these things to get a useful result out of them?”
Apparently not. The materials used to construct their first prototype consisted of little more than 16 small commodity speaker cones, a piece of lumber, some speaker wires, and a handful of fasteners.
“The way I like to design research projects,” Droppo says, “is that each phase answers at least one or two questions that we don’t know [the answer to]. While we don’t have anything spectacularly different yet, in the past few months, we’ve been able to catch up to the state of the art, and the real exciting part for me is where we’re taking this in the future.”
There are a number of usage scenarios that come to mind when considering the potential of the Personal Audio Space technology. One, for example, would enable an office worker to listen to music without disturbing those in adjacent workspaces. Another features a more expansive applicability.
“One that Ivan, in particular, is very passionate about,” Droppo says, “is pairing the speaker array for audio rendering with the microphone array for audio capture and a screen and a camera for video capture and rendering. Once you have that complete solution, you can have a communications terminal that will track the users and deliver audio to the intended users, capture low-noise audio from them, do face tracking, and provide a really nice communications experience where it feels like you’re having a private-conversation video chat.
“The advantage of the speaker array in that scenario is that once you aren’t tethered to the computer anymore, you can wander around the room. Without the speaker array, you’d want to turn up the speakers so you could hear it everywhere. But with the speaker array and the user tracking, the audio could be delivered to you wherever you are in the room.”
And then there’s the babysitter scenario.
“It’s very simple,” Droppo explains. “It’s delivering audio to you at a higher volume than your kids are going to hear upstairs or in the next room while they’re sleeping.
“When you tell people about that scenario, it divides them into two groups. Either they don’t understand the utility, or they have children. The people who have children get it right away.
“About once a month,” Droppo smiles, “one of my kids will come downstairs just to see what’s going on, and they’ll complain about the TV being too loud. I don’t know if it’s actually too loud, because I try to be considerate, or if they’re just using that as an excuse to see what’s going on. I’d love to take that excuse away from them.”
One interesting part of the Personal Audio Space project is that, unlike many research projects that revolve around scientific inquiry and conjecture, the one actually forced Droppo to build a physical object to test the project’s success.
“The hardest part was overcoming the physical aspects of the project,” Droppo says, “because I had mainly been a software person before. As part of my job at Microsoft, I haven’t built anything this big before. It’s actually a physical thing, with amplifiers and digitizers and boards and glue and nails and screws. That was my favorite part of the project.”
Of course, there were more abstract aspects, particularly with regard to learning about acoustic rendering and the physics of sound. But Seltzer was able to obtain a copy of a seminal text, Acoustics, by Leo L. Beraneck, that helped them master such subtleties.
As things stand, Droppo stipulates, it will be difficult to reduce the non-targeted audio to absolutely zero, a concept readily understood by anybody who has found themselves next to a person listening to high-volume music in a bus or on an airplane. But the Personal Audio Space technology can diminish the audio leakage to the point where, in a busy room, it is virtually impossible to detect.
“Most of the time,” Droppo says, “people will perceive that you have the audio much lower than you actually do.”
Having achieved that, there are a number of ways this technology can be extended.
“One,” Droppo says, “is developing algorithms that produce better directional sound fields. Right now, we’re using a well-known technique, beam forming, that produces the effect that the audio exists in one region and not another. There are still interesting things to do in that space in order to produce more of a separation between where you want the audio to be and where you don’t want the audio to be, to make a better separation between the two.”
Another direction to take could involve applications of the technology.
“Now that we know how to build these things and what a lot of the tradeoffs are in the design,” he says, “we can start looking at different applications of the technology and seeing how what we’ve been able to build can actually improve the customer’s experience.”
That might involve enabling users to mold the sound to their precise preferences using an enhanced user interface. Or it might mean constructing an actual hardware component to bring the demonstrated capacity to life.
“If you have a speaker array under computer control,” Droppo suggests, “it can produce many more patterns than you can reasonably choose from. Presenting these options to users and letting them make intelligent choices about what they want is a user-interface issue. Given today’s technology, we can build new calibration tools, new interaction tools, so that users can intelligently design the type of audio they want for their rooms.
“The other direction that we’re looking at is speaker arrays as a component technology. What kind of end-to-end systems can we build that make sense? Can we make something the width of a monitor that actually produces sound that it pleasant to listen to?”
There’s no doubt, though, that when Droppo, Tashev, and Seltzer demonstrate Personal Audio Space, the response they typically get is music to their ears.
“The coolest part to me,” Droppo says, “is that when I describe it to people, they just get it. Their eyes light up, and they can think of many different ways to use the technology. That is the kind of feedback I really like to get.”