|
Audio Human Extensions
I believe language is the largest market opportunity for the audio industry—essential for augmented reality and any applications that augment our human abilities, which I have characterized as human extensions. Language and voice are intrinsically connected and are cornerstones of today's audio product designs both to expand human interfaces, but more importantly as the "code" that connects our experience with technology. I am always fascinated when I see how people explore our current generative AI engines—mostly conditioned to natural language prompts—in different languages and how that "language" interface results in diverse outcomes.
As the interactions with AI-powered products and solutions that are based on large language models (LLMs) expand to different levels, from work to entertainment, we are now witnessing the rapid transition to AI agents, which theoretically should be agnostic to human languages. Yet that's absolutely not the case—and the reason why I believe this is such a promising field for the audio profession. Something I predict will be (at least) a decade-long transition for language and voice technologies to evolve.
Through natural language processing (NLP), we have created a new level of experience for humans to interact with machines. We will continue to require knowledge of programming languages to develop computers, but much less to use them. We will be able to evolve from structured menus and steep interface learning curves to computers that understand how we humans already communicate—by both voice and text. This is where AI becomes truly the next level of the Information Age, making computers universally accessible. Particularly when AI shows promise to create the universal translator, the layer of human intent, and the most intuitive interface between human thinking and machine action.
Human knowledge is almost entirely encoded in language, and our LLMs are trained on our books, articles, laws, research, and more, but also on our interactions and conversations as text or speech. But because language carries intent, context, and meaning, it also becomes the hardest domain for AI That's why we are not there yet. We are currently still restricted to speech-to-text and text-to-speech conversion, with a "machine" latency that is still far from human interactions. I recently finished reading a fascinating biography of Claude Shannon—A Mind at Play by Jimmy Soni and Rob Goodman—which details how his published conceptual foundations mapped almost exactly what would eventually become modern NLP and LLMs. In his 1948 paper "A Mathematical Theory of Communication," Shannon modeled language using mathematical structures where the probability of the next symbol depends on preceding symbols, and conceptually, described the exact principle behind modern LLMs. In 1948.
Shortly after, in 1951, Claude Shannon expanded his research on the Prediction and Entropy of Printed English, where he derived upper and lower bounds on the entropy of the English language. His findings are what make communication robust to noise, what allows humans to understand speech in a noisy room, and what makes compression possible. Shannon also provided the theoretical basis for all digital audio, described pulse-code modulation (PCM), speech coding, and ultimately speech recognition: the idea that voice is just another information source with a measurable entropy that can be compressed, transmitted, and reconstructed within provable limits. Shannon understood the importance of language modeling, 70 years before all these things had been implemented in silicon. While reading this fascinating biography, I couldn't help but think how Shannon's findings have contributed—and can still contribute—to the evolution of key disciplines in audio, including wireless audio transmission. From the challenges of signal transmission over cables across the ocean, to the coding that enabled modern communications and information technologies, in the process creating the very foundations for digital audio. This, I believe, will eventually also help us model sound accurately as a function of audio signals, which is where we continue to struggle.
Unfortunately, our computers and our AI are still not at that stage where they will be able to sort out our lack of understanding of key elements of sound perception and audio signals. While AI is helping us make progress in the understanding and virtualization of acoustics, what remains to be understood, remains to be described. We can ask our computers to help us measure more and faster, but not what we still don't know we should measure and how. Maybe our language interfaces will help us when we ask our computers to help us in the process, but we remain in a conceptual deadlock regarding those key elements. We might get there faster when it comes to the evolution of audio technology, which deals with electric signals, than with sound and human perception.
Subscribe To
audioXpress Magazine
|
| |||||||