Full Program »
Defeating Hidden Audio Channel Attacks on Voice Assistants via Audio-Induced Surface Vibrations
Voice access technologies are widely adopted in mobile devices and
voice assistant systems as a convenient way of user interaction. Recent
studies have demonstrated a potentially serious vulnerability of
the existing voice interfaces on these systems to “hidden voice commands”.
This attack uses synthetically rendered adversarial sounds
embedded within a voice command to trick the speech recognition
process into executing malicious commands, without being noticed
by legitimate users.
In this paper, we employ low-cost motion sensors, in a novel
way, to detect these hidden voice commands. In particular, our proposed
system extracts and examines the unique audio signatures
of the issued voice commands in the vibration domain. We show
that such signatures of normal commands vs. synthetic hidden voice
commands are distinctive, leading to the detection of the attacks.
The proposed system, which benefits from a speaker-motion sensor
setup, can be easily deployed on smartphones by reusing existing
on-board motion sensors or utilizing a cloud service that provides
the relevant setup environment. The system is based on the premise
that while the crafted audio features of the hidden voice commands
may fool an authentication system in the audio domain, their unique
audio-induced surface vibrations captured by the motion sensor are
hard to forge. Our proposed system creates a harder challenge for
the attacker as now it has to forge the acoustic features in both the
audio and vibration domains, simultaneously. We extract the time
and frequency domain statistical features, and the acoustic features
(e.g., chroma vectors and MFCCs) from the motion sensor data and
use learning-based methods for uniquely determining both normal
commands and hidden voice commands. The results show that our
system can detect hidden voice commands vs. normal commands
with 99.9% accuracy by simply using the low-cost motion sensors
that have very low sampling frequencies.