Compare commits

1 Commits

Author SHA1 Message Date
2971b8154f add thesis extracts and pandoc latex conversion 2025-02-20 12:46:57 +01:00
4 changed files with 889 additions and 1334 deletions

View File

@@ -1,2 +0,0 @@
# varys thesis
View the thesis: [Master Thesis.pdf](<./Master Thesis.pdf>)

1376
main.typ

File diff suppressed because it is too large Load Diff

View File

@@ -29,22 +29,9 @@
set heading(numbering: "1.1.")
set figure(gap: 1em)
show figure.caption: emph
set table(align: left)
// Set run-in subheadings, starting at level 4.
show heading: it => {
if it.level > 3 {
parbreak()
text(11pt, style: "italic", weight: "regular", it.body + ".")
} else if it.level == 1 {
text(size: 20pt, it)
} else {
it
}
}
set par(leading: 0.75em)
// Title page.
@@ -120,11 +107,6 @@
v(1.618fr)
pagebreak()
// Table of contents.
show outline.entry.where(level: 1): it => text(weight: "bold", it)
outline(depth: 3, indent: true, fill: text(weight: "regular", repeat[.]))
pagebreak()
// Main body.
set par(justify: true)

823
thesis.tex Normal file
View File

@@ -0,0 +1,823 @@
\documentclass{article}
\title{February 29, 2024 A Testbed for Voice Assistant\\
Traffic Fingerprinting Master Thesis}
\begin{document}
\section{Introduction}\label{introduction}
Smart Speakers have become ubiquitous in many countries. The global sale
of Amazon Alexa devices surpassed half a billion in 2023
{[}Amazon\_2023{]} and Apple sells over \(10\) million HomePods every
year {[}Statista\_Apple\_2023{]}. A study shows that \(35\%\) of adults
in the U.S. own a smart speaker {[}NPR\_2022{]}. With half of the
participants in the study reporting to have heard an advertisement on
their smart speaker before, concerns about the privacy and security of
these ever-listening devices are well founded. Especially since they are
usually placed where private or confidential conversations take place.
To back claims about loss of privacy or decreased security, a thorough
analysis of smart speakers becomes necessary. In this thesis, we build a
system that allows us to collect a large amount of data on interactions
with voice assistants.
\subsection{Previous Work}
There are several previous works analysing the privacy and security
risks of voice assistants. Most of them go into topics concerning
specifically Amazon Alexa since their adoption rate is significantly
higher than Apple's Siri or Google Assistant {[}Bolton\_2021{]}. Some
examples include experiments of smart speaker misactivations
{[}Ford2019{]}{[}Dubois\_2020{]}{[}schonherr2022101328{]}, analyses of
voice assistant skills
{[}Natatsuka\_2019{]}{[}Shezan\_2020{]}{[}255322{]}{[}Edu\_2022{]}{[}Xie\_2022{]}
and voice command fingerprinting {[}8802686{]}{[}Wang\_2020aa{]}. Apart
from some old work on traffic patterns of Siri {[}caviglione\_2013{]}
and research by Apple {[}Apple\_2023{]}{[}Apple\_2018{]}, we found
little information about that voice assistant.
\subsection{Thesis Overview}\label{overview}
The primary bottleneck when collecting large amounts of data on voice
assistants is the actual time speaking and waiting for the response.
According to our tests, the majority of interactions take between \(4\)
and \(8\) seconds plus any additional delay between the question and the
response. This aligns with the results by {[}Haas\_2022{]}, who found a
median of \(4.03\)s per typical sentence. With the goal of collecting a
large number of samples for a list of voice commands (henceforth called
``queries``), the system will therefore need to run for a substantial
amount of time.
With this in mind, the primary aim of this thesis is building a testbed
that can autonomously interact with a smart speaker. This system should
take a list of queries and continuously go through it while collecting
audio recordings, network packet traces, timestamps and other relevant
metadata about the interaction. In
\hyperref[background]{{[}background{]}}, we establish why it is of
interest to collect this data and the implementation of our system is
explored in \hyperref[system-design]{{[}system-design{]}}.
For the second part, documented in Sections \hyperref[experiment]{}and
\hyperref[results]{}, we run an experiment using our system, collecting
interactions with an Apple HomePod mini\footnote{\url{https://apple.com/homepod-mini}}
smart speaker. With this dataset of interactions, we train an ML model
for traffic fingerprinting. The system is adapted from a similar
implementation by {[}Wang\_2020aa{]} but will be applied to data
collected from the Siri voice assistant instead of Alexa and Google
Assistant.
\section{Background}\label{background}
Having defined our unsupervised system for collecting interactions with
smart speakers, it is necessary to explain the reasoning behind
collecting that data.
{[}Papadogiannaki\_2021{]} write in their survey ``In the second
category \emph{interception of encrypted traffic,} we encounter
solutions that take advantage of behavioural patterns or signatures to
report a specific event.'' The main focus for us lies on the encrypted
network traffic being collected. This traffic cannot be decrypted,
however, we can analyse it using a technique called traffic
fingerprinting.
\subsection{Traffic Fingerprinting}
Traffic fingerprinting uses pattern recognition approaches on collected,
encrypted traffic traces to infer user behaviour. This method is used
successfully to analyse web traffic {[}Wang\_2020{]}, traffic on the Tor
network {[}Oh\_2017{]}, IoT device traffic
{[}8802686{]}{[}Wang\_2020aa{]}{[}Mazhar\_2020aa{]}{[}Zuo\_2019{]}{[}Trimananda\_2019aa{]}
and user location {[}Ateniese\_2015aa{]}. Traffic fingerprinting on
smart speakers is generally used to infer the command given to the voice
assistant by the user. To enable us to run traffic fingerprinting, our
system stores the query and the network traffic for each interaction it
has with the smart speaker.
We assume the concept of \emph{open world} and \emph{closed world}
models from previous traffic fingerprinting works
{[}Wang\_2020aa{]}{[}Wang\_2020{]}. Both models define an attacker who
has a set of queries they are interested in. The interest in the open
world scenario is whether or not a given command is in that set. The
traffic analysed in this case can be from either known queries or
queries not seen before. In the closed world scenario, the attacker
wants to differentiate between the different queries in that set, but
cannot tell whether a new query is in it or not. As an example, the open
world model could be used to infer whether the smart speaker is being
used to communicate with a person that an attacker is monitoring. In the
closed world case, the attacker might want to differentiate between
different people a user has contacted. In our thesis, we focus on the
closed world model.
\subsection{Threat Model}\label{threat-model}
Our threat model is similar to the ones introduced by {[}Wang\_2020aa{]}
and {[}8802686{]}, specifically,
\begin{itemize}
\item
the attacker has access to the network the smart speaker is connected
to;
\item
the MAC address of the smart speaker is known;
\item
the type of the voice assistant is known (e.g. Siri, Alexa or Google
Assistant).
\end{itemize}
Previous work {[}Mazhar\_2020aa{]}{[}Trimananda\_2019aa{]} shows that it
is feasible to extract this information from the encrypted network
traffic. Typically unencrypted DNS queries can also be used to identify
smart speakers on a network. Our own analysis shows that the Apple
HomePod regularly sends DNS queries asking about the domain
\texttt{gsp-ssl.ls-apple.com.akadns.net} and {[}Wang\_2020aa{]} found
that the Amazon Echo smart speaker sends DNS queries about
\texttt{unagi-na.amazon.com}. The specific domains queried may differ
regionally (as indicated by the \texttt{-na} part in the Amazon
subdomain), but should be simple to obtain.
Finally, we assume the traffic is encrypted, since the attack would be
trivial otherwise.
\subsection{Voice Assistant Choice}
The most popular voice assistants are Apple's Siri, Amazon Alexa and
Google Assistant {[}Bolton\_2021{]}. Support for Microsoft Cortana was
dropped in August of 2023 {[}Microsoft\_Cortana\_2023{]}, likely in
favour of Copilot for Windows, which was announced one month later
{[}Microsoft\_Copilot\_2023{]} (for more on this see
\hyperref[vas-based-on-llms]{{[}vas-based-on-llms{]}}).
Our system supports the Siri and Alexa voice assistants but can be
easily extended with support for more. For the experiment in
\hyperref[experiment]{{[}experiment{]}}, we decided to use Siri as it is
still less researched {[}Bolton\_2021{]} than Alexa, where a host of
security and privacy issues have already been found
{[}Wang\_2020aa{]}{[}8802686{]}{[}Ford2019{]}{[}Edu\_2022{]}{[}Dubois\_2020{]}.
\section{System Design}\label{system-design}
The primary product of this thesis is a testbed that can autonomously
interact with a smart speaker for a long time. Our system, called varys,
collects the following data on its interactions with the voice
assistant:
\begin{itemize}
\item
Audio recordings of the query and the response
\item
A text transcript of the query and the response
\item
Encrypted network traffic traces from and to the smart speaker
\item
The durations of different parts of the interaction
\end{itemize}
\subsection{Modularisation}
The wide range of data our system needs to collect makes it necessary to
keep its complexity manageable. To simplify maintenance and increase the
extensibility, the functionality of our system is separated into five
modules:
\begin{description}
\item[varys]
The main executable combining all modules into the final system.
\item[{\texttt{v}\texttt{arys-analysis}}]
Analysis of data collected by varys.
\item[{\texttt{v}\texttt{arys-audio}}]
Recording audio and the TTS and STT systems.
\item[{\texttt{v}\texttt{arys-database}}]
Abstraction of the database system where interactions are stored.
\item[{\texttt{v}\texttt{arys-network}}]
Collection of network traffic, writing and parsing of pcap files.
\end{description}
The modules {\texttt{v}\texttt{arys-audio}},
{\texttt{v}\texttt{arys-database}} and {\texttt{v}\texttt{arys-network}}
are isolated from the rest of varys. If the need arises (e.g. if a
different database connector is required or a more performant TTS system
is found), they can be easily swapped out.
The module {\texttt{v}\texttt{arys-analysis}} does not currently do any
analysis of the recorded audio and thus doesn't depend on that module.
In this section, we will lay out the design considerations behind the
modules that make up varys.
\subsection{Network}\label{network}
The system runs on a device that serves as a MITM between the Internet
and an \emph{internal network}, which it acts as a router for. The
latter is reserved for the devices that are monitored. This includes the
voice assistant and any devices required by it. Since the majority of
smart speakers do not have any wired connectivity, the \emph{internal
network} is wireless.
This layout enables the MITM device to record all incoming and outgoing
traffic to and from the voice assistant without any interference from
other devices on the network. (We filter the network traffic by the MAC
address of the smart speaker so this is not strictly necessary, but it
helps to reduce the amount of garbage data collected.)
Currently, we assume that the beginning and end of an interaction with
the voice assistant is known and we can collect the network capture from
when the user started speaking to when the voice assistant stopped
speaking. We discuss this restriction in more detail in
\hyperref[traffic-trace-bounds]{{[}traffic-trace-bounds{]}}.
The {\texttt{v}\texttt{arys-network}} module provides functionality to
start and stop capturing traffic on a specific network interface and
write it to a pcap file. Furthermore, it offers a wrapper structure
around the network packets read from those files, which is used by
{\texttt{v}\texttt{arys-analysis}}. Each smart speaker has a different
MAC address that is used to distinguish between incoming and outgoing
traffic. The MAC address is therefore also stored in the database
alongside the path to the pcap file.
\subsubsection{Monitoring}\label{monitoring}
Since the system is running for long periods of time without
supervision, we need to monitor potential outages. We added support to
varys for \emph{passive} monitoring. A high-uptime server receives a
request from the system each time it completes an interaction. That
server in turn notifies us if it hasn't received anything for a
configurable amount of time.
\subsection{Audio}
The largest module of our system is {\texttt{v}\texttt{arys-audio}}. Its
main components are the TTS and STT systems, expanded upon in the next
two subsections. Apart from that, it stores the recorded audio files
encoded in the space-optimised OPUS {[}rfc6716{]} format and handles all
audio processing. Converting it to mono; downsampling it for space
efficiency; and trimming the silence from the beginning and end.
\subsubsection{Text-to-Speech}\label{text-to-speech}
The only interface that a voice assistant provides is via speaking to
it. Therefore, we need a system to autonomously talk, for which there
are two options:
\begin{itemize}
\item
Prerecorded audio
\item
Speech synthesis
\end{itemize}
The former was dismissed early for several reasons. Firstly, audio
recordings would not support conversations that last for more than one
query and one response. Secondly, each change in our dataset would
necessitate recording new queries. Lastly, given the scope of this
thesis, recording and editing more than 100 sentences was not feasible.
Thus, we decided to integrate a TTS system into our
{\texttt{v}\texttt{arys-audio}} module. This system must provide high
quality synthesis to minimise speech recognition mistakes on the smart
speaker, and be able to generate speech without any noticeable
performance overhead. The latter is typically measured by the rtf, with
\(\text{rtf } > 1\) if the system takes longer to synthesise text than
to speak that text and \(\text{rtf } \leq 1\) otherwise. A good
real-time factor is especially important if our system is to support
conversations in the future; the voice assistant only listens for a
certain amount of time before cancelling an interaction.
We explored three different options with \(\text{rtf } \leq 1\) on the
machine we tested on:
\begin{itemize}
\item
espeak\footnote{\url{https://espeak.sourceforge.net}}
\item
Larynx\footnote{\url{https://github.com/rhasspy/larynx}}
\item
\texttt{tts-rs} using the \texttt{AVFoundation} backend\footnote{\url{https://github.com/ndarilek/tts-rs}
-- the \texttt{AVFoundation} backend is only supported on macOS.}
\end{itemize}
We primarily based our choice on how well Amazon Alexa\footnote{We
picked Alexa (over Siri) for these tests since its speech recognition
performed worse in virtually every case.} would be able to understand
different sentences. A test was performed manually by asking each of
three questions 20 times with every TTS system. As a ground truth, the
same was repeated verbally. The results show that \texttt{tts-rs} with
the \texttt{AVFoundation} backend performed approximately \(20\%\)
better on average than both Larynx and espeak and came within \(15\%\)
of the ground truth.
While performing this test, the difference in quality was very easily
heard, as only the \texttt{AVFoundation} voices come close to sounding
like a real voice. These were, apart from the ground truth, also the
only voices that could be differentiated from each other by the voice
assistants. The recognition of user voices works well while the guest
voice was wrongly recognised as one of the others \(65\%\) of the time.
With these results, we decided to use \texttt{tts-rs} on a macOS machine
to get access to the \texttt{AVFoundation} backend.
Since running this test, Larynx has been succeeded by Piper\footnote{\url{https://github.com/rhasspy/piper}}
during the development of our system. The new system promises good
results and may be an option for running varys on non-macOS systems in
the future.
\subsubsection{Speech-to-Text}\label{speech-to-text}
Not all voice assistants provide a history of interactions, which is why
we use a STT system to to transcribe the recorded answers.
In our module {\texttt{v}\texttt{arys-audio}}, we utilise an ML-based
speech recognition system called whisper. It provides transcription
error rates comparable to those of human transcription
{[}Radford\_2022aa{]} and outperforms other local speech recognition
models {[}Seagraves\_2022{]} and commercial services
{[}Radford\_2022aa{]}. In
\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}, we
explore how well whisper worked for our use case.
\subsection{Data Storage}
The module {\texttt{v}\texttt{arys-database}} acts as an abstraction of
our database, where all metadata about interactions are stored. Audio
files and pcap traces are saved to disk by
{\texttt{v}\texttt{arys-audio}} and {\texttt{v}\texttt{arys-network}}
respectively and their paths are referenced in the database.
Since some changes were made to the system while it was running, we
store the version of varys a session was ran with. This way, if there
are any uncertainties about the collected data, the relevant code
history and database migrations can be inspected.
To prevent data corruption if the project is moved, all paths are
relative to a main \texttt{data} directory.
\subsection{Interaction Process}
The main executable of varys combines all modules introduced in this
section to run a series of interactions. When the system is run, it
loads the queries from a TOML file. Until it is stopped, the system then
runs interaction sessions, each with a shuffled version of the original
query list.
Resetting the voice assistant is implemented differently for each one
supported.
\subsection{Performance Considerations}\label{performance}
The initial design of our system intended for it to be run on a
Raspberry Pi, a small form factor computer with limited processing and
storage capabilities. Mostly because we did not find a TTS system that
ran with satisfactory performance on the Raspberry Pi, we now use a
laptop computer.
\subsubsection{Data Compression}
Because of the initial plan of using a Raspberry Pi, we compress the
recorded audio with the OPUS {[}rfc6716{]} codec and the captured
network traffic with DEFLATE {[}rfc1951{]}. We are however now running
the experiment on a laptop, where space is no longer a concern. The
audio codec does not add a noticeable processing overhead, but the pcap
files are now stored uncompressed. Moreover, storing the network traces
uncompressed simplifies our data analysis since they can be used without
any preprocessing.
\subsubsection{Audio Transcription}
Apart from the actual time speaking (as mentioned in
\hyperref[overview]{{[}overview{]}}), the transcription of audio is the
main bottleneck for how frequently our system can run interactions. The
following optimisations allowed us to reduce the average time per
interaction from \(\sim 120\)s to \(\sim 30\)s:
\paragraph{Model Choice}
There are five different sizes available for the whisper model. The step
from the \texttt{medium} to the \texttt{large} and \texttt{large-v2}
models does not significantly decrease Word-Error-Rate but more than
doubles the number of parameters. Since in our tests, this increase in
network size far more than doubles processing time, we opted for the
\texttt{medium} model.
\paragraph{Transcription in Parallel}
The bottleneck of waiting for whisper to transcribe the response could
be completely eliminated with a transcription queue that is processed in
parallel to the main system. However, if the system is run for weeks at
a time, this can lead to arbitrarily long post-processing times if the
transcription takes longer on average than an interaction.
Our solution is to process the previous response while the next
interaction is already running, but to wait with further interactions
until the transcription queue (of length 1) is empty again. This
approach allows us to reduce the time per interaction to
\(\max(T_{t},T_{i})\), rather than \(T_{t} + T_{i}\), where \(T_{t}\)
represents the time taken for transcription and \(T_{i}\) the time taken
for the rest of the interaction.
\paragraph{Audio Preprocessing}
Another step we took to reduce the transcription time is to reduce the
audio sample rate to \(16\) kHz, average both channels into one and trim
the silence from the start and the end. This preprocessing does not
noticeably reduce the quality of the transcription. Moreover, cutting
off silent audio eliminates virtually all whisper hallucinations (see
\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}).
\section{Experiment}\label{experiment}
To test our data collection system, we conducted several experiments
where we let varys interact with an Apple HomePod mini.
Three different lists of queries were tested:
\begin{itemize}
\item
A large dataset with \(227\) queries, split into \(24\) categories
\item
A small dataset with \(13\) queries, each from a different category
\item
A binary dataset with the two queries ``Call John Doe'' and ``Call
Mary Poppins''
\end{itemize}
Apple does not release an official list of Siri's capabilities. A
dataset of \(\sim 800\) Siri commands was originally compiled from a
user-collected list {[}Hey\_Siri\_Commands{]} and extended with queries
from our own exploration. The HomePod does not support all Siri
queries\footnote{Any queries that require Siri to show something to the
user will be answered with something along the lines of \emph{``I can
show you the results on your iPhone.''}}. Therefore, during an initial
test period, we manually removed commands from the list that did not
result in a useful response. The number of redundant commands was
further reduced to increase the number of samples that can be collected
per query. Having less than 256 labels has the additional advantage of
our query labels fitting in one byte.
Further analysis of our results can be found in
\hyperref[results]{{[}results{]}}. In this section we detail the
experiment setup and our traffic fingerprinting system.
\subsection{Setup}
The experiment setup consists of a laptop running macOS\footnote{The
choice of using macOS voices is explained in
\hyperref[text-to-speech]{{[}text-to-speech{]}}.}, connected to a pair
of speakers and a microphone. The latter, together with the HomePod, are
placed inside a sound-isolated box. Additionally, the user's iPhone is
required to be on the \emph{internal network} (see
\hyperref[network]{{[}network{]}}) for Siri to be able to respond to
personal requests.
The iPhone is set up with a dummy user account for \emph{Peter Pan} and
two contacts, \emph{John Doe} and \emph{Mary Poppins}. Otherwise, the
phone is kept at factory settings and the only changes come from
interactions with the HomePod.
Finally, to minimise system downtime, an external server is used for
monitoring as explained in \hyperref[monitoring]{{[}monitoring{]}}.
The experiment must be set up in a location with the following
characteristics:
\begin{itemize}
\item
Wired Internet connection\footnote{A wireless connection is not
possible, since macOS does not support connecting to WiFi and
creating a wireless network at the same time. If no wired connection
is available, a network bridge (e.g. a Raspberry Pi bridging WiFi to
its Ethernet port) can be used to connect the Mac to the internet.}
\item
Power outlet available
\item
No persons speaking (the quieter the environment, the higher the voice
recognition accuracy)
\item
Noise disturbance from experiment not bothering anyone
\item
Access to room during working hours (or if possible at any time)
\end{itemize}
The sound isolation of the box can reduce ambient noise in the range of
\(50 - 60\) dB to a significantly lower \(35 - 45\) dB, which increases
the voice recognition and transcription accuracy. This also helps
finding an experiment location, since the noise disturbance is less
significant.
When the hardware is set up, the command
\texttt{varys\ listen\ -\/-calibrate} is used to find the ambient noise
volume.
\subsection{Data Analysis}
We began data analysis by building a simple tool that takes a number of
traffic traces and visualises them. Red and blue represent incoming and
outgoing packets respectively and the strength of the colour visualises
the packet size. The \(x\)-axis does not necessarily correspond with
time, but with the number of incoming and outgoing packets since the
beginning of the trace. This method of analysing traffic traces aligns
with previous work on traffic fingerprinting {[}Wang\_2020aa{]}.
\subsubsection{Traffic Preprocessing}
We write a network traffic trace as
\(\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right)\),
where \(n\) is the number of packets in that trace, \(s_{i}\) is the
size of a packet in bytes and \(d_{i} \in \left\{ 0,1 \right\}\) is the
direction of that packet.
To be used as input in our ML network, the traffic traces were
preprocessed into normalised tensors of length \(475\). Our collected
network traffic is converted like
\(\)\[\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right) \rightarrow \left( p_{1},p_{2},\ldots,p_{475} \right),\]\(\)
where for all \(i \in \lbrack 1,475\rbrack\) the components are
calculated as
\(\)\[p_{i} = \begin{cases}
( - 1)^{d_{i}} \cdot \frac{s_{i}}{1514}\mspace{14mu}\text{if }i < n, \\
0\mspace{99mu}\text{else}.
\end{cases}\]\(\)
This pads traces shorter than \(475\) packets long with zeroes, cuts off
the ones with more and normalises all traces into the range
\(\lbrack - 1,1\rbrack\)\footnote{The maximum size of the packets is
\(1514\) bytes, which is why we can simply divide by this to normalise
our data.}.
\subsubsection{Adapting Network from Wang et
al.}\label{adapting-network}
Since there is similar previous work on traffic fingerprinting Amazon
Alexa and Google Assistant interactions by {[}Wang\_2020aa{]}, we based
our network on their CNN.
In addition to a CNN, they also tested an LSTM, a SAE and a combination
of the networks in the form of ensemble learning. However, their best
results for a single network came from their CNN, so we decided to use
that.
Owing to limited computing resources, we reduced the size of our
network. The network of {[}Wang\_2020aa{]} used a pool size of \(1\),
which is equivalent to a \emph{no-op}. Thus, we removed all \emph{Max
Pooling} layers. Additionally, we reduced the four groups of
convolutional layers to one.
Since the hyperparameter search performed by {[}Wang\_2020aa{]} found
\(180\) to be the optimal dense network size when that was the maximum
in their search space \(\left\{ 100,110,\ldots,170,180 \right\}\), we
can assume that the true optimal size is higher than \(180\). In our
case we used \(475\) -- the size of our input.
To split out dataset into separate training, validation and testing
parts, we used the same proportions as {[}Wang\_2020aa{]}; Of the full
datasets, \(64\%\) were used for training, \(16\%\) for validation, and
\(20\%\) for testing.
\section{Results}\label{results}
Our system ran for approximately \(800\) hours and collected data on a
total of \(73\prime 605\) interactions. The collected data is available
at \url{https://gitlab.com/m-vz/varys-data}.
In this section we will go over our results, beginning with the
statistics of our collected dataset.
\subsection{Statistics}\label{statistics}
Of the \(73\prime 605\) queries collected, \(71\prime 915\) were
successfully completed (meaning there was no timeout and the system was
not stopped during the interaction). Our system varys is therefore able
to interact with Siri with a \(97.70\%\) success rate.
As mentioned in \hyperref[overview]{{[}overview{]}}, there are
approximately \(4\) to \(8\) seconds of speaking during most
interactions. The remaining duration (measured as the total duration
minus the speaking duration) lies mostly between \(25\) and \(30\)
seconds. This includes the time spent waiting for the voice assistant to
answer and transcribing the response. It did not increase even for
outliers with a significantly longer speaking duration.
On average, a query took \(2.47\) seconds (\(\pm 0.64\)) to say, which
is expected since all query sentences are of similar length. The
responses lasted on average \(4.40\) seconds with a standard deviation
of \(5.63\)s.
During the first week, there were two bugs that resulted in several
downtimes (the longest of which lasted \(\sim 42\)h because we could not
get physical access to the system during the weekend). After that, the
system ran flawlessly except during one network outage at the university
and during maintenance.
\subsection{Transcription Accuracy}\label{transcription-accuracy}
Compared to the results by {[}Radford\_2022aa{]} and
{[}Seagraves\_2022{]}, our own results show an even higher transcription
accuracy apart from errors in URLs, certain names and slight formatting
differences. This is likely due to the audio coming from a synthetic
voice in a controlled environment.
The query \emph{``How far away is Boston?''} was missing \emph{``as the
crow flies''} from the correct transcription \emph{``{[}\ldots{]} it's
about 5,944 kilometers as the crow flies.''} about \(36\%\) of the time.
Together with the correctly transcribed cases, this makes up \(98.51\%\)
of this query's data. In the response to \emph{``What is the factorial
of 6?''}, Siri provided the source
\href{https://solumaths.com}{solumaths.com}, which is text that whisper
was unable to produce\footnote{In 1823 cases the name \emph{solumaths}
was recognised as either \emph{solomus}, \emph{solomoths},
\emph{solomons}, \emph{solomuths}, \emph{soloomiths}, \emph{salumith}
or \emph{solemnus}, but never correctly.}. Since mistakes in URLs are
expected, we marked those transcriptions as correct.
There are some rare cases where the voice assistant responds with a
single word. Since whisper does not support recognition of audio shorter
than one second, these queries have an exceedingly high failure rate.
One example is the query \emph{``Flip a coin''}, which was only
recognised correctly \(\sim 2\%\) of the time. Out of the \(140\)
incorrectly transcribed responses, \(134\) were cancelled due to the
recorded audio being too short.
Another problem we identified with whisper is its tendency to
hallucinate text when fed silent audio (one or more seconds of
\texttt{0}-samples). Preventing this can be done by simply trimming off
all silent audio in the preprocessing step before the transcription.
\subsection{Traffic Fingerprinting}
With our implementation of a fingerprinting CNN, we were able to get
promising results. Our network was trained on the binary, small and
large datasets for 1000 epochs each. The accuracy axis of the training
progress is bounded by the interval
\(\left\lbrack \frac{100}{n},100 \right\rbrack\), where \(n\) is the
number of labels. The bottom of the accuracy axis is therefore
equivalent to random choice. (As an example, the accuracy of the binary
classifier starts at 50\%, which is choosing randomly between two
options.) The loss axis is bounded to \(\lbrack 0,3\rbrack\) to fit our
data. The loss for training our large dataset started outside that range
at \(5.05\).
Most of the trends in these graphs show that with more training, the
models could likely be improved. However, since this serves as a proof
of concept and due to limited time and processing capacity, we did not
train each network for more than \(1000\) epochs.
For binary classification between the queries ``Call John Doe'' and
``Call Mary Poppins'', our final accuracy on the test set was
\(\sim 71.19\%\). This result is especially concerning since it suggests
this method can differentiate between very similar queries. An attacker
could train a network on names of people they want to monitor and tell
from the encrypted network traffic whether the voice assistant was
likely used to call that person.
The second classification we attempted was between the \(13\) different
queries from the small dataset, each from a different category. The
result for this dataset is an \(86.19\%\) accuracy on the test set.
Compared to less than \(8\%\) for a random guess, this confirms our
concerns about the privacy and security of the HomePod.
The large dataset with \(227\) different queries showed an accuracy of
\(40.40\%\). This result is likely limited by the size of our network
and training capabilities. When compared to randomly guessing at
\(0.44\%\) though, the loss of privacy becomes apparent.
With adjustments to our CNN, more training data and hardware resources,
we estimate that highly accurate classification can be achieved even
between hundreds of different queries.
\subsection{On-Device Recognition}\label{on-device-recognition}
According to HomePod marketing,
\begin{quote}
HomePod mini works with your iPhone for requests like hearing your
messages or notes, so they are completed on device without revealing
that information to Apple.
℄~\url{https://apple.com/homepod-mini}
\end{quote}
\section{Conclusion}
For this thesis, we built a highly reliable system that can autonomously
collect data on interacting with smart speakers for hundreds of hours.
Using our system, we ran an experiment to collect data on over
\(70\prime 000\) interactions with Siri on a HomePod.
The data we collected enabled us to train two ML models to distinguish
encrypted smart speaker traffic of \(13\) or \(227\) different queries
and a binary classifier to recognise which of two contacts a user is
calling on their voice assistant. With final test accuracies of
\(\sim 86\%\) for the small dataset, \(\sim 40\%\) for the large dataset
and \(\sim 71\%\) for the binary classifier, our system serves as a
proof of concept that even though network traffic on a smart speaker is
encrypted, patterns in the traffic can be used to accurately predict
user behaviour. While this has been known about Amazon Alexa
{[}Wang\_2020aa{]}{[}8802686{]} and Google Assistant {[}Wang\_2020aa{]},
we have shown the viability of the same attack on the HomePod despite
Apple's claims about the privacy and security of their smart speakers.
\subsection{Future Work}
While working on this thesis, several ideas for improvement, different
types of data collection and approaches to the analysis of our data came
up. Due to time constraints, we could not pursue all of these points. In
this section we will briefly touch on some open questions or directions
in which work could continue.
\subsubsection{Recognising Different Speakers}
All popular voice assistants support differentiating between users
speaking to them. They give personalised responses to queries such as
\emph{``Read my calendar''} or \emph{``Any new messages?''} by storing a
voice print of the user. Some smart speakers like the HomePod go a step
further and only respond to requests given by a registered user
{[}Apple\_2023{]}.
Similarly to inferring the query from a traffic fingerprint, it may be
possible to differentiate between different users using the smart
speaker. Our system supports different voices that can be used to
register different users on the voice assistants. To keep the scope our
proof of concept experiment manageable, our traffic dataset was
collected with the same voice. By recording more data on the same
queries with different voices, we could explore the possibility of a
classifier inferring the speaker of a query from the traffic trace.
\subsubsection{Inferring Bounds of Traffic
Trace}\label{traffic-trace-bounds}
Currently, we assume an attacker knows when the user begins an
interaction and when it ends. Using a sliding window technique similar
to the one introduced by {[}Zhang\_2018{]}, a system could continuously
monitor network traffic and infer when an interaction has taken place
automatically.
\subsubsection{Classifier for Personal Requests}
As mentioned in
\hyperref[on-device-recognition]{{[}on-device-recognition{]}}, the
traffic traces for personal requests show a visible group of outgoing
traffic that should be straightforward to recognise automatically. A
binary classifier that distinguishes the traffic of personal requests
and normal queries on the HomePod should be straightforward to build.
Since this feature appears consistently on all personal request traces
we looked at, in addition to recognising traffic for queries it was
trained on, this model could feasibly also classify queries it has never
seen.
\subsubsection{Voice Assistants Based on LLMs}\label{vas-based-on-llms}
With the release of Gemini by Google {[}Google\_LLM\_2024{]} and Amazon
announcing generative AI for their Alexa voice assistant
{[}Amazon\_LLM\_2023{]}, the next years will quite certainly show a
shift from traditional voice assistants to assistants based on {[}Large
Language Models \emph{(LLMs)}{]}. Existing concerns about the lack of
user privacy and security when using LLMs {[}Gupta\_2023{]}, combined
with the frequency at which voice assistants process PII, warrant
further research into voice assistants based on generative AI.
\subsubsection{Tuning the Model Hyperparameters}
For time reasons, we are using hyperparameters similar to the ones found
by {[}Wang\_2020aa{]}, adapted to fit our network by hand. Our model
performance could likely be improved by running our own hyperparameter
search.
\subsubsection{Non-Uniform Sampling of Interactions According to Real
World Data}
Currently, interactions with the voice assistants are sampled uniformly
from a fixed dataset. According to a user survey, there are large
differences in how often certain types of skills are used
{[}NPR\_2022{]}. For example, asking for a summary of the news is likely
done much less frequently than playing music.
An experiment could be run where the system asks questions with a
likelihood corresponding to how often they occur in the real world. This
might improve the performance of the recognition system trained on such
a dataset.
\subsubsection{Performance Improvements}
The main bottleneck currently preventing us from interacting without
pausing is our STT system. We estimate a lower bound of approximately 10
seconds per interaction. This entails the speaking duration plus the
time waiting for a server response. With a more performant
implementation of whisper, a superior transcription model or better
hardware to run the model on, we could thus collect up to six
interactions per minute.
\subsubsection{\texorpdfstring{Other Uses for
\texttt{varys}}{Other Uses for varys}}
During data analysis, we discovered certain \emph{easter eggs} in the
answers that Siri gives. As an example, it responds to the query
\emph{``Flip a coin''} with \emph{``It\ldots{} whoops! It fell in a
crack.''} about \(1\%\) of the time -- a response that forces the user
to ask again should they actually need an answer. Since this behaviour
does not have any direct security or privacy implications, we did not
pursue it further. Having said this, varys does provide all the tools
required to analyse these rare responses from a UX research point of
view.
\end{document}