add thesis extracts and pandoc latex conversion

2025-02-20 12:46:57 +01:00
4 changed files with 889 additions and 1334 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +0,0 @@
-# varys thesis
-View the thesis: [Master Thesis.pdf](<./Master Thesis.pdf>)
--- a/main.typ
+++ b/main.typ
--- a/template.typ
+++ b/template.typ
@@ -29,22 +29,9 @@
  set heading(numbering: "1.1.")

  set figure(gap: 1em)
-  show figure.caption: emph

  set table(align: left)

-  // Set run-in subheadings, starting at level 4.
-  show heading: it => {
-    if it.level > 3 {
-      parbreak()
-      text(11pt, style: "italic", weight: "regular", it.body + ".")
-    } else if it.level == 1 {
-      text(size: 20pt, it)
-    } else {
-      it
-    }
-  }
-
  set par(leading: 0.75em)

  // Title page.
@@ -120,11 +107,6 @@
  v(1.618fr)
  pagebreak()

-  // Table of contents.
-  show outline.entry.where(level: 1): it => text(weight: "bold", it)
-  outline(depth: 3, indent: true, fill: text(weight: "regular", repeat[.]))
-  pagebreak()
-
  // Main body.
  set par(justify: true)

--- a/thesis.tex
+++ b/thesis.tex
@@ -0,0 +1,823 @@
+\documentclass{article}
+
+\title{February 29, 2024 A Testbed for Voice Assistant\\
+Traffic Fingerprinting Master Thesis}
+
+\begin{document}
+
+\section{Introduction}\label{introduction}
+
+Smart Speakers have become ubiquitous in many countries. The global sale
+of Amazon Alexa devices surpassed half a billion in 2023
+{[}Amazon\_2023{]} and Apple sells over \(10\) million HomePods every
+year {[}Statista\_Apple\_2023{]}. A study shows that \(35\%\) of adults
+in the U.S. own a smart speaker {[}NPR\_2022{]}. With half of the
+participants in the study reporting to have heard an advertisement on
+their smart speaker before, concerns about the privacy and security of
+these ever-listening devices are well founded. Especially since they are
+usually placed where private or confidential conversations take place.
+
+To back claims about loss of privacy or decreased security, a thorough
+analysis of smart speakers becomes necessary. In this thesis, we build a
+system that allows us to collect a large amount of data on interactions
+with voice assistants.
+
+\subsection{Previous Work}
+
+There are several previous works analysing the privacy and security
+risks of voice assistants. Most of them go into topics concerning
+specifically Amazon Alexa since their adoption rate is significantly
+higher than Apple's Siri or Google Assistant {[}Bolton\_2021{]}. Some
+examples include experiments of smart speaker misactivations
+{[}Ford2019{]}{[}Dubois\_2020{]}{[}schonherr2022101328{]}, analyses of
+voice assistant skills
+{[}Natatsuka\_2019{]}{[}Shezan\_2020{]}{[}255322{]}{[}Edu\_2022{]}{[}Xie\_2022{]}
+and voice command fingerprinting {[}8802686{]}{[}Wang\_2020aa{]}. Apart
+from some old work on traffic patterns of Siri {[}caviglione\_2013{]}
+and research by Apple {[}Apple\_2023{]}{[}Apple\_2018{]}, we found
+little information about that voice assistant.
+
+\subsection{Thesis Overview}\label{overview}
+
+The primary bottleneck when collecting large amounts of data on voice
+assistants is the actual time speaking and waiting for the response.
+According to our tests, the majority of interactions take between \(4\)
+and \(8\) seconds plus any additional delay between the question and the
+response. This aligns with the results by {[}Haas\_2022{]}, who found a
+median of \(4.03\)s per typical sentence. With the goal of collecting a
+large number of samples for a list of voice commands (henceforth called
+``queries``), the system will therefore need to run for a substantial
+amount of time.
+
+With this in mind, the primary aim of this thesis is building a testbed
+that can autonomously interact with a smart speaker. This system should
+take a list of queries and continuously go through it while collecting
+audio recordings, network packet traces, timestamps and other relevant
+metadata about the interaction. In
+\hyperref[background]{{[}background{]}}, we establish why it is of
+interest to collect this data and the implementation of our system is
+explored in \hyperref[system-design]{{[}system-design{]}}.
+
+For the second part, documented in Sections \hyperref[experiment]{}and
+\hyperref[results]{}, we run an experiment using our system, collecting
+interactions with an Apple HomePod mini\footnote{\url{https://apple.com/homepod-mini}}
+smart speaker. With this dataset of interactions, we train an ML model
+for traffic fingerprinting. The system is adapted from a similar
+implementation by {[}Wang\_2020aa{]} but will be applied to data
+collected from the Siri voice assistant instead of Alexa and Google
+Assistant.
+
+\section{Background}\label{background}
+
+Having defined our unsupervised system for collecting interactions with
+smart speakers, it is necessary to explain the reasoning behind
+collecting that data.
+
+{[}Papadogiannaki\_2021{]} write in their survey ``In the second
+category \emph{interception of encrypted traffic,} we encounter
+solutions that take advantage of behavioural patterns or signatures to
+report a specific event.'' The main focus for us lies on the encrypted
+network traffic being collected. This traffic cannot be decrypted,
+however, we can analyse it using a technique called traffic
+fingerprinting.
+
+\subsection{Traffic Fingerprinting}
+
+Traffic fingerprinting uses pattern recognition approaches on collected,
+encrypted traffic traces to infer user behaviour. This method is used
+successfully to analyse web traffic {[}Wang\_2020{]}, traffic on the Tor
+network {[}Oh\_2017{]}, IoT device traffic
+{[}8802686{]}{[}Wang\_2020aa{]}{[}Mazhar\_2020aa{]}{[}Zuo\_2019{]}{[}Trimananda\_2019aa{]}
+and user location {[}Ateniese\_2015aa{]}. Traffic fingerprinting on
+smart speakers is generally used to infer the command given to the voice
+assistant by the user. To enable us to run traffic fingerprinting, our
+system stores the query and the network traffic for each interaction it
+has with the smart speaker.
+
+We assume the concept of \emph{open world} and \emph{closed world}
+models from previous traffic fingerprinting works
+{[}Wang\_2020aa{]}{[}Wang\_2020{]}. Both models define an attacker who
+has a set of queries they are interested in. The interest in the open
+world scenario is whether or not a given command is in that set. The
+traffic analysed in this case can be from either known queries or
+queries not seen before. In the closed world scenario, the attacker
+wants to differentiate between the different queries in that set, but
+cannot tell whether a new query is in it or not. As an example, the open
+world model could be used to infer whether the smart speaker is being
+used to communicate with a person that an attacker is monitoring. In the
+closed world case, the attacker might want to differentiate between
+different people a user has contacted. In our thesis, we focus on the
+closed world model.
+
+\subsection{Threat Model}\label{threat-model}
+
+Our threat model is similar to the ones introduced by {[}Wang\_2020aa{]}
+and {[}8802686{]}, specifically,
+
+\begin{itemize}
+\item
+  the attacker has access to the network the smart speaker is connected
+  to;
+\item
+  the MAC address of the smart speaker is known;
+\item
+  the type of the voice assistant is known (e.g. Siri, Alexa or Google
+  Assistant).
+\end{itemize}
+
+Previous work {[}Mazhar\_2020aa{]}{[}Trimananda\_2019aa{]} shows that it
+is feasible to extract this information from the encrypted network
+traffic. Typically unencrypted DNS queries can also be used to identify
+smart speakers on a network. Our own analysis shows that the Apple
+HomePod regularly sends DNS queries asking about the domain
+\texttt{gsp-ssl.ls-apple.com.akadns.net} and {[}Wang\_2020aa{]} found
+that the Amazon Echo smart speaker sends DNS queries about
+\texttt{unagi-na.amazon.com}. The specific domains queried may differ
+regionally (as indicated by the \texttt{-na} part in the Amazon
+subdomain), but should be simple to obtain.
+
+Finally, we assume the traffic is encrypted, since the attack would be
+trivial otherwise.
+
+\subsection{Voice Assistant Choice}
+
+The most popular voice assistants are Apple's Siri, Amazon Alexa and
+Google Assistant {[}Bolton\_2021{]}. Support for Microsoft Cortana was
+dropped in August of 2023 {[}Microsoft\_Cortana\_2023{]}, likely in
+favour of Copilot for Windows, which was announced one month later
+{[}Microsoft\_Copilot\_2023{]} (for more on this see
+\hyperref[vas-based-on-llms]{{[}vas-based-on-llms{]}}).
+
+Our system supports the Siri and Alexa voice assistants but can be
+easily extended with support for more. For the experiment in
+\hyperref[experiment]{{[}experiment{]}}, we decided to use Siri as it is
+still less researched {[}Bolton\_2021{]} than Alexa, where a host of
+security and privacy issues have already been found
+{[}Wang\_2020aa{]}{[}8802686{]}{[}Ford2019{]}{[}Edu\_2022{]}{[}Dubois\_2020{]}.
+
+\section{System Design}\label{system-design}
+
+The primary product of this thesis is a testbed that can autonomously
+interact with a smart speaker for a long time. Our system, called varys,
+collects the following data on its interactions with the voice
+assistant:
+
+\begin{itemize}
+\item
+  Audio recordings of the query and the response
+\item
+  A text transcript of the query and the response
+\item
+  Encrypted network traffic traces from and to the smart speaker
+\item
+  The durations of different parts of the interaction
+\end{itemize}
+
+\subsection{Modularisation}
+
+The wide range of data our system needs to collect makes it necessary to
+keep its complexity manageable. To simplify maintenance and increase the
+extensibility, the functionality of our system is separated into five
+modules:
+
+\begin{description}
+\item[varys]
+The main executable combining all modules into the final system.
+\item[{\texttt{v}\texttt{arys-analysis}}]
+Analysis of data collected by varys.
+\item[{\texttt{v}\texttt{arys-audio}}]
+Recording audio and the TTS and STT systems.
+\item[{\texttt{v}\texttt{arys-database}}]
+Abstraction of the database system where interactions are stored.
+\item[{\texttt{v}\texttt{arys-network}}]
+Collection of network traffic, writing and parsing of pcap files.
+\end{description}
+
+The modules {\texttt{v}\texttt{arys-audio}},
+{\texttt{v}\texttt{arys-database}} and {\texttt{v}\texttt{arys-network}}
+are isolated from the rest of varys. If the need arises (e.g. if a
+different database connector is required or a more performant TTS system
+is found), they can be easily swapped out.
+
+The module {\texttt{v}\texttt{arys-analysis}} does not currently do any
+analysis of the recorded audio and thus doesn't depend on that module.
+
+In this section, we will lay out the design considerations behind the
+modules that make up varys.
+
+\subsection{Network}\label{network}
+
+The system runs on a device that serves as a MITM between the Internet
+and an \emph{internal network}, which it acts as a router for. The
+latter is reserved for the devices that are monitored. This includes the
+voice assistant and any devices required by it. Since the majority of
+smart speakers do not have any wired connectivity, the \emph{internal
+network} is wireless.
+
+This layout enables the MITM device to record all incoming and outgoing
+traffic to and from the voice assistant without any interference from
+other devices on the network. (We filter the network traffic by the MAC
+address of the smart speaker so this is not strictly necessary, but it
+helps to reduce the amount of garbage data collected.)
+
+Currently, we assume that the beginning and end of an interaction with
+the voice assistant is known and we can collect the network capture from
+when the user started speaking to when the voice assistant stopped
+speaking. We discuss this restriction in more detail in
+\hyperref[traffic-trace-bounds]{{[}traffic-trace-bounds{]}}.
+
+The {\texttt{v}\texttt{arys-network}} module provides functionality to
+start and stop capturing traffic on a specific network interface and
+write it to a pcap file. Furthermore, it offers a wrapper structure
+around the network packets read from those files, which is used by
+{\texttt{v}\texttt{arys-analysis}}. Each smart speaker has a different
+MAC address that is used to distinguish between incoming and outgoing
+traffic. The MAC address is therefore also stored in the database
+alongside the path to the pcap file.
+
+\subsubsection{Monitoring}\label{monitoring}
+
+Since the system is running for long periods of time without
+supervision, we need to monitor potential outages. We added support to
+varys for \emph{passive} monitoring. A high-uptime server receives a
+request from the system each time it completes an interaction. That
+server in turn notifies us if it hasn't received anything for a
+configurable amount of time.
+
+\subsection{Audio}
+
+The largest module of our system is {\texttt{v}\texttt{arys-audio}}. Its
+main components are the TTS and STT systems, expanded upon in the next
+two subsections. Apart from that, it stores the recorded audio files
+encoded in the space-optimised OPUS {[}rfc6716{]} format and handles all
+audio processing. Converting it to mono; downsampling it for space
+efficiency; and trimming the silence from the beginning and end.
+
+\subsubsection{Text-to-Speech}\label{text-to-speech}
+
+The only interface that a voice assistant provides is via speaking to
+it. Therefore, we need a system to autonomously talk, for which there
+are two options:
+
+\begin{itemize}
+\item
+  Prerecorded audio
+\item
+  Speech synthesis
+\end{itemize}
+
+The former was dismissed early for several reasons. Firstly, audio
+recordings would not support conversations that last for more than one
+query and one response. Secondly, each change in our dataset would
+necessitate recording new queries. Lastly, given the scope of this
+thesis, recording and editing more than 100 sentences was not feasible.
+
+Thus, we decided to integrate a TTS system into our
+{\texttt{v}\texttt{arys-audio}} module. This system must provide high
+quality synthesis to minimise speech recognition mistakes on the smart
+speaker, and be able to generate speech without any noticeable
+performance overhead. The latter is typically measured by the rtf, with
+\(\text{rtf } > 1\) if the system takes longer to synthesise text than
+to speak that text and \(\text{rtf } \leq 1\) otherwise. A good
+real-time factor is especially important if our system is to support
+conversations in the future; the voice assistant only listens for a
+certain amount of time before cancelling an interaction.
+
+We explored three different options with \(\text{rtf } \leq 1\) on the
+machine we tested on:
+
+\begin{itemize}
+\item
+  espeak\footnote{\url{https://espeak.sourceforge.net}}
+\item
+  Larynx\footnote{\url{https://github.com/rhasspy/larynx}}
+\item
+  \texttt{tts-rs} using the \texttt{AVFoundation} backend\footnote{\url{https://github.com/ndarilek/tts-rs}
+    -- the \texttt{AVFoundation} backend is only supported on macOS.}
+\end{itemize}
+
+We primarily based our choice on how well Amazon Alexa\footnote{We
+  picked Alexa (over Siri) for these tests since its speech recognition
+  performed worse in virtually every case.} would be able to understand
+different sentences. A test was performed manually by asking each of
+three questions 20 times with every TTS system. As a ground truth, the
+same was repeated verbally. The results show that \texttt{tts-rs} with
+the \texttt{AVFoundation} backend performed approximately \(20\%\)
+better on average than both Larynx and espeak and came within \(15\%\)
+of the ground truth.
+
+While performing this test, the difference in quality was very easily
+heard, as only the \texttt{AVFoundation} voices come close to sounding
+like a real voice. These were, apart from the ground truth, also the
+only voices that could be differentiated from each other by the voice
+assistants. The recognition of user voices works well while the guest
+voice was wrongly recognised as one of the others \(65\%\) of the time.
+
+With these results, we decided to use \texttt{tts-rs} on a macOS machine
+to get access to the \texttt{AVFoundation} backend.
+
+Since running this test, Larynx has been succeeded by Piper\footnote{\url{https://github.com/rhasspy/piper}}
+during the development of our system. The new system promises good
+results and may be an option for running varys on non-macOS systems in
+the future.
+
+\subsubsection{Speech-to-Text}\label{speech-to-text}
+
+Not all voice assistants provide a history of interactions, which is why
+we use a STT system to to transcribe the recorded answers.
+
+In our module {\texttt{v}\texttt{arys-audio}}, we utilise an ML-based
+speech recognition system called whisper. It provides transcription
+error rates comparable to those of human transcription
+{[}Radford\_2022aa{]} and outperforms other local speech recognition
+models {[}Seagraves\_2022{]} and commercial services
+{[}Radford\_2022aa{]}. In
+\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}, we
+explore how well whisper worked for our use case.
+
+\subsection{Data Storage}
+
+The module {\texttt{v}\texttt{arys-database}} acts as an abstraction of
+our database, where all metadata about interactions are stored. Audio
+files and pcap traces are saved to disk by
+{\texttt{v}\texttt{arys-audio}} and {\texttt{v}\texttt{arys-network}}
+respectively and their paths are referenced in the database.
+
+Since some changes were made to the system while it was running, we
+store the version of varys a session was ran with. This way, if there
+are any uncertainties about the collected data, the relevant code
+history and database migrations can be inspected.
+
+To prevent data corruption if the project is moved, all paths are
+relative to a main \texttt{data} directory.
+
+\subsection{Interaction Process}
+
+The main executable of varys combines all modules introduced in this
+section to run a series of interactions. When the system is run, it
+loads the queries from a TOML file. Until it is stopped, the system then
+runs interaction sessions, each with a shuffled version of the original
+query list.
+
+Resetting the voice assistant is implemented differently for each one
+supported.
+
+\subsection{Performance Considerations}\label{performance}
+
+The initial design of our system intended for it to be run on a
+Raspberry Pi, a small form factor computer with limited processing and
+storage capabilities. Mostly because we did not find a TTS system that
+ran with satisfactory performance on the Raspberry Pi, we now use a
+laptop computer.
+
+\subsubsection{Data Compression}
+
+Because of the initial plan of using a Raspberry Pi, we compress the
+recorded audio with the OPUS {[}rfc6716{]} codec and the captured
+network traffic with DEFLATE {[}rfc1951{]}. We are however now running
+the experiment on a laptop, where space is no longer a concern. The
+audio codec does not add a noticeable processing overhead, but the pcap
+files are now stored uncompressed. Moreover, storing the network traces
+uncompressed simplifies our data analysis since they can be used without
+any preprocessing.
+
+\subsubsection{Audio Transcription}
+
+Apart from the actual time speaking (as mentioned in
+\hyperref[overview]{{[}overview{]}}), the transcription of audio is the
+main bottleneck for how frequently our system can run interactions. The
+following optimisations allowed us to reduce the average time per
+interaction from \(\sim 120\)s to \(\sim 30\)s:
+
+\paragraph{Model Choice}
+
+There are five different sizes available for the whisper model. The step
+from the \texttt{medium} to the \texttt{large} and \texttt{large-v2}
+models does not significantly decrease Word-Error-Rate but more than
+doubles the number of parameters. Since in our tests, this increase in
+network size far more than doubles processing time, we opted for the
+\texttt{medium} model.
+
+\paragraph{Transcription in Parallel}
+
+The bottleneck of waiting for whisper to transcribe the response could
+be completely eliminated with a transcription queue that is processed in
+parallel to the main system. However, if the system is run for weeks at
+a time, this can lead to arbitrarily long post-processing times if the
+transcription takes longer on average than an interaction.
+
+Our solution is to process the previous response while the next
+interaction is already running, but to wait with further interactions
+until the transcription queue (of length 1) is empty again. This
+approach allows us to reduce the time per interaction to
+\(\max(T_{t},T_{i})\), rather than \(T_{t} + T_{i}\), where \(T_{t}\)
+represents the time taken for transcription and \(T_{i}\) the time taken
+for the rest of the interaction.
+
+\paragraph{Audio Preprocessing}
+
+Another step we took to reduce the transcription time is to reduce the
+audio sample rate to \(16\) kHz, average both channels into one and trim
+the silence from the start and the end. This preprocessing does not
+noticeably reduce the quality of the transcription. Moreover, cutting
+off silent audio eliminates virtually all whisper hallucinations (see
+\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}).
+
+\section{Experiment}\label{experiment}
+
+To test our data collection system, we conducted several experiments
+where we let varys interact with an Apple HomePod mini.
+
+Three different lists of queries were tested:
+
+\begin{itemize}
+\item
+  A large dataset with \(227\) queries, split into \(24\) categories
+\item
+  A small dataset with \(13\) queries, each from a different category
+\item
+  A binary dataset with the two queries ``Call John Doe'' and ``Call
+  Mary Poppins''
+\end{itemize}
+
+Apple does not release an official list of Siri's capabilities. A
+dataset of \(\sim 800\) Siri commands was originally compiled from a
+user-collected list {[}Hey\_Siri\_Commands{]} and extended with queries
+from our own exploration. The HomePod does not support all Siri
+queries\footnote{Any queries that require Siri to show something to the
+  user will be answered with something along the lines of \emph{``I can
+  show you the results on your iPhone.''}}. Therefore, during an initial
+test period, we manually removed commands from the list that did not
+result in a useful response. The number of redundant commands was
+further reduced to increase the number of samples that can be collected
+per query. Having less than 256 labels has the additional advantage of
+our query labels fitting in one byte.
+
+Further analysis of our results can be found in
+\hyperref[results]{{[}results{]}}. In this section we detail the
+experiment setup and our traffic fingerprinting system.
+
+\subsection{Setup}
+
+The experiment setup consists of a laptop running macOS\footnote{The
+  choice of using macOS voices is explained in
+  \hyperref[text-to-speech]{{[}text-to-speech{]}}.}, connected to a pair
+of speakers and a microphone. The latter, together with the HomePod, are
+placed inside a sound-isolated box. Additionally, the user's iPhone is
+required to be on the \emph{internal network} (see
+\hyperref[network]{{[}network{]}}) for Siri to be able to respond to
+personal requests.
+
+The iPhone is set up with a dummy user account for \emph{Peter Pan} and
+two contacts, \emph{John Doe} and \emph{Mary Poppins}. Otherwise, the
+phone is kept at factory settings and the only changes come from
+interactions with the HomePod.
+
+Finally, to minimise system downtime, an external server is used for
+monitoring as explained in \hyperref[monitoring]{{[}monitoring{]}}.
+
+The experiment must be set up in a location with the following
+characteristics:
+
+\begin{itemize}
+\item
+  Wired Internet connection\footnote{A wireless connection is not
+    possible, since macOS does not support connecting to WiFi and
+    creating a wireless network at the same time. If no wired connection
+    is available, a network bridge (e.g. a Raspberry Pi bridging WiFi to
+    its Ethernet port) can be used to connect the Mac to the internet.}
+\item
+  Power outlet available
+\item
+  No persons speaking (the quieter the environment, the higher the voice
+  recognition accuracy)
+\item
+  Noise disturbance from experiment not bothering anyone
+\item
+  Access to room during working hours (or if possible at any time)
+\end{itemize}
+
+The sound isolation of the box can reduce ambient noise in the range of
+\(50 - 60\) dB to a significantly lower \(35 - 45\) dB, which increases
+the voice recognition and transcription accuracy. This also helps
+finding an experiment location, since the noise disturbance is less
+significant.
+
+When the hardware is set up, the command
+\texttt{varys\ listen\ -\/-calibrate} is used to find the ambient noise
+volume.
+
+\subsection{Data Analysis}
+
+We began data analysis by building a simple tool that takes a number of
+traffic traces and visualises them. Red and blue represent incoming and
+outgoing packets respectively and the strength of the colour visualises
+the packet size. The \(x\)-axis does not necessarily correspond with
+time, but with the number of incoming and outgoing packets since the
+beginning of the trace. This method of analysing traffic traces aligns
+with previous work on traffic fingerprinting {[}Wang\_2020aa{]}.
+
+\subsubsection{Traffic Preprocessing}
+
+We write a network traffic trace as
+\(\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right)\),
+where \(n\) is the number of packets in that trace, \(s_{i}\) is the
+size of a packet in bytes and \(d_{i} \in \left\{ 0,1 \right\}\) is the
+direction of that packet.
+
+To be used as input in our ML network, the traffic traces were
+preprocessed into normalised tensors of length \(475\). Our collected
+network traffic is converted like
+
+\(\)\[\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right) \rightarrow \left( p_{1},p_{2},\ldots,p_{475} \right),\]\(\)
+
+where for all \(i \in \lbrack 1,475\rbrack\) the components are
+calculated as
+
+\(\)\[p_{i} = \begin{cases}
+( - 1)^{d_{i}} \cdot \frac{s_{i}}{1514}\mspace{14mu}\text{if }i < n, \\
+0\mspace{99mu}\text{else}.
+\end{cases}\]\(\)
+
+This pads traces shorter than \(475\) packets long with zeroes, cuts off
+the ones with more and normalises all traces into the range
+\(\lbrack - 1,1\rbrack\)\footnote{The maximum size of the packets is
+  \(1514\) bytes, which is why we can simply divide by this to normalise
+  our data.}.
+
+\subsubsection{Adapting Network from Wang et
+al.}\label{adapting-network}
+
+Since there is similar previous work on traffic fingerprinting Amazon
+Alexa and Google Assistant interactions by {[}Wang\_2020aa{]}, we based
+our network on their CNN.
+
+In addition to a CNN, they also tested an LSTM, a SAE and a combination
+of the networks in the form of ensemble learning. However, their best
+results for a single network came from their CNN, so we decided to use
+that.
+
+Owing to limited computing resources, we reduced the size of our
+network. The network of {[}Wang\_2020aa{]} used a pool size of \(1\),
+which is equivalent to a \emph{no-op}. Thus, we removed all \emph{Max
+Pooling} layers. Additionally, we reduced the four groups of
+convolutional layers to one.
+
+Since the hyperparameter search performed by {[}Wang\_2020aa{]} found
+\(180\) to be the optimal dense network size when that was the maximum
+in their search space \(\left\{ 100,110,\ldots,170,180 \right\}\), we
+can assume that the true optimal size is higher than \(180\). In our
+case we used \(475\) -- the size of our input.
+
+To split out dataset into separate training, validation and testing
+parts, we used the same proportions as {[}Wang\_2020aa{]}; Of the full
+datasets, \(64\%\) were used for training, \(16\%\) for validation, and
+\(20\%\) for testing.
+
+\section{Results}\label{results}
+
+Our system ran for approximately \(800\) hours and collected data on a
+total of \(73\prime 605\) interactions. The collected data is available
+at \url{https://gitlab.com/m-vz/varys-data}.
+
+In this section we will go over our results, beginning with the
+statistics of our collected dataset.
+
+\subsection{Statistics}\label{statistics}
+
+Of the \(73\prime 605\) queries collected, \(71\prime 915\) were
+successfully completed (meaning there was no timeout and the system was
+not stopped during the interaction). Our system varys is therefore able
+to interact with Siri with a \(97.70\%\) success rate.
+
+As mentioned in \hyperref[overview]{{[}overview{]}}, there are
+approximately \(4\) to \(8\) seconds of speaking during most
+interactions. The remaining duration (measured as the total duration
+minus the speaking duration) lies mostly between \(25\) and \(30\)
+seconds. This includes the time spent waiting for the voice assistant to
+answer and transcribing the response. It did not increase even for
+outliers with a significantly longer speaking duration.
+
+On average, a query took \(2.47\) seconds (\(\pm 0.64\)) to say, which
+is expected since all query sentences are of similar length. The
+responses lasted on average \(4.40\) seconds with a standard deviation
+of \(5.63\)s.
+
+During the first week, there were two bugs that resulted in several
+downtimes (the longest of which lasted \(\sim 42\)h because we could not
+get physical access to the system during the weekend). After that, the
+system ran flawlessly except during one network outage at the university
+and during maintenance.
+
+\subsection{Transcription Accuracy}\label{transcription-accuracy}
+
+Compared to the results by {[}Radford\_2022aa{]} and
+{[}Seagraves\_2022{]}, our own results show an even higher transcription
+accuracy apart from errors in URLs, certain names and slight formatting
+differences. This is likely due to the audio coming from a synthetic
+voice in a controlled environment.
+
+The query \emph{``How far away is Boston?''} was missing \emph{``as the
+crow flies''} from the correct transcription \emph{``{[}\ldots{]} it's
+about 5,944 kilometers as the crow flies.''} about \(36\%\) of the time.
+Together with the correctly transcribed cases, this makes up \(98.51\%\)
+of this query's data. In the response to \emph{``What is the factorial
+of 6?''}, Siri provided the source
+\href{https://solumaths.com}{solumaths.com}, which is text that whisper
+was unable to produce\footnote{In 1823 cases the name \emph{solumaths}
+  was recognised as either \emph{solomus}, \emph{solomoths},
+  \emph{solomons}, \emph{solomuths}, \emph{soloomiths}, \emph{salumith}
+  or \emph{solemnus}, but never correctly.}. Since mistakes in URLs are
+expected, we marked those transcriptions as correct.
+
+There are some rare cases where the voice assistant responds with a
+single word. Since whisper does not support recognition of audio shorter
+than one second, these queries have an exceedingly high failure rate.
+One example is the query \emph{``Flip a coin''}, which was only
+recognised correctly \(\sim 2\%\) of the time. Out of the \(140\)
+incorrectly transcribed responses, \(134\) were cancelled due to the
+recorded audio being too short.
+
+Another problem we identified with whisper is its tendency to
+hallucinate text when fed silent audio (one or more seconds of
+\texttt{0}-samples). Preventing this can be done by simply trimming off
+all silent audio in the preprocessing step before the transcription.
+
+\subsection{Traffic Fingerprinting}
+
+With our implementation of a fingerprinting CNN, we were able to get
+promising results. Our network was trained on the binary, small and
+large datasets for 1000 epochs each. The accuracy axis of the training
+progress is bounded by the interval
+\(\left\lbrack \frac{100}{n},100 \right\rbrack\), where \(n\) is the
+number of labels. The bottom of the accuracy axis is therefore
+equivalent to random choice. (As an example, the accuracy of the binary
+classifier starts at 50\%, which is choosing randomly between two
+options.) The loss axis is bounded to \(\lbrack 0,3\rbrack\) to fit our
+data. The loss for training our large dataset started outside that range
+at \(5.05\).
+
+Most of the trends in these graphs show that with more training, the
+models could likely be improved. However, since this serves as a proof
+of concept and due to limited time and processing capacity, we did not
+train each network for more than \(1000\) epochs.
+
+For binary classification between the queries ``Call John Doe'' and
+``Call Mary Poppins'', our final accuracy on the test set was
+\(\sim 71.19\%\). This result is especially concerning since it suggests
+this method can differentiate between very similar queries. An attacker
+could train a network on names of people they want to monitor and tell
+from the encrypted network traffic whether the voice assistant was
+likely used to call that person.
+
+The second classification we attempted was between the \(13\) different
+queries from the small dataset, each from a different category. The
+result for this dataset is an \(86.19\%\) accuracy on the test set.
+Compared to less than \(8\%\) for a random guess, this confirms our
+concerns about the privacy and security of the HomePod.
+
+The large dataset with \(227\) different queries showed an accuracy of
+\(40.40\%\). This result is likely limited by the size of our network
+and training capabilities. When compared to randomly guessing at
+\(0.44\%\) though, the loss of privacy becomes apparent.
+
+With adjustments to our CNN, more training data and hardware resources,
+we estimate that highly accurate classification can be achieved even
+between hundreds of different queries.
+
+\subsection{On-Device Recognition}\label{on-device-recognition}
+
+According to HomePod marketing,
+
+\begin{quote}
+HomePod mini works with your iPhone for requests like hearing your
+messages or notes, so they are completed on device without revealing
+that information to Apple.
+
+℄~\url{https://apple.com/homepod-mini}
+\end{quote}
+
+\section{Conclusion}
+
+For this thesis, we built a highly reliable system that can autonomously
+collect data on interacting with smart speakers for hundreds of hours.
+Using our system, we ran an experiment to collect data on over
+\(70\prime 000\) interactions with Siri on a HomePod.
+
+The data we collected enabled us to train two ML models to distinguish
+encrypted smart speaker traffic of \(13\) or \(227\) different queries
+and a binary classifier to recognise which of two contacts a user is
+calling on their voice assistant. With final test accuracies of
+\(\sim 86\%\) for the small dataset, \(\sim 40\%\) for the large dataset
+and \(\sim 71\%\) for the binary classifier, our system serves as a
+proof of concept that even though network traffic on a smart speaker is
+encrypted, patterns in the traffic can be used to accurately predict
+user behaviour. While this has been known about Amazon Alexa
+{[}Wang\_2020aa{]}{[}8802686{]} and Google Assistant {[}Wang\_2020aa{]},
+we have shown the viability of the same attack on the HomePod despite
+Apple's claims about the privacy and security of their smart speakers.
+
+\subsection{Future Work}
+
+While working on this thesis, several ideas for improvement, different
+types of data collection and approaches to the analysis of our data came
+up. Due to time constraints, we could not pursue all of these points. In
+this section we will briefly touch on some open questions or directions
+in which work could continue.
+
+\subsubsection{Recognising Different Speakers}
+
+All popular voice assistants support differentiating between users
+speaking to them. They give personalised responses to queries such as
+\emph{``Read my calendar''} or \emph{``Any new messages?''} by storing a
+voice print of the user. Some smart speakers like the HomePod go a step
+further and only respond to requests given by a registered user
+{[}Apple\_2023{]}.
+
+Similarly to inferring the query from a traffic fingerprint, it may be
+possible to differentiate between different users using the smart
+speaker. Our system supports different voices that can be used to
+register different users on the voice assistants. To keep the scope our
+proof of concept experiment manageable, our traffic dataset was
+collected with the same voice. By recording more data on the same
+queries with different voices, we could explore the possibility of a
+classifier inferring the speaker of a query from the traffic trace.
+
+\subsubsection{Inferring Bounds of Traffic
+Trace}\label{traffic-trace-bounds}
+
+Currently, we assume an attacker knows when the user begins an
+interaction and when it ends. Using a sliding window technique similar
+to the one introduced by {[}Zhang\_2018{]}, a system could continuously
+monitor network traffic and infer when an interaction has taken place
+automatically.
+
+\subsubsection{Classifier for Personal Requests}
+
+As mentioned in
+\hyperref[on-device-recognition]{{[}on-device-recognition{]}}, the
+traffic traces for personal requests show a visible group of outgoing
+traffic that should be straightforward to recognise automatically. A
+binary classifier that distinguishes the traffic of personal requests
+and normal queries on the HomePod should be straightforward to build.
+
+Since this feature appears consistently on all personal request traces
+we looked at, in addition to recognising traffic for queries it was
+trained on, this model could feasibly also classify queries it has never
+seen.
+
+\subsubsection{Voice Assistants Based on LLMs}\label{vas-based-on-llms}
+
+With the release of Gemini by Google {[}Google\_LLM\_2024{]} and Amazon
+announcing generative AI for their Alexa voice assistant
+{[}Amazon\_LLM\_2023{]}, the next years will quite certainly show a
+shift from traditional voice assistants to assistants based on {[}Large
+Language Models \emph{(LLMs)}{]}. Existing concerns about the lack of
+user privacy and security when using LLMs {[}Gupta\_2023{]}, combined
+with the frequency at which voice assistants process PII, warrant
+further research into voice assistants based on generative AI.
+
+\subsubsection{Tuning the Model Hyperparameters}
+
+For time reasons, we are using hyperparameters similar to the ones found
+by {[}Wang\_2020aa{]}, adapted to fit our network by hand. Our model
+performance could likely be improved by running our own hyperparameter
+search.
+
+\subsubsection{Non-Uniform Sampling of Interactions According to Real
+World Data}
+
+Currently, interactions with the voice assistants are sampled uniformly
+from a fixed dataset. According to a user survey, there are large
+differences in how often certain types of skills are used
+{[}NPR\_2022{]}. For example, asking for a summary of the news is likely
+done much less frequently than playing music.
+
+An experiment could be run where the system asks questions with a
+likelihood corresponding to how often they occur in the real world. This
+might improve the performance of the recognition system trained on such
+a dataset.
+
+\subsubsection{Performance Improvements}
+
+The main bottleneck currently preventing us from interacting without
+pausing is our STT system. We estimate a lower bound of approximately 10
+seconds per interaction. This entails the speaking duration plus the
+time waiting for a server response. With a more performant
+implementation of whisper, a superior transcription model or better
+hardware to run the model on, we could thus collect up to six
+interactions per minute.
+
+\subsubsection{\texorpdfstring{Other Uses for
+\texttt{varys}}{Other Uses for varys}}
+
+During data analysis, we discovered certain \emph{easter eggs} in the
+answers that Siri gives. As an example, it responds to the query
+\emph{``Flip a coin''} with \emph{``It\ldots{} whoops! It fell in a
+crack.''} about \(1\%\) of the time -- a response that forces the user
+to ask again should they actually need an answer. Since this behaviour
+does not have any direct security or privacy implications, we did not
+pursue it further. Having said this, varys does provide all the tools
+required to analyse these rare responses from a UX research point of
+view.
+\end{document}