docs: add readme

2025-12-16 01:19:10 +01:00
4 changed files with 1334 additions and 889 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,2 @@
 # varys thesis
 View the thesis: [Master Thesis.pdf](<./Master Thesis.pdf>)
--- a/main.typ
+++ b/main.typ
--- a/template.typ
+++ b/template.typ
@@ -29,9 +29,22 @@
  set heading(numbering: "1.1.")
  set figure(gap: 1em)
  show figure.caption: emph
  set table(align: left)
  // Set run-in subheadings, starting at level 4.
  show heading: it => {
    if it.level > 3 {
      parbreak()
      text(11pt, style: "italic", weight: "regular", it.body + ".")
    } else if it.level == 1 {
      text(size: 20pt, it)
    } else {
      it
    }
  }
  set par(leading: 0.75em)
  // Title page.
@@ -107,6 +120,11 @@
  v(1.618fr)
  pagebreak()
  // Table of contents.
  show outline.entry.where(level: 1): it => text(weight: "bold", it)
  outline(depth: 3, indent: true, fill: text(weight: "regular", repeat[.]))
  pagebreak()
  // Main body.
  set par(justify: true)
--- a/thesis.tex
+++ b/thesis.tex
@@ -1,823 +0,0 @@
 \documentclass{article}
 \title{February 29, 2024 A Testbed for Voice Assistant\\
 Traffic Fingerprinting Master Thesis}
 \begin{document}
 \section{Introduction}\label{introduction}
 Smart Speakers have become ubiquitous in many countries. The global sale
 of Amazon Alexa devices surpassed half a billion in 2023
 {[}Amazon\_2023{]} and Apple sells over \(10\) million HomePods every
 year {[}Statista\_Apple\_2023{]}. A study shows that \(35\%\) of adults
 in the U.S. own a smart speaker {[}NPR\_2022{]}. With half of the
 participants in the study reporting to have heard an advertisement on
 their smart speaker before, concerns about the privacy and security of
 these ever-listening devices are well founded. Especially since they are
 usually placed where private or confidential conversations take place.
 To back claims about loss of privacy or decreased security, a thorough
 analysis of smart speakers becomes necessary. In this thesis, we build a
 system that allows us to collect a large amount of data on interactions
 with voice assistants.
 \subsection{Previous Work}
 There are several previous works analysing the privacy and security
 risks of voice assistants. Most of them go into topics concerning
 specifically Amazon Alexa since their adoption rate is significantly
 higher than Apple's Siri or Google Assistant {[}Bolton\_2021{]}. Some
 examples include experiments of smart speaker misactivations
 {[}Ford2019{]}{[}Dubois\_2020{]}{[}schonherr2022101328{]}, analyses of
 voice assistant skills
 {[}Natatsuka\_2019{]}{[}Shezan\_2020{]}{[}255322{]}{[}Edu\_2022{]}{[}Xie\_2022{]}
 and voice command fingerprinting {[}8802686{]}{[}Wang\_2020aa{]}. Apart
 from some old work on traffic patterns of Siri {[}caviglione\_2013{]}
 and research by Apple {[}Apple\_2023{]}{[}Apple\_2018{]}, we found
 little information about that voice assistant.
 \subsection{Thesis Overview}\label{overview}
 The primary bottleneck when collecting large amounts of data on voice
 assistants is the actual time speaking and waiting for the response.
 According to our tests, the majority of interactions take between \(4\)
 and \(8\) seconds plus any additional delay between the question and the
 response. This aligns with the results by {[}Haas\_2022{]}, who found a
 median of \(4.03\)s per typical sentence. With the goal of collecting a
 large number of samples for a list of voice commands (henceforth called
 ``queries``), the system will therefore need to run for a substantial
 amount of time.
 With this in mind, the primary aim of this thesis is building a testbed
 that can autonomously interact with a smart speaker. This system should
 take a list of queries and continuously go through it while collecting
 audio recordings, network packet traces, timestamps and other relevant
 metadata about the interaction. In
 \hyperref[background]{{[}background{]}}, we establish why it is of
 interest to collect this data and the implementation of our system is
 explored in \hyperref[system-design]{{[}system-design{]}}.
 For the second part, documented in Sections \hyperref[experiment]{}and
 \hyperref[results]{}, we run an experiment using our system, collecting
 interactions with an Apple HomePod mini\footnote{\url{https://apple.com/homepod-mini}}
 smart speaker. With this dataset of interactions, we train an ML model
 for traffic fingerprinting. The system is adapted from a similar
 implementation by {[}Wang\_2020aa{]} but will be applied to data
 collected from the Siri voice assistant instead of Alexa and Google
 Assistant.
 \section{Background}\label{background}
 Having defined our unsupervised system for collecting interactions with
 smart speakers, it is necessary to explain the reasoning behind
 collecting that data.
 {[}Papadogiannaki\_2021{]} write in their survey ``In the second
 category \emph{interception of encrypted traffic,} we encounter
 solutions that take advantage of behavioural patterns or signatures to
 report a specific event.'' The main focus for us lies on the encrypted
 network traffic being collected. This traffic cannot be decrypted,
 however, we can analyse it using a technique called traffic
 fingerprinting.
 \subsection{Traffic Fingerprinting}
 Traffic fingerprinting uses pattern recognition approaches on collected,
 encrypted traffic traces to infer user behaviour. This method is used
 successfully to analyse web traffic {[}Wang\_2020{]}, traffic on the Tor
 network {[}Oh\_2017{]}, IoT device traffic
 {[}8802686{]}{[}Wang\_2020aa{]}{[}Mazhar\_2020aa{]}{[}Zuo\_2019{]}{[}Trimananda\_2019aa{]}
 and user location {[}Ateniese\_2015aa{]}. Traffic fingerprinting on
 smart speakers is generally used to infer the command given to the voice
 assistant by the user. To enable us to run traffic fingerprinting, our
 system stores the query and the network traffic for each interaction it
 has with the smart speaker.
 We assume the concept of \emph{open world} and \emph{closed world}
 models from previous traffic fingerprinting works
 {[}Wang\_2020aa{]}{[}Wang\_2020{]}. Both models define an attacker who
 has a set of queries they are interested in. The interest in the open
 world scenario is whether or not a given command is in that set. The
 traffic analysed in this case can be from either known queries or
 queries not seen before. In the closed world scenario, the attacker
 wants to differentiate between the different queries in that set, but
 cannot tell whether a new query is in it or not. As an example, the open
 world model could be used to infer whether the smart speaker is being
 used to communicate with a person that an attacker is monitoring. In the
 closed world case, the attacker might want to differentiate between
 different people a user has contacted. In our thesis, we focus on the
 closed world model.
 \subsection{Threat Model}\label{threat-model}
 Our threat model is similar to the ones introduced by {[}Wang\_2020aa{]}
 and {[}8802686{]}, specifically,
 \begin{itemize}
 \item
  the attacker has access to the network the smart speaker is connected
  to;
 \item
  the MAC address of the smart speaker is known;
 \item
  the type of the voice assistant is known (e.g. Siri, Alexa or Google
  Assistant).
 \end{itemize}
 Previous work {[}Mazhar\_2020aa{]}{[}Trimananda\_2019aa{]} shows that it
 is feasible to extract this information from the encrypted network
 traffic. Typically unencrypted DNS queries can also be used to identify
 smart speakers on a network. Our own analysis shows that the Apple
 HomePod regularly sends DNS queries asking about the domain
 \texttt{gsp-ssl.ls-apple.com.akadns.net} and {[}Wang\_2020aa{]} found
 that the Amazon Echo smart speaker sends DNS queries about
 \texttt{unagi-na.amazon.com}. The specific domains queried may differ
 regionally (as indicated by the \texttt{-na} part in the Amazon
 subdomain), but should be simple to obtain.
 Finally, we assume the traffic is encrypted, since the attack would be
 trivial otherwise.
 \subsection{Voice Assistant Choice}
 The most popular voice assistants are Apple's Siri, Amazon Alexa and
 Google Assistant {[}Bolton\_2021{]}. Support for Microsoft Cortana was
 dropped in August of 2023 {[}Microsoft\_Cortana\_2023{]}, likely in
 favour of Copilot for Windows, which was announced one month later
 {[}Microsoft\_Copilot\_2023{]} (for more on this see
 \hyperref[vas-based-on-llms]{{[}vas-based-on-llms{]}}).
 Our system supports the Siri and Alexa voice assistants but can be
 easily extended with support for more. For the experiment in
 \hyperref[experiment]{{[}experiment{]}}, we decided to use Siri as it is
 still less researched {[}Bolton\_2021{]} than Alexa, where a host of
 security and privacy issues have already been found
 {[}Wang\_2020aa{]}{[}8802686{]}{[}Ford2019{]}{[}Edu\_2022{]}{[}Dubois\_2020{]}.
 \section{System Design}\label{system-design}
 The primary product of this thesis is a testbed that can autonomously
 interact with a smart speaker for a long time. Our system, called varys,
 collects the following data on its interactions with the voice
 assistant:
 \begin{itemize}
 \item
  Audio recordings of the query and the response
 \item
  A text transcript of the query and the response
 \item
  Encrypted network traffic traces from and to the smart speaker
 \item
  The durations of different parts of the interaction
 \end{itemize}
 \subsection{Modularisation}
 The wide range of data our system needs to collect makes it necessary to
 keep its complexity manageable. To simplify maintenance and increase the
 extensibility, the functionality of our system is separated into five
 modules:
 \begin{description}
 \item[varys]
 The main executable combining all modules into the final system.
 \item[{\texttt{v}\texttt{arys-analysis}}]
 Analysis of data collected by varys.
 \item[{\texttt{v}\texttt{arys-audio}}]
 Recording audio and the TTS and STT systems.
 \item[{\texttt{v}\texttt{arys-database}}]
 Abstraction of the database system where interactions are stored.
 \item[{\texttt{v}\texttt{arys-network}}]
 Collection of network traffic, writing and parsing of pcap files.
 \end{description}
 The modules {\texttt{v}\texttt{arys-audio}},
 {\texttt{v}\texttt{arys-database}} and {\texttt{v}\texttt{arys-network}}
 are isolated from the rest of varys. If the need arises (e.g. if a
 different database connector is required or a more performant TTS system
 is found), they can be easily swapped out.
 The module {\texttt{v}\texttt{arys-analysis}} does not currently do any
 analysis of the recorded audio and thus doesn't depend on that module.
 In this section, we will lay out the design considerations behind the
 modules that make up varys.
 \subsection{Network}\label{network}
 The system runs on a device that serves as a MITM between the Internet
 and an \emph{internal network}, which it acts as a router for. The
 latter is reserved for the devices that are monitored. This includes the
 voice assistant and any devices required by it. Since the majority of
 smart speakers do not have any wired connectivity, the \emph{internal
 network} is wireless.
 This layout enables the MITM device to record all incoming and outgoing
 traffic to and from the voice assistant without any interference from
 other devices on the network. (We filter the network traffic by the MAC
 address of the smart speaker so this is not strictly necessary, but it
 helps to reduce the amount of garbage data collected.)
 Currently, we assume that the beginning and end of an interaction with
 the voice assistant is known and we can collect the network capture from
 when the user started speaking to when the voice assistant stopped
 speaking. We discuss this restriction in more detail in
 \hyperref[traffic-trace-bounds]{{[}traffic-trace-bounds{]}}.
 The {\texttt{v}\texttt{arys-network}} module provides functionality to
 start and stop capturing traffic on a specific network interface and
 write it to a pcap file. Furthermore, it offers a wrapper structure
 around the network packets read from those files, which is used by
 {\texttt{v}\texttt{arys-analysis}}. Each smart speaker has a different
 MAC address that is used to distinguish between incoming and outgoing
 traffic. The MAC address is therefore also stored in the database
 alongside the path to the pcap file.
 \subsubsection{Monitoring}\label{monitoring}
 Since the system is running for long periods of time without
 supervision, we need to monitor potential outages. We added support to
 varys for \emph{passive} monitoring. A high-uptime server receives a
 request from the system each time it completes an interaction. That
 server in turn notifies us if it hasn't received anything for a
 configurable amount of time.
 \subsection{Audio}
 The largest module of our system is {\texttt{v}\texttt{arys-audio}}. Its
 main components are the TTS and STT systems, expanded upon in the next
 two subsections. Apart from that, it stores the recorded audio files
 encoded in the space-optimised OPUS {[}rfc6716{]} format and handles all
 audio processing. Converting it to mono; downsampling it for space
 efficiency; and trimming the silence from the beginning and end.
 \subsubsection{Text-to-Speech}\label{text-to-speech}
 The only interface that a voice assistant provides is via speaking to
 it. Therefore, we need a system to autonomously talk, for which there
 are two options:
 \begin{itemize}
 \item
  Prerecorded audio
 \item
  Speech synthesis
 \end{itemize}
 The former was dismissed early for several reasons. Firstly, audio
 recordings would not support conversations that last for more than one
 query and one response. Secondly, each change in our dataset would
 necessitate recording new queries. Lastly, given the scope of this
 thesis, recording and editing more than 100 sentences was not feasible.
 Thus, we decided to integrate a TTS system into our
 {\texttt{v}\texttt{arys-audio}} module. This system must provide high
 quality synthesis to minimise speech recognition mistakes on the smart
 speaker, and be able to generate speech without any noticeable
 performance overhead. The latter is typically measured by the rtf, with
 \(\text{rtf } > 1\) if the system takes longer to synthesise text than
 to speak that text and \(\text{rtf } \leq 1\) otherwise. A good
 real-time factor is especially important if our system is to support
 conversations in the future; the voice assistant only listens for a
 certain amount of time before cancelling an interaction.
 We explored three different options with \(\text{rtf } \leq 1\) on the
 machine we tested on:
 \begin{itemize}
 \item
  espeak\footnote{\url{https://espeak.sourceforge.net}}
 \item
  Larynx\footnote{\url{https://github.com/rhasspy/larynx}}
 \item
  \texttt{tts-rs} using the \texttt{AVFoundation} backend\footnote{\url{https://github.com/ndarilek/tts-rs}
    -- the \texttt{AVFoundation} backend is only supported on macOS.}
 \end{itemize}
 We primarily based our choice on how well Amazon Alexa\footnote{We
  picked Alexa (over Siri) for these tests since its speech recognition
  performed worse in virtually every case.} would be able to understand
 different sentences. A test was performed manually by asking each of
 three questions 20 times with every TTS system. As a ground truth, the
 same was repeated verbally. The results show that \texttt{tts-rs} with
 the \texttt{AVFoundation} backend performed approximately \(20\%\)
 better on average than both Larynx and espeak and came within \(15\%\)
 of the ground truth.
 While performing this test, the difference in quality was very easily
 heard, as only the \texttt{AVFoundation} voices come close to sounding
 like a real voice. These were, apart from the ground truth, also the
 only voices that could be differentiated from each other by the voice
 assistants. The recognition of user voices works well while the guest
 voice was wrongly recognised as one of the others \(65\%\) of the time.
 With these results, we decided to use \texttt{tts-rs} on a macOS machine
 to get access to the \texttt{AVFoundation} backend.
 Since running this test, Larynx has been succeeded by Piper\footnote{\url{https://github.com/rhasspy/piper}}
 during the development of our system. The new system promises good
 results and may be an option for running varys on non-macOS systems in
 the future.
 \subsubsection{Speech-to-Text}\label{speech-to-text}
 Not all voice assistants provide a history of interactions, which is why
 we use a STT system to to transcribe the recorded answers.
 In our module {\texttt{v}\texttt{arys-audio}}, we utilise an ML-based
 speech recognition system called whisper. It provides transcription
 error rates comparable to those of human transcription
 {[}Radford\_2022aa{]} and outperforms other local speech recognition
 models {[}Seagraves\_2022{]} and commercial services
 {[}Radford\_2022aa{]}. In
 \hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}, we
 explore how well whisper worked for our use case.
 \subsection{Data Storage}
 The module {\texttt{v}\texttt{arys-database}} acts as an abstraction of
 our database, where all metadata about interactions are stored. Audio
 files and pcap traces are saved to disk by
 {\texttt{v}\texttt{arys-audio}} and {\texttt{v}\texttt{arys-network}}
 respectively and their paths are referenced in the database.
 Since some changes were made to the system while it was running, we
 store the version of varys a session was ran with. This way, if there
 are any uncertainties about the collected data, the relevant code
 history and database migrations can be inspected.
 To prevent data corruption if the project is moved, all paths are
 relative to a main \texttt{data} directory.
 \subsection{Interaction Process}
 The main executable of varys combines all modules introduced in this
 section to run a series of interactions. When the system is run, it
 loads the queries from a TOML file. Until it is stopped, the system then
 runs interaction sessions, each with a shuffled version of the original
 query list.
 Resetting the voice assistant is implemented differently for each one
 supported.
 \subsection{Performance Considerations}\label{performance}
 The initial design of our system intended for it to be run on a
 Raspberry Pi, a small form factor computer with limited processing and
 storage capabilities. Mostly because we did not find a TTS system that
 ran with satisfactory performance on the Raspberry Pi, we now use a
 laptop computer.
 \subsubsection{Data Compression}
 Because of the initial plan of using a Raspberry Pi, we compress the
 recorded audio with the OPUS {[}rfc6716{]} codec and the captured
 network traffic with DEFLATE {[}rfc1951{]}. We are however now running
 the experiment on a laptop, where space is no longer a concern. The
 audio codec does not add a noticeable processing overhead, but the pcap
 files are now stored uncompressed. Moreover, storing the network traces
 uncompressed simplifies our data analysis since they can be used without
 any preprocessing.
 \subsubsection{Audio Transcription}
 Apart from the actual time speaking (as mentioned in
 \hyperref[overview]{{[}overview{]}}), the transcription of audio is the
 main bottleneck for how frequently our system can run interactions. The
 following optimisations allowed us to reduce the average time per
 interaction from \(\sim 120\)s to \(\sim 30\)s:
 \paragraph{Model Choice}
 There are five different sizes available for the whisper model. The step
 from the \texttt{medium} to the \texttt{large} and \texttt{large-v2}
 models does not significantly decrease Word-Error-Rate but more than
 doubles the number of parameters. Since in our tests, this increase in
 network size far more than doubles processing time, we opted for the
 \texttt{medium} model.
 \paragraph{Transcription in Parallel}
 The bottleneck of waiting for whisper to transcribe the response could
 be completely eliminated with a transcription queue that is processed in
 parallel to the main system. However, if the system is run for weeks at
 a time, this can lead to arbitrarily long post-processing times if the
 transcription takes longer on average than an interaction.
 Our solution is to process the previous response while the next
 interaction is already running, but to wait with further interactions
 until the transcription queue (of length 1) is empty again. This
 approach allows us to reduce the time per interaction to
 \(\max(T_{t},T_{i})\), rather than \(T_{t} + T_{i}\), where \(T_{t}\)
 represents the time taken for transcription and \(T_{i}\) the time taken
 for the rest of the interaction.
 \paragraph{Audio Preprocessing}
 Another step we took to reduce the transcription time is to reduce the
 audio sample rate to \(16\) kHz, average both channels into one and trim
 the silence from the start and the end. This preprocessing does not
 noticeably reduce the quality of the transcription. Moreover, cutting
 off silent audio eliminates virtually all whisper hallucinations (see
 \hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}).
 \section{Experiment}\label{experiment}
 To test our data collection system, we conducted several experiments
 where we let varys interact with an Apple HomePod mini.
 Three different lists of queries were tested:
 \begin{itemize}
 \item
  A large dataset with \(227\) queries, split into \(24\) categories
 \item
  A small dataset with \(13\) queries, each from a different category
 \item
  A binary dataset with the two queries ``Call John Doe'' and ``Call
  Mary Poppins''
 \end{itemize}
 Apple does not release an official list of Siri's capabilities. A
 dataset of \(\sim 800\) Siri commands was originally compiled from a
 user-collected list {[}Hey\_Siri\_Commands{]} and extended with queries
 from our own exploration. The HomePod does not support all Siri
 queries\footnote{Any queries that require Siri to show something to the
  user will be answered with something along the lines of \emph{``I can
  show you the results on your iPhone.''}}. Therefore, during an initial
 test period, we manually removed commands from the list that did not
 result in a useful response. The number of redundant commands was
 further reduced to increase the number of samples that can be collected
 per query. Having less than 256 labels has the additional advantage of
 our query labels fitting in one byte.
 Further analysis of our results can be found in
 \hyperref[results]{{[}results{]}}. In this section we detail the
 experiment setup and our traffic fingerprinting system.
 \subsection{Setup}
 The experiment setup consists of a laptop running macOS\footnote{The
  choice of using macOS voices is explained in
  \hyperref[text-to-speech]{{[}text-to-speech{]}}.}, connected to a pair
 of speakers and a microphone. The latter, together with the HomePod, are
 placed inside a sound-isolated box. Additionally, the user's iPhone is
 required to be on the \emph{internal network} (see
 \hyperref[network]{{[}network{]}}) for Siri to be able to respond to
 personal requests.
 The iPhone is set up with a dummy user account for \emph{Peter Pan} and
 two contacts, \emph{John Doe} and \emph{Mary Poppins}. Otherwise, the
 phone is kept at factory settings and the only changes come from
 interactions with the HomePod.
 Finally, to minimise system downtime, an external server is used for
 monitoring as explained in \hyperref[monitoring]{{[}monitoring{]}}.
 The experiment must be set up in a location with the following
 characteristics:
 \begin{itemize}
 \item
  Wired Internet connection\footnote{A wireless connection is not
    possible, since macOS does not support connecting to WiFi and
    creating a wireless network at the same time. If no wired connection
    is available, a network bridge (e.g. a Raspberry Pi bridging WiFi to
    its Ethernet port) can be used to connect the Mac to the internet.}
 \item
  Power outlet available
 \item
  No persons speaking (the quieter the environment, the higher the voice
  recognition accuracy)
 \item
  Noise disturbance from experiment not bothering anyone
 \item
  Access to room during working hours (or if possible at any time)
 \end{itemize}
 The sound isolation of the box can reduce ambient noise in the range of
 \(50 - 60\) dB to a significantly lower \(35 - 45\) dB, which increases
 the voice recognition and transcription accuracy. This also helps
 finding an experiment location, since the noise disturbance is less
 significant.
 When the hardware is set up, the command
 \texttt{varys\ listen\ -\/-calibrate} is used to find the ambient noise
 volume.
 \subsection{Data Analysis}
 We began data analysis by building a simple tool that takes a number of
 traffic traces and visualises them. Red and blue represent incoming and
 outgoing packets respectively and the strength of the colour visualises
 the packet size. The \(x\)-axis does not necessarily correspond with
 time, but with the number of incoming and outgoing packets since the
 beginning of the trace. This method of analysing traffic traces aligns
 with previous work on traffic fingerprinting {[}Wang\_2020aa{]}.
 \subsubsection{Traffic Preprocessing}
 We write a network traffic trace as
 \(\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right)\),
 where \(n\) is the number of packets in that trace, \(s_{i}\) is the
 size of a packet in bytes and \(d_{i} \in \left\{ 0,1 \right\}\) is the
 direction of that packet.
 To be used as input in our ML network, the traffic traces were
 preprocessed into normalised tensors of length \(475\). Our collected
 network traffic is converted like
 \(\)\[\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right) \rightarrow \left( p_{1},p_{2},\ldots,p_{475} \right),\]\(\)
 where for all \(i \in \lbrack 1,475\rbrack\) the components are
 calculated as
 \(\)\[p_{i} = \begin{cases}
 ( - 1)^{d_{i}} \cdot \frac{s_{i}}{1514}\mspace{14mu}\text{if }i < n, \\
 0\mspace{99mu}\text{else}.
 \end{cases}\]\(\)
 This pads traces shorter than \(475\) packets long with zeroes, cuts off
 the ones with more and normalises all traces into the range
 \(\lbrack - 1,1\rbrack\)\footnote{The maximum size of the packets is
  \(1514\) bytes, which is why we can simply divide by this to normalise
  our data.}.
 \subsubsection{Adapting Network from Wang et
 al.}\label{adapting-network}
 Since there is similar previous work on traffic fingerprinting Amazon
 Alexa and Google Assistant interactions by {[}Wang\_2020aa{]}, we based
 our network on their CNN.
 In addition to a CNN, they also tested an LSTM, a SAE and a combination
 of the networks in the form of ensemble learning. However, their best
 results for a single network came from their CNN, so we decided to use
 that.
 Owing to limited computing resources, we reduced the size of our
 network. The network of {[}Wang\_2020aa{]} used a pool size of \(1\),
 which is equivalent to a \emph{no-op}. Thus, we removed all \emph{Max
 Pooling} layers. Additionally, we reduced the four groups of
 convolutional layers to one.
 Since the hyperparameter search performed by {[}Wang\_2020aa{]} found
 \(180\) to be the optimal dense network size when that was the maximum
 in their search space \(\left\{ 100,110,\ldots,170,180 \right\}\), we
 can assume that the true optimal size is higher than \(180\). In our
 case we used \(475\) -- the size of our input.
 To split out dataset into separate training, validation and testing
 parts, we used the same proportions as {[}Wang\_2020aa{]}; Of the full
 datasets, \(64\%\) were used for training, \(16\%\) for validation, and
 \(20\%\) for testing.
 \section{Results}\label{results}
 Our system ran for approximately \(800\) hours and collected data on a
 total of \(73\prime 605\) interactions. The collected data is available
 at \url{https://gitlab.com/m-vz/varys-data}.
 In this section we will go over our results, beginning with the
 statistics of our collected dataset.
 \subsection{Statistics}\label{statistics}
 Of the \(73\prime 605\) queries collected, \(71\prime 915\) were
 successfully completed (meaning there was no timeout and the system was
 not stopped during the interaction). Our system varys is therefore able
 to interact with Siri with a \(97.70\%\) success rate.
 As mentioned in \hyperref[overview]{{[}overview{]}}, there are
 approximately \(4\) to \(8\) seconds of speaking during most
 interactions. The remaining duration (measured as the total duration
 minus the speaking duration) lies mostly between \(25\) and \(30\)
 seconds. This includes the time spent waiting for the voice assistant to
 answer and transcribing the response. It did not increase even for
 outliers with a significantly longer speaking duration.
 On average, a query took \(2.47\) seconds (\(\pm 0.64\)) to say, which
 is expected since all query sentences are of similar length. The
 responses lasted on average \(4.40\) seconds with a standard deviation
 of \(5.63\)s.
 During the first week, there were two bugs that resulted in several
 downtimes (the longest of which lasted \(\sim 42\)h because we could not
 get physical access to the system during the weekend). After that, the
 system ran flawlessly except during one network outage at the university
 and during maintenance.
 \subsection{Transcription Accuracy}\label{transcription-accuracy}
 Compared to the results by {[}Radford\_2022aa{]} and
 {[}Seagraves\_2022{]}, our own results show an even higher transcription
 accuracy apart from errors in URLs, certain names and slight formatting
 differences. This is likely due to the audio coming from a synthetic
 voice in a controlled environment.
 The query \emph{``How far away is Boston?''} was missing \emph{``as the
 crow flies''} from the correct transcription \emph{``{[}\ldots{]} it's
 about 5,944 kilometers as the crow flies.''} about \(36\%\) of the time.
 Together with the correctly transcribed cases, this makes up \(98.51\%\)
 of this query's data. In the response to \emph{``What is the factorial
 of 6?''}, Siri provided the source
 \href{https://solumaths.com}{solumaths.com}, which is text that whisper
 was unable to produce\footnote{In 1823 cases the name \emph{solumaths}
  was recognised as either \emph{solomus}, \emph{solomoths},
  \emph{solomons}, \emph{solomuths}, \emph{soloomiths}, \emph{salumith}
  or \emph{solemnus}, but never correctly.}. Since mistakes in URLs are
 expected, we marked those transcriptions as correct.
 There are some rare cases where the voice assistant responds with a
 single word. Since whisper does not support recognition of audio shorter
 than one second, these queries have an exceedingly high failure rate.
 One example is the query \emph{``Flip a coin''}, which was only
 recognised correctly \(\sim 2\%\) of the time. Out of the \(140\)
 incorrectly transcribed responses, \(134\) were cancelled due to the
 recorded audio being too short.
 Another problem we identified with whisper is its tendency to
 hallucinate text when fed silent audio (one or more seconds of
 \texttt{0}-samples). Preventing this can be done by simply trimming off
 all silent audio in the preprocessing step before the transcription.
 \subsection{Traffic Fingerprinting}
 With our implementation of a fingerprinting CNN, we were able to get
 promising results. Our network was trained on the binary, small and
 large datasets for 1000 epochs each. The accuracy axis of the training
 progress is bounded by the interval
 \(\left\lbrack \frac{100}{n},100 \right\rbrack\), where \(n\) is the
 number of labels. The bottom of the accuracy axis is therefore
 equivalent to random choice. (As an example, the accuracy of the binary
 classifier starts at 50\%, which is choosing randomly between two
 options.) The loss axis is bounded to \(\lbrack 0,3\rbrack\) to fit our
 data. The loss for training our large dataset started outside that range
 at \(5.05\).
 Most of the trends in these graphs show that with more training, the
 models could likely be improved. However, since this serves as a proof
 of concept and due to limited time and processing capacity, we did not
 train each network for more than \(1000\) epochs.
 For binary classification between the queries ``Call John Doe'' and
 ``Call Mary Poppins'', our final accuracy on the test set was
 \(\sim 71.19\%\). This result is especially concerning since it suggests
 this method can differentiate between very similar queries. An attacker
 could train a network on names of people they want to monitor and tell
 from the encrypted network traffic whether the voice assistant was
 likely used to call that person.
 The second classification we attempted was between the \(13\) different
 queries from the small dataset, each from a different category. The
 result for this dataset is an \(86.19\%\) accuracy on the test set.
 Compared to less than \(8\%\) for a random guess, this confirms our
 concerns about the privacy and security of the HomePod.
 The large dataset with \(227\) different queries showed an accuracy of
 \(40.40\%\). This result is likely limited by the size of our network
 and training capabilities. When compared to randomly guessing at
 \(0.44\%\) though, the loss of privacy becomes apparent.
 With adjustments to our CNN, more training data and hardware resources,
 we estimate that highly accurate classification can be achieved even
 between hundreds of different queries.
 \subsection{On-Device Recognition}\label{on-device-recognition}
 According to HomePod marketing,
 \begin{quote}
 HomePod mini works with your iPhone for requests like hearing your
 messages or notes, so they are completed on device without revealing
 that information to Apple.
 ℄~\url{https://apple.com/homepod-mini}
 \end{quote}
 \section{Conclusion}
 For this thesis, we built a highly reliable system that can autonomously
 collect data on interacting with smart speakers for hundreds of hours.
 Using our system, we ran an experiment to collect data on over
 \(70\prime 000\) interactions with Siri on a HomePod.
 The data we collected enabled us to train two ML models to distinguish
 encrypted smart speaker traffic of \(13\) or \(227\) different queries
 and a binary classifier to recognise which of two contacts a user is
 calling on their voice assistant. With final test accuracies of
 \(\sim 86\%\) for the small dataset, \(\sim 40\%\) for the large dataset
 and \(\sim 71\%\) for the binary classifier, our system serves as a
 proof of concept that even though network traffic on a smart speaker is
 encrypted, patterns in the traffic can be used to accurately predict
 user behaviour. While this has been known about Amazon Alexa
 {[}Wang\_2020aa{]}{[}8802686{]} and Google Assistant {[}Wang\_2020aa{]},
 we have shown the viability of the same attack on the HomePod despite
 Apple's claims about the privacy and security of their smart speakers.
 \subsection{Future Work}
 While working on this thesis, several ideas for improvement, different
 types of data collection and approaches to the analysis of our data came
 up. Due to time constraints, we could not pursue all of these points. In
 this section we will briefly touch on some open questions or directions
 in which work could continue.
 \subsubsection{Recognising Different Speakers}
 All popular voice assistants support differentiating between users
 speaking to them. They give personalised responses to queries such as
 \emph{``Read my calendar''} or \emph{``Any new messages?''} by storing a
 voice print of the user. Some smart speakers like the HomePod go a step
 further and only respond to requests given by a registered user
 {[}Apple\_2023{]}.
 Similarly to inferring the query from a traffic fingerprint, it may be
 possible to differentiate between different users using the smart
 speaker. Our system supports different voices that can be used to
 register different users on the voice assistants. To keep the scope our
 proof of concept experiment manageable, our traffic dataset was
 collected with the same voice. By recording more data on the same
 queries with different voices, we could explore the possibility of a
 classifier inferring the speaker of a query from the traffic trace.
 \subsubsection{Inferring Bounds of Traffic
 Trace}\label{traffic-trace-bounds}
 Currently, we assume an attacker knows when the user begins an
 interaction and when it ends. Using a sliding window technique similar
 to the one introduced by {[}Zhang\_2018{]}, a system could continuously
 monitor network traffic and infer when an interaction has taken place
 automatically.
 \subsubsection{Classifier for Personal Requests}
 As mentioned in
 \hyperref[on-device-recognition]{{[}on-device-recognition{]}}, the
 traffic traces for personal requests show a visible group of outgoing
 traffic that should be straightforward to recognise automatically. A
 binary classifier that distinguishes the traffic of personal requests
 and normal queries on the HomePod should be straightforward to build.
 Since this feature appears consistently on all personal request traces
 we looked at, in addition to recognising traffic for queries it was
 trained on, this model could feasibly also classify queries it has never
 seen.
 \subsubsection{Voice Assistants Based on LLMs}\label{vas-based-on-llms}
 With the release of Gemini by Google {[}Google\_LLM\_2024{]} and Amazon
 announcing generative AI for their Alexa voice assistant
 {[}Amazon\_LLM\_2023{]}, the next years will quite certainly show a
 shift from traditional voice assistants to assistants based on {[}Large
 Language Models \emph{(LLMs)}{]}. Existing concerns about the lack of
 user privacy and security when using LLMs {[}Gupta\_2023{]}, combined
 with the frequency at which voice assistants process PII, warrant
 further research into voice assistants based on generative AI.
 \subsubsection{Tuning the Model Hyperparameters}
 For time reasons, we are using hyperparameters similar to the ones found
 by {[}Wang\_2020aa{]}, adapted to fit our network by hand. Our model
 performance could likely be improved by running our own hyperparameter
 search.
 \subsubsection{Non-Uniform Sampling of Interactions According to Real
 World Data}
 Currently, interactions with the voice assistants are sampled uniformly
 from a fixed dataset. According to a user survey, there are large
 differences in how often certain types of skills are used
 {[}NPR\_2022{]}. For example, asking for a summary of the news is likely
 done much less frequently than playing music.
 An experiment could be run where the system asks questions with a
 likelihood corresponding to how often they occur in the real world. This
 might improve the performance of the recognition system trained on such
 a dataset.
 \subsubsection{Performance Improvements}
 The main bottleneck currently preventing us from interacting without
 pausing is our STT system. We estimate a lower bound of approximately 10
 seconds per interaction. This entails the speaking duration plus the
 time waiting for a server response. With a more performant
 implementation of whisper, a superior transcription model or better
 hardware to run the model on, we could thus collect up to six
 interactions per minute.
 \subsubsection{\texorpdfstring{Other Uses for
 \texttt{varys}}{Other Uses for varys}}
 During data analysis, we discovered certain \emph{easter eggs} in the
 answers that Siri gives. As an example, it responds to the query
 \emph{``Flip a coin''} with \emph{``It\ldots{} whoops! It fell in a
 crack.''} about \(1\%\) of the time -- a response that forces the user
 to ask again should they actually need an answer. Since this behaviour
 does not have any direct security or privacy implications, we did not
 pursue it further. Having said this, varys does provide all the tools
 required to analyse these rare responses from a UX research point of
 view.
 \end{document}
		`@@ -0,0 +1,2 @@`
							`# varys thesis`
							`View the thesis: [Master Thesis.pdf](<./Master Thesis.pdf>)`