#import "template.typ": *

#set text(region: "GB")

// constants
// colours from https://mk.bcgsc.ca/colorblind/palettes.mhtml
#let colour_network = rgb("2271b2")
#let colour_network_transparent = rgb(34, 113, 178, 50%)
#let colour_audio = rgb("e69f00")
#let colour_database = rgb("f748a5")
#let colour_analysis = rgb("359b73")
#let colour_highlight = rgb("d55e00")

// replacements
#show "varys-network": name => box[#text(colour_network)[#raw(name.text.at(0))#raw(name.text.slice(1))]]
#show "varys-audio": name => box[#text(colour_audio)[#raw(name.text.at(0))#raw(name.text.slice(1))]]
#show "varys-database": name => box[#text(colour_database)[#raw(name.text.at(0))#raw(name.text.slice(1))]]
#show "varys-analysis": name => box[#text(colour_analysis)[#raw(name.text.at(0))#raw(name.text.slice(1))]] // the at(0)/slice(1) trick prevents an infinite recursion

#show: project.with(
  title: [A Testbed for Voice Assistant \ Traffic Fingerprinting],
  authors: (
    (name: "Milan van Zanten", email: "milan.vanzanten@unibas.ch"),
  ),
  examiner: (
    name: "Prof. Dr. Isabel Wagner",
    role: "Examiner",
  ),
  group: (
    faculty: "Faculty of Science",
    department: "Department of Mathematics and Computer Science",
    name: "Privacy-Enhancing Technologies Group",
    url: "https://pet.dmi.unibas.ch",
  ),
  acknowledgement: par(justify: true, block(width: 80%)[
    Firstly, I would like to thank Prof. Dr. Isabel Wagner for the opportunity to write my master thesis in her research group. Her insightful perspectives during our progress meetings have been invaluable. Furthermore, I am especially grateful for the help and explanations provided by Diego, Shiva and Patrick, who have generously shared their knowledge with a machine learning novice like me. I also wish to express my heartfelt appreciation to Patricia and Heike for assisting me with navigating the bureaucratic process of delaying my thesis during a difficult time. Last, but certainly not least, I thank Michelle, my family, and my friends for supporting me throughout the work on this thesis. I would not have been able to complete it without them.
  ]),
  abstract: par(justify: true, block(width: 80%)[
    In this thesis, we investigate the viability of traffic fingerprinting on voice assistants like Apple's Siri. Initially, we built a testbed that can house different kinds of smart speakers and autonomously interact with the voice assistants running on them. It collects network captures, audio recordings, timestamps and a transcript of the voice assistant's response. With this system, we conducted an experiment over one month to collect data on approximately 70'000 interactions with an Apple HomePod. As a proof of concept, the collected traffic traces were used to train a machine learning model that can infer what was said to the voice assistant from only the network traffic. Our findings indicate that our system can correctly classify unseen network traffic with an accuracy of up to 86%, revealing a vulnerability in Siri's security and privacy measures.
  ]),
  date: "February 29, 2024",
  logo: "images/logo-en.svg",
)

= Introduction <introduction>
Smart Speakers have become ubiquitous in many countries. The global sale of Amazon Alexa devices surpassed half a billion in 2023 @Amazon_2023 and Apple sells over $10$ million HomePods every year @Statista_Apple_2023. A study shows that $35%$ of adults in the U.S. own a smart speaker @NPR_2022. With half of the participants in the study reporting to have heard an advertisement on their smart speaker before, concerns about the privacy and security of these ever-listening devices are well founded. Especially since they are usually placed where private or confidential conversations take place.

To back claims about loss of privacy or decreased security, a thorough analysis of smart speakers becomes necessary. In this thesis, we build a system that allows us to collect a large amount of data on interactions with voice assistants.

== Previous Work
There are several previous works analysing the privacy and security risks of voice assistants. Most of them go into topics concerning specifically Amazon Alexa since their adoption rate is significantly higher than Apple's Siri or Google Assistant @Bolton_2021. Some examples include experiments of smart speaker misactivations @Ford2019 @Dubois_2020 @schonherr2022101328, analyses of voice assistant skills @Natatsuka_2019 @Shezan_2020 @255322 @Edu_2022 @Xie_2022 and voice command fingerprinting @8802686 @Wang_2020aa. Apart from some old work on traffic patterns of Siri @caviglione_2013 and research by Apple @Apple_2023 @Apple_2018, we found little information about that voice assistant.

== Thesis Overview <overview>
The primary bottleneck when collecting large amounts of data on voice assistants is the actual time speaking and waiting for the response. According to our tests, the majority of interactions take between $4$ and $8$ seconds plus any additional delay between the question and the response. This aligns with the results by #cite(<Haas_2022>, form: "prose"), who found a median of $4.03$s per typical sentence. With the goal of collecting a large number of samples for a list of voice commands (henceforth called "queries"), the system will therefore need to run for a substantial amount of time.

With this in mind, the primary aim of this thesis is building a testbed that can autonomously interact with a smart speaker. This system should take a list of queries and continuously go through it while collecting audio recordings, network packet traces, timestamps and other relevant metadata about the interaction. In @background, we establish why it is of interest to collect this data and the implementation of our system is explored in @system-design.

For the second part, documented in Sections @experiment[] and @results[], we run an experiment using our system, collecting interactions with an Apple HomePod mini#footnote[#link("https://apple.com/homepod-mini")] smart speaker. With this dataset of interactions, we train an ML model for traffic fingerprinting. The system is adapted from a similar implementation by #cite(<Wang_2020aa>, form: "prose") but will be applied to data collected from the Siri voice assistant instead of Alexa and Google Assistant.

#pagebreak()
= Background <background>
Having defined our unsupervised system for collecting interactions with smart speakers, it is necessary to explain the reasoning behind collecting that data.

#cite(<Papadogiannaki_2021>, form: "author") write in their survey #quote(attribution: <Papadogiannaki_2021>, [In the second category #emph[interception of encrypted traffic,] we encounter solutions that take advantage of behavioural patterns or signatures to report a specific event.]) The main focus for us lies on the encrypted network traffic being collected. This traffic cannot be decrypted, however, we can analyse it using a technique called traffic fingerprinting.

== Traffic Fingerprinting
Traffic fingerprinting uses pattern recognition approaches on collected, encrypted traffic traces to infer user behaviour. This method is used successfully to analyse web traffic @Wang_2020, traffic on the Tor network @Oh_2017, IoT device traffic @8802686 @Wang_2020aa @Mazhar_2020aa @Zuo_2019 @Trimananda_2019aa and user location @Ateniese_2015aa. Traffic fingerprinting on smart speakers is generally used to infer the command given to the voice assistant by the user. To enable us to run traffic fingerprinting, our system stores the query and the network traffic for each interaction it has with the smart speaker.

We assume the concept of #emph[open world] and #emph[closed world] models from previous traffic fingerprinting works @Wang_2020aa @Wang_2020. Both models define an attacker who has a set of queries they are interested in. The interest in the open world scenario is whether or not a given command is in that set. The traffic analysed in this case can be from either known queries or queries not seen before. In the closed world scenario, the attacker wants to differentiate between the different queries in that set, but cannot tell whether a new query is in it or not. As an example, the open world model could be used to infer whether the smart speaker is being used to communicate with a person that an attacker is monitoring. In the closed world case, the attacker might want to differentiate between different people a user has contacted. In our thesis, we focus on the closed world model.

== Threat Model <threat-model>
Our threat model is similar to the ones introduced by #cite(<Wang_2020aa>, form: "prose") and @8802686, specifically,

- the attacker has access to the network the smart speaker is connected to;
- the MAC address of the smart speaker is known;
- the type of the voice assistant is known (e.g. Siri, Alexa or Google Assistant).

Previous work @Mazhar_2020aa @Trimananda_2019aa shows that it is feasible to extract this information from the encrypted network traffic. Typically unencrypted DNS queries can also be used to identify smart speakers on a network. Our own analysis shows that the Apple HomePod regularly sends DNS queries asking about the domain `gsp-ssl.ls-apple.com.akadns.net` and #cite(<Wang_2020aa>, form: "prose") found that the Amazon Echo smart speaker sends DNS queries about `unagi-na.amazon.com`. The specific domains queried may differ regionally (as indicated by the `-na` part in the Amazon subdomain), but should be simple to obtain.

Finally, we assume the traffic is encrypted, since the attack would be trivial otherwise.

== Voice Assistant Choice
The most popular voice assistants are Apple's Siri, Amazon Alexa and Google Assistant @Bolton_2021. Support for Microsoft Cortana was dropped in August of 2023 @Microsoft_Cortana_2023, likely in favour of Copilot for Windows, which was announced one month later @Microsoft_Copilot_2023 (for more on this see @vas-based-on-llms).

Our system supports the Siri and Alexa voice assistants but can be easily extended with support for more. For the experiment in @experiment, we decided to use Siri as it is still less researched @Bolton_2021 than Alexa, where a host of security and privacy issues have already been found @Wang_2020aa @8802686 @Ford2019 @Edu_2022 @Dubois_2020.

#pagebreak()
= System Design <system-design>
The primary product of this thesis is a testbed that can autonomously interact with a smart speaker for a long time. Our system, called varys, collects the following data on its interactions with the voice assistant:

- Audio recordings of the query and the response
- A text transcript of the query and the response
- Encrypted network traffic traces from and to the smart speaker
- The durations of different parts of the interaction

== Modularisation
The wide range of data our system needs to collect makes it necessary to keep its complexity manageable. To simplify maintenance and increase the extensibility, the functionality of our system is separated into five modules:

/ varys: The main executable combining all modules into the final system.
/ varys-analysis: Analysis of data collected by varys.
/ varys-audio: Recording audio and the TTS and STT systems.
/ varys-database: Abstraction of the database system where interactions are stored.
/ varys-network: Collection of network traffic, writing and parsing of pcap files.

The modules varys-audio, varys-database and varys-network are isolated from the rest of varys. If the need arises (e.g. if a different database connector is required or a more performant TTS system is found), they can be easily swapped out.

The module varys-analysis does not currently do any analysis of the recorded audio and thus doesn't depend on that module.

In this section, we will lay out the design considerations behind the modules that make up varys.

== Network <network>
The system runs on a device that serves as a MITM between the Internet and an #emph[internal network], which it acts as a router for. The latter is reserved for the devices that are monitored. This includes the voice assistant and any devices required by it. Since the majority of smart speakers do not have any wired connectivity, the #emph[internal network] is wireless.

This layout enables the MITM device to record all incoming and outgoing traffic to and from the voice assistant without any interference from other devices on the network. (We filter the network traffic by the MAC address of the smart speaker so this is not strictly necessary, but it helps to reduce the amount of garbage data collected.)

Currently, we assume that the beginning and end of an interaction with the voice assistant is known and we can collect the network capture from when the user started speaking to when the voice assistant stopped speaking. We discuss this restriction in more detail in @traffic-trace-bounds.

The varys-network module provides functionality to start and stop capturing traffic on a specific network interface and write it to a pcap file. Furthermore, it offers a wrapper structure around the network packets read from those files, which is used by varys-analysis. Each smart speaker has a different MAC address that is used to distinguish between incoming and outgoing traffic. The MAC address is therefore also stored in the database alongside the path to the pcap file.

=== Monitoring <monitoring>
Since the system is running for long periods of time without supervision, we need to monitor potential outages. We added support to varys for #emph[passive] monitoring. A high-uptime server receives a request from the system each time it completes an interaction. That server in turn notifies us if it hasn't received anything for a configurable amount of time.

== Audio
The largest module of our system is varys-audio. Its main components are the TTS and STT systems, expanded upon in the next two subsections. Apart from that, it stores the recorded audio files encoded in the space-optimised OPUS @rfc6716 format and handles all audio processing. Converting it to mono; downsampling it for space efficiency; and trimming the silence from the beginning and end.

=== Text-to-Speech <text-to-speech>
The only interface that a voice assistant provides is via speaking to it. Therefore, we need a system to autonomously talk, for which there are two options:

- Prerecorded audio
- Speech synthesis

The former was dismissed early for several reasons. Firstly, audio recordings would not support conversations that last for more than one query and one response. Secondly, each change in our dataset would necessitate recording new queries. Lastly, given the scope of this thesis, recording and editing more than 100 sentences was not feasible.

Thus, we decided to integrate a TTS system into our varys-audio module. This system must provide high quality synthesis to minimise speech recognition mistakes on the smart speaker, and be able to generate speech without any noticeable performance overhead. The latter is typically measured by the rtf, with $"rtf" > 1$ if the system takes longer to synthesise text than to speak that text and $"rtf" <= 1$ otherwise. A good real-time factor is especially important if our system is to support conversations in the future; the voice assistant only listens for a certain amount of time before cancelling an interaction.

We explored three different options with $"rtf" <= 1$ on the machine we tested on:

- espeak#footnote(link("https://espeak.sourceforge.net"))
- Larynx#footnote(link("https://github.com/rhasspy/larynx"))
- `tts-rs` using the `AVFoundation` backend#footnote[#link("https://github.com/ndarilek/tts-rs") – the `AVFoundation` backend is only supported on macOS.]

We primarily based our choice on how well Amazon Alexa#footnote[We picked Alexa (over Siri) for these tests since its speech recognition performed worse in virtually every case.] would be able to understand different sentences. A test was performed manually by asking each of three questions 20 times with every TTS system. As a ground truth, the same was repeated verbally. The results show that `tts-rs` with the `AVFoundation` backend performed approximately $20%$ better on average than both Larynx and espeak and came within $15%$ of the ground truth.

While performing this test, the difference in quality was very easily heard, as only the `AVFoundation` voices come close to sounding like a real voice. These were, apart from the ground truth, also the only voices that could be differentiated from each other by the voice assistants. The recognition of user voices works well while the guest voice was wrongly recognised as one of the others $65%$ of the time.

With these results, we decided to use `tts-rs` on a macOS machine to get access to the `AVFoundation` backend.

Since running this test, Larynx has been succeeded by Piper#footnote(link("https://github.com/rhasspy/piper")) during the development of our system. The new system promises good results and may be an option for running varys on non-macOS systems in the future.

=== Speech-to-Text <speech-to-text>
Not all voice assistants provide a history of interactions, which is why we use a STT system to to transcribe the recorded answers.

In our module varys-audio, we utilise an ML-based speech recognition system called whisper. It provides transcription error rates comparable to those of human transcription @Radford_2022aa and outperforms other local speech recognition models @Seagraves_2022 and commercial services @Radford_2022aa. In @transcription-accuracy, we explore how well whisper worked for our use case.

== Data Storage
The module varys-database acts as an abstraction of our database, where all metadata about interactions are stored. Audio files and pcap traces are saved to disk by varys-audio and varys-network respectively and their paths are referenced in the database.

Since some changes were made to the system while it was running, we store the version of varys a session was ran with. This way, if there are any uncertainties about the collected data, the relevant code history and database migrations can be inspected.

To prevent data corruption if the project is moved, all paths are relative to a main `data` directory.

== Interaction Process
The main executable of varys combines all modules introduced in this section to run a series of interactions. When the system is run, it loads the queries from a TOML file. Until it is stopped, the system then runs interaction sessions, each with a shuffled version of the original query list.

Resetting the voice assistant is implemented differently for each one supported.

== Performance Considerations <performance>
The initial design of our system intended for it to be run on a Raspberry Pi, a small form factor computer with limited processing and storage capabilities. Mostly because we did not find a TTS system that ran with satisfactory performance on the Raspberry Pi, we now use a laptop computer.

=== Data Compression
Because of the initial plan of using a Raspberry Pi, we compress the recorded audio with the OPUS @rfc6716 codec and the captured network traffic with DEFLATE @rfc1951. We are however now running the experiment on a laptop, where space is no longer a concern. The audio codec does not add a noticeable processing overhead, but the pcap files are now stored uncompressed. Moreover, storing the network traces uncompressed simplifies our data analysis since they can be used without any preprocessing.

=== Audio Transcription
Apart from the actual time speaking (as mentioned in @overview), the transcription of audio is the main bottleneck for how frequently our system can run interactions. The following optimisations allowed us to reduce the average time per interaction from $~120$s to $~30$s:

==== Model Choice
There are five different sizes available for the whisper model. The step from the `medium` to the `large` and `large-v2` models does not significantly decrease Word-Error-Rate but more than doubles the number of parameters. Since in our tests, this increase in network size far more than doubles processing time, we opted for the `medium` model.

==== Transcription in Parallel
The bottleneck of waiting for whisper to transcribe the response could be completely eliminated with a transcription queue that is processed in parallel to the main system. However, if the system is run for weeks at a time, this can lead to arbitrarily long post-processing times if the transcription takes longer on average than an interaction.

Our solution is to process the previous response while the next interaction is already running, but to wait with further interactions until the transcription queue (of length 1) is empty again. This approach allows us to reduce the time per interaction to $max(T_t, T_i)$, rather than $T_t + T_i$, where $T_t$ represents the time taken for transcription and $T_i$ the time taken for the rest of the interaction.

==== Audio Preprocessing
Another step we took to reduce the transcription time is to reduce the audio sample rate to $16$ kHz, average both channels into one and trim the silence from the start and the end. This preprocessing does not noticeably reduce the quality of the transcription. Moreover, cutting off silent audio eliminates virtually all whisper hallucinations (see @transcription-accuracy).

#pagebreak()
= Experiment <experiment>
To test our data collection system, we conducted several experiments where we let varys interact with an Apple HomePod mini.

Three different lists of queries were tested:

- A large dataset with $227$ queries, split into $24$ categories
- A small dataset with $13$ queries, each from a different category
- A binary dataset with the two queries "Call John Doe" and "Call Mary Poppins"

Apple does not release an official list of Siri's capabilities. A dataset of $~800$ Siri commands was originally compiled from a user-collected list @Hey_Siri_Commands and extended with queries from our own exploration. The HomePod does not support all Siri queries#footnote[Any queries that require Siri to show something to the user will be answered with something along the lines of #emph(quote[I can show you the results on your iPhone.])]. Therefore, during an initial test period, we manually removed commands from the list that did not result in a useful response. The number of redundant commands was further reduced to increase the number of samples that can be collected per query. Having less than 256 labels has the additional advantage of our query labels fitting in one byte.

Further analysis of our results can be found in @results. In this section we detail the experiment setup and our traffic fingerprinting system.

== Setup
The experiment setup consists of a laptop running macOS#footnote[The choice of using macOS voices is explained in @text-to-speech.], connected to a pair of speakers and a microphone. The latter, together with the HomePod, are placed inside a sound-isolated box. Additionally, the user's iPhone is required to be on the #emph[internal network] (see @network) for Siri to be able to respond to personal requests.

The iPhone is set up with a dummy user account for #emph[Peter Pan] and two contacts, #emph[John Doe] and #emph[Mary Poppins]. Otherwise, the phone is kept at factory settings and the only changes come from interactions with the HomePod.

Finally, to minimise system downtime, an external server is used for monitoring as explained in @monitoring.

The experiment must be set up in a location with the following characteristics:
#block(breakable:  false)[
  - Wired Internet connection#footnote[A wireless connection is not possible, since macOS does not support connecting to WiFi and creating a wireless network at the same time. If no wired connection is available, a network bridge (e.g. a Raspberry Pi bridging WiFi to its Ethernet port) can be used to connect the Mac to the internet.]
  - Power outlet available
  - No persons speaking (the quieter the environment, the higher the voice recognition accuracy)
  - Noise disturbance from experiment not bothering anyone
  - Access to room during working hours (or if possible at any time)
]

The sound isolation of the box can reduce ambient noise in the range of $50 - 60$ dB to a significantly lower $35 - 45$ dB, which increases the voice recognition and transcription accuracy. This also helps finding an experiment location, since the noise disturbance is less significant.

When the hardware is set up, the command `varys listen --calibrate` is used to find the ambient noise volume.

== Data Analysis
We began data analysis by building a simple tool that takes a number of traffic traces and visualises them. Red and blue represent incoming and outgoing packets respectively and the strength of the colour visualises the packet size. The $x$-axis does not necessarily correspond with time, but with the number of incoming and outgoing packets since the beginning of the trace. This method of analysing traffic traces aligns with previous work on traffic fingerprinting @Wang_2020aa.

=== Traffic Preprocessing
We write a network traffic trace as $((s_1, d_1), (s_2, d_2), ..., (s_n, d_n))$, where $n$ is the number of packets in that trace, $s_i$ is the size of a packet in bytes and $d_i in {0, 1}$ is the direction of that packet.

To be used as input in our ML network, the traffic traces were preprocessed into normalised tensors of length $475$. Our collected network traffic is converted like

$$$
  ((s_1, d_1), (s_2, d_2), ..., (s_n, d_n)) -> (p_1, p_2, ..., p_475),
$$$

where for all $i in [1, 475]$ the components are calculated as

$$$
  p_i = cases(
    (-1)^(d_i) dot s_i/1514 #h(0.8em) "if" i < n",",
    0 #h(5.5em) "else".
  )
$$$

This pads traces shorter than $475$ packets long with zeroes, cuts off the ones with more and normalises all traces into the range $[-1, 1]$#footnote[The maximum size of the packets is $1514$ bytes, which is why we can simply divide by this to normalise our data.].

=== Adapting Network from Wang et al. <adapting-network>
Since there is similar previous work on traffic fingerprinting Amazon Alexa and Google Assistant interactions by #cite(<Wang_2020aa>, form: "prose"), we based our network on their CNN.

In addition to a CNN, they also tested an LSTM, a SAE and a combination of the networks in the form of ensemble learning. However, their best results for a single network came from their CNN, so we decided to use that.

Owing to limited computing resources, we reduced the size of our network. The network of #cite(<Wang_2020aa>, form: "author") used a pool size of $1$, which is equivalent to a #emph[no-op]. Thus, we removed all #emph[Max Pooling] layers. Additionally, we reduced the four groups of convolutional layers to one.

Since the hyperparameter search performed by #cite(<Wang_2020aa>, form: "author") found $180$ to be the optimal dense network size when that was the maximum in their search space ${100, 110, ..., 170, 180}$, we can assume that the true optimal size is higher than $180$. In our case we used $475$ – the size of our input.

To split out dataset into separate training, validation and testing parts, we used the same proportions as #cite(<Wang_2020aa>, form: "prose")\; Of the full datasets, $64%$ were used for training, $16%$ for validation, and $20%$ for testing.

#pagebreak()
= Results <results>
Our system ran for approximately $800$ hours and collected data on a total of $73'605$ interactions. The collected data is available at #link("https://gitlab.com/m-vz/varys-data").

In this section we will go over our results, beginning with the statistics of our collected dataset.

== Statistics <statistics>
Of the $73'605$ queries collected, $71'915$ were successfully completed (meaning there was no timeout and the system was not stopped during the interaction). Our system varys is therefore able to interact with Siri with a $97.70%$ success rate.

As mentioned in @overview, there are approximately $4$ to $8$ seconds of speaking during most interactions. The remaining duration (measured as the total duration minus the speaking duration) lies mostly between $25$ and $30$ seconds. This includes the time spent waiting for the voice assistant to answer and transcribing the response. It did not increase even for outliers with a significantly longer speaking duration.

On average, a query took $2.47$ seconds ($±0.64$) to say, which is expected since all query sentences are of similar length. The responses lasted on average $4.40$ seconds with a standard deviation of $5.63$s.

During the first week, there were two bugs that resulted in several downtimes (the longest of which lasted $~42$h because we could not get physical access to the system during the weekend). After that, the system ran flawlessly except during one network outage at the university and during maintenance.

== Transcription Accuracy <transcription-accuracy>
Compared to the results by #cite(<Radford_2022aa>, form: "prose") and #cite(<Seagraves_2022>, form: "prose"), our own results show an even higher transcription accuracy apart from errors in URLs, certain names and slight formatting differences. This is likely due to the audio coming from a synthetic voice in a controlled environment.

The query #emph(quote[How far away is Boston?]) was missing #emph(quote[as the crow flies]) from the correct transcription #emph(quote[[...] it's about 5,944 kilometers as the crow flies.]) about $36%$ of the time. Together with the correctly transcribed cases, this makes up $98.51%$ of this query's data. In the response to #emph(quote[What is the factorial of 6?]), Siri provided the source #link("https://solumaths.com", "solumaths.com"), which is text that whisper was unable to produce#footnote[In 1823 cases the name #emph[solumaths] was recognised as either #emph[solomus], #emph[solomoths], #emph[solomons], #emph[solomuths], #emph[soloomiths], #emph[salumith] or #emph[solemnus], but never correctly.]. Since mistakes in URLs are expected, we marked those transcriptions as correct.

There are some rare cases where the voice assistant responds with a single word. Since whisper does not support recognition of audio shorter than one second, these queries have an exceedingly high failure rate. One example is the query #emph(quote[Flip a coin]), which was only recognised correctly $~2%$ of the time. Out of the $140$ incorrectly transcribed responses, $134$ were cancelled due to the recorded audio being too short.

Another problem we identified with whisper is its tendency to hallucinate text when fed silent audio (one or more seconds of `0`-samples). Preventing this can be done by simply trimming off all silent audio in the preprocessing step before the transcription.

== Traffic Fingerprinting
With our implementation of a fingerprinting CNN, we were able to get promising results. Our network was trained on the binary, small and large datasets for 1000 epochs each. The accuracy axis of the training progress is bounded by the interval $[100 / n, 100]$, where $n$ is the number of labels. The bottom of the accuracy axis is therefore equivalent to random choice. (As an example, the accuracy of the binary classifier starts at 50%, which is choosing randomly between two options.) The loss axis is bounded to $[0, 3]$ to fit our data. The loss for training our large dataset started outside that range at $5.05$.

Most of the trends in these graphs show that with more training, the models could likely be improved. However, since this serves as a proof of concept and due to limited time and processing capacity, we did not train each network for more than $1000$ epochs.

For binary classification between the queries "Call John Doe" and "Call Mary Poppins", our final accuracy on the test set was $~71.19%$. This result is especially concerning since it suggests this method can differentiate between very similar queries. An attacker could train a network on names of people they want to monitor and tell from the encrypted network traffic whether the voice assistant was likely used to call that person.

The second classification we attempted was between the $13$ different queries from the small dataset, each from a different category. The result for this dataset is an $86.19%$ accuracy on the test set. Compared to less than $8%$ for a random guess, this confirms our concerns about the privacy and security of the HomePod.

The large dataset with $227$ different queries showed an accuracy of $40.40%$. This result is likely limited by the size of our network and training capabilities. When compared to randomly guessing at $0.44%$ though, the loss of privacy becomes apparent.

With adjustments to our CNN, more training data and hardware resources, we estimate that highly accurate classification can be achieved even between hundreds of different queries.

== On-Device Recognition <on-device-recognition>
According to HomePod marketing,

#quote(block: true, attribution: [#link("https://apple.com/homepod-mini")])[HomePod mini works with your iPhone for requests like hearing your messages or notes, so they are completed on device without revealing that information to Apple.]

#pagebreak()
= Conclusion
For this thesis, we built a highly reliable system that can autonomously collect data on interacting with smart speakers for hundreds of hours. Using our system, we ran an experiment to collect data on over $70'000$ interactions with Siri on a HomePod.

The data we collected enabled us to train two ML models to distinguish encrypted smart speaker traffic of $13$ or $227$ different queries and a binary classifier to recognise which of two contacts a user is calling on their voice assistant. With final test accuracies of $~86%$ for the small dataset, $~40%$ for the large dataset and $~71%$ for the binary classifier, our system serves as a proof of concept that even though network traffic on a smart speaker is encrypted, patterns in the traffic can be used to accurately predict user behaviour. While this has been known about Amazon Alexa @Wang_2020aa @8802686 and Google Assistant @Wang_2020aa, we have shown the viability of the same attack on the HomePod despite Apple's claims about the privacy and security of their smart speakers.

== Future Work
While working on this thesis, several ideas for improvement, different types of data collection and approaches to the analysis of our data came up. Due to time constraints, we could not pursue all of these points. In this section we will briefly touch on some open questions or directions in which work could continue.

=== Recognising Different Speakers
All popular voice assistants support differentiating between users speaking to them. They give personalised responses to queries such as #emph(quote[Read my calendar]) or #emph(quote[Any new messages?]) by storing a voice print of the user. Some smart speakers like the HomePod go a step further and only respond to requests given by a registered user @Apple_2023.

Similarly to inferring the query from a traffic fingerprint, it may be possible to differentiate between different users using the smart speaker. Our system supports different voices that can be used to register different users on the voice assistants. To keep the scope our proof of concept experiment manageable, our traffic dataset was collected with the same voice. By recording more data on the same queries with different voices, we could explore the possibility of a classifier inferring the speaker of a query from the traffic trace.

=== Inferring Bounds of Traffic Trace <traffic-trace-bounds>
Currently, we assume an attacker knows when the user begins an interaction and when it ends. Using a sliding window technique similar to the one introduced by #cite(<Zhang_2018>, form: "prose"), a system could continuously monitor network traffic and infer when an interaction has taken place automatically.

=== Classifier for Personal Requests
As mentioned in @on-device-recognition, the traffic traces for personal requests show a visible group of outgoing traffic that should be straightforward to recognise automatically. A binary classifier that distinguishes the traffic of personal requests and normal queries on the HomePod should be straightforward to build.

Since this feature appears consistently on all personal request traces we looked at, in addition to recognising traffic for queries it was trained on, this model could feasibly also classify queries it has never seen.

=== Voice Assistants Based on LLMs <vas-based-on-llms>
With the release of Gemini by Google @Google_LLM_2024 and Amazon announcing generative AI for their Alexa voice assistant @Amazon_LLM_2023, the next years will quite certainly show a shift from traditional voice assistants to assistants based on [Large Language Models #emph("(LLMs)")]. Existing concerns about the lack of user privacy and security when using LLMs @Gupta_2023, combined with the frequency at which voice assistants process PII, warrant further research into voice assistants based on generative AI.

=== Tuning the Model Hyperparameters
For time reasons, we are using hyperparameters similar to the ones found by #cite(<Wang_2020aa>, form: "prose"), adapted to fit our network by hand. Our model performance could likely be improved by running our own hyperparameter search.

=== Non-Uniform Sampling of Interactions According to Real World Data
Currently, interactions with the voice assistants are sampled uniformly from a fixed dataset. According to a user survey, there are large differences in how often certain types of skills are used @NPR_2022. For example, asking for a summary of the news is likely done much less frequently than playing music.

An experiment could be run where the system asks questions with a likelihood corresponding to how often they occur in the real world. This might improve the performance of the recognition system trained on such a dataset.

=== Performance Improvements
The main bottleneck currently preventing us from interacting without pausing is our STT system. We estimate a lower bound of approximately 10 seconds per interaction. This entails the speaking duration plus the time waiting for a server response. With a more performant implementation of whisper, a superior transcription model or better hardware to run the model on, we could thus collect up to six interactions per minute.

=== Other Uses for `varys`
During data analysis, we discovered certain #emph[easter eggs] in the answers that Siri gives. As an example, it responds to the query #emph(quote[Flip a coin]) with #emph(quote[It... whoops! It fell in a crack.]) about $1%$ of the time – a response that forces the user to ask again should they actually need an answer. Since this behaviour does not have any direct security or privacy implications, we did not pursue it further. Having said this, varys does provide all the tools required to analyse these rare responses from a UX research point of view.

#pagebreak()
#bibliography("literature.bib", style: "ieee.csl", full: true)