refactor: replace : with _ in citation references
This commit is contained in:
38
main.typ
38
main.typ
@@ -163,14 +163,14 @@ Smart Speakers have become ubiquitous in many countries. The global sale of Amaz
|
||||
To back claims about loss of privacy or decreased security, a thorough analysis of smart speakers becomes necessary. In this thesis, we build a system that allows us to collect a large amount of data on interactions with voice assistants.
|
||||
|
||||
== Previous Work
|
||||
There are several previous works analysing the privacy and security risks of voice assistants. Most of them go into topics concerning specifically Amazon Alexa since their adoption rate is significantly higher than Apple's Siri or Google Assistant @Bolton_2021. Some examples include experiments of smart speaker misactivations @Ford2019 @Dubois_2020 @schonherr2022101328, analyses of voice assistant #gls("skill")s @Natatsuka_2019 @Shezan_2020 @255322 @Edu_2022 @Xie_2022 and voice command fingerprinting @8802686 @Wang:2020aa. Apart from some old work on traffic patterns of Siri @caviglione_2013 and research by Apple @Apple_2023 @Apple_2018, we found little information about that voice assistant.
|
||||
There are several previous works analysing the privacy and security risks of voice assistants. Most of them go into topics concerning specifically Amazon Alexa since their adoption rate is significantly higher than Apple's Siri or Google Assistant @Bolton_2021. Some examples include experiments of smart speaker misactivations @Ford2019 @Dubois_2020 @schonherr2022101328, analyses of voice assistant #gls("skill")s @Natatsuka_2019 @Shezan_2020 @255322 @Edu_2022 @Xie_2022 and voice command fingerprinting @8802686 @Wang_2020aa. Apart from some old work on traffic patterns of Siri @caviglione_2013 and research by Apple @Apple_2023 @Apple_2018, we found little information about that voice assistant.
|
||||
|
||||
== Thesis Overview <overview>
|
||||
The primary bottleneck when collecting large amounts of data on voice assistants is the actual time speaking and waiting for the response. According to our tests, the majority of interactions take between $4$ and $8$ seconds plus any additional delay between the question and the response. This aligns with the results by #pcite(<Haas_2022>), who found a median of $4.03$s per typical sentence. With the goal of collecting a large number of samples for a list of voice commands (henceforth called "queries"), the system will therefore need to run for a substantial amount of time.
|
||||
|
||||
With this in mind, the primary aim of this thesis is building a testbed that can autonomously interact with a smart speaker. This system should take a list of queries and continuously go through it while collecting audio recordings, network packet traces, timestamps and other relevant metadata about the interaction. In @background, we establish why it is of interest to collect this data and the implementation of our system is explored in @system-design.
|
||||
|
||||
For the second part, documented in Sections @experiment[] and @results[], we run an experiment using our system, collecting interactions with an Apple HomePod mini#footnote[#link("https://apple.com/homepod-mini")] smart speaker. With this dataset of interactions, we train an #gls("ml") model for traffic fingerprinting. The system is adapted from a similar implementation by #pcite(<Wang:2020aa>) but will be applied to data collected from the Siri voice assistant instead of Alexa and Google Assistant.
|
||||
For the second part, documented in Sections @experiment[] and @results[], we run an experiment using our system, collecting interactions with an Apple HomePod mini#footnote[#link("https://apple.com/homepod-mini")] smart speaker. With this dataset of interactions, we train an #gls("ml") model for traffic fingerprinting. The system is adapted from a similar implementation by #pcite(<Wang_2020aa>) but will be applied to data collected from the Siri voice assistant instead of Alexa and Google Assistant.
|
||||
|
||||
#pagebreak()
|
||||
= Background <background>
|
||||
@@ -208,25 +208,25 @@ Having defined our unsupervised system for collecting interactions with smart sp
|
||||
#cite(<Papadogiannaki_2021>, form: "author") write in their survey #quote(attribution: <Papadogiannaki_2021>, [In the second category #emph[interception of encrypted traffic,] we encounter solutions that take advantage of behavioural patterns or signatures to report a specific event.]) The main focus for us lies on the encrypted network traffic being collected. This traffic cannot be decrypted, however, we can analyse it using a technique called traffic fingerprinting.
|
||||
|
||||
== Traffic Fingerprinting
|
||||
Traffic fingerprinting uses pattern recognition approaches on collected, encrypted traffic traces to infer user behaviour. This method is used successfully to analyse web traffic @Wang_2020, traffic on the Tor network @Oh_2017, #gls("iot") device traffic @8802686 @Wang:2020aa @Mazhar:2020aa @Zuo_2019 @Trimananda:2019aa and user location @Ateniese_2015aa. Traffic fingerprinting on smart speakers is generally used to infer the command given to the voice assistant by the user. To enable us to run traffic fingerprinting, our system stores the query and the network traffic for each interaction it has with the smart speaker.
|
||||
Traffic fingerprinting uses pattern recognition approaches on collected, encrypted traffic traces to infer user behaviour. This method is used successfully to analyse web traffic @Wang_2020, traffic on the Tor network @Oh_2017, #gls("iot") device traffic @8802686 @Wang_2020aa @Mazhar_2020aa @Zuo_2019 @Trimananda_2019aa and user location @Ateniese_2015aa. Traffic fingerprinting on smart speakers is generally used to infer the command given to the voice assistant by the user. To enable us to run traffic fingerprinting, our system stores the query and the network traffic for each interaction it has with the smart speaker.
|
||||
|
||||
We assume the concept of #emph[open world] and #emph[closed world] models from previous traffic fingerprinting works @Wang:2020aa @Wang_2020. Both models define an attacker who has a set of queries they are interested in. The interest in the open world scenario is whether or not a given command is in that set. The traffic analysed in this case can be from either known queries or queries not seen before. In the closed world scenario, the attacker wants to differentiate between the different queries in that set, but cannot tell whether a new query is in it or not. As an example, the open world model could be used to infer whether the smart speaker is being used to communicate with a person that an attacker is monitoring. In the closed world case, the attacker might want to differentiate between different people a user has contacted. In our thesis, we focus on the closed world model.
|
||||
We assume the concept of #emph[open world] and #emph[closed world] models from previous traffic fingerprinting works @Wang_2020aa @Wang_2020. Both models define an attacker who has a set of queries they are interested in. The interest in the open world scenario is whether or not a given command is in that set. The traffic analysed in this case can be from either known queries or queries not seen before. In the closed world scenario, the attacker wants to differentiate between the different queries in that set, but cannot tell whether a new query is in it or not. As an example, the open world model could be used to infer whether the smart speaker is being used to communicate with a person that an attacker is monitoring. In the closed world case, the attacker might want to differentiate between different people a user has contacted. In our thesis, we focus on the closed world model.
|
||||
|
||||
== Threat Model <threat-model>
|
||||
Our threat model is similar to the ones introduced by #pcite(<Wang:2020aa>) and @8802686, specifically,
|
||||
Our threat model is similar to the ones introduced by #pcite(<Wang_2020aa>) and @8802686, specifically,
|
||||
|
||||
- the attacker has access to the network the smart speaker is connected to;
|
||||
- the MAC address of the smart speaker is known;
|
||||
- the type of the voice assistant is known (e.g. Siri, Alexa or Google Assistant).
|
||||
|
||||
Previous work @Mazhar:2020aa @Trimananda:2019aa shows that it is feasible to extract this information from the encrypted network traffic. Typically unencrypted DNS queries can also be used to identify smart speakers on a network. Our own analysis shows that the Apple HomePod regularly sends DNS queries asking about the domain `gsp-ssl.ls-apple.com.akadns.net` and #pcite(<Wang:2020aa>) found that the Amazon Echo smart speaker sends DNS queries about `unagi-na.amazon.com`. The specific domains queried may differ regionally (as indicated by the `-na` part in the Amazon subdomain), but should be simple to obtain.
|
||||
Previous work @Mazhar_2020aa @Trimananda_2019aa shows that it is feasible to extract this information from the encrypted network traffic. Typically unencrypted DNS queries can also be used to identify smart speakers on a network. Our own analysis shows that the Apple HomePod regularly sends DNS queries asking about the domain `gsp-ssl.ls-apple.com.akadns.net` and #pcite(<Wang_2020aa>) found that the Amazon Echo smart speaker sends DNS queries about `unagi-na.amazon.com`. The specific domains queried may differ regionally (as indicated by the `-na` part in the Amazon subdomain), but should be simple to obtain.
|
||||
|
||||
Finally, we assume the traffic is encrypted, since the attack would be trivial otherwise.
|
||||
|
||||
== Voice Assistant Choice
|
||||
The most popular voice assistants are Apple's Siri, Amazon Alexa and Google Assistant @Bolton_2021. Support for Microsoft Cortana was dropped in August of 2023 @Microsoft_Cortana_2023, likely in favour of Copilot for Windows, which was announced one month later @Microsoft_Copilot_2023 (for more on this see @vas-based-on-llms).
|
||||
|
||||
Our system supports the Siri and Alexa voice assistants but can be easily extended with support for more. For the experiment in @experiment, we decided to use Siri as it is still less researched @Bolton_2021 than Alexa, where a host of security and privacy issues have already been found @Wang:2020aa @8802686 @Ford2019 @Edu_2022 @Dubois_2020.
|
||||
Our system supports the Siri and Alexa voice assistants but can be easily extended with support for more. For the experiment in @experiment, we decided to use Siri as it is still less researched @Bolton_2021 than Alexa, where a host of security and privacy issues have already been found @Wang_2020aa @8802686 @Ford2019 @Edu_2022 @Dubois_2020.
|
||||
|
||||
#pagebreak()
|
||||
= System Design <system-design>
|
||||
@@ -406,7 +406,7 @@ Since running this test, Larynx has been succeeded by Piper#footnote(link("https
|
||||
=== Speech-to-Text <speech-to-text>
|
||||
Not all voice assistants provide a history of interactions, which is why we use a #gls("stt") system to to transcribe the recorded answers.
|
||||
|
||||
In our module varys-audio, we utilise an #gls("ml")-based speech recognition system called #gls("whisper"). It provides transcription error rates comparable to those of human transcription @Radford:2022aa and outperforms other local speech recognition models @Seagraves_2022 and commercial services @Radford:2022aa. In @transcription-accuracy, we explore how well #gls("whisper") worked for our use case.
|
||||
In our module varys-audio, we utilise an #gls("ml")-based speech recognition system called #gls("whisper"). It provides transcription error rates comparable to those of human transcription @Radford_2022aa and outperforms other local speech recognition models @Seagraves_2022 and commercial services @Radford_2022aa. In @transcription-accuracy, we explore how well #gls("whisper") worked for our use case.
|
||||
|
||||
== Data Storage
|
||||
The module varys-database acts as an abstraction of our database, where all metadata about interactions are stored. Audio files and #gls("pcap") traces are saved to disk by varys-audio and varys-network respectively and their paths are referenced in the database.
|
||||
@@ -570,7 +570,7 @@ Apart from the actual time speaking (as mentioned in @overview), the transcripti
|
||||
==== Model Choice
|
||||
There are five different sizes available for the #gls("whisper") model, shown in @tab-whisper-sizes. The step from the `medium` to the `large` and `large-v2` models does not significantly decrease #gls("wer") but more than doubles the number of parameters. Since in our tests, this increase in network size far more than doubles processing time, we opted for the `medium` model.
|
||||
|
||||
#figure(caption: [Comparison of #gls("whisper") models by #gls("wer") on English datasets @Radford:2022aa.])[
|
||||
#figure(caption: [Comparison of #gls("whisper") models by #gls("wer") on English datasets @Radford_2022aa.])[
|
||||
#let tiny = (39, calc.round((15.7, 28.8, 11.6, 12.4).sum() / 4, digits: 2))
|
||||
#let base = (74, calc.round((11.7, 21.9, 9.5, 8.9).sum() / 4, digits: 2))
|
||||
#let small = (244, calc.round((8.3, 14.5, 8.2, 6.1).sum() / 4, digits: 2))
|
||||
@@ -667,7 +667,7 @@ The sound isolation of the box can reduce ambient noise in the range of $50 - 60
|
||||
When the hardware is set up, the command `varys listen --calibrate` is used to find the ambient noise volume.
|
||||
|
||||
== Data Analysis
|
||||
We began data analysis by building a simple tool that takes a number of traffic traces and visualises them as shown in @fig-example-trace. Red and blue represent incoming and outgoing packets respectively and the strength of the colour visualises the packet size. The $x$-axis does not necessarily correspond with time, but with the number of incoming and outgoing packets since the beginning of the trace. This method of analysing traffic traces aligns with previous work on traffic fingerprinting @Wang:2020aa.
|
||||
We began data analysis by building a simple tool that takes a number of traffic traces and visualises them as shown in @fig-example-trace. Red and blue represent incoming and outgoing packets respectively and the strength of the colour visualises the packet size. The $x$-axis does not necessarily correspond with time, but with the number of incoming and outgoing packets since the beginning of the trace. This method of analysing traffic traces aligns with previous work on traffic fingerprinting @Wang_2020aa.
|
||||
|
||||
#figure(
|
||||
caption: [Example traffic trace visualisation of the query #emph(quote[What's the temperature outside?]).],
|
||||
@@ -697,9 +697,9 @@ $$$
|
||||
This pads traces shorter than $475$ packets long with zeroes, cuts off the ones with more and normalises all traces into the range $[-1, 1]$#footnote[The maximum size of the packets is $1514$ bytes, which is why we can simply divide by this to normalise our data.].
|
||||
|
||||
=== Adapting Network from Wang et al. <adapting-network>
|
||||
Since there is similar previous work on traffic fingerprinting Amazon Alexa and Google Assistant interactions by #pcite(<Wang:2020aa>), we based our network on their #gls("cnn"), shown in @diag-cnn-wang.
|
||||
Since there is similar previous work on traffic fingerprinting Amazon Alexa and Google Assistant interactions by #pcite(<Wang_2020aa>), we based our network on their #gls("cnn"), shown in @diag-cnn-wang.
|
||||
|
||||
#diagram(<diag-cnn-wang>)[The #gls("cnn") design used in @Wang:2020aa.][
|
||||
#diagram(<diag-cnn-wang>)[The #gls("cnn") design used in @Wang_2020aa.][
|
||||
#fletcher.diagram(
|
||||
node-stroke: 1pt,
|
||||
node-fill: rgb("eee"),
|
||||
@@ -756,7 +756,7 @@ Since there is similar previous work on traffic fingerprinting Amazon Alexa and
|
||||
|
||||
In addition to a #gls("cnn"), they also tested an #gls("lstm"), a #gls("sae") and a combination of the networks in the form of ensemble learning. However, their best results for a single network came from their #gls("cnn"), so we decided to use that.
|
||||
|
||||
Owing to limited computing resources, we reduced the size of our network to the one shown in @diag-cnn. The network of #cite(<Wang:2020aa>, form: "author") used a pool size of $1$, which is equivalent to a #emph[no-op]. Thus, we removed all #emph[Max Pooling] layers. Additionally, we reduced the four groups of convolutional layers to one.
|
||||
Owing to limited computing resources, we reduced the size of our network to the one shown in @diag-cnn. The network of #cite(<Wang_2020aa>, form: "author") used a pool size of $1$, which is equivalent to a #emph[no-op]. Thus, we removed all #emph[Max Pooling] layers. Additionally, we reduced the four groups of convolutional layers to one.
|
||||
|
||||
#diagram(<diag-cnn>)[The #gls("cnn") implemented by us.][
|
||||
#fletcher.diagram(
|
||||
@@ -789,9 +789,9 @@ Owing to limited computing resources, we reduced the size of our network to the
|
||||
)
|
||||
]
|
||||
|
||||
Since the hyperparameter search performed by #cite(<Wang:2020aa>, form: "author") found $180$ to be the optimal dense network size when that was the maximum in their search space ${100, 110, ..., 170, 180}$, we can assume that the true optimal size is higher than $180$. In our case we used $475$ – the size of our input.
|
||||
Since the hyperparameter search performed by #cite(<Wang_2020aa>, form: "author") found $180$ to be the optimal dense network size when that was the maximum in their search space ${100, 110, ..., 170, 180}$, we can assume that the true optimal size is higher than $180$. In our case we used $475$ – the size of our input.
|
||||
|
||||
To split out dataset into separate training, validation and testing parts, we used the same proportions as #pcite(<Wang:2020aa>)\; Of the full datasets, $64%$ were used for training, $16%$ for validation, and $20%$ for testing.
|
||||
To split out dataset into separate training, validation and testing parts, we used the same proportions as #pcite(<Wang_2020aa>)\; Of the full datasets, $64%$ were used for training, $16%$ for validation, and $20%$ for testing.
|
||||
|
||||
#pagebreak()
|
||||
= Results <results>
|
||||
@@ -909,7 +909,7 @@ In @fig-uptime we show the uptime of our system in hours from when it was set up
|
||||
] <fig-uptime>
|
||||
|
||||
== Transcription Accuracy <transcription-accuracy>
|
||||
Compared to the results by #pcite(<Radford:2022aa>) and #pcite(<Seagraves_2022>), our own results show an even higher transcription accuracy apart from errors in URLs, certain names and slight formatting differences. This is likely due to the audio coming from a synthetic voice in a controlled environment.
|
||||
Compared to the results by #pcite(<Radford_2022aa>) and #pcite(<Seagraves_2022>), our own results show an even higher transcription accuracy apart from errors in URLs, certain names and slight formatting differences. This is likely due to the audio coming from a synthetic voice in a controlled environment.
|
||||
@tab-stt-accuracy shows the manually verified results for six randomly picked queries (see @sql-different-responses). We filtered out any invalid queries where Siri was waiting for the answer to a follow-up question or could not answer due to a network issue.
|
||||
|
||||
#figure(caption: [#gls("whisper") speech recognition accuracy samples.])[#table(
|
||||
@@ -1048,7 +1048,7 @@ Since our threat model from @threat-model assumes we collect data from the netwo
|
||||
= Conclusion
|
||||
For this thesis, we built a highly reliable system that can autonomously collect data on interacting with smart speakers for hundreds of hours. Using our system, we ran an experiment to collect data on over $70'000$ interactions with Siri on a HomePod.
|
||||
|
||||
The data we collected enabled us to train two #gls("ml") models to distinguish encrypted smart speaker traffic of $13$ or $227$ different queries and a binary classifier to recognise which of two contacts a user is calling on their voice assistant. With final test accuracies of $~86%$ for the small dataset, $~40%$ for the large dataset and $~71%$ for the binary classifier, our system serves as a proof of concept that even though network traffic on a smart speaker is encrypted, patterns in the traffic can be used to accurately predict user behaviour. While this has been known about Amazon Alexa @Wang:2020aa @8802686 and Google Assistant @Wang:2020aa, we have shown the viability of the same attack on the HomePod despite Apple's claims about the privacy and security of their smart speakers.
|
||||
The data we collected enabled us to train two #gls("ml") models to distinguish encrypted smart speaker traffic of $13$ or $227$ different queries and a binary classifier to recognise which of two contacts a user is calling on their voice assistant. With final test accuracies of $~86%$ for the small dataset, $~40%$ for the large dataset and $~71%$ for the binary classifier, our system serves as a proof of concept that even though network traffic on a smart speaker is encrypted, patterns in the traffic can be used to accurately predict user behaviour. While this has been known about Amazon Alexa @Wang_2020aa @8802686 and Google Assistant @Wang_2020aa, we have shown the viability of the same attack on the HomePod despite Apple's claims about the privacy and security of their smart speakers.
|
||||
|
||||
== Future Work
|
||||
While working on this thesis, several ideas for improvement, different types of data collection and approaches to the analysis of our data came up. Due to time constraints, we could not pursue all of these points. In this section we will briefly touch on some open questions or directions in which work could continue.
|
||||
@@ -1103,7 +1103,7 @@ As a simple example in @fig-llm-varys, we used the popular GPT-4 model#footnote[
|
||||
) <fig-llm-varys>
|
||||
|
||||
=== Tuning the Model Hyperparameters
|
||||
For time reasons, we are using hyperparameters similar to the ones found by #pcite(<Wang:2020aa>), adapted to fit our network by hand. Our model performance could likely be improved by running our own hyperparameter search.
|
||||
For time reasons, we are using hyperparameters similar to the ones found by #pcite(<Wang_2020aa>), adapted to fit our network by hand. Our model performance could likely be improved by running our own hyperparameter search.
|
||||
|
||||
=== Non-Uniform Sampling of Interactions According to Real World Data
|
||||
Currently, interactions with the voice assistants are sampled uniformly from a fixed dataset. According to a user survey, there are large differences in how often certain types of #gls("skill", display: "skills") are used @NPR_2022. For example, asking for a summary of the news is likely done much less frequently than playing music.
|
||||
@@ -1190,7 +1190,7 @@ This section contains words, concepts and abbreviations used in this thesis that
|
||||
key: "wer",
|
||||
short: "WER",
|
||||
long: "word error rate",
|
||||
desc: [A metric for evaluating and comparing speech recognition systems based on string edit distance @Radford:2022aa.]
|
||||
desc: [A metric for evaluating and comparing speech recognition systems based on string edit distance @Radford_2022aa.]
|
||||
),
|
||||
(
|
||||
key: "rtf",
|
||||
|
||||
Reference in New Issue
Block a user