Compare commits
1 Commits
main
...
extracts-f
| Author | SHA1 | Date | |
|---|---|---|---|
|
2971b8154f
|
@@ -1,2 +0,0 @@
|
|||||||
# varys thesis
|
|
||||||
View the thesis: [Master Thesis.pdf](<./Master Thesis.pdf>)
|
|
||||||
22
template.typ
22
template.typ
@@ -27,24 +27,11 @@
|
|||||||
show par: set block(above: 1.2em, below: 1.2em)
|
show par: set block(above: 1.2em, below: 1.2em)
|
||||||
|
|
||||||
set heading(numbering: "1.1.")
|
set heading(numbering: "1.1.")
|
||||||
|
|
||||||
set figure(gap: 1em)
|
set figure(gap: 1em)
|
||||||
show figure.caption: emph
|
|
||||||
|
|
||||||
set table(align: left)
|
set table(align: left)
|
||||||
|
|
||||||
// Set run-in subheadings, starting at level 4.
|
|
||||||
show heading: it => {
|
|
||||||
if it.level > 3 {
|
|
||||||
parbreak()
|
|
||||||
text(11pt, style: "italic", weight: "regular", it.body + ".")
|
|
||||||
} else if it.level == 1 {
|
|
||||||
text(size: 20pt, it)
|
|
||||||
} else {
|
|
||||||
it
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
set par(leading: 0.75em)
|
set par(leading: 0.75em)
|
||||||
|
|
||||||
// Title page.
|
// Title page.
|
||||||
@@ -61,7 +48,7 @@
|
|||||||
#text(2em, weight: 700, title)
|
#text(2em, weight: 700, title)
|
||||||
#v(1.8em, weak: true)
|
#v(1.8em, weak: true)
|
||||||
#text(1.1em, "Master Thesis")
|
#text(1.1em, "Master Thesis")
|
||||||
|
|
||||||
// Author information.
|
// Author information.
|
||||||
#pad(
|
#pad(
|
||||||
top: 6em,
|
top: 6em,
|
||||||
@@ -120,11 +107,6 @@
|
|||||||
v(1.618fr)
|
v(1.618fr)
|
||||||
pagebreak()
|
pagebreak()
|
||||||
|
|
||||||
// Table of contents.
|
|
||||||
show outline.entry.where(level: 1): it => text(weight: "bold", it)
|
|
||||||
outline(depth: 3, indent: true, fill: text(weight: "regular", repeat[.]))
|
|
||||||
pagebreak()
|
|
||||||
|
|
||||||
// Main body.
|
// Main body.
|
||||||
set par(justify: true)
|
set par(justify: true)
|
||||||
|
|
||||||
|
|||||||
823
thesis.tex
Normal file
823
thesis.tex
Normal file
@@ -0,0 +1,823 @@
|
|||||||
|
\documentclass{article}
|
||||||
|
|
||||||
|
\title{February 29, 2024 A Testbed for Voice Assistant\\
|
||||||
|
Traffic Fingerprinting Master Thesis}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
|
||||||
|
\section{Introduction}\label{introduction}
|
||||||
|
|
||||||
|
Smart Speakers have become ubiquitous in many countries. The global sale
|
||||||
|
of Amazon Alexa devices surpassed half a billion in 2023
|
||||||
|
{[}Amazon\_2023{]} and Apple sells over \(10\) million HomePods every
|
||||||
|
year {[}Statista\_Apple\_2023{]}. A study shows that \(35\%\) of adults
|
||||||
|
in the U.S. own a smart speaker {[}NPR\_2022{]}. With half of the
|
||||||
|
participants in the study reporting to have heard an advertisement on
|
||||||
|
their smart speaker before, concerns about the privacy and security of
|
||||||
|
these ever-listening devices are well founded. Especially since they are
|
||||||
|
usually placed where private or confidential conversations take place.
|
||||||
|
|
||||||
|
To back claims about loss of privacy or decreased security, a thorough
|
||||||
|
analysis of smart speakers becomes necessary. In this thesis, we build a
|
||||||
|
system that allows us to collect a large amount of data on interactions
|
||||||
|
with voice assistants.
|
||||||
|
|
||||||
|
\subsection{Previous Work}
|
||||||
|
|
||||||
|
There are several previous works analysing the privacy and security
|
||||||
|
risks of voice assistants. Most of them go into topics concerning
|
||||||
|
specifically Amazon Alexa since their adoption rate is significantly
|
||||||
|
higher than Apple's Siri or Google Assistant {[}Bolton\_2021{]}. Some
|
||||||
|
examples include experiments of smart speaker misactivations
|
||||||
|
{[}Ford2019{]}{[}Dubois\_2020{]}{[}schonherr2022101328{]}, analyses of
|
||||||
|
voice assistant skills
|
||||||
|
{[}Natatsuka\_2019{]}{[}Shezan\_2020{]}{[}255322{]}{[}Edu\_2022{]}{[}Xie\_2022{]}
|
||||||
|
and voice command fingerprinting {[}8802686{]}{[}Wang\_2020aa{]}. Apart
|
||||||
|
from some old work on traffic patterns of Siri {[}caviglione\_2013{]}
|
||||||
|
and research by Apple {[}Apple\_2023{]}{[}Apple\_2018{]}, we found
|
||||||
|
little information about that voice assistant.
|
||||||
|
|
||||||
|
\subsection{Thesis Overview}\label{overview}
|
||||||
|
|
||||||
|
The primary bottleneck when collecting large amounts of data on voice
|
||||||
|
assistants is the actual time speaking and waiting for the response.
|
||||||
|
According to our tests, the majority of interactions take between \(4\)
|
||||||
|
and \(8\) seconds plus any additional delay between the question and the
|
||||||
|
response. This aligns with the results by {[}Haas\_2022{]}, who found a
|
||||||
|
median of \(4.03\)s per typical sentence. With the goal of collecting a
|
||||||
|
large number of samples for a list of voice commands (henceforth called
|
||||||
|
``queries``), the system will therefore need to run for a substantial
|
||||||
|
amount of time.
|
||||||
|
|
||||||
|
With this in mind, the primary aim of this thesis is building a testbed
|
||||||
|
that can autonomously interact with a smart speaker. This system should
|
||||||
|
take a list of queries and continuously go through it while collecting
|
||||||
|
audio recordings, network packet traces, timestamps and other relevant
|
||||||
|
metadata about the interaction. In
|
||||||
|
\hyperref[background]{{[}background{]}}, we establish why it is of
|
||||||
|
interest to collect this data and the implementation of our system is
|
||||||
|
explored in \hyperref[system-design]{{[}system-design{]}}.
|
||||||
|
|
||||||
|
For the second part, documented in Sections \hyperref[experiment]{}and
|
||||||
|
\hyperref[results]{}, we run an experiment using our system, collecting
|
||||||
|
interactions with an Apple HomePod mini\footnote{\url{https://apple.com/homepod-mini}}
|
||||||
|
smart speaker. With this dataset of interactions, we train an ML model
|
||||||
|
for traffic fingerprinting. The system is adapted from a similar
|
||||||
|
implementation by {[}Wang\_2020aa{]} but will be applied to data
|
||||||
|
collected from the Siri voice assistant instead of Alexa and Google
|
||||||
|
Assistant.
|
||||||
|
|
||||||
|
\section{Background}\label{background}
|
||||||
|
|
||||||
|
Having defined our unsupervised system for collecting interactions with
|
||||||
|
smart speakers, it is necessary to explain the reasoning behind
|
||||||
|
collecting that data.
|
||||||
|
|
||||||
|
{[}Papadogiannaki\_2021{]} write in their survey ``In the second
|
||||||
|
category \emph{interception of encrypted traffic,} we encounter
|
||||||
|
solutions that take advantage of behavioural patterns or signatures to
|
||||||
|
report a specific event.'' The main focus for us lies on the encrypted
|
||||||
|
network traffic being collected. This traffic cannot be decrypted,
|
||||||
|
however, we can analyse it using a technique called traffic
|
||||||
|
fingerprinting.
|
||||||
|
|
||||||
|
\subsection{Traffic Fingerprinting}
|
||||||
|
|
||||||
|
Traffic fingerprinting uses pattern recognition approaches on collected,
|
||||||
|
encrypted traffic traces to infer user behaviour. This method is used
|
||||||
|
successfully to analyse web traffic {[}Wang\_2020{]}, traffic on the Tor
|
||||||
|
network {[}Oh\_2017{]}, IoT device traffic
|
||||||
|
{[}8802686{]}{[}Wang\_2020aa{]}{[}Mazhar\_2020aa{]}{[}Zuo\_2019{]}{[}Trimananda\_2019aa{]}
|
||||||
|
and user location {[}Ateniese\_2015aa{]}. Traffic fingerprinting on
|
||||||
|
smart speakers is generally used to infer the command given to the voice
|
||||||
|
assistant by the user. To enable us to run traffic fingerprinting, our
|
||||||
|
system stores the query and the network traffic for each interaction it
|
||||||
|
has with the smart speaker.
|
||||||
|
|
||||||
|
We assume the concept of \emph{open world} and \emph{closed world}
|
||||||
|
models from previous traffic fingerprinting works
|
||||||
|
{[}Wang\_2020aa{]}{[}Wang\_2020{]}. Both models define an attacker who
|
||||||
|
has a set of queries they are interested in. The interest in the open
|
||||||
|
world scenario is whether or not a given command is in that set. The
|
||||||
|
traffic analysed in this case can be from either known queries or
|
||||||
|
queries not seen before. In the closed world scenario, the attacker
|
||||||
|
wants to differentiate between the different queries in that set, but
|
||||||
|
cannot tell whether a new query is in it or not. As an example, the open
|
||||||
|
world model could be used to infer whether the smart speaker is being
|
||||||
|
used to communicate with a person that an attacker is monitoring. In the
|
||||||
|
closed world case, the attacker might want to differentiate between
|
||||||
|
different people a user has contacted. In our thesis, we focus on the
|
||||||
|
closed world model.
|
||||||
|
|
||||||
|
\subsection{Threat Model}\label{threat-model}
|
||||||
|
|
||||||
|
Our threat model is similar to the ones introduced by {[}Wang\_2020aa{]}
|
||||||
|
and {[}8802686{]}, specifically,
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
the attacker has access to the network the smart speaker is connected
|
||||||
|
to;
|
||||||
|
\item
|
||||||
|
the MAC address of the smart speaker is known;
|
||||||
|
\item
|
||||||
|
the type of the voice assistant is known (e.g. Siri, Alexa or Google
|
||||||
|
Assistant).
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Previous work {[}Mazhar\_2020aa{]}{[}Trimananda\_2019aa{]} shows that it
|
||||||
|
is feasible to extract this information from the encrypted network
|
||||||
|
traffic. Typically unencrypted DNS queries can also be used to identify
|
||||||
|
smart speakers on a network. Our own analysis shows that the Apple
|
||||||
|
HomePod regularly sends DNS queries asking about the domain
|
||||||
|
\texttt{gsp-ssl.ls-apple.com.akadns.net} and {[}Wang\_2020aa{]} found
|
||||||
|
that the Amazon Echo smart speaker sends DNS queries about
|
||||||
|
\texttt{unagi-na.amazon.com}. The specific domains queried may differ
|
||||||
|
regionally (as indicated by the \texttt{-na} part in the Amazon
|
||||||
|
subdomain), but should be simple to obtain.
|
||||||
|
|
||||||
|
Finally, we assume the traffic is encrypted, since the attack would be
|
||||||
|
trivial otherwise.
|
||||||
|
|
||||||
|
\subsection{Voice Assistant Choice}
|
||||||
|
|
||||||
|
The most popular voice assistants are Apple's Siri, Amazon Alexa and
|
||||||
|
Google Assistant {[}Bolton\_2021{]}. Support for Microsoft Cortana was
|
||||||
|
dropped in August of 2023 {[}Microsoft\_Cortana\_2023{]}, likely in
|
||||||
|
favour of Copilot for Windows, which was announced one month later
|
||||||
|
{[}Microsoft\_Copilot\_2023{]} (for more on this see
|
||||||
|
\hyperref[vas-based-on-llms]{{[}vas-based-on-llms{]}}).
|
||||||
|
|
||||||
|
Our system supports the Siri and Alexa voice assistants but can be
|
||||||
|
easily extended with support for more. For the experiment in
|
||||||
|
\hyperref[experiment]{{[}experiment{]}}, we decided to use Siri as it is
|
||||||
|
still less researched {[}Bolton\_2021{]} than Alexa, where a host of
|
||||||
|
security and privacy issues have already been found
|
||||||
|
{[}Wang\_2020aa{]}{[}8802686{]}{[}Ford2019{]}{[}Edu\_2022{]}{[}Dubois\_2020{]}.
|
||||||
|
|
||||||
|
\section{System Design}\label{system-design}
|
||||||
|
|
||||||
|
The primary product of this thesis is a testbed that can autonomously
|
||||||
|
interact with a smart speaker for a long time. Our system, called varys,
|
||||||
|
collects the following data on its interactions with the voice
|
||||||
|
assistant:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
Audio recordings of the query and the response
|
||||||
|
\item
|
||||||
|
A text transcript of the query and the response
|
||||||
|
\item
|
||||||
|
Encrypted network traffic traces from and to the smart speaker
|
||||||
|
\item
|
||||||
|
The durations of different parts of the interaction
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{Modularisation}
|
||||||
|
|
||||||
|
The wide range of data our system needs to collect makes it necessary to
|
||||||
|
keep its complexity manageable. To simplify maintenance and increase the
|
||||||
|
extensibility, the functionality of our system is separated into five
|
||||||
|
modules:
|
||||||
|
|
||||||
|
\begin{description}
|
||||||
|
\item[varys]
|
||||||
|
The main executable combining all modules into the final system.
|
||||||
|
\item[{\texttt{v}\texttt{arys-analysis}}]
|
||||||
|
Analysis of data collected by varys.
|
||||||
|
\item[{\texttt{v}\texttt{arys-audio}}]
|
||||||
|
Recording audio and the TTS and STT systems.
|
||||||
|
\item[{\texttt{v}\texttt{arys-database}}]
|
||||||
|
Abstraction of the database system where interactions are stored.
|
||||||
|
\item[{\texttt{v}\texttt{arys-network}}]
|
||||||
|
Collection of network traffic, writing and parsing of pcap files.
|
||||||
|
\end{description}
|
||||||
|
|
||||||
|
The modules {\texttt{v}\texttt{arys-audio}},
|
||||||
|
{\texttt{v}\texttt{arys-database}} and {\texttt{v}\texttt{arys-network}}
|
||||||
|
are isolated from the rest of varys. If the need arises (e.g. if a
|
||||||
|
different database connector is required or a more performant TTS system
|
||||||
|
is found), they can be easily swapped out.
|
||||||
|
|
||||||
|
The module {\texttt{v}\texttt{arys-analysis}} does not currently do any
|
||||||
|
analysis of the recorded audio and thus doesn't depend on that module.
|
||||||
|
|
||||||
|
In this section, we will lay out the design considerations behind the
|
||||||
|
modules that make up varys.
|
||||||
|
|
||||||
|
\subsection{Network}\label{network}
|
||||||
|
|
||||||
|
The system runs on a device that serves as a MITM between the Internet
|
||||||
|
and an \emph{internal network}, which it acts as a router for. The
|
||||||
|
latter is reserved for the devices that are monitored. This includes the
|
||||||
|
voice assistant and any devices required by it. Since the majority of
|
||||||
|
smart speakers do not have any wired connectivity, the \emph{internal
|
||||||
|
network} is wireless.
|
||||||
|
|
||||||
|
This layout enables the MITM device to record all incoming and outgoing
|
||||||
|
traffic to and from the voice assistant without any interference from
|
||||||
|
other devices on the network. (We filter the network traffic by the MAC
|
||||||
|
address of the smart speaker so this is not strictly necessary, but it
|
||||||
|
helps to reduce the amount of garbage data collected.)
|
||||||
|
|
||||||
|
Currently, we assume that the beginning and end of an interaction with
|
||||||
|
the voice assistant is known and we can collect the network capture from
|
||||||
|
when the user started speaking to when the voice assistant stopped
|
||||||
|
speaking. We discuss this restriction in more detail in
|
||||||
|
\hyperref[traffic-trace-bounds]{{[}traffic-trace-bounds{]}}.
|
||||||
|
|
||||||
|
The {\texttt{v}\texttt{arys-network}} module provides functionality to
|
||||||
|
start and stop capturing traffic on a specific network interface and
|
||||||
|
write it to a pcap file. Furthermore, it offers a wrapper structure
|
||||||
|
around the network packets read from those files, which is used by
|
||||||
|
{\texttt{v}\texttt{arys-analysis}}. Each smart speaker has a different
|
||||||
|
MAC address that is used to distinguish between incoming and outgoing
|
||||||
|
traffic. The MAC address is therefore also stored in the database
|
||||||
|
alongside the path to the pcap file.
|
||||||
|
|
||||||
|
\subsubsection{Monitoring}\label{monitoring}
|
||||||
|
|
||||||
|
Since the system is running for long periods of time without
|
||||||
|
supervision, we need to monitor potential outages. We added support to
|
||||||
|
varys for \emph{passive} monitoring. A high-uptime server receives a
|
||||||
|
request from the system each time it completes an interaction. That
|
||||||
|
server in turn notifies us if it hasn't received anything for a
|
||||||
|
configurable amount of time.
|
||||||
|
|
||||||
|
\subsection{Audio}
|
||||||
|
|
||||||
|
The largest module of our system is {\texttt{v}\texttt{arys-audio}}. Its
|
||||||
|
main components are the TTS and STT systems, expanded upon in the next
|
||||||
|
two subsections. Apart from that, it stores the recorded audio files
|
||||||
|
encoded in the space-optimised OPUS {[}rfc6716{]} format and handles all
|
||||||
|
audio processing. Converting it to mono; downsampling it for space
|
||||||
|
efficiency; and trimming the silence from the beginning and end.
|
||||||
|
|
||||||
|
\subsubsection{Text-to-Speech}\label{text-to-speech}
|
||||||
|
|
||||||
|
The only interface that a voice assistant provides is via speaking to
|
||||||
|
it. Therefore, we need a system to autonomously talk, for which there
|
||||||
|
are two options:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
Prerecorded audio
|
||||||
|
\item
|
||||||
|
Speech synthesis
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
The former was dismissed early for several reasons. Firstly, audio
|
||||||
|
recordings would not support conversations that last for more than one
|
||||||
|
query and one response. Secondly, each change in our dataset would
|
||||||
|
necessitate recording new queries. Lastly, given the scope of this
|
||||||
|
thesis, recording and editing more than 100 sentences was not feasible.
|
||||||
|
|
||||||
|
Thus, we decided to integrate a TTS system into our
|
||||||
|
{\texttt{v}\texttt{arys-audio}} module. This system must provide high
|
||||||
|
quality synthesis to minimise speech recognition mistakes on the smart
|
||||||
|
speaker, and be able to generate speech without any noticeable
|
||||||
|
performance overhead. The latter is typically measured by the rtf, with
|
||||||
|
\(\text{rtf } > 1\) if the system takes longer to synthesise text than
|
||||||
|
to speak that text and \(\text{rtf } \leq 1\) otherwise. A good
|
||||||
|
real-time factor is especially important if our system is to support
|
||||||
|
conversations in the future; the voice assistant only listens for a
|
||||||
|
certain amount of time before cancelling an interaction.
|
||||||
|
|
||||||
|
We explored three different options with \(\text{rtf } \leq 1\) on the
|
||||||
|
machine we tested on:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
espeak\footnote{\url{https://espeak.sourceforge.net}}
|
||||||
|
\item
|
||||||
|
Larynx\footnote{\url{https://github.com/rhasspy/larynx}}
|
||||||
|
\item
|
||||||
|
\texttt{tts-rs} using the \texttt{AVFoundation} backend\footnote{\url{https://github.com/ndarilek/tts-rs}
|
||||||
|
-- the \texttt{AVFoundation} backend is only supported on macOS.}
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
We primarily based our choice on how well Amazon Alexa\footnote{We
|
||||||
|
picked Alexa (over Siri) for these tests since its speech recognition
|
||||||
|
performed worse in virtually every case.} would be able to understand
|
||||||
|
different sentences. A test was performed manually by asking each of
|
||||||
|
three questions 20 times with every TTS system. As a ground truth, the
|
||||||
|
same was repeated verbally. The results show that \texttt{tts-rs} with
|
||||||
|
the \texttt{AVFoundation} backend performed approximately \(20\%\)
|
||||||
|
better on average than both Larynx and espeak and came within \(15\%\)
|
||||||
|
of the ground truth.
|
||||||
|
|
||||||
|
While performing this test, the difference in quality was very easily
|
||||||
|
heard, as only the \texttt{AVFoundation} voices come close to sounding
|
||||||
|
like a real voice. These were, apart from the ground truth, also the
|
||||||
|
only voices that could be differentiated from each other by the voice
|
||||||
|
assistants. The recognition of user voices works well while the guest
|
||||||
|
voice was wrongly recognised as one of the others \(65\%\) of the time.
|
||||||
|
|
||||||
|
With these results, we decided to use \texttt{tts-rs} on a macOS machine
|
||||||
|
to get access to the \texttt{AVFoundation} backend.
|
||||||
|
|
||||||
|
Since running this test, Larynx has been succeeded by Piper\footnote{\url{https://github.com/rhasspy/piper}}
|
||||||
|
during the development of our system. The new system promises good
|
||||||
|
results and may be an option for running varys on non-macOS systems in
|
||||||
|
the future.
|
||||||
|
|
||||||
|
\subsubsection{Speech-to-Text}\label{speech-to-text}
|
||||||
|
|
||||||
|
Not all voice assistants provide a history of interactions, which is why
|
||||||
|
we use a STT system to to transcribe the recorded answers.
|
||||||
|
|
||||||
|
In our module {\texttt{v}\texttt{arys-audio}}, we utilise an ML-based
|
||||||
|
speech recognition system called whisper. It provides transcription
|
||||||
|
error rates comparable to those of human transcription
|
||||||
|
{[}Radford\_2022aa{]} and outperforms other local speech recognition
|
||||||
|
models {[}Seagraves\_2022{]} and commercial services
|
||||||
|
{[}Radford\_2022aa{]}. In
|
||||||
|
\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}, we
|
||||||
|
explore how well whisper worked for our use case.
|
||||||
|
|
||||||
|
\subsection{Data Storage}
|
||||||
|
|
||||||
|
The module {\texttt{v}\texttt{arys-database}} acts as an abstraction of
|
||||||
|
our database, where all metadata about interactions are stored. Audio
|
||||||
|
files and pcap traces are saved to disk by
|
||||||
|
{\texttt{v}\texttt{arys-audio}} and {\texttt{v}\texttt{arys-network}}
|
||||||
|
respectively and their paths are referenced in the database.
|
||||||
|
|
||||||
|
Since some changes were made to the system while it was running, we
|
||||||
|
store the version of varys a session was ran with. This way, if there
|
||||||
|
are any uncertainties about the collected data, the relevant code
|
||||||
|
history and database migrations can be inspected.
|
||||||
|
|
||||||
|
To prevent data corruption if the project is moved, all paths are
|
||||||
|
relative to a main \texttt{data} directory.
|
||||||
|
|
||||||
|
\subsection{Interaction Process}
|
||||||
|
|
||||||
|
The main executable of varys combines all modules introduced in this
|
||||||
|
section to run a series of interactions. When the system is run, it
|
||||||
|
loads the queries from a TOML file. Until it is stopped, the system then
|
||||||
|
runs interaction sessions, each with a shuffled version of the original
|
||||||
|
query list.
|
||||||
|
|
||||||
|
Resetting the voice assistant is implemented differently for each one
|
||||||
|
supported.
|
||||||
|
|
||||||
|
\subsection{Performance Considerations}\label{performance}
|
||||||
|
|
||||||
|
The initial design of our system intended for it to be run on a
|
||||||
|
Raspberry Pi, a small form factor computer with limited processing and
|
||||||
|
storage capabilities. Mostly because we did not find a TTS system that
|
||||||
|
ran with satisfactory performance on the Raspberry Pi, we now use a
|
||||||
|
laptop computer.
|
||||||
|
|
||||||
|
\subsubsection{Data Compression}
|
||||||
|
|
||||||
|
Because of the initial plan of using a Raspberry Pi, we compress the
|
||||||
|
recorded audio with the OPUS {[}rfc6716{]} codec and the captured
|
||||||
|
network traffic with DEFLATE {[}rfc1951{]}. We are however now running
|
||||||
|
the experiment on a laptop, where space is no longer a concern. The
|
||||||
|
audio codec does not add a noticeable processing overhead, but the pcap
|
||||||
|
files are now stored uncompressed. Moreover, storing the network traces
|
||||||
|
uncompressed simplifies our data analysis since they can be used without
|
||||||
|
any preprocessing.
|
||||||
|
|
||||||
|
\subsubsection{Audio Transcription}
|
||||||
|
|
||||||
|
Apart from the actual time speaking (as mentioned in
|
||||||
|
\hyperref[overview]{{[}overview{]}}), the transcription of audio is the
|
||||||
|
main bottleneck for how frequently our system can run interactions. The
|
||||||
|
following optimisations allowed us to reduce the average time per
|
||||||
|
interaction from \(\sim 120\)s to \(\sim 30\)s:
|
||||||
|
|
||||||
|
\paragraph{Model Choice}
|
||||||
|
|
||||||
|
There are five different sizes available for the whisper model. The step
|
||||||
|
from the \texttt{medium} to the \texttt{large} and \texttt{large-v2}
|
||||||
|
models does not significantly decrease Word-Error-Rate but more than
|
||||||
|
doubles the number of parameters. Since in our tests, this increase in
|
||||||
|
network size far more than doubles processing time, we opted for the
|
||||||
|
\texttt{medium} model.
|
||||||
|
|
||||||
|
\paragraph{Transcription in Parallel}
|
||||||
|
|
||||||
|
The bottleneck of waiting for whisper to transcribe the response could
|
||||||
|
be completely eliminated with a transcription queue that is processed in
|
||||||
|
parallel to the main system. However, if the system is run for weeks at
|
||||||
|
a time, this can lead to arbitrarily long post-processing times if the
|
||||||
|
transcription takes longer on average than an interaction.
|
||||||
|
|
||||||
|
Our solution is to process the previous response while the next
|
||||||
|
interaction is already running, but to wait with further interactions
|
||||||
|
until the transcription queue (of length 1) is empty again. This
|
||||||
|
approach allows us to reduce the time per interaction to
|
||||||
|
\(\max(T_{t},T_{i})\), rather than \(T_{t} + T_{i}\), where \(T_{t}\)
|
||||||
|
represents the time taken for transcription and \(T_{i}\) the time taken
|
||||||
|
for the rest of the interaction.
|
||||||
|
|
||||||
|
\paragraph{Audio Preprocessing}
|
||||||
|
|
||||||
|
Another step we took to reduce the transcription time is to reduce the
|
||||||
|
audio sample rate to \(16\) kHz, average both channels into one and trim
|
||||||
|
the silence from the start and the end. This preprocessing does not
|
||||||
|
noticeably reduce the quality of the transcription. Moreover, cutting
|
||||||
|
off silent audio eliminates virtually all whisper hallucinations (see
|
||||||
|
\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}).
|
||||||
|
|
||||||
|
\section{Experiment}\label{experiment}
|
||||||
|
|
||||||
|
To test our data collection system, we conducted several experiments
|
||||||
|
where we let varys interact with an Apple HomePod mini.
|
||||||
|
|
||||||
|
Three different lists of queries were tested:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
A large dataset with \(227\) queries, split into \(24\) categories
|
||||||
|
\item
|
||||||
|
A small dataset with \(13\) queries, each from a different category
|
||||||
|
\item
|
||||||
|
A binary dataset with the two queries ``Call John Doe'' and ``Call
|
||||||
|
Mary Poppins''
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Apple does not release an official list of Siri's capabilities. A
|
||||||
|
dataset of \(\sim 800\) Siri commands was originally compiled from a
|
||||||
|
user-collected list {[}Hey\_Siri\_Commands{]} and extended with queries
|
||||||
|
from our own exploration. The HomePod does not support all Siri
|
||||||
|
queries\footnote{Any queries that require Siri to show something to the
|
||||||
|
user will be answered with something along the lines of \emph{``I can
|
||||||
|
show you the results on your iPhone.''}}. Therefore, during an initial
|
||||||
|
test period, we manually removed commands from the list that did not
|
||||||
|
result in a useful response. The number of redundant commands was
|
||||||
|
further reduced to increase the number of samples that can be collected
|
||||||
|
per query. Having less than 256 labels has the additional advantage of
|
||||||
|
our query labels fitting in one byte.
|
||||||
|
|
||||||
|
Further analysis of our results can be found in
|
||||||
|
\hyperref[results]{{[}results{]}}. In this section we detail the
|
||||||
|
experiment setup and our traffic fingerprinting system.
|
||||||
|
|
||||||
|
\subsection{Setup}
|
||||||
|
|
||||||
|
The experiment setup consists of a laptop running macOS\footnote{The
|
||||||
|
choice of using macOS voices is explained in
|
||||||
|
\hyperref[text-to-speech]{{[}text-to-speech{]}}.}, connected to a pair
|
||||||
|
of speakers and a microphone. The latter, together with the HomePod, are
|
||||||
|
placed inside a sound-isolated box. Additionally, the user's iPhone is
|
||||||
|
required to be on the \emph{internal network} (see
|
||||||
|
\hyperref[network]{{[}network{]}}) for Siri to be able to respond to
|
||||||
|
personal requests.
|
||||||
|
|
||||||
|
The iPhone is set up with a dummy user account for \emph{Peter Pan} and
|
||||||
|
two contacts, \emph{John Doe} and \emph{Mary Poppins}. Otherwise, the
|
||||||
|
phone is kept at factory settings and the only changes come from
|
||||||
|
interactions with the HomePod.
|
||||||
|
|
||||||
|
Finally, to minimise system downtime, an external server is used for
|
||||||
|
monitoring as explained in \hyperref[monitoring]{{[}monitoring{]}}.
|
||||||
|
|
||||||
|
The experiment must be set up in a location with the following
|
||||||
|
characteristics:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item
|
||||||
|
Wired Internet connection\footnote{A wireless connection is not
|
||||||
|
possible, since macOS does not support connecting to WiFi and
|
||||||
|
creating a wireless network at the same time. If no wired connection
|
||||||
|
is available, a network bridge (e.g. a Raspberry Pi bridging WiFi to
|
||||||
|
its Ethernet port) can be used to connect the Mac to the internet.}
|
||||||
|
\item
|
||||||
|
Power outlet available
|
||||||
|
\item
|
||||||
|
No persons speaking (the quieter the environment, the higher the voice
|
||||||
|
recognition accuracy)
|
||||||
|
\item
|
||||||
|
Noise disturbance from experiment not bothering anyone
|
||||||
|
\item
|
||||||
|
Access to room during working hours (or if possible at any time)
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
The sound isolation of the box can reduce ambient noise in the range of
|
||||||
|
\(50 - 60\) dB to a significantly lower \(35 - 45\) dB, which increases
|
||||||
|
the voice recognition and transcription accuracy. This also helps
|
||||||
|
finding an experiment location, since the noise disturbance is less
|
||||||
|
significant.
|
||||||
|
|
||||||
|
When the hardware is set up, the command
|
||||||
|
\texttt{varys\ listen\ -\/-calibrate} is used to find the ambient noise
|
||||||
|
volume.
|
||||||
|
|
||||||
|
\subsection{Data Analysis}
|
||||||
|
|
||||||
|
We began data analysis by building a simple tool that takes a number of
|
||||||
|
traffic traces and visualises them. Red and blue represent incoming and
|
||||||
|
outgoing packets respectively and the strength of the colour visualises
|
||||||
|
the packet size. The \(x\)-axis does not necessarily correspond with
|
||||||
|
time, but with the number of incoming and outgoing packets since the
|
||||||
|
beginning of the trace. This method of analysing traffic traces aligns
|
||||||
|
with previous work on traffic fingerprinting {[}Wang\_2020aa{]}.
|
||||||
|
|
||||||
|
\subsubsection{Traffic Preprocessing}
|
||||||
|
|
||||||
|
We write a network traffic trace as
|
||||||
|
\(\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right)\),
|
||||||
|
where \(n\) is the number of packets in that trace, \(s_{i}\) is the
|
||||||
|
size of a packet in bytes and \(d_{i} \in \left\{ 0,1 \right\}\) is the
|
||||||
|
direction of that packet.
|
||||||
|
|
||||||
|
To be used as input in our ML network, the traffic traces were
|
||||||
|
preprocessed into normalised tensors of length \(475\). Our collected
|
||||||
|
network traffic is converted like
|
||||||
|
|
||||||
|
\(\)\[\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right) \rightarrow \left( p_{1},p_{2},\ldots,p_{475} \right),\]\(\)
|
||||||
|
|
||||||
|
where for all \(i \in \lbrack 1,475\rbrack\) the components are
|
||||||
|
calculated as
|
||||||
|
|
||||||
|
\(\)\[p_{i} = \begin{cases}
|
||||||
|
( - 1)^{d_{i}} \cdot \frac{s_{i}}{1514}\mspace{14mu}\text{if }i < n, \\
|
||||||
|
0\mspace{99mu}\text{else}.
|
||||||
|
\end{cases}\]\(\)
|
||||||
|
|
||||||
|
This pads traces shorter than \(475\) packets long with zeroes, cuts off
|
||||||
|
the ones with more and normalises all traces into the range
|
||||||
|
\(\lbrack - 1,1\rbrack\)\footnote{The maximum size of the packets is
|
||||||
|
\(1514\) bytes, which is why we can simply divide by this to normalise
|
||||||
|
our data.}.
|
||||||
|
|
||||||
|
\subsubsection{Adapting Network from Wang et
|
||||||
|
al.}\label{adapting-network}
|
||||||
|
|
||||||
|
Since there is similar previous work on traffic fingerprinting Amazon
|
||||||
|
Alexa and Google Assistant interactions by {[}Wang\_2020aa{]}, we based
|
||||||
|
our network on their CNN.
|
||||||
|
|
||||||
|
In addition to a CNN, they also tested an LSTM, a SAE and a combination
|
||||||
|
of the networks in the form of ensemble learning. However, their best
|
||||||
|
results for a single network came from their CNN, so we decided to use
|
||||||
|
that.
|
||||||
|
|
||||||
|
Owing to limited computing resources, we reduced the size of our
|
||||||
|
network. The network of {[}Wang\_2020aa{]} used a pool size of \(1\),
|
||||||
|
which is equivalent to a \emph{no-op}. Thus, we removed all \emph{Max
|
||||||
|
Pooling} layers. Additionally, we reduced the four groups of
|
||||||
|
convolutional layers to one.
|
||||||
|
|
||||||
|
Since the hyperparameter search performed by {[}Wang\_2020aa{]} found
|
||||||
|
\(180\) to be the optimal dense network size when that was the maximum
|
||||||
|
in their search space \(\left\{ 100,110,\ldots,170,180 \right\}\), we
|
||||||
|
can assume that the true optimal size is higher than \(180\). In our
|
||||||
|
case we used \(475\) -- the size of our input.
|
||||||
|
|
||||||
|
To split out dataset into separate training, validation and testing
|
||||||
|
parts, we used the same proportions as {[}Wang\_2020aa{]}; Of the full
|
||||||
|
datasets, \(64\%\) were used for training, \(16\%\) for validation, and
|
||||||
|
\(20\%\) for testing.
|
||||||
|
|
||||||
|
\section{Results}\label{results}
|
||||||
|
|
||||||
|
Our system ran for approximately \(800\) hours and collected data on a
|
||||||
|
total of \(73\prime 605\) interactions. The collected data is available
|
||||||
|
at \url{https://gitlab.com/m-vz/varys-data}.
|
||||||
|
|
||||||
|
In this section we will go over our results, beginning with the
|
||||||
|
statistics of our collected dataset.
|
||||||
|
|
||||||
|
\subsection{Statistics}\label{statistics}
|
||||||
|
|
||||||
|
Of the \(73\prime 605\) queries collected, \(71\prime 915\) were
|
||||||
|
successfully completed (meaning there was no timeout and the system was
|
||||||
|
not stopped during the interaction). Our system varys is therefore able
|
||||||
|
to interact with Siri with a \(97.70\%\) success rate.
|
||||||
|
|
||||||
|
As mentioned in \hyperref[overview]{{[}overview{]}}, there are
|
||||||
|
approximately \(4\) to \(8\) seconds of speaking during most
|
||||||
|
interactions. The remaining duration (measured as the total duration
|
||||||
|
minus the speaking duration) lies mostly between \(25\) and \(30\)
|
||||||
|
seconds. This includes the time spent waiting for the voice assistant to
|
||||||
|
answer and transcribing the response. It did not increase even for
|
||||||
|
outliers with a significantly longer speaking duration.
|
||||||
|
|
||||||
|
On average, a query took \(2.47\) seconds (\(\pm 0.64\)) to say, which
|
||||||
|
is expected since all query sentences are of similar length. The
|
||||||
|
responses lasted on average \(4.40\) seconds with a standard deviation
|
||||||
|
of \(5.63\)s.
|
||||||
|
|
||||||
|
During the first week, there were two bugs that resulted in several
|
||||||
|
downtimes (the longest of which lasted \(\sim 42\)h because we could not
|
||||||
|
get physical access to the system during the weekend). After that, the
|
||||||
|
system ran flawlessly except during one network outage at the university
|
||||||
|
and during maintenance.
|
||||||
|
|
||||||
|
\subsection{Transcription Accuracy}\label{transcription-accuracy}
|
||||||
|
|
||||||
|
Compared to the results by {[}Radford\_2022aa{]} and
|
||||||
|
{[}Seagraves\_2022{]}, our own results show an even higher transcription
|
||||||
|
accuracy apart from errors in URLs, certain names and slight formatting
|
||||||
|
differences. This is likely due to the audio coming from a synthetic
|
||||||
|
voice in a controlled environment.
|
||||||
|
|
||||||
|
The query \emph{``How far away is Boston?''} was missing \emph{``as the
|
||||||
|
crow flies''} from the correct transcription \emph{``{[}\ldots{]} it's
|
||||||
|
about 5,944 kilometers as the crow flies.''} about \(36\%\) of the time.
|
||||||
|
Together with the correctly transcribed cases, this makes up \(98.51\%\)
|
||||||
|
of this query's data. In the response to \emph{``What is the factorial
|
||||||
|
of 6?''}, Siri provided the source
|
||||||
|
\href{https://solumaths.com}{solumaths.com}, which is text that whisper
|
||||||
|
was unable to produce\footnote{In 1823 cases the name \emph{solumaths}
|
||||||
|
was recognised as either \emph{solomus}, \emph{solomoths},
|
||||||
|
\emph{solomons}, \emph{solomuths}, \emph{soloomiths}, \emph{salumith}
|
||||||
|
or \emph{solemnus}, but never correctly.}. Since mistakes in URLs are
|
||||||
|
expected, we marked those transcriptions as correct.
|
||||||
|
|
||||||
|
There are some rare cases where the voice assistant responds with a
|
||||||
|
single word. Since whisper does not support recognition of audio shorter
|
||||||
|
than one second, these queries have an exceedingly high failure rate.
|
||||||
|
One example is the query \emph{``Flip a coin''}, which was only
|
||||||
|
recognised correctly \(\sim 2\%\) of the time. Out of the \(140\)
|
||||||
|
incorrectly transcribed responses, \(134\) were cancelled due to the
|
||||||
|
recorded audio being too short.
|
||||||
|
|
||||||
|
Another problem we identified with whisper is its tendency to
|
||||||
|
hallucinate text when fed silent audio (one or more seconds of
|
||||||
|
\texttt{0}-samples). Preventing this can be done by simply trimming off
|
||||||
|
all silent audio in the preprocessing step before the transcription.
|
||||||
|
|
||||||
|
\subsection{Traffic Fingerprinting}
|
||||||
|
|
||||||
|
With our implementation of a fingerprinting CNN, we were able to get
|
||||||
|
promising results. Our network was trained on the binary, small and
|
||||||
|
large datasets for 1000 epochs each. The accuracy axis of the training
|
||||||
|
progress is bounded by the interval
|
||||||
|
\(\left\lbrack \frac{100}{n},100 \right\rbrack\), where \(n\) is the
|
||||||
|
number of labels. The bottom of the accuracy axis is therefore
|
||||||
|
equivalent to random choice. (As an example, the accuracy of the binary
|
||||||
|
classifier starts at 50\%, which is choosing randomly between two
|
||||||
|
options.) The loss axis is bounded to \(\lbrack 0,3\rbrack\) to fit our
|
||||||
|
data. The loss for training our large dataset started outside that range
|
||||||
|
at \(5.05\).
|
||||||
|
|
||||||
|
Most of the trends in these graphs show that with more training, the
|
||||||
|
models could likely be improved. However, since this serves as a proof
|
||||||
|
of concept and due to limited time and processing capacity, we did not
|
||||||
|
train each network for more than \(1000\) epochs.
|
||||||
|
|
||||||
|
For binary classification between the queries ``Call John Doe'' and
|
||||||
|
``Call Mary Poppins'', our final accuracy on the test set was
|
||||||
|
\(\sim 71.19\%\). This result is especially concerning since it suggests
|
||||||
|
this method can differentiate between very similar queries. An attacker
|
||||||
|
could train a network on names of people they want to monitor and tell
|
||||||
|
from the encrypted network traffic whether the voice assistant was
|
||||||
|
likely used to call that person.
|
||||||
|
|
||||||
|
The second classification we attempted was between the \(13\) different
|
||||||
|
queries from the small dataset, each from a different category. The
|
||||||
|
result for this dataset is an \(86.19\%\) accuracy on the test set.
|
||||||
|
Compared to less than \(8\%\) for a random guess, this confirms our
|
||||||
|
concerns about the privacy and security of the HomePod.
|
||||||
|
|
||||||
|
The large dataset with \(227\) different queries showed an accuracy of
|
||||||
|
\(40.40\%\). This result is likely limited by the size of our network
|
||||||
|
and training capabilities. When compared to randomly guessing at
|
||||||
|
\(0.44\%\) though, the loss of privacy becomes apparent.
|
||||||
|
|
||||||
|
With adjustments to our CNN, more training data and hardware resources,
|
||||||
|
we estimate that highly accurate classification can be achieved even
|
||||||
|
between hundreds of different queries.
|
||||||
|
|
||||||
|
\subsection{On-Device Recognition}\label{on-device-recognition}
|
||||||
|
|
||||||
|
According to HomePod marketing,
|
||||||
|
|
||||||
|
\begin{quote}
|
||||||
|
HomePod mini works with your iPhone for requests like hearing your
|
||||||
|
messages or notes, so they are completed on device without revealing
|
||||||
|
that information to Apple.
|
||||||
|
|
||||||
|
℄~\url{https://apple.com/homepod-mini}
|
||||||
|
\end{quote}
|
||||||
|
|
||||||
|
\section{Conclusion}
|
||||||
|
|
||||||
|
For this thesis, we built a highly reliable system that can autonomously
|
||||||
|
collect data on interacting with smart speakers for hundreds of hours.
|
||||||
|
Using our system, we ran an experiment to collect data on over
|
||||||
|
\(70\prime 000\) interactions with Siri on a HomePod.
|
||||||
|
|
||||||
|
The data we collected enabled us to train two ML models to distinguish
|
||||||
|
encrypted smart speaker traffic of \(13\) or \(227\) different queries
|
||||||
|
and a binary classifier to recognise which of two contacts a user is
|
||||||
|
calling on their voice assistant. With final test accuracies of
|
||||||
|
\(\sim 86\%\) for the small dataset, \(\sim 40\%\) for the large dataset
|
||||||
|
and \(\sim 71\%\) for the binary classifier, our system serves as a
|
||||||
|
proof of concept that even though network traffic on a smart speaker is
|
||||||
|
encrypted, patterns in the traffic can be used to accurately predict
|
||||||
|
user behaviour. While this has been known about Amazon Alexa
|
||||||
|
{[}Wang\_2020aa{]}{[}8802686{]} and Google Assistant {[}Wang\_2020aa{]},
|
||||||
|
we have shown the viability of the same attack on the HomePod despite
|
||||||
|
Apple's claims about the privacy and security of their smart speakers.
|
||||||
|
|
||||||
|
\subsection{Future Work}
|
||||||
|
|
||||||
|
While working on this thesis, several ideas for improvement, different
|
||||||
|
types of data collection and approaches to the analysis of our data came
|
||||||
|
up. Due to time constraints, we could not pursue all of these points. In
|
||||||
|
this section we will briefly touch on some open questions or directions
|
||||||
|
in which work could continue.
|
||||||
|
|
||||||
|
\subsubsection{Recognising Different Speakers}
|
||||||
|
|
||||||
|
All popular voice assistants support differentiating between users
|
||||||
|
speaking to them. They give personalised responses to queries such as
|
||||||
|
\emph{``Read my calendar''} or \emph{``Any new messages?''} by storing a
|
||||||
|
voice print of the user. Some smart speakers like the HomePod go a step
|
||||||
|
further and only respond to requests given by a registered user
|
||||||
|
{[}Apple\_2023{]}.
|
||||||
|
|
||||||
|
Similarly to inferring the query from a traffic fingerprint, it may be
|
||||||
|
possible to differentiate between different users using the smart
|
||||||
|
speaker. Our system supports different voices that can be used to
|
||||||
|
register different users on the voice assistants. To keep the scope our
|
||||||
|
proof of concept experiment manageable, our traffic dataset was
|
||||||
|
collected with the same voice. By recording more data on the same
|
||||||
|
queries with different voices, we could explore the possibility of a
|
||||||
|
classifier inferring the speaker of a query from the traffic trace.
|
||||||
|
|
||||||
|
\subsubsection{Inferring Bounds of Traffic
|
||||||
|
Trace}\label{traffic-trace-bounds}
|
||||||
|
|
||||||
|
Currently, we assume an attacker knows when the user begins an
|
||||||
|
interaction and when it ends. Using a sliding window technique similar
|
||||||
|
to the one introduced by {[}Zhang\_2018{]}, a system could continuously
|
||||||
|
monitor network traffic and infer when an interaction has taken place
|
||||||
|
automatically.
|
||||||
|
|
||||||
|
\subsubsection{Classifier for Personal Requests}
|
||||||
|
|
||||||
|
As mentioned in
|
||||||
|
\hyperref[on-device-recognition]{{[}on-device-recognition{]}}, the
|
||||||
|
traffic traces for personal requests show a visible group of outgoing
|
||||||
|
traffic that should be straightforward to recognise automatically. A
|
||||||
|
binary classifier that distinguishes the traffic of personal requests
|
||||||
|
and normal queries on the HomePod should be straightforward to build.
|
||||||
|
|
||||||
|
Since this feature appears consistently on all personal request traces
|
||||||
|
we looked at, in addition to recognising traffic for queries it was
|
||||||
|
trained on, this model could feasibly also classify queries it has never
|
||||||
|
seen.
|
||||||
|
|
||||||
|
\subsubsection{Voice Assistants Based on LLMs}\label{vas-based-on-llms}
|
||||||
|
|
||||||
|
With the release of Gemini by Google {[}Google\_LLM\_2024{]} and Amazon
|
||||||
|
announcing generative AI for their Alexa voice assistant
|
||||||
|
{[}Amazon\_LLM\_2023{]}, the next years will quite certainly show a
|
||||||
|
shift from traditional voice assistants to assistants based on {[}Large
|
||||||
|
Language Models \emph{(LLMs)}{]}. Existing concerns about the lack of
|
||||||
|
user privacy and security when using LLMs {[}Gupta\_2023{]}, combined
|
||||||
|
with the frequency at which voice assistants process PII, warrant
|
||||||
|
further research into voice assistants based on generative AI.
|
||||||
|
|
||||||
|
\subsubsection{Tuning the Model Hyperparameters}
|
||||||
|
|
||||||
|
For time reasons, we are using hyperparameters similar to the ones found
|
||||||
|
by {[}Wang\_2020aa{]}, adapted to fit our network by hand. Our model
|
||||||
|
performance could likely be improved by running our own hyperparameter
|
||||||
|
search.
|
||||||
|
|
||||||
|
\subsubsection{Non-Uniform Sampling of Interactions According to Real
|
||||||
|
World Data}
|
||||||
|
|
||||||
|
Currently, interactions with the voice assistants are sampled uniformly
|
||||||
|
from a fixed dataset. According to a user survey, there are large
|
||||||
|
differences in how often certain types of skills are used
|
||||||
|
{[}NPR\_2022{]}. For example, asking for a summary of the news is likely
|
||||||
|
done much less frequently than playing music.
|
||||||
|
|
||||||
|
An experiment could be run where the system asks questions with a
|
||||||
|
likelihood corresponding to how often they occur in the real world. This
|
||||||
|
might improve the performance of the recognition system trained on such
|
||||||
|
a dataset.
|
||||||
|
|
||||||
|
\subsubsection{Performance Improvements}
|
||||||
|
|
||||||
|
The main bottleneck currently preventing us from interacting without
|
||||||
|
pausing is our STT system. We estimate a lower bound of approximately 10
|
||||||
|
seconds per interaction. This entails the speaking duration plus the
|
||||||
|
time waiting for a server response. With a more performant
|
||||||
|
implementation of whisper, a superior transcription model or better
|
||||||
|
hardware to run the model on, we could thus collect up to six
|
||||||
|
interactions per minute.
|
||||||
|
|
||||||
|
\subsubsection{\texorpdfstring{Other Uses for
|
||||||
|
\texttt{varys}}{Other Uses for varys}}
|
||||||
|
|
||||||
|
During data analysis, we discovered certain \emph{easter eggs} in the
|
||||||
|
answers that Siri gives. As an example, it responds to the query
|
||||||
|
\emph{``Flip a coin''} with \emph{``It\ldots{} whoops! It fell in a
|
||||||
|
crack.''} about \(1\%\) of the time -- a response that forces the user
|
||||||
|
to ask again should they actually need an answer. Since this behaviour
|
||||||
|
does not have any direct security or privacy implications, we did not
|
||||||
|
pursue it further. Having said this, varys does provide all the tools
|
||||||
|
required to analyse these rare responses from a UX research point of
|
||||||
|
view.
|
||||||
|
\end{document}
|
||||||
Reference in New Issue
Block a user