Compare commits
1 Commits
extracts-f
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
944d4e9d19
|
2
README.md
Normal file
2
README.md
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
# varys thesis
|
||||||
|
View the thesis: [Master Thesis.pdf](<./Master Thesis.pdf>)
|
||||||
18
template.typ
18
template.typ
@@ -29,9 +29,22 @@
|
|||||||
set heading(numbering: "1.1.")
|
set heading(numbering: "1.1.")
|
||||||
|
|
||||||
set figure(gap: 1em)
|
set figure(gap: 1em)
|
||||||
|
show figure.caption: emph
|
||||||
|
|
||||||
set table(align: left)
|
set table(align: left)
|
||||||
|
|
||||||
|
// Set run-in subheadings, starting at level 4.
|
||||||
|
show heading: it => {
|
||||||
|
if it.level > 3 {
|
||||||
|
parbreak()
|
||||||
|
text(11pt, style: "italic", weight: "regular", it.body + ".")
|
||||||
|
} else if it.level == 1 {
|
||||||
|
text(size: 20pt, it)
|
||||||
|
} else {
|
||||||
|
it
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
set par(leading: 0.75em)
|
set par(leading: 0.75em)
|
||||||
|
|
||||||
// Title page.
|
// Title page.
|
||||||
@@ -107,6 +120,11 @@
|
|||||||
v(1.618fr)
|
v(1.618fr)
|
||||||
pagebreak()
|
pagebreak()
|
||||||
|
|
||||||
|
// Table of contents.
|
||||||
|
show outline.entry.where(level: 1): it => text(weight: "bold", it)
|
||||||
|
outline(depth: 3, indent: true, fill: text(weight: "regular", repeat[.]))
|
||||||
|
pagebreak()
|
||||||
|
|
||||||
// Main body.
|
// Main body.
|
||||||
set par(justify: true)
|
set par(justify: true)
|
||||||
|
|
||||||
|
|||||||
823
thesis.tex
823
thesis.tex
@@ -1,823 +0,0 @@
|
|||||||
\documentclass{article}
|
|
||||||
|
|
||||||
\title{February 29, 2024 A Testbed for Voice Assistant\\
|
|
||||||
Traffic Fingerprinting Master Thesis}
|
|
||||||
|
|
||||||
\begin{document}
|
|
||||||
|
|
||||||
\section{Introduction}\label{introduction}
|
|
||||||
|
|
||||||
Smart Speakers have become ubiquitous in many countries. The global sale
|
|
||||||
of Amazon Alexa devices surpassed half a billion in 2023
|
|
||||||
{[}Amazon\_2023{]} and Apple sells over \(10\) million HomePods every
|
|
||||||
year {[}Statista\_Apple\_2023{]}. A study shows that \(35\%\) of adults
|
|
||||||
in the U.S. own a smart speaker {[}NPR\_2022{]}. With half of the
|
|
||||||
participants in the study reporting to have heard an advertisement on
|
|
||||||
their smart speaker before, concerns about the privacy and security of
|
|
||||||
these ever-listening devices are well founded. Especially since they are
|
|
||||||
usually placed where private or confidential conversations take place.
|
|
||||||
|
|
||||||
To back claims about loss of privacy or decreased security, a thorough
|
|
||||||
analysis of smart speakers becomes necessary. In this thesis, we build a
|
|
||||||
system that allows us to collect a large amount of data on interactions
|
|
||||||
with voice assistants.
|
|
||||||
|
|
||||||
\subsection{Previous Work}
|
|
||||||
|
|
||||||
There are several previous works analysing the privacy and security
|
|
||||||
risks of voice assistants. Most of them go into topics concerning
|
|
||||||
specifically Amazon Alexa since their adoption rate is significantly
|
|
||||||
higher than Apple's Siri or Google Assistant {[}Bolton\_2021{]}. Some
|
|
||||||
examples include experiments of smart speaker misactivations
|
|
||||||
{[}Ford2019{]}{[}Dubois\_2020{]}{[}schonherr2022101328{]}, analyses of
|
|
||||||
voice assistant skills
|
|
||||||
{[}Natatsuka\_2019{]}{[}Shezan\_2020{]}{[}255322{]}{[}Edu\_2022{]}{[}Xie\_2022{]}
|
|
||||||
and voice command fingerprinting {[}8802686{]}{[}Wang\_2020aa{]}. Apart
|
|
||||||
from some old work on traffic patterns of Siri {[}caviglione\_2013{]}
|
|
||||||
and research by Apple {[}Apple\_2023{]}{[}Apple\_2018{]}, we found
|
|
||||||
little information about that voice assistant.
|
|
||||||
|
|
||||||
\subsection{Thesis Overview}\label{overview}
|
|
||||||
|
|
||||||
The primary bottleneck when collecting large amounts of data on voice
|
|
||||||
assistants is the actual time speaking and waiting for the response.
|
|
||||||
According to our tests, the majority of interactions take between \(4\)
|
|
||||||
and \(8\) seconds plus any additional delay between the question and the
|
|
||||||
response. This aligns with the results by {[}Haas\_2022{]}, who found a
|
|
||||||
median of \(4.03\)s per typical sentence. With the goal of collecting a
|
|
||||||
large number of samples for a list of voice commands (henceforth called
|
|
||||||
``queries``), the system will therefore need to run for a substantial
|
|
||||||
amount of time.
|
|
||||||
|
|
||||||
With this in mind, the primary aim of this thesis is building a testbed
|
|
||||||
that can autonomously interact with a smart speaker. This system should
|
|
||||||
take a list of queries and continuously go through it while collecting
|
|
||||||
audio recordings, network packet traces, timestamps and other relevant
|
|
||||||
metadata about the interaction. In
|
|
||||||
\hyperref[background]{{[}background{]}}, we establish why it is of
|
|
||||||
interest to collect this data and the implementation of our system is
|
|
||||||
explored in \hyperref[system-design]{{[}system-design{]}}.
|
|
||||||
|
|
||||||
For the second part, documented in Sections \hyperref[experiment]{}and
|
|
||||||
\hyperref[results]{}, we run an experiment using our system, collecting
|
|
||||||
interactions with an Apple HomePod mini\footnote{\url{https://apple.com/homepod-mini}}
|
|
||||||
smart speaker. With this dataset of interactions, we train an ML model
|
|
||||||
for traffic fingerprinting. The system is adapted from a similar
|
|
||||||
implementation by {[}Wang\_2020aa{]} but will be applied to data
|
|
||||||
collected from the Siri voice assistant instead of Alexa and Google
|
|
||||||
Assistant.
|
|
||||||
|
|
||||||
\section{Background}\label{background}
|
|
||||||
|
|
||||||
Having defined our unsupervised system for collecting interactions with
|
|
||||||
smart speakers, it is necessary to explain the reasoning behind
|
|
||||||
collecting that data.
|
|
||||||
|
|
||||||
{[}Papadogiannaki\_2021{]} write in their survey ``In the second
|
|
||||||
category \emph{interception of encrypted traffic,} we encounter
|
|
||||||
solutions that take advantage of behavioural patterns or signatures to
|
|
||||||
report a specific event.'' The main focus for us lies on the encrypted
|
|
||||||
network traffic being collected. This traffic cannot be decrypted,
|
|
||||||
however, we can analyse it using a technique called traffic
|
|
||||||
fingerprinting.
|
|
||||||
|
|
||||||
\subsection{Traffic Fingerprinting}
|
|
||||||
|
|
||||||
Traffic fingerprinting uses pattern recognition approaches on collected,
|
|
||||||
encrypted traffic traces to infer user behaviour. This method is used
|
|
||||||
successfully to analyse web traffic {[}Wang\_2020{]}, traffic on the Tor
|
|
||||||
network {[}Oh\_2017{]}, IoT device traffic
|
|
||||||
{[}8802686{]}{[}Wang\_2020aa{]}{[}Mazhar\_2020aa{]}{[}Zuo\_2019{]}{[}Trimananda\_2019aa{]}
|
|
||||||
and user location {[}Ateniese\_2015aa{]}. Traffic fingerprinting on
|
|
||||||
smart speakers is generally used to infer the command given to the voice
|
|
||||||
assistant by the user. To enable us to run traffic fingerprinting, our
|
|
||||||
system stores the query and the network traffic for each interaction it
|
|
||||||
has with the smart speaker.
|
|
||||||
|
|
||||||
We assume the concept of \emph{open world} and \emph{closed world}
|
|
||||||
models from previous traffic fingerprinting works
|
|
||||||
{[}Wang\_2020aa{]}{[}Wang\_2020{]}. Both models define an attacker who
|
|
||||||
has a set of queries they are interested in. The interest in the open
|
|
||||||
world scenario is whether or not a given command is in that set. The
|
|
||||||
traffic analysed in this case can be from either known queries or
|
|
||||||
queries not seen before. In the closed world scenario, the attacker
|
|
||||||
wants to differentiate between the different queries in that set, but
|
|
||||||
cannot tell whether a new query is in it or not. As an example, the open
|
|
||||||
world model could be used to infer whether the smart speaker is being
|
|
||||||
used to communicate with a person that an attacker is monitoring. In the
|
|
||||||
closed world case, the attacker might want to differentiate between
|
|
||||||
different people a user has contacted. In our thesis, we focus on the
|
|
||||||
closed world model.
|
|
||||||
|
|
||||||
\subsection{Threat Model}\label{threat-model}
|
|
||||||
|
|
||||||
Our threat model is similar to the ones introduced by {[}Wang\_2020aa{]}
|
|
||||||
and {[}8802686{]}, specifically,
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item
|
|
||||||
the attacker has access to the network the smart speaker is connected
|
|
||||||
to;
|
|
||||||
\item
|
|
||||||
the MAC address of the smart speaker is known;
|
|
||||||
\item
|
|
||||||
the type of the voice assistant is known (e.g. Siri, Alexa or Google
|
|
||||||
Assistant).
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
Previous work {[}Mazhar\_2020aa{]}{[}Trimananda\_2019aa{]} shows that it
|
|
||||||
is feasible to extract this information from the encrypted network
|
|
||||||
traffic. Typically unencrypted DNS queries can also be used to identify
|
|
||||||
smart speakers on a network. Our own analysis shows that the Apple
|
|
||||||
HomePod regularly sends DNS queries asking about the domain
|
|
||||||
\texttt{gsp-ssl.ls-apple.com.akadns.net} and {[}Wang\_2020aa{]} found
|
|
||||||
that the Amazon Echo smart speaker sends DNS queries about
|
|
||||||
\texttt{unagi-na.amazon.com}. The specific domains queried may differ
|
|
||||||
regionally (as indicated by the \texttt{-na} part in the Amazon
|
|
||||||
subdomain), but should be simple to obtain.
|
|
||||||
|
|
||||||
Finally, we assume the traffic is encrypted, since the attack would be
|
|
||||||
trivial otherwise.
|
|
||||||
|
|
||||||
\subsection{Voice Assistant Choice}
|
|
||||||
|
|
||||||
The most popular voice assistants are Apple's Siri, Amazon Alexa and
|
|
||||||
Google Assistant {[}Bolton\_2021{]}. Support for Microsoft Cortana was
|
|
||||||
dropped in August of 2023 {[}Microsoft\_Cortana\_2023{]}, likely in
|
|
||||||
favour of Copilot for Windows, which was announced one month later
|
|
||||||
{[}Microsoft\_Copilot\_2023{]} (for more on this see
|
|
||||||
\hyperref[vas-based-on-llms]{{[}vas-based-on-llms{]}}).
|
|
||||||
|
|
||||||
Our system supports the Siri and Alexa voice assistants but can be
|
|
||||||
easily extended with support for more. For the experiment in
|
|
||||||
\hyperref[experiment]{{[}experiment{]}}, we decided to use Siri as it is
|
|
||||||
still less researched {[}Bolton\_2021{]} than Alexa, where a host of
|
|
||||||
security and privacy issues have already been found
|
|
||||||
{[}Wang\_2020aa{]}{[}8802686{]}{[}Ford2019{]}{[}Edu\_2022{]}{[}Dubois\_2020{]}.
|
|
||||||
|
|
||||||
\section{System Design}\label{system-design}
|
|
||||||
|
|
||||||
The primary product of this thesis is a testbed that can autonomously
|
|
||||||
interact with a smart speaker for a long time. Our system, called varys,
|
|
||||||
collects the following data on its interactions with the voice
|
|
||||||
assistant:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item
|
|
||||||
Audio recordings of the query and the response
|
|
||||||
\item
|
|
||||||
A text transcript of the query and the response
|
|
||||||
\item
|
|
||||||
Encrypted network traffic traces from and to the smart speaker
|
|
||||||
\item
|
|
||||||
The durations of different parts of the interaction
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\subsection{Modularisation}
|
|
||||||
|
|
||||||
The wide range of data our system needs to collect makes it necessary to
|
|
||||||
keep its complexity manageable. To simplify maintenance and increase the
|
|
||||||
extensibility, the functionality of our system is separated into five
|
|
||||||
modules:
|
|
||||||
|
|
||||||
\begin{description}
|
|
||||||
\item[varys]
|
|
||||||
The main executable combining all modules into the final system.
|
|
||||||
\item[{\texttt{v}\texttt{arys-analysis}}]
|
|
||||||
Analysis of data collected by varys.
|
|
||||||
\item[{\texttt{v}\texttt{arys-audio}}]
|
|
||||||
Recording audio and the TTS and STT systems.
|
|
||||||
\item[{\texttt{v}\texttt{arys-database}}]
|
|
||||||
Abstraction of the database system where interactions are stored.
|
|
||||||
\item[{\texttt{v}\texttt{arys-network}}]
|
|
||||||
Collection of network traffic, writing and parsing of pcap files.
|
|
||||||
\end{description}
|
|
||||||
|
|
||||||
The modules {\texttt{v}\texttt{arys-audio}},
|
|
||||||
{\texttt{v}\texttt{arys-database}} and {\texttt{v}\texttt{arys-network}}
|
|
||||||
are isolated from the rest of varys. If the need arises (e.g. if a
|
|
||||||
different database connector is required or a more performant TTS system
|
|
||||||
is found), they can be easily swapped out.
|
|
||||||
|
|
||||||
The module {\texttt{v}\texttt{arys-analysis}} does not currently do any
|
|
||||||
analysis of the recorded audio and thus doesn't depend on that module.
|
|
||||||
|
|
||||||
In this section, we will lay out the design considerations behind the
|
|
||||||
modules that make up varys.
|
|
||||||
|
|
||||||
\subsection{Network}\label{network}
|
|
||||||
|
|
||||||
The system runs on a device that serves as a MITM between the Internet
|
|
||||||
and an \emph{internal network}, which it acts as a router for. The
|
|
||||||
latter is reserved for the devices that are monitored. This includes the
|
|
||||||
voice assistant and any devices required by it. Since the majority of
|
|
||||||
smart speakers do not have any wired connectivity, the \emph{internal
|
|
||||||
network} is wireless.
|
|
||||||
|
|
||||||
This layout enables the MITM device to record all incoming and outgoing
|
|
||||||
traffic to and from the voice assistant without any interference from
|
|
||||||
other devices on the network. (We filter the network traffic by the MAC
|
|
||||||
address of the smart speaker so this is not strictly necessary, but it
|
|
||||||
helps to reduce the amount of garbage data collected.)
|
|
||||||
|
|
||||||
Currently, we assume that the beginning and end of an interaction with
|
|
||||||
the voice assistant is known and we can collect the network capture from
|
|
||||||
when the user started speaking to when the voice assistant stopped
|
|
||||||
speaking. We discuss this restriction in more detail in
|
|
||||||
\hyperref[traffic-trace-bounds]{{[}traffic-trace-bounds{]}}.
|
|
||||||
|
|
||||||
The {\texttt{v}\texttt{arys-network}} module provides functionality to
|
|
||||||
start and stop capturing traffic on a specific network interface and
|
|
||||||
write it to a pcap file. Furthermore, it offers a wrapper structure
|
|
||||||
around the network packets read from those files, which is used by
|
|
||||||
{\texttt{v}\texttt{arys-analysis}}. Each smart speaker has a different
|
|
||||||
MAC address that is used to distinguish between incoming and outgoing
|
|
||||||
traffic. The MAC address is therefore also stored in the database
|
|
||||||
alongside the path to the pcap file.
|
|
||||||
|
|
||||||
\subsubsection{Monitoring}\label{monitoring}
|
|
||||||
|
|
||||||
Since the system is running for long periods of time without
|
|
||||||
supervision, we need to monitor potential outages. We added support to
|
|
||||||
varys for \emph{passive} monitoring. A high-uptime server receives a
|
|
||||||
request from the system each time it completes an interaction. That
|
|
||||||
server in turn notifies us if it hasn't received anything for a
|
|
||||||
configurable amount of time.
|
|
||||||
|
|
||||||
\subsection{Audio}
|
|
||||||
|
|
||||||
The largest module of our system is {\texttt{v}\texttt{arys-audio}}. Its
|
|
||||||
main components are the TTS and STT systems, expanded upon in the next
|
|
||||||
two subsections. Apart from that, it stores the recorded audio files
|
|
||||||
encoded in the space-optimised OPUS {[}rfc6716{]} format and handles all
|
|
||||||
audio processing. Converting it to mono; downsampling it for space
|
|
||||||
efficiency; and trimming the silence from the beginning and end.
|
|
||||||
|
|
||||||
\subsubsection{Text-to-Speech}\label{text-to-speech}
|
|
||||||
|
|
||||||
The only interface that a voice assistant provides is via speaking to
|
|
||||||
it. Therefore, we need a system to autonomously talk, for which there
|
|
||||||
are two options:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item
|
|
||||||
Prerecorded audio
|
|
||||||
\item
|
|
||||||
Speech synthesis
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
The former was dismissed early for several reasons. Firstly, audio
|
|
||||||
recordings would not support conversations that last for more than one
|
|
||||||
query and one response. Secondly, each change in our dataset would
|
|
||||||
necessitate recording new queries. Lastly, given the scope of this
|
|
||||||
thesis, recording and editing more than 100 sentences was not feasible.
|
|
||||||
|
|
||||||
Thus, we decided to integrate a TTS system into our
|
|
||||||
{\texttt{v}\texttt{arys-audio}} module. This system must provide high
|
|
||||||
quality synthesis to minimise speech recognition mistakes on the smart
|
|
||||||
speaker, and be able to generate speech without any noticeable
|
|
||||||
performance overhead. The latter is typically measured by the rtf, with
|
|
||||||
\(\text{rtf } > 1\) if the system takes longer to synthesise text than
|
|
||||||
to speak that text and \(\text{rtf } \leq 1\) otherwise. A good
|
|
||||||
real-time factor is especially important if our system is to support
|
|
||||||
conversations in the future; the voice assistant only listens for a
|
|
||||||
certain amount of time before cancelling an interaction.
|
|
||||||
|
|
||||||
We explored three different options with \(\text{rtf } \leq 1\) on the
|
|
||||||
machine we tested on:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item
|
|
||||||
espeak\footnote{\url{https://espeak.sourceforge.net}}
|
|
||||||
\item
|
|
||||||
Larynx\footnote{\url{https://github.com/rhasspy/larynx}}
|
|
||||||
\item
|
|
||||||
\texttt{tts-rs} using the \texttt{AVFoundation} backend\footnote{\url{https://github.com/ndarilek/tts-rs}
|
|
||||||
-- the \texttt{AVFoundation} backend is only supported on macOS.}
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
We primarily based our choice on how well Amazon Alexa\footnote{We
|
|
||||||
picked Alexa (over Siri) for these tests since its speech recognition
|
|
||||||
performed worse in virtually every case.} would be able to understand
|
|
||||||
different sentences. A test was performed manually by asking each of
|
|
||||||
three questions 20 times with every TTS system. As a ground truth, the
|
|
||||||
same was repeated verbally. The results show that \texttt{tts-rs} with
|
|
||||||
the \texttt{AVFoundation} backend performed approximately \(20\%\)
|
|
||||||
better on average than both Larynx and espeak and came within \(15\%\)
|
|
||||||
of the ground truth.
|
|
||||||
|
|
||||||
While performing this test, the difference in quality was very easily
|
|
||||||
heard, as only the \texttt{AVFoundation} voices come close to sounding
|
|
||||||
like a real voice. These were, apart from the ground truth, also the
|
|
||||||
only voices that could be differentiated from each other by the voice
|
|
||||||
assistants. The recognition of user voices works well while the guest
|
|
||||||
voice was wrongly recognised as one of the others \(65\%\) of the time.
|
|
||||||
|
|
||||||
With these results, we decided to use \texttt{tts-rs} on a macOS machine
|
|
||||||
to get access to the \texttt{AVFoundation} backend.
|
|
||||||
|
|
||||||
Since running this test, Larynx has been succeeded by Piper\footnote{\url{https://github.com/rhasspy/piper}}
|
|
||||||
during the development of our system. The new system promises good
|
|
||||||
results and may be an option for running varys on non-macOS systems in
|
|
||||||
the future.
|
|
||||||
|
|
||||||
\subsubsection{Speech-to-Text}\label{speech-to-text}
|
|
||||||
|
|
||||||
Not all voice assistants provide a history of interactions, which is why
|
|
||||||
we use a STT system to to transcribe the recorded answers.
|
|
||||||
|
|
||||||
In our module {\texttt{v}\texttt{arys-audio}}, we utilise an ML-based
|
|
||||||
speech recognition system called whisper. It provides transcription
|
|
||||||
error rates comparable to those of human transcription
|
|
||||||
{[}Radford\_2022aa{]} and outperforms other local speech recognition
|
|
||||||
models {[}Seagraves\_2022{]} and commercial services
|
|
||||||
{[}Radford\_2022aa{]}. In
|
|
||||||
\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}, we
|
|
||||||
explore how well whisper worked for our use case.
|
|
||||||
|
|
||||||
\subsection{Data Storage}
|
|
||||||
|
|
||||||
The module {\texttt{v}\texttt{arys-database}} acts as an abstraction of
|
|
||||||
our database, where all metadata about interactions are stored. Audio
|
|
||||||
files and pcap traces are saved to disk by
|
|
||||||
{\texttt{v}\texttt{arys-audio}} and {\texttt{v}\texttt{arys-network}}
|
|
||||||
respectively and their paths are referenced in the database.
|
|
||||||
|
|
||||||
Since some changes were made to the system while it was running, we
|
|
||||||
store the version of varys a session was ran with. This way, if there
|
|
||||||
are any uncertainties about the collected data, the relevant code
|
|
||||||
history and database migrations can be inspected.
|
|
||||||
|
|
||||||
To prevent data corruption if the project is moved, all paths are
|
|
||||||
relative to a main \texttt{data} directory.
|
|
||||||
|
|
||||||
\subsection{Interaction Process}
|
|
||||||
|
|
||||||
The main executable of varys combines all modules introduced in this
|
|
||||||
section to run a series of interactions. When the system is run, it
|
|
||||||
loads the queries from a TOML file. Until it is stopped, the system then
|
|
||||||
runs interaction sessions, each with a shuffled version of the original
|
|
||||||
query list.
|
|
||||||
|
|
||||||
Resetting the voice assistant is implemented differently for each one
|
|
||||||
supported.
|
|
||||||
|
|
||||||
\subsection{Performance Considerations}\label{performance}
|
|
||||||
|
|
||||||
The initial design of our system intended for it to be run on a
|
|
||||||
Raspberry Pi, a small form factor computer with limited processing and
|
|
||||||
storage capabilities. Mostly because we did not find a TTS system that
|
|
||||||
ran with satisfactory performance on the Raspberry Pi, we now use a
|
|
||||||
laptop computer.
|
|
||||||
|
|
||||||
\subsubsection{Data Compression}
|
|
||||||
|
|
||||||
Because of the initial plan of using a Raspberry Pi, we compress the
|
|
||||||
recorded audio with the OPUS {[}rfc6716{]} codec and the captured
|
|
||||||
network traffic with DEFLATE {[}rfc1951{]}. We are however now running
|
|
||||||
the experiment on a laptop, where space is no longer a concern. The
|
|
||||||
audio codec does not add a noticeable processing overhead, but the pcap
|
|
||||||
files are now stored uncompressed. Moreover, storing the network traces
|
|
||||||
uncompressed simplifies our data analysis since they can be used without
|
|
||||||
any preprocessing.
|
|
||||||
|
|
||||||
\subsubsection{Audio Transcription}
|
|
||||||
|
|
||||||
Apart from the actual time speaking (as mentioned in
|
|
||||||
\hyperref[overview]{{[}overview{]}}), the transcription of audio is the
|
|
||||||
main bottleneck for how frequently our system can run interactions. The
|
|
||||||
following optimisations allowed us to reduce the average time per
|
|
||||||
interaction from \(\sim 120\)s to \(\sim 30\)s:
|
|
||||||
|
|
||||||
\paragraph{Model Choice}
|
|
||||||
|
|
||||||
There are five different sizes available for the whisper model. The step
|
|
||||||
from the \texttt{medium} to the \texttt{large} and \texttt{large-v2}
|
|
||||||
models does not significantly decrease Word-Error-Rate but more than
|
|
||||||
doubles the number of parameters. Since in our tests, this increase in
|
|
||||||
network size far more than doubles processing time, we opted for the
|
|
||||||
\texttt{medium} model.
|
|
||||||
|
|
||||||
\paragraph{Transcription in Parallel}
|
|
||||||
|
|
||||||
The bottleneck of waiting for whisper to transcribe the response could
|
|
||||||
be completely eliminated with a transcription queue that is processed in
|
|
||||||
parallel to the main system. However, if the system is run for weeks at
|
|
||||||
a time, this can lead to arbitrarily long post-processing times if the
|
|
||||||
transcription takes longer on average than an interaction.
|
|
||||||
|
|
||||||
Our solution is to process the previous response while the next
|
|
||||||
interaction is already running, but to wait with further interactions
|
|
||||||
until the transcription queue (of length 1) is empty again. This
|
|
||||||
approach allows us to reduce the time per interaction to
|
|
||||||
\(\max(T_{t},T_{i})\), rather than \(T_{t} + T_{i}\), where \(T_{t}\)
|
|
||||||
represents the time taken for transcription and \(T_{i}\) the time taken
|
|
||||||
for the rest of the interaction.
|
|
||||||
|
|
||||||
\paragraph{Audio Preprocessing}
|
|
||||||
|
|
||||||
Another step we took to reduce the transcription time is to reduce the
|
|
||||||
audio sample rate to \(16\) kHz, average both channels into one and trim
|
|
||||||
the silence from the start and the end. This preprocessing does not
|
|
||||||
noticeably reduce the quality of the transcription. Moreover, cutting
|
|
||||||
off silent audio eliminates virtually all whisper hallucinations (see
|
|
||||||
\hyperref[transcription-accuracy]{{[}transcription-accuracy{]}}).
|
|
||||||
|
|
||||||
\section{Experiment}\label{experiment}
|
|
||||||
|
|
||||||
To test our data collection system, we conducted several experiments
|
|
||||||
where we let varys interact with an Apple HomePod mini.
|
|
||||||
|
|
||||||
Three different lists of queries were tested:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item
|
|
||||||
A large dataset with \(227\) queries, split into \(24\) categories
|
|
||||||
\item
|
|
||||||
A small dataset with \(13\) queries, each from a different category
|
|
||||||
\item
|
|
||||||
A binary dataset with the two queries ``Call John Doe'' and ``Call
|
|
||||||
Mary Poppins''
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
Apple does not release an official list of Siri's capabilities. A
|
|
||||||
dataset of \(\sim 800\) Siri commands was originally compiled from a
|
|
||||||
user-collected list {[}Hey\_Siri\_Commands{]} and extended with queries
|
|
||||||
from our own exploration. The HomePod does not support all Siri
|
|
||||||
queries\footnote{Any queries that require Siri to show something to the
|
|
||||||
user will be answered with something along the lines of \emph{``I can
|
|
||||||
show you the results on your iPhone.''}}. Therefore, during an initial
|
|
||||||
test period, we manually removed commands from the list that did not
|
|
||||||
result in a useful response. The number of redundant commands was
|
|
||||||
further reduced to increase the number of samples that can be collected
|
|
||||||
per query. Having less than 256 labels has the additional advantage of
|
|
||||||
our query labels fitting in one byte.
|
|
||||||
|
|
||||||
Further analysis of our results can be found in
|
|
||||||
\hyperref[results]{{[}results{]}}. In this section we detail the
|
|
||||||
experiment setup and our traffic fingerprinting system.
|
|
||||||
|
|
||||||
\subsection{Setup}
|
|
||||||
|
|
||||||
The experiment setup consists of a laptop running macOS\footnote{The
|
|
||||||
choice of using macOS voices is explained in
|
|
||||||
\hyperref[text-to-speech]{{[}text-to-speech{]}}.}, connected to a pair
|
|
||||||
of speakers and a microphone. The latter, together with the HomePod, are
|
|
||||||
placed inside a sound-isolated box. Additionally, the user's iPhone is
|
|
||||||
required to be on the \emph{internal network} (see
|
|
||||||
\hyperref[network]{{[}network{]}}) for Siri to be able to respond to
|
|
||||||
personal requests.
|
|
||||||
|
|
||||||
The iPhone is set up with a dummy user account for \emph{Peter Pan} and
|
|
||||||
two contacts, \emph{John Doe} and \emph{Mary Poppins}. Otherwise, the
|
|
||||||
phone is kept at factory settings and the only changes come from
|
|
||||||
interactions with the HomePod.
|
|
||||||
|
|
||||||
Finally, to minimise system downtime, an external server is used for
|
|
||||||
monitoring as explained in \hyperref[monitoring]{{[}monitoring{]}}.
|
|
||||||
|
|
||||||
The experiment must be set up in a location with the following
|
|
||||||
characteristics:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item
|
|
||||||
Wired Internet connection\footnote{A wireless connection is not
|
|
||||||
possible, since macOS does not support connecting to WiFi and
|
|
||||||
creating a wireless network at the same time. If no wired connection
|
|
||||||
is available, a network bridge (e.g. a Raspberry Pi bridging WiFi to
|
|
||||||
its Ethernet port) can be used to connect the Mac to the internet.}
|
|
||||||
\item
|
|
||||||
Power outlet available
|
|
||||||
\item
|
|
||||||
No persons speaking (the quieter the environment, the higher the voice
|
|
||||||
recognition accuracy)
|
|
||||||
\item
|
|
||||||
Noise disturbance from experiment not bothering anyone
|
|
||||||
\item
|
|
||||||
Access to room during working hours (or if possible at any time)
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
The sound isolation of the box can reduce ambient noise in the range of
|
|
||||||
\(50 - 60\) dB to a significantly lower \(35 - 45\) dB, which increases
|
|
||||||
the voice recognition and transcription accuracy. This also helps
|
|
||||||
finding an experiment location, since the noise disturbance is less
|
|
||||||
significant.
|
|
||||||
|
|
||||||
When the hardware is set up, the command
|
|
||||||
\texttt{varys\ listen\ -\/-calibrate} is used to find the ambient noise
|
|
||||||
volume.
|
|
||||||
|
|
||||||
\subsection{Data Analysis}
|
|
||||||
|
|
||||||
We began data analysis by building a simple tool that takes a number of
|
|
||||||
traffic traces and visualises them. Red and blue represent incoming and
|
|
||||||
outgoing packets respectively and the strength of the colour visualises
|
|
||||||
the packet size. The \(x\)-axis does not necessarily correspond with
|
|
||||||
time, but with the number of incoming and outgoing packets since the
|
|
||||||
beginning of the trace. This method of analysing traffic traces aligns
|
|
||||||
with previous work on traffic fingerprinting {[}Wang\_2020aa{]}.
|
|
||||||
|
|
||||||
\subsubsection{Traffic Preprocessing}
|
|
||||||
|
|
||||||
We write a network traffic trace as
|
|
||||||
\(\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right)\),
|
|
||||||
where \(n\) is the number of packets in that trace, \(s_{i}\) is the
|
|
||||||
size of a packet in bytes and \(d_{i} \in \left\{ 0,1 \right\}\) is the
|
|
||||||
direction of that packet.
|
|
||||||
|
|
||||||
To be used as input in our ML network, the traffic traces were
|
|
||||||
preprocessed into normalised tensors of length \(475\). Our collected
|
|
||||||
network traffic is converted like
|
|
||||||
|
|
||||||
\(\)\[\left( \left( s_{1},d_{1} \right),\left( s_{2},d_{2} \right),\ldots,\left( s_{n},d_{n} \right) \right) \rightarrow \left( p_{1},p_{2},\ldots,p_{475} \right),\]\(\)
|
|
||||||
|
|
||||||
where for all \(i \in \lbrack 1,475\rbrack\) the components are
|
|
||||||
calculated as
|
|
||||||
|
|
||||||
\(\)\[p_{i} = \begin{cases}
|
|
||||||
( - 1)^{d_{i}} \cdot \frac{s_{i}}{1514}\mspace{14mu}\text{if }i < n, \\
|
|
||||||
0\mspace{99mu}\text{else}.
|
|
||||||
\end{cases}\]\(\)
|
|
||||||
|
|
||||||
This pads traces shorter than \(475\) packets long with zeroes, cuts off
|
|
||||||
the ones with more and normalises all traces into the range
|
|
||||||
\(\lbrack - 1,1\rbrack\)\footnote{The maximum size of the packets is
|
|
||||||
\(1514\) bytes, which is why we can simply divide by this to normalise
|
|
||||||
our data.}.
|
|
||||||
|
|
||||||
\subsubsection{Adapting Network from Wang et
|
|
||||||
al.}\label{adapting-network}
|
|
||||||
|
|
||||||
Since there is similar previous work on traffic fingerprinting Amazon
|
|
||||||
Alexa and Google Assistant interactions by {[}Wang\_2020aa{]}, we based
|
|
||||||
our network on their CNN.
|
|
||||||
|
|
||||||
In addition to a CNN, they also tested an LSTM, a SAE and a combination
|
|
||||||
of the networks in the form of ensemble learning. However, their best
|
|
||||||
results for a single network came from their CNN, so we decided to use
|
|
||||||
that.
|
|
||||||
|
|
||||||
Owing to limited computing resources, we reduced the size of our
|
|
||||||
network. The network of {[}Wang\_2020aa{]} used a pool size of \(1\),
|
|
||||||
which is equivalent to a \emph{no-op}. Thus, we removed all \emph{Max
|
|
||||||
Pooling} layers. Additionally, we reduced the four groups of
|
|
||||||
convolutional layers to one.
|
|
||||||
|
|
||||||
Since the hyperparameter search performed by {[}Wang\_2020aa{]} found
|
|
||||||
\(180\) to be the optimal dense network size when that was the maximum
|
|
||||||
in their search space \(\left\{ 100,110,\ldots,170,180 \right\}\), we
|
|
||||||
can assume that the true optimal size is higher than \(180\). In our
|
|
||||||
case we used \(475\) -- the size of our input.
|
|
||||||
|
|
||||||
To split out dataset into separate training, validation and testing
|
|
||||||
parts, we used the same proportions as {[}Wang\_2020aa{]}; Of the full
|
|
||||||
datasets, \(64\%\) were used for training, \(16\%\) for validation, and
|
|
||||||
\(20\%\) for testing.
|
|
||||||
|
|
||||||
\section{Results}\label{results}
|
|
||||||
|
|
||||||
Our system ran for approximately \(800\) hours and collected data on a
|
|
||||||
total of \(73\prime 605\) interactions. The collected data is available
|
|
||||||
at \url{https://gitlab.com/m-vz/varys-data}.
|
|
||||||
|
|
||||||
In this section we will go over our results, beginning with the
|
|
||||||
statistics of our collected dataset.
|
|
||||||
|
|
||||||
\subsection{Statistics}\label{statistics}
|
|
||||||
|
|
||||||
Of the \(73\prime 605\) queries collected, \(71\prime 915\) were
|
|
||||||
successfully completed (meaning there was no timeout and the system was
|
|
||||||
not stopped during the interaction). Our system varys is therefore able
|
|
||||||
to interact with Siri with a \(97.70\%\) success rate.
|
|
||||||
|
|
||||||
As mentioned in \hyperref[overview]{{[}overview{]}}, there are
|
|
||||||
approximately \(4\) to \(8\) seconds of speaking during most
|
|
||||||
interactions. The remaining duration (measured as the total duration
|
|
||||||
minus the speaking duration) lies mostly between \(25\) and \(30\)
|
|
||||||
seconds. This includes the time spent waiting for the voice assistant to
|
|
||||||
answer and transcribing the response. It did not increase even for
|
|
||||||
outliers with a significantly longer speaking duration.
|
|
||||||
|
|
||||||
On average, a query took \(2.47\) seconds (\(\pm 0.64\)) to say, which
|
|
||||||
is expected since all query sentences are of similar length. The
|
|
||||||
responses lasted on average \(4.40\) seconds with a standard deviation
|
|
||||||
of \(5.63\)s.
|
|
||||||
|
|
||||||
During the first week, there were two bugs that resulted in several
|
|
||||||
downtimes (the longest of which lasted \(\sim 42\)h because we could not
|
|
||||||
get physical access to the system during the weekend). After that, the
|
|
||||||
system ran flawlessly except during one network outage at the university
|
|
||||||
and during maintenance.
|
|
||||||
|
|
||||||
\subsection{Transcription Accuracy}\label{transcription-accuracy}
|
|
||||||
|
|
||||||
Compared to the results by {[}Radford\_2022aa{]} and
|
|
||||||
{[}Seagraves\_2022{]}, our own results show an even higher transcription
|
|
||||||
accuracy apart from errors in URLs, certain names and slight formatting
|
|
||||||
differences. This is likely due to the audio coming from a synthetic
|
|
||||||
voice in a controlled environment.
|
|
||||||
|
|
||||||
The query \emph{``How far away is Boston?''} was missing \emph{``as the
|
|
||||||
crow flies''} from the correct transcription \emph{``{[}\ldots{]} it's
|
|
||||||
about 5,944 kilometers as the crow flies.''} about \(36\%\) of the time.
|
|
||||||
Together with the correctly transcribed cases, this makes up \(98.51\%\)
|
|
||||||
of this query's data. In the response to \emph{``What is the factorial
|
|
||||||
of 6?''}, Siri provided the source
|
|
||||||
\href{https://solumaths.com}{solumaths.com}, which is text that whisper
|
|
||||||
was unable to produce\footnote{In 1823 cases the name \emph{solumaths}
|
|
||||||
was recognised as either \emph{solomus}, \emph{solomoths},
|
|
||||||
\emph{solomons}, \emph{solomuths}, \emph{soloomiths}, \emph{salumith}
|
|
||||||
or \emph{solemnus}, but never correctly.}. Since mistakes in URLs are
|
|
||||||
expected, we marked those transcriptions as correct.
|
|
||||||
|
|
||||||
There are some rare cases where the voice assistant responds with a
|
|
||||||
single word. Since whisper does not support recognition of audio shorter
|
|
||||||
than one second, these queries have an exceedingly high failure rate.
|
|
||||||
One example is the query \emph{``Flip a coin''}, which was only
|
|
||||||
recognised correctly \(\sim 2\%\) of the time. Out of the \(140\)
|
|
||||||
incorrectly transcribed responses, \(134\) were cancelled due to the
|
|
||||||
recorded audio being too short.
|
|
||||||
|
|
||||||
Another problem we identified with whisper is its tendency to
|
|
||||||
hallucinate text when fed silent audio (one or more seconds of
|
|
||||||
\texttt{0}-samples). Preventing this can be done by simply trimming off
|
|
||||||
all silent audio in the preprocessing step before the transcription.
|
|
||||||
|
|
||||||
\subsection{Traffic Fingerprinting}
|
|
||||||
|
|
||||||
With our implementation of a fingerprinting CNN, we were able to get
|
|
||||||
promising results. Our network was trained on the binary, small and
|
|
||||||
large datasets for 1000 epochs each. The accuracy axis of the training
|
|
||||||
progress is bounded by the interval
|
|
||||||
\(\left\lbrack \frac{100}{n},100 \right\rbrack\), where \(n\) is the
|
|
||||||
number of labels. The bottom of the accuracy axis is therefore
|
|
||||||
equivalent to random choice. (As an example, the accuracy of the binary
|
|
||||||
classifier starts at 50\%, which is choosing randomly between two
|
|
||||||
options.) The loss axis is bounded to \(\lbrack 0,3\rbrack\) to fit our
|
|
||||||
data. The loss for training our large dataset started outside that range
|
|
||||||
at \(5.05\).
|
|
||||||
|
|
||||||
Most of the trends in these graphs show that with more training, the
|
|
||||||
models could likely be improved. However, since this serves as a proof
|
|
||||||
of concept and due to limited time and processing capacity, we did not
|
|
||||||
train each network for more than \(1000\) epochs.
|
|
||||||
|
|
||||||
For binary classification between the queries ``Call John Doe'' and
|
|
||||||
``Call Mary Poppins'', our final accuracy on the test set was
|
|
||||||
\(\sim 71.19\%\). This result is especially concerning since it suggests
|
|
||||||
this method can differentiate between very similar queries. An attacker
|
|
||||||
could train a network on names of people they want to monitor and tell
|
|
||||||
from the encrypted network traffic whether the voice assistant was
|
|
||||||
likely used to call that person.
|
|
||||||
|
|
||||||
The second classification we attempted was between the \(13\) different
|
|
||||||
queries from the small dataset, each from a different category. The
|
|
||||||
result for this dataset is an \(86.19\%\) accuracy on the test set.
|
|
||||||
Compared to less than \(8\%\) for a random guess, this confirms our
|
|
||||||
concerns about the privacy and security of the HomePod.
|
|
||||||
|
|
||||||
The large dataset with \(227\) different queries showed an accuracy of
|
|
||||||
\(40.40\%\). This result is likely limited by the size of our network
|
|
||||||
and training capabilities. When compared to randomly guessing at
|
|
||||||
\(0.44\%\) though, the loss of privacy becomes apparent.
|
|
||||||
|
|
||||||
With adjustments to our CNN, more training data and hardware resources,
|
|
||||||
we estimate that highly accurate classification can be achieved even
|
|
||||||
between hundreds of different queries.
|
|
||||||
|
|
||||||
\subsection{On-Device Recognition}\label{on-device-recognition}
|
|
||||||
|
|
||||||
According to HomePod marketing,
|
|
||||||
|
|
||||||
\begin{quote}
|
|
||||||
HomePod mini works with your iPhone for requests like hearing your
|
|
||||||
messages or notes, so they are completed on device without revealing
|
|
||||||
that information to Apple.
|
|
||||||
|
|
||||||
℄~\url{https://apple.com/homepod-mini}
|
|
||||||
\end{quote}
|
|
||||||
|
|
||||||
\section{Conclusion}
|
|
||||||
|
|
||||||
For this thesis, we built a highly reliable system that can autonomously
|
|
||||||
collect data on interacting with smart speakers for hundreds of hours.
|
|
||||||
Using our system, we ran an experiment to collect data on over
|
|
||||||
\(70\prime 000\) interactions with Siri on a HomePod.
|
|
||||||
|
|
||||||
The data we collected enabled us to train two ML models to distinguish
|
|
||||||
encrypted smart speaker traffic of \(13\) or \(227\) different queries
|
|
||||||
and a binary classifier to recognise which of two contacts a user is
|
|
||||||
calling on their voice assistant. With final test accuracies of
|
|
||||||
\(\sim 86\%\) for the small dataset, \(\sim 40\%\) for the large dataset
|
|
||||||
and \(\sim 71\%\) for the binary classifier, our system serves as a
|
|
||||||
proof of concept that even though network traffic on a smart speaker is
|
|
||||||
encrypted, patterns in the traffic can be used to accurately predict
|
|
||||||
user behaviour. While this has been known about Amazon Alexa
|
|
||||||
{[}Wang\_2020aa{]}{[}8802686{]} and Google Assistant {[}Wang\_2020aa{]},
|
|
||||||
we have shown the viability of the same attack on the HomePod despite
|
|
||||||
Apple's claims about the privacy and security of their smart speakers.
|
|
||||||
|
|
||||||
\subsection{Future Work}
|
|
||||||
|
|
||||||
While working on this thesis, several ideas for improvement, different
|
|
||||||
types of data collection and approaches to the analysis of our data came
|
|
||||||
up. Due to time constraints, we could not pursue all of these points. In
|
|
||||||
this section we will briefly touch on some open questions or directions
|
|
||||||
in which work could continue.
|
|
||||||
|
|
||||||
\subsubsection{Recognising Different Speakers}
|
|
||||||
|
|
||||||
All popular voice assistants support differentiating between users
|
|
||||||
speaking to them. They give personalised responses to queries such as
|
|
||||||
\emph{``Read my calendar''} or \emph{``Any new messages?''} by storing a
|
|
||||||
voice print of the user. Some smart speakers like the HomePod go a step
|
|
||||||
further and only respond to requests given by a registered user
|
|
||||||
{[}Apple\_2023{]}.
|
|
||||||
|
|
||||||
Similarly to inferring the query from a traffic fingerprint, it may be
|
|
||||||
possible to differentiate between different users using the smart
|
|
||||||
speaker. Our system supports different voices that can be used to
|
|
||||||
register different users on the voice assistants. To keep the scope our
|
|
||||||
proof of concept experiment manageable, our traffic dataset was
|
|
||||||
collected with the same voice. By recording more data on the same
|
|
||||||
queries with different voices, we could explore the possibility of a
|
|
||||||
classifier inferring the speaker of a query from the traffic trace.
|
|
||||||
|
|
||||||
\subsubsection{Inferring Bounds of Traffic
|
|
||||||
Trace}\label{traffic-trace-bounds}
|
|
||||||
|
|
||||||
Currently, we assume an attacker knows when the user begins an
|
|
||||||
interaction and when it ends. Using a sliding window technique similar
|
|
||||||
to the one introduced by {[}Zhang\_2018{]}, a system could continuously
|
|
||||||
monitor network traffic and infer when an interaction has taken place
|
|
||||||
automatically.
|
|
||||||
|
|
||||||
\subsubsection{Classifier for Personal Requests}
|
|
||||||
|
|
||||||
As mentioned in
|
|
||||||
\hyperref[on-device-recognition]{{[}on-device-recognition{]}}, the
|
|
||||||
traffic traces for personal requests show a visible group of outgoing
|
|
||||||
traffic that should be straightforward to recognise automatically. A
|
|
||||||
binary classifier that distinguishes the traffic of personal requests
|
|
||||||
and normal queries on the HomePod should be straightforward to build.
|
|
||||||
|
|
||||||
Since this feature appears consistently on all personal request traces
|
|
||||||
we looked at, in addition to recognising traffic for queries it was
|
|
||||||
trained on, this model could feasibly also classify queries it has never
|
|
||||||
seen.
|
|
||||||
|
|
||||||
\subsubsection{Voice Assistants Based on LLMs}\label{vas-based-on-llms}
|
|
||||||
|
|
||||||
With the release of Gemini by Google {[}Google\_LLM\_2024{]} and Amazon
|
|
||||||
announcing generative AI for their Alexa voice assistant
|
|
||||||
{[}Amazon\_LLM\_2023{]}, the next years will quite certainly show a
|
|
||||||
shift from traditional voice assistants to assistants based on {[}Large
|
|
||||||
Language Models \emph{(LLMs)}{]}. Existing concerns about the lack of
|
|
||||||
user privacy and security when using LLMs {[}Gupta\_2023{]}, combined
|
|
||||||
with the frequency at which voice assistants process PII, warrant
|
|
||||||
further research into voice assistants based on generative AI.
|
|
||||||
|
|
||||||
\subsubsection{Tuning the Model Hyperparameters}
|
|
||||||
|
|
||||||
For time reasons, we are using hyperparameters similar to the ones found
|
|
||||||
by {[}Wang\_2020aa{]}, adapted to fit our network by hand. Our model
|
|
||||||
performance could likely be improved by running our own hyperparameter
|
|
||||||
search.
|
|
||||||
|
|
||||||
\subsubsection{Non-Uniform Sampling of Interactions According to Real
|
|
||||||
World Data}
|
|
||||||
|
|
||||||
Currently, interactions with the voice assistants are sampled uniformly
|
|
||||||
from a fixed dataset. According to a user survey, there are large
|
|
||||||
differences in how often certain types of skills are used
|
|
||||||
{[}NPR\_2022{]}. For example, asking for a summary of the news is likely
|
|
||||||
done much less frequently than playing music.
|
|
||||||
|
|
||||||
An experiment could be run where the system asks questions with a
|
|
||||||
likelihood corresponding to how often they occur in the real world. This
|
|
||||||
might improve the performance of the recognition system trained on such
|
|
||||||
a dataset.
|
|
||||||
|
|
||||||
\subsubsection{Performance Improvements}
|
|
||||||
|
|
||||||
The main bottleneck currently preventing us from interacting without
|
|
||||||
pausing is our STT system. We estimate a lower bound of approximately 10
|
|
||||||
seconds per interaction. This entails the speaking duration plus the
|
|
||||||
time waiting for a server response. With a more performant
|
|
||||||
implementation of whisper, a superior transcription model or better
|
|
||||||
hardware to run the model on, we could thus collect up to six
|
|
||||||
interactions per minute.
|
|
||||||
|
|
||||||
\subsubsection{\texorpdfstring{Other Uses for
|
|
||||||
\texttt{varys}}{Other Uses for varys}}
|
|
||||||
|
|
||||||
During data analysis, we discovered certain \emph{easter eggs} in the
|
|
||||||
answers that Siri gives. As an example, it responds to the query
|
|
||||||
\emph{``Flip a coin''} with \emph{``It\ldots{} whoops! It fell in a
|
|
||||||
crack.''} about \(1\%\) of the time -- a response that forces the user
|
|
||||||
to ask again should they actually need an answer. Since this behaviour
|
|
||||||
does not have any direct security or privacy implications, we did not
|
|
||||||
pursue it further. Having said this, varys does provide all the tools
|
|
||||||
required to analyse these rare responses from a UX research point of
|
|
||||||
view.
|
|
||||||
\end{document}
|
|
||||||
Reference in New Issue
Block a user