github/coqui-ai: mitigating tts misuse discussion

original posted here:

https://github.com/coqui-ai/TTS/discussions/1036

I’ve spent most of my PhD figuring out ways to attack Mozilla DeepSpeech but my PhD supervisor and I spent some time discussing this topic for generative image models so I’m gonna chip in here…

I’ll heavily caveat:

I’m not up to speed on TTS in the slightest (abusing properties of CTC has basically been most of my PhD)
detecting / preventing deepfakes / generative model misuse is an active research problem (and will be for a long time to come)

I think that any technical solutions will be easily worked around, or someone will just reproduce the code, at some point it will be as trivial as running a jupyter notebook. The cat is out of the bag so to speak. What is needed are societal and legal approaches.

Like [above post] I don’t think there’s a technological fix here, and I agree that we need societal and legal measures.

I disagree that this is solely a legal / societal issue.

Unfortunately this is the cat and mouse game of the adaptive security cycle (see Biggio and Roli). Someone designs mitigations, someone breaks them, someone fixes them, someone breaks them… repeat ad infinitum.

No system is ever going to be 100% secure — or 100% unable to generate malicious data in the generative model case.

In security we aim to make it as hard as possible to perform feasible attacks instead of aiming for completely impossible. So the job is to make the “as trivial as running a jupyter notebook” case as unlikely as possible.

The real trick is to make a small change that has a big impact and I think coqui could do something to help in that regard by initially focussing on the tts-server application… which would be my first port of call for doing nefarious things with TTS.

Script kiddies vs. ML developers vs. APTs

A bit of a break down of potential adversaries is always helpful when discussing things like this.

Script Kiddies: These folks don’t particularly have much know how and are looking for the “run a jupyterhub script” approach.
Developer: Know how to clone the repo and modify source code, but do not have resources to alter or retrain the model.
ML Developer: Has the know how and resources to modify and completely retrain from scratch.

Script kiddies

Mitigations being enabled by default within tts-server would mean script kiddies are no longer given the option to do nefarious things (those covered by mitigations at least) as they’re no longer a pip install TTS && tts-server ... away from doing bad things.

As a simple example, if users want to turn off the mitigations? Then tts-server adds an audible watermark to try and stop you from doing nefarious things. Want to remove the watermark? Turn the mitigations back on.

Script Kiddie mitigations are a pretty good starter for 10. A positive side effect is that anyone who deploys an unmitigated instance of tts-server in the wild (or applications derived from the the coqui code) won’t be able to let any random user on the web generate speech without an audible watermark, and mitigated instances would (ideally) block the generation of malicious voice data requested by users.

Developer

The developer level adversaries are harder to mitigate against but the tts-server modifications would require them to clone/modify the source code. The effort required might scale proportionally to the number of people willing to go through and actually find out how to modify the backend code for tts-server and/or TTS. Chances are one of this set will be able to set up an unmitigated tts-server in the wild by disabling the audible watermarking example above in the source code.

Depending on how far you want to go this you could start to mitigate against some of more determined attackers. Data and/or Model level mitigations could make it prohibitive in terms of cost and expertise to the more determined Developer level adversaries.

These developer level ones can be dealt with later on (if deemed worth the effort).

ML Developer

No much you can do here as the project is open source except make their life very difficult (but then you’ll be making everyone’s life difficult in the process.

An existing product: Descript’s Overdub

Descript have a couple of interesting points in their Ethics page regarding their voice cloning product Overdub (previously lyrebird.ai):

Membership of the Content Authenticity Initiative
verbal consent verification

Content Authenticity Initiative

Will mostly leave this for future reading, but CAI essentially aims to provide attribution verification which is similar to the watermarking / NFT / digital signatures threads above but also addresses the “Certificate Authority” problem.

Unfortunately the CAI only seems to support image files for attribution verification so far. But seems like at least one step in the good direction with Descript being a member (or it could just Descript signing up for the kudos, who knows).

Registering content with CAI can be done anonymously and it might be possible to bake the registration into the tts-server application by default with an option to disable content registration (I have no idea how their API actually works, but the website says it can be submitted anonymously).

If it’s turned on by default, most generated speech created with tts-server would be registered but I appreciate there will probably concern regarding “opt out” things like this in FOSS.

Using the adversaries list above, Script Kiddies would be mitigated here (assuming they know nothing about command line arguments) but most Developers (+all ML Developers) would be able to disable it.

Verbal Consent Verification

Note: this assumes that YourTTS + Speaking in Tongues demo is slated to end up as part of the tts-server application and I’m not 100% sure if it is (I just pip installed and it wasn’t the first thing on my screen using the README instructions).

Without signing up for an Overdub account to check how verbal consent verification works, it reads like they initially pass the training audio to a STT model to verify that a specific transcription exists in the recording. If the transcription doesn’t include the required phrase(s) then a TTS voice is not generated (likely a recording of someone else).

This would be somewhat effective at mitigating against some simple replay attack examples which aim to make:

David Bowie say things about the recent transition of his estate to Warner Chappel (David Bowie is deceased)
some celebrity say kinky things about another celebrity
my boss’ bank think he’s telling them to send $5000 to my bank account

For coqui specifically, it could be possible to implement a consent verification scheme as part of tts-server application where the user must say 5 randomly generated keywords at some point during their recorded audio. coqui already has the STT models to perform this verification. This would probably require something like downloading the STT models and changing the server API changes for tts-server.

This could be expanded on by randomly rotating when the user must speak the keywords and only providing a prompt on the tts-server front end with a N second count down timer. Or, even more simply, require the user to record themselves speaking completely randomly generated transcripts (with an equal mix of random words and complete sentences).

Using the adversaries list above, Script Kiddies would be mitigated here but some Developers (+all ML Developers) would probably work out a way to disable within tts-server.