MarySpeech

master
javadoc

The MarySpeech Service can be used to generate speech from text using the MaryTTS project.

1. General

It's different from most other speech services (AcapellaSpeech & GoogleSpeech) as it is OPEN SOURCE (Yay!) and doesn't even require an internet connection (Yay! x2) as everything is generated locally and offline.

The downside is that you have to pre-download the rather big voice-file(s) you want to use.

Also the quality might not be as great compared to it's competitors (e.g. Acapella & Google), but it's free and nobody can simply "shut it down".

Another cool thing about MarySpeech is that you can apply paramerized voice effects to your voice.


#file : service/MarySpeech.py edit raw
#start Service
mouth = Runtime.createAndStart("MarySpeech", "MarySpeech")

#speak!
mouth.speakBlocking("Hello world")
mouth.speakBlocking("I speak English. More voices are available, but they need to be installed")
mouth.speakBlocking("Echo echo echo")
mouth.speakBlocking("What should I use")
mouth.speakBlocking("Happy birthday Kyle")

#install a voice:
#an overview over all official voices is available @ http://myrobotlab.org/service/MarySpeech
#mouth.installComponentsAcceptLicense(voicename)
#e.g.
#mouth.installComponentsAcceptLicense("bits1")

#switch voice:
#mouth.setVoice(voicename)
#mouth.setVoice("bits1")

#add voice effects:
#more effects and information @ http://myrobotlab.org/service/MarySpeech
mouth.setAudioEffects("FIRFilter+Robot(amount=50)");

2. Voices

MarySpeech has many different voices you can select and use, below is a table listing all of them. There may be more voices (and languages) available, but these are the voices officially registered in MaryTTS.

All voices are available through an additional download, you can install a voice by simply calling:

maryspeech.installComponentsAcceptLicense(voicename)

and to select a different voice:

maryspeech.setVoice(voicename) or maryspeech.setLanguage(language)

NOTE: You have to install a voice before you are able to select it!

NOTE: You need to restart MRL after installing a voice!

NOTE: By installing a voice you accect it's license!

DATE: 05.05.2017

NOTE: I don't feel responsible for errors in this matrix.

Language Voicename Gender Type Version Description License Size Dependencies
DE bits1 female unit selection 5.2 A female German unit selection voice, built from voice recordings provided by the BITS project at the Bavarian Archive of Speech Signals BY-ND-3.0 262691025 DE, 5.2
DE bits1-hsmm female hsmm 5.2 A female German hidden semi-Markov model voice, built from voice recordings provided by the BITS project at the Bavarian Archive of Speech Signals BY-ND-3.0 1359993 DE, 5.2
DE bits4 female unit selection 5.2 A female German unit selection voice, built from voice recordings provided by the BITS project at the Bavarian Archive of Speech Signals BY-ND-3.0 274825221 DE, 5.2
DE bits2 male unit selection 5.2 A male German unit selection voice, built from voice recordings provided by the BITS project at the Bavarian Archive of Speech Signals BY-ND-3.0 266072011 DE, 5.2
DE bits3 male unit selection 5.2 A male German unit selection voice, built from voice recordings provided by the BITS project at the Bavarian Archive of Speech Signals BY-ND-3.0 269955538 DE, 5.2
DE bits3-hsmm male hsmm 5.2 A male German hidden semi-Markov model voice, built from voice recordings provided by the BITS project at the Bavarian Archive of Speech Signals BY-ND-3.0 1556358 DE, 5.2
DE dfki-pavoque-neutral male unit selection 5.2 A male German unit selection voice BY-ND-3.0 448866455 DE, 5.2
DE dfki-pavoque-neutral-hsmm male hsmm 5.2 A male German hidden semi-Markov model voice BY-ND-3.0 2834245 DE, 5.2
DE dfki-pavoque-styles male unit selection 5.2 A male German unit selection voice with expressive styles "happy", "sad", "angry", and "poker" BY-ND-3.0 700875468 DE, 5.2
EN_GB dfki-poppy female unit selection 5.2 A female British English expressive unit selection voice: Cheerful Poppy BY-ND-3.0 111958955 EN-GB, 5.2
EN_GB dfki-poppy-hsmm female hsmm 5.2 A female British English hidden semi-Markov model voice BY-ND-3.0 1015143 EN-GB, 5.2
EN_GB dfki-prudence female unit selection 5.2 A female British English expressive unit selection voice: Pragmatic Prudence BY-ND-3.0 293735841 EN-GB, 5.2
EN_GB dfki-prudence-hsmm female hsmm 5.2 A female British English hidden semi-Markov model voice BY-ND-3.0 1559757 EN-GB, 5.2
EN_GB dfki-obadiah male unit selection 5.2 A male British English expressive unit selection voice: Gloomy Obadiah BY-ND-3.0 165140911 EN-GB, 5.2
EN_GB dfki-obadiah-hsmm male hsmm 5.2 A male British English hidden semi-Markov model voice BY-ND-3.0 1215660 EN-GB, 5.2
EN_GB dfki-spike male unit selection 5.2 A male British English expressive unit selection voice: Aggressive Spike BY-ND-3.0 163980552 EN-GB, 5.2
EN_GB dfki-spike-hsmm male hsmm 5.2 A male British English hidden semi-Markov model voice BY-ND-3.0 1082784 EN-GB, 5.2
EN_US cmu-slt female unit selection 5.2 A female English unit selection voice CMU-ARCTIC 104627156 EN-US, 5.2
EN_US cmu-bdl male unit selection 5.2 A male US English unit selection voice, built from recordings provided by Carnegie Mellon University ARCTIC-LICENSE 95244351 EN-US, 5.2
EN_US cmu-bdl-hsmm male hsmm 5.2 A male US English hidden semi-Markov model voice, built from recordings provided by Carnegie Mellon University ARCTIC-LICENSE 1016701 EN-US, 5.2
EN_US cmu-rms male unit selection 5.2 A male US English unit selection voice, built from recordings provided by Carnegie Mellon University ARCTIC-LICENSE 121504555 EN-US, 5.2
EN_US cmu-rms-hsmm male hsmm 5.2 A male US English hidden semi-Markov model voice, built from recordings provided by Carnegie Mellon University ARCTIC-LICENSE 1027287 EN-US, 5.2
FR enst-camille female unit selection 5.2 A female French unit selection voice, built at Télécom ParisTech (ENST) using data recorded by Camille Dianoux BY-SA-3.0 214247758 FR, 5.2
FR enst-camille-hsmm female hsmm 5.2 A female French hidden semi-Markov model voice, built at Télécom ParisTech (ENST) using data recorded by Camille Dianoux BY-SA-3.0 1517857 FR, 5.2
FR upmc-jessica female unit seleciton 5.2 A female French unit selection voice, built at ISIR (UPMC) using data recorded by Jessica Durand BY-SA-3.0 151407773 FR, 5.2
FR upmc-jessica-hsmm female hsmm 5.2 A female French hidden semi-Markov model voice, built at ISIR (UPMC) using data recorded by Jessica Durand BY-SA-3.0 1118194 FR, 5.2
FR enst-dennys-hsmm male hsmm 5.2 A male Québécois French hidden semi-Markov model voice, built at Télécom ParisTech (ENST) BY-ND-3.0 1675605 FR, 5.2
FR upmc-pierre male unit selection 5.2 A male French unit selection voice, built at ISIR (UPMC) using data recorded by Pierre Chauvin BY-SA-3.0 206409457 FR, 5.2
FR upmc-pierre-hsmm male hsmm 5.2 A male French hidden semi-Markov model voice, built at ISIR (UPMC) using data recorded by Pierre Chauvin BY-SA-3.0 1556673 FR, 5.2
IT istc-lucia-hsmm female hsmm 5.2 Italian female Hidden semi-Markov model voice kindly made available by Fabio Tesser BY-ND-3.0 1466178 IT, 5.2
LB marylux female unit selection 5.2 A female Luxembourgish unit selection voice BY-NC-SA-4.0 118421559 LB, 5.2
TE cmu-nk-hsmm female hsmm 5.2 A female Telugu hidden semi-Markov model voice built from voice recordings provided by IIIT Hyderabad and Carnegie Mellon University BY-ND-3.0 3396770 TE, 5.2
TR dfki-ot male unit selection 5.2 A male Turkish unit selection voice BY-ND-3.0 161098455 TR, 5.2
TR dfki-ot-hsmm male hsmm 5.2 A male Turkish hidden semi-Markov model voice BY-ND-3.0 1365754 TR, 5.2

All voices together are somewhere around 5 GB.

 

You can build your own voice as well, but you should have at least a basic knowledge of "working with computers".

 

3. Voice Effects

Notation:
"Effect1(param1=value1,param2=value2)+Effect2"

Some examples thankfully provided by MaryTTS:
"FIRFilter+Robot(amount=50)"
"Robot(amount=100)+Chorus(delay1=866, amp1=0.24, delay2=300, amp2=-0.40,)"
"Robot(amount=80)+Stadium(amount=50)"
"FIRFilter(type=3,fc1=6000, fc2=10000) + Robot"
"Stadium(amount=40) + Robot(amount=87) + Whisper(amount=65)+FIRFilter(type=1,fc1=1540;)++"

(The following section is from the MaryTTS documentation.)

3.1. Volume Effect:
Scales the output volume by a fixed amount.
Parameter:
   <amount>   Definition : Amount of scaling (the output is simply multiplied by amount)
   Range      : [0.0,10.0]
Example:
amount:2.0;

3.2. Vocal Tract Linear Scaling Effect:
Creates a shortened or lengthened vocal tract effect by shifting the formants.
Parameter:
   <amount>   Definition : The amount of formant shifting
   Range      : [0.25,4.0]
   For values of <amount> less than 1.0, the formants are shifted to lower frequencies
       resulting in a longer vocal tract (i.e. a deeper voice).
   Values greater than 1.0 shift the formants to higher frequencies.
       The result is a shorter vocal tract.

Example:
amount:1.5;

3.3. F0 scaling effect for HMM voices:
All voiced f0 values are multiplied by <f0Scale> for HMM voices.
This operation effectively scales the range of f0 values.
Note that mean f0 is preserved during the operation.
Parameter:
   <f0Scale>   Definition : Scale ratio for modifying the dynamic range of the f0 contour
                If f0Scale>1.0, the range is expanded (i.e. voice with more variable pitch)
                If f0Scale<1.0, the range is compressed (i.e. more monotonic voice)
                If f0Scale=1.0 results in no changes in range
   Range      : [0.0,3.0]
Example:
f0Scale:2.0;

3.4. F0 mean shifting effect for HMM voices:
Shifts the mean F0 value by <f0Add> Hz for HMM voices.
Parameter:
   <f0Add>   Definition : F0 shift of mean value in Hz for synthesized speech output
   Range      : [-300.0,300.0]
Example:
f0Add:50.0;

3.5. Duration scaling for HMM voices:
Scales the HMM output speech duration by <durScale>.
Parameter:
   <durScale>   Definition : Duration scaling factor for synthesized speech output
   Range      : [0.1,3.0]
Example:
durScale:1.5;

3.6. Robotiser Effect:
Creates a robotic voice by setting all phases to zero.
Parameter:
   <amount>   Definition : The amount of robotic voice at the output
   Range      : [0.0,100.0]
Example:
amount:100.0;

3.7. Whisper Effect:
Creates a whispered voice by replacing the LPC residual with white noise.
Parameter:
   <amount>   Definition : The amount of whisperised voice at the output
   Range      : [0.0,100.0]
Example:
amount:100.0;

3.8. Stadium Effect:
Adds stadium effect by applying a specially designed multi-tap chorus.
Parameter:
   <amount>   Definition : The amount of stadium effect at the output
   Range      : [0.0,200.0]
Example:
amount:100.0

3.9. Multi-Tap Chorus Effect:
Adds chorus effect by summing up the original signal with delayed and amplitude scaled versions.
The parameters should consist of delay and amplitude pairs for each tap.
A variable number of taps (max 20) can be specified by simply defining more delay-amplitude pairs.
Each tap outputs a delayed and gain-scaled version of the original signal.
All tap outputs are summed up with the oiginal signal with appropriate gain normalization.
Parameters:
   <delay1>
   Definition : The amount of delay in miliseconds for tap #1
   Range      : [0,5000]
   <amp1>
   Definition : Relative amplitude of the channel gain as compared to original signal gain for tap #1
   Range      : [-5.0,5.0]
   <delay2>
   Definition : The amount of delay in miliseconds in delayed channel #2
   Range      : [0,5000]
   <amp2>
   Definition : Relative amplitude of the channel gain as compared to original signal gain for delayed channel #2
   Range      : [-5.0,5.0]
   ...
   <delayN>
   Definition : The amount of delay in miliseconds in delayed channel #N
   Range      : [0,5000]
   <ampN>
   Definition : Relative amplitude of the channel gain as compared to original signal gain for delayed channel #N
   Range      : [-5.0,5.0]
   Note: Maximum possible number of taps is N=20. Parameters for more taps will simply be neglected.
Example: (A three-tap chorus effect)
delay1:466;amp1:0.54;delay2:600;amp2:-0.10;delay3:250;amp3:0.30

3.10. FIR filtering:
Filters the input signal by an FIR filter.
Parameters:
   <type>
   Definition : Type of filter (1:Lowpass, 2:Highpass, 3:Bandpass, 4:Bandreject)
   Range      : {1,2,3,4}
   <fc>   Definition : Cutoff frequency in Hz for lowpass and highpass filters
   Range      : [0.0, fs/2.0] where fs is the sampling rate in Hz
   <fc1>   Definition : Lower frequency cutoff in Hz for bandpass and bandreject filters
   Range      : [0.0, fs/2.0] where fs is the sampling rate in Hz
   <fc2>   Definition : Higher frequency cutoff in Hz for bandpass and bandreject filters
   Range      : [0.0, fs/2.0] where fs is the sampling rate in Hz
Example: (A band-pass filter)
type:3;fc1:500.0;fc2:2000.0

3.11. Jet pilot effect:
Filters the input signal using an FIR bandpass filter.
Parameters: NONE

 

4. TL;DR

PRO:

  • open source
  • offline
  • free
  • different voices
  • voice effects!

CON:

  • not as great as paid alternatives (yet)
  • rather big voice-files required

5. Further links

MaryTTS -> http://mary.dfki.de/

MaryTTS web interface -> http://mary.dfki.de:59125/

blog post about supporting different voices -> http://myrobotlab.org/content/marytts-multi-language-support

 

[TODO]

-> sample script

-> ???

moz4r's picture

thank you ! great

thank you ! great documentation...

In french the only good voice I found is pierre a male voice

and changed to women like this : mouth.setAudioEffects("F0Add(f0Add=90.0)+TractScaler(amount=1.2)")

hairygael's picture

Good instructions

Good instructions MaVo!!

Using the parameters is nice and handy!

Since you are with your hands into voices, is MBrola voices a branch of Marytts? There seems to be lot of various voices in their list.

http://tcts.fpms.ac.be/synthesis/mbrola/mbrcopybin.html

juerg's picture

large post - great but almost

large post - great but almost a bit overhelming.

What I miss is how exactly we can add a voice to mrl. My procedure was to first download the runtime package from marytts webpage. In the bin folder it has a "marytts-componene-installer.bat". Starting it we can select the voices to download.

You will get a zip file in the download folder relativ from where you started the .bat command. Unzipping it will create a new folder lib where you should have a "voice-xxx.jar" now. You need to copy this jar into your ".../mrl/libraries/jar"-folder AND RESTART MRL!

It might be that we already have a magical feature in MRL which does this automatically - but using my method of creating an "mrl_<version>" folder with each new version I found that I had to copy the voice jar into the jar folder myself.

And it might help to add a python example line of how to apply the voice modifications for all the different options?

MaVo's picture

To install a

To install a voice:

maryspeech.installComponentsAcceptLicense(voicename)

This will not only download the correct jar, but also the additional voice files (if the voice has them) and put both in the correct location.

(More information on installing voices at the start of "2. Voices")

 

A Python example script is really needed.

GroG's picture

With brackets [[

With brackets [ [service/MarySpeech.py] ] - the page will pick up the pyrobotlab/servce/MarySpeech.py script and format it ...

I added it to the top but it seems pretty bare ..  no example of loading voices ...

FYI - Its the develop branch of the script in pyrobotlab.
At some point I'll add a selector to service pages so you can choose which branch your documentation/script examples come from ...   

Another thing on the list :)

Hi MaVo !