MarySpeech

master
javadoc

The MarySpeech Service can be used to generate speech from text using the MaryTTS project.

1. General

It's different from most other speech services (AcapellaSpeech & GoogleSpeech) as it is OPEN SOURCE (Yay!) and doesn't even require an internet connection (Yay! x2) as everything is generated locally and offline.

The downside is that you have to pre-download the rather big voice-file(s) you want to use.

Also the quality might not be as great compared to it's competitors (e.g. Acapella & Google), but it's free and nobody can simply "shut it down".

Another cool thing about MarySpeech is that you can apply paramerized voice effects to your voice.

 

2. Voices

[TODO]

 

3. Voice Effects

Notation:
"Effect1(param1=value1,param2=value2)+Effect2"

Some examples thankfully provided by MaryTTS:
"FIRFilter+Robot(amount=50)"
"Robot(amount=100)+Chorus(delay1=866, amp1=0.24, delay2=300, amp2=-0.40,)"
"Robot(amount=80)+Stadium(amount=50)"
"FIRFilter(type=3,fc1=6000, fc2=10000) + Robot"
"Stadium(amount=40) + Robot(amount=87) + Whisper(amount=65)+FIRFilter(type=1,fc1=1540;)++"

(The following section is from the MaryTTS documentation.)

3.1. Volume Effect:
Scales the output volume by a fixed amount.
Parameter:
   <amount>   Definition : Amount of scaling (the output is simply multiplied by amount)
   Range      : [0.0,10.0]
Example:
amount:2.0;

3.2. Vocal Tract Linear Scaling Effect:
Creates a shortened or lengthened vocal tract effect by shifting the formants.
Parameter:
   <amount>   Definition : The amount of formant shifting
   Range      : [0.25,4.0]
   For values of <amount> less than 1.0, the formants are shifted to lower frequencies
       resulting in a longer vocal tract (i.e. a deeper voice).
   Values greater than 1.0 shift the formants to higher frequencies.
       The result is a shorter vocal tract.

Example:
amount:1.5;

3.3. F0 scaling effect for HMM voices:
All voiced f0 values are multiplied by <f0Scale> for HMM voices.
This operation effectively scales the range of f0 values.
Note that mean f0 is preserved during the operation.
Parameter:
   <f0Scale>   Definition : Scale ratio for modifying the dynamic range of the f0 contour
                If f0Scale>1.0, the range is expanded (i.e. voice with more variable pitch)
                If f0Scale<1.0, the range is compressed (i.e. more monotonic voice)
                If f0Scale=1.0 results in no changes in range
   Range      : [0.0,3.0]
Example:
f0Scale:2.0;

3.4. F0 mean shifting effect for HMM voices:
Shifts the mean F0 value by <f0Add> Hz for HMM voices.
Parameter:
   <f0Add>   Definition : F0 shift of mean value in Hz for synthesized speech output
   Range      : [-300.0,300.0]
Example:
f0Add:50.0;

3.5. Duration scaling for HMM voices:
Scales the HMM output speech duration by <durScale>.
Parameter:
   <durScale>   Definition : Duration scaling factor for synthesized speech output
   Range      : [0.1,3.0]
Example:
durScale:1.5;

3.6. Robotiser Effect:
Creates a robotic voice by setting all phases to zero.
Parameter:
   <amount>   Definition : The amount of robotic voice at the output
   Range      : [0.0,100.0]
Example:
amount:100.0;

3.7. Whisper Effect:
Creates a whispered voice by replacing the LPC residual with white noise.
Parameter:
   <amount>   Definition : The amount of whisperised voice at the output
   Range      : [0.0,100.0]
Example:
amount:100.0;

3.8. Stadium Effect:
Adds stadium effect by applying a specially designed multi-tap chorus.
Parameter:
   <amount>   Definition : The amount of stadium effect at the output
   Range      : [0.0,200.0]
Example:
amount:100.0

3.9. Multi-Tap Chorus Effect:
Adds chorus effect by summing up the original signal with delayed and amplitude scaled versions.
The parameters should consist of delay and amplitude pairs for each tap.
A variable number of taps (max 20) can be specified by simply defining more delay-amplitude pairs.
Each tap outputs a delayed and gain-scaled version of the original signal.
All tap outputs are summed up with the oiginal signal with appropriate gain normalization.
Parameters:
   <delay1>
   Definition : The amount of delay in miliseconds for tap #1
   Range      : [0,5000]
   <amp1>
   Definition : Relative amplitude of the channel gain as compared to original signal gain for tap #1
   Range      : [-5.0,5.0]
   <delay2>
   Definition : The amount of delay in miliseconds in delayed channel #2
   Range      : [0,5000]
   <amp2>
   Definition : Relative amplitude of the channel gain as compared to original signal gain for delayed channel #2
   Range      : [-5.0,5.0]
   ...
   <delayN>
   Definition : The amount of delay in miliseconds in delayed channel #N
   Range      : [0,5000]
   <ampN>
   Definition : Relative amplitude of the channel gain as compared to original signal gain for delayed channel #N
   Range      : [-5.0,5.0]
   Note: Maximum possible number of taps is N=20. Parameters for more taps will simply be neglected.
Example: (A three-tap chorus effect)
delay1:466;amp1:0.54;delay2:600;amp2:-0.10;delay3:250;amp3:0.30

3.10. FIR filtering:
Filters the input signal by an FIR filter.
Parameters:
   <type>
   Definition : Type of filter (1:Lowpass, 2:Highpass, 3:Bandpass, 4:Bandreject)
   Range      : {1,2,3,4}
   <fc>   Definition : Cutoff frequency in Hz for lowpass and highpass filters
   Range      : [0.0, fs/2.0] where fs is the sampling rate in Hz
   <fc1>   Definition : Lower frequency cutoff in Hz for bandpass and bandreject filters
   Range      : [0.0, fs/2.0] where fs is the sampling rate in Hz
   <fc2>   Definition : Higher frequency cutoff in Hz for bandpass and bandreject filters
   Range      : [0.0, fs/2.0] where fs is the sampling rate in Hz
Example: (A band-pass filter)
type:3;fc1:500.0;fc2:2000.0

3.11. Jet pilot effect:
Filters the input signal using an FIR bandpass filter.
Parameters: NONE

 

4. TL;DR

PRO:

  • open source
  • offline
  • free
  • different voices
  • voice effects!

CON:

  • not as great as paid alternatives (yet)
  • rather big voice-files required

5. Further links

MaryTTS -> http://mary.dfki.de/

MaryTTS web interface -> http://mary.dfki.de:59125/

blog post about supporting different voices -> http://myrobotlab.org/content/marytts-multi-language-support

 

[TODO]

-> sample script

-> voices (both doc & service-support)

-> ???

moz4r's picture

thank you ! great

thank you ! great documentation...

In french the only good voice I found is pierre a male voice

and changed to women like this : mouth.setAudioEffects("F0Add(f0Add=90.0)+TractScaler(amount=1.2)")

hairygael's picture

Good instructions

Good instructions MaVo!!

Using the parameters is nice and handy!

Since you are with your hands into voices, is MBrola voices a branch of Marytts? There seems to be lot of various voices in their list.

http://tcts.fpms.ac.be/synthesis/mbrola/mbrcopybin.html