This past week I made my Hack-A-Thon project to add MaryXML a valid input for speech inside of MyRobotLab.  I decided to add this to improve the capabilities of MarySpeech inside MRL.  MaryXML allows the input of markup language that allows the voice to be modified in the middle of the file. 

If you have seen some of my other posts you will see I seem to have a facination with making my robot, "Junior", sing.  The idea behind it was if I could make him sing then I should be able to make him sound better when he speaks.  Of course when I say that I still like having Junior sound like a robot so what I am doing may not exactly meet the expectation of every MRL user.  

First lets look at the example of how I made Junior "sing" before:

from java.lang import String

python = Runtime.getService("python")
mouth = Runtime.createAndStart("MarySpeech", "MarySpeech")
mouth.setVoice("cmu-bdl-hsmm")
mouth.setAudioEffects("TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=60.0) + Robot(amount=8.0) + Rate(amount=1.75)")
 
singLowA = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=-9.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singLowB = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=4.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singC = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=10.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singD = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=28.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singE = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=45.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singF = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=58.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singG = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=77.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singA = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=102.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singB = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=130.0) + Robot(amount=8.0) + Rate(amount=1.75)"
singHighC = "TractScaler(amount=1.4)  + F0Scale(f0Scale=0.0) + F0Add(f0Add=142.0) + Robot(amount=8.0) + Rate(amount=1.75)"
 
mouth.setAudioEffects(singF) 
mouth.speakBlocking("Happ, pee")
mouth.setAudioEffects(singG) 
mouth.speakBlocking("birth")
mouth.setAudioEffects(singF) 
mouth.speakBlocking("day")
mouth.setAudioEffects(singB) 
...
 
The issue with this was it produced a very choppy sounding singing voice.  I have to transform the voice and then call speakBlocking() which causes a larger break in between syllables. 
 
To remedy this issue I wanted to add MaryXML as an input the MarySpeech.java interface we use in MRL. I have added a couple of helpful methods which I will be adding to the dev branch of the MyRobotLab code in the near future. These are the functions I have added:
 
public boolean speakWithXMLBlocking(String dataString, String dataType) {}
 
this is my first pass at it and I may be changing dataType to an enumeration so there is a finite set of inputs for it. Currently you can use "RAWMARYXML" or "EMOTIONML". The dataString is either written in MaryXML or EmotionML.  Here are some examples of what that looks like in my test Python script:
 
mouth.speakWithXMLBlocking("<?xml version=\"1.0\" encoding=\"UTF-8\" ?> " +
"<maryxml version=\"0.5\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " +
"xmlns=\" http://mary.dfki.de/2002/MaryXML\" xml:lang=\"en-US\">" +
"<p>  <prosody contour=\"(0%,+0st)(100%,-0st)\" pitch=\"-18%\" rate=\"-30%\"> " +
"Come, there’s no use in crying like that." +
"</prosody> </p>" +
"</maryxml>", "RAWMARYXML") 
 
or 
 
mouth.speakWithXMLBlocking("<emotionml version=\"1.0\" "+
"Hello and good afternoon. " +
"<emotion><category name=\"happy\"/> " +
"Nice to see you again! " +
"</emotion> " +
"<emotion><category name=\"loving\"/> " +
"Nice to see you again! " +
"</emotion> " +
"<emotion><category name=\"sad\"/> " +
"Yeah I also had something else in mind than this. " +
"</emotion> " +
"<emotion><category name=\"content\"/> " +
"Well at least there is something nice to see. " +
"</emotion></emotionml>", "EMOTIONML")
 
This works welll for testing what is possible within this new input method but may be a little more than all MRL user want to go through so I also created this method:
 

public boolean speakWithEmotionBlocking(String text, String emotion) {}

 

This would allow for making a call in your puthon script like:

 

mouth.speakWithEmotionBlocking("Yeah I also had something else in mind than this.", "sad")

 
 
The list of emotions can be found:
 
 
* Not all of these categories seem to be properly processed by MaryTTS so you may need to test to see if they work.
* Also, not all voices seem to be full compatible with EmotionML and MaryXML.
     From what I have read the the voices that begin with "voice-dfki" seem to work better for this than the  
     "voice-cmu" voices.
 
Here is a quick video showing it in action.  
 
 
 
* Note:  I am running MRL from Eclipse and it is using the cmu-stl-hsmm voice. This voice doesn't have full emotional capabilities but sad seems to work well.
 
 
Resources:
mary.dfki.de/documentation/maryxml/
 
 

kyle.clinton

7 years 3 months ago

I forgot to add in an example of what my test for Junior singing using MaryXML:

mouth.speakWithXMLBlocking("<?xml version=\"1.0\" encoding=\"UTF-8\" ?> <maryxml version=\"0.4\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns=\" http://mary.dfki.de/2002/MaryXML\" xml:lang=\"en-US\">" +
"<p> <prosody  pitch=\"+2st\" range=\"-10%\" volume=\"loud\">" +
"Happ, pee" +
"</prosody> </p>" +
"<p> <prosody  pitch=\"+4st\" range=\"-10%\" volume=\"loud\">" +
"Birth" +
"</prosody> </p>" +
"<p> <prosody  pitch=\"+2st\" range=\"-10%\" volume=\"loud\">" +
"day" +
"</prosody> </p>" +
"<p> <prosody  pitch=\"+8st\" range=\"-10%\" volume=\"loud\">" +
"to" +
"</prosody> </p>" +
"<p> <prosody  pitch=\"+6st\" range=\"-10%\" volume=\"loud\">" +
"you" +
"</prosody> </p>" +
"</maryxml>", "RAWMARYXML")
 
* Currently this was still choppy but I have not tried to improve it much.  I just wanted to test changing the pitch in semitones <prosody pitch="+6st" ....> This seems to work and should be an easier way of defining the different notes Junior "sings".
 
So with a little more tweaking of this I think this will work better...
 
mouth.speakWithXMLBlocking("<?xml version=\"1.0\" encoding=\"UTF-8\" ?> " +
"<maryxml version=\"0.5\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " +
"xmlns=\" http://mary.dfki.de/2002/MaryXML\" xml:lang=\"en_US\"> " +
"Hello! <boundary duration=\"700\"/> " +
"<prosody contour=\"(0%,+0st)  (40%,+8st) (60%,+0st) (80%,+10st) (90%,+12st)\" > " +
"Hap pee birth day to you  " +
"</prosody> " +
"</maryxml>", "RAWMARYXML")
 
Still a work in progress.  I am not sure I am getting the notes exactly the way I would like them but it no longer has the choppy breaks between different notes!!!
 

From the feedback of GroG I have added the non-blocking methods for speakWithEmotion and speakWithXML.

So the little oddity to this is that speak() inside MarySpeech.java looks like:

public AudioData speak(String dataStringe)

so I made these new methods also return AudioData:

public AudioData speakWithXML(String dataString, String dataType)

public AudioData speakWithEmotion(String toSpeak, String emotion)

 

There is also another method that is technically available which are:

public boolean speakWithXMLInternal(String dataString, String dataType, boolean blocking)

 

This is actually where I do all the work for all of the new blocking and non-blocking methods. The dataTypes that are available for withXML are "RAWMARYXML" and "EMOTIONML". 

 

For the emotions, I believe I defined those in my original post. Remember, that depending on the voice you are using for MarySpeech the abilities of these new methods may be limited. I am still working on getting all of the voices compiled to do a demo of the different voices.

 

ENJOY!

 

Kyle

 

 

GroG

7 years 3 months ago

First .. thanks for the detailed post !

What about speak non blocking ?

I tried correlating the speech in the video with your xml - its not exactly the same.  I was a bit suprised, that I did not hear the emotion...  after watching the video several times I can imagine it making a difference, but I don't have an "un-emotional" sample to compare word for word...

I guess I was thinking more 'way over the top unbelievable emotion' - 

like a <category name="jimcarrey">

very cool work ... very interested in seeing where it goes :)

Thanks kyle !