converter for AIML encoded "ISO-8859-1" to "UTF-8"

So to make it short.

I found a free French Open Source chat bot which contains a lot of possibilities based on ALICE2.0.

The problem is that all the AIML files are encoded in "ISO-8859-1"

Therefore when I use French with WebKitRecognition it doesn't work correctly and takes only the default answers.

#############################

Here is how the AIML are looking like:

<?xml version="1.0" encoding="ISO-8859-1"?>
<aiml version="1.0">

<category><pattern>PEUT ETRE *</pattern><template>Tu semble incertain. <sr/> </template></category>
<category><pattern>PEUT SEULEMENT *</pattern><template><srai>peut <star/></srai></template></category>
<category><pattern>PEUT ÊTRE *</pattern><template>Tu semble incertain. <sr/> </template></category>
<category><pattern>PEUT</pattern><template>Peut que? </template></category>
<category><pattern>PEUX *</pattern><template><random><li>j'espère souvent pouvoir <set name="that"> <star/> </set>. </li><li>Un livre peut il n'avoir aucun titre? </li><li>Peux quoi? </li></random></template></category>

....

##############################

So I have set a python file which is called at start that tries to convert some of it with some success for the first line, but the second line isn't working apparently:

def onText(text):
    #print text.replace("'", " ")
    inmoovFrench.getResponse(text.replace("'", " "))
    inmoovFrench.getResponse(text.replace("-", " "))

##############################

I'm wondering what method could used to convert what comes from Chrome.

I tried replacing "ISO-8859-1" by "UTF-8" on the top of my AIML but it gives me some error.

If you guys have an idea...


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
hairygael's picture

I found this on a post of

I found this on a post of Anthony, though I need to figure out, what is needed for me:

import io
import glob, os
oridir=os.getcwd().replace("\\", "/")
dir=oridir+"/ProgramAB/bots/rachel/aiml"
os.chdir(dir)
for file in glob.glob("*.aiml"):
    with io.open(dir+'/'+file,'r',encoding='iso-8859-1') as f:
        text = f.read()

    with io.open(dir+'/'+file,'wb') as f:
            f.write(text.encode('utf8'))
            print file+' converted'
os.chdir(oridir)

kwatters's picture

change encoding

That looks like a script that will read all the files in a directory assume they are in iso-8859-1 encoding, and re-encode them into utf-8 be fore finally writing out  a new version with the filename including the word "converted" in it..

so.. I think there is still 1 change to make after that..  the XML  /  AIML needs have it's encoding setting updated in the top in to say :

<?xml version="1.0" encoding="utf-8" ?> 

I think you can probably just run that code in the python service in mrl.. the only thing you need to update is the "dir" to make sure it's pointing at the right directory where your aiml files are.

 

 

 

 

moz4r's picture

Hi gael I will test your code

Hi gael I will test your python code and aiml sample file  f they are compatible with programab do you have it on github ? you can send a zip too no problem.

( I don't remember at all this piece of code :) happy if it worky )

 

@++

hairygael's picture

The issue I have is at the

The issue I have is at the input text level. For exemple Webkitrecognition is hearing:

"m'écoutes-tu" to make it work with the AIML I have, I need to input:

"m ecoutes tu"

Therefore I tried to create a

def onText(text):
    inmoovFrench.getResponse(text.replace("-", " ").replace("'", " ").replace("é", "e").replace("è", "e").replace("ê", "e").replace("à", "a").replace("û", "u").replace("ï", "i"))

This is kind of worky, but the robot repeats twice his own answer and not all works, only ("-", " ").("'", " ") are working

##################

When I ran the code of mz4r, it really ruined all the AIML files by replacing all the "é" with "ÂÂÂÂÂÂÂÂ@".

And more other things...

Luckily I had saved a copy before :)

The best convertion result was to decode from csISO20022JP to UTF-8 but It still doesn' solve my input problem.

http://string-functions.com/encodedecode.aspx

###################

@moz4r, this is one of the AIML on github:

https://github.com/MyRobotLab/pyrobotlab/blob/master/home/hairygael/peuxtu.aiml

 

So I'm still seeking for a solution..

juerg's picture

Hi Gaël Maybe my workaround

Hi Gaël

Maybe my workaround for german umlaute helps:

the result of wksr (ear) is routed over an umlaut replacement function and then sent to marvin (ProgramAB)

#ear.addTextListener(marvin)
# route text over Umlaut replacing function
ear.addListener("publishText","python","replaceUmlaute")

def replaceUmlaute(data):
  data = data.replace(chr(228),"AE")
  data = data.replace(chr(246),"OE")
  data = data.replace(chr(252),"UE")
  print data
  marvin.getResponse(data)

wksr will understand and produce umlaute e.g. glücklich (happy). The replaceUmlaute will make glUEcklich and matches my AIML pattern GLUECKLICH.

Careful: The chr-numbers, e.g. chr(228) is not the ascii code of ä, I had to print out the codes for the umlaute coming from wksr. Use print ord(data[0]) to see the actual code wksr is using.

moz4r's picture

Worky script

ok guys it worky, I made little script ( sorry mines are sometime very big and uncommented , I work to enhance ) and and try to explain it :
 
https://github.com/moz4r/aiml/raw/master/bots/BOTS-FRENCH/Inmoov_AI/peuxtu.zip
 
 
1/ If you need to incercept what the "ear" listen and redirect to aiml files for interpretation :
 
Don't lose time to replace all the characters "é à ..." you just need to replace 1 thing : apostroph ( ' ) or maybe the ( - ) If you don't want rewrite the aiml files. 
 
 
To intercept the ear data you need to remove the direct ear connector
 
 
ear.addTextListener(chatBot)
 
and add this instead :
 
htmlFilter.addListener("publishText", python.name, "talk")
def talk(data):
 if data:
   if data!="":
mouth.speak(unicode(data,'utf-8'))
 
def onText(text):
 chatBot.getResponse(text.replace("'", " "))
 
python.subscribe(ear.getName(),"publishText")
 
 
2/ AIML format
You need encode them to UTF8 format and add this a the top :
<?xml version="1.0" encoding="UTF-8"?>
 
3/ replace gaelaimltest by your real folder. I didnt test the script because I m at work today but this it the base idea I use . i launch it tonight, have fun to learn french to you robot :)
Be carefull with this alice files sometime traduction is bad and don't reflect webkit recognition
 
hairygael's picture

Thanks guys! I'm leaning

Thanks guys!

I'm leaning first for Anthony's option because I already have most of script configured the same as his exemple.

Although the line:

htmlFilter.addListener("publishText", python.name, "talk")

returns an error, expected 1 arg, got 3.

juerg's picture

Unfortunately I personally

Unfortunately I personally find the methods for publishing and listening for messages is rather confusing in MRL.

I am not able to follow the path of the messages from your example - e.g. is htmlFilter sending data to "talk"? And how is onText() connected to the rest? And what is the python.subscribe() good for?

Lots of inside knowledge required I assume.

moz4r's picture

test ok

Ok it works , it's better when it is not blinded coding :)

I just start programab session at the begining :

https://github.com/moz4r/aiml/raw/master/bots/BOTS-FRENCH/Inmoov_AI/test...

@juerg I agree whith you sometime it's hard to find the path ! but so many fonctions, so powerfull. Do you want I publish some graphical description about this 2 or 3 function ? I will do that when I have some little time

hairygael's picture

So finally I used the method

So finally I used the method of juerg and got it WORKY!!

#ear.addTextListener(inmoovFrench)
# route text over to replacing function
ear.addListener("publishText","python","replacer")

#We intercept what the robot is listening to change some values
#here we replace ' by space because AIML doesn't like '
def replacer(data):
    data = data.replace("'", " ")
    data = data.replace("-", " ")
    data = data.replace(chr(232),"E")#è
    data = data.replace(chr(233),"E")#é
    data = data.replace(chr(234),"E")#ê
    data = data.replace(chr(235),"E")#ë
    data = data.replace(chr(249),"U")#ù
    data = data.replace(chr(251),"U")#û
    data = data.replace(chr(224),"A")#à
    data = data.replace(chr(226),"A")#â
    data = data.replace(chr(212),"O")#ô
    data = data.replace(chr(239),"I")#ï
    print data
    #print ord(data[0])
    inmoovFrench.getResponse(data)

    #German replacer
    #data = data.replace(chr(228),"AE")
    #data = data.replace(chr(246),"OE")
    #data = data.replace(chr(252),"UE")

 

A few minutes after I got it working, Kwatters proposed to add an option in the WebKitRecognition service to strip magically all accents by adding this line in your script:

wksr.setStripAccents(True)

 

moz4r's picture

Ok I don't need to do that,

Ok I don't need to do that, it's strange you have problems with accent ! Most important is it's worky ! Great evening guys

 

hairygael's picture

You don't get the error with

You don't get the error with the args?

bartcam's picture

There is no need to replace

There is no need to replace all the special characteres.

Try this in AIML, my work with polish ĄĘĆŚŃŹŻŁ etc...


<?xml version="1.0" encoding="Windows-1250"?>

 
<aiml>
....
..
.
</aiml>

if You use special characteres in myrobotlab python try:

mouth speak(u'ĄĘĆŚŃ')

works!!

I was struggling in Polish and I had to somehow do it.

The simple is better ;)

juerg's picture

problem in my case was that

problem in my case was that wksr is understanding and returning lowercase letter "ü"

this does not match anything in AIML as the patterns are uppercase

bartcam's picture

For me works well

For me works well, WKSR is not returning U for me.

write here sample code, maybe I will be abble to understand Your problem.