converter for AIML encoded "ISO-8859-1" to "UTF-8"

So to make it short.

I found a free French Open Source chat bot which contains a lot of possibilities based on ALICE2.0.

The problem is that all the AIML files are encoded in "ISO-8859-1"

Therefore when I use French with WebKitRecognition it doesn't work correctly and takes only the default answers.

#############################

Here is how the AIML are looking like:

<?xml version="1.0" encoding="ISO-8859-1"?>
<aiml version="1.0">

<category><pattern>PEUT ETRE *</pattern><template>Tu semble incertain. <sr/> </template></category>
<category><pattern>PEUT SEULEMENT *</pattern><template><srai>peut <star/></srai></template></category>
<category><pattern>PEUT ÊTRE *</pattern><template>Tu semble incertain. <sr/> </template></category>
<category><pattern>PEUT</pattern><template>Peut que? </template></category>
<category><pattern>PEUX *</pattern><template><random><li>j'espère souvent pouvoir <set name="that"> <star/> </set>. </li><li>Un livre peut il n'avoir aucun titre? </li><li>Peux quoi? </li></random></template></category>

....

##############################

So I have set a python file which is called at start that tries to convert some of it with some success for the first line, but the second line isn't working apparently:

def onText(text):
   #print text.replace("'", " ")
   inmoovFrench.getResponse(text.replace("'", " "))
    inmoovFrench.getResponse(text.replace("-", " "))

##############################

I'm wondering what method could used to convert what comes from Chrome.

I tried replacing "ISO-8859-1" by "UTF-8" on the top of my AIML but it gives me some error.

If you guys have an idea...

I found this on a post of

I found this on a post of Anthony, though I need to figure out, what is needed for me:

import io
import glob, os
oridir=os.getcwd().replace("\\", "/")
dir=oridir+"/ProgramAB/bots/rachel/aiml"
os.chdir(dir)
for file in glob.glob("*.aiml"):
with io.open(dir+'/'+file,'r',encoding='iso-8859-1') as f:
text = f.read()

   with io.open(dir+'/'+file,'wb') as f:
           f.write(text.encode('utf8'))
           print file+' converted'
os.chdir(oridir)

change encoding

That looks like a script that will read all the files in a directory assume they are in iso-8859-1 encoding, and re-encode them into utf-8 be fore finally writing out a new version with the filename including the word "converted" in it..

so.. I think there is still 1 change to make after that.. the XML / AIML needs have it's encoding setting updated in the top in to say :

<?xml version="1.0" encoding="utf-8" ?>

I think you can probably just run that code in the python service in mrl.. the only thing you need to update is the "dir" to make sure it's pointing at the right directory where your aiml files are.

Hi gael I will test your code

Hi gael I will test your python code and aiml sample file f they are compatible with programab do you have it on github ? you can send a zip too no problem.

( I don't remember at all this piece of code :) happy if it worky )

@++

The issue I have is at the

The issue I have is at the input text level. For exemple Webkitrecognition is hearing:

"m'écoutes-tu" to make it work with the AIML I have, I need to input:

"m ecoutes tu"

Therefore I tried to create a

def onText(text):
inmoovFrench.getResponse(text.replace("-", " ").replace("'", " ").replace("é", "e").replace("è", "e").replace("ê", "e").replace("à", "a").replace("û", "u").replace("ï", "i"))

This is kind of worky, but the robot repeats twice his own answer and not all works, only ("-", " ").("'", " ") are working

##################

When I ran the code of mz4r, it really ruined all the AIML files by replacing all the "é" with "ÂÂÂÂÂÂÂÂ@".

And more other things...

Luckily I had saved a copy before :)

The best convertion result was to decode from csISO20022JP to UTF-8 but It still doesn' solve my input problem.

http://string-functions.com/encodedecode.aspx

###################

@moz4r, this is one of the AIML on github:

https://github.com/MyRobotLab/pyrobotlab/blob/master/home/hairygael/peuxtu.aiml

So I'm still seeking for a solution..

Hi Gaël Maybe my workaround

Hi Gaël

Maybe my workaround for german umlaute helps:

the result of wksr (ear) is routed over an umlaut replacement function and then sent to marvin (ProgramAB)

#ear.addTextListener(marvin)
# route text over Umlaut replacing function
ear.addListener("publishText","python","replaceUmlaute")

def replaceUmlaute(data):
  data = data.replace(chr(228),"AE")
  data = data.replace(chr(246),"OE")
  data = data.replace(chr(252),"UE")
  print data
  marvin.getResponse(data)

wksr will understand and produce umlaute e.g. glücklich (happy). The replaceUmlaute will make glUEcklich and matches my AIML pattern GLUECKLICH.

Careful: The chr-numbers, e.g. chr(228) is not the ascii code of ä, I had to print out the codes for the umlaute coming from wksr. Use print ord(data[0]) to see the actual code wksr is using.

Worky script

ok guys it worky, I made little script ( sorry mines are sometime very big and uncommented , I work to enhance ) and and try to explain it :

https://github.com/moz4r/aiml/raw/master/bots/BOTS-FRENCH/Inmoov_AI/peu…

1/ If you need to incercept what the "ear" listen and redirect to aiml files for interpretation :

Don't lose time to replace all the characters "é à ..." you just need to replace 1 thing : apostroph ( ' ) or maybe the ( - ) If you don't want rewrite the aiml files.

To intercept the ear data you need to remove the direct ear connector

ear.addTextListener(chatBot)

and add this instead :

htmlFilter.addListener("publishText", python.name, "talk")

def talk(data):

if data:

if data!="":

mouth.speak(unicode(data,'utf-8'))

def onText(text):

chatBot.getResponse(text.replace("'", " "))

python.subscribe(ear.getName(),"publishText")

2/ AIML format

You need encode them to UTF8 format and add this a the top :

<?xml version="1.0" encoding="UTF-8"?>

3/ replace gaelaimltest by your real folder. I didnt test the script because I m at work today but this it the base idea I use . i launch it tonight, have fun to learn french to you robot :)

Be carefull with this alice files sometime traduction is bad and don't reflect webkit recognition

Thanks guys! I'm leaning

Thanks guys!

I'm leaning first for Anthony's option because I already have most of script configured the same as his exemple.

Although the line:

htmlFilter.addListener("publishText", python.name, "talk")

returns an error, expected 1 arg, got 3.

Unfortunately I personally

Unfortunately I personally find the methods for publishing and listening for messages is rather confusing in MRL.

I am not able to follow the path of the messages from your example - e.g. is htmlFilter sending data to "talk"? And how is onText() connected to the rest? And what is the python.subscribe() good for?

Lots of inside knowledge required I assume.

test ok

Ok it works , it's better when it is not blinded coding :)

I just start programab session at the begining :

https://github.com/moz4r/aiml/raw/master/bots/BOTS-FRENCH/Inmoov_AI/tes…

@juerg I agree whith you sometime it's hard to find the path ! but so many fonctions, so powerfull. Do you want I publish some graphical description about this 2 or 3 function ? I will do that when I have some little time

So finally I used the method

So finally I used the method of juerg and got it WORKY!!

#ear.addTextListener(inmoovFrench)
# route text over to replacing function
ear.addListener("publishText","python","replacer")

#We intercept what the robot is listening to change some values
#here we replace ' by space because AIML doesn't like '
def replacer(data):
    data = data.replace("'", " ")
    data = data.replace("-", " ")
    data = data.replace(chr(232),"E")#è
    data = data.replace(chr(233),"E")#é
    data = data.replace(chr(234),"E")#ê
    data = data.replace(chr(235),"E")#ë
    data = data.replace(chr(249),"U")#ù
    data = data.replace(chr(251),"U")#û
    data = data.replace(chr(224),"A")#à
    data = data.replace(chr(226),"A")#â
    data = data.replace(chr(212),"O")#ô
    data = data.replace(chr(239),"I")#ï
    print data
    #print ord(data[0])
    inmoovFrench.getResponse(data)

    #German replacer
    #data = data.replace(chr(228),"AE")
    #data = data.replace(chr(246),"OE")
    #data = data.replace(chr(252),"UE")

A few minutes after I got it working, Kwatters proposed to add an option in the WebKitRecognition service to strip magically all accents by adding this line in your script:

wksr.setStripAccents(True)

<?xml version="1.0" encoding="Windows-1250"?>

<aiml>

....

</aiml>

if You use special characteres in myrobotlab python try:

mouth speak(u'ĄĘĆŚŃ')

works!!

I was struggling in Polish and I had to somehow do it.

The simple is better ;)

problem in my case was that

problem in my case was that wksr is understanding and returning lowercase letter "ü"

this does not match anything in AIML as the patterns are uppercase

For me works well

For me works well, WKSR is not returning U for me.

write here sample code, maybe I will be abble to understand Your problem.