Wow .. this is really "challenging" ocr

JNA/JNI is always a pain - its when  native libraries are used for either performance or legacy reasons.  And it makes Java programs much more complicated, fragile and difficult to maintain. But, after kicking Tessa4j around a bit - I wanted to see how well it worked.. so I spent a few extra cycles, kicking it and finally I got results (that's the good news) .. The bad news is it does some phenominally crappy ocr. 

With this image which is a SCREEN SHOT mind you .. about as perfect a picture of text there is.  No fuzzyness, no dusty street sign, this is as black and white as you could possibly get.

With the above screen shot I got this as recognized text.

 

 /(ekSI/ 1-)
noun
I a book a omenunlmn nrprmled walk regarded ntevms om cormm ramenhan Ils physical
form,
'2 um um explores paln and gner
synonym hook me wmlen wofk pinned walk, document
'2 um um explores paln and gner
2 the main bodyafa book H mm: pematwmngas dlsmlcl «mm mm maleual such 25 notes,
appeudmes, and .uusuaum,
'Ihe plclums are clear and relate well m the ten'
synonym wmds, warding, willing, Mme
m
I send a text messagela
'l mmgm n was fzmasuc am he umk Ihe [mthle m «m me'
 
Building haar models to recognize characters in "pure" opencv seems like it might get better results :P
 
All I got to say is, "ifkd dsfuiej felk3 fslkjkd erxoiuu !!!"
 
Ok, so of course I spoke too soon... I'm always appreciative of the open source community for providing so much work and effort in a complex area.  
 
I found it frustrating at first because (of course) my lack of knowlege regarding what images are suitable, and what things can "help" with ocr.  I have a lot to learn
 
Anyway, I downloaded the latest Tessa4J 3.4.0 from sourceforge - and ran ant
It did a nice job of building it all and auto-magically testing the project with test data supplied.
Here is one of the test images.
It is higher res - the image above is 1024 x 768.
 
The results in text are :

The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose. as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,.schnelle� braune Fuchs springt
?ber den faulen Hund. Le renard brun
�rapide� saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra i] cane pigro. El zorro
marr�n r�pido salta sobre el perro
perezoso. A raposa marrom r�pida
salta sobre o c?o preguicoso.
 

Very interesting article on stackoverflow regarding details of OpenCV pre-processing

https://stackoverflow.com/questions/23506105/extracting-text-opencv 

 


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
moz4r's picture

Last time I tested I had some

Last time I tested I had some good results with last build : net.sourceforge.tess4j\3.4.0

GroG's picture

Hmmm perhaps there is still

Hmmm perhaps there is still hope ..
I saw this - https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

needs more playing with I suppose

kyle.clinton's picture

How Hi-Res was the screen shot?

I had similar results initially when I was doing screen captures form my LiveCam HD USB Camera through "Take a Picture" functionality of OpenCV. I got much better results when I fed it images from my Samsung Galaxy S7 with super high resolution.  It may be "Good Enough" or better than initially believed.  I am not ready to give up on it totally.  If you can pass on how you were testing I may try to take it a bit further...