Wordnet , ProgramAB, and Solr Graph Query

Ok, there's a lot of stuff going on here, so I'll do my best to break it down and provide some background.

For work work, I have been working on a lot of large scale search engine problems.  Most of this work relates to indexing large amounts of data so that people can search through it very quickly.  Much of this work has been done with a project called Apache Solr.  Solr is an open source blazingly fast search engine based on the open source library Lucene.  A while back, I contributed a new search engine operator to this project called the "GraphQuery"

The graph query allows for a recursive querying of documents in a search engine index and allows things like walking up and down folder structures and other things like that.  At the end of the day it provides Graph Traversal for the search index.  Ok, so what does that mean.  It's always easiest with an example.  This is where wordnet comes in.

I've been looking for a good test data set to show off the power of this new GraphQuery.  So, I began looking for datasets that were heirarchical in nature.  That means that it is represented as a bunch of "nodes" and "edges".  A node can have edges.  Edges point to other nodes.  These are the basic building blocks for any graph data structure.  

The "wordnet" project from Princeton University.  They have been amassing a data set that shows the representations of all the words in the English language.  In addition to being just a dictionary, it includes links to synonyms for each of the words as well as a reference for the part of speech the word is, depending on the context or "sense" that it is used in.  Some words are nouns most of the time, but sometimes they could be adjectives or even verbs, depending on how they are used in context. 

In addition to having synonyms, it illusrates that heirarchy of words and how they are related in a more general or more specific case.  For example, for the word "dog".  There is a more general case, that a dog is also an "animal".   This relationship is referred to as a "hypernym".  Or the more specific case that a poodle is a type of dog.  This relationship is referred to as a hyponym.  

The Wordnet data set provides information about all of the hypernym and hyponyms for the words in the English language.  There are many layers to this heirrachy.  

Example: a dog is a type of cainine whic is a type of carnivore which is a type of a placental mammal which is a type of a  mammel which is a type of vertebrate which is a type of chordate which is a type of animal which is a type of living organism... 

I used some code to put the wordnet data into a Solr index.  Each document in the index represents one sense of a word from the wordnet dictionary.  The documents in the index have the id's that point to their hypernym and hyponyms respectively.  

Ok, if you're still with me, than thanks.  Take a leap of faith that "sense_lemma" means "word".  It will make this easier to talk about.  

I can do a search against the solr index for 

sense_lemma:dog 

This search returns all entries in the wordnet dictionary that matches the word "dog".  

I can use the graph query in solr to recursively return not just the word "dog", but all of the hypernyms of dog with this odd looking query below

{!graph from="synset_id" to="hypernym_id"}sense_lemma:dog 

The above query finds all the documents that match the query for "dog", and then it recursively finds all the linked documents based on the hypernym_id field.  this returns the full heirarchy of words all the way up to "living thing"  (for the given example above.)

If we want to know if a "dog" is a type of "living thing"  we only need to do an "AND" query between "living thing" and the hypernym graph traversal for "dog" .  If "living thing" is a hypernym of "dog", the result set will contain 1 document (for sense_lemma:living_thing)

That query looks like this:

+{!graph from="synset_id" to="hypernym_id"}sense_lemma:dog +sense_lemma:"living thing"

Ok... so, now where does ProgramAB fit in?  Ok, it's clear that the syntax of these queries is very complicated, even for this simple example.  This is where ProgramAB helps us.  My example from this video, uses AIML and ProgramAB with it's OOB tags to match a pattern of :

"IS A * A *"   

The values that match the first and second start are passed into a python method that programmically builds up the appropriate solr query string, and runs a search against a "Solr" service running in MRL.  If the nubmer of hits returned from the search engine , the python example then tells acapela speech to read the definition of the word and assert an answer.

For example:   "Is a puma a cat?"   could be a question

This would search for puma and all it's hypernyms, that would be ANDed with a query for sense_lemma:cat and it would return that yes, a cat is a hypernym of puma.  The hit count returned from the search engine will be 1.  If that's the cae "A puma is a cat".  If the search engine returned 0 hits, it means that "A puma is not a cat" ...

This is, what I believe to be, an example of a computer program that is performing inductive reasoning.  The program is using a "Knowledge Graph" to answer questions.  Those questions require some of the basic reasoning theorms such as a syllogism.

An A is a B

A B is a C 

Is an A a C? 

We are entering an age of Cognitive Computing and I hope that the above example using MyRobotLab can serve as a sort of primer into the topic and share with the community a bit of my vision of how to add this sort of functionality to MyRobotLab.

So, where do we go next?  I've created a Document Processing pipeline and Connectors that can crawl various data sources such as RSS feeds, file systems, csv files, etc.  These systems can serve as a connection to "unstructured data" that is being created on the internet at any given time.  Using OpenNLP we can start adding part of speech (POS) tagging to content as well as entity extraction.  Using advanced libraries such as OpenNLP we can start automatically creating these "knowedge graphs" based on news that is being published in real time, based on sensor data being collected, and even output from OpenCV filters...

I hope one of the next videos will be me asking ProgramAB what is the news about today? and ProgramAB will be able to convert that into a query for recent news and tell us what topics and concepts are most promiment in the news.  Then ask programab to start reading the head lines...

I hope this is a glimpse into my vision for where MRL can serve as the AI platform for the modern  open source smart home!

Here are some links for more info:

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&r=1&s=panther&i=2&h=0100#c

http://lucene.apache.org/solr/

https://opennlp.apache.org/

http://www.alicebot.org/

https://issues.apache.org/jira/browse/SOLR-7543


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
elroddurwood's picture

Is there any information

Is there any information available on how you loaded Solr with the Wordnet graph? I am learning Solr and this seems like a very good sample project to experiment with.

thx,

elrod