You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Daniel Franc <df...@gmail.com> on 2013/03/09 00:05:49 UTC

Dictionary lookup novice question

Hello friends,

I am at a novice level for both OpenNLP and Java and have been fumbling
around to put together a working version of the software with some success
thanks to the documentation provided!  My eventual goal is partially to
look up terms within a pre-defined dictionary, and I've been able to use
the dictionary creator to create a basic dictionary to lookup from as here:

    dictionary.serialize(new FileOutputStream(
"/Applications/apache-opennlp-1.5.2-incubating/dictionarynames.txt"));
My particular questions are:

1. Can someone help me with loading this dictionary after it was previously
created?

2. Is there a straightforward was to implement a basic lookup mechanism for
tokenized text?

Thanks for your help!
-Dan

Re: Dictionary lookup novice question

Posted by James Kosin <ja...@gmail.com>.
Hi Dan,

The dictionary element is to add to the name recognizer to help find 
names that don't match or to help enforce name recognition here.  I'm 
not exactly sure if this is quite what you want to do.

There is a lesser used Dictionary name finder that may be more suited to 
what you are wanting to do... I think.  But, the current version in 
1.5.2 has a few bugs.  You can get a pre-release here: 
http://people.apache.org/~colen/releases/opennlp-1.5.3/rc2/ of our next 
release to help with the problems.

The dictionary format is fairly straight forward .... though not well 
documented.  There are also several CLI tools to convert files to a 
dictionary format.

I guess I'll try to better the documentation here.... :-)

<?xml version="1.0" encoding="UTF-8"?><dictionary case_sensitive="true">
<entry>
<token>Patrick</token>
</entry>
</dictionary>

The dictionary contains entries for the tokens for each.  When the 
DictionaryNameFinder is called, it will attempt to find the longest 
matching series from the dictionary in the document.
This sort of dictionary is best for keywords, some names and special 
words.  You could use this type of dictionary populated with the 
keywords for c/c++ and it could parse and tag a program file with all 
the keywords.

Let me know if I'm headed down the wrong path here....

Thanks,
James

On 3/8/2013 11:56 PM, Daniel Franc wrote:
> Hi James,
>
> Thanks for your reply.  Maybe my questions are too elementary so sorry!
>
> I was running through the OpenNLP manual and went through the 
> "tokenizer" step 
> (http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer). 
>
>
> Then when running through the "name finder" step it alluded to an 
> alternative separate dictionary lookup step (end of this section: 
> http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.recognition.api)
>
> I was able to create a dictionary for lookup, but I can't figure out 
> how to load it up or search with it.
>
> My eventual goal is have a method to look up a set of terms within a 
> document as an alternative way to classify or tag the document and not 
> necessarily use the statistical name finder.  I'm not familiar with 
> JWNL but I could give that a try.  It seems that I could manually code 
> a text search through a document, but I thought I'd try to use OpenNLP 
> first.
>
> Thanks again -- Dan
>
>
>
>
> On Fri, Mar 8, 2013 at 4:22 PM, James Kosin <james.kosin@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Dan,
>
>     I'm guessing when you say tokenized you mean with POS values.  If
>     so, a better approach would be to use the JWNL library to look up
>     the dictionary terms.  We use this with our coref component and
>     isn't hard to get working.  The biggest thing with POS is
>     selecting the right one.  It may be better to build a model for
>     the POS tokenizer than to build a dictionary for this.  Unless you
>     are meaning for a different language.
>
>     I guess I need more information from you on what you are trying to
>     accomplish?
>
>     James
>
>
>     On 3/8/2013 6:05 PM, Daniel Franc wrote:
>
>         Hello friends,
>
>         I am at a novice level for both OpenNLP and Java and have been
>         fumbling
>         around to put together a working version of the software with
>         some success
>         thanks to the documentation provided!  My eventual goal is
>         partially to
>         look up terms within a pre-defined dictionary, and I've been
>         able to use
>         the dictionary creator to create a basic dictionary to lookup
>         from as here:
>
>              dictionary.serialize(new FileOutputStream(
>         "/Applications/apache-opennlp-1.5.2-incubating/dictionarynames.txt"));
>         My particular questions are:
>
>         1. Can someone help me with loading this dictionary after it
>         was previously
>         created?
>
>         2. Is there a straightforward was to implement a basic lookup
>         mechanism for
>         tokenized text?
>
>         Thanks for your help!
>         -Dan
>
>
>


Re: Dictionary lookup novice question

Posted by James Kosin <ja...@gmail.com>.
Dan,

I'm guessing when you say tokenized you mean with POS values.  If so, a 
better approach would be to use the JWNL library to look up the 
dictionary terms.  We use this with our coref component and isn't hard 
to get working.  The biggest thing with POS is selecting the right one.  
It may be better to build a model for the POS tokenizer than to build a 
dictionary for this.  Unless you are meaning for a different language.

I guess I need more information from you on what you are trying to 
accomplish?

James

On 3/8/2013 6:05 PM, Daniel Franc wrote:
> Hello friends,
>
> I am at a novice level for both OpenNLP and Java and have been fumbling
> around to put together a working version of the software with some success
> thanks to the documentation provided!  My eventual goal is partially to
> look up terms within a pre-defined dictionary, and I've been able to use
> the dictionary creator to create a basic dictionary to lookup from as here:
>
>      dictionary.serialize(new FileOutputStream(
> "/Applications/apache-opennlp-1.5.2-incubating/dictionarynames.txt"));
> My particular questions are:
>
> 1. Can someone help me with loading this dictionary after it was previously
> created?
>
> 2. Is there a straightforward was to implement a basic lookup mechanism for
> tokenized text?
>
> Thanks for your help!
> -Dan
>