You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Siva Sakthi <ss...@gmail.com> on 2013/09/13 12:49:10 UTC

Tweets With Organization

Hi,
  we are using opennlp for finding organizations (code below)

e.g.

1. Find out how Intel Xeon processors help make #EMC number 1 in backup at
#IDF13 going on now in San Francisco. #Speed2Lead Protect your data
>>
Opennlp returns "Intel" in the above sentence

2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
http://t.co/V0XLKrp3TI
>>
Opennlp returns "Intel Division Chief Lashes"

Issue 1: I don't understand why it returns a composite string in the second
case, instead of just Intel
Issue 2: The "Intel" in the second sentence is not really "Intel"

My code as follows,

    public static String findOrg(String message) throws Exception {
        String[] words = message.split(" ");
        InputStream orgIs = new FileInputStream("en-ner-organization.bin");
        TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
        NameFinderME nf = new NameFinderME(tnf);
        Span sp[] = nf.find(words);
        String a[] = Span.spansToStrings(sp, words);
        StringBuilder sb = new StringBuilder();
        int l = a.length;

        for (int j = 0; j < l; j++) {
            sb = sb.append(a[j] + "\n");
        }

        return sb.toString();
    }

Thanks,
Ss

Re: Tweets With Organization

Posted by ch...@gmail.com.

No it is not
The right answer would be to explain the features used in the model 




Sent from my iPhone

On Sep 21, 2013, at 1:41 PM, Lance Norskog <go...@gmail.com> wrote:

> And yet it is the right one. How odd.
> 
> On 09/20/2013 11:16 AM, charlesmartin14@gmail.com wrote:
>> That is such a poor answer
>> 
>> 
>> Sent from my iPhone
>> 
>> On Sep 20, 2013, at 11:11 AM, Jeffrey Mershon <je...@gmail.com> wrote:
>> 
>>> Siva,
>>> 
>>> I'm assuming there is nothing wrong with you code. OpenNLP's named-entity
>>> recognizer is based on MaxEnt modeling, as opposed to rule-based
>>> programming, to identify named entities. So, the answer to "Why did OpenNLP
>>> return X as an organization" is always going to be "Because it was trained
>>> to do so". If the training set--that is, the set of sentences used to train
>>> the recognition model that you are using--does not possess similar
>>> characteristics to the sentences you are using that model to process, you
>>> are going to get sub-optimal results.
>>> 
>>> It looks to me as if you are processing tweets. If you're using the default
>>> recognizer, I doubt very much whether that was trained on tweets, and
>>> tweets possess very different characteristics than regular prose.
>>> Consequently, I suggest that you consider training a model using data that
>>> represents what you want to actually process.
>>> 
>>> In the examples you give, Intel is a company name in on case and a slang
>>> term (contraction of Intelligence) in another.You may find that it is not
>>> possible to train just one model to handle all cases. You might need
>>> individual strategies for different industries, depending on what you are
>>> trying to achieve. Good Luck.
>>> 
>>> Regards,
>>> 
>>> Jeff
>>> 
>>> 
>>> On Fri, Sep 20, 2013 at 2:59 AM, Siva Sakthi <ss...@gmail.com> wrote:
>>> 
>>>> Can anyone answer the above question???
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> On Fri, Sep 13, 2013 at 4:19 PM, Siva Sakthi <ss...@gmail.com> wrote:
>>>> 
>>>>> Hi,
>>>>>  we are using opennlp for finding organizations (code below)
>>>>> 
>>>>> e.g.
>>>>> 
>>>>> 1. Find out how Intel Xeon processors help make #EMC number 1 in backup
>>>> at
>>>>> #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
>>>>> Opennlp returns "Intel" in the above sentence
>>>>> 
>>>>> 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
>>>>> http://t.co/V0XLKrp3TI
>>>>> Opennlp returns "Intel Division Chief Lashes"
>>>>> 
>>>>> Issue 1: I don't understand why it returns a composite string in the
>>>>> second case, instead of just Intel
>>>>> Issue 2: The "Intel" in the second sentence is not really "Intel"
>>>>> 
>>>>> My code as follows,
>>>>> 
>>>>>    public static String findOrg(String message) throws Exception {
>>>>>        String[] words = message.split(" ");
>>>>>        InputStream orgIs = new
>>>> FileInputStream("en-ner-organization.bin");
>>>>>        TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
>>>>>        NameFinderME nf = new NameFinderME(tnf);
>>>>>        Span sp[] = nf.find(words);
>>>>>        String a[] = Span.spansToStrings(sp, words);
>>>>>        StringBuilder sb = new StringBuilder();
>>>>>        int l = a.length;
>>>>> 
>>>>>        for (int j = 0; j < l; j++) {
>>>>>            sb = sb.append(a[j] + "\n");
>>>>>        }
>>>>> 
>>>>>        return sb.toString();
>>>>>    }
>>>>> 
>>>>> Thanks,
>>>>> Ss
>>>>> 
>>>>> 
>

Re: Tweets With Organization

Posted by Lance Norskog <go...@gmail.com>.

And yet it is the right one. How odd.

On 09/20/2013 11:16 AM, charlesmartin14@gmail.com wrote:
> That is such a poor answer
>
>
> Sent from my iPhone
>
> On Sep 20, 2013, at 11:11 AM, Jeffrey Mershon <je...@gmail.com> wrote:
>
>> Siva,
>>
>> I'm assuming there is nothing wrong with you code. OpenNLP's named-entity
>> recognizer is based on MaxEnt modeling, as opposed to rule-based
>> programming, to identify named entities. So, the answer to "Why did OpenNLP
>> return X as an organization" is always going to be "Because it was trained
>> to do so". If the training set--that is, the set of sentences used to train
>> the recognition model that you are using--does not possess similar
>> characteristics to the sentences you are using that model to process, you
>> are going to get sub-optimal results.
>>
>> It looks to me as if you are processing tweets. If you're using the default
>> recognizer, I doubt very much whether that was trained on tweets, and
>> tweets possess very different characteristics than regular prose.
>> Consequently, I suggest that you consider training a model using data that
>> represents what you want to actually process.
>>
>> In the examples you give, Intel is a company name in on case and a slang
>> term (contraction of Intelligence) in another.You may find that it is not
>> possible to train just one model to handle all cases. You might need
>> individual strategies for different industries, depending on what you are
>> trying to achieve. Good Luck.
>>
>> Regards,
>>
>> Jeff
>>
>>
>> On Fri, Sep 20, 2013 at 2:59 AM, Siva Sakthi <ss...@gmail.com> wrote:
>>
>>> Can anyone answer the above question???
>>>
>>> Thanks
>>>
>>>
>>> On Fri, Sep 13, 2013 at 4:19 PM, Siva Sakthi <ss...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>   we are using opennlp for finding organizations (code below)
>>>>
>>>> e.g.
>>>>
>>>> 1. Find out how Intel Xeon processors help make #EMC number 1 in backup
>>> at
>>>> #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
>>>> Opennlp returns "Intel" in the above sentence
>>>>
>>>> 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
>>>> http://t.co/V0XLKrp3TI
>>>> Opennlp returns "Intel Division Chief Lashes"
>>>>
>>>> Issue 1: I don't understand why it returns a composite string in the
>>>> second case, instead of just Intel
>>>> Issue 2: The "Intel" in the second sentence is not really "Intel"
>>>>
>>>> My code as follows,
>>>>
>>>>     public static String findOrg(String message) throws Exception {
>>>>         String[] words = message.split(" ");
>>>>         InputStream orgIs = new
>>> FileInputStream("en-ner-organization.bin");
>>>>         TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
>>>>         NameFinderME nf = new NameFinderME(tnf);
>>>>         Span sp[] = nf.find(words);
>>>>         String a[] = Span.spansToStrings(sp, words);
>>>>         StringBuilder sb = new StringBuilder();
>>>>         int l = a.length;
>>>>
>>>>         for (int j = 0; j < l; j++) {
>>>>             sb = sb.append(a[j] + "\n");
>>>>         }
>>>>
>>>>         return sb.toString();
>>>>     }
>>>>
>>>> Thanks,
>>>> Ss
>>>>
>>>>

Re: Tweets With Organization

Posted by ch...@gmail.com.

That is such a poor answer


Sent from my iPhone

On Sep 20, 2013, at 11:11 AM, Jeffrey Mershon <je...@gmail.com> wrote:

> Siva,
> 
> I'm assuming there is nothing wrong with you code. OpenNLP's named-entity
> recognizer is based on MaxEnt modeling, as opposed to rule-based
> programming, to identify named entities. So, the answer to "Why did OpenNLP
> return X as an organization" is always going to be "Because it was trained
> to do so". If the training set--that is, the set of sentences used to train
> the recognition model that you are using--does not possess similar
> characteristics to the sentences you are using that model to process, you
> are going to get sub-optimal results.
> 
> It looks to me as if you are processing tweets. If you're using the default
> recognizer, I doubt very much whether that was trained on tweets, and
> tweets possess very different characteristics than regular prose.
> Consequently, I suggest that you consider training a model using data that
> represents what you want to actually process.
> 
> In the examples you give, Intel is a company name in on case and a slang
> term (contraction of Intelligence) in another.You may find that it is not
> possible to train just one model to handle all cases. You might need
> individual strategies for different industries, depending on what you are
> trying to achieve. Good Luck.
> 
> Regards,
> 
> Jeff
> 
> 
> On Fri, Sep 20, 2013 at 2:59 AM, Siva Sakthi <ss...@gmail.com> wrote:
> 
>> Can anyone answer the above question???
>> 
>> Thanks
>> 
>> 
>> On Fri, Sep 13, 2013 at 4:19 PM, Siva Sakthi <ss...@gmail.com> wrote:
>> 
>>> Hi,
>>>  we are using opennlp for finding organizations (code below)
>>> 
>>> e.g.
>>> 
>>> 1. Find out how Intel Xeon processors help make #EMC number 1 in backup
>> at
>>> #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
>>>>> 
>>> Opennlp returns "Intel" in the above sentence
>>> 
>>> 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
>>> http://t.co/V0XLKrp3TI
>>>>> 
>>> Opennlp returns "Intel Division Chief Lashes"
>>> 
>>> Issue 1: I don't understand why it returns a composite string in the
>>> second case, instead of just Intel
>>> Issue 2: The "Intel" in the second sentence is not really "Intel"
>>> 
>>> My code as follows,
>>> 
>>>    public static String findOrg(String message) throws Exception {
>>>        String[] words = message.split(" ");
>>>        InputStream orgIs = new
>> FileInputStream("en-ner-organization.bin");
>>>        TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
>>>        NameFinderME nf = new NameFinderME(tnf);
>>>        Span sp[] = nf.find(words);
>>>        String a[] = Span.spansToStrings(sp, words);
>>>        StringBuilder sb = new StringBuilder();
>>>        int l = a.length;
>>> 
>>>        for (int j = 0; j < l; j++) {
>>>            sb = sb.append(a[j] + "\n");
>>>        }
>>> 
>>>        return sb.toString();
>>>    }
>>> 
>>> Thanks,
>>> Ss
>>> 
>>> 
>>

Re: Tweets With Organization

Posted by Jeffrey Mershon <je...@gmail.com>.

Siva,

I'm assuming there is nothing wrong with you code. OpenNLP's named-entity
recognizer is based on MaxEnt modeling, as opposed to rule-based
programming, to identify named entities. So, the answer to "Why did OpenNLP
return X as an organization" is always going to be "Because it was trained
to do so". If the training set--that is, the set of sentences used to train
the recognition model that you are using--does not possess similar
characteristics to the sentences you are using that model to process, you
are going to get sub-optimal results.

It looks to me as if you are processing tweets. If you're using the default
recognizer, I doubt very much whether that was trained on tweets, and
tweets possess very different characteristics than regular prose.
Consequently, I suggest that you consider training a model using data that
represents what you want to actually process.

In the examples you give, Intel is a company name in on case and a slang
term (contraction of Intelligence) in another.You may find that it is not
possible to train just one model to handle all cases. You might need
individual strategies for different industries, depending on what you are
trying to achieve. Good Luck.

Regards,

Jeff

On Fri, Sep 20, 2013 at 2:59 AM, Siva Sakthi <ss...@gmail.com> wrote:

> Can anyone answer the above question???
>
> Thanks
>
>
> On Fri, Sep 13, 2013 at 4:19 PM, Siva Sakthi <ss...@gmail.com> wrote:
>
> > Hi,
> >   we are using opennlp for finding organizations (code below)
> >
> > e.g.
> >
> > 1. Find out how Intel Xeon processors help make #EMC number 1 in backup
> at
> > #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
> > >>
> > Opennlp returns "Intel" in the above sentence
> >
> > 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
> > http://t.co/V0XLKrp3TI
> > >>
> > Opennlp returns "Intel Division Chief Lashes"
> >
> > Issue 1: I don't understand why it returns a composite string in the
> > second case, instead of just Intel
> > Issue 2: The "Intel" in the second sentence is not really "Intel"
> >
> > My code as follows,
> >
> >     public static String findOrg(String message) throws Exception {
> >         String[] words = message.split(" ");
> >         InputStream orgIs = new
> FileInputStream("en-ner-organization.bin");
> >         TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
> >         NameFinderME nf = new NameFinderME(tnf);
> >         Span sp[] = nf.find(words);
> >         String a[] = Span.spansToStrings(sp, words);
> >         StringBuilder sb = new StringBuilder();
> >         int l = a.length;
> >
> >         for (int j = 0; j < l; j++) {
> >             sb = sb.append(a[j] + "\n");
> >         }
> >
> >         return sb.toString();
> >     }
> >
> > Thanks,
> > Ss
> >
> >
>

Re: Tweets With Organization

Posted by Siva Sakthi <ss...@gmail.com>.

Can anyone answer the above question???

Thanks


On Fri, Sep 13, 2013 at 4:19 PM, Siva Sakthi <ss...@gmail.com> wrote:

> Hi,
>   we are using opennlp for finding organizations (code below)
>
> e.g.
>
> 1. Find out how Intel Xeon processors help make #EMC number 1 in backup at
> #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
> >>
> Opennlp returns "Intel" in the above sentence
>
> 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
> http://t.co/V0XLKrp3TI
> >>
> Opennlp returns "Intel Division Chief Lashes"
>
> Issue 1: I don't understand why it returns a composite string in the
> second case, instead of just Intel
> Issue 2: The "Intel" in the second sentence is not really "Intel"
>
> My code as follows,
>
>     public static String findOrg(String message) throws Exception {
>         String[] words = message.split(" ");
>         InputStream orgIs = new FileInputStream("en-ner-organization.bin");
>         TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
>         NameFinderME nf = new NameFinderME(tnf);
>         Span sp[] = nf.find(words);
>         String a[] = Span.spansToStrings(sp, words);
>         StringBuilder sb = new StringBuilder();
>         int l = a.length;
>
>         for (int j = 0; j < l; j++) {
>             sb = sb.append(a[j] + "\n");
>         }
>
>         return sb.toString();
>     }
>
> Thanks,
> Ss
>
>

Re: Tweets With Organization

Posted by Lance Norskog <go...@gmail.com>.

Cool! This is an Parts-of-Speech toolkit for twitter:
http://www.ark.cs.cmu.edu/TweetNLP/

It's great that there is an NLP ecosystem developing around this new 
"grammar". Are there Twitter monitoring services which use this type of 
tool to fine-tune relevance? That would be a cool and resume-enhancing 
technical report.

Lance

On 09/20/2013 10:59 AM, Michael Schmitz wrote:
> You might find this package helpful--it's specifically for NER and tweets.
>
> https://github.com/aritter/twitter_nlp
>
> Peace.  Michael
>
> On Fri, Sep 13, 2013 at 3:49 AM, Siva Sakthi <ss...@gmail.com> wrote:
>> Hi,
>>    we are using opennlp for finding organizations (code below)
>>
>> e.g.
>>
>> 1. Find out how Intel Xeon processors help make #EMC number 1 in backup at
>> #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
>> Opennlp returns "Intel" in the above sentence
>>
>> 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
>> http://t.co/V0XLKrp3TI
>> Opennlp returns "Intel Division Chief Lashes"
>>
>> Issue 1: I don't understand why it returns a composite string in the second
>> case, instead of just Intel
>> Issue 2: The "Intel" in the second sentence is not really "Intel"
>>
>> My code as follows,
>>
>>      public static String findOrg(String message) throws Exception {
>>          String[] words = message.split(" ");
>>          InputStream orgIs = new FileInputStream("en-ner-organization.bin");
>>          TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
>>          NameFinderME nf = new NameFinderME(tnf);
>>          Span sp[] = nf.find(words);
>>          String a[] = Span.spansToStrings(sp, words);
>>          StringBuilder sb = new StringBuilder();
>>          int l = a.length;
>>
>>          for (int j = 0; j < l; j++) {
>>              sb = sb.append(a[j] + "\n");
>>          }
>>
>>          return sb.toString();
>>      }
>>
>> Thanks,
>> Ss

Re: Tweets With Organization

Posted by Michael Schmitz <sc...@cs.washington.edu>.

You might find this package helpful--it's specifically for NER and tweets.

https://github.com/aritter/twitter_nlp

Peace.  Michael

On Fri, Sep 13, 2013 at 3:49 AM, Siva Sakthi <ss...@gmail.com> wrote:
> Hi,
>   we are using opennlp for finding organizations (code below)
>
> e.g.
>
> 1. Find out how Intel Xeon processors help make #EMC number 1 in backup at
> #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
>>>
> Opennlp returns "Intel" in the above sentence
>
> 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
> http://t.co/V0XLKrp3TI
>>>
> Opennlp returns "Intel Division Chief Lashes"
>
> Issue 1: I don't understand why it returns a composite string in the second
> case, instead of just Intel
> Issue 2: The "Intel" in the second sentence is not really "Intel"
>
> My code as follows,
>
>     public static String findOrg(String message) throws Exception {
>         String[] words = message.split(" ");
>         InputStream orgIs = new FileInputStream("en-ner-organization.bin");
>         TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
>         NameFinderME nf = new NameFinderME(tnf);
>         Span sp[] = nf.find(words);
>         String a[] = Span.spansToStrings(sp, words);
>         StringBuilder sb = new StringBuilder();
>         int l = a.length;
>
>         for (int j = 0; j < l; j++) {
>             sb = sb.append(a[j] + "\n");
>         }
>
>         return sb.toString();
>     }
>
> Thanks,
> Ss

Re: Tweets With Organization

Posted by Jörn Kottmann <ko...@gmail.com>.

On 09/15/2013 01:27 AM, Lance Norskog wrote:
>
> You might get better results if you make your own organization 
> training set. These training sets are old, and the business world 
> changes names rapidly. Also, advertising text has its own terse syntax 
> and the models are generally trained on more formal English. 

Yes, try to produce your own training data, as part of that you can 
precisely define what should be an organization name,
maybe you want to annotate product names as well.

HTH,
Jörn

Re: Tweets With Organization

Posted by Lance Norskog <go...@gmail.com>.

Xeon is not a word, so it only finds Intel. Division and Chief are 
probably organization words.

You might get better results if you make your own organization training 
set. These training sets are old, and the business world changes names 
rapidly. Also, advertising text has its own terse syntax and the models 
are generally trained on more formal English.

If you're doing tweets, there is a POS determiner for Tweets from CMU. 
Cross-checking against noun/verb/etc. might help your results.

On 09/13/2013 03:49 AM, Siva Sakthi wrote:
> Hi,
>    we are using opennlp for finding organizations (code below)
>
> e.g.
>
> 1. Find out how Intel Xeon processors help make #EMC number 1 in backup at
> #IDF13 going on now in San Francisco. #Speed2Lead Protect your data
> Opennlp returns "Intel" in the above sentence
>
> 2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
> http://t.co/V0XLKrp3TI
> Opennlp returns "Intel Division Chief Lashes"
>
> Issue 1: I don't understand why it returns a composite string in the second
> case, instead of just Intel
> Issue 2: The "Intel" in the second sentence is not really "Intel"
>
> My code as follows,
>
>      public static String findOrg(String message) throws Exception {
>          String[] words = message.split(" ");
>          InputStream orgIs = new FileInputStream("en-ner-organization.bin");
>          TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
>          NameFinderME nf = new NameFinderME(tnf);
>          Span sp[] = nf.find(words);
>          String a[] = Span.spansToStrings(sp, words);
>          StringBuilder sb = new StringBuilder();
>          int l = a.length;
>
>          for (int j = 0; j < l; j++) {
>              sb = sb.append(a[j] + "\n");
>          }
>
>          return sb.toString();
>      }
>
> Thanks,
> Ss
>