You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Andrew Zhang <ro...@gmail.com> on 2009/10/06 13:42:23 UTC

Phase Extraction, mainly for English

Hi guys,

The requirement is very simple here, e.g. for this sentence, 'The NBA
formally announced its new *social media* guidelines Wednesday',  I want to
treat '*social media*' as a whole phase term. The default english analyzers
came with lucene all deal with single word, so it you want to get the most
frequent terms, *social *and *media* are separated, and each of them can't
represent a good meaning as *social media*, right?

I know there's a way built on some phase dictionary, and try to match the
phase already there, very like the way to do with chinese language, but is
there an open source solution for english, I mean I don't want to build a
phase dictionary myself, and I also want a smart way, which can "discover"
the phase automatically. I got 2 millions docs analyzered the norma way, all
single terms, which I can use as a base source, and it's possible to find
that *social media *came together frequently, but I really don't know what's
the reverse way.

I tried to find some phase analyzers, but no luck. so any advices?

Regards,
Andrew
-- 
Simple is best

Re: Phase Extraction, mainly for English

Posted by Andrew Zhang <ro...@gmail.com>.

Right, Vasu, I think NLP is good, I should take some time to look at that.
Thanks.

On Tue, Oct 6, 2009 at 8:10 PM, Vasudevan Comandur <vc...@gmail.com>wrote:

> Hi,
>
>   Take the NLP route and use modules like POS tagger and NP chunker.
>
>   OpenNLP has a stack for English language. Try to use them.
>
> Regards
>  Vasu
>
> On Tue, Oct 6, 2009 at 5:12 PM, Andrew Zhang <ro...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > The requirement is very simple here, e.g. for this sentence, 'The NBA
> > formally announced its new *social media* guidelines Wednesday',  I want
> to
> > treat '*social media*' as a whole phase term. The default english
> analyzers
> > came with lucene all deal with single word, so it you want to get the
> most
> > frequent terms, *social *and *media* are separated, and each of them
> can't
> > represent a good meaning as *social media*, right?
> >
> > I know there's a way built on some phase dictionary, and try to match the
> > phase already there, very like the way to do with chinese language, but
> is
> > there an open source solution for english, I mean I don't want to build a
> > phase dictionary myself, and I also want a smart way, which can
> "discover"
> > the phase automatically. I got 2 millions docs analyzered the norma way,
> > all
> > single terms, which I can use as a base source, and it's possible to find
> > that *social media *came together frequently, but I really don't know
> > what's
> > the reverse way.
> >
> > I tried to find some phase analyzers, but no luck. so any advices?
> >
> > Regards,
> > Andrew
> > --
> > Simple is best
> >
>



-- 
Simple is best

Re: Phase Extraction, mainly for English

Posted by Vasudevan Comandur <vc...@gmail.com>.

Hi,

   Take the NLP route and use modules like POS tagger and NP chunker.

   OpenNLP has a stack for English language. Try to use them.

Regards
 Vasu

On Tue, Oct 6, 2009 at 5:12 PM, Andrew Zhang <ro...@gmail.com> wrote:

> Hi guys,
>
> The requirement is very simple here, e.g. for this sentence, 'The NBA
> formally announced its new *social media* guidelines Wednesday',  I want to
> treat '*social media*' as a whole phase term. The default english analyzers
> came with lucene all deal with single word, so it you want to get the most
> frequent terms, *social *and *media* are separated, and each of them can't
> represent a good meaning as *social media*, right?
>
> I know there's a way built on some phase dictionary, and try to match the
> phase already there, very like the way to do with chinese language, but is
> there an open source solution for english, I mean I don't want to build a
> phase dictionary myself, and I also want a smart way, which can "discover"
> the phase automatically. I got 2 millions docs analyzered the norma way,
> all
> single terms, which I can use as a base source, and it's possible to find
> that *social media *came together frequently, but I really don't know
> what's
> the reverse way.
>
> I tried to find some phase analyzers, but no luck. so any advices?
>
> Regards,
> Andrew
> --
> Simple is best
>

Re: Phase Extraction, mainly for English

Posted by Andrew Zhang <ro...@gmail.com>.

Hi Erick,

If you want to query, you should know the "phase" right? but I want to
discover the phase, or which words came together so often and by the natural
way, we use that as a phase.



On Tue, Oct 6, 2009 at 8:12 PM, Erick Erickson <er...@gmail.com>wrote:

> Maybe I'm missing the problem entirely, but can you use phrase queries?or
> one of the Span* queries with a slop of 0 when searching?
>
> Best
> Erick
>
> On Tue, Oct 6, 2009 at 7:42 AM, Andrew Zhang <ro...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > The requirement is very simple here, e.g. for this sentence, 'The NBA
> > formally announced its new *social media* guidelines Wednesday',  I want
> to
> > treat '*social media*' as a whole phase term. The default english
> analyzers
> > came with lucene all deal with single word, so it you want to get the
> most
> > frequent terms, *social *and *media* are separated, and each of them
> can't
> > represent a good meaning as *social media*, right?
> >
> > I know there's a way built on some phase dictionary, and try to match the
> > phase already there, very like the way to do with chinese language, but
> is
> > there an open source solution for english, I mean I don't want to build a
> > phase dictionary myself, and I also want a smart way, which can
> "discover"
> > the phase automatically. I got 2 millions docs analyzered the norma way,
> > all
> > single terms, which I can use as a base source, and it's possible to find
> > that *social media *came together frequently, but I really don't know
> > what's
> > the reverse way.
> >
> > I tried to find some phase analyzers, but no luck. so any advices?
> >
> > Regards,
> > Andrew
> > --
> > Simple is best
> >
>



-- 
Simple is best

Re: Phase Extraction, mainly for English

Posted by Erick Erickson <er...@gmail.com>.

Maybe I'm missing the problem entirely, but can you use phrase queries?or
one of the Span* queries with a slop of 0 when searching?

Best
Erick

On Tue, Oct 6, 2009 at 7:42 AM, Andrew Zhang <ro...@gmail.com> wrote:

> Hi guys,
>
> The requirement is very simple here, e.g. for this sentence, 'The NBA
> formally announced its new *social media* guidelines Wednesday',  I want to
> treat '*social media*' as a whole phase term. The default english analyzers
> came with lucene all deal with single word, so it you want to get the most
> frequent terms, *social *and *media* are separated, and each of them can't
> represent a good meaning as *social media*, right?
>
> I know there's a way built on some phase dictionary, and try to match the
> phase already there, very like the way to do with chinese language, but is
> there an open source solution for english, I mean I don't want to build a
> phase dictionary myself, and I also want a smart way, which can "discover"
> the phase automatically. I got 2 millions docs analyzered the norma way,
> all
> single terms, which I can use as a base source, and it's possible to find
> that *social media *came together frequently, but I really don't know
> what's
> the reverse way.
>
> I tried to find some phase analyzers, but no luck. so any advices?
>
> Regards,
> Andrew
> --
> Simple is best
>

Re: Phase Extraction, mainly for English

Posted by Karl Wettin <ka...@gmail.com>.

There are many uses for shingles.

I've used them to find common phrases in text, which is my  
understanding of what you try to achieve. It works rather well, is a  
very simple solution and easy on resources compared to real semantic  
analysis.

You'll be getting a lot of shingles such as "there is" and "we are",  
but using a stop word lists to filter out any shingle contaning one or  
many of the stop words should do the trick (I did that in post  
processing, keeping all shingles in my index). It will probably  
require bit of manual work, depending on your corpora, to get a really  
clean list of common phrases that makes sense. Just create a list and  
inspect it with your eyes an try to find patterns in the phrases you  
want to get rid of. You might also want to look for punctuation in  
your text to avoid creating shingles of text that is in diffrent  
sentences. There is a pretty good sentence extraction tool in Gate you  
can use.


      karl

7 okt 2009 kl. 01.39 skrev Andrew Zhang:

> Hi Karl,
>
> I think shingle is designed to make the phase search faster, it'll  
> generate
> a lot of "seemed like" phase by pos only and completely disregard the
> meaning, that's not good enough.
>
> Regards,
> Andrew
>
> On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin <ka...@gmail.com>  
> wrote:
>
>> Hi Andrew,
>>
>> I think you are looking for the shingle package in contrib/analyzers.
>>
>>
>>     karl
>>
>> 6 okt 2009 kl. 13.42 skrev Andrew Zhang:
>>
>>
>> Hi guys,
>>>
>>> The requirement is very simple here, e.g. for this sentence, 'The  
>>> NBA
>>> formally announced its new *social media* guidelines Wednesday',   
>>> I want
>>> to
>>> treat '*social media*' as a whole phase term. The default english
>>> analyzers
>>> came with lucene all deal with single word, so it you want to get  
>>> the most
>>> frequent terms, *social *and *media* are separated, and each of  
>>> them can't
>>> represent a good meaning as *social media*, right?
>>>
>>> I know there's a way built on some phase dictionary, and try to  
>>> match the
>>> phase already there, very like the way to do with chinese  
>>> language, but is
>>> there an open source solution for english, I mean I don't want to  
>>> build a
>>> phase dictionary myself, and I also want a smart way, which can  
>>> "discover"
>>> the phase automatically. I got 2 millions docs analyzered the  
>>> norma way,
>>> all
>>> single terms, which I can use as a base source, and it's possible  
>>> to find
>>> that *social media *came together frequently, but I really don't  
>>> know
>>> what's
>>> the reverse way.
>>>
>>> I tried to find some phase analyzers, but no luck. so any advices?
>>>
>>> Regards,
>>> Andrew
>>> --
>>> Simple is best
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> -- 
> Simple is best


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Phase Extraction, mainly for English

Posted by Andrew Zhang <ro...@gmail.com>.

Hi Karl,

I think shingle is designed to make the phase search faster, it'll generate
a lot of "seemed like" phase by pos only and completely disregard the
meaning, that's not good enough.

Regards,
Andrew

On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin <ka...@gmail.com> wrote:

> Hi Andrew,
>
> I think you are looking for the shingle package in contrib/analyzers.
>
>
>      karl
>
> 6 okt 2009 kl. 13.42 skrev Andrew Zhang:
>
>
>  Hi guys,
>>
>> The requirement is very simple here, e.g. for this sentence, 'The NBA
>> formally announced its new *social media* guidelines Wednesday',  I want
>> to
>> treat '*social media*' as a whole phase term. The default english
>> analyzers
>> came with lucene all deal with single word, so it you want to get the most
>> frequent terms, *social *and *media* are separated, and each of them can't
>> represent a good meaning as *social media*, right?
>>
>> I know there's a way built on some phase dictionary, and try to match the
>> phase already there, very like the way to do with chinese language, but is
>> there an open source solution for english, I mean I don't want to build a
>> phase dictionary myself, and I also want a smart way, which can "discover"
>> the phase automatically. I got 2 millions docs analyzered the norma way,
>> all
>> single terms, which I can use as a base source, and it's possible to find
>> that *social media *came together frequently, but I really don't know
>> what's
>> the reverse way.
>>
>> I tried to find some phase analyzers, but no luck. so any advices?
>>
>> Regards,
>> Andrew
>> --
>> Simple is best
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Simple is best

Re: Phase Extraction, mainly for English

Posted by Karl Wettin <ka...@gmail.com>.

Hi Andrew,

I think you are looking for the shingle package in contrib/analyzers.


       karl

6 okt 2009 kl. 13.42 skrev Andrew Zhang:

> Hi guys,
>
> The requirement is very simple here, e.g. for this sentence, 'The NBA
> formally announced its new *social media* guidelines Wednesday',  I  
> want to
> treat '*social media*' as a whole phase term. The default english  
> analyzers
> came with lucene all deal with single word, so it you want to get  
> the most
> frequent terms, *social *and *media* are separated, and each of them  
> can't
> represent a good meaning as *social media*, right?
>
> I know there's a way built on some phase dictionary, and try to  
> match the
> phase already there, very like the way to do with chinese language,  
> but is
> there an open source solution for english, I mean I don't want to  
> build a
> phase dictionary myself, and I also want a smart way, which can  
> "discover"
> the phase automatically. I got 2 millions docs analyzered the norma  
> way, all
> single terms, which I can use as a base source, and it's possible to  
> find
> that *social media *came together frequently, but I really don't  
> know what's
> the reverse way.
>
> I tried to find some phase analyzers, but no luck. so any advices?
>
> Regards,
> Andrew
> -- 
> Simple is best


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org