You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Harsha1 <99...@gmail.com> on 2009/06/26 14:14:08 UTC

Catogarization is possible in Lucene?

Hi,
I went through the overview of Lucene and found its somewhat related to text
searching and other stuffs.

Please let me know if following can be done.

Suppose i have a paragraph,
This is test program. I have done this using regex and some other function
in groovy. But what I am looking is some kind of feature or template or
anything wherein I just mention the pattern in which i am interested in.
Based on the pattern mention groovy should automatically categorize the
fields.  Authors: Micheal Jackson, Daniel O Reily and Harsha.

Format we are looking at is,
TITLE: NAME1 NAME2 NAME3

In this case,
TITLE = Authors,
NAME1 = Micheal Jackson
NAME2 = Daniel O Reily
NAME3 = Harsha

Like this, When i pass some paragraph, these fields(TITLE: NAME1 NAME2
NAME3) categorized automatically. Is it possible? (I have done in java using
Regular expression, but we dont want to code from scratch, we want some
features from language will automatically do this. or with less code)
-- 
View this message in context: http://www.nabble.com/Catogarization-is-possible-in-Lucene--tp24219314p24219314.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Catogarization is possible in Lucene?

Posted by Harsha1 <99...@gmail.com>.
Hi Ted,
Thanks for your reply.

As you have mentioned in your reply, to search the document that have string
"Authors:" within 10 words of particular name. But this was just an example.
In real time I wont be knowing what would be the Title and how many names
are there are that.
Before hand we just will be knowing, there will be a colon in a paragraph,
just beside the colon is Title and just after the colon till a period will
be names. 
so wen i pass this paragraph i want to categorize this as, Title = , name1 =
, name2= ....

paragraph may be like this:
 
"Lucene can use the output of such a system, but does not support doing the
extraction itself.  Many times, full scale named entity extraction is not
really necessary and in those cases the phrase query in Lucene can help you
out. Friends: Ted Dunning, Sree HArsha."

So here,
Title will be Friends
Name1 will be Ted Dunning
Name2 will be Sree HArsha

To do this, I am looking for any special feature which can help me doing
this when compared to Java where we need to code from scratch. 


Ted Dunning wrote:
> 
> It sounds to me that what you are trying to do is information extraction.
> 
> Lucene can use the output of such a system, but does not support doing the
> extraction itself.  Many times, full scale named entity extraction is not
> really necessary and in those cases the phrase query in Lucene can help
> you
> out.  For instance, you might search for documents that have the string
> "Authors:" within 10 words of a particular name.  That will only retrieve
> documents, however, and would not, say, fill in an author table in a
> database.  You can help such a system by doing simple pre-processing
> during
> initial document processing and such a system can help in doing
> information
> extraction by finding documents that are likely to contain the information
> you need to extract.
> 
> I would recommend you look at the GATE system (if you want open source) or
> Lingpipe (if you can pay commercial prices or are doing research).
> 
> http://gate.ac.uk/
> http://alias-i.com/lingpipe/
> 
> On Fri, Jun 26, 2009 at 5:14 AM, Harsha1 <99...@gmail.com> wrote:
> 
>>
>> Hi,
>> I went through the overview of Lucene and found its somewhat related to
>> text
>> searching and other stuffs.
>>
>> Please let me know if following can be done.
>>
>> Suppose i have a paragraph,
>> This is test program. I have done this using regex and some other
>> function
>> in groovy. But what I am looking is some kind of feature or template or
>> anything wherein I just mention the pattern in which i am interested in.
>> Based on the pattern mention groovy should automatically categorize the
>> fields.  Authors: Micheal Jackson, Daniel O Reily and Harsha.
>>
>> Format we are looking at is,
>> TITLE: NAME1 NAME2 NAME3
>>
>> In this case,
>> TITLE = Authors,
>> NAME1 = Micheal Jackson
>> NAME2 = Daniel O Reily
>> NAME3 = Harsha
>>
>> Like this, When i pass some paragraph, these fields(TITLE: NAME1 NAME2
>> NAME3) categorized automatically. Is it possible? (I have done in java
>> using
>> Regular expression, but we dont want to code from scratch, we want some
>> features from language will automatically do this. or with less code)
>> --
>> View this message in context:
>> http://www.nabble.com/Catogarization-is-possible-in-Lucene--tp24219314p24219314.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
> 
> 

-- 
View this message in context: http://www.nabble.com/Catogarization-is-possible-in-Lucene--tp24219314p24248514.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Catogarization is possible in Lucene?

Posted by Ted Dunning <te...@gmail.com>.
It sounds to me that what you are trying to do is information extraction.

Lucene can use the output of such a system, but does not support doing the
extraction itself.  Many times, full scale named entity extraction is not
really necessary and in those cases the phrase query in Lucene can help you
out.  For instance, you might search for documents that have the string
"Authors:" within 10 words of a particular name.  That will only retrieve
documents, however, and would not, say, fill in an author table in a
database.  You can help such a system by doing simple pre-processing during
initial document processing and such a system can help in doing information
extraction by finding documents that are likely to contain the information
you need to extract.

I would recommend you look at the GATE system (if you want open source) or
Lingpipe (if you can pay commercial prices or are doing research).

http://gate.ac.uk/
http://alias-i.com/lingpipe/

On Fri, Jun 26, 2009 at 5:14 AM, Harsha1 <99...@gmail.com> wrote:

>
> Hi,
> I went through the overview of Lucene and found its somewhat related to
> text
> searching and other stuffs.
>
> Please let me know if following can be done.
>
> Suppose i have a paragraph,
> This is test program. I have done this using regex and some other function
> in groovy. But what I am looking is some kind of feature or template or
> anything wherein I just mention the pattern in which i am interested in.
> Based on the pattern mention groovy should automatically categorize the
> fields.  Authors: Micheal Jackson, Daniel O Reily and Harsha.
>
> Format we are looking at is,
> TITLE: NAME1 NAME2 NAME3
>
> In this case,
> TITLE = Authors,
> NAME1 = Micheal Jackson
> NAME2 = Daniel O Reily
> NAME3 = Harsha
>
> Like this, When i pass some paragraph, these fields(TITLE: NAME1 NAME2
> NAME3) categorized automatically. Is it possible? (I have done in java
> using
> Regular expression, but we dont want to code from scratch, we want some
> features from language will automatically do this. or with less code)
> --
> View this message in context:
> http://www.nabble.com/Catogarization-is-possible-in-Lucene--tp24219314p24219314.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)