You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by MFM <mf...@live.com> on 2009/03/24 16:04:59 UTC

question about grouping text

I have been able to successfully index and search text from structured
documents like PDF and MS Word. I am having a real hard time trying to
figure out how to group the index strings together e.g. if my document had a
question and answer in a table, the search will produce the text with the
question based on the keyword. How would I group or associate the question
and answer as part of the indexing ? I have tried using POI to read thru the
MS Word file and try and group them, but then it gets really intense into
pattern matching. 

Thanks
MFM
-- 
View this message in context: http://www.nabble.com/question-about-grouping-text-tp22682433p22682433.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: question about grouping text

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

I'm not aware of anything in LingPipe that would do the Q&A part, though LP (and GATE) may have the building blocks for what you need.  For example, they both must have sentence boundary detection/sentence chunking, which might be one of the first sub-tasks you'd need to do to begin finding/evaluating questions and answers.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Amin Mohammed-Coleman <am...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, March 26, 2009 3:54:59 AM
> Subject: Re: question about grouping text
> 
> Hi
> 
> I was wondering if soemthing like LingPipe or Gate (for text extraction)
> might be an idea?  I've started looking at it and I'm just thinking it may
> be applicable (I maybe wrong).
> 
> Cheers
> Amin
> 
> On Wed, Mar 25, 2009 at 4:18 PM, Grant Ingersoll wrote:
> 
> > Hi MFM,
> >
> > This comes down to a preprocessing step that you would have to do before
> > putting into Lucene, although I suppose you might be able to identify it
> > during analysis and use the TeeTokenFilter and the SinkTokenizer.  Once you
> > do this, then you can add them as fields on a Document.  I know that's not a
> > great help, but not much Lucene can do b/c it is application specific.
> >
> > Document/field wise, I would probably have:
> > Document
> >   question
> >   answer
> >
> > Then, when you search in the question field, you can also retrieve the
> > answer.
> >
> > -Grant
> >
> >
> > On Mar 24, 2009, at 4:04 PM, MFM wrote:
> >
> >
> >> I have been able to successfully index and search text from structured
> >> documents like PDF and MS Word. I am having a real hard time trying to
> >> figure out how to group the index strings together e.g. if my document had
> >> a
> >> question and answer in a table, the search will produce the text with the
> >> question based on the keyword. How would I group or associate the question
> >> and answer as part of the indexing ? I have tried using POI to read thru
> >> the
> >> MS Word file and try and group them, but then it gets really intense into
> >> pattern matching.
> >>
> >> Thanks
> >> MFM
> >> --
> >> View this message in context:
> >> http://www.nabble.com/question-about-grouping-text-tp22682433p22682433.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: question about grouping text

Posted by Amin Mohammed-Coleman <am...@gmail.com>.
Hi

I was wondering if soemthing like LingPipe or Gate (for text extraction)
might be an idea?  I've started looking at it and I'm just thinking it may
be applicable (I maybe wrong).

Cheers
Amin

On Wed, Mar 25, 2009 at 4:18 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Hi MFM,
>
> This comes down to a preprocessing step that you would have to do before
> putting into Lucene, although I suppose you might be able to identify it
> during analysis and use the TeeTokenFilter and the SinkTokenizer.  Once you
> do this, then you can add them as fields on a Document.  I know that's not a
> great help, but not much Lucene can do b/c it is application specific.
>
> Document/field wise, I would probably have:
> Document
>   question
>   answer
>
> Then, when you search in the question field, you can also retrieve the
> answer.
>
> -Grant
>
>
> On Mar 24, 2009, at 4:04 PM, MFM wrote:
>
>
>> I have been able to successfully index and search text from structured
>> documents like PDF and MS Word. I am having a real hard time trying to
>> figure out how to group the index strings together e.g. if my document had
>> a
>> question and answer in a table, the search will produce the text with the
>> question based on the keyword. How would I group or associate the question
>> and answer as part of the indexing ? I have tried using POI to read thru
>> the
>> MS Word file and try and group them, but then it gets really intense into
>> pattern matching.
>>
>> Thanks
>> MFM
>> --
>> View this message in context:
>> http://www.nabble.com/question-about-grouping-text-tp22682433p22682433.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: question about grouping text

Posted by Grant Ingersoll <gs...@apache.org>.
Hi MFM,

This comes down to a preprocessing step that you would have to do  
before putting into Lucene, although I suppose you might be able to  
identify it during analysis and use the TeeTokenFilter and the  
SinkTokenizer.  Once you do this, then you can add them as fields on a  
Document.  I know that's not a great help, but not much Lucene can do  
b/c it is application specific.

Document/field wise, I would probably have:
Document
    question
    answer

Then, when you search in the question field, you can also retrieve the  
answer.

-Grant

On Mar 24, 2009, at 4:04 PM, MFM wrote:

>
> I have been able to successfully index and search text from structured
> documents like PDF and MS Word. I am having a real hard time trying to
> figure out how to group the index strings together e.g. if my  
> document had a
> question and answer in a table, the search will produce the text  
> with the
> question based on the keyword. How would I group or associate the  
> question
> and answer as part of the indexing ? I have tried using POI to read  
> thru the
> MS Word file and try and group them, but then it gets really intense  
> into
> pattern matching.
>
> Thanks
> MFM
> -- 
> View this message in context: http://www.nabble.com/question-about-grouping-text-tp22682433p22682433.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org