You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Shay Hummel <sh...@gmail.com> on 2015/04/14 19:12:21 UTC

Text dependent analyzer

Hi
I would like to create a text dependent analyzer.
That is, *given a string*, the analyzer will:
1. Read the entire text and break it into sentences.
2. Each sentence will then be tokenized, possesive removal, lowercased,
mark terms and stemmed.

The second part is essentially what happens in english analyzer
(createComponent). However, this is not dependent of the text it receives -
which is the first part of what I am trying to do.

So ... How can it be achieved?

Thank you,

Shay Hummel

Re: Text dependent analyzer

Posted by Shay Hummel <sh...@gmail.com>.

Hi Rich

Thank you very much,
I understand your solution and will try to do something in that spirit.

Shay

On Fri, Apr 17, 2015 at 8:35 PM Rich Cariens <ri...@gmail.com> wrote:

> Ahoy, ahoy!
>
> I was playing around with something similar for indexing multi-lingual
> documents, Shay. The code is up on github
> <https://github.com/whateverdood/cross-lingual-search> and needs
> attention, but you're welcome to see if anything in there helps. The basic
> idea is this:
>
>    1. A custom CharFilter uses the ICU4J sentence BreakIterator to get
>    sentences out of the char stream.
>       1. Each sentence is lang-id'd using the cybozu Detector, and a
>       thread-local (ugh)
>       2. A ThreadLocal (ugh) is updated to with languages and their
>       offsets (where a run of a particular language ends)
>    2. A custom Filter then marks each token with it's language (relying
>    on that ThreadLocal) if possible so the next custom Filter
>    3. ...checks the tokens language and recruits the appropriate stemmer.
>    4. Other Filters like ICUFoldingFilter kick in to do their thing,
>
> Does this help at all?
>
> On Fri, Apr 17, 2015 at 1:06 PM, Benson Margulies <be...@basistech.com>
> wrote:
>
>> If you wait tokenization to depend on sentences, and you insist on
>> being inside Lucene, you have to be a Tokenizer. Your tokenizer can
>> set an attribute on the token that ends a sentence. Then, downstream,
>> filters can  read-ahead tokens to get the full sentence and buffer
>> tokens as needed.
>>
>>
>>
>> On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan <io...@yahoo.com.invalid>
>> wrote:
>> > Hi Hummel,
>> >
>> > There was an effort to bring open-nlp capabilities to Lucene:
>> > https://issues.apache.org/jira/browse/LUCENE-2899
>> >
>> > Lance was working on it to keep it up-to-date. But, it looks like it is
>> not always best to accomplish all things inside Lucene.
>> > I personally would do the sentence detection outside of the Lucene.
>> >
>> > By the way, I remember there was a way to consume all upstream token
>> stream.
>> >
>> > I think it was consuming all input and injecting one concatenated huge
>> term/token.
>> >
>> > KeywordTokenizer has similar behaviour. It injects a single token.
>> >
>> http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html
>> >
>> > Ahmet
>> >
>> >
>> > On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <
>> shay.hummel@gmail.com> wrote:
>> > Hi Ahment,
>> > Thank you for the reply,
>> > That's exactly what I am doing. At the moment, to index a document, I
>> break
>> > it to sentences, and each sentence is analyzed (lemmatizing, stopword
>> > removal etc.)
>> > Now, what I am looking for is a way to create an analyzer (a class which
>> > extends lucene's analyzer). This analyzer will be used for index and
>> query
>> > processing. It (a like the english analyzer) will receive the text and
>> > produce tokens.
>> > The Api of Analyzer requires implementing the createComponents which
>> > is not dependent
>> > on the text being analyzed. This fact is problematic since as you know
>> the
>> > OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
>> > model files to provide spans of each sentence and then break them).
>> > Is there a way around it?
>> >
>> > Shay
>> >
>> >
>> > On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <iorixxx@yahoo.com.invalid
>> >
>> > wrote:
>> >
>> >> Hi Hummel,
>> >>
>> >> You can perform sentence detection outside of the solr, using opennlp
>> for
>> >> instance, and then feed them to solr.
>> >>
>> >>
>> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
>> >>
>> >> Ahmet
>> >>
>> >>
>> >>
>> >>
>> >> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <shay.hummel@gmail.com
>> >
>> >> wrote:
>> >> Hi
>> >> I would like to create a text dependent analyzer.
>> >> That is, *given a string*, the analyzer will:
>> >> 1. Read the entire text and break it into sentences.
>> >> 2. Each sentence will then be tokenized, possesive removal, lowercased,
>> >> mark terms and stemmed.
>> >>
>> >> The second part is essentially what happens in english analyzer
>> >> (createComponent). However, this is not dependent of the text it
>> receives -
>> >> which is the first part of what I am trying to do.
>> >>
>> >> So ... How can it be achieved?
>> >>
>> >> Thank you,
>> >>
>> >> Shay Hummel
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Text dependent analyzer

Posted by Rich Cariens <ri...@gmail.com>.

Ahoy, ahoy!

I was playing around with something similar for indexing multi-lingual
documents, Shay. The code is up on github
<https://github.com/whateverdood/cross-lingual-search> and needs attention,
but you're welcome to see if anything in there helps. The basic idea is
this:

   1. A custom CharFilter uses the ICU4J sentence BreakIterator to get
   sentences out of the char stream.
      1. Each sentence is lang-id'd using the cybozu Detector, and a
      thread-local (ugh)
      2. A ThreadLocal (ugh) is updated to with languages and their offsets
      (where a run of a particular language ends)
   2. A custom Filter then marks each token with it's language (relying on
   that ThreadLocal) if possible so the next custom Filter
   3. ...checks the tokens language and recruits the appropriate stemmer.
   4. Other Filters like ICUFoldingFilter kick in to do their thing,

Does this help at all?

On Fri, Apr 17, 2015 at 1:06 PM, Benson Margulies <be...@basistech.com>
wrote:

> If you wait tokenization to depend on sentences, and you insist on
> being inside Lucene, you have to be a Tokenizer. Your tokenizer can
> set an attribute on the token that ends a sentence. Then, downstream,
> filters can  read-ahead tokens to get the full sentence and buffer
> tokens as needed.
>
>
>
> On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
> > Hi Hummel,
> >
> > There was an effort to bring open-nlp capabilities to Lucene:
> > https://issues.apache.org/jira/browse/LUCENE-2899
> >
> > Lance was working on it to keep it up-to-date. But, it looks like it is
> not always best to accomplish all things inside Lucene.
> > I personally would do the sentence detection outside of the Lucene.
> >
> > By the way, I remember there was a way to consume all upstream token
> stream.
> >
> > I think it was consuming all input and injecting one concatenated huge
> term/token.
> >
> > KeywordTokenizer has similar behaviour. It injects a single token.
> >
> http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html
> >
> > Ahmet
> >
> >
> > On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <sh...@gmail.com>
> wrote:
> > Hi Ahment,
> > Thank you for the reply,
> > That's exactly what I am doing. At the moment, to index a document, I
> break
> > it to sentences, and each sentence is analyzed (lemmatizing, stopword
> > removal etc.)
> > Now, what I am looking for is a way to create an analyzer (a class which
> > extends lucene's analyzer). This analyzer will be used for index and
> query
> > processing. It (a like the english analyzer) will receive the text and
> > produce tokens.
> > The Api of Analyzer requires implementing the createComponents which
> > is not dependent
> > on the text being analyzed. This fact is problematic since as you know
> the
> > OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
> > model files to provide spans of each sentence and then break them).
> > Is there a way around it?
> >
> > Shay
> >
> >
> > On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <io...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Hummel,
> >>
> >> You can perform sentence detection outside of the solr, using opennlp
> for
> >> instance, and then feed them to solr.
> >>
> >>
> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
> >>
> >> Ahmet
> >>
> >>
> >>
> >>
> >> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <sh...@gmail.com>
> >> wrote:
> >> Hi
> >> I would like to create a text dependent analyzer.
> >> That is, *given a string*, the analyzer will:
> >> 1. Read the entire text and break it into sentences.
> >> 2. Each sentence will then be tokenized, possesive removal, lowercased,
> >> mark terms and stemmed.
> >>
> >> The second part is essentially what happens in english analyzer
> >> (createComponent). However, this is not dependent of the text it
> receives -
> >> which is the first part of what I am trying to do.
> >>
> >> So ... How can it be achieved?
> >>
> >> Thank you,
> >>
> >> Shay Hummel
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Text dependent analyzer

Posted by Benson Margulies <be...@basistech.com>.

If you wait tokenization to depend on sentences, and you insist on
being inside Lucene, you have to be a Tokenizer. Your tokenizer can
set an attribute on the token that ends a sentence. Then, downstream,
filters can  read-ahead tokens to get the full sentence and buffer
tokens as needed.



On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi Hummel,
>
> There was an effort to bring open-nlp capabilities to Lucene:
> https://issues.apache.org/jira/browse/LUCENE-2899
>
> Lance was working on it to keep it up-to-date. But, it looks like it is not always best to accomplish all things inside Lucene.
> I personally would do the sentence detection outside of the Lucene.
>
> By the way, I remember there was a way to consume all upstream token stream.
>
> I think it was consuming all input and injecting one concatenated huge term/token.
>
> KeywordTokenizer has similar behaviour. It injects a single token.
> http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html
>
> Ahmet
>
>
> On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <sh...@gmail.com> wrote:
> Hi Ahment,
> Thank you for the reply,
> That's exactly what I am doing. At the moment, to index a document, I break
> it to sentences, and each sentence is analyzed (lemmatizing, stopword
> removal etc.)
> Now, what I am looking for is a way to create an analyzer (a class which
> extends lucene's analyzer). This analyzer will be used for index and query
> processing. It (a like the english analyzer) will receive the text and
> produce tokens.
> The Api of Analyzer requires implementing the createComponents which
> is not dependent
> on the text being analyzed. This fact is problematic since as you know the
> OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
> model files to provide spans of each sentence and then break them).
> Is there a way around it?
>
> Shay
>
>
> On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
>> Hi Hummel,
>>
>> You can perform sentence detection outside of the solr, using opennlp for
>> instance, and then feed them to solr.
>>
>> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
>>
>> Ahmet
>>
>>
>>
>>
>> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <sh...@gmail.com>
>> wrote:
>> Hi
>> I would like to create a text dependent analyzer.
>> That is, *given a string*, the analyzer will:
>> 1. Read the entire text and break it into sentences.
>> 2. Each sentence will then be tokenized, possesive removal, lowercased,
>> mark terms and stemmed.
>>
>> The second part is essentially what happens in english analyzer
>> (createComponent). However, this is not dependent of the text it receives -
>> which is the first part of what I am trying to do.
>>
>> So ... How can it be achieved?
>>
>> Thank you,
>>
>> Shay Hummel
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Text dependent analyzer

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Hummel,

There was an effort to bring open-nlp capabilities to Lucene: 
https://issues.apache.org/jira/browse/LUCENE-2899

Lance was working on it to keep it up-to-date. But, it looks like it is not always best to accomplish all things inside Lucene.
I personally would do the sentence detection outside of the Lucene.

By the way, I remember there was a way to consume all upstream token stream.

I think it was consuming all input and injecting one concatenated huge term/token.

KeywordTokenizer has similar behaviour. It injects a single token.
http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html

Ahmet

On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <sh...@gmail.com> wrote:
Hi Ahment,
Thank you for the reply,
That's exactly what I am doing. At the moment, to index a document, I break
it to sentences, and each sentence is analyzed (lemmatizing, stopword
removal etc.)
Now, what I am looking for is a way to create an analyzer (a class which
extends lucene's analyzer). This analyzer will be used for index and query
processing. It (a like the english analyzer) will receive the text and
produce tokens.
The Api of Analyzer requires implementing the createComponents which
is not dependent
on the text being analyzed. This fact is problematic since as you know the
OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
model files to provide spans of each sentence and then break them).
Is there a way around it?

Shay

On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Hummel,
>
> You can perform sentence detection outside of the solr, using opennlp for
> instance, and then feed them to solr.
>
> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
>
> Ahmet
>
>
>
>
> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <sh...@gmail.com>
> wrote:
> Hi
> I would like to create a text dependent analyzer.
> That is, *given a string*, the analyzer will:
> 1. Read the entire text and break it into sentences.
> 2. Each sentence will then be tokenized, possesive removal, lowercased,
> mark terms and stemmed.
>
> The second part is essentially what happens in english analyzer
> (createComponent). However, this is not dependent of the text it receives -
> which is the first part of what I am trying to do.
>
> So ... How can it be achieved?
>
> Thank you,
>
> Shay Hummel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Text dependent analyzer

Posted by Jack Krupansky <ja...@gmail.com>.

Currently, how are you indexing sentence boundaries? Are you placing
sentences in distinct fields, leaving a position gap, or... what?

Ultimately it comes down to how you intend to query the data in a way that
respects sentence boundaries. To put it simply, whay exactly do you care
where the sentence boundaries are? Be specific, because that determines
what your queries should look like, which determines what the indexed text
should look like, which determines how the text should be analyzed.

-- Jack Krupansky

On Wed, Apr 15, 2015 at 8:12 AM, Shay Hummel <sh...@gmail.com> wrote:

> Hi Ahment,
> Thank you for the reply,
> That's exactly what I am doing. At the moment, to index a document, I break
> it to sentences, and each sentence is analyzed (lemmatizing, stopword
> removal etc.)
> Now, what I am looking for is a way to create an analyzer (a class which
> extends lucene's analyzer). This analyzer will be used for index and query
> processing. It (a like the english analyzer) will receive the text and
> produce tokens.
> The Api of Analyzer requires implementing the createComponents which
> is not dependent
> on the text being analyzed. This fact is problematic since as you know the
> OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
> model files to provide spans of each sentence and then break them).
> Is there a way around it?
>
> Shay
>
> On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Hi Hummel,
> >
> > You can perform sentence detection outside of the solr, using opennlp for
> > instance, and then feed them to solr.
> >
> >
> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
> >
> > Ahmet
> >
> >
> >
> >
> > On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <sh...@gmail.com>
> > wrote:
> > Hi
> > I would like to create a text dependent analyzer.
> > That is, *given a string*, the analyzer will:
> > 1. Read the entire text and break it into sentences.
> > 2. Each sentence will then be tokenized, possesive removal, lowercased,
> > mark terms and stemmed.
> >
> > The second part is essentially what happens in english analyzer
> > (createComponent). However, this is not dependent of the text it
> receives -
> > which is the first part of what I am trying to do.
> >
> > So ... How can it be achieved?
> >
> > Thank you,
> >
> > Shay Hummel
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Text dependent analyzer

Posted by Shay Hummel <sh...@gmail.com>.

Hi Ahment,
Thank you for the reply,
That's exactly what I am doing. At the moment, to index a document, I break
it to sentences, and each sentence is analyzed (lemmatizing, stopword
removal etc.)
Now, what I am looking for is a way to create an analyzer (a class which
extends lucene's analyzer). This analyzer will be used for index and query
processing. It (a like the english analyzer) will receive the text and
produce tokens.
The Api of Analyzer requires implementing the createComponents which
is not dependent
on the text being analyzed. This fact is problematic since as you know the
OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
model files to provide spans of each sentence and then break them).
Is there a way around it?

Shay

On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Hummel,
>
> You can perform sentence detection outside of the solr, using opennlp for
> instance, and then feed them to solr.
>
> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
>
> Ahmet
>
>
>
>
> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <sh...@gmail.com>
> wrote:
> Hi
> I would like to create a text dependent analyzer.
> That is, *given a string*, the analyzer will:
> 1. Read the entire text and break it into sentences.
> 2. Each sentence will then be tokenized, possesive removal, lowercased,
> mark terms and stemmed.
>
> The second part is essentially what happens in english analyzer
> (createComponent). However, this is not dependent of the text it receives -
> which is the first part of what I am trying to do.
>
> So ... How can it be achieved?
>
> Thank you,
>
> Shay Hummel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Text dependent analyzer

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Hummel,

You can perform sentence detection outside of the solr, using opennlp for instance, and then feed them to solr.
https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect

Ahmet




On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <sh...@gmail.com> wrote:
Hi
I would like to create a text dependent analyzer.
That is, *given a string*, the analyzer will:
1. Read the entire text and break it into sentences.
2. Each sentence will then be tokenized, possesive removal, lowercased,
mark terms and stemmed.

The second part is essentially what happens in english analyzer
(createComponent). However, this is not dependent of the text it receives -
which is the first part of what I am trying to do.

So ... How can it be achieved?

Thank you,

Shay Hummel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org