You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2009/11/27 19:07:36 UTC

Sentence detection/extraction as Tokenizer?

Hello,

The contrib/wordnet package contains an AnalyzerUtil class with a method that extracts sentences from text/String.  It is super-simplistic, so probably not very accurate, but I am wondering if *conceptually* it would make sense to have a Tokenizer that extracts sentences?  I suppose that means each Token would be a complete sentence.

Would you say it makes sense to implement sentence detection/extraction as a Tokenizer?

Thanks,
Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Sentence detection/extraction as Tokenizer?

Posted by Shai Erera <se...@gmail.com>.
Hi Otis

I've implemented sentence detection as part of my tokenizer, and it does not
extract sentences, but "detecs" EOS (based on several characters from the
UNICODE spec). Upon detection, it returns a Token of EOS type. I then have a
EOS Filter which can be configured w/ appropriate behavior as to what to do
with it for example, set posIncr to 100 on the next token, to avoid
phrase/fuzzy searches find matches across sentences, but there are other
reasons as well such as highlighting.

So I would, personally, not think of EOS detection as  a Tokenizer in and on
itself, but rather as a capability of a Tokenizer (Standard?).

Shai

On Fri, Nov 27, 2009 at 8:07 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
> The contrib/wordnet package contains an AnalyzerUtil class with a method
> that extracts sentences from text/String.  It is super-simplistic, so
> probably not very accurate, but I am wondering if *conceptually* it would
> make sense to have a Tokenizer that extracts sentences?  I suppose that
> means each Token would be a complete sentence.
>
> Would you say it makes sense to implement sentence detection/extraction as
> a Tokenizer?
>
> Thanks,
> Otis
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>