You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alex Cougarman <ac...@bwc.org> on 2013/04/15 15:48:42 UTC

Tokenize on paragraphs and sentences

Hi. Is it possible to search within paragraphs or sentences in Solr? The PatternTokenizerFactory uses regular expressions, but how can this be done with plain ASCII docs that don't have <p> tags (HTML), yet they're broken into paragraphs? Thanks.

Warm regards,
Alex



RE: Tokenize on paragraphs and sentences

Posted by Alex Cougarman <ac...@bwc.org>.
Thanks, Jack. Sorry, took me a while to reply :)
It sounds like sentence/paragraph level searches won't be easy.

Warm regards,
Alex 

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: 15 April 2013 5:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Tokenize on paragraphs and sentences

Technically, yes, but you would have to do a lot of work yourself. Like, a sentence/paragraph recognizer that inserted sentence and paragraph markers, and a query parser that allows you to do SpanNear and SpanNot (to selectively exclude sentence or paragraph marks based on your granularity of
search.)

The LucidWorks Search query parser has SpanNot support (or at least did at one point in time), but no sentence/paragraph marking.

You could come up with some heuristic regular expressions for sentence and paragraph marks, like consecutive newlines for a paragraph and dot followed by white space for sentence (with some more heuristics for abbreviations.)

Or you could have an update processor do the marking.

-- Jack Krupansky

-----Original Message-----
From: Alex Cougarman
Sent: Monday, April 15, 2013 9:48 AM
To: solr-user@lucene.apache.org
Subject: Tokenize on paragraphs and sentences

Hi. Is it possible to search within paragraphs or sentences in Solr? The PatternTokenizerFactory uses regular expressions, but how can this be done with plain ASCII docs that don't have <p> tags (HTML), yet they're broken into paragraphs? Thanks.

Warm regards,
Alex



Re: Tokenize on paragraphs and sentences

Posted by Jack Krupansky <ja...@basetechnology.com>.
Technically, yes, but you would have to do a lot of work yourself. Like, a 
sentence/paragraph recognizer that inserted sentence and paragraph markers, 
and a query parser that allows you to do SpanNear and SpanNot (to 
selectively exclude sentence or paragraph marks based on your granularity of 
search.)

The LucidWorks Search query parser has SpanNot support (or at least did at 
one point in time), but no sentence/paragraph marking.

You could come up with some heuristic regular expressions for sentence and 
paragraph marks, like consecutive newlines for a paragraph and dot followed 
by white space for sentence (with some more heuristics for abbreviations.)

Or you could have an update processor do the marking.

-- Jack Krupansky

-----Original Message----- 
From: Alex Cougarman
Sent: Monday, April 15, 2013 9:48 AM
To: solr-user@lucene.apache.org
Subject: Tokenize on paragraphs and sentences

Hi. Is it possible to search within paragraphs or sentences in Solr? The 
PatternTokenizerFactory uses regular expressions, but how can this be done 
with plain ASCII docs that don't have <p> tags (HTML), yet they're broken 
into paragraphs? Thanks.

Warm regards,
Alex