You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Caleb Land <ca...@gmail.com> on 2010/01/05 20:05:18 UTC

Basic sentence parsing with the regex highlighter fragmenter

Hello,
I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
basic sentences, and I'm running into a problem.

I'm using the default regex specified in the example solr configuration:

[-\w ,/\n\"']{20,200}

But I am using a larger fragment size (140) with a slop of 1.0.

Given the passage:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
vitae, molestie quis nunc.

When I search for "Nulla" (the first word of the second sentence) and grab
the first highlighted snippet, this is what I get:

. <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus

As you can see, there's a leading period from the previous sentence and the
period from the current sentence is missing.

I understand this regex isn't that advanced, but I've tried everything I can
think of, regex-wise, to get this to work, and I always end up with this
problem.

For example, I've tried: \w[^.!?]{0,200}[.!?]

Which seems like it should include the ending punctuation, but it doesn't,
so I think I'm missing something.

Does anybody know a regex that works?
-- 
Caleb Land

Re: Basic sentence parsing with the regex highlighter fragmenter

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Regular expressions won't work well for sentence boundary detection.
If you want something free, you could plug in OpenNLP or GATE.  Or LingPipe, but that's not free.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Caleb Land <ca...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, January 5, 2010 2:05:18 PM
> Subject: Basic sentence parsing with the regex highlighter fragmenter
> 
> Hello,
> I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
> basic sentences, and I'm running into a problem.
> 
> I'm using the default regex specified in the example solr configuration:
> 
> [-\w ,/\n\"']{20,200}
> 
> But I am using a larger fragment size (140) with a slop of 1.0.
> 
> Given the passage:
> 
> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
> ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
> vitae, molestie quis nunc.
> 
> When I search for "Nulla" (the first word of the second sentence) and grab
> the first highlighted snippet, this is what I get:
> 
> . Nulla a neque a ipsum accumsan iaculis at id lacus
> 
> As you can see, there's a leading period from the previous sentence and the
> period from the current sentence is missing.
> 
> I understand this regex isn't that advanced, but I've tried everything I can
> think of, regex-wise, to get this to work, and I always end up with this
> problem.
> 
> For example, I've tried: \w[^.!?]{0,200}[.!?]
> 
> Which seems like it should include the ending punctuation, but it doesn't,
> so I think I'm missing something.
> 
> Does anybody know a regex that works?
> -- 
> Caleb Land

Re: Basic sentence parsing with the regex highlighter fragmenter

Posted by Caleb Land <re...@gmail.com>.

On Wed, Jan 6, 2010 at 4:30 PM, Erick Erickson <er...@gmail.com>wrote:

> Hmmm, I'll have to defer to the highlighter experts here....
>
>
I've looked at the source code for the highlighter, and I think I know
what's going on. I haven't had time to play with this yet, so I could be
wrong, but this is my impression.

The highlighter builds a highlighted fragment by reading tokens in, and
appending their contents to a string buffer.

Now, every time a token is appended to a fragment, it adds the "whitespace"
between the previous token and the current token (this isn't strictly
whitespace, but really anything that was removed from the source text by the
tokenizer, like punctuation etc.).

I believe what is happening in my case is that the leading ". " is the
"whitespace" between the last token (of the previous fragment) and the first
token of the current fragment.

And, of course, the trailing punctuation is being cut off because
the fragment builder doesn't APPEND "whitespace" after the last token, it
just prepends this "whitespace".

You can see the code that does this, from the
Highlighter#getBestTextFragments (line 233 in lucene 3.0.0) here:

http://gist.github.com/271515

If I do what I said in my second email (add preserveOriginal=1 to the
WordDelimiterFilter), things work because the ending punctuation is stored
with the token, and just the real whitespace is prepended by this code.

I'm not sure what the solution is, but currently I'm just trimming leading
punctuation + a space off on the client side, and leaving the sentence
terminator-less.

-- 
Caleb Land

Re: Basic sentence parsing with the regex highlighter fragmenter

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, I'll have to defer to the highlighter experts here....

Erick

On Wed, Jan 6, 2010 at 3:23 PM, Caleb Land <re...@gmail.com> wrote:

> I've looked at the docs/source for WordDelimiterFilter, and I understand
> what it does now.
>
> Here is my configuration:
>
> http://gist.github.com/270590
>
> I've tried the StandardTokenizerFactory instead of the
> WhitespaceTokenizerFactory, but I get the same problem as before, a the
> period from the previous sentence shows up and the period from the current
> sentence is cut off of highlighter fragments.
>
> I've tried the WhitespaceTokenizer with the StandardFilter, and this kinda
> works, but to match a word at the end of a sentence, you need to search for
> the period at the end of the sentence (the period is being tokenized along
> with the word).
>
> In any case, if I use the WordDelimiterFilter or add preserveOriginal="1",
> everything seems to work. (If I remove the WordDelimiterFilter, the periods
> are indexed with the word they're connected to, and searching for those
> words doesn't match unless the user includes the period)
>
> I'm trying to go through the code to understand how this works.
>
> On Wed, Jan 6, 2010 at 9:13 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Hmmm, the name WordDelimiterFilterFactory might be leading
> > you astray. Its purpose isn't to break things up into "words"
> > that have anything to do with grammatical rules. Rather, it's
> > purpose is to break up strings of funky characters into
> > searchable stuff. see:
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> >
> > In the grammatical sense, PowerShot should just be
> > PowerShot, not power shot (which is what WordDelimiterFactory
> > gives you, options permitting). So I think you probably want
> > one of the other analyzers....
> >
> > Have you tried any other analyzers? StandardAnalyzer might be
> > more friendly....
> >
> > HTH
> > Erick
> >
> > On Tue, Jan 5, 2010 at 5:18 PM, Caleb Land <ca...@gmail.com> wrote:
> >
> > > I've tracked this problem down to the fact that I'm using the
> > > WordDelimiterFilter. I don't quite understand what's happening, but if
> I
> > > add preserveOriginal="1" as an option, everything looks fine. I think
> it
> > > has
> > > to do with the period being stripped in the token stream.
> > >
> > > On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <ca...@gmail.com>
> wrote:
> > >
> > > > Hello,
> > > > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to
> parse
> > > > basic sentences, and I'm running into a problem.
> > > >
> > > > I'm using the default regex specified in the example solr
> > configuration:
> > > >
> > > > [-\w ,/\n\"']{20,200}
> > > >
> > > > But I am using a larger fragment size (140) with a slop of 1.0.
> > > >
> > > > Given the passage:
> > > >
> > > > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a
> neque
> > a
> > > > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut
> congue
> > > > vitae, molestie quis nunc.
> > > >
> > > > When I search for "Nulla" (the first word of the second sentence) and
> > > grab
> > > > the first highlighted snippet, this is what I get:
> > > >
> > > > . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus
> > > >
> > > > As you can see, there's a leading period from the previous sentence
> and
> > > the
> > > > period from the current sentence is missing.
> > > >
> > > > I understand this regex isn't that advanced, but I've tried
> everything
> > I
> > > > can think of, regex-wise, to get this to work, and I always end up
> with
> > > this
> > > > problem.
> > > >
> > > > For example, I've tried: \w[^.!?]{0,200}[.!?]
> > > >
> > > > Which seems like it should include the ending punctuation, but it
> > > doesn't,
> > > > so I think I'm missing something.
> > > >
> > > > Does anybody know a regex that works?
> > > > --
> > > > Caleb Land
> > > >
> > >
> > >
> > >
> > > --
> > > Caleb Land
> > >
> >
>
>
>
> --
> Caleb Land
>

Re: Basic sentence parsing with the regex highlighter fragmenter

Posted by Caleb Land <re...@gmail.com>.

I've looked at the docs/source for WordDelimiterFilter, and I understand
what it does now.

Here is my configuration:

http://gist.github.com/270590

I've tried the StandardTokenizerFactory instead of the
WhitespaceTokenizerFactory, but I get the same problem as before, a the
period from the previous sentence shows up and the period from the current
sentence is cut off of highlighter fragments.

I've tried the WhitespaceTokenizer with the StandardFilter, and this kinda
works, but to match a word at the end of a sentence, you need to search for
the period at the end of the sentence (the period is being tokenized along
with the word).

In any case, if I use the WordDelimiterFilter or add preserveOriginal="1",
everything seems to work. (If I remove the WordDelimiterFilter, the periods
are indexed with the word they're connected to, and searching for those
words doesn't match unless the user includes the period)

I'm trying to go through the code to understand how this works.

On Wed, Jan 6, 2010 at 9:13 AM, Erick Erickson <er...@gmail.com>wrote:

> Hmmm, the name WordDelimiterFilterFactory might be leading
> you astray. Its purpose isn't to break things up into "words"
> that have anything to do with grammatical rules. Rather, it's
> purpose is to break up strings of funky characters into
> searchable stuff. see:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
>
> In the grammatical sense, PowerShot should just be
> PowerShot, not power shot (which is what WordDelimiterFactory
> gives you, options permitting). So I think you probably want
> one of the other analyzers....
>
> Have you tried any other analyzers? StandardAnalyzer might be
> more friendly....
>
> HTH
> Erick
>
> On Tue, Jan 5, 2010 at 5:18 PM, Caleb Land <ca...@gmail.com> wrote:
>
> > I've tracked this problem down to the fact that I'm using the
> > WordDelimiterFilter. I don't quite understand what's happening, but if I
> > add preserveOriginal="1" as an option, everything looks fine. I think it
> > has
> > to do with the period being stripped in the token stream.
> >
> > On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <ca...@gmail.com> wrote:
> >
> > > Hello,
> > > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
> > > basic sentences, and I'm running into a problem.
> > >
> > > I'm using the default regex specified in the example solr
> configuration:
> > >
> > > [-\w ,/\n\"']{20,200}
> > >
> > > But I am using a larger fragment size (140) with a slop of 1.0.
> > >
> > > Given the passage:
> > >
> > > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque
> a
> > > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
> > > vitae, molestie quis nunc.
> > >
> > > When I search for "Nulla" (the first word of the second sentence) and
> > grab
> > > the first highlighted snippet, this is what I get:
> > >
> > > . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus
> > >
> > > As you can see, there's a leading period from the previous sentence and
> > the
> > > period from the current sentence is missing.
> > >
> > > I understand this regex isn't that advanced, but I've tried everything
> I
> > > can think of, regex-wise, to get this to work, and I always end up with
> > this
> > > problem.
> > >
> > > For example, I've tried: \w[^.!?]{0,200}[.!?]
> > >
> > > Which seems like it should include the ending punctuation, but it
> > doesn't,
> > > so I think I'm missing something.
> > >
> > > Does anybody know a regex that works?
> > > --
> > > Caleb Land
> > >
> >
> >
> >
> > --
> > Caleb Land
> >
>



-- 
Caleb Land

Re: Basic sentence parsing with the regex highlighter fragmenter

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, the name WordDelimiterFilterFactory might be leading
you astray. Its purpose isn't to break things up into "words"
that have anything to do with grammatical rules. Rather, it's
purpose is to break up strings of funky characters into
searchable stuff. see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

In the grammatical sense, PowerShot should just be
PowerShot, not power shot (which is what WordDelimiterFactory
gives you, options permitting). So I think you probably want
one of the other analyzers....

Have you tried any other analyzers? StandardAnalyzer might be
more friendly....

HTH
Erick

On Tue, Jan 5, 2010 at 5:18 PM, Caleb Land <ca...@gmail.com> wrote:

> I've tracked this problem down to the fact that I'm using the
> WordDelimiterFilter. I don't quite understand what's happening, but if I
> add preserveOriginal="1" as an option, everything looks fine. I think it
> has
> to do with the period being stripped in the token stream.
>
> On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <ca...@gmail.com> wrote:
>
> > Hello,
> > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
> > basic sentences, and I'm running into a problem.
> >
> > I'm using the default regex specified in the example solr configuration:
> >
> > [-\w ,/\n\"']{20,200}
> >
> > But I am using a larger fragment size (140) with a slop of 1.0.
> >
> > Given the passage:
> >
> > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
> > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
> > vitae, molestie quis nunc.
> >
> > When I search for "Nulla" (the first word of the second sentence) and
> grab
> > the first highlighted snippet, this is what I get:
> >
> > . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus
> >
> > As you can see, there's a leading period from the previous sentence and
> the
> > period from the current sentence is missing.
> >
> > I understand this regex isn't that advanced, but I've tried everything I
> > can think of, regex-wise, to get this to work, and I always end up with
> this
> > problem.
> >
> > For example, I've tried: \w[^.!?]{0,200}[.!?]
> >
> > Which seems like it should include the ending punctuation, but it
> doesn't,
> > so I think I'm missing something.
> >
> > Does anybody know a regex that works?
> > --
> > Caleb Land
> >
>
>
>
> --
> Caleb Land
>

Re: Basic sentence parsing with the regex highlighter fragmenter

Posted by Caleb Land <ca...@gmail.com>.

I've tracked this problem down to the fact that I'm using the
WordDelimiterFilter. I don't quite understand what's happening, but if I
add preserveOriginal="1" as an option, everything looks fine. I think it has
to do with the period being stripped in the token stream.

On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <ca...@gmail.com> wrote:

> Hello,
> I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
> basic sentences, and I'm running into a problem.
>
> I'm using the default regex specified in the example solr configuration:
>
> [-\w ,/\n\"']{20,200}
>
> But I am using a larger fragment size (140) with a slop of 1.0.
>
> Given the passage:
>
> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
> ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
> vitae, molestie quis nunc.
>
> When I search for "Nulla" (the first word of the second sentence) and grab
> the first highlighted snippet, this is what I get:
>
> . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus
>
> As you can see, there's a leading period from the previous sentence and the
> period from the current sentence is missing.
>
> I understand this regex isn't that advanced, but I've tried everything I
> can think of, regex-wise, to get this to work, and I always end up with this
> problem.
>
> For example, I've tried: \w[^.!?]{0,200}[.!?]
>
> Which seems like it should include the ending punctuation, but it doesn't,
> so I think I'm missing something.
>
> Does anybody know a regex that works?
> --
> Caleb Land
>



-- 
Caleb Land