You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2010/10/06 00:24:30 UTC

PatternReplaceFilterFactory creating empty string as a term

  I am developing a new schema. It has a pattern filter that trims 
leading and trailing punctuation from terms.

<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
/>

It is resulting in empty terms, because there are situations in the 
analyzer stream where a term happens to be composed of nothing but 
punctuation. This problem is not happening in production. I want those 
terms removed.

This blank term makes the top of the list as far as term frequency. Out 
of 7.6 million documents, 4.8 million of them have it. From TermsComponent:

<response>
−
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19106</int>
</lst>
−
<lst name="terms">
−
<lst name="catchall">
<int name="">4830648</int>
<int name="usa">1863264</int>
<int name="photo">1743551</int>
<int name="new">1544314</int>
<int name="de">1455691</int>
<int name="during">1412551</int>
<int name="los">1408855</int>
<int name="united">1368594</int>
<int name="2009">1271103</int>
<int name="la">1204441</int>
</lst>
</lst>
</response>

Is there any existing way to remove empty terms during analysis? I tried 
TrimFilterFactory but that made no difference. Is this a bug in 
PatternReplaceFilterFactory?

Shawn


Re: PatternReplaceFilterFactory creating empty string as a term

Posted by Robert Muir <rc...@gmail.com>.
alternatively, you can use "+" instead of "*" in your regular expressions so
that you dont match them at all...

I think the PatternTokenizer is doing the right thing, if your expression
says that a blank term is acceptable.

On Tue, Oct 5, 2010 at 6:39 PM, Markus Jelsma <ma...@buyways.nl>wrote:

> Actually, it might be a good idea to add an optional setting to the
> PatternTokenizer that doesn't emit blank terms. Perhaps a
> allowBlanks="false" would be a pleasant additional to the PatternTokenizer
> so an additional LengthFilter can be left out and thus spare CPU cycles and
> some memory.
>
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Wed 06-10-2010 00:29
> To: solr-user@lucene.apache.org;
> Subject: RE: PatternReplaceFilterFactory creating empty string as a term
>
> I'm not sure if this is the best approach but a LengthFilter will stop
> blank terms.
>
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
>
> -----Original message-----
> From: Shawn Heisey <so...@elyograg.org>
> Sent: Wed 06-10-2010 00:25
> To: solr-user@lucene.apache.org;
> Subject: PatternReplaceFilterFactory creating empty string as a term
>
>  I am developing a new schema. It has a pattern filter that trims
> leading and trailing punctuation from terms.
>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
> replacement="$2"
> />
>
> It is resulting in empty terms, because there are situations in the
> analyzer stream where a term happens to be composed of nothing but
> punctuation. This problem is not happening in production. I want those
> terms removed.
>
> This blank term makes the top of the list as far as term frequency. Out
> of 7.6 million documents, 4.8 million of them have it. From TermsComponent:
>
> <response>
> ???
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">19106</int>
> </lst>
> ???
> <lst name="terms">
> ???
> <lst name="catchall">
> <int name="">4830648</int>
> <int name="usa">1863264</int>
> <int name="photo">1743551</int>
> <int name="new">1544314</int>
> <int name="de">1455691</int>
> <int name="during">1412551</int>
> <int name="los">1408855</int>
> <int name="united">1368594</int>
> <int name="2009">1271103</int>
> <int name="la">1204441</int>
> </lst>
> </lst>
> </response>
>
> Is there any existing way to remove empty terms during analysis? I tried
> TrimFilterFactory but that made no difference. Is this a bug in
> PatternReplaceFilterFactory?
>
> Shawn
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: PatternReplaceFilterFactory creating empty string as a term

Posted by Markus Jelsma <ma...@buyways.nl>.
Actually, it might be a good idea to add an optional setting to the PatternTokenizer that doesn't emit blank terms. Perhaps a allowBlanks="false" would be a pleasant additional to the PatternTokenizer so an additional LengthFilter can be left out and thus spare CPU cycles and some memory.
 
-----Original message-----
From: Markus Jelsma <ma...@buyways.nl>
Sent: Wed 06-10-2010 00:29
To: solr-user@lucene.apache.org; 
Subject: RE: PatternReplaceFilterFactory creating empty string as a term

I'm not sure if this is the best approach but a LengthFilter will stop blank terms.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
 
-----Original message-----
From: Shawn Heisey <so...@elyograg.org>
Sent: Wed 06-10-2010 00:25
To: solr-user@lucene.apache.org; 
Subject: PatternReplaceFilterFactory creating empty string as a term

 I am developing a new schema. It has a pattern filter that trims 
leading and trailing punctuation from terms.

<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
/>

It is resulting in empty terms, because there are situations in the 
analyzer stream where a term happens to be composed of nothing but 
punctuation. This problem is not happening in production. I want those 
terms removed.

This blank term makes the top of the list as far as term frequency. Out 
of 7.6 million documents, 4.8 million of them have it. From TermsComponent:

<response>
???
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19106</int>
</lst>
???
<lst name="terms">
???
<lst name="catchall">
<int name="">4830648</int>
<int name="usa">1863264</int>
<int name="photo">1743551</int>
<int name="new">1544314</int>
<int name="de">1455691</int>
<int name="during">1412551</int>
<int name="los">1408855</int>
<int name="united">1368594</int>
<int name="2009">1271103</int>
<int name="la">1204441</int>
</lst>
</lst>
</response>

Is there any existing way to remove empty terms during analysis? I tried 
TrimFilterFactory but that made no difference. Is this a bug in 
PatternReplaceFilterFactory?

Shawn


Re: PatternReplaceFilterFactory creating empty string as a term

Posted by Shawn Heisey <so...@elyograg.org>.
  On 10/5/2010 6:28 PM, Markus Jelsma wrote:
> I'm not sure if this is the best approach but a LengthFilter will stop blank terms.
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

Two people with the answer I needed.  Thank you!

Shawn


RE: PatternReplaceFilterFactory creating empty string as a term

Posted by Markus Jelsma <ma...@buyways.nl>.
I'm not sure if this is the best approach but a LengthFilter will stop blank terms.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
 
-----Original message-----
From: Shawn Heisey <so...@elyograg.org>
Sent: Wed 06-10-2010 00:25
To: solr-user@lucene.apache.org; 
Subject: PatternReplaceFilterFactory creating empty string as a term

 I am developing a new schema. It has a pattern filter that trims 
leading and trailing punctuation from terms.

<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
/>

It is resulting in empty terms, because there are situations in the 
analyzer stream where a term happens to be composed of nothing but 
punctuation. This problem is not happening in production. I want those 
terms removed.

This blank term makes the top of the list as far as term frequency. Out 
of 7.6 million documents, 4.8 million of them have it. From TermsComponent:

<response>
???
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19106</int>
</lst>
???
<lst name="terms">
???
<lst name="catchall">
<int name="">4830648</int>
<int name="usa">1863264</int>
<int name="photo">1743551</int>
<int name="new">1544314</int>
<int name="de">1455691</int>
<int name="during">1412551</int>
<int name="los">1408855</int>
<int name="united">1368594</int>
<int name="2009">1271103</int>
<int name="la">1204441</int>
</lst>
</lst>
</response>

Is there any existing way to remove empty terms during analysis? I tried 
TrimFilterFactory but that made no difference. Is this a bug in 
PatternReplaceFilterFactory?

Shawn


Re: PatternReplaceFilterFactory creating empty string as a term

Posted by Shawn Heisey <so...@elyograg.org>.
  On 10/5/2010 10:38 PM, Shawn Heisey wrote:
> That fixed it.  Thank you.  If I have time, I'll peek at the 
> patternfilter source code and see if I can figure out how to make it 
> optionally remove empty terms. For me, it's not terribly critical, 
> because my database is the bottleneck in my indexing process, so Solr 
> is much faster than the data coming in.  For someone else, the time 
> involved in another analyzer step might actually matter.

I looked into the code for 1.4.1.  PatternReplaceFilter works by 
overriding incrementToken, and I could not figure out a way in that 
context to remove the token.  Looking into how other things remove 
tokens, I found that LengthFilter is using a deprecated API call, and 
RemoveDuplicates does it by overriding TokenStream.  Changing how 
PatternReplaceFilter is implemented is perhaps more than I am prepared 
to tackle.

If someone knows a way within incrementToken to remove a token, let me 
know and I will give it a try.  I will also look into the branch_3x code 
and see if I can find something helpful there.

Thanks,
Shawn


Re: PatternReplaceFilterFactory creating empty string as a term

Posted by Shawn Heisey <so...@elyograg.org>.
  On 10/5/2010 6:34 PM, Ken Krugler wrote:
>
>> Is there any existing way to remove empty terms during analysis? I 
>> tried TrimFilterFactory but that made no difference.
>
> You could use LengthFilterFactory to restrict terms to being at least 
> one character long.
>
>> Is this a bug in PatternReplaceFilterFactory?
>
> No, I don't believe so. PatternReplaceFilterFactory creates a 
> PatternReplaceFilter, and the JavaDoc for that says:
>> Note: Depending on the input and the pattern used and the input 
>> TokenStream, this TokenFilter may produce Tokens whose text is the 
>> empty string.
>>
>

That fixed it.  Thank you.  If I have time, I'll peek at the 
patternfilter source code and see if I can figure out how to make it 
optionally remove empty terms. For me, it's not terribly critical, 
because my database is the bottleneck in my indexing process, so Solr is 
much faster than the data coming in.  For someone else, the time 
involved in another analyzer step might actually matter.

Shawn


Re: PatternReplaceFilterFactory creating empty string as a term

Posted by Ken Krugler <kk...@transpac.com>.
On Oct 5, 2010, at 6:24pm, Shawn Heisey wrote:

> I am developing a new schema. It has a pattern filter that trims  
> leading and trailing punctuation from terms.
>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
> replacement="$2"
> />
>
> It is resulting in empty terms, because there are situations in the  
> analyzer stream where a term happens to be composed of nothing but  
> punctuation. This problem is not happening in production. I want  
> those terms removed.
>
> This blank term makes the top of the list as far as term frequency.  
> Out of 7.6 million documents, 4.8 million of them have it. From  
> TermsComponent:
>
> <response>
> −
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">19106</int>
> </lst>
> −
> <lst name="terms">
> −
> <lst name="catchall">
> <int name="">4830648</int>

[snip]

> Is there any existing way to remove empty terms during analysis? I  
> tried TrimFilterFactory but that made no difference.

You could use LengthFilterFactory to restrict terms to being at least  
one character long.

> Is this a bug in PatternReplaceFilterFactory?

No, I don't believe so. PatternReplaceFilterFactory creates a  
PatternReplaceFilter, and the JavaDoc for that says:
> Note: Depending on the input and the pattern used and the input  
> TokenStream, this TokenFilter may produce Tokens whose text is the  
> empty string.
>

-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225




--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g