You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2010/08/31 16:23:45 UTC

Stripping leading/trailing punctuation with SOLR-1653

  I am trying to use PatternReplaceCharFilterFactory (SOLR-1653) to 
strip leading and trailing punctuation from terms.  It's not working.  
This was previously discussed here as part of something I was trying 
with WordDelimiterFilterFactory, but I think it needs its own thread now.

I seem to be having two problems, based on what I can see.  The first 
problem is that analysis shows the PatternReplaceCharFilterFactory 
applied in a different order than I have configured it - it's going 
first.  The other problem is that it's eating all my text, leaving any 
fields of that type (which is most of my index!) completely empty.  A 
screenshot showing the issue:

http://www.elyograg.org/punct_analysis.png

Here's my entire fieldType definition, but the same thing happens when I 
replace the pattern with a very basic "([0-9]*)(.*)([0-9]*)" and the 
input value with "9dummy".

<fieldType name="text" class="solr.TextField" sortMissingLast="true" 
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
           replaceWith="$2"
         />
<filter class="solr.WordDelimiterFilterFactory"
           splitOnCaseChange="1"
           splitOnNumerics="1"
           stemEnglishPossessive="1"
           generateWordParts="1"
           generateNumberParts="1"
           catenateWords="1"
           catenateNumbers="1"
           catenateAll="1"
           preserveOriginal="1"
         />
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
           replaceWith="$2"
         />
<filter class="solr.WordDelimiterFilterFactory"
           splitOnCaseChange="1"
           splitOnNumerics="1"
           stemEnglishPossessive="1"
           generateWordParts="1"
           generateNumberParts="1"
           catenateWords="0"
           catenateNumbers="0"
           catenateAll="0"
           preserveOriginal="1"
         />
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Am I doing something wrong, or is this a bug?

Thanks,
Shawn

Re: Stripping leading/trailing punctuation with SOLR-1653

Posted by Shawn Heisey <el...@elyograg.org>.

  On 8/31/2010 8:49 AM, Shawn Heisey wrote:
>  I believe I may have solved this.  After a more careful reading of 
> SOLR-1653, I noticed that they referred to another filter.  I changed 
> my configuration from /solr/.PatternReplaceCharFilterFactory to 
> /solr/.PatternReplaceFilterFactory and updated the XML syntax 
> appropriately, and it looks OK now.  This filter is not mentioned on 
> the wiki page dealing with analyzers, which is why I did not use it 
> from the start.  When I searched that page for regex, the CharFilter 
> was the only one that came up.

Final working config, for anyone else that has a desire to do this:

<filter class="solr.PatternReplaceFilterFactory"
           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
           replacement="$2"
         />

I haven't yet tried it, but this filter seems to exist in the Solr 1.4.1 
war file, so it should work there too.

Thanks,
Shawn

Re: Stripping leading/trailing punctuation with SOLR-1653

Posted by Shawn Heisey <so...@elyograg.org>.

  I believe I may have solved this.  After a more careful reading of 
SOLR-1653, I noticed that they referred to another filter.  I changed my 
configuration from /solr/.PatternReplaceCharFilterFactory to 
/solr/.PatternReplaceFilterFactory and updated the XML syntax 
appropriately, and it looks OK now.  This filter is not mentioned on the 
wiki page dealing with analyzers, which is why I did not use it from the 
start.  When I searched that page for regex, the CharFilter was the only 
one that came up.

On 8/31/2010 8:29 AM, Shawn Heisey wrote:
>  I didn't give any particulars about my setup, sorry about that.  This 
> is branch_3x rev 990625, downloaded two days ago.  It passed all unit 
> tests.
>
> Linux idxst9-b 2.6.32-bpo.5-amd64 #1 SMP Fri Jun 11 08:42:31 UTC 2010 
> x86_64 GNU/Linux
>
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>
> On 8/31/2010 8:23 AM, Shawn Heisey wrote:
>>  I am trying to use PatternReplaceCharFilterFactory (SOLR-1653) to 
>> strip leading and trailing punctuation from terms.  It's not 
>> working.  This was previously discussed here as part of something I 
>> was trying with WordDelimiterFilterFactory, but I think it needs its 
>> own thread now.

Re: Stripping leading/trailing punctuation with SOLR-1653

Posted by Shawn Heisey <so...@elyograg.org>.

  I didn't give any particulars about my setup, sorry about that.  This 
is branch_3x rev 990625, downloaded two days ago.  It passed all unit tests.

Linux idxst9-b 2.6.32-bpo.5-amd64 #1 SMP Fri Jun 11 08:42:31 UTC 2010 
x86_64 GNU/Linux

Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

On 8/31/2010 8:23 AM, Shawn Heisey wrote:
>  I am trying to use PatternReplaceCharFilterFactory (SOLR-1653) to 
> strip leading and trailing punctuation from terms.  It's not working.  
> This was previously discussed here as part of something I was trying 
> with WordDelimiterFilterFactory, but I think it needs its own thread now.
>
> I seem to be having two problems, based on what I can see.  The first 
> problem is that analysis shows the PatternReplaceCharFilterFactory 
> applied in a different order than I have configured it - it's going 
> first.  The other problem is that it's eating all my text, leaving any 
> fields of that type (which is most of my index!) completely empty.  A 
> screenshot showing the issue:
>
> http://www.elyograg.org/punct_analysis.png
>
> Here's my entire fieldType definition, but the same thing happens when 
> I replace the pattern with a very basic "([0-9]*)(.*)([0-9]*)" and the 
> input value with "9dummy".
>
> <fieldType name="text" class="solr.TextField" sortMissingLast="true" 
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>           pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
>           replaceWith="$2"
>         />
> <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="1"
>           catenateNumbers="1"
>           catenateAll="1"
>           preserveOriginal="1"
>         />
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>           pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
>           replaceWith="$2"
>         />
> <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="0"
>           catenateNumbers="0"
>           catenateAll="0"
>           preserveOriginal="1"
>         />
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Am I doing something wrong, or is this a bug?

RE: Memcache for Solr

Posted by Markus Jelsma <ma...@buyways.nl>.

Hi,

 

In a restaurant index website, we have used Memcache only for storing the generated HTML facet list when q=*. This cached object was only used when no additional search parameters were specified. It was quite useful because the facet list was always present and only changed if real search parameters were specified.

 

We found it wasn't feasible to cache arbitrary result sets, there would be just too many result sets to cache which would probably never be reused anyway and there is the problem  of invalidating cached result sets. I'd rather rely on Solr's filter cache instead.

 

From that point of view, it's only feasible to cache generated objects (HTML or whatever format) that you know are being requested many times. It's easy to implement and doesn't take too much memory that won't be reused anyway.

 

Cheers,
 
-----Original message-----
From: Hitendra Molleti <hi...@itp.com>
Sent: Tue 31-08-2010 16:38
To: solr-user@lucene.apache.org; 
Subject: Memcache for Solr

Hi,

We were looking at implementing Memcache for Solr.

Can someone who has already implemented this let us know if it is a good
option to go for i.e. how effective is using memcache compared to Solr's
internal cache. 

Also, are there any down sides to it and difficult to implement.

Thanks

Hitendra

Re: Memcache for Solr

Posted by Glen Newton <gl...@gmail.com>.

Apologies Chris: my mistake.
-Glen

On 31 August 2010 23:27, Chris Hostetter <ho...@fucit.org> wrote:
>
> : ?
> : The second post was relevant to the original post.
> : And even dealt with some of the questions asked in the original:
>
> The first msg with subject "Memcache for Solr" was a thread-jack of
> an existing thread "Stripping leading/trailing punctuation with SOLR-1653"
>
> http://lucene.472066.n3.nabble.com/Stripping-leading-trailing-punctuation-with-SOLR-1653-td1394514.html#a1394514
>
> That was the msg i replied to w/ a request to cease thread hijacking.
>
>
> -Hoss
>
> --
> http://lucenerevolution.org/  ...  October 7-8, Boston
> http://bit.ly/stump-hoss      ...  Stump The Chump!
>
>



-- 

-

Re: Memcache for Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: ?
: The second post was relevant to the original post.
: And even dealt with some of the questions asked in the original:

The first msg with subject "Memcache for Solr" was a thread-jack of 
an existing thread "Stripping leading/trailing punctuation with SOLR-1653" 

http://lucene.472066.n3.nabble.com/Stripping-leading-trailing-punctuation-with-SOLR-1653-td1394514.html#a1394514

That was the msg i replied to w/ a request to cease thread hijacking.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss      ...  Stump The Chump!

Re: Memcache for Solr

Posted by Glen Newton <gl...@gmail.com>.

?
The second post was relevant to the original post.
And even dealt with some of the questions asked in the original:
Q > are there any down sides to it and difficult to implement....
A > We found it wasn't feasible to cache arbitrary result sets...
?

-glen

On 31 August 2010 15:11, Chris Hostetter <ho...@fucit.org> wrote:
>
> : References: <4C...@elyograg.org>
> : In-Reply-To: <4C...@elyograg.org>
> : Subject: Memcache for Solr
>
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
>
> When starting a new discussion on a mailing list, please do not reply to
> an existing message, instead start a fresh email.  Even if you change the
> subject line of your email, other mail headers still track which thread
> you replied to and your question is "hidden" in that thread and gets less
> attention.   It makes following discussions in the mailing list archives
> particularly difficult.
> See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
>
>
>
> -Hoss
>
> --
> http://lucenerevolution.org/  ...  October 7-8, Boston
> http://bit.ly/stump-hoss      ...  Stump The Chump!
>
>



-- 

-

RE: Memcache for Solr

Posted by Hitendra Molleti <hi...@itp.com>.

Apologies, did not realize it.

Thanks

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Tuesday, August 31, 2010 11:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Memcache for Solr

: References: <4C...@elyograg.org>
: In-Reply-To: <4C...@elyograg.org>
: Subject: Memcache for Solr

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss      ...  Stump The Chump!

Re: Memcache for Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: References: <4C...@elyograg.org>
: In-Reply-To: <4C...@elyograg.org>
: Subject: Memcache for Solr

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss      ...  Stump The Chump!

Memcache for Solr

Posted by Hitendra Molleti <hi...@itp.com>.

Hi,

We were looking at implementing Memcache for Solr.

Can someone who has already implemented this let us know if it is a good
option to go for i.e. how effective is using memcache compared to Solr's
internal cache. 

Also, are there any down sides to it and difficult to implement.

Thanks

Hitendra