You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "samuele.mattiuzzo" <sa...@gmail.com> on 2011/06/29 10:28:19 UTC

Regex replacement not working!

Hi, i have this bunch of lines in my schema.xml that should do a replacement
but it doesn't work!

    <fieldType name="salary_max_text" class="solr.TextField"
omitNorms="true">
      <analyzer type="index">
          <tokenizer class="solr.StandardTokenizerFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([0-9]+k?[.,]?[0-9]*).*?([0-9]+k?[.,]?[0-9]*)" replacement="$2"/>
      </analyzer>
    </fieldType>


I need it to extract only the numbers from some other string. The strings
can be anything: only letters (so it should replace it with an empty
string), letters + numbers. The numbers can be in one of those formats

17000 --> ok
17,000 --> should be replaced with 17000
17.000 --> should be replaced with 17000
17k --> should be replaced with 17000

how can i accomplish this? 

--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3120748.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Michael Kuhlmann <so...@kuli.org>.
Am 29.06.2011 12:30, schrieb samuele.mattiuzzo:
>     <fieldType name="salary_min_text" class="solr.TextField" >
>       <analyzer type="index">
...

> this is the "final" version of my schema part, but what i get is this:
> 
> 
> <doc>
> <float name="score">1.0</float>
> <str name="salary">Negotiable</str>
> <str name="salary_max">Negotiable</str>
> <str name="salary_min">Negotiable</str>
> </doc>
...


The mistake is that you assume that the filter applied to the result.
This is not true. Index filters only affect the index (as the name
says), not the contents.

Therefore, if you have copyFields that are stored, the'll always return
the same value as the original field.

Try inspecting your index data with luke or the admin console. Then
you'll see whether your regex applies.

Greetings,
Kuli

Re: Regex replacement not working!

Posted by Ahmet Arslan <io...@yahoo.com>.
> too bad it is still in todo, that's
> why i was asking some for some tips on
> writing, compiling, registration, calling...

Here is general information about how to customize solr via plugins.
http://wiki.apache.org/solr/SolrPlugins

Here is the registration and code example.
http://wiki.apache.org/solr/UpdateRequestProcessor

Re: Regex replacement not working!

Posted by "samuele.mattiuzzo" <sa...@gmail.com>.
too bad it is still in todo, that's why i was asking some for some tips on
writing, compiling, registration, calling...


--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121856.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Ahmet Arslan <io...@yahoo.com>.
> ok, last question on the
> UpdateProcessor: can you please give me the steps to
> implement my own?
> i mean, i can push my custom processor in solr's code, and
> then what?
> i don't understand how i have to change the solrconf.xml
> and how can i bind
> that to the updater i just wrotea
> and also i don't understand how i do have to change the
> schema.xml
> 
> i'm sorry for this question, but i started working on solr
> 5 days ago and
> for some things i really need a lot of documentation, and
> this isn't fully
> covered anywhere

"Implementing a conditional copyField" example is a good place start. You can use it as a template. 

You don't need to modify the solr source code for this. You can write your class, compile it, put the resulting jar into solrHome/lib directory. It is explained here, how to register your new update processor in solrconfig.xml

http://wiki.apache.org/solr/SolrPlugins#UpdateRequestProcessorFactory  

Re: Regex replacement not working!

Posted by Adam Estrada <es...@gmail.com>.
I have had the same problems with regex and I went with the regular pattern
replace filter rather than the charfilter. When I added it to the very end
of the chain, only then would it work...I am on Solr 3.2. I have also
noticed that the HTML filter factory is not working either. When I dump the
field that it's supposed to be working on, all the hyperlinks and everything
that you would expect to be stripped are still present.

Adam

On Wed, Jun 29, 2011 at 10:04 AM, samuele.mattiuzzo <sa...@gmail.com>wrote:

> ok, last question on the UpdateProcessor: can you please give me the steps
> to
> implement my own?
> i mean, i can push my custom processor in solr's code, and then what?
> i don't understand how i have to change the solrconf.xml and how can i bind
> that to the updater i just wrotea
> and also i don't understand how i do have to change the schema.xml
>
> i'm sorry for this question, but i started working on solr 5 days ago and
> for some things i really need a lot of documentation, and this isn't fully
> covered anywhere
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Regex replacement not working!

Posted by "samuele.mattiuzzo" <sa...@gmail.com>.
ok, last question on the UpdateProcessor: can you please give me the steps to
implement my own?
i mean, i can push my custom processor in solr's code, and then what?
i don't understand how i have to change the solrconf.xml and how can i bind
that to the updater i just wrotea
and also i don't understand how i do have to change the schema.xml

i'm sorry for this question, but i started working on solr 5 days ago and
for some things i really need a lot of documentation, and this isn't fully
covered anywhere

--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Ahmet Arslan <io...@yahoo.com>.
> my goal is/was storing the value into
> the field, and i get i have to create
> my Update handler.
> 
> i was trying to use query with salary_min:[100 TO 200] and
> it's actually
> working... since i just need it to search, i'll stay with
> this solution
> 
> is the [100 TO 200] a performance killer? i remember
> reading something
> around, but cannot find it again...

Please be aware that range query is working on strings. It will return unwanted results. String sorting and integer sorting is different.

If you are after range queries you need to defied price_min and price_max fields as trie-based types. tint, tdouble etc. And populate them with the update processor or at client side.

Re: Regex replacement not working!

Posted by "samuele.mattiuzzo" <sa...@gmail.com>.
my goal is/was storing the value into the field, and i get i have to create
my Update handler.

i was trying to use query with salary_min:[100 TO 200] and it's actually
working... since i just need it to search, i'll stay with this solution

is the [100 TO 200] a performance killer? i remember reading something
around, but cannot find it again...

--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121625.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Juan Grande <ju...@gmail.com>.
Hi Samuele,

It's not clear for me if your goal is to search on that field (for example,
"salary_min:[100 TO 200]") or if you want to show the transformed field to
the user (so you want the result of the regex replacement to be included in
the search results).

If your goal is to show the results to the user, then (as Ahmet said in a
previous mail) it won't work, because the content of the documents is stored
verbatim. The analysis only affects the way that documents are searched.

If your goal is to search, could you please show us the query that you're
using to test the use case?

Thanks!

*Juan*



On Wed, Jun 29, 2011 at 10:02 AM, samuele.mattiuzzo <sa...@gmail.com>wrote:

> ok, but i'm not applying the filtering on the copyfields.
> this is how my schema looks:
>
>
>
> <field name="salary" type="text" indexed="true" stored="true" />
> <field name="salary_min" type="salary_min_text" indexed="true"
> stored="true"
> />
> <field name="salary_max" type="salary_max_text" indexed="true"
> stored="true"
> />
>
>
> <copyField source="salary" dest="salary_min" />
> <copyField source="salary" dest="salary_max" />
>
> and the two datatypes defined before. that's why i tought i could first use
> "copyField" to copy the value then index them with my two datatypes
> filtering...
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121497.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Regex replacement not working!

Posted by "samuele.mattiuzzo" <sa...@gmail.com>.
ok, but i'm not applying the filtering on the copyfields.
this is how my schema looks:



<field name="salary" type="text" indexed="true" stored="true" />
<field name="salary_min" type="salary_min_text" indexed="true" stored="true"
/>
<field name="salary_max" type="salary_max_text" indexed="true" stored="true"
/>
 

<copyField source="salary" dest="salary_min" />
<copyField source="salary" dest="salary_max" />

and the two datatypes defined before. that's why i tought i could first use
"copyField" to copy the value then index them with my two datatypes
filtering...

--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121497.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Ahmet Arslan <io...@yahoo.com>.
> i have the string "You may earn 25k
> dollars per week" stored in the field
> "salary"
> 
> i'm using 2 copyfields "salary_min" and "salary_max" with
> source in "salary"
> with those 2 datatypes 
> 
> salary is "text"
> salary_min is "salary_min_text"
> salary_max is "salary_max_text"
> 
> so, i was expecting this:
> 
> solr updates its index
> solr copies the value from salary to salary_min and applies
> the value with
> the regex
> solr copies the value from salary to salary_max and applies
> the value with
> the regex
> 
> 
> but it's not working, it copies the value from one field to
> another, but the
> filter isn't applied, even if it's working as you could
> see

Okey, that makes sense. copyField just copies the content. It has nothing to do with analyzers. Two solutions comes to my mind.

1-) If you are using data import handler, I think (i am not good with regex), you can use regex transformer to populate these two fields.

http://wiki.apache.org/solr/DataImportHandler#RegexTransformer

2-) If not, you can populate these two field in a custom UpdateRequestProcessor. There is an example to modify and to start here :

http://wiki.apache.org/solr/UpdateRequestProcessor

Re: Regex replacement not working!

Posted by "samuele.mattiuzzo" <sa...@gmail.com>.
i have the string "You may earn 25k dollars per week" stored in the field
"salary"

i'm using 2 copyfields "salary_min" and "salary_max" with source in "salary"
with those 2 datatypes 

salary is "text"
salary_min is "salary_min_text"
salary_max is "salary_max_text"

so, i was expecting this:

solr updates its index
solr copies the value from salary to salary_min and applies the value with
the regex
solr copies the value from salary to salary_max and applies the value with
the regex


but it's not working, it copies the value from one field to another, but the
filter isn't applied, even if it's working as you could see


--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121386.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Ahmet Arslan <io...@yahoo.com>.
> Index Analyzer
> org.apache.solr.analysis.KeywordTokenizerFactory
> {luceneMatchVersion=LUCENE_31}
> position    1
> term text    £22000 - £25000 per annum +
> benefits
> startOffset    0
> endOffset    36
> 
> 
> org.apache.solr.analysis.PatternReplaceFilterFactory
> {replacement=$2,
> pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*,
> luceneMatchVersion=LUCENE_31}
> position    1
> term text    25000
> startOffset    0
> endOffset    36
> 
> 
> this is my output for the field salary_max, it seems to be
> working from the
> admin jsp interface

That's good to know. If you explain your final goal in detail, users can give better pointers. 

Re: Regex replacement not working!

Posted by "samuele.mattiuzzo" <sa...@gmail.com>.
Index Analyzer
org.apache.solr.analysis.KeywordTokenizerFactory
{luceneMatchVersion=LUCENE_31}
position	1
term text	£22000 - £25000 per annum + benefits
startOffset	0
endOffset	36


org.apache.solr.analysis.PatternReplaceFilterFactory {replacement=$2,
pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*,
luceneMatchVersion=LUCENE_31}
position	1
term text	25000
startOffset	0
endOffset	36


this is my output for the field salary_max, it seems to be working from the
admin jsp interface

--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121353.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Ahmet Arslan <io...@yahoo.com>.
>     <fieldType
> name="salary_min_text" class="solr.TextField" >
>       <analyzer type="index">
>         <charFilter
> class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
> replacement="$1"/>
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.LowerCaseFilterFactory" />
>         <filter
> class="solr.TrimFilterFactory" />
>       </analyzer>
>       <analyzer type="query">
>         <charFilter
> class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
> replacement="$1"/>
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.LowerCaseFilterFactory" />
>         <filter
> class="solr.TrimFilterFactory" />
>       </analyzer>
>     </fieldType>
> 
>     <fieldType name="salary_max_text"
> class="solr.TextField" >
>       <analyzer type="index">
>         <charFilter
> class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
> replacement="$2"/>
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.LowerCaseFilterFactory" />
>         <filter
> class="solr.TrimFilterFactory" />
>       </analyzer>
>       <analyzer type="query">
>         <charFilter
> class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
> replacement="$2"/>
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.LowerCaseFilterFactory" />
>         <filter
> class="solr.TrimFilterFactory" />
>       </analyzer>
>     </fieldType>
> 
> this is the "final" version of my schema part, but what i
> get is this:
> 
> 
> <doc>
> <float name="score">1.0</float>
> <str name="salary">Negotiable</str>
> <str name="salary_max">Negotiable</str>
> <str name="salary_min">Negotiable</str>
> </doc>
> <doc>
> <float name="score">1.0</float>
> <str name="salary">£7 to £8 per hour</str>
> <str name="salary_max">£7 to £8 per
> hour</str>
> <str name="salary_min">£7 to £8 per
> hour</str>
> </doc>
> <doc>
> <float name="score">1.0</float>
> <str name="salary">£125 to £150 per
> day</str>
> <str name="salary_max">£125 to £150 per
> day</str>
> <str name="salary_min">£125 to £150 per
> day</str>
> </doc>
> 
> which is not what i'm expecting... the regular expression
> works in
> http://www.fileformat.info/tool/regex.htm
> without any problem

I am not good with regular expressions, but response always contains untouched/un-analyzed version of fields. You can visually test your fieldType/regex on admin/analysis.jsp page. It show indexed terms step by step.

Re: Regex replacement not working!

Posted by "samuele.mattiuzzo" <sa...@gmail.com>.
    <fieldType name="salary_min_text" class="solr.TextField" >
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
replacement="$1"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
replacement="$1"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
    </fieldType>

    <fieldType name="salary_max_text" class="solr.TextField" >
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
replacement="$2"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*"
replacement="$2"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
    </fieldType>

this is the "final" version of my schema part, but what i get is this:


<doc>
<float name="score">1.0</float>
<str name="salary">Negotiable</str>
<str name="salary_max">Negotiable</str>
<str name="salary_min">Negotiable</str>
</doc>
<doc>
<float name="score">1.0</float>
<str name="salary">£7 to £8 per hour</str>
<str name="salary_max">£7 to £8 per hour</str>
<str name="salary_min">£7 to £8 per hour</str>
</doc>
<doc>
<float name="score">1.0</float>
<str name="salary">£125 to £150 per day</str>
<str name="salary_max">£125 to £150 per day</str>
<str name="salary_min">£125 to £150 per day</str>
</doc>

which is not what i'm expecting... the regular expression works in
http://www.fileformat.info/tool/regex.htm without any problem

--
View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121055.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex replacement not working!

Posted by Ahmet Arslan <io...@yahoo.com>.
> Hi, i have this bunch of lines in my
> schema.xml that should do a replacement
> but it doesn't work!
> 
>     <fieldType name="salary_max_text"
> class="solr.TextField"
> omitNorms="true">
>       <analyzer type="index">
>           <tokenizer
> class="solr.StandardTokenizerFactory"/>
>         <charFilter
> class="solr.PatternReplaceCharFilterFactory"
> pattern="([0-9]+k?[.,]?[0-9]*).*?([0-9]+k?[.,]?[0-9]*)"
> replacement="$2"/>
>       </analyzer>
>     </fieldType>
> 

<charFilter definitions should be above the tokenizer definition.
i.e., 
<analyzer
<charFilter
<tokenizer
<filter