You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andreas Owen <ao...@conx.ch> on 2013/09/05 18:03:35 UTC

charfilter doesn't do anything

i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error.

in schema.xml i have the following:

<field name="text_html" type="text_cutHtml" indexed="true" stored="true" multiValued="true"/>

<fieldType name="text_cutHtml" class="solr.TextField">
	<analyzer>
	  <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
	  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
	  <tokenizer class="solr.KeywordTokenizerFactory"/>
	</analyzer>
   </fieldType>

my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern?

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

perfect, i tried it before but always at the tail of the expression with no effect. thanks a lot. a last question, do you know how to keep the html comments from being filtered before the transformer has done its work?


On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote:

> Okay, I can repro the problem. Yes, in appears that the pattern replace char filter does not default to multiline mode for pattern matching, so <body> on one line and </body> on another line cannot be matched.
> 
> Now, whether that is by design or a bug or an option for enhancement is a matter for some committer to comment on.
> 
> But, the good news is that you can in fact set multiline mode in your pattern my starting it with "(?s)", which means that dot accepts line break characters as well.
> 
> So, here are my revised field types:
> 
> <fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
> <analyzer>
>   <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> 
> <fieldType name="text_html_body_strip" class="solr.TextField" positionIncrementGap="100" >
> <analyzer>
>   <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>   <charFilter class="solr.HTMLStripCharFilterFactory" />
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> 
> The first type accepts everything within <body>, including nested HTML formatting, while the latter strips nested HTML formatting as well.
> 
> The tokenizer will in fact strip out white space, but that happens after all character filters have completed.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Tuesday, September 10, 2013 7:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> ok i am getting there now but if there are newlines involved the regex stops as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have to get rid of the newlines. why isn't whitespaceTokenizerFactory the right element for this?
> 
> 
> On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:
> 
>> Use XML then. Although you will need to escape the XML special characters as I did in the pattern.
>> 
>> The point is simply: Quickly and simply try to find the simple test scenario that illustrates the problem.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 7:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml
>> 
>> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
>> 
>>> Did you at least try the pattern I gave you?
>>> 
>>> The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 6:40 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml
>>> 
>>> 
>>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>>> 
>>>> Did you in fact try my suggested example? If not, please do so.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Andreas Owen
>>>> Sent: Monday, September 09, 2013 4:42 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: charfilter doesn't do anything
>>>> 
>>>> i index html pages with a lot of lines and not just a string with the body-tag.
>>>> it doesn't work with proper html files, even though i took all the new lines out.
>>>> 
>>>> html-file:
>>>> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
>>>> 
>>>> solr update debug output:
>>>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will ich sehenfooter-content</body></html>"]
>>>> 
>>>> 
>>>> 
>>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>>> 
>>>>> I tried this and it seems to work when added to the standard Solr example in 4.4:
>>>>> 
>>>>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>>>> 
>>>>> <fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
>>>>> <analyzer>
>>>>> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>> </analyzer>
>>>>> </fieldType>
>>>>> 
>>>>> That char filter retains only text between <body> and </body>. Is that what you wanted?
>>>>> 
>>>>> Indexing this data:
>>>>> 
>>>>> curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '
>>>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>>>> 
>>>>> And querying with these commands:
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
>>>>> Shows all data
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
>>>>> shows the body text
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
>>>>> shows nothing (outside of body)
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
>>>>> shows nothing (outside of body)
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
>>>>> Shows nothing, HTML tag stripped
>>>>> 
>>>>> In your original query, you didn't show us what your default field, df parameter, was.
>>>>> 
>>>>> -- Jack Krupansky
>>>>> 
>>>>> -----Original Message----- From: Andreas Owen
>>>>> Sent: Sunday, September 08, 2013 5:21 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: charfilter doesn't do anything
>>>>> 
>>>>> yes but that filter html and not the specific tag i want.
>>>>> 
>>>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>>>> 
>>>>>> Hmmm, have you looked at:
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>> 
>>>>>> Not quite the <body>, perhaps, but might it help?
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>>>>> 
>>>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>>>>> that between the body-comments. i thought regexTransformer would be the
>>>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>>>> formed html, which i would like to switch off.
>>>>>>> 
>>>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>>>> 
>>>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>>>>> is on the same line in my tika entity. but when the string is multilined it
>>>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>>>> 
>>>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>>>>> transformer="RegexTransformer">
>>>>>>>>> <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>>> </entity>
>>>>>>>>> 
>>>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>>>> 
>>>>>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>>> 
>>>>>>>>> in javascript this works but maybe because i only used a small string.
>>>>>>>> 
>>>>>>>> Sounds like we've got an XY problem here.
>>>>>>>> 
>>>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>>>> 
>>>>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>>>>> and then we can find a solution for you?
>>>>>>>> 
>>>>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>>>> 
>>>>>>>> 
>>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>>>> 
>>>>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>>>>> be able to search for individual words on your HTML input.  The entire
>>>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>>>> 
>>>>>>>> 
>>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>>>> 
>>>>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>>>>> Solr will always return the text that was originally indexed in search
>>>>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>>>>> need an Update Processor.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Shawn

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

Okay, I can repro the problem. Yes, in appears that the pattern replace char 
filter does not default to multiline mode for pattern matching, so <body> on 
one line and </body> on another line cannot be matched.

Now, whether that is by design or a bug or an option for enhancement is a 
matter for some committer to comment on.

But, the good news is that you can in fact set multiline mode in your 
pattern my starting it with "(?s)", which means that dot accepts line break 
characters as well.

So, here are my revised field types:

<fieldType name="text_html_body" class="solr.TextField" 
positionIncrementGap="100" >
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text_html_body_strip" class="solr.TextField" 
positionIncrementGap="100" >
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
    <charFilter class="solr.HTMLStripCharFilterFactory" />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The first type accepts everything within <body>, including nested HTML 
formatting, while the latter strips nested HTML formatting as well.

The tokenizer will in fact strip out white space, but that happens after all 
character filters have completed.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Owen
Sent: Tuesday, September 10, 2013 7:07 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

ok i am getting there now but if there are newlines involved the regex stops 
as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have 
to get rid of the newlines. why isn't whitespaceTokenizerFactory the right 
element for this?


On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:

> Use XML then. Although you will need to escape the XML special characters 
> as I did in the pattern.
>
> The point is simply: Quickly and simply try to find the simple test 
> scenario that illustrates the problem.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 7:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
>
> i tried but that isn't working either, it want a data-stream, i'll have to 
> check how to post json instead of xml
>
> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
>
>> Did you at least try the pattern I gave you?
>>
>> The point of the curl was the data, not how you send the data. You can 
>> just use the standard Solr simple post tool.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 6:40 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>>
>> i've downloaded curl and tried it in the comman prompt and power shell on 
>> my win 2008r2 server, thats why i used my dataimporter with a single line 
>> html file and copy/pastet the lines into schema.xml
>>
>>
>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>>
>>> Did you in fact try my suggested example? If not, please do so.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 4:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>>
>>> i index html pages with a lot of lines and not just a string with the 
>>> body-tag.
>>> it doesn't work with proper html files, even though i took all the new 
>>> lines out.
>>>
>>> html-file:
>>> <html>nav-content<body> nur das will ich 
>>> sehen</body>footer-content</html>
>>>
>>> solr update debug output:
>>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
>>> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" 
>>> content=\"text/html; 
>>> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur 
>>> das will ich sehenfooter-content</body></html>"]
>>>
>>>
>>>
>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>>
>>>> I tried this and it seems to work when added to the standard Solr 
>>>> example in 4.4:
>>>>
>>>> <field name="body" type="text_html_body" indexed="true" stored="true" 
>>>> />
>>>>
>>>> <fieldType name="text_html_body" class="solr.TextField" 
>>>> positionIncrementGap="100" >
>>>> <analyzer>
>>>> <charFilter class="solr.PatternReplaceCharFilterFactory" 
>>>> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>> That char filter retains only text between <body> and </body>. Is that 
>>>> what you wanted?
>>>>
>>>> Indexing this data:
>>>>
>>>> curl 'localhost:8983/solr/update?commit=true' -H 
>>>> 'Content-type:application/json' -d '
>>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>>>
>>>> And querying with these commands:
>>>>
>>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
>>>> Shows all data
>>>>
>>>> curl 
>>>> "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
>>>> shows the body text
>>>>
>>>> curl 
>>>> "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
>>>> shows nothing (outside of body)
>>>>
>>>> curl 
>>>> "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
>>>> shows nothing (outside of body)
>>>>
>>>> curl 
>>>> "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
>>>> Shows nothing, HTML tag stripped
>>>>
>>>> In your original query, you didn't show us what your default field, df 
>>>> parameter, was.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Andreas Owen
>>>> Sent: Sunday, September 08, 2013 5:21 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: charfilter doesn't do anything
>>>>
>>>> yes but that filter html and not the specific tag i want.
>>>>
>>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>>>
>>>>> Hmmm, have you looked at:
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>
>>>>> Not quite the <body>, perhaps, but might it help?
>>>>>
>>>>>
>>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>>>>
>>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) 
>>>>>> only
>>>>>> that between the body-comments. i thought regexTransformer would be 
>>>>>> the
>>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that 
>>>>>> the
>>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>>> formed html, which i would like to switch off.
>>>>>>
>>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>>>
>>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>>> i've managed to get it working if i use the regexTransformer and 
>>>>>>>> string
>>>>>> is on the same line in my tika entity. but when the string is 
>>>>>> multilined it
>>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>>>
>>>>>>>> <entity name="tika" processor="TikaEntityProcessor" 
>>>>>>>> url="${rec.url}"
>>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" 
>>>>>> format="html"
>>>>>> transformer="RegexTransformer">
>>>>>>>> <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>> </entity>
>>>>>>>>
>>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>>>
>>>>>>>> <field column="text_html" 
>>>>>>>> regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>>
>>>>>>>> in javascript this works but maybe because i only used a small 
>>>>>>>> string.
>>>>>>>
>>>>>>> Sounds like we've got an XY problem here.
>>>>>>>
>>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>>>
>>>>>>> How about you tell us *exactly* what you'd actually like to have 
>>>>>>> happen
>>>>>>> and then we can find a solution for you?
>>>>>>>
>>>>>>> It sounds a little bit like you're interested in stripping all the 
>>>>>>> HTML
>>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>>>
>>>>>>>
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>>>
>>>>>>> Something that I already said: By using the KeywordTokenizer, you 
>>>>>>> won't
>>>>>>> be able to search for individual words on your HTML input.  The 
>>>>>>> entire
>>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>>>
>>>>>>>
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>>>
>>>>>>> Note that no matter what you do to your data with the analysis 
>>>>>>> chain,
>>>>>>> Solr will always return the text that was originally indexed in 
>>>>>>> search
>>>>>>> results.  If you need to affect what gets stored as well, perhaps 
>>>>>>> you
>>>>>>> need an Update Processor.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

ok i am getting there now but if there are newlines involved the regex stops as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have to get rid of the newlines. why isn't whitespaceTokenizerFactory the right element for this?


On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:

> Use XML then. Although you will need to escape the XML special characters as I did in the pattern.
> 
> The point is simply: Quickly and simply try to find the simple test scenario that illustrates the problem.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 7:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml
> 
> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
> 
>> Did you at least try the pattern I gave you?
>> 
>> The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 6:40 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml
>> 
>> 
>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>> 
>>> Did you in fact try my suggested example? If not, please do so.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 4:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> i index html pages with a lot of lines and not just a string with the body-tag.
>>> it doesn't work with proper html files, even though i took all the new lines out.
>>> 
>>> html-file:
>>> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
>>> 
>>> solr update debug output:
>>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will ich sehenfooter-content</body></html>"]
>>> 
>>> 
>>> 
>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>> 
>>>> I tried this and it seems to work when added to the standard Solr example in 4.4:
>>>> 
>>>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>>> 
>>>> <fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
>>>> <analyzer>
>>>> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> </analyzer>
>>>> </fieldType>
>>>> 
>>>> That char filter retains only text between <body> and </body>. Is that what you wanted?
>>>> 
>>>> Indexing this data:
>>>> 
>>>> curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '
>>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>>> 
>>>> And querying with these commands:
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
>>>> Shows all data
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
>>>> shows the body text
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
>>>> shows nothing (outside of body)
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
>>>> shows nothing (outside of body)
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
>>>> Shows nothing, HTML tag stripped
>>>> 
>>>> In your original query, you didn't show us what your default field, df parameter, was.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Andreas Owen
>>>> Sent: Sunday, September 08, 2013 5:21 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: charfilter doesn't do anything
>>>> 
>>>> yes but that filter html and not the specific tag i want.
>>>> 
>>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>>> 
>>>>> Hmmm, have you looked at:
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>> 
>>>>> Not quite the <body>, perhaps, but might it help?
>>>>> 
>>>>> 
>>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>>>> 
>>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>>>> that between the body-comments. i thought regexTransformer would be the
>>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>>> formed html, which i would like to switch off.
>>>>>> 
>>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>>> 
>>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>>>> is on the same line in my tika entity. but when the string is multilined it
>>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>>> 
>>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>>>> transformer="RegexTransformer">
>>>>>>>> <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>> </entity>
>>>>>>>> 
>>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>>> 
>>>>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>> 
>>>>>>>> in javascript this works but maybe because i only used a small string.
>>>>>>> 
>>>>>>> Sounds like we've got an XY problem here.
>>>>>>> 
>>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>>> 
>>>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>>>> and then we can find a solution for you?
>>>>>>> 
>>>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>>> 
>>>>>>> 
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>>> 
>>>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>>>> be able to search for individual words on your HTML input.  The entire
>>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>>> 
>>>>>>> 
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>>> 
>>>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>>>> Solr will always return the text that was originally indexed in search
>>>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>>>> need an Update Processor.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shawn

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

Use XML then. Although you will need to escape the XML special characters as 
I did in the pattern.

The point is simply: Quickly and simply try to find the simple test scenario 
that illustrates the problem.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Owen
Sent: Monday, September 09, 2013 7:05 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i tried but that isn't working either, it want a data-stream, i'll have to 
check how to post json instead of xml

On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

> Did you at least try the pattern I gave you?
>
> The point of the curl was the data, not how you send the data. You can 
> just use the standard Solr simple post tool.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 6:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
>
> i've downloaded curl and tried it in the comman prompt and power shell on 
> my win 2008r2 server, thats why i used my dataimporter with a single line 
> html file and copy/pastet the lines into schema.xml
>
>
> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>
>> Did you in fact try my suggested example? If not, please do so.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 4:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>>
>> i index html pages with a lot of lines and not just a string with the 
>> body-tag.
>> it doesn't work with proper html files, even though i took all the new 
>> lines out.
>>
>> html-file:
>> <html>nav-content<body> nur das will ich 
>> sehen</body>footer-content</html>
>>
>> solr update debug output:
>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
>> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" 
>> content=\"text/html; 
>> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das 
>> will ich sehenfooter-content</body></html>"]
>>
>>
>>
>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>
>>> I tried this and it seems to work when added to the standard Solr 
>>> example in 4.4:
>>>
>>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>>
>>> <fieldType name="text_html_body" class="solr.TextField" 
>>> positionIncrementGap="100" >
>>> <analyzer>
>>> <charFilter class="solr.PatternReplaceCharFilterFactory" 
>>> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> That char filter retains only text between <body> and </body>. Is that 
>>> what you wanted?
>>>
>>> Indexing this data:
>>>
>>> curl 'localhost:8983/solr/update?commit=true' -H 
>>> 'Content-type:application/json' -d '
>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>>
>>> And querying with these commands:
>>>
>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
>>> Shows all data
>>>
>>> curl 
>>> "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
>>> shows the body text
>>>
>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
>>> shows nothing (outside of body)
>>>
>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
>>> shows nothing (outside of body)
>>>
>>> curl 
>>> "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
>>> Shows nothing, HTML tag stripped
>>>
>>> In your original query, you didn't show us what your default field, df 
>>> parameter, was.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Sunday, September 08, 2013 5:21 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>>
>>> yes but that filter html and not the specific tag i want.
>>>
>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>>
>>>> Hmmm, have you looked at:
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>
>>>> Not quite the <body>, perhaps, but might it help?
>>>>
>>>>
>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>>>
>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) 
>>>>> only
>>>>> that between the body-comments. i thought regexTransformer would be 
>>>>> the
>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that 
>>>>> the
>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>> formed html, which i would like to switch off.
>>>>>
>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>>
>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>> i've managed to get it working if i use the regexTransformer and 
>>>>>>> string
>>>>> is on the same line in my tika entity. but when the string is 
>>>>> multilined it
>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>>
>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" 
>>>>> format="html"
>>>>> transformer="RegexTransformer">
>>>>>>>  <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>> </entity>
>>>>>>>
>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>>
>>>>>>> <field column="text_html" 
>>>>>>> regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>
>>>>>>> in javascript this works but maybe because i only used a small 
>>>>>>> string.
>>>>>>
>>>>>> Sounds like we've got an XY problem here.
>>>>>>
>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>>
>>>>>> How about you tell us *exactly* what you'd actually like to have 
>>>>>> happen
>>>>>> and then we can find a solution for you?
>>>>>>
>>>>>> It sounds a little bit like you're interested in stripping all the 
>>>>>> HTML
>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>>
>>>>>>
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>>
>>>>>> Something that I already said: By using the KeywordTokenizer, you 
>>>>>> won't
>>>>>> be able to search for individual words on your HTML input.  The 
>>>>>> entire
>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>>
>>>>>>
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>>
>>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>>> Solr will always return the text that was originally indexed in 
>>>>>> search
>>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>>> need an Update Processor.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml

On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

> Did you at least try the pattern I gave you?
> 
> The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 6:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml
> 
> 
> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
> 
>> Did you in fact try my suggested example? If not, please do so.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 4:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i index html pages with a lot of lines and not just a string with the body-tag.
>> it doesn't work with proper html files, even though i took all the new lines out.
>> 
>> html-file:
>> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
>> 
>> solr update debug output:
>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will ich sehenfooter-content</body></html>"]
>> 
>> 
>> 
>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>> 
>>> I tried this and it seems to work when added to the standard Solr example in 4.4:
>>> 
>>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>> 
>>> <fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
>>> <analyzer>
>>> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>> 
>>> That char filter retains only text between <body> and </body>. Is that what you wanted?
>>> 
>>> Indexing this data:
>>> 
>>> curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '
>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>> 
>>> And querying with these commands:
>>> 
>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
>>> Shows all data
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
>>> shows the body text
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
>>> shows nothing (outside of body)
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
>>> shows nothing (outside of body)
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
>>> Shows nothing, HTML tag stripped
>>> 
>>> In your original query, you didn't show us what your default field, df parameter, was.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Sunday, September 08, 2013 5:21 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> yes but that filter html and not the specific tag i want.
>>> 
>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>> 
>>>> Hmmm, have you looked at:
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>> 
>>>> Not quite the <body>, perhaps, but might it help?
>>>> 
>>>> 
>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>>> 
>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>>> that between the body-comments. i thought regexTransformer would be the
>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>> formed html, which i would like to switch off.
>>>>> 
>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>> 
>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>>> is on the same line in my tika entity. but when the string is multilined it
>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>> 
>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>>> transformer="RegexTransformer">
>>>>>>>  <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>> </entity>
>>>>>>> 
>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>> 
>>>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>> 
>>>>>>> in javascript this works but maybe because i only used a small string.
>>>>>> 
>>>>>> Sounds like we've got an XY problem here.
>>>>>> 
>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>> 
>>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>>> and then we can find a solution for you?
>>>>>> 
>>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>> 
>>>>>> 
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>> 
>>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>>> be able to search for individual words on your HTML input.  The entire
>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>> 
>>>>>> 
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>> 
>>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>>> Solr will always return the text that was originally indexed in search
>>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>>> need an Update Processor.
>>>>>> 
>>>>>> Thanks,
>>>>>> Shawn

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

Did you at least try the pattern I gave you?

The point of the curl was the data, not how you send the data. You can just 
use the standard Solr simple post tool.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Owen
Sent: Monday, September 09, 2013 6:40 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i've downloaded curl and tried it in the comman prompt and power shell on my 
win 2008r2 server, thats why i used my dataimporter with a single line html 
file and copy/pastet the lines into schema.xml


On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:

> Did you in fact try my suggested example? If not, please do so.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 4:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
>
> i index html pages with a lot of lines and not just a string with the 
> body-tag.
> it doesn't work with proper html files, even though i took all the new 
> lines out.
>
> html-file:
> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
>
> solr update debug output:
> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" 
> content=\"text/html; 
> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das 
> will ich sehenfooter-content</body></html>"]
>
>
>
> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>
>> I tried this and it seems to work when added to the standard Solr example 
>> in 4.4:
>>
>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>
>> <fieldType name="text_html_body" class="solr.TextField" 
>> positionIncrementGap="100" >
>> <analyzer>
>>  <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>  <tokenizer class="solr.StandardTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> That char filter retains only text between <body> and </body>. Is that 
>> what you wanted?
>>
>> Indexing this data:
>>
>> curl 'localhost:8983/solr/update?commit=true' -H 
>> 'Content-type:application/json' -d '
>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>
>> And querying with these commands:
>>
>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
>> Shows all data
>>
>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
>> shows the body text
>>
>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
>> shows nothing (outside of body)
>>
>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
>> shows nothing (outside of body)
>>
>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
>> Shows nothing, HTML tag stripped
>>
>> In your original query, you didn't show us what your default field, df 
>> parameter, was.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Andreas Owen
>> Sent: Sunday, September 08, 2013 5:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>>
>> yes but that filter html and not the specific tag i want.
>>
>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>
>>> Hmmm, have you looked at:
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>
>>> Not quite the <body>, perhaps, but might it help?
>>>
>>>
>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>>
>>>> ok i have html pages with <html>.....<!--body-->content i
>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>> that between the body-comments. i thought regexTransformer would be the
>>>> best because xpath doesn't work in tika and i cant nest a
>>>> xpathEntetyProcessor to use xpath. what i have also found out is that 
>>>> the
>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>> formed html, which i would like to switch off.
>>>>
>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>
>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>> i've managed to get it working if i use the regexTransformer and 
>>>>>> string
>>>> is on the same line in my tika entity. but when the string is 
>>>> multilined it
>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>
>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>> transformer="RegexTransformer">
>>>>>>   <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>> </entity>
>>>>>>
>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>
>>>>>> <field column="text_html" 
>>>>>> regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>
>>>>>> in javascript this works but maybe because i only used a small 
>>>>>> string.
>>>>>
>>>>> Sounds like we've got an XY problem here.
>>>>>
>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>
>>>>> How about you tell us *exactly* what you'd actually like to have 
>>>>> happen
>>>>> and then we can find a solution for you?
>>>>>
>>>>> It sounds a little bit like you're interested in stripping all the 
>>>>> HTML
>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>
>>>>>
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>
>>>>> Something that I already said: By using the KeywordTokenizer, you 
>>>>> won't
>>>>> be able to search for individual words on your HTML input.  The entire
>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>
>>>>>
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>
>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>> Solr will always return the text that was originally indexed in search
>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>> need an Update Processor.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml


On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:

> Did you in fact try my suggested example? If not, please do so.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 4:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i index html pages with a lot of lines and not just a string with the body-tag.
> it doesn't work with proper html files, even though i took all the new lines out.
> 
> html-file:
> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
> 
> solr update debug output:
> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will ich sehenfooter-content</body></html>"]
> 
> 
> 
> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
> 
>> I tried this and it seems to work when added to the standard Solr example in 4.4:
>> 
>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>> 
>> <fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
>> <analyzer>
>>  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>  <tokenizer class="solr.StandardTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>> 
>> That char filter retains only text between <body> and </body>. Is that what you wanted?
>> 
>> Indexing this data:
>> 
>> curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '
>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>> 
>> And querying with these commands:
>> 
>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
>> Shows all data
>> 
>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
>> shows the body text
>> 
>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
>> shows nothing (outside of body)
>> 
>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
>> shows nothing (outside of body)
>> 
>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
>> Shows nothing, HTML tag stripped
>> 
>> In your original query, you didn't show us what your default field, df parameter, was.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Sunday, September 08, 2013 5:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> yes but that filter html and not the specific tag i want.
>> 
>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>> 
>>> Hmmm, have you looked at:
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>> 
>>> Not quite the <body>, perhaps, but might it help?
>>> 
>>> 
>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>> 
>>>> ok i have html pages with <html>.....<!--body-->content i
>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>> that between the body-comments. i thought regexTransformer would be the
>>>> best because xpath doesn't work in tika and i cant nest a
>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>> formed html, which i would like to switch off.
>>>> 
>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>> 
>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>> is on the same line in my tika entity. but when the string is multilined it
>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>> 
>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>> transformer="RegexTransformer">
>>>>>>   <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>> </entity>
>>>>>> 
>>>>>> then i tried it like this and i get a stackoverflow
>>>>>> 
>>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>> 
>>>>>> in javascript this works but maybe because i only used a small string.
>>>>> 
>>>>> Sounds like we've got an XY problem here.
>>>>> 
>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>> 
>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>> and then we can find a solution for you?
>>>>> 
>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>> 
>>>>> 
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>> 
>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>> be able to search for individual words on your HTML input.  The entire
>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>> 
>>>>> 
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>> 
>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>> Solr will always return the text that was originally indexed in search
>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>> need an Update Processor.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

Did you in fact try my suggested example? If not, please do so.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Owen
Sent: Monday, September 09, 2013 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i index html pages with a lot of lines and not just a string with the 
body-tag.
it doesn't work with proper html files, even though i took all the new lines 
out.

html-file:
<html>nav-content<body> nur das will ich sehen</body>footer-content</html>

solr update debug output:
"text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; 
charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das 
will ich sehenfooter-content</body></html>"]



On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

> I tried this and it seems to work when added to the standard Solr example 
> in 4.4:
>
> <field name="body" type="text_html_body" indexed="true" stored="true" />
>
> <fieldType name="text_html_body" class="solr.TextField" 
> positionIncrementGap="100" >
> <analyzer>
>   <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> That char filter retains only text between <body> and </body>. Is that 
> what you wanted?
>
> Indexing this data:
>
> curl 'localhost:8983/solr/update?commit=true' -H 
> 'Content-type:application/json' -d '
> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>
> And querying with these commands:
>
> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
> Shows all data
>
> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
> shows the body text
>
> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
> shows nothing (outside of body)
>
> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
> shows nothing (outside of body)
>
> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
> Shows nothing, HTML tag stripped
>
> In your original query, you didn't show us what your default field, df 
> parameter, was.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andreas Owen
> Sent: Sunday, September 08, 2013 5:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
>
> yes but that filter html and not the specific tag i want.
>
> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>
>> Hmmm, have you looked at:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>
>> Not quite the <body>, perhaps, but might it help?
>>
>>
>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>>
>>> ok i have html pages with <html>.....<!--body-->content i
>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>> that between the body-comments. i thought regexTransformer would be the
>>> best because xpath doesn't work in tika and i cant nest a
>>> xpathEntetyProcessor to use xpath. what i have also found out is that 
>>> the
>>> htmlparser from tika cuts my body-comments out and tries to make well
>>> formed html, which i would like to switch off.
>>>
>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>
>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>> i've managed to get it working if i use the regexTransformer and 
>>>>> string
>>> is on the same line in my tika entity. but when the string is multilined 
>>> it
>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>
>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>> transformer="RegexTransformer">
>>>>>    <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>> </entity>
>>>>>
>>>>> then i tried it like this and i get a stackoverflow
>>>>>
>>>>> <field column="text_html" 
>>>>> regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>
>>>>> in javascript this works but maybe because i only used a small string.
>>>>
>>>> Sounds like we've got an XY problem here.
>>>>
>>>> http://people.apache.org/~hossman/#xyproblem
>>>>
>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>> and then we can find a solution for you?
>>>>
>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>
>>>>
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>
>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>> be able to search for individual words on your HTML input.  The entire
>>>> input string is treated as a single token, and therefore ONLY exact
>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>
>>>>
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>
>>>> Note that no matter what you do to your data with the analysis chain,
>>>> Solr will always return the text that was originally indexed in search
>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>> need an Update Processor.
>>>>
>>>> Thanks,
>>>> Shawn
>>>

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

i index html pages with a lot of lines and not just a string with the body-tag. 
it doesn't work with proper html files, even though i took all the new lines out.

html-file:
<html>nav-content<body> nur das will ich sehen</body>footer-content</html>

solr update debug output:
"text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will ich sehenfooter-content</body></html>"]



On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

> I tried this and it seems to work when added to the standard Solr example in 4.4:
> 
> <field name="body" type="text_html_body" indexed="true" stored="true" />
> 
> <fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
> <analyzer>
>   <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> 
> That char filter retains only text between <body> and </body>. Is that what you wanted?
> 
> Indexing this data:
> 
> curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '
> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
> 
> And querying with these commands:
> 
> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
> Shows all data
> 
> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
> shows the body text
> 
> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
> shows nothing (outside of body)
> 
> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
> shows nothing (outside of body)
> 
> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
> Shows nothing, HTML tag stripped
> 
> In your original query, you didn't show us what your default field, df parameter, was.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Sunday, September 08, 2013 5:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> yes but that filter html and not the specific tag i want.
> 
> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
> 
>> Hmmm, have you looked at:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>> 
>> Not quite the <body>, perhaps, but might it help?
>> 
>> 
>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>> 
>>> ok i have html pages with <html>.....<!--body-->content i
>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>> that between the body-comments. i thought regexTransformer would be the
>>> best because xpath doesn't work in tika and i cant nest a
>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>> htmlparser from tika cuts my body-comments out and tries to make well
>>> formed html, which i would like to switch off.
>>> 
>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>> 
>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>> i've managed to get it working if i use the regexTransformer and string
>>> is on the same line in my tika entity. but when the string is multilined it
>>> isn't working even though i tried ?s to set the flag dotall.
>>>>> 
>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>> transformer="RegexTransformer">
>>>>>    <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>> </entity>
>>>>> 
>>>>> then i tried it like this and i get a stackoverflow
>>>>> 
>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>> 
>>>>> in javascript this works but maybe because i only used a small string.
>>>> 
>>>> Sounds like we've got an XY problem here.
>>>> 
>>>> http://people.apache.org/~hossman/#xyproblem
>>>> 
>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>> and then we can find a solution for you?
>>>> 
>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>> 
>>>> 
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>> 
>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>> be able to search for individual words on your HTML input.  The entire
>>>> input string is treated as a single token, and therefore ONLY exact
>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>> 
>>>> 
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>> 
>>>> Note that no matter what you do to your data with the analysis chain,
>>>> Solr will always return the text that was originally indexed in search
>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>> need an Update Processor.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

I tried this and it seems to work when added to the standard Solr example in 
4.4:

<field name="body" type="text_html_body" indexed="true" stored="true" />

<fieldType name="text_html_body" class="solr.TextField" 
positionIncrementGap="100" >
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

That char filter retains only text between <body> and </body>. Is that what 
you wanted?

Indexing this data:

curl 'localhost:8983/solr/update?commit=true' -H 
'Content-type:application/json' -d '
[{"id":"doc-1","body":"abc <body>A test.</body> def"}]'

And querying with these commands:

curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"
Shows all data

curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"
shows the body text

curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"
shows nothing (outside of body)

curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"
shows nothing (outside of body)

curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"
Shows nothing, HTML tag stripped

In your original query, you didn't show us what your default field, df 
parameter, was.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Owen
Sent: Sunday, September 08, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:

> Hmmm, have you looked at:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>
> Not quite the <body>, perhaps, but might it help?
>
>
> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
>
>> ok i have html pages with <html>.....<!--body-->content i
>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>> that between the body-comments. i thought regexTransformer would be the
>> best because xpath doesn't work in tika and i cant nest a
>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>> htmlparser from tika cuts my body-comments out and tries to make well
>> formed html, which i would like to switch off.
>>
>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>
>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>> i've managed to get it working if i use the regexTransformer and string
>> is on the same line in my tika entity. but when the string is multilined 
>> it
>> isn't working even though i tried ?s to set the flag dotall.
>>>>
>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>> transformer="RegexTransformer">
>>>>     <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>> </entity>
>>>>
>>>> then i tried it like this and i get a stackoverflow
>>>>
>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>
>>>> in javascript this works but maybe because i only used a small string.
>>>
>>> Sounds like we've got an XY problem here.
>>>
>>> http://people.apache.org/~hossman/#xyproblem
>>>
>>> How about you tell us *exactly* what you'd actually like to have happen
>>> and then we can find a solution for you?
>>>
>>> It sounds a little bit like you're interested in stripping all the HTML
>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>
>>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>
>>> Something that I already said: By using the KeywordTokenizer, you won't
>>> be able to search for individual words on your HTML input.  The entire
>>> input string is treated as a single token, and therefore ONLY exact
>>> entire-field matches (or certain wildcard matches) will be possible.
>>>
>>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>
>>> Note that no matter what you do to your data with the analysis chain,
>>> Solr will always return the text that was originally indexed in search
>>> results.  If you need to affect what gets stored as well, perhaps you
>>> need an Update Processor.
>>>
>>> Thanks,
>>> Shawn
>>
>>

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:

> Hmmm, have you looked at:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Not quite the <body>, perhaps, but might it help?
> 
> 
> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:
> 
>> ok i have html pages with <html>.....<!--body-->content i
>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>> that between the body-comments. i thought regexTransformer would be the
>> best because xpath doesn't work in tika and i cant nest a
>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>> htmlparser from tika cuts my body-comments out and tries to make well
>> formed html, which i would like to switch off.
>> 
>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>> 
>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>> i've managed to get it working if i use the regexTransformer and string
>> is on the same line in my tika entity. but when the string is multilined it
>> isn't working even though i tried ?s to set the flag dotall.
>>>> 
>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>> transformer="RegexTransformer">
>>>>     <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>> </entity>
>>>> 
>>>> then i tried it like this and i get a stackoverflow
>>>> 
>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>> 
>>>> in javascript this works but maybe because i only used a small string.
>>> 
>>> Sounds like we've got an XY problem here.
>>> 
>>> http://people.apache.org/~hossman/#xyproblem
>>> 
>>> How about you tell us *exactly* what you'd actually like to have happen
>>> and then we can find a solution for you?
>>> 
>>> It sounds a little bit like you're interested in stripping all the HTML
>>> tags out.  Perhaps the HTMLStripCharFilter?
>>> 
>>> 
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>> 
>>> Something that I already said: By using the KeywordTokenizer, you won't
>>> be able to search for individual words on your HTML input.  The entire
>>> input string is treated as a single token, and therefore ONLY exact
>>> entire-field matches (or certain wildcard matches) will be possible.
>>> 
>>> 
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>> 
>>> Note that no matter what you do to your data with the analysis chain,
>>> Solr will always return the text that was originally indexed in search
>>> results.  If you need to affect what gets stored as well, perhaps you
>>> need an Update Processor.
>>> 
>>> Thanks,
>>> Shawn
>> 
>>

Re: charfilter doesn't do anything

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the <body>, perhaps, but might it help?


On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <ao...@conx.ch> wrote:

> ok i have html pages with <html>.....<!--body-->content i
> want....<!--/body-->.....</html>. i want to extract (index, store) only
> that between the body-comments. i thought regexTransformer would be the
> best because xpath doesn't work in tika and i cant nest a
> xpathEntetyProcessor to use xpath. what i have also found out is that the
> htmlparser from tika cuts my body-comments out and tries to make well
> formed html, which i would like to switch off.
>
> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>
> > On 9/6/2013 7:09 AM, Andreas Owen wrote:
> >> i've managed to get it working if i use the regexTransformer and string
> is on the same line in my tika entity. but when the string is multilined it
> isn't working even though i tried ?s to set the flag dotall.
> >>
> >> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
> transformer="RegexTransformer">
> >>      <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
> >> </entity>
> >>
> >> then i tried it like this and i get a stackoverflow
> >>
> >> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
> >>
> >> in javascript this works but maybe because i only used a small string.
> >
> > Sounds like we've got an XY problem here.
> >
> > http://people.apache.org/~hossman/#xyproblem
> >
> > How about you tell us *exactly* what you'd actually like to have happen
> > and then we can find a solution for you?
> >
> > It sounds a little bit like you're interested in stripping all the HTML
> > tags out.  Perhaps the HTMLStripCharFilter?
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> >
> > Something that I already said: By using the KeywordTokenizer, you won't
> > be able to search for individual words on your HTML input.  The entire
> > input string is treated as a single token, and therefore ONLY exact
> > entire-field matches (or certain wildcard matches) will be possible.
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
> >
> > Note that no matter what you do to your data with the analysis chain,
> > Solr will always return the text that was originally indexed in search
> > results.  If you need to affect what gets stored as well, perhaps you
> > need an Update Processor.
> >
> > Thanks,
> > Shawn
>
>

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

ok i have html pages with <html>.....<!--body-->content i want....<!--/body-->.....</html>. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>> i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall.
>> 
>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" transformer="RegexTransformer">
>> 	<field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>> </entity>
>> 			
>> then i tried it like this and i get a stackoverflow
>> 
>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>> 
>> in javascript this works but maybe because i only used a small string.
> 
> Sounds like we've got an XY problem here.
> 
> http://people.apache.org/~hossman/#xyproblem
> 
> How about you tell us *exactly* what you'd actually like to have happen
> and then we can find a solution for you?
> 
> It sounds a little bit like you're interested in stripping all the HTML
> tags out.  Perhaps the HTMLStripCharFilter?
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Something that I already said: By using the KeywordTokenizer, you won't
> be able to search for individual words on your HTML input.  The entire
> input string is treated as a single token, and therefore ONLY exact
> entire-field matches (or certain wildcard matches) will be possible.
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
> 
> Note that no matter what you do to your data with the analysis chain,
> Solr will always return the text that was originally indexed in search
> results.  If you need to affect what gets stored as well, perhaps you
> need an Update Processor.
> 
> Thanks,
> Shawn

Re: charfilter doesn't do anything

Posted by Shawn Heisey <so...@elyograg.org>.

On 9/6/2013 7:09 AM, Andreas Owen wrote:
> i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall.
> 
> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" transformer="RegexTransformer">
> 	<field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
> </entity>
> 			
> then i tried it like this and i get a stackoverflow
> 
> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
> 
> in javascript this works but maybe because i only used a small string.

Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out.  Perhaps the HTMLStripCharFilter?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input.  The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results.  If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall.

<entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" transformer="RegexTransformer">
	<field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
</entity>
			
then i tried it like this and i get a stackoverflow

<field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />

in javascript this works but maybe because i only used a small string.



On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote:

> Is there any chance that your changed your schema since you indexed the data? If so, re-index the data.
> 
> If a "*" query finds nothing, that implies that the default field is empty. Are you sure the "df" parameter is set to the field containing your data? Show us your request handler definition and a sample of your actual Solr input (Solr XML or JSON?) so that we can see what fields are being populated.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Friday, September 06, 2013 4:01 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=*
> 
> On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:
> 
>> And show us an input string and a query that fail.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Shawn Heisey
>> Sent: Thursday, September 05, 2013 2:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>>> i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error.
>>> 
>>> in schema.xml i have the following:
>>> 
>>> <field name="text_html" type="text_cutHtml" indexed="true" stored="true" multiValued="true"/>
>>> 
>>> <fieldType name="text_cutHtml" class="solr.TextField">
>>> <analyzer>
>>> <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
>>> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>> </analyzer>
>>>  </fieldType>
>>> 
>>> my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern?
>> 
>> I don't know about your second question.  I don't know if that will be
>> possible, but I'll leave that to someone who's more expert than I.
>> 
>> As for the first question, here's what I have.  Did you reindex?  That
>> will be required.
>> 
>> http://wiki.apache.org/solr/HowToReindex
>> 
>> Assuming that you did reindex, are you trying to search for ASDFGHJK in
>> a field that contains more than just "Zahlungsverkehr"?  The keyword
>> tokenizer might not do what you expect - it tokenizes the entire input
>> string as a single token, which means that you won't be able to search
>> for single words in a multi-word field without wildcards, which are
>> pretty slow.
>> 
>> Note that both the pattern and replacement are case sensitive.  This is
>> how regex works.  You haven't used a lowercase filter, which means that
>> you won't be able to search for asdfghjk.
>> 
>> Use the analysis tab in the UI on your core to see what Solr does to
>> your field text.
>> 
>> Thanks,
>> Shawn

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

Is there any chance that your changed your schema since you indexed the 
data? If so, re-index the data.

If a "*" query finds nothing, that implies that the default field is empty. 
Are you sure the "df" parameter is set to the field containing your data? 
Show us your request handler definition and a sample of your actual Solr 
input (Solr XML or JSON?) so that we can see what fields are being 
populated.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Owen
Sent: Friday, September 06, 2013 4:01 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

the input string is a normal html page with the word Zahlungsverkehr in it 
and my query is ...solr/collection1/select?q=*

On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:

> And show us an input string and a query that fail.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Shawn Heisey
> Sent: Thursday, September 05, 2013 2:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
>
> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>> i would like to filter / replace a word during indexing but it doesn't do 
>> anything and i dont get a error.
>>
>> in schema.xml i have the following:
>>
>> <field name="text_html" type="text_cutHtml" indexed="true" stored="true" 
>> multiValued="true"/>
>>
>> <fieldType name="text_cutHtml" class="solr.TextField">
>> <analyzer>
>>  <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
>>  <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>> </analyzer>
>>   </fieldType>
>>
>> my 2. question is where can i say that the expression is multilined like 
>> in javascript i can use /m at the end of the pattern?
>
> I don't know about your second question.  I don't know if that will be
> possible, but I'll leave that to someone who's more expert than I.
>
> As for the first question, here's what I have.  Did you reindex?  That
> will be required.
>
> http://wiki.apache.org/solr/HowToReindex
>
> Assuming that you did reindex, are you trying to search for ASDFGHJK in
> a field that contains more than just "Zahlungsverkehr"?  The keyword
> tokenizer might not do what you expect - it tokenizes the entire input
> string as a single token, which means that you won't be able to search
> for single words in a multi-word field without wildcards, which are
> pretty slow.
>
> Note that both the pattern and replacement are case sensitive.  This is
> how regex works.  You haven't used a lowercase filter, which means that
> you won't be able to search for asdfghjk.
>
> Use the analysis tab in the UI on your core to see what Solr does to
> your field text.
>
> Thanks,
> Shawn

Re: charfilter doesn't do anything

Posted by Andreas Owen <ao...@conx.ch>.

the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=*

On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:

> And show us an input string and a query that fail.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Shawn Heisey
> Sent: Thursday, September 05, 2013 2:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>> i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error.
>> 
>> in schema.xml i have the following:
>> 
>> <field name="text_html" type="text_cutHtml" indexed="true" stored="true" multiValued="true"/>
>> 
>> <fieldType name="text_cutHtml" class="solr.TextField">
>> <analyzer>
>>  <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
>>  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>> </analyzer>
>>   </fieldType>
>> 
>> my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern?
> 
> I don't know about your second question.  I don't know if that will be
> possible, but I'll leave that to someone who's more expert than I.
> 
> As for the first question, here's what I have.  Did you reindex?  That
> will be required.
> 
> http://wiki.apache.org/solr/HowToReindex
> 
> Assuming that you did reindex, are you trying to search for ASDFGHJK in
> a field that contains more than just "Zahlungsverkehr"?  The keyword
> tokenizer might not do what you expect - it tokenizes the entire input
> string as a single token, which means that you won't be able to search
> for single words in a multi-word field without wildcards, which are
> pretty slow.
> 
> Note that both the pattern and replacement are case sensitive.  This is
> how regex works.  You haven't used a lowercase filter, which means that
> you won't be able to search for asdfghjk.
> 
> Use the analysis tab in the UI on your core to see what Solr does to
> your field text.
> 
> Thanks,
> Shawn

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

And show us an input string and a query that fail.

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Thursday, September 05, 2013 2:41 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

On 9/5/2013 10:03 AM, Andreas Owen wrote:
> i would like to filter / replace a word during indexing but it doesn't do 
> anything and i dont get a error.
>
> in schema.xml i have the following:
>
> <field name="text_html" type="text_cutHtml" indexed="true" stored="true" 
> multiValued="true"/>
>
> <fieldType name="text_cutHtml" class="solr.TextField">
> <analyzer>
>   <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
>   <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
> </analyzer>
>    </fieldType>
>
> my 2. question is where can i say that the expression is multilined like 
> in javascript i can use /m at the end of the pattern?

I don't know about your second question.  I don't know if that will be
possible, but I'll leave that to someone who's more expert than I.

As for the first question, here's what I have.  Did you reindex?  That
will be required.

http://wiki.apache.org/solr/HowToReindex

Assuming that you did reindex, are you trying to search for ASDFGHJK in
a field that contains more than just "Zahlungsverkehr"?  The keyword
tokenizer might not do what you expect - it tokenizes the entire input
string as a single token, which means that you won't be able to search
for single words in a multi-word field without wildcards, which are
pretty slow.

Note that both the pattern and replacement are case sensitive.  This is
how regex works.  You haven't used a lowercase filter, which means that
you won't be able to search for asdfghjk.

Use the analysis tab in the UI on your core to see what Solr does to
your field text.

Thanks,
Shawn

Re: charfilter doesn't do anything

Posted by Shawn Heisey <so...@elyograg.org>.

On 9/5/2013 10:03 AM, Andreas Owen wrote:
> i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error.
> 
> in schema.xml i have the following:
> 
> <field name="text_html" type="text_cutHtml" indexed="true" stored="true" multiValued="true"/>
> 
> <fieldType name="text_cutHtml" class="solr.TextField">
> 	<analyzer>
> 	  <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
> 	  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
> 	  <tokenizer class="solr.KeywordTokenizerFactory"/>
> 	</analyzer>
>    </fieldType>
> 
> my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern?

I don't know about your second question.  I don't know if that will be
possible, but I'll leave that to someone who's more expert than I.

As for the first question, here's what I have.  Did you reindex?  That
will be required.

http://wiki.apache.org/solr/HowToReindex

Assuming that you did reindex, are you trying to search for ASDFGHJK in
a field that contains more than just "Zahlungsverkehr"?  The keyword
tokenizer might not do what you expect - it tokenizes the entire input
string as a single token, which means that you won't be able to search
for single words in a multi-word field without wildcards, which are
pretty slow.

Note that both the pattern and replacement are case sensitive.  This is
how regex works.  You haven't used a lowercase filter, which means that
you won't be able to search for asdfghjk.

Use the analysis tab in the UI on your core to see what Solr does to
your field text.

Thanks,
Shawn

Re: charfilter doesn't do anything

Posted by Jack Krupansky <ja...@basetechnology.com>.

For the second question, there is no multiline mode - the ends of lines are 
just white space characters. IOW, it is implicitly multi-line.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Owen
Sent: Thursday, September 05, 2013 12:03 PM
To: solr-user@lucene.apache.org
Subject: charfilter doesn't do anything

i would like to filter / replace a word during indexing but it doesn't do 
anything and i dont get a error.

in schema.xml i have the following:

<field name="text_html" type="text_cutHtml" indexed="true" stored="true" 
multiValued="true"/>

<fieldType name="text_cutHtml" class="solr.TextField">
<analyzer>
  <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
  <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
  <tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
   </fieldType>

my 2. question is where can i say that the expression is multilined like in 
javascript i can use /m at the end of the pattern?