You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by jasimop <st...@gmail.com> on 2013/05/29 22:12:01 UTC

Problem with PatternReplaceCharFilter

Hi,

I have a Problem when using PatternReplaceCharFilter when indexing a field.
I created the following field: 
    <fieldType name="testfield" class="solr.TextField">
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;TextDocument[^&#62;]*&#62;" replacement=""/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;/TextDocument&#62;" replacement=""/>-->
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;TextLine[^&#60;]+ content=\&#34;([^\&#34;]*)\&#34;[^/]+/&#62;"
replacement="$1 "/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt"  format="snowball"
enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

And I created a field that is indexed and stored:
<field name="testfield" type="testfield" indexed="true" stored="true" />

I need to index a document with such a structure in this field:
<TextDocument filename="somefile.end" mime="..." created="..."><TextLine
aa="bb" cc="dd" content="the content to search in" ee="ff" /><TextLine
aa="bb" cc="dd" content="the second content line" ee="ff" /></TextDocument>

Basically I have some sort of XML structure, i need only to search in the
"content" attribute, but when highlighting i need to get back to the
enclosing XML tags.

So with the 3 Regex I want to remove all unwanted tags and tokenize/index
only the important data.
I know that I could use HTMLStripCharFilterFactory but then also the tag
names, attribute names and values get indexed. And I don't want to search in
that content too.

I read the following in the doc:
NOTE: If you produce a phrase that has different length to source string and
the field is used for highlighting for a term of the phrase, you will face a
trouble. 

The thing is, why is this the case? When running the analyze from solr admin
the CharFilters generate
"the content to search in the second content line" which looks perfect, but
then the StandardTokenizer
gets the start and end positions of the tokens wrong. Why is this the case?
Does there exist another solution to my problem?
Could I use the following method I saw in the doc of
PatternReplaceCharFilter:
protected int correct(int currentOff) Documentation: Retrieve the corrected
offset.

How could I solve such a task?






--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with PatternReplaceCharFilter

Posted by jasimop <st...@gmail.com>.

Thanks again for your input.

In fact I already preprocess the data (concatenation of only the content
values) and index it into another field.

But my general problem is the following: My data has such a cryptic format
and I have to search only within the content values. Therefore I preprocess
it and put it into a field. There all works fine (highlighting etc.).
The problem now comes from the fact that when getting a hit in that field I
need to know the <TextLine>
it appeared in to get the attribute values. They define some rules for
processing the search result, but it should not be possible to search in
them. Therefore I cannot just use the HtmlStripCharFilter.

So my idea was the following: indexing my cleaned version and the raw format
and make sure that both fields
generate the same tokens (this is the hard part). If i need to know the
surrounding attribute values i search
in the raw version and highlight the matching term. This is the indication
for me which attribute values to use.

Another option would be to search in the cleaned version and after the
search/in my application try to match that position to the one in the raw
format based on the highlighted term. But this is very error prone.

Both solutions do not seem elegant to me.


Any suggestions?




--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4067265.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with PatternReplaceCharFilter

Posted by Jack Krupansky <ja...@basetechnology.com>.

Just count the character in the literal portions of the patterns and include 
that spaces in the replacement.

So, "<TextLine " would become "          ".

It gets trickier if names are variable length. But I'm sure you could come 
up with patterns to replace one, two, three, etc. char names with equivalent 
spaces.

But... if all of this is too difficult for you, some people might find it 
easier to preprocess the data before sending it to Solr.

I mean, do you really need to highlight the content in such a cryptic input 
format?

Ultimately you might be better off with a custom char filter - sometimes 
people can cope better with straight Java code than cryptic regular 
expression sequences.

-- Jack Krupansky

-----Original Message----- 
From: jasimop
Sent: Thursday, May 30, 2013 12:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with PatternReplaceCharFilter

Honestly, I have no idea how to do that.
PatternReplaceCharFilter doesn't seem to have a parameter like
preservePositions="true" and
optionally fillCharacter=" ".
And I don't think I can express this simply as regex. How would I count in a
pure
regex the length difference before and after the match?

Well, the specific problem is, that when highlighting the term positions are
wrong and the
result is not a valid XML structure that I can handle.
I expect something like
<TextLine aa=&quot;bb&quot; cc=&quot;dd&quot; content=&quot;the content to
&lt;em>search</em> in" ee="ff" />
but I can
<Tex&lt;em>tLine</em>aa="bb" cc="dd" content="the content to <em>search</em>
in" ee="ff" />

Thanks for your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4066939.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with PatternReplaceCharFilter

Posted by jasimop <st...@gmail.com>.

Honestly, I have no idea how to do that.
PatternReplaceCharFilter doesn't seem to have a parameter like
preservePositions="true" and
optionally fillCharacter=" ".
And I don't think I can express this simply as regex. How would I count in a
pure
regex the length difference before and after the match?

Well, the specific problem is, that when highlighting the term positions are
wrong and the
result is not a valid XML structure that I can handle.
I expect something like
<TextLine aa=&quot;bb&quot; cc=&quot;dd&quot; content=&quot;the content to
&lt;em>search</em> in" ee="ff" />
but I can 
<Tex&lt;em>tLine</em>aa="bb" cc="dd" content="the content to <em>search</em>
in" ee="ff" />

Thanks for your help.



--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869p4066939.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with PatternReplaceCharFilter

Posted by Jack Krupansky <ja...@basetechnology.com>.

Just replace the stripped markup with the equivalent number of spaces to 
maintain positions.

Was there some specific problem you were encountering?

-- Jack Krupansky

-----Original Message----- 
From: jasimop
Sent: Wednesday, May 29, 2013 4:12 PM
To: solr-user@lucene.apache.org
Subject: Problem with PatternReplaceCharFilter

Hi,

I have a Problem when using PatternReplaceCharFilter when indexing a field.
I created the following field:
    <fieldType name="testfield" class="solr.TextField">
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;TextDocument[^&#62;]*&#62;" replacement=""/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;/TextDocument&#62;" replacement=""/>-->
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;TextLine[^&#60;]+ content=\&#34;([^\&#34;]*)\&#34;[^/]+/&#62;"
replacement="$1 "/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt"  format="snowball"
enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

And I created a field that is indexed and stored:
<field name="testfield" type="testfield" indexed="true" stored="true" />

I need to index a document with such a structure in this field:
<TextDocument filename="somefile.end" mime="..." created="..."><TextLine
aa="bb" cc="dd" content="the content to search in" ee="ff" /><TextLine
aa="bb" cc="dd" content="the second content line" ee="ff" /></TextDocument>

Basically I have some sort of XML structure, i need only to search in the
"content" attribute, but when highlighting i need to get back to the
enclosing XML tags.

So with the 3 Regex I want to remove all unwanted tags and tokenize/index
only the important data.
I know that I could use HTMLStripCharFilterFactory but then also the tag
names, attribute names and values get indexed. And I don't want to search in
that content too.

I read the following in the doc:
NOTE: If you produce a phrase that has different length to source string and
the field is used for highlighting for a term of the phrase, you will face a
trouble.

The thing is, why is this the case? When running the analyze from solr admin
the CharFilters generate
"the content to search in the second content line" which looks perfect, but
then the StandardTokenizer
gets the start and end positions of the tokens wrong. Why is this the case?
Does there exist another solution to my problem?
Could I use the following method I saw in the doc of
PatternReplaceCharFilter:
protected int correct(int currentOff) Documentation: Retrieve the corrected
offset.

How could I solve such a task?






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html
Sent from the Solr - User mailing list archive at Nabble.com.