You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ziqi Zhang <zi...@sheffield.ac.uk> on 2015/10/04 13:08:31 UTC
Does solr remove "\r" from text content for indexing?

Hi

I am trying to pin-point a mismatch between the offsets produced by solr 
indexing process when I use the offsets to substring from the original 
document content. It seems that if the text content contains "\r" 
(windows carriage sign), solr automatically removes it, so "ok\r\nthis 
is the text\r\nand..." becomes "ok\nthis is the text\nand..." and as a 
reulst the offsets created by solr indexing do not work with the 
original content.

I have asked this issue on the lucene mailing list but have been 
suggested that it is likely to be solr that caused this.

*To reproduce this issue, here is what I have done:*

1. Compile OpenNLPTokenizer.java and OpenNLPTokenizerFactory.java (in 
attachment), which I use to analyse a text field. OpenNLPTokenizer.java 
is almost identical to that at 
https://issues.apache.org/jira/browse/LUCENE-6595 except that I adapted 
it to lucene 5.3.0. If you look at line 74 of OpenNLPTokenizer, it takes 
the "input" variable (of type Reader) from its superclass Tokenizer, and 
tokenizes its content. At runtime time by debugging, I can see the 
string content held by this variable has already removed "\r" (details 
below)

2. configure solrconfig.xml and schema.xml to use the above tokenizer:
In solrconfig, do something like below and place the compiled code into 
the folder
<lib dir="${solr.install.dir:../../../..}/classes" regex=".*\.class" />

In schema.xml define a new field type:
<fieldType name="testFieldType" class="solr.TextField" 
positionIncrementGap="100">
             <analyzer type="index">
                 <tokenizer 
class="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"
sentenceModel=".../your_path/en-sent.bin"
tokenizerModel=".../your_path/en-token.bin"/>
                 <filter class="solr.ASCIIFoldingFilterFactory"/>
             </analyzer>
  </fieldType>

Download "en-sent.bin" and "en-token.bin" from below and place it 
somewhere and then change the sentenceModel and tokenizerModel params 
above to point to them:
http://opennlp.sourceforge.net/models-1.5/en-token.bin
http://opennlp.sourceforge.net/models-1.5/en-sent.bin

Then define a new field in the schema:
<field name="content" type="testFieldType" indexed="true" stored="false" 
multiValued="false" termVectors="true" termPositions="true" 
termOffsets="true"/>

3. Run the testing class TestIndexing.java (attachment) in debugging 
mode, *you need to place a break point on line 74 of OpenNLPTokenizer*.

*
**To see the problem, notice that:
*- Line 19 of TestIndexing.java passes raw string "ok\r\nthis is the 
text\r\nand..." to be added to field "content", which is to be analyzed 
by the "testFieldType" defined above. So this will trigger the 
OpenNLPTokenizer class
- When you are at line 74 of OpenNLPTokenizer, inspect the value of the 
variable "input". It is instantiated as a *ReusableStringReader*, and 
its value is now "ok\nthis is the text\nand...", all "\r" has been removed.


*In an attempt to solve the problem, I have learnt that:
*- (suggested by a lucene developer) the ReuseableStringReader I see is 
caused by the way how Solr sets the field contents (as String). If the 
StringReader has no \r anymore, then it is Solr's fault.
- trying to follow the debugger I pin pointed at line 299 of 
DefaultIndexingChain that is shown as below:

       for (IndexableField field : docState.doc) {
         fieldCount = processField(field, fieldGen, fieldCount);
       }

And again during debugging, I can see the field "content" is 
encapsulated in an "IndexableField" object and its content is already 
"\r" removed.
However at this point I cannot trace further to find how such 
IndexableFields are created by solr, or lucene...


Any thoughts on this would be much appreciated!