You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ziqi Zhang <zi...@sheffield.ac.uk> on 2015/10/04 13:08:31 UTC
Does solr remove "\r" from text content for indexing?
Hi
I am trying to pin-point a mismatch between the offsets produced by solr
indexing process when I use the offsets to substring from the original
document content. It seems that if the text content contains "\r"
(windows carriage sign), solr automatically removes it, so "ok\r\nthis
is the text\r\nand..." becomes "ok\nthis is the text\nand..." and as a
reulst the offsets created by solr indexing do not work with the
original content.
I have asked this issue on the lucene mailing list but have been
suggested that it is likely to be solr that caused this.
*To reproduce this issue, here is what I have done:*
1. Compile OpenNLPTokenizer.java and OpenNLPTokenizerFactory.java (in
attachment), which I use to analyse a text field. OpenNLPTokenizer.java
is almost identical to that at
https://issues.apache.org/jira/browse/LUCENE-6595 except that I adapted
it to lucene 5.3.0. If you look at line 74 of OpenNLPTokenizer, it takes
the "input" variable (of type Reader) from its superclass Tokenizer, and
tokenizes its content. At runtime time by debugging, I can see the
string content held by this variable has already removed "\r" (details
below)
2. configure solrconfig.xml and schema.xml to use the above tokenizer:
In solrconfig, do something like below and place the compiled code into
the folder
<lib dir="${solr.install.dir:../../../..}/classes" regex=".*\.class" />
In schema.xml define a new field type:
<fieldType name="testFieldType" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer
class="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"
sentenceModel=".../your_path/en-sent.bin"
tokenizerModel=".../your_path/en-token.bin"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Download "en-sent.bin" and "en-token.bin" from below and place it
somewhere and then change the sentenceModel and tokenizerModel params
above to point to them:
http://opennlp.sourceforge.net/models-1.5/en-token.bin
http://opennlp.sourceforge.net/models-1.5/en-sent.bin
Then define a new field in the schema:
<field name="content" type="testFieldType" indexed="true" stored="false"
multiValued="false" termVectors="true" termPositions="true"
termOffsets="true"/>
3. Run the testing class TestIndexing.java (attachment) in debugging
mode, *you need to place a break point on line 74 of OpenNLPTokenizer*.
*
**To see the problem, notice that:
*- Line 19 of TestIndexing.java passes raw string "ok\r\nthis is the
text\r\nand..." to be added to field "content", which is to be analyzed
by the "testFieldType" defined above. So this will trigger the
OpenNLPTokenizer class
- When you are at line 74 of OpenNLPTokenizer, inspect the value of the
variable "input". It is instantiated as a *ReusableStringReader*, and
its value is now "ok\nthis is the text\nand...", all "\r" has been removed.
*In an attempt to solve the problem, I have learnt that:
*- (suggested by a lucene developer) the ReuseableStringReader I see is
caused by the way how Solr sets the field contents (as String). If the
StringReader has no \r anymore, then it is Solr's fault.
- trying to follow the debugger I pin pointed at line 299 of
DefaultIndexingChain that is shown as below:
for (IndexableField field : docState.doc) {
fieldCount = processField(field, fieldGen, fieldCount);
}
And again during debugging, I can see the field "content" is
encapsulated in an "IndexableField" object and its content is already
"\r" removed.
However at this point I cannot trace further to find how such
IndexableFields are created by solr, or lucene...
Any thoughts on this would be much appreciated!