You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dean Thompson <dt...@mspoke.com> on 2008/12/05 19:44:15 UTC

Re: IOException: Mark invalid while analyzing HTML

Was this one ever addressed?  I'm seeing it in some small percentage of the
documents that I index in 1.4-dev 708596M.  I don't see a corresponding JIRA
issue.


James Brady-3 wrote:
> 
> Hi,
> I'm seeing a problem mentioned in Solr-42, Highlighting problems with  
> HTMLStripWhitespaceTokenizerFactory:
> https://issues.apache.org/jira/browse/SOLR-42
> 
> I'm indexing HTML documents, and am getting reams of "Mark invalid"  
> IOExceptions:
> SEVERE: java.io.IOException: Mark invalid
> 	at java.io.BufferedReader.reset(Unknown Source)
> 	at  
> org 
> .apache 
> .solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
> 	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
> 728)
> 	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
> 742)
> 	at java.io.Reader.read(Unknown Source)
> 	at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56)
> 	at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118)
> 	at  
> org 
> .apache 
> .solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249)
> 	at  
> org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33)
> 	at  
> org 
> .apache 
> .solr 
> .analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92)
> 	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45)
> 	at  
> org 
> .apache 
> .solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
> 	at  
> org 
> .apache 
> .solr 
> .analysis 
> .RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: 
> 33)
> 	at  
> org 
> .apache 
> .solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82)
> 	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79)
> 	at org.apache.lucene.index.DocumentsWriter$ThreadState 
> $FieldData.invertField(DocumentsWriter.java:1518)
> 	at org.apache.lucene.index.DocumentsWriter$ThreadState 
> $FieldData.processField(DocumentsWriter.java:1407)
> 	at org.apache.lucene.index.DocumentsWriter 
> $ThreadState.processDocument(DocumentsWriter.java:1116)
> 	at  
> org 
> .apache 
> .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440)
> 	at  
> org 
> .apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: 
> 2422)
> 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: 
> 1445)
> 
> 
> This is using a ~1 week old version of Solr 1.3 from SVN.
> 
> One workaround mentioned in that Jira issue was to move HTML stripping  
> outside of Solr; can anyone suggest a better approach than that?
> 
> Thanks
> James
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/IOException%3A-Mark-invalid-while-analyzing-HTML-tp17052153p20859862.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: IOException: Mark invalid while analyzing HTML

Posted by Grant Ingersoll <gs...@apache.org>.
About the only thing you can do here is to increase the readAheadLimit  
on the BufferedReader, but, by the looks of it, that also means we  
need to modify the TokenStream Factories that create the  
HTMLStripReader so that they take in some optional attributes.  If you  
can open a JIRA issue for this, that would be great.

-Grant

On Dec 5, 2008, at 1:44 PM, Dean Thompson wrote:

>
> Was this one ever addressed?  I'm seeing it in some small percentage  
> of the
> documents that I index in 1.4-dev 708596M.  I don't see a  
> corresponding JIRA
> issue.
>
>
> James Brady-3 wrote:
>>
>> Hi,
>> I'm seeing a problem mentioned in Solr-42, Highlighting problems with
>> HTMLStripWhitespaceTokenizerFactory:
>> https://issues.apache.org/jira/browse/SOLR-42
>>
>> I'm indexing HTML documents, and am getting reams of "Mark invalid"
>> IOExceptions:
>> SEVERE: java.io.IOException: Mark invalid
>> 	at java.io.BufferedReader.reset(Unknown Source)
>> 	at
>> org
>> .apache
>> .solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
>> 	at  
>> org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:
>> 728)
>> 	at  
>> org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:
>> 742)
>> 	at java.io.Reader.read(Unknown Source)
>> 	at  
>> org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56)
>> 	at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118)
>> 	at
>> org
>> .apache
>> .solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249)
>> 	at
>> org 
>> .apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33)
>> 	at
>> org
>> .apache
>> .solr
>> .analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java: 
>> 92)
>> 	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45)
>> 	at
>> org
>> .apache
>> .solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
>> 	at
>> org
>> .apache
>> .solr
>> .analysis
>> .RemoveDuplicatesTokenFilter 
>> .process(RemoveDuplicatesTokenFilter.java:
>> 33)
>> 	at
>> org
>> .apache
>> .solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82)
>> 	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79)
>> 	at org.apache.lucene.index.DocumentsWriter$ThreadState
>> $FieldData.invertField(DocumentsWriter.java:1518)
>> 	at org.apache.lucene.index.DocumentsWriter$ThreadState
>> $FieldData.processField(DocumentsWriter.java:1407)
>> 	at org.apache.lucene.index.DocumentsWriter
>> $ThreadState.processDocument(DocumentsWriter.java:1116)
>> 	at
>> org
>> .apache
>> .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java: 
>> 2440)
>> 	at
>> org
>> .apache 
>> .lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:
>> 2422)
>> 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:
>> 1445)
>>
>>
>> This is using a ~1 week old version of Solr 1.3 from SVN.
>>
>> One workaround mentioned in that Jira issue was to move HTML  
>> stripping
>> outside of Solr; can anyone suggest a better approach than that?
>>
>> Thanks
>> James
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/IOException%3A-Mark-invalid-while-analyzing-HTML-tp17052153p20859862.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ