You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Majisha Parambath <pa...@usc.edu> on 2015/03/23 00:04:07 UTC

Error trying to index files to Solr

Hello,

As part of an assignment, we initially crawled and collected  NSF and NASA
Polar Datasets using Nutch. We used the nutch dump command to dump out the
segments that were created as part of the crawl.
Now we have to index this data into Solr. I am using java -jar post.jar
filename to post to solr however after the execution I do not see my file
indexed and checking the log I found exceptions which I am attaching with
this mail.

Could you please let me know if I am missing something?

Thanks and regards,
*Majisha Namath Parambath*
*Graduate Student, M.S in Computer Science*
*Viterbi School of Engineering*
*University of Southern California, Los Angeles*

Re: Error trying to index files to Solr

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/22/2015 5:04 PM, Majisha Parambath wrote:
> As part of an assignment, we initially crawled and collected  NSF and
> NASA Polar Datasets using Nutch. We used the nutch dump command to dump
> out the segments that were created as part of the crawl.
> Now we have to index this data into Solr. I am using java -jar post.jar
> filename to post to solr however after the execution I do not see my
> file indexed and checking the log I found exceptions which I am
> attaching with this mail.

Here's the first part of your exception:

org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0x0 (at
char #10, byte #-1)
	at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
	
Solr is expecting UTF-8 characters, but the info you are sending it is
in another character set, and includes characters outside the normal
ASCII set.  The error message indicates that it is XML data.

If you know what character set the data actually uses for encoding, you
can use XML methods to indicate the character set, and the XML libraries
that Solr is utilizing can probably convert to UTF-8 automatically.

http://www.w3schools.com/xml/xml_encoding.asp

Thanks,
Shawn


Re: Error trying to index files to Solr

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Majisha,

Nutch' Solr indexing plugin has support for stripping non-utf8 character 
codepoints from the input, but it does so only on the content field if i 
remember correctly.

However, that stripping method was not built with the invalid middle byte 
exception in mind, and i have not seen it even once before Solr 5.x. We are 
upgrading parts of our infrastructure to Solr 5.x and got struck by this too.

Can you confirm that it is the content field sent by Nutch that causes the 
problem?

Markus

On Sunday 22 March 2015 16:04:07 Majisha Parambath wrote:
> Hello,
> 
> As part of an assignment, we initially crawled and collected  NSF and NASA
> Polar Datasets using Nutch. We used the nutch dump command to dump out the
> segments that were created as part of the crawl.
> Now we have to index this data into Solr. I am using java -jar post.jar
> filename to post to solr however after the execution I do not see my file
> indexed and checking the log I found exceptions which I am attaching with
> this mail.
> 
> Could you please let me know if I am missing something?
> 
> Thanks and regards,
> *Majisha Namath Parambath*
> *Graduate Student, M.S in Computer Science*
> *Viterbi School of Engineering*
> *University of Southern California, Los Angeles*