You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by neeraj <ne...@yahoo.com> on 2013/03/17 19:34:48 UTC

SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

I am getting following exception when indexing documents to Solr from Nutch.

org.apache.solr.common.SolrException: An invalid XML character (Unicode:
0xffffffff) was found in the element content of the document.

Please let me know on how to resolve this.

I am using Nutch 1.6 for crawling and Solr 4.1.

Thanks,
Neeraj.

  



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrException-An-invalid-XML-character-Unicode-0xffffffff-was-found-in-the-element-content-of-the-do-tp4048290.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

Posted by feng lu <am...@gmail.com>.

I am not sure whether this error is caused by this property "
parser.character.encoding.default"  Can you trace this error back to a
specific document? So you can create a test enviroment and parser&index
that document again. See what happens.

On Mon, Mar 18, 2013 at 12:17 PM, neeraj <ne...@yahoo.com> wrote:

> So do i need to recrawl all my documents again with property changed to
> utf-8. Will it resolve the indexing problem?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-6-Need-help-with-Indexing-tp4048290p4048405.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
Don't Grow Old, Grow Up... :-)

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

Posted by neeraj <ne...@yahoo.com>.

So do i need to recrawl all my documents again with property changed to
utf-8. Will it resolve the indexing problem?



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-6-Need-help-with-Indexing-tp4048290p4048405.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

Posted by feng lu <am...@gmail.com>.

yes, NUTCH-1016 already fixed this problem.

The property "parser.character.encoding.default" is used when
EncodingDetctor can not detected the content encoding. It set the defaut
encoding to this page content. If this detection is wrong, sometimes it
will result unreadable code of parse content. like [0]

[0]
http://mail-archives.apache.org/mod_mbox/nutch-user/201303.mbox/%3CCAOeWMMp7nGLe6otgmBePb450CmObc3w0xJk6oHS1RAfF_5qAsw@mail.gmail.com%3E


On Mon, Mar 18, 2013 at 10:31 AM, neeraj <ne...@yahoo.com> wrote:

> Amuseme,
>
>    Thanks for the reply. I reviewed the exceptions given on the link and I
> am not getting any of those. I have more than 5 million documents crawled
> and was able to index 120 K documents to Solr before this exception
> occurred
> for invalid XML character.
>
> I was trying to investigate around this issue and found that there are
> previous posts on the same topic where the patch was being applied to
> stripNonCharCodepoints(). But that is already part of Nutch 1.6 and I am
> still getting the same exception.
>
> My "parser.character.encoding.default" was set to windows-1252 when
> crawling
> all these documents. Could that have let to this exception when indexing?
>
> Any insight on this will be helpful.
>
> Thanks,
> Neeraj.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-6-Need-help-with-Indexing-tp4048290p4048391.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

Posted by neeraj <ne...@yahoo.com>.

Amuseme,

   Thanks for the reply. I reviewed the exceptions given on the link and I
am not getting any of those. I have more than 5 million documents crawled
and was able to index 120 K documents to Solr before this exception occurred
for invalid XML character.

I was trying to investigate around this issue and found that there are
previous posts on the same topic where the patch was being applied to
stripNonCharCodepoints(). But that is already part of Nutch 1.6 and I am
still getting the same exception.

My "parser.character.encoding.default" was set to windows-1252 when crawling
all these documents. Could that have let to this exception when indexing?

Any insight on this will be helpful.

Thanks,
Neeraj.



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-6-Need-help-with-Indexing-tp4048290p4048391.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

Posted by feng lu <am...@gmail.com>.

Hi Neeraj

schema-solr4.xml does not work with Solr 4.1.0. Maybe you can add this
patch[0] and run again.

[0] https://issues.apache.org/jira/browse/NUTCH-1486


On Mon, Mar 18, 2013 at 2:34 AM, neeraj <ne...@yahoo.com> wrote:

> I am getting following exception when indexing documents to Solr from
> Nutch.
>
> org.apache.solr.common.SolrException: An invalid XML character (Unicode:
> 0xffffffff) was found in the element content of the document.
>
> Please let me know on how to resolve this.
>
> I am using Nutch 1.6 for crawling and Solr 4.1.
>
> Thanks,
> Neeraj.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrException-An-invalid-XML-character-Unicode-0xffffffff-was-found-in-the-element-content-of-the-do-tp4048290.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)