You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu> on 2012/05/21 17:06:56 UTC

error parsing some xml

Hi all.
When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
**************************************************************************************
WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
***************************************************************************************
Please some help will apreciated


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: error parsing some xml

Posted by Markus Jelsma <ma...@openindex.io>.
Strange, it should show the bad URL. But since you have only 9 URL's the easiest way to go is to use the parsechecker tool for each URL.

 
 
-----Original message-----
> From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 21-May-2012 19:42
> To: user@nutch.apache.org
> Subject: Re: error parsing some xml
> 
> I use nutch 1.4 and solr 3.4
> I think that my error is at moment to parse one xml with this structure
> <!--text with -- inside the comentary-->
> I was reading but not found so much, this is my error's log.
> please some help.
> *************************************************************************************************
> 2012-05-21 10:17:53,398 INFO  fetcher.Fetcher - Fetcher: starting at 2012-05-21 10:17:53
> 2012-05-21 10:17:53,399 INFO  fetcher.Fetcher - Fetcher: segment: crawl/segments/20120521101752
> 2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: threads: 20
> 2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
> 2012-05-21 10:17:53,777 INFO  fetcher.Fetcher - QueueFeeder finished: total 9 records + hit by time limit :0
> 2012-05-21 10:17:53,804 WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> 2012-05-21 10:17:53,809 WARN  mapred.LocalJobRunner - job_local_0005
> java.lang.RuntimeException: Parse Plugins preferences could not be loaded.
> 	at org.apache.nutch.parse.ParserFactory.<init>(ParserFactory.java:73)
> 	at org.apache.nutch.parse.ParseUtil.<init>(ParseUtil.java:53)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.<init>(Fetcher.java:581)
> 	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1075)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> ****************************************************************************************************
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Markus Jelsma" <ma...@openindex.io>
> Para: user@nutch.apache.org
> Enviados: Lunes, 21 de Mayo 2012 11:41:40
> Asunto: RE: error parsing some xml
> 
> Hi
> 
> Which version do you use? It should list the troubling URL. What's the stack trace?
> 
> Cheers
> 
>  
>  
> -----Original message-----
> > From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> > Sent: Mon 21-May-2012 17:07
> > To: user@nutch.apache.org
> > Subject: error parsing some xml
> > 
> > Hi all.
> > When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
> > **************************************************************************************
> > WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> > ***************************************************************************************
> > Please some help will apreciated
> > 
> > 
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> > 
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
> > 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 

Re: error parsing some xml

Posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu>.
I use nutch 1.4 and solr 3.4
I think that my error is at moment to parse one xml with this structure
<!--text with -- inside the comentary-->
I was reading but not found so much, this is my error's log.
please some help.
*************************************************************************************************
2012-05-21 10:17:53,398 INFO  fetcher.Fetcher - Fetcher: starting at 2012-05-21 10:17:53
2012-05-21 10:17:53,399 INFO  fetcher.Fetcher - Fetcher: segment: crawl/segments/20120521101752
2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Using queue mode : byHost
2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: threads: 20
2012-05-21 10:17:53,762 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2012-05-21 10:17:53,777 INFO  fetcher.Fetcher - QueueFeeder finished: total 9 records + hit by time limit :0
2012-05-21 10:17:53,804 WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
2012-05-21 10:17:53,809 WARN  mapred.LocalJobRunner - job_local_0005
java.lang.RuntimeException: Parse Plugins preferences could not be loaded.
	at org.apache.nutch.parse.ParserFactory.<init>(ParserFactory.java:73)
	at org.apache.nutch.parse.ParseUtil.<init>(ParseUtil.java:53)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.<init>(Fetcher.java:581)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1075)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
****************************************************************************************************




----- Mensaje original -----
De: "Markus Jelsma" <ma...@openindex.io>
Para: user@nutch.apache.org
Enviados: Lunes, 21 de Mayo 2012 11:41:40
Asunto: RE: error parsing some xml

Hi

Which version do you use? It should list the troubling URL. What's the stack trace?

Cheers

 
 
-----Original message-----
> From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 21-May-2012 17:07
> To: user@nutch.apache.org
> Subject: error parsing some xml
> 
> Hi all.
> When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
> **************************************************************************************
> WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> ***************************************************************************************
> Please some help will apreciated
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: error parsing some xml

Posted by Markus Jelsma <ma...@openindex.io>.
Hi

Which version do you use? It should list the troubling URL. What's the stack trace?

Cheers

 
 
-----Original message-----
> From:Ing. Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 21-May-2012 17:07
> To: user@nutch.apache.org
> Subject: error parsing some xml
> 
> Hi all.
> When I try to crawl i have a problem at parsing some xml, i get the exception below, i want to know which is the xml with problem at parsing moment.
> **************************************************************************************
> WARN  parse.ParsePluginsReader - Unable to parse [null].Reason is [org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 7; The string "--" is not permitted within comments.]
> ***************************************************************************************
> Please some help will apreciated
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>