You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by raviksingh <ra...@gmail.com> on 2013/03/05 17:38:10 UTC

Find which URL created exception

Hi, 
  I am new to nutch. I am using nutch with MySQL. 
While trying to crawl  http://piwik.org/xmlrpc.php
<http://piwik.org/xmlrpc.php>  
nutch throws exception :

Parsing http://piwik.org/xmlrpc.php
Call completed
java.lang.RuntimeException: job failed: name=update-table, jobid=null
	at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
	at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:98)
	at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
	at org.apache.nutch.crawl.Crawler.run(Crawler.java:181)
	at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at ravi.crawler.MyCrawl.crawl(MyCrawl.java:13)
	at ravi.crawler.Crawler.AttachCrawl(Crawler.java:88)
	at scheduler.MyTask.run(MyTask.java:15)
	at java.util.TimerThread.mainLoop(Unknown Source)
	at java.util.TimerThread.run(Unknown Source)



Please check the link as it looks like a service.

How can I either resolve this .





--
View this message in context: http://lucene.472066.n3.nabble.com/Find-which-URL-created-exception-tp4044914.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Find which URL created exception

Posted by kiran chitturi <ch...@gmail.com>.

This is an issue with using MySql as backend. Hbase is the current stable
backend AFAIK for Nutch 2.x. There a few bugs with MySql as backend


On Tue, Mar 5, 2013 at 11:52 AM, raviksingh <ra...@gmail.com>wrote:

> This is the log :
>
> The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled
> via the plugin.includes system property, and all claim to support the
> content type text/plain, but they are not mapped to it  in the
> parse-plugins.xml file
> 2013-03-05 22:06:54,076 WARN  parse.ParseUtil - Unable to successfully
> parse
> content http://piwik.org/xmlrpc.php of type text/plain
> 2013-03-05 22:06:54,955 INFO  mapreduce.GoraRecordReader -
> gora.buffer.read.limit = 10000
> 2013-03-05 22:06:55,706 INFO  mapreduce.GoraRecordWriter -
> gora.buffer.write.limit = 10000
> 2013-03-05 22:06:55,707 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-03-05 22:06:55,707 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2013-03-05 22:06:55,707 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2013-03-05 22:06:56,216 WARN  mapred.LocalJobRunner - job_local_0005
> java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data
> too long for column 'id' at row 1
>         at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
>         at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
>         at
> org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
>         at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> Caused by: java.sql.BatchUpdateException: Data truncation: Data too long
> for
> column 'id' at row 1
>         at
>
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
>         at
> com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
>         at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
>         ... 5 more
> Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too
> long for column 'id' at row 1
>         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3607)
>         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>         at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>         at
>
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>         at
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
>         at
>
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/job-failed-name-update-table-jobid-null-tp4044914p4044923.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Kiran Chitturi

Re: Find which URL created exception

Posted by raviksingh <ra...@gmail.com>.

This is the log : 

The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled
via the plugin.includes system property, and all claim to support the
content type text/plain, but they are not mapped to it  in the
parse-plugins.xml file
2013-03-05 22:06:54,076 WARN  parse.ParseUtil - Unable to successfully parse
content http://piwik.org/xmlrpc.php of type text/plain
2013-03-05 22:06:54,955 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-03-05 22:06:55,706 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-03-05 22:06:55,707 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-03-05 22:06:55,707 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-03-05 22:06:55,707 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-03-05 22:06:56,216 WARN  mapred.LocalJobRunner - job_local_0005
java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data
too long for column 'id' at row 1
	at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
	at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
	at
org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for
column 'id' at row 1
	at
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
	at
com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
	at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
	... 5 more
Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too
long for column 'id' at row 1
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3607)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
	at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
	at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
	at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
	at
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
	at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
	at
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)



--
View this message in context: http://lucene.472066.n3.nabble.com/job-failed-name-update-table-jobid-null-tp4044914p4044923.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Find which URL created exception

Posted by kiran chitturi <ch...@gmail.com>.

Hi!

Looking at 'logs/hadoop.log' will give you more information on why the job
has failed.

To check if a single URL can be crawled, please use parseChecker tool [0]

[0] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker

I have checked using parseChecker and it worked for me.





On Tue, Mar 5, 2013 at 11:38 AM, raviksingh <ra...@gmail.com>wrote:

> Hi,
>   I am new to nutch. I am using nutch with MySQL.
> While trying to crawl  http://piwik.org/xmlrpc.php
> <http://piwik.org/xmlrpc.php>
> nutch throws exception :
>
> Parsing http://piwik.org/xmlrpc.php
> Call completed
> java.lang.RuntimeException: job failed: name=update-table, jobid=null
>         at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>         at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:98)
>         at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:181)
>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at ravi.crawler.MyCrawl.crawl(MyCrawl.java:13)
>         at ravi.crawler.Crawler.AttachCrawl(Crawler.java:88)
>         at scheduler.MyTask.run(MyTask.java:15)
>         at java.util.TimerThread.mainLoop(Unknown Source)
>         at java.util.TimerThread.run(Unknown Source)
>
>
>
> Please check the link as it looks like a service.
>
> How can I either resolve this .
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Find-which-URL-created-exception-tp4044914.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Kiran Chitturi