You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by j....@thomsonreuters.com on 2012/09/02 01:25:10 UTC

RE: Nutch 2.0 MySQL Data truncation: Data too long for column 'content' at row 1

I am using `content` mediumblob in the SQL table and have not had any
issues so far but that means little as it is probably due to differences
in what we are crawling.  However, I am running into a similar problem
with the outlinks column so I suspect the original webpage table
creation script for MySQL has never been tested with real production
runs.  I am going to play around with that and see if I can get
something more robust this week.

-----Original Message-----
From: Matt MacDonald [mailto:matt@nearbyfyi.com] 
Sent: Friday, August 31, 2012 9:38 AM
To: user@nutch.apache.org
Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column
'content' at row 1

Yeah I had tried using "MEDIUMBLOB" and ran into the same error message.

<field name="content" column="content" jdbc-type="MEDIUMBLOB"/>

It seems likely that we'd want to use HBase in production anyway so I
ended up switching the configuration over to HBase for the crawl and
didn't run into any issues with the content length of the documents. If
anyone knows more about what the mysql issue it might be worth replying
so that others that are encountering the issue don't hit the same dead
end that I did.

Thanks,
Matt

On Thu, Aug 30, 2012 at 4:05 PM, Ferdy Galema
<fe...@kalooga.com>wrote:

> Hi,
>
> It might be too obvious but have you already tried to play around with

> the following line in the mapping (seems like a limit to me)?
> <field name="content" column="content" length="65535"/>
>
> It has been a while since I tried to store big content with the 
> SqlStore, so I'm not sure how it works exactly.
>
> Ferdy.
>
> On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <ma...@nearbyfyi.com>
> wrote:
>
> > Hi,
> >
> > I'm using Nutch 2.0 from the 2.x branch on Github and used
> > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as 
> > the storage backend. I'm seeing the following error show up in my 
> > hadoop.log file while fetching and wonder if others have ideas for 
> > moving past the error without having to set the http.content.limit:
> >
> > java.io.IOException: java.sql.BatchUpdateException: Data truncation:

> > Data too long for column 'content' at row 1
> >         at
org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
> >         at
org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
> >         at
> >
> org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java
> :55)
> >         at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(Redu
> ceTask.java:579)
> >         at
> >
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
> >         at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> >         at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:
> > 260) Caused by: java.sql.BatchUpdateException: Data truncation: Data

> > too long for column 'content' at row 1
> >         at
> >
> >
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatemen
> t.java:2028)
> >         at
> >
> com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1
> 451)
> >         at 
> > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
> >
> >
> > The webpage table in MySQL looks like:
> >
> > mysql> desc webpage;
> > +-------------------+--------------+------+-----+---------+-------+
> > | Field             | Type         | Null | Key | Default | Extra |
> > +-------------------+--------------+------+-----+---------+-------+
> > | id                | varchar(512) | NO   | PRI | NULL    |       |
> > | headers           | blob         | YES  |     | NULL    |       |
> > | text              | mediumtext   | YES  |     | NULL    |       |
> > | status            | int(11)      | YES  |     | NULL    |       |
> > | markers           | blob         | YES  |     | NULL    |       |
> > | parseStatus       | blob         | YES  |     | NULL    |       |
> > | modifiedTime      | bigint(20)   | YES  |     | NULL    |       |
> > | score             | float        | YES  |     | NULL    |       |
> > | typ               | varchar(32)  | YES  |     | NULL    |       |
> > | baseUrl           | varchar(512) | YES  |     | NULL    |       |
> > | content           | mediumblob   | YES  |     | NULL    |       |
> > | title             | varchar(512) | YES  |     | NULL    |       |
> > | reprUrl           | varchar(512) | YES  |     | NULL    |       |
> > | fetchInterval     | int(11)      | YES  |     | NULL    |       |
> > | prevFetchTime     | bigint(20)   | YES  |     | NULL    |       |
> > | inlinks           | blob         | YES  |     | NULL    |       |
> > | prevSignature     | blob         | YES  |     | NULL    |       |
> > | outlinks          | blob         | YES  |     | NULL    |       |
> > | fetchTime         | bigint(20)   | YES  |     | NULL    |       |
> > | retriesSinceFetch | int(11)      | YES  |     | NULL    |       |
> > | protocolStatus    | blob         | YES  |     | NULL    |       |
> > | signature         | blob         | YES  |     | NULL    |       |
> > | metadata          | blob         | YES  |     | NULL    |       |
> > +-------------------+--------------+------+-----+---------+-------+
> > 23 rows in set (0.00 sec)
> >
> >
> >
> > And my gora-sql-mapping.xml file looks like:
> >
> > <class name="org.apache.nutch.storage.WebPage"
> keyClass="java.lang.String"
> > table="webpage">
> >   <primarykey column="id" length="512"/>
> >     <field name="baseUrl" column="baseUrl" length="512"/>
> >     <field name="status" column="status"/>
> >     <field name="prevFetchTime" column="prevFetchTime"/>
> >     <field name="fetchTime" column="fetchTime"/>
> >     <field name="fetchInterval" column="fetchInterval"/>
> >     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
> >     <field name="reprUrl" column="reprUrl" length="512"/>
> >     <field name="content" column="content" length="65535"/>
> >     <field name="contentType" column="typ" length="32"/>
> >     <field name="protocolStatus" column="protocolStatus"/>
> >     <field name="modifiedTime" column="modifiedTime"/>
> >
> >     <!-- parse fields                                       -->
> >     <field name="title" column="title" length="512"/>
> >     <field name="text" column="text" length="32000"/>
> >     <field name="parseStatus" column="parseStatus"/>
> >     <field name="signature" column="signature"/>
> >     <field name="prevSignature" column="prevSignature"/>
> >
> >     <!-- score fields                                       -->
> >     <field name="score" column="score"/>
> >     <field name="headers" column="headers"/>
> >     <field name="inlinks" column="inlinks"/>
> >     <field name="outlinks" column="outlinks"/>
> >     <field name="metadata" column="metadata"/>
> >     <field name="markers" column="markers"/> </class>
> >
> >
> > The http.content.limit in nutch-default.xml looks like:
> >
> > <property>
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content using the
http
> >   protocol, in bytes. If this value is nonnegative (>=0), content
longer
> >   than it will be truncated; otherwise, no truncation at all. Do not
> >   confuse this setting with the file.content.limit setting.
> >   </description>
> > </property>
> >
> >
> > My my.cnf file looks like:
> >
> > [mysqld]
> > max_allowed_packet             = 200M
> >
> > character-set-server = utf8
> > collation-server=utf8_unicode_ci
> >
> >
> > I've tested that I could change the http.content.limit property to
be a
> > nonnegative number (65535) and the fetch job completes, but I want
to
> have
> > the non-truncated content available so that I'm crawling all links
on the
> > page and storing the entire contents of the document so that I can
then
> > index the entire text in Elasticsearch. Any ideas on how I can fetch
and
> > store the full content in MySQL? If the answer is - use HBase I'll
do
> that
> > I'm just trying to remove another new variable as I learn more about
how
> > Nutch works. With the content.limit set my crawl completes but I'm
> missing
> > nearly a 3rd of the documents that I would expect because the
content is
> > being truncated?
> >
> > Thanks for any advice you can offer.
> >
> > Thanks,
> > Matt
> >
>

RE: Nutch 2.0 MySQL Data truncation: Data too long for column 'content' at row 1

Posted by j....@thomsonreuters.com.
Thanks Paul. I had the table change part figured out already but your
comment about malformed ids and DbUpdateReducer.reduce was really
helpful (and something worth its own thread). With the proper table set
up there should be no problem with size with MySQL. However the URL
parsing for the id has issues and non-Ascii characters cause problems in
various places.

-----Original Message-----
From: Paul Dhaliwal [mailto:subpaul@gmail.com] 
Sent: Sunday, September 02, 2012 2:54 PM
To: user@nutch.apache.org
Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column
'content' at row 1

I am giving Nutch 2.0 MySQL a go also. I am running into similar issues
too.

I changed inlinks on both host and webpage tables to longblob. I also
modified the webpage.title to because 512 was not cutting it. I made
couple of code changes also. I noticed that my updatedb jobs kept on
failing because one of the page had a really long id. It turns out that
id contained body text also. To handle this, I skipped pages with id
longer than 512 in DbUpdateReducer.reduce. Every page skipped had a
malformed id.

My current schema is posted below and its running pretty stable for me.
I have only gone through few hundred thousand pages, but I seem to have
came across all the funny ones.

Cheers,
Paul

----- my current nutch 2.0 mysql schema
CREATE TABLE `host` (
  `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '',
  `inlinks` longblob,
  `outlinks` longblob,
  `metadata` longblob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `webpage` (
  `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '',
  `headers` blob,
  `text` mediumtext,
  `status` int(11) DEFAULT NULL,
  `markers` blob,
  `parseStatus` blob,
  `modifiedTime` bigint(20) DEFAULT NULL,
  `score` float DEFAULT NULL,
  `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
  `baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
  `content` mediumblob,
  `title` varchar(2048) DEFAULT NULL,
  `reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
  `fetchInterval` int(11) DEFAULT NULL,
  `prevFetchTime` bigint(20) DEFAULT NULL,
  `inlinks` longblob,
  `prevSignature` blob,
  `outlinks` blob,
  `fetchTime` bigint(20) DEFAULT NULL,
  `retriesSinceFetch` int(11) DEFAULT NULL,
  `protocolStatus` blob,
  `signature` blob,
  `metadata` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

On Sat, Sep 1, 2012 at 4:25 PM, <j....@thomsonreuters.com> wrote:

> I am using `content` mediumblob in the SQL table and have not had any 
> issues so far but that means little as it is probably due to 
> differences in what we are crawling.  However, I am running into a 
> similar problem with the outlinks column so I suspect the original 
> webpage table creation script for MySQL has never been tested with 
> real production runs.  I am going to play around with that and see if 
> I can get something more robust this week.
>
> -----Original Message-----
> From: Matt MacDonald [mailto:matt@nearbyfyi.com]
> Sent: Friday, August 31, 2012 9:38 AM
> To: user@nutch.apache.org
> Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column

> 'content' at row 1
>
> Yeah I had tried using "MEDIUMBLOB" and ran into the same error
message.
>
> <field name="content" column="content" jdbc-type="MEDIUMBLOB"/>
>
> It seems likely that we'd want to use HBase in production anyway so I 
> ended up switching the configuration over to HBase for the crawl and 
> didn't run into any issues with the content length of the documents. 
> If anyone knows more about what the mysql issue it might be worth 
> replying so that others that are encountering the issue don't hit the 
> same dead end that I did.
>
> Thanks,
> Matt
>
> On Thu, Aug 30, 2012 at 4:05 PM, Ferdy Galema
> <fe...@kalooga.com>wrote:
>
> > Hi,
> >
> > It might be too obvious but have you already tried to play around 
> > with
>
> > the following line in the mapping (seems like a limit to me)?
> > <field name="content" column="content" length="65535"/>
> >
> > It has been a while since I tried to store big content with the 
> > SqlStore, so I'm not sure how it works exactly.
> >
> > Ferdy.
> >
> > On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <ma...@nearbyfyi.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm using Nutch 2.0 from the 2.x branch on Github and used
> > > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL 
> > > as the storage backend. I'm seeing the following error show up in 
> > > my hadoop.log file while fetching and wonder if others have ideas 
> > > for moving past the error without having to set the
http.content.limit:
> > >
> > > java.io.IOException: java.sql.BatchUpdateException: Data
truncation:
>
> > > Data too long for column 'content' at row 1
> > >         at
> org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
> > >         at
> org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
> > >         at
> > >
> > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.ja
> > va
> > :55)
> > >         at
> > >
> > >
> > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(Re
> > du
> > ceTask.java:579)
> > >         at
> > >
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
> > >         at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> > >         at
> > >
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:
> > > 260) Caused by: java.sql.BatchUpdateException: Data truncation: 
> > > Data
>
> > > too long for column 'content' at row 1
> > >         at
> > >
> > >
> > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatem
> > en
> > t.java:2028)
> > >         at
> > >
> > com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java
> > :1
> > 451)
> > >         at
> > > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
> > >
> > >
> > > The webpage table in MySQL looks like:
> > >
> > > mysql> desc webpage;
> > >
+-------------------+--------------+------+-----+---------+-------+
> > > | Field             | Type         | Null | Key | Default | Extra
|
> > >
+-------------------+--------------+------+-----+---------+-------+
> > > | id                | varchar(512) | NO   | PRI | NULL    |
|
> > > | headers           | blob         | YES  |     | NULL    |
|
> > > | text              | mediumtext   | YES  |     | NULL    |
|
> > > | status            | int(11)      | YES  |     | NULL    |
|
> > > | markers           | blob         | YES  |     | NULL    |
|
> > > | parseStatus       | blob         | YES  |     | NULL    |
|
> > > | modifiedTime      | bigint(20)   | YES  |     | NULL    |
|
> > > | score             | float        | YES  |     | NULL    |
|
> > > | typ               | varchar(32)  | YES  |     | NULL    |
|
> > > | baseUrl           | varchar(512) | YES  |     | NULL    |
|
> > > | content           | mediumblob   | YES  |     | NULL    |
|
> > > | title             | varchar(512) | YES  |     | NULL    |
|
> > > | reprUrl           | varchar(512) | YES  |     | NULL    |
|
> > > | fetchInterval     | int(11)      | YES  |     | NULL    |
|
> > > | prevFetchTime     | bigint(20)   | YES  |     | NULL    |
|
> > > | inlinks           | blob         | YES  |     | NULL    |
|
> > > | prevSignature     | blob         | YES  |     | NULL    |
|
> > > | outlinks          | blob         | YES  |     | NULL    |
|
> > > | fetchTime         | bigint(20)   | YES  |     | NULL    |
|
> > > | retriesSinceFetch | int(11)      | YES  |     | NULL    |
|
> > > | protocolStatus    | blob         | YES  |     | NULL    |
|
> > > | signature         | blob         | YES  |     | NULL    |
|
> > > | metadata          | blob         | YES  |     | NULL    |
|
> > >
+-------------------+--------------+------+-----+---------+-------+
> > > 23 rows in set (0.00 sec)
> > >
> > >
> > >
> > > And my gora-sql-mapping.xml file looks like:
> > >
> > > <class name="org.apache.nutch.storage.WebPage"
> > keyClass="java.lang.String"
> > > table="webpage">
> > >   <primarykey column="id" length="512"/>
> > >     <field name="baseUrl" column="baseUrl" length="512"/>
> > >     <field name="status" column="status"/>
> > >     <field name="prevFetchTime" column="prevFetchTime"/>
> > >     <field name="fetchTime" column="fetchTime"/>
> > >     <field name="fetchInterval" column="fetchInterval"/>
> > >     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
> > >     <field name="reprUrl" column="reprUrl" length="512"/>
> > >     <field name="content" column="content" length="65535"/>
> > >     <field name="contentType" column="typ" length="32"/>
> > >     <field name="protocolStatus" column="protocolStatus"/>
> > >     <field name="modifiedTime" column="modifiedTime"/>
> > >
> > >     <!-- parse fields                                       -->
> > >     <field name="title" column="title" length="512"/>
> > >     <field name="text" column="text" length="32000"/>
> > >     <field name="parseStatus" column="parseStatus"/>
> > >     <field name="signature" column="signature"/>
> > >     <field name="prevSignature" column="prevSignature"/>
> > >
> > >     <!-- score fields                                       -->
> > >     <field name="score" column="score"/>
> > >     <field name="headers" column="headers"/>
> > >     <field name="inlinks" column="inlinks"/>
> > >     <field name="outlinks" column="outlinks"/>
> > >     <field name="metadata" column="metadata"/>
> > >     <field name="markers" column="markers"/> </class>
> > >
> > >
> > > The http.content.limit in nutch-default.xml looks like:
> > >
> > > <property>
> > >   <name>http.content.limit</name>
> > >   <value>-1</value>
> > >   <description>The length limit for downloaded content using the
> http
> > >   protocol, in bytes. If this value is nonnegative (>=0), content
> longer
> > >   than it will be truncated; otherwise, no truncation at all. Do
not
> > >   confuse this setting with the file.content.limit setting.
> > >   </description>
> > > </property>
> > >
> > >
> > > My my.cnf file looks like:
> > >
> > > [mysqld]
> > > max_allowed_packet             = 200M
> > >
> > > character-set-server = utf8
> > > collation-server=utf8_unicode_ci
> > >
> > >
> > > I've tested that I could change the http.content.limit property to
> be a
> > > nonnegative number (65535) and the fetch job completes, but I want
> to
> > have
> > > the non-truncated content available so that I'm crawling all links
> on the
> > > page and storing the entire contents of the document so that I can
> then
> > > index the entire text in Elasticsearch. Any ideas on how I can 
> > > fetch
> and
> > > store the full content in MySQL? If the answer is - use HBase I'll
> do
> > that
> > > I'm just trying to remove another new variable as I learn more 
> > > about
> how
> > > Nutch works. With the content.limit set my crawl completes but I'm
> > missing
> > > nearly a 3rd of the documents that I would expect because the
> content is
> > > being truncated?
> > >
> > > Thanks for any advice you can offer.
> > >
> > > Thanks,
> > > Matt
> > >
> >
>

Re: Nutch 2.0 MySQL Data truncation: Data too long for column 'content' at row 1

Posted by Paul Dhaliwal <su...@gmail.com>.
I am giving Nutch 2.0 MySQL a go also. I am running into similar issues too.

I changed inlinks on both host and webpage tables to longblob. I also
modified the webpage.title to because 512 was not cutting it. I made couple
of code changes also. I noticed that my updatedb jobs kept on failing
because one of the page had a really long id. It turns out that id
contained body text also. To handle this, I skipped pages with id longer
than 512 in DbUpdateReducer.reduce. Every page skipped had a malformed id.

My current schema is posted below and its running pretty stable for me. I
have only gone through few hundred thousand pages, but I seem to have came
across all the funny ones.

Cheers,
Paul

----- my current nutch 2.0 mysql schema
CREATE TABLE `host` (
  `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '',
  `inlinks` longblob,
  `outlinks` longblob,
  `metadata` longblob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `webpage` (
  `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '',
  `headers` blob,
  `text` mediumtext,
  `status` int(11) DEFAULT NULL,
  `markers` blob,
  `parseStatus` blob,
  `modifiedTime` bigint(20) DEFAULT NULL,
  `score` float DEFAULT NULL,
  `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
  `baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
  `content` mediumblob,
  `title` varchar(2048) DEFAULT NULL,
  `reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
  `fetchInterval` int(11) DEFAULT NULL,
  `prevFetchTime` bigint(20) DEFAULT NULL,
  `inlinks` longblob,
  `prevSignature` blob,
  `outlinks` blob,
  `fetchTime` bigint(20) DEFAULT NULL,
  `retriesSinceFetch` int(11) DEFAULT NULL,
  `protocolStatus` blob,
  `signature` blob,
  `metadata` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

On Sat, Sep 1, 2012 at 4:25 PM, <j....@thomsonreuters.com> wrote:

> I am using `content` mediumblob in the SQL table and have not had any
> issues so far but that means little as it is probably due to differences
> in what we are crawling.  However, I am running into a similar problem
> with the outlinks column so I suspect the original webpage table
> creation script for MySQL has never been tested with real production
> runs.  I am going to play around with that and see if I can get
> something more robust this week.
>
> -----Original Message-----
> From: Matt MacDonald [mailto:matt@nearbyfyi.com]
> Sent: Friday, August 31, 2012 9:38 AM
> To: user@nutch.apache.org
> Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column
> 'content' at row 1
>
> Yeah I had tried using "MEDIUMBLOB" and ran into the same error message.
>
> <field name="content" column="content" jdbc-type="MEDIUMBLOB"/>
>
> It seems likely that we'd want to use HBase in production anyway so I
> ended up switching the configuration over to HBase for the crawl and
> didn't run into any issues with the content length of the documents. If
> anyone knows more about what the mysql issue it might be worth replying
> so that others that are encountering the issue don't hit the same dead
> end that I did.
>
> Thanks,
> Matt
>
> On Thu, Aug 30, 2012 at 4:05 PM, Ferdy Galema
> <fe...@kalooga.com>wrote:
>
> > Hi,
> >
> > It might be too obvious but have you already tried to play around with
>
> > the following line in the mapping (seems like a limit to me)?
> > <field name="content" column="content" length="65535"/>
> >
> > It has been a while since I tried to store big content with the
> > SqlStore, so I'm not sure how it works exactly.
> >
> > Ferdy.
> >
> > On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <ma...@nearbyfyi.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm using Nutch 2.0 from the 2.x branch on Github and used
> > > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as
> > > the storage backend. I'm seeing the following error show up in my
> > > hadoop.log file while fetching and wonder if others have ideas for
> > > moving past the error without having to set the http.content.limit:
> > >
> > > java.io.IOException: java.sql.BatchUpdateException: Data truncation:
>
> > > Data too long for column 'content' at row 1
> > >         at
> org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
> > >         at
> org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
> > >         at
> > >
> > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java
> > :55)
> > >         at
> > >
> > >
> > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(Redu
> > ceTask.java:579)
> > >         at
> > >
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
> > >         at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> > >         at
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:
> > > 260) Caused by: java.sql.BatchUpdateException: Data truncation: Data
>
> > > too long for column 'content' at row 1
> > >         at
> > >
> > >
> > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatemen
> > t.java:2028)
> > >         at
> > >
> > com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1
> > 451)
> > >         at
> > > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
> > >
> > >
> > > The webpage table in MySQL looks like:
> > >
> > > mysql> desc webpage;
> > > +-------------------+--------------+------+-----+---------+-------+
> > > | Field             | Type         | Null | Key | Default | Extra |
> > > +-------------------+--------------+------+-----+---------+-------+
> > > | id                | varchar(512) | NO   | PRI | NULL    |       |
> > > | headers           | blob         | YES  |     | NULL    |       |
> > > | text              | mediumtext   | YES  |     | NULL    |       |
> > > | status            | int(11)      | YES  |     | NULL    |       |
> > > | markers           | blob         | YES  |     | NULL    |       |
> > > | parseStatus       | blob         | YES  |     | NULL    |       |
> > > | modifiedTime      | bigint(20)   | YES  |     | NULL    |       |
> > > | score             | float        | YES  |     | NULL    |       |
> > > | typ               | varchar(32)  | YES  |     | NULL    |       |
> > > | baseUrl           | varchar(512) | YES  |     | NULL    |       |
> > > | content           | mediumblob   | YES  |     | NULL    |       |
> > > | title             | varchar(512) | YES  |     | NULL    |       |
> > > | reprUrl           | varchar(512) | YES  |     | NULL    |       |
> > > | fetchInterval     | int(11)      | YES  |     | NULL    |       |
> > > | prevFetchTime     | bigint(20)   | YES  |     | NULL    |       |
> > > | inlinks           | blob         | YES  |     | NULL    |       |
> > > | prevSignature     | blob         | YES  |     | NULL    |       |
> > > | outlinks          | blob         | YES  |     | NULL    |       |
> > > | fetchTime         | bigint(20)   | YES  |     | NULL    |       |
> > > | retriesSinceFetch | int(11)      | YES  |     | NULL    |       |
> > > | protocolStatus    | blob         | YES  |     | NULL    |       |
> > > | signature         | blob         | YES  |     | NULL    |       |
> > > | metadata          | blob         | YES  |     | NULL    |       |
> > > +-------------------+--------------+------+-----+---------+-------+
> > > 23 rows in set (0.00 sec)
> > >
> > >
> > >
> > > And my gora-sql-mapping.xml file looks like:
> > >
> > > <class name="org.apache.nutch.storage.WebPage"
> > keyClass="java.lang.String"
> > > table="webpage">
> > >   <primarykey column="id" length="512"/>
> > >     <field name="baseUrl" column="baseUrl" length="512"/>
> > >     <field name="status" column="status"/>
> > >     <field name="prevFetchTime" column="prevFetchTime"/>
> > >     <field name="fetchTime" column="fetchTime"/>
> > >     <field name="fetchInterval" column="fetchInterval"/>
> > >     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
> > >     <field name="reprUrl" column="reprUrl" length="512"/>
> > >     <field name="content" column="content" length="65535"/>
> > >     <field name="contentType" column="typ" length="32"/>
> > >     <field name="protocolStatus" column="protocolStatus"/>
> > >     <field name="modifiedTime" column="modifiedTime"/>
> > >
> > >     <!-- parse fields                                       -->
> > >     <field name="title" column="title" length="512"/>
> > >     <field name="text" column="text" length="32000"/>
> > >     <field name="parseStatus" column="parseStatus"/>
> > >     <field name="signature" column="signature"/>
> > >     <field name="prevSignature" column="prevSignature"/>
> > >
> > >     <!-- score fields                                       -->
> > >     <field name="score" column="score"/>
> > >     <field name="headers" column="headers"/>
> > >     <field name="inlinks" column="inlinks"/>
> > >     <field name="outlinks" column="outlinks"/>
> > >     <field name="metadata" column="metadata"/>
> > >     <field name="markers" column="markers"/> </class>
> > >
> > >
> > > The http.content.limit in nutch-default.xml looks like:
> > >
> > > <property>
> > >   <name>http.content.limit</name>
> > >   <value>-1</value>
> > >   <description>The length limit for downloaded content using the
> http
> > >   protocol, in bytes. If this value is nonnegative (>=0), content
> longer
> > >   than it will be truncated; otherwise, no truncation at all. Do not
> > >   confuse this setting with the file.content.limit setting.
> > >   </description>
> > > </property>
> > >
> > >
> > > My my.cnf file looks like:
> > >
> > > [mysqld]
> > > max_allowed_packet             = 200M
> > >
> > > character-set-server = utf8
> > > collation-server=utf8_unicode_ci
> > >
> > >
> > > I've tested that I could change the http.content.limit property to
> be a
> > > nonnegative number (65535) and the fetch job completes, but I want
> to
> > have
> > > the non-truncated content available so that I'm crawling all links
> on the
> > > page and storing the entire contents of the document so that I can
> then
> > > index the entire text in Elasticsearch. Any ideas on how I can fetch
> and
> > > store the full content in MySQL? If the answer is - use HBase I'll
> do
> > that
> > > I'm just trying to remove another new variable as I learn more about
> how
> > > Nutch works. With the content.limit set my crawl completes but I'm
> > missing
> > > nearly a 3rd of the documents that I would expect because the
> content is
> > > being truncated?
> > >
> > > Thanks for any advice you can offer.
> > >
> > > Thanks,
> > > Matt
> > >
> >
>