You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tamer Yousef <TY...@boardreader.com> on 2015/01/07 00:13:55 UTC

Nutch 2.2.1 error with MySql

Hi all!
I'm following the guide "Installing Nutch 2.2 with MySQL to handle UTF-8" at http://nlp.solutions.asia/?p=362 but I'm getting an error.
But first, here is what I have:

MySql 5.6 on Centos 6.6 on one box, and nutch 2.2.1 on another box.

The conf file at /etc/my.cnf has the following content (not at /etc/mysql/my.cnf as per article):
---------------------------------------------------------------------------------
[mysqld]

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M
---------------------------------------------------------------------------------
I have created the database and the webpage table as per article, and I have setup nutch and gora configs as per article.

My gora-sql-mapping.xml has :
--------------------------------------------------------------------------------------------------------------
<class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String" table="webpage">
  <primarykey column="id" length="767"/>
    <field name="baseUrl" column="baseUrl" length="512"/>
    <field name="status" column="status"/>
    <field name="prevFetchTime" column="prevFetchTime"/>
    <field name="fetchTime" column="fetchTime"/>
    <field name="fetchInterval" column="fetchInterval"/>
    <field name="retriesSinceFetch" column="retriesSinceFetch"/>
    <field name="reprUrl" column="reprUrl" length="512"/>
8    <field name="content" column="content" length="65536"/>
    <field name="contentType" column="typ" length="32"/>
    <field name="protocolStatus" column="protocolStatus"/>
    <field name="modifiedTime" column="modifiedTime"/>
    <field name="prevModifiedTime" column="prevModifiedTime"/>
    <field name="batchId" column="batchId" length="32"/>

    <!-- parse fields                                       -->
    <field name="title" column="title" length="512"/>
    <field name="text" column="text" length="32000"/>
    <field name="parseStatus" column="parseStatus"/>
    <field name="signature" column="signature"/>
    <field name="prevSignature" column="prevSignature"/>

    <!-- score fields                                       -->
    <field name="score" column="score"/>
    <field name="headers" column="headers"/>
    <field name="inlinks" column="inlinks"/>
    <field name="outlinks" column="outlinks"/>
    <field name="metadata" column="metadata"/>
    <field name="markers" column="markers"/>
</class>
--------------------------------------------------------------------------------------------------------------

When I run my crawl script, I get the following error:
--------------------------------------------------------------------------------------------------------------
15/01/06 18:09:46 ERROR crawl.InjectorJob: InjectorJob: org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLException: Index column 
size too large. The maximum column size is 767 bytes.
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.io.IOException: java.sql.SQLException: Index column size too large. The maximum column size is 767 bytes.
        at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:226)
        at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:172)
        at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
        ... 12 more
Caused by: java.sql.SQLException: Index column size too large. The maximum column size is 767 bytes.
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2345)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2330)
        at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:224)
--------------------------------------------------------------------------------------------------------------


What changes do I have to make to avoid the index too large error?

Thanks,
Tamer