You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2012/08/13 02:24:39 UTC

updatedb error in nutch-2.0


Hello,


I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase

java.lang.ArrayIndexOutOfBoundsException: 1
        at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
        at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
        at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

I see this is because of reversing and unreversing urls. What is the idea behind this reversal and unreversal in nutch-2.0?

Thanks.
Alex.

 

Re: updatedb error in nutch-2.0

Posted by Ferdy Galema <fe...@kalooga.com>.
FYI I have attached a patch in nutch-1448.

On Mon, Aug 13, 2012 at 7:54 PM, <al...@aim.com> wrote:

> I found out that the key sent to
> unreverseUrl in DbUpdateMapper.map  was ":index.php/http"
>
>
> This happened in the depth 3 and I checked seed file there was no line in
> the form of http:/index.php
>
> Thanks.
> Alex.
>
>
>
> -----Original Message-----
> From: Ferdy Galema <fe...@kalooga.com>
> To: user <us...@nutch.apache.org>
> Sent: Mon, Aug 13, 2012 1:53 am
> Subject: Re: updatedb error in nutch-2.0
>
>
> Hi,
>
> In the specific case of Alex, it means that a row name in the database is
> malformed. Looking at the stacktrace lines in TableUtil, it looks like an
> url is stored without protocol (at least without a ":"). This is probably
> because of redirected urls not correctly being checked for wellformedness.
> If you look at line 664 in the FetcherReducer (HEAD) it writes out a new
> url directly as a row in the database. I have never experienced this
> exception and this might be because I changed some behaviour that makes
> sure a redirected url is handled a bit more like a general outlink. I have
> created an issue for this that I will update shortly:
> https://issues.apache.org/jira/browse/NUTCH-1448
>
> Ferdy.
>
> On Mon, Aug 13, 2012 at 2:52 AM, <j....@thomsonreuters.com> wrote:
>
> > The url is stored in a different order (reversed domain
> > name:protocol:port and path) from the order normally seen in your web
> > browser so that it can be searched more quickly in NoSQL data stores
> > like hbase. Nutch has a brief explanation and convenience utility
> > methods around this at TableUtil
> > (http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm
> > l)
> >
> >
> > -----Original Message-----
> > From: alxsss@aim.com [mailto:alxsss@aim.com]
> > Sent: Monday, August 13, 2012 9:25 AM
> > To: user@nutch.apache.org
> > Subject: updatedb error in nutch-2.0
> >
> >
> >
> > Hello,
> >
> >
> > I get the following error when I do bin/nutch updatedb in nutch-2.0 with
> > hbase
> >
> > java.lang.ArrayIndexOutOfBoundsException: 1
> >         at
> > org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> >         at
> > org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
> >         at
> > org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >
> > I see this is because of reversing and unreversing urls. What is the
> > idea behind this reversal and unreversal in nutch-2.0?
> >
> > Thanks.
> > Alex.
> >
> >
> >
>
>
>

Re: updatedb error in nutch-2.0

Posted by al...@aim.com.
I found out that the key sent to 
unreverseUrl in DbUpdateMapper.map  was ":index.php/http"


This happened in the depth 3 and I checked seed file there was no line in the form of http:/index.php

Thanks.
Alex.



-----Original Message-----
From: Ferdy Galema <fe...@kalooga.com>
To: user <us...@nutch.apache.org>
Sent: Mon, Aug 13, 2012 1:53 am
Subject: Re: updatedb error in nutch-2.0


Hi,

In the specific case of Alex, it means that a row name in the database is
malformed. Looking at the stacktrace lines in TableUtil, it looks like an
url is stored without protocol (at least without a ":"). This is probably
because of redirected urls not correctly being checked for wellformedness.
If you look at line 664 in the FetcherReducer (HEAD) it writes out a new
url directly as a row in the database. I have never experienced this
exception and this might be because I changed some behaviour that makes
sure a redirected url is handled a bit more like a general outlink. I have
created an issue for this that I will update shortly:
https://issues.apache.org/jira/browse/NUTCH-1448

Ferdy.

On Mon, Aug 13, 2012 at 2:52 AM, <j....@thomsonreuters.com> wrote:

> The url is stored in a different order (reversed domain
> name:protocol:port and path) from the order normally seen in your web
> browser so that it can be searched more quickly in NoSQL data stores
> like hbase. Nutch has a brief explanation and convenience utility
> methods around this at TableUtil
> (http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm
> l)
>
>
> -----Original Message-----
> From: alxsss@aim.com [mailto:alxsss@aim.com]
> Sent: Monday, August 13, 2012 9:25 AM
> To: user@nutch.apache.org
> Subject: updatedb error in nutch-2.0
>
>
>
> Hello,
>
>
> I get the following error when I do bin/nutch updatedb in nutch-2.0 with
> hbase
>
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at
> org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
>         at
> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
>         at
> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> I see this is because of reversing and unreversing urls. What is the
> idea behind this reversal and unreversal in nutch-2.0?
>
> Thanks.
> Alex.
>
>
>

 

Re: updatedb error in nutch-2.0

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

In the specific case of Alex, it means that a row name in the database is
malformed. Looking at the stacktrace lines in TableUtil, it looks like an
url is stored without protocol (at least without a ":"). This is probably
because of redirected urls not correctly being checked for wellformedness.
If you look at line 664 in the FetcherReducer (HEAD) it writes out a new
url directly as a row in the database. I have never experienced this
exception and this might be because I changed some behaviour that makes
sure a redirected url is handled a bit more like a general outlink. I have
created an issue for this that I will update shortly:
https://issues.apache.org/jira/browse/NUTCH-1448

Ferdy.

On Mon, Aug 13, 2012 at 2:52 AM, <j....@thomsonreuters.com> wrote:

> The url is stored in a different order (reversed domain
> name:protocol:port and path) from the order normally seen in your web
> browser so that it can be searched more quickly in NoSQL data stores
> like hbase. Nutch has a brief explanation and convenience utility
> methods around this at TableUtil
> (http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm
> l)
>
>
> -----Original Message-----
> From: alxsss@aim.com [mailto:alxsss@aim.com]
> Sent: Monday, August 13, 2012 9:25 AM
> To: user@nutch.apache.org
> Subject: updatedb error in nutch-2.0
>
>
>
> Hello,
>
>
> I get the following error when I do bin/nutch updatedb in nutch-2.0 with
> hbase
>
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at
> org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
>         at
> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
>         at
> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> I see this is because of reversing and unreversing urls. What is the
> idea behind this reversal and unreversal in nutch-2.0?
>
> Thanks.
> Alex.
>
>
>

RE: updatedb error in nutch-2.0

Posted by j....@thomsonreuters.com.
The url is stored in a different order (reversed domain
name:protocol:port and path) from the order normally seen in your web
browser so that it can be searched more quickly in NoSQL data stores
like hbase. Nutch has a brief explanation and convenience utility
methods around this at TableUtil
(http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm
l)


-----Original Message-----
From: alxsss@aim.com [mailto:alxsss@aim.com] 
Sent: Monday, August 13, 2012 9:25 AM
To: user@nutch.apache.org
Subject: updatedb error in nutch-2.0



Hello,


I get the following error when I do bin/nutch updatedb in nutch-2.0 with
hbase

java.lang.ArrayIndexOutOfBoundsException: 1
        at
org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
        at
org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
        at
org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

I see this is because of reversing and unreversing urls. What is the
idea behind this reversal and unreversal in nutch-2.0?

Thanks.
Alex.