You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Žygimantas Medelis <zz...@gmail.com> on 2012/09/18 15:34:17 UTC

Nutch2 + Cassandra

Hi,

I have nutch2 configured with a Cassandra backed (as described there
http://sujitpal.blogspot.com/2012/01/exploring-nutch-gora-with-cassandra.html

And it fails to fetch pages after first iteration. That is it sucessfuly
goes throught home pages but then fetcher gets 0 pages on subesequent ones.

Commands I am issuing

bin/nutch inject seed
bin/nutch generate
bin/nutch fetch ID1

There I get log
...
0/0 spinwaiting/active, 4 pages


bin/nutch parse ID1
bin/nutch updatedb
bin/nutch generate
bin/nutch fetch ID2

QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
...
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s,

Exact same config works with v1.5.1

Also I was getting NullPointerException on inject before
changing conf/gora-cassandra-mapping.xml
from:  <class keyClass="java.lang.String"
name="org.apache.nutch.storage.WebPage">
to: <class keyClass="java.lang.String"
name="org.apache.nutch.storage.WebPage" keyspace="webpage">

http.content.limit is set to -1 as it was suggested a while back in a
similar thread, but it does not help

Regards

Re: Nutch2 + Cassandra

Posted by Žygimantas Medelis <zz...@gmail.com>.
When running with gora 0.2.1, the outlinks field was not filed in. Spent
quite a bit of time trying to figure out whats wrong but unsuccessfully



On Fri, Sep 21, 2012 at 2:41 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Ahem...
>
> No I was wrong and also apologise as there seems to be a problem with
> with gora-cassandra v0.2.1 indeed. For the time being please roll back
> to 0.2 until we can release another gora-cassandra artifact.
>
> Time to get digging.
>
> Ta
>
> Lewis
>
> On Thu, Sep 20, 2012 at 10:49 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
> > Hi Again,
> >
> > On Wed, Sep 19, 2012 at 8:39 PM, Lewis John Mcgibbney
> > <le...@gmail.com> wrote:
> >
> >> On Wed, Sep 19, 2012 at 1:54 PM, Žygimantas Medelis <zz...@gmail.com>
> wrote:
> >>> Its the problem with gora v0.2.1 which does not work with current
> nutch 2.
> >
> > I've just run a medium sized focused crawl today with Nutch v2.x head,
> > gora-core v0.2.1, gora-cassandra v.0.2.1 and Cassandra v1.1.2. All
> > working perfectly on > 10 iterations... it would be great for you to
> > describe what exactly went wrong or what you think was wrong. Check
> > you parsed any outlinks from your seed URLs, urlfilters, etc. as to
> > why the 2nd iteration did not generate a batch.
> >
> >>
> >>> Have also tested with sql store also fails.
> >>
> >
> > I also undertook a small focused crawl with Nutch 2.x head, gora-core
> > v0.2.1, gora-sql v0.1.1-incubating and mysql (most recent debian
> > package from apt-get)... again all went swimmingly.
> >
> >>
> >>> Changing dependency to gora v0.2 and rebuilding solves the problem
> >>
> >
> > For completeness I actually managed to get around to this as well...
> > and yes gora-core v0.2.1 is compatible with gora-cassandra v0.2
> > howevere then you don't get any of the goodies we included in the Gora
> > 0.2.1 release.
> >
> > Would be great to hear about any further problems you have with this.
> >
> > Lewis
> >
> >
> > --
> > Lewis
>
>
>
> --
> Lewis
>

Re: Nutch2 + Cassandra

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Ahem...

No I was wrong and also apologise as there seems to be a problem with
with gora-cassandra v0.2.1 indeed. For the time being please roll back
to 0.2 until we can release another gora-cassandra artifact.

Time to get digging.

Ta

Lewis

On Thu, Sep 20, 2012 at 10:49 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Again,
>
> On Wed, Sep 19, 2012 at 8:39 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
>
>> On Wed, Sep 19, 2012 at 1:54 PM, Žygimantas Medelis <zz...@gmail.com> wrote:
>>> Its the problem with gora v0.2.1 which does not work with current nutch 2.
>
> I've just run a medium sized focused crawl today with Nutch v2.x head,
> gora-core v0.2.1, gora-cassandra v.0.2.1 and Cassandra v1.1.2. All
> working perfectly on > 10 iterations... it would be great for you to
> describe what exactly went wrong or what you think was wrong. Check
> you parsed any outlinks from your seed URLs, urlfilters, etc. as to
> why the 2nd iteration did not generate a batch.
>
>>
>>> Have also tested with sql store also fails.
>>
>
> I also undertook a small focused crawl with Nutch 2.x head, gora-core
> v0.2.1, gora-sql v0.1.1-incubating and mysql (most recent debian
> package from apt-get)... again all went swimmingly.
>
>>
>>> Changing dependency to gora v0.2 and rebuilding solves the problem
>>
>
> For completeness I actually managed to get around to this as well...
> and yes gora-core v0.2.1 is compatible with gora-cassandra v0.2
> howevere then you don't get any of the goodies we included in the Gora
> 0.2.1 release.
>
> Would be great to hear about any further problems you have with this.
>
> Lewis
>
>
> --
> Lewis



-- 
Lewis

Re: Nutch2 + Cassandra

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Again,

On Wed, Sep 19, 2012 at 8:39 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:

> On Wed, Sep 19, 2012 at 1:54 PM, Žygimantas Medelis <zz...@gmail.com> wrote:
>> Its the problem with gora v0.2.1 which does not work with current nutch 2.

I've just run a medium sized focused crawl today with Nutch v2.x head,
gora-core v0.2.1, gora-cassandra v.0.2.1 and Cassandra v1.1.2. All
working perfectly on > 10 iterations... it would be great for you to
describe what exactly went wrong or what you think was wrong. Check
you parsed any outlinks from your seed URLs, urlfilters, etc. as to
why the 2nd iteration did not generate a batch.

>
>> Have also tested with sql store also fails.
>

I also undertook a small focused crawl with Nutch 2.x head, gora-core
v0.2.1, gora-sql v0.1.1-incubating and mysql (most recent debian
package from apt-get)... again all went swimmingly.

>
>> Changing dependency to gora v0.2 and rebuilding solves the problem
>

For completeness I actually managed to get around to this as well...
and yes gora-core v0.2.1 is compatible with gora-cassandra v0.2
howevere then you don't get any of the goodies we included in the Gora
0.2.1 release.

Would be great to hear about any further problems you have with this.

Lewis


-- 
Lewis

Re: Nutch2 + Cassandra

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Wed, Sep 19, 2012 at 1:54 PM, Žygimantas Medelis <zz...@gmail.com> wrote:
> Its the problem with gora v0.2.1 which does not work with current nutch 2.

Can you elaborate on what you think is wrong here? To give you some
insight here. Between Gora 0.2 and 0.2.1 a substantial effort was put
into improving the functionality of the gora-cassandra artifact. A
good bit of this was to do with serialization (amongst other
functionality) but unfortunately the user uptake has not been huge
therefore I would not be surprised to find things are not quite
utopian as of yet.

> Have also tested with sql store also fails.

And again here please. The sql artifact has not changed since it was
last published e.g. 0.1.1-incubating revision and (although we know
there are some problems) there is a growing body of users who seem to
be relatively competent with its operation.

> Changing dependency to gora v0.2 and rebuilding solves the problem

I don't suppose you can elaborate here wither? What do you see now
when list one of the rows in your column families e.g. list p;?
Do you mean you get parse text returned instead of a serialization?

Lewis

Re: Nutch2 + Cassandra

Posted by Žygimantas Medelis <zz...@gmail.com>.
Its the problem with gora v0.2.1 which does not work with current nutch 2.
Have also tested with sql store also fails.
Changing dependency to gora v0.2 and rebuilding solves the problem



On Wed, Sep 19, 2012 at 9:07 AM, Žygimantas Medelis <zz...@gmail.com>wrote:

>
> > Can you read your db and see if there are any pages pending a fetch?
>
> After inject
>
> [default@webpage] list f;
> Using default limit of 100
> -------------------
> RowKey: 6c742e62616c7361732e7777773a687474702f
> => (column=6669, value=00278d00, timestamp=1348032953800000)
> => (column=73, value=3f800000, timestamp=1348032953802000)
> => (column=7473, value=00000139dd066f6b, timestamp=1348032953798000)
> -------------------
> RowKey: 6c742e6c72797461732e7777773a687474702f
> => (column=6669, value=00278d00, timestamp=1348032953811000)
> => (column=73, value=3f800000, timestamp=1348032953814000)
> => (column=7473, value=00000139dd066f6b, timestamp=1348032953809000)
> -------------------
> RowKey: 6c742e31356d696e2e7777773a687474702f
> => (column=6669, value=00278d00, timestamp=1348032953787000)
> => (column=73, value=3f800000, timestamp=1348032953789000)
> => (column=7473, value=00000139dd066f6b, timestamp=1348032953785000)
> -------------------
> RowKey: 6c742e64656c66692e7777773a687474702f
> => (column=6669, value=00278d00, timestamp=1348032953749000)
> => (column=73, value=3f800000, timestamp=1348032953752000)
> => (column=7473, value=00000139dd066f6b, timestamp=1348032953656000)
>
> 4 Rows Returned.
>
> Then after fetch
>
>
> Very very long sequence of this .......
> d3e3c2f6469763e0a3c2f6469763e3c212d2d2064656c666920636f6e7461696e6572207772617070657220626567696e202d2d3e0a0a0a202020200a3c2f626f64793e0a3c2f68746d6c3e0a0a,
> timestamp=1347972430537000)
> => (column=6669, value=00278d00, timestamp=1347972384062000)
> => (column=707473, value=00000139d96a3c7a, timestamp=1347972430534000)
> => (column=73, value=3f800000, timestamp=1347972384065000)
> => (column=7374, value=00000002, timestamp=1347972430531000)
> => (column=7473, value=0000013b0e6877e8, timestamp=1347974904068000)
> => (column=747970, value=6170706c69636174696f6e2f7868746d6c2b786d6c,
> timestamp=1347972430640000)
> 4 Rows Returned.
> Elapsed time: 10255 msec(s).
>
> parse and list p returns similar very long sequence of bite codes.
>
> updatedb apparently no changes.
>
> Then starting new generate, fetch, parse iteration
>
> list f
>
>
> ....02020200a3c2f626f64793e0a3c2f68746d6c3e0a0a,
> timestamp=1348033056939000)
> => (column=6669, value=00278d00, timestamp=1348032953749000)
> => (column=707473, value=00000139dd066f6b, timestamp=1348033056934000)
> => (column=73, value=3f800000, timestamp=1348032953752000)
> => (column=7374, value=00000002, timestamp=1348033056931000)
> => (column=7473, value=0000013a7786c6be, timestamp=1348033211184000)
> => (column=747970, value=6170706c69636174696f6e2f7868746d6c2b786d6c,
> timestamp=1348033056949000)
>
> 4 Rows Returned.
> Elapsed time: 11825 msec(s).
>
>
> Also I have added those jar's to nutch lib, maybe versions are not right?
>
> cassandra-all-1.1.2.jar
> cassandra-thrift-1.1.2.jar
> gora-core-0.2.1.jar
> gora-cassandra-0.2.1.jar
> hector-core-1.1-0.jar
> thrift-0.2.0.jar (not needed I think, libtrift has all what is necessary)
> libthrift-0.7.0.jar
>
> cassandra -v
> 1.0.11
> That's a bit strange for I have downloaded v1.1.5 (also tried the one
> which installs via aptitude on ubuntu)
>
>
> On Tue, Sep 18, 2012 at 5:16 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi,
>>
>> On Tue, Sep 18, 2012 at 2:34 PM, Žygimantas Medelis <zz...@gmail.com>
>> wrote:
>>
>> > Commands I am issuing
>> >
>>
>> Can you read your db and see if there are any pages pending a fetch?
>>
>> >
>> > Also I was getting NullPointerException on inject before
>> > changing conf/gora-cassandra-mapping.xml
>> > from:  <class keyClass="java.lang.String"
>> > name="org.apache.nutch.storage.WebPage">
>> > to: <class keyClass="java.lang.String"
>> > name="org.apache.nutch.storage.WebPage" keyspace="webpage">
>>
>> I've now fixed this in the 2.x branch. Thank you for reporting
>>
>
>

Re: Nutch2 + Cassandra

Posted by Žygimantas Medelis <zz...@gmail.com>.
> Can you read your db and see if there are any pages pending a fetch?

After inject

[default@webpage] list f;
Using default limit of 100
-------------------
RowKey: 6c742e62616c7361732e7777773a687474702f
=> (column=6669, value=00278d00, timestamp=1348032953800000)
=> (column=73, value=3f800000, timestamp=1348032953802000)
=> (column=7473, value=00000139dd066f6b, timestamp=1348032953798000)
-------------------
RowKey: 6c742e6c72797461732e7777773a687474702f
=> (column=6669, value=00278d00, timestamp=1348032953811000)
=> (column=73, value=3f800000, timestamp=1348032953814000)
=> (column=7473, value=00000139dd066f6b, timestamp=1348032953809000)
-------------------
RowKey: 6c742e31356d696e2e7777773a687474702f
=> (column=6669, value=00278d00, timestamp=1348032953787000)
=> (column=73, value=3f800000, timestamp=1348032953789000)
=> (column=7473, value=00000139dd066f6b, timestamp=1348032953785000)
-------------------
RowKey: 6c742e64656c66692e7777773a687474702f
=> (column=6669, value=00278d00, timestamp=1348032953749000)
=> (column=73, value=3f800000, timestamp=1348032953752000)
=> (column=7473, value=00000139dd066f6b, timestamp=1348032953656000)

4 Rows Returned.

Then after fetch


Very very long sequence of this .......
d3e3c2f6469763e0a3c2f6469763e3c212d2d2064656c666920636f6e7461696e6572207772617070657220626567696e202d2d3e0a0a0a202020200a3c2f626f64793e0a3c2f68746d6c3e0a0a,
timestamp=1347972430537000)
=> (column=6669, value=00278d00, timestamp=1347972384062000)
=> (column=707473, value=00000139d96a3c7a, timestamp=1347972430534000)
=> (column=73, value=3f800000, timestamp=1347972384065000)
=> (column=7374, value=00000002, timestamp=1347972430531000)
=> (column=7473, value=0000013b0e6877e8, timestamp=1347974904068000)
=> (column=747970, value=6170706c69636174696f6e2f7868746d6c2b786d6c,
timestamp=1347972430640000)
4 Rows Returned.
Elapsed time: 10255 msec(s).

parse and list p returns similar very long sequence of bite codes.

updatedb apparently no changes.

Then starting new generate, fetch, parse iteration

list f


....02020200a3c2f626f64793e0a3c2f68746d6c3e0a0a, timestamp=1348033056939000)
=> (column=6669, value=00278d00, timestamp=1348032953749000)
=> (column=707473, value=00000139dd066f6b, timestamp=1348033056934000)
=> (column=73, value=3f800000, timestamp=1348032953752000)
=> (column=7374, value=00000002, timestamp=1348033056931000)
=> (column=7473, value=0000013a7786c6be, timestamp=1348033211184000)
=> (column=747970, value=6170706c69636174696f6e2f7868746d6c2b786d6c,
timestamp=1348033056949000)

4 Rows Returned.
Elapsed time: 11825 msec(s).


Also I have added those jar's to nutch lib, maybe versions are not right?

cassandra-all-1.1.2.jar
cassandra-thrift-1.1.2.jar
gora-core-0.2.1.jar
gora-cassandra-0.2.1.jar
hector-core-1.1-0.jar
thrift-0.2.0.jar (not needed I think, libtrift has all what is necessary)
libthrift-0.7.0.jar

cassandra -v
1.0.11
That's a bit strange for I have downloaded v1.1.5 (also tried the one which
installs via aptitude on ubuntu)


On Tue, Sep 18, 2012 at 5:16 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
>
> On Tue, Sep 18, 2012 at 2:34 PM, Žygimantas Medelis <zz...@gmail.com>
> wrote:
>
> > Commands I am issuing
> >
>
> Can you read your db and see if there are any pages pending a fetch?
>
> >
> > Also I was getting NullPointerException on inject before
> > changing conf/gora-cassandra-mapping.xml
> > from:  <class keyClass="java.lang.String"
> > name="org.apache.nutch.storage.WebPage">
> > to: <class keyClass="java.lang.String"
> > name="org.apache.nutch.storage.WebPage" keyspace="webpage">
>
> I've now fixed this in the 2.x branch. Thank you for reporting
>

Re: Nutch2 + Cassandra

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Tue, Sep 18, 2012 at 2:34 PM, Žygimantas Medelis <zz...@gmail.com> wrote:

> Commands I am issuing
>

Can you read your db and see if there are any pages pending a fetch?

>
> Also I was getting NullPointerException on inject before
> changing conf/gora-cassandra-mapping.xml
> from:  <class keyClass="java.lang.String"
> name="org.apache.nutch.storage.WebPage">
> to: <class keyClass="java.lang.String"
> name="org.apache.nutch.storage.WebPage" keyspace="webpage">

I've now fixed this in the 2.x branch. Thank you for reporting