You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2012/12/09 01:26:08 UTC

Web pages parsed status

Hi all,

I have started playing around with Apache Nutch, but I think this
output is a strange because I would actually like to fetch the
content.
Is there any configuration or a simple step I might be missing?


baseUrl:        null
status: 1 (status_unfetched)
fetchInterval:  2592000
fetchTime:      1355009213206
prevFetchTime:  0
retries:        0
modifiedTime:   0
protocolStatus: UNKNOWN_CODE_0, args=[]
parseStatus:    notparsed/ok (0/0), args=[]
title:  null
score:  8.525337E-12
markers:        {dist=2}


Hope someone can help me out (:


Renato M.

Re: Web pages parsed status

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Renato,

OK here we go :0)

On Mon, Dec 10, 2012 at 3:44 PM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> I did notice that this pages weren´t fetched but the thing is that I
> do want them to be fetched without actually having to fetch, and parse
> them individually with the parsechecker tool i.e. do it automatically.

No probs. This makes perfect sense and is easily achievable (once
GORA-182 is fixed). The patch submitted by Kaz, could be tested
however you will need to use Nutch 2.x head with gora-0.2.1. My
understanding is that currently Nutch 2.x needs some work done to
accommodate the changes we made when creating the WebServices API for
Gora.

> I have downloaded Apache Nutch 2.1 but I haven't been able to find the
> crawl script. I did run 'ant build' but still no luck, am I missing
> something? or should I use trunk instead of the downloaded tar.gz?

You posted the link to the crawl script, which IIRC was added post 2.1
release. You would therefore need to manually add this to your Nutch
2.x copy under the bin directory. If you are using Nutch 2.X for R&D,
I would highly recommend you to use 2.x head. This way if you find
something wrong you can post a patch/fix ;0)
>
> Regarding to the backend, the issue you are talking about is GORA-182?

Yes

> I read that one of the problems was at injecting URLs into Cassandra
> right?

Partly, the problem is mainly when we wish to update the web DB after
a crawl iteration. Currently MAP structures are not current handled
correctly, therefore the update is not successful and the web DB is
not populated with more links to crawl in the next cycle.

> I executed this command without problems.
>
> ./nutch inject ../../../conf/urls/seed.txt

For starters if you are using the script in runtime/local, it might be
easier to simply create you urls directory in the same directory.

>
> Is there any other way in which I can reproduce this error so maybe I
> can try tackle it :)

This currently has to do with some issues with Hector client
configuration. Kaz mentions this in his last post on GORA-182.

>
> There is one thing I am not understanding is the order in which I have
> to execute Nutch commands. What I am understanding for Nutch execution
> process is as follows:
>
> Nutch 1.X
> ------------------
> 1. nutch inject
> 2. nutch generate
> 3. nutch fetch
> 4. nutch parse

then add
5. nutch updatedb
6. repeat from 2 (unless you wish to progress with graph analysis,
ranking and/or scoring of URLs, indexing, etc.)

>
> Nutch 2.X
> -------------------
> 1. All this lifecycle is embedded within the crawl script, right? But
> we are supposed to use Solr indexing even if I don't want to.

There is also a crawl script for 1.x so nothing is different in this
respect. You can either use the crawl script provided or you can
manually run your own operations via the command line (or
programmatically). You DO NOT need to use Solr to index your URLs,
this depends on whether you want to search over the fields and content
e.g. search engine. If you don't require this, then simply don't index
your content into Solr.

> Shouldn't we make this optional?

It is optional. The crawl script assumes that you wish to undertake a
iterative/continuous crawl then index to Solr. If this is not required
then the script is not difficult to alter for your own purposes.

> Thanks in advance!

No hassle, get in touch if there is something else OK.

Best

Lewis

Re: Web pages parsed status

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Lewis,

I did notice that this pages weren´t fetched but the thing is that I
do want them to be fetched without actually having to fetch, and parse
them individually with the parsechecker tool i.e. do it automatically.
I have downloaded Apache Nutch 2.1 but I haven't been able to find the
crawl script. I did run 'ant build' but still no luck, am I missing
something? or should I use trunk instead of the downloaded tar.gz?

Regarding to the backend, the issue you are talking about is GORA-182?
I read that one of the problems was at injecting URLs into Cassandra
right? I executed this command without problems.

./nutch inject ../../../conf/urls/seed.txt

Is there any other way in which I can reproduce this error so maybe I
can try tackle it :)

There is one thing I am not understanding is the order in which I have
to execute Nutch commands. What I am understanding for Nutch execution
process is as follows:

Nutch 1.X
------------------
1. nutch inject
2. nutch generate
3. nutch fetch
4. nutch parse

Nutch 2.X
-------------------
1. All this lifecycle is embedded within the crawl script, right? But
we are supposed to use Solr indexing even if I don't want to.
Shouldn't we make this optional?
Thanks in advance!


Renato M.

[1] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl

2012/12/8 Lewis John Mcgibbney <le...@gmail.com>:
> Hi Renato,
>
> On Sun, Dec 9, 2012 at 12:47 AM, Renato Marroquín Mogrovejo
> <re...@gmail.com> wrote:
>> ./nutch crawl  ../../../conf/urls/seed.txt -depth 10 -topN 10
>>
>> So I shouldn't be using this command? Which one should I use? The readb command?
>
> Yeah we would advise against using this command. You could use the
> crawl script [0] or else get to know the individual commands inside
> out [1]. NOTE, although the link to [1] is for Nutch v1.X, you will
> find that they are *almost* identical.
>
> Another note, if you look at the fetched status for the record you
> provided, it has not actually been fetched. It remains  unfetched and
> therefore unparsed.
>
> You can also try parsing manually using the parserchecker tool, you
> will find this in the bin/nutch script.
>
> In gora-cassandra 0.2.1 there is currently a problem with MAP fields
> (headers, outlinks, inlinks, markers and metadata) however this will
> be fixed in gora-cassandra 0.3. I would suggest that you revert to 0.2
> or alternatively use one of the other gora dependencies with a
> different web store.
>
> Please also check out nutch-default.xml for available configuration overrides.
>
> Keep us posted mate.
>
> Best
>
> Lewis
>
> [0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
> [1] http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
> [2] https://svn.apache.org/repos/asf/nutch/branches/2.x/conf/nutch-default.xml

Re: Web pages parsed status

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Renato,

On Sun, Dec 9, 2012 at 12:47 AM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> ./nutch crawl  ../../../conf/urls/seed.txt -depth 10 -topN 10
>
> So I shouldn't be using this command? Which one should I use? The readb command?

Yeah we would advise against using this command. You could use the
crawl script [0] or else get to know the individual commands inside
out [1]. NOTE, although the link to [1] is for Nutch v1.X, you will
find that they are *almost* identical.

Another note, if you look at the fetched status for the record you
provided, it has not actually been fetched. It remains  unfetched and
therefore unparsed.

You can also try parsing manually using the parserchecker tool, you
will find this in the bin/nutch script.

In gora-cassandra 0.2.1 there is currently a problem with MAP fields
(headers, outlinks, inlinks, markers and metadata) however this will
be fixed in gora-cassandra 0.3. I would suggest that you revert to 0.2
or alternatively use one of the other gora dependencies with a
different web store.

Please also check out nutch-default.xml for available configuration overrides.

Keep us posted mate.

Best

Lewis

[0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
[1] http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
[2] https://svn.apache.org/repos/asf/nutch/branches/2.x/conf/nutch-default.xml

Re: Web pages parsed status

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hey Lewis!!!

Thanks a lot my friend! Well, I am using Nutch 2.1 with Cassandra as a
backend. The command line I am using is the following:

./nutch crawl  ../../../conf/urls/seed.txt -depth 10 -topN 10

So I shouldn't be using this command? Which one should I use? The readb command?
Thanks again!


Renato M.

2012/12/8 Lewis John Mcgibbney <le...@gmail.com>:
> Hi Renato,
>
> Firstly are you on 2.x? If so what gora- storage backend are you on?
> If not what version of 1.x are you using.
>
> After fetching have you parsed the pages?
>
> How are you executing your crawl cycle. The one step command/script or
> individually via a custom script? We advise against using the
> deprecated Crawl class in both distributions.
>
> Best
>
> Lewis
>
> On Sun, Dec 9, 2012 at 12:26 AM, Renato Marroquín Mogrovejo
> <re...@gmail.com> wrote:
>> Hi all,
>>
>> I have started playing around with Apache Nutch, but I think this
>> output is a strange because I would actually like to fetch the
>> content.
>> Is there any configuration or a simple step I might be missing?
>>
>>
>> baseUrl:        null
>> status: 1 (status_unfetched)
>> fetchInterval:  2592000
>> fetchTime:      1355009213206
>> prevFetchTime:  0
>> retries:        0
>> modifiedTime:   0
>> protocolStatus: UNKNOWN_CODE_0, args=[]
>> parseStatus:    notparsed/ok (0/0), args=[]
>> title:  null
>> score:  8.525337E-12
>> markers:        {dist=2}
>>
>>
>> Hope someone can help me out (:
>>
>>
>> Renato M.
>
>
>
> --
> Lewis

Re: Web pages parsed status

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Renato,

Firstly are you on 2.x? If so what gora- storage backend are you on?
If not what version of 1.x are you using.

After fetching have you parsed the pages?

How are you executing your crawl cycle. The one step command/script or
individually via a custom script? We advise against using the
deprecated Crawl class in both distributions.

Best

Lewis

On Sun, Dec 9, 2012 at 12:26 AM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> Hi all,
>
> I have started playing around with Apache Nutch, but I think this
> output is a strange because I would actually like to fetch the
> content.
> Is there any configuration or a simple step I might be missing?
>
>
> baseUrl:        null
> status: 1 (status_unfetched)
> fetchInterval:  2592000
> fetchTime:      1355009213206
> prevFetchTime:  0
> retries:        0
> modifiedTime:   0
> protocolStatus: UNKNOWN_CODE_0, args=[]
> parseStatus:    notparsed/ok (0/0), args=[]
> title:  null
> score:  8.525337E-12
> markers:        {dist=2}
>
>
> Hope someone can help me out (:
>
>
> Renato M.

-- 
Lewis