You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2014/03/27 10:29:52 UTC

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

Hi Shane,

It really helps users of this list and yourself if you are able to provide
more detailed questions.
Can you please state which version of Nutch, gora-core and gora-sql
artifacts and MySQL you are using?
It would seem that you've not made much progress to date, so i would
suggest wiping the data you have within your MySQL WebPage table and
starting again.
I would advise you to use the readdb tool to check the stats of the DB
after EVERY phase of the crawl.
https://wiki.apache.org/nutch/bin/nutch%20readdb
Please see below for more feedback.

On Thu, Mar 27, 2014 at 8:54 AM, <us...@nutch.apache.org> wrote:

>
> mapred.FileOutputCommitter - Output path is null in cleanup
>
> What does this mean?

The above WARN can be ignored. Really, it occurs when we commit a job and
do the
clean up of a temporary directory. This is not a problem.

> what would be the command line too index a single domain. say test.com
>

The exact same as it would be to index multiple domains. Your configuration
however may need some tweaking. Have you looked over the wiki documentation
on urlfilter's? You'll have a better idea of where in the crawl things are
going wrong once you've analyzed the crawl progress as I've mentioned
above.

>
> Why does generate give me the same fetch list every time ?

Because it would appear that these URL's are considered as good for
fetching. This is more likely a mistake in your crawler configuration as
oppose to Nutch itself.

> i thought Nutch would only re indexed the same page once every 30 days
> my setup fetch the same pages every time i index, this seems a waist of
> resources.
>
>
As I originally stated, it helps if you described in more details if you
have been able to index at all. Right now this seems to be a mystery as to
what you've actually achieved.

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

Posted by Shane Wood <sh...@cbm8bit.com>.

How do you use the readdb command when using MYSQL there is no crawldb 
created ? Can you list the command too use.
Or does Nutch still create a crawldb but i cant find it, where is it 
created ? i have /crawl folder but nothing appears in there.

I use Nutch 2.2 and MYSQL version 5.6.16  as per this tutorial 
http://nlp.solutions.asia/?p=362.  I followed this by the letter
and can index sites the issue i am having is only the same pages are 
indexed every time i re index nothing new is added after the second 
indexing.

I index with these commands.

./bin/nutch inject urls
./bin/nutch generate -topN 20
./bin/nutch fetch -all
./bin/nutch parse -all
./bin/nutch updatedb

Thanks
Shane.



On 27/03/14 19:29, Lewis John Mcgibbney wrote:
> Hi Shane,
>
> It really helps users of this list and yourself if you are able to provide
> more detailed questions.
> Can you please state which version of Nutch, gora-core and gora-sql
> artifacts and MySQL you are using?
> It would seem that you've not made much progress to date, so i would
> suggest wiping the data you have within your MySQL WebPage table and
> starting again.
> I would advise you to use the readdb tool to check the stats of the DB
> after EVERY phase of the crawl.
> https://wiki.apache.org/nutch/bin/nutch%20readdb
> Please see below for more feedback.
>
> On Thu, Mar 27, 2014 at 8:54 AM,<us...@nutch.apache.org>  wrote:
>
>    
>> mapred.FileOutputCommitter - Output path is null in cleanup
>>
>> What does this mean?
>>      
>
> The above WARN can be ignored. Really, it occurs when we commit a job and
> do the
> clean up of a temporary directory. This is not a problem.
>
>
>    
>> what would be the command line too index a single domain. say test.com
>>
>>      
> The exact same as it would be to index multiple domains. Your configuration
> however may need some tweaking. Have you looked over the wiki documentation
> on urlfilter's? You'll have a better idea of where in the crawl things are
> going wrong once you've analyzed the crawl progress as I've mentioned
> above.
>
>
>    
>> Why does generate give me the same fetch list every time ?
>>      
>
> Because it would appear that these URL's are considered as good for
> fetching. This is more likely a mistake in your crawler configuration as
> oppose to Nutch itself.
>
>
>    
>> i thought Nutch would only re indexed the same page once every 30 days
>> my setup fetch the same pages every time i index, this seems a waist of
>> resources.
>>
>>
>>      
> As I originally stated, it helps if you described in more details if you
> have been able to index at all. Right now this seems to be a mystery as to
> what you've actually achieved.
>
>

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

Posted by Shane Wood <sh...@cbm8bit.com>.

I setup Nutch as per this  http://nlp.solutions.asia/?p=362.

I wiped the data within MYSQL and re indexed several time and these 
fields remain NULL
modifiedTime     prevModifiedTime

MYSQL version 5.6.16
Nutch version 2.2

./bin/nutch inject urls
./bin/nutch generate -topN 20
./bin/nutch fetch -all
./bin/nutch parse -all
./bin/nutch updatedb

I run these commands in this order each time i index.
I'm very new at Nutch but learning.

Cheers
Shane.


On 27/03/14 19:29, Lewis John Mcgibbney wrote:
> Hi Shane,
>
> It really helps users of this list and yourself if you are able to provide
> more detailed questions.
> Can you please state which version of Nutch, gora-core and gora-sql
> artifacts and MySQL you are using?
> It would seem that you've not made much progress to date, so i would
> suggest wiping the data you have within your MySQL WebPage table and
> starting again.
> I would advise you to use the readdb tool to check the stats of the DB
> after EVERY phase of the crawl.
> https://wiki.apache.org/nutch/bin/nutch%20readdb
> Please see below for more feedback.
>
> On Thu, Mar 27, 2014 at 8:54 AM,<us...@nutch.apache.org>  wrote:
>
>    
>> mapred.FileOutputCommitter - Output path is null in cleanup
>>
>> What does this mean?
>>      
>
> The above WARN can be ignored. Really, it occurs when we commit a job and
> do the
> clean up of a temporary directory. This is not a problem.
>
>
>    
>> what would be the command line too index a single domain. say test.com
>>
>>      
> The exact same as it would be to index multiple domains. Your configuration
> however may need some tweaking. Have you looked over the wiki documentation
> on urlfilter's? You'll have a better idea of where in the crawl things are
> going wrong once you've analyzed the crawl progress as I've mentioned
> above.
>
>
>    
>> Why does generate give me the same fetch list every time ?
>>      
>
> Because it would appear that these URL's are considered as good for
> fetching. This is more likely a mistake in your crawler configuration as
> oppose to Nutch itself.
>
>
>    
>> i thought Nutch would only re indexed the same page once every 30 days
>> my setup fetch the same pages every time i index, this seems a waist of
>> resources.
>>
>>
>>      
> As I originally stated, it helps if you described in more details if you
> have been able to index at all. Right now this seems to be a mystery as to
> what you've actually achieved.
>
>