You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/23 16:05:16 UTC

Empty LinkDB after invertlinks

Hey Ho,

for some reasons the inverlinks command produces an empty linkdb.

I did:

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
LinkDb: starting at 2011-08-23 14:47:21
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: false
LinkDb: URL filter: false
LinkDb: adding segment: crawl/segments/20110817164804
LinkDb: adding segment: crawl/segments/20110817164912
LinkDb: adding segment: crawl/segments/20110817165053
LinkDb: adding segment: crawl/segments/20110817165524
LinkDb: adding segment: crawl/segments/20110817170729
LinkDb: adding segment: crawl/segments/20110817171757
LinkDb: adding segment: crawl/segments/20110817172919
LinkDb: adding segment: crawl/segments/20110819135218
LinkDb: adding segment: crawl/segments/20110819165658
LinkDb: adding segment: crawl/segments/20110819170807
LinkDb: adding segment: crawl/segments/20110819171841
LinkDb: adding segment: crawl/segments/20110819173350
LinkDb: adding segment: crawl/segments/20110822135934
LinkDb: adding segment: crawl/segments/20110822141229
LinkDb: adding segment: crawl/segments/20110822143419
LinkDb: adding segment: crawl/segments/20110822143824
LinkDb: adding segment: crawl/segments/20110822144031
LinkDb: adding segment: crawl/segments/20110822144232
LinkDb: adding segment: crawl/segments/20110822144435
LinkDb: adding segment: crawl/segments/20110822144617
LinkDb: adding segment: crawl/segments/20110822144750
LinkDb: adding segment: crawl/segments/20110822144927
LinkDb: adding segment: crawl/segments/20110822145249
LinkDb: adding segment: crawl/segments/20110822150757
LinkDb: adding segment: crawl/segments/20110822152354
LinkDb: adding segment: crawl/segments/20110822152503
LinkDb: adding segment: crawl/segments/20110822153900
LinkDb: adding segment: crawl/segments/20110822155321
LinkDb: adding segment: crawl/segments/20110822155732
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14

After that:

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
./nutch readlinkdb crawl/linkdb/ -dump linkdump
LinkDb dump: starting at 2011-08-23 14:48:26
LinkDb dump: db: crawl/linkdb/
LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01

And then:

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd 
linkdump/
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump# 
ll
total 0
-rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#

As you see, the dump size is 0 byte.

Unfortunately I have no idea what went wrong.

I have attached the hadoop.log for the inverlinks process. Perhaps that 
helps anybody?

Re: Empty LinkDB after invertlinks

Posted by Marek Bachmann <m....@uni-kassel.de>.

Hi Lewis,

you are right.

I think the problem is a bit more general. There are some tools which 
aren't very verbose about which configuration they use (and some tools 
don't tell much at all ;-) ).

I think there were many problems discussed on the list which were 
related to a wrong configuration files or a standard option that wasn't 
noticed by the user.

So it would be great that if we ran a command, it would tell us which 
config file it uses and which values it has detected (or which default 
values it uses)

Unfortunately I have no knowledge about the configuration architecture 
that nutch uses.
It seems to me that the the options for most of the tools can be defined 
in the nutch-site.xml. But I am not aware of how and WHERE this file is 
interpreted by the tools.

I think I'll take a look in the sources and see if I could manage to 
teach the single tools to be more verbose. :)

By the way, that cuts a point which occupies since a while: How can I 
know which options are available for a tool to be configured in the 
nutch-site.xml

For now I'll make myself acquaint with JIRA, since I never worked with it

regards,

Marek

Am 23.08.2011 23:56, schrieb lewis.mcgibbney@gmail.com:
> Hi Marek,
>
> You make a reasonable point. If you feel that this is something that
> should be integrated then maybe consider filing a JIRA with a
> comprehensive description of the problem and a proposed solution. If you
> do not actually patch this yourself then maybe someone else can provide
> a patch in the future should they experience the same as it would be a
> nice indication of the situation identified by Sergey.
>
>
> On Aug 23, 2011 10:48pm, Marek Bachmann <m....@uni-kassel.de> wrote:
>> Oh yes, thank your very much Sergey, that was the problem.
>
>
>
>> Would have been nice, if the inverlinks command had told me that it has
>
>> ignored them :-)
>
>
>
>> Cheers,
>
>
>
>> Marek
>
>
>
>> Am 23.08.2011 19:26, schrieb Sergey A Volkov:
>
>> > Hi
>
>> >
>
>> > Is it possible that you fetch documents from just one site/domain?
>
>> >
>
>> > Looks like by default nutch ignore internal site links
>
>> > (db.ignore.internal.links)
>
>> >
>
>> > Sergey Volkov
>
>> >
>
>> > On 08/23/2011 07:04 PM, Marek Bachmann wrote:
>
>> >> Hi Lewis,
>
>> >>
>
>> >> thank you for your suggestion.
>
>> >> Unfortunately this isn't the problem. Actually I have also tried to
>
>> >> merge all segments together and put the one big segment to the
>
>> >> inverlinks command. Same (none) effect. :-(
>
>> >>
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>
>> >> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
>
>> >> Merging 29 segments to crawl/one-seg/20110823165144
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
>>
>
>> >> SegmentMerger: using segment data from: content crawl_generate
>
>> >> crawl_fetch crawl_parse parse_data parse_text
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
>
>> >> -rf crawl/linkdb/
>
>> >>
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>
>> >> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
>
>> >> -noNormalize -noFilter
>
>> >> LinkDb: starting at 2011-08-23 17:01:44
>
>> >> LinkDb: linkdb: crawl/linkdb
>
>> >> LinkDb: URL normalize: false
>
>> >> LinkDb: URL filter: false
>
>> >> LinkDb: adding segment: crawl/one-seg/20110823165144
>
>> >> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
>
>> >>
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>
>> >> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>
>> >> LinkDb dump: starting at 2011-08-23 17:03:12
>
>> >> LinkDb dump: db: crawl/linkdb/
>
>> >> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
>
>> >> linkdump/
>
>> >>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>>
>
>> >> ll
>
>> >> total 0
>
>> >> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
>
>> >>
>
>> >>
>
>> >> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
>
>> >>> Hi
>
>> >>>
>
>> >>> Small suggestion, but I do not see any -dir argument passed
>
>> >>> alongside your
>
>> >>> initial invertlinks command. I understand that you have multiple
>
>> >>> segment
>
>> >>> directories, which have been fetched over a recent number of days,
>
>> >>> and that
>
>> >>> the output would also suggest the process was properly executed,
>
>> >>> however I
>
>> >>> have never used the command without the -dir option (as it has
>
>> >>> always worked
>
>> >>> for me), therefore I can only suggest that this may be the problem.
>
>> >>>
>
>> >>>
>
>> >>>
>
>> >>> On Tue, Aug 23, 2011 at 3:29 PM, Marek
>
>> >>> Bachmannm.bachmann@uni-kassel.de>wrote:
>
>> >>>
>
>> >>>> Hi Markus,
>
>> >>>>
>
>> >>>> thank you for the quick reply. I already searched for this
>
>> >>>> Configuration
>
>> >>>> error and found:
>
>> >>>>
>
>> >>>>
>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.htmlhttp://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>>
>
>> >>>>
>
>> >>>>
>
>> >>>> Where they say that "This exception is innocuous - it helps to
>
>> >>>> debug at
>
>> >>>> which points in the code the Configuration instances are being
>
>> >>>> created.
>
>> >>>> (...)"
>
>> >>>>
>
>> >>>> I have indeed not much disk space on the machine but it should be
>
>> >>>> enough at
>
>> >>>> the moment:
>
>> >>>>
>
>> >>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>> df
>
>> >>>> -h .
>
>> >>>> Filesystem Size Used Avail Use% Mounted on
>
>> >>>> /dev/vda1 20G 5.9G 15G 30% /home
>
>> >>>>
>
>> >>>> As I am root and all directories under
>
>> >>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
>
>> >>>> are set to root:root and 755 permissions shouldn't be the problem.
>
>> >>>>
>
>> >>>> Any further suggestions? :-/
>
>> >>>>
>
>> >>>> Thank you once again
>
>> >>>>
>
>> >>>>
>
>> >>>>
>
>> >>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>
>> >>>>
>
>> >>>> There are some peculiarities in your log:
>
>> >>>>>
>
>> >>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -
>
>> >>>>> java.io.IOException:
>
>> >>>>> config()
>
>> >>>>> at org.apache.hadoop.conf.**Configuration.(**
>
>> >>>>> Configuration.java:211)
>
>> >>>>> at org.apache.hadoop.conf.**Configuration.(**
>
>> >>>>> Configuration.java:198)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**213)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.(**
>
>> >>>>> LocalJobRunner.java:93)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>
>> >>>>> LocalJobRunner.java:373)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>
>> >>>>> JobClient.java:800)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>
>> >>>>> java:730)
>
>> >>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>
>> >>>>> java:1249)
>
>> >>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>
>> >>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>
>> >>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>
>> >>>>>
>
>> >>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:
>
>> >>>>> job_local_0002
>
>> >>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -
>
>> >>>>> java.io.IOException:
>
>> >>>>> config(config)
>
>> >>>>> at org.apache.hadoop.conf.**Configuration.(**
>
>> >>>>> Configuration.java:226)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**184)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapreduce.**JobContext.(JobContext.**
>
>> >>>>> java:52)
>
>> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**
>
>> >>>>> java:32)
>
>> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**
>
>> >>>>> java:38)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>
>> >>>>> LocalJobRunner.java:111)
>
>> >>>>>
>
>> >>>>>
>
>> >>>>> Can you check permissions, disk space etc?
>
>> >>>>>
>
>> >>>>>
>
>> >>>>>
>
>> >>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>
>> >>>>>
>
>> >>>>>> Hey Ho,
>
>> >>>>>>
>
>> >>>>>> for some reasons the inverlinks command produces an empty linkdb.
>
>> >>>>>>
>
>> >>>>>> I did:
>
>> >>>>>>
>
>> >>>>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>>>>
>
>> >>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize
>
>> >>>>>> -noFilter
>
>> >>>>>> LinkDb: starting at 2011-08-23 14:47:21
>
>> >>>>>> LinkDb: linkdb: crawl/linkdb
>
>> >>>>>> LinkDb: URL normalize: false
>
>> >>>>>> LinkDb: URL filter: false
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817164804
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817164912
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817165053
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817165524
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817170729
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817171757
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817172919
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819135218
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819165658
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819170807
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819171841
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819173350
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822135934
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822141229
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822143419
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822143824
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144031
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144232
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144435
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144617
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144750
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144927
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822145249
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822150757
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822152354
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822152503
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822153900
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822155321
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822155732
>
>> >>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>
>> >>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>
>> >>>>>>
>
>> >>>>>> After that:
>
>> >>>>>>
>
>> >>>>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>>>>
>
>> >>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>
>> >>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>
>> >>>>>> LinkDb dump: db: crawl/linkdb/
>
>> >>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>
>> >>>>>>
>
>> >>>>>> And then:
>
>> >>>>>>
>
>> >>>>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>>>>
>
>> >>>>>> cd
>
>> >>>>>> linkdump/
>
>> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>
>> >>>>>> runtime/local/bin/linkdump#
>
>> >>>>>> ll
>
>> >>>>>> total 0
>
>> >>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>
>> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>
>> >>>>>> runtime/local/bin/linkdump#
>
>> >>>>>>
>
>> >>>>>> As you see, the dump size is 0 byte.
>
>> >>>>>>
>
>> >>>>>> Unfortunately I have no idea what went wrong.
>
>> >>>>>>
>
>> >>>>>> I have attached the hadoop.log for the inverlinks process.
>
>> >>>>>> Perhaps that
>
>> >>>>>> helps anybody?
>
>> >>>>>>
>
>> >>>>>
>
>> >>>>>
>
>> >>>>
>
>> >>>
>
>> >>>
>
>> >>
>
>> >
>
>
>
>
>

Re: Re: Empty LinkDB after invertlinks

Posted by le...@gmail.com.

Hi Marek,

You make a reasonable point. If you feel that this is something that should  
be integrated then maybe consider filing a JIRA with a comprehensive  
description of the problem and a proposed solution. If you do not actually  
patch this yourself then maybe someone else can provide a patch in the  
future should they experience the same as it would be a nice indication of  
the situation identified by Sergey.


On Aug 23, 2011 10:48pm, Marek Bachmann <m....@uni-kassel.de> wrote:
> Oh yes, thank your very much Sergey, that was the problem.



> Would have been nice, if the inverlinks command had told me that it has

> ignored them :-)



> Cheers,



> Marek



> Am 23.08.2011 19:26, schrieb Sergey A Volkov:

> > Hi

> >

> > Is it possible that you fetch documents from just one site/domain?

> >

> > Looks like by default nutch ignore internal site links

> > (db.ignore.internal.links)

> >

> > Sergey Volkov

> >

> > On 08/23/2011 07:04 PM, Marek Bachmann wrote:

> >> Hi Lewis,

> >>

> >> thank you for your suggestion.

> >> Unfortunately this isn't the problem. Actually I have also tried to

> >> merge all segments together and put the one big segment to the

> >> inverlinks command. Same (none) effect. :-(

> >>

> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#

> >> ./nutch mergesegs crawl/one-seg -dir crawl/segments/

> >> Merging 29 segments to crawl/one-seg/20110823165144

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321

> >> SegmentMerger: adding

> >>  
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732

> >> SegmentMerger: using segment data from: content crawl_generate

> >> crawl_fetch crawl_parse parse_data parse_text

> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm

> >> -rf crawl/linkdb/

> >>

> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#

> >> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/

> >> -noNormalize -noFilter

> >> LinkDb: starting at 2011-08-23 17:01:44

> >> LinkDb: linkdb: crawl/linkdb

> >> LinkDb: URL normalize: false

> >> LinkDb: URL filter: false

> >> LinkDb: adding segment: crawl/one-seg/20110823165144

> >> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08

> >>

> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#

> >> ./nutch readlinkdb crawl/linkdb/ -dump linkdump

> >> LinkDb dump: starting at 2011-08-23 17:03:12

> >> LinkDb dump: db: crawl/linkdb/

> >> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01

> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd

> >> linkdump/

> >>  
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#

> >> ll

> >> total 0

> >> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000

> >>

> >>

> >> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:

> >>> Hi

> >>>

> >>> Small suggestion, but I do not see any -dir argument passed

> >>> alongside your

> >>> initial invertlinks command. I understand that you have multiple

> >>> segment

> >>> directories, which have been fetched over a recent number of days,

> >>> and that

> >>> the output would also suggest the process was properly executed,

> >>> however I

> >>> have never used the command without the -dir option (as it has

> >>> always worked

> >>> for me), therefore I can only suggest that this may be the problem.

> >>>

> >>>

> >>>

> >>> On Tue, Aug 23, 2011 at 3:29 PM, Marek

> >>> Bachmannm.bachmann@uni-kassel.de>wrote:

> >>>

> >>>> Hi Markus,

> >>>>

> >>>> thank you for the quick reply. I already searched for this

> >>>> Configuration

> >>>> error and found:

> >>>>

> >>>>  
> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.htmlhttp://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>

> >>>>

> >>>>

> >>>> Where they say that "This exception is innocuous - it helps to

> >>>> debug at

> >>>> which points in the code the Configuration instances are being

> >>>> created.

> >>>> (...)"

> >>>>

> >>>> I have indeed not much disk space on the machine but it should be

> >>>> enough at

> >>>> the moment:

> >>>>

> >>>>  
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

> >>>> df

> >>>> -h .

> >>>> Filesystem Size Used Avail Use% Mounted on

> >>>> /dev/vda1 20G 5.9G 15G 30% /home

> >>>>

> >>>> As I am root and all directories under

> >>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin

> >>>> are set to root:root and 755 permissions shouldn't be the problem.

> >>>>

> >>>> Any further suggestions? :-/

> >>>>

> >>>> Thank you once again

> >>>>

> >>>>

> >>>>

> >>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:

> >>>>

> >>>> There are some peculiarities in your log:

> >>>>>

> >>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -

> >>>>> java.io.IOException:

> >>>>> config()

> >>>>> at org.apache.hadoop.conf.**Configuration.(**

> >>>>> Configuration.java:211)

> >>>>> at org.apache.hadoop.conf.**Configuration.(**

> >>>>> Configuration.java:198)

> >>>>> at

> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**213)

> >>>>> at

> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.(**

> >>>>> LocalJobRunner.java:93)

> >>>>> at

> >>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**

> >>>>> LocalJobRunner.java:373)

> >>>>> at

> >>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**

> >>>>> JobClient.java:800)

> >>>>> at

> >>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**

> >>>>> java:730)

> >>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**

> >>>>> java:1249)

> >>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)

> >>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)

> >>>>> at

> >>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)

> >>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)

> >>>>>

> >>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:

> >>>>> job_local_0002

> >>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -

> >>>>> java.io.IOException:

> >>>>> config(config)

> >>>>> at org.apache.hadoop.conf.**Configuration.(**

> >>>>> Configuration.java:226)

> >>>>> at

> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**184)

> >>>>> at

> >>>>> org.apache.hadoop.mapreduce.**JobContext.(JobContext.**

> >>>>> java:52)

> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**

> >>>>> java:32)

> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**

> >>>>> java:38)

> >>>>> at

> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**

> >>>>> LocalJobRunner.java:111)

> >>>>>

> >>>>>

> >>>>> Can you check permissions, disk space etc?

> >>>>>

> >>>>>

> >>>>>

> >>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:

> >>>>>

> >>>>>> Hey Ho,

> >>>>>>

> >>>>>> for some reasons the inverlinks command produces an empty linkdb.

> >>>>>>

> >>>>>> I did:

> >>>>>>

> >>>>>>  
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

> >>>>>>

> >>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize

> >>>>>> -noFilter

> >>>>>> LinkDb: starting at 2011-08-23 14:47:21

> >>>>>> LinkDb: linkdb: crawl/linkdb

> >>>>>> LinkDb: URL normalize: false

> >>>>>> LinkDb: URL filter: false

> >>>>>> LinkDb: adding segment: crawl/segments/20110817164804

> >>>>>> LinkDb: adding segment: crawl/segments/20110817164912

> >>>>>> LinkDb: adding segment: crawl/segments/20110817165053

> >>>>>> LinkDb: adding segment: crawl/segments/20110817165524

> >>>>>> LinkDb: adding segment: crawl/segments/20110817170729

> >>>>>> LinkDb: adding segment: crawl/segments/20110817171757

> >>>>>> LinkDb: adding segment: crawl/segments/20110817172919

> >>>>>> LinkDb: adding segment: crawl/segments/20110819135218

> >>>>>> LinkDb: adding segment: crawl/segments/20110819165658

> >>>>>> LinkDb: adding segment: crawl/segments/20110819170807

> >>>>>> LinkDb: adding segment: crawl/segments/20110819171841

> >>>>>> LinkDb: adding segment: crawl/segments/20110819173350

> >>>>>> LinkDb: adding segment: crawl/segments/20110822135934

> >>>>>> LinkDb: adding segment: crawl/segments/20110822141229

> >>>>>> LinkDb: adding segment: crawl/segments/20110822143419

> >>>>>> LinkDb: adding segment: crawl/segments/20110822143824

> >>>>>> LinkDb: adding segment: crawl/segments/20110822144031

> >>>>>> LinkDb: adding segment: crawl/segments/20110822144232

> >>>>>> LinkDb: adding segment: crawl/segments/20110822144435

> >>>>>> LinkDb: adding segment: crawl/segments/20110822144617

> >>>>>> LinkDb: adding segment: crawl/segments/20110822144750

> >>>>>> LinkDb: adding segment: crawl/segments/20110822144927

> >>>>>> LinkDb: adding segment: crawl/segments/20110822145249

> >>>>>> LinkDb: adding segment: crawl/segments/20110822150757

> >>>>>> LinkDb: adding segment: crawl/segments/20110822152354

> >>>>>> LinkDb: adding segment: crawl/segments/20110822152503

> >>>>>> LinkDb: adding segment: crawl/segments/20110822153900

> >>>>>> LinkDb: adding segment: crawl/segments/20110822155321

> >>>>>> LinkDb: adding segment: crawl/segments/20110822155732

> >>>>>> LinkDb: merging with existing linkdb: crawl/linkdb

> >>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14

> >>>>>>

> >>>>>> After that:

> >>>>>>

> >>>>>>  
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

> >>>>>>

> >>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump

> >>>>>> LinkDb dump: starting at 2011-08-23 14:48:26

> >>>>>> LinkDb dump: db: crawl/linkdb/

> >>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01

> >>>>>>

> >>>>>> And then:

> >>>>>>

> >>>>>>  
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

> >>>>>>

> >>>>>> cd

> >>>>>> linkdump/

> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**

> >>>>>> runtime/local/bin/linkdump#

> >>>>>> ll

> >>>>>> total 0

> >>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000

> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**

> >>>>>> runtime/local/bin/linkdump#

> >>>>>>

> >>>>>> As you see, the dump size is 0 byte.

> >>>>>>

> >>>>>> Unfortunately I have no idea what went wrong.

> >>>>>>

> >>>>>> I have attached the hadoop.log for the inverlinks process.

> >>>>>> Perhaps that

> >>>>>> helps anybody?

> >>>>>>

> >>>>>

> >>>>>

> >>>>

> >>>

> >>>

> >>

> >

Re: Empty LinkDB after invertlinks

Posted by Marek Bachmann <m....@uni-kassel.de>.

Oh yes, thank your very much Sergey, that was the problem.

Would have been nice, if the inverlinks command had told me that it has
ignored them :-)

Cheers,

Marek

Am 23.08.2011 19:26, schrieb Sergey A Volkov:
> Hi
>
> Is it possible that you fetch documents from just one site/domain?
>
> Looks like by default nutch ignore internal site links
> (db.ignore.internal.links)
>
> Sergey Volkov
>
> On 08/23/2011 07:04 PM, Marek Bachmann wrote:
>> Hi Lewis,
>>
>> thank you for your suggestion.
>> Unfortunately this isn't the problem. Actually I have also tried to
>> merge all segments together and put the one big segment to the
>> inverlinks command. Same (none) effect. :-(
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
>> Merging 29 segments to crawl/one-seg/20110823165144
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
>> SegmentMerger:   adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
>> SegmentMerger: using segment data from: content crawl_generate
>> crawl_fetch crawl_parse parse_data parse_text
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
>> -rf crawl/linkdb/
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
>> -noNormalize -noFilter
>> LinkDb: starting at 2011-08-23 17:01:44
>> LinkDb: linkdb: crawl/linkdb
>> LinkDb: URL normalize: false
>> LinkDb: URL filter: false
>> LinkDb: adding segment: crawl/one-seg/20110823165144
>> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>> LinkDb dump: starting at 2011-08-23 17:03:12
>> LinkDb dump: db: crawl/linkdb/
>> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
>> linkdump/
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>> ll
>> total 0
>> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
>>
>>
>> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
>>> Hi
>>>
>>> Small suggestion, but I do not see any -dir argument passed
>>> alongside your
>>> initial invertlinks command. I understand that you have multiple
>>> segment
>>> directories, which have been fetched over a recent number of days,
>>> and that
>>> the output would also suggest the process was properly executed,
>>> however I
>>> have never used the command without the -dir option (as it has
>>> always worked
>>> for me), therefore I can only suggest that this may be the problem.
>>>
>>>
>>>
>>> On Tue, Aug 23, 2011 at 3:29 PM, Marek
>>> Bachmann<m....@uni-kassel.de>wrote:
>>>
>>>> Hi Markus,
>>>>
>>>> thank you for the quick reply. I already searched for this
>>>> Configuration
>>>> error and found:
>>>>
>>>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>>>>
>>>>
>>>> Where they say that "This exception is innocuous - it helps to
>>>> debug at
>>>> which points in the code the Configuration instances are being
>>>> created.
>>>> (...)"
>>>>
>>>> I have indeed not much disk space on the machine but it should be
>>>> enough at
>>>> the moment:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> df
>>>> -h .
>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>> /dev/vda1              20G  5.9G   15G  30% /home
>>>>
>>>> As I am root and all directories under
>>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
>>>> are set to root:root and 755 permissions shouldn't be the problem.
>>>>
>>>> Any further suggestions? :-/
>>>>
>>>> Thank you once again
>>>>
>>>>
>>>>
>>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>>>>
>>>>   There are some peculiarities in your log:
>>>>>
>>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -
>>>>> java.io.IOException:
>>>>> config()
>>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:211)
>>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:198)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>>>>> LocalJobRunner.java:93)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>>>>> LocalJobRunner.java:373)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>>>> JobClient.java:800)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>>>> java:730)
>>>>>         at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>>> java:1249)
>>>>>         at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>>>>         at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>>>>         at
>>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>>         at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>>>>
>>>>> 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
>>>>> job_local_0002
>>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -
>>>>> java.io.IOException:
>>>>> config(config)
>>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:226)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>>>>> java:52)
>>>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>>> java:32)
>>>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>>> java:38)
>>>>>         at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>>>> LocalJobRunner.java:111)
>>>>>
>>>>>
>>>>> Can you check permissions, disk space etc?
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>>>>
>>>>>> Hey Ho,
>>>>>>
>>>>>> for some reasons the inverlinks command produces an empty linkdb.
>>>>>>
>>>>>> I did:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize
>>>>>> -noFilter
>>>>>> LinkDb: starting at 2011-08-23 14:47:21
>>>>>> LinkDb: linkdb: crawl/linkdb
>>>>>> LinkDb: URL normalize: false
>>>>>> LinkDb: URL filter: false
>>>>>> LinkDb: adding segment: crawl/segments/20110817164804
>>>>>> LinkDb: adding segment: crawl/segments/20110817164912
>>>>>> LinkDb: adding segment: crawl/segments/20110817165053
>>>>>> LinkDb: adding segment: crawl/segments/20110817165524
>>>>>> LinkDb: adding segment: crawl/segments/20110817170729
>>>>>> LinkDb: adding segment: crawl/segments/20110817171757
>>>>>> LinkDb: adding segment: crawl/segments/20110817172919
>>>>>> LinkDb: adding segment: crawl/segments/20110819135218
>>>>>> LinkDb: adding segment: crawl/segments/20110819165658
>>>>>> LinkDb: adding segment: crawl/segments/20110819170807
>>>>>> LinkDb: adding segment: crawl/segments/20110819171841
>>>>>> LinkDb: adding segment: crawl/segments/20110819173350
>>>>>> LinkDb: adding segment: crawl/segments/20110822135934
>>>>>> LinkDb: adding segment: crawl/segments/20110822141229
>>>>>> LinkDb: adding segment: crawl/segments/20110822143419
>>>>>> LinkDb: adding segment: crawl/segments/20110822143824
>>>>>> LinkDb: adding segment: crawl/segments/20110822144031
>>>>>> LinkDb: adding segment: crawl/segments/20110822144232
>>>>>> LinkDb: adding segment: crawl/segments/20110822144435
>>>>>> LinkDb: adding segment: crawl/segments/20110822144617
>>>>>> LinkDb: adding segment: crawl/segments/20110822144750
>>>>>> LinkDb: adding segment: crawl/segments/20110822144927
>>>>>> LinkDb: adding segment: crawl/segments/20110822145249
>>>>>> LinkDb: adding segment: crawl/segments/20110822150757
>>>>>> LinkDb: adding segment: crawl/segments/20110822152354
>>>>>> LinkDb: adding segment: crawl/segments/20110822152503
>>>>>> LinkDb: adding segment: crawl/segments/20110822153900
>>>>>> LinkDb: adding segment: crawl/segments/20110822155321
>>>>>> LinkDb: adding segment: crawl/segments/20110822155732
>>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>>>>
>>>>>> After that:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>>>>> LinkDb dump: db: crawl/linkdb/
>>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>>>>
>>>>>> And then:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> cd
>>>>>> linkdump/
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>>> runtime/local/bin/linkdump#
>>>>>> ll
>>>>>> total 0
>>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>>> runtime/local/bin/linkdump#
>>>>>>
>>>>>> As you see, the dump size is 0 byte.
>>>>>>
>>>>>> Unfortunately I have no idea what went wrong.
>>>>>>
>>>>>> I have attached the hadoop.log for the inverlinks process.
>>>>>> Perhaps that
>>>>>> helps anybody?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: Empty LinkDB after invertlinks

Posted by Sergey A Volkov <se...@gmail.com>.

Hi

Is it possible that you fetch documents from just one site/domain?

Looks like by default nutch ignore internal site links 
(db.ignore.internal.links)

Sergey Volkov

On 08/23/2011 07:04 PM, Marek Bachmann wrote:
> Hi Lewis,
>
> thank you for your suggestion.
> Unfortunately this isn't the problem. Actually I have also tried to 
> merge all segments together and put the one big segment to the 
> inverlinks command. Same (none) effect. :-(
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
> Merging 29 segments to crawl/one-seg/20110823165144
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
> SegmentMerger:   adding 
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
> SegmentMerger: using segment data from: content crawl_generate 
> crawl_fetch crawl_parse parse_data parse_text
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm 
> -rf crawl/linkdb/
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/ 
> -noNormalize -noFilter
> LinkDb: starting at 2011-08-23 17:01:44
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: false
> LinkDb: URL filter: false
> LinkDb: adding segment: crawl/one-seg/20110823165144
> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
> LinkDb dump: starting at 2011-08-23 17:03:12
> LinkDb dump: db: crawl/linkdb/
> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd 
> linkdump/
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump# 
> ll
> total 0
> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
>
>
> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
>> Hi
>>
>> Small suggestion, but I do not see any -dir argument passed alongside 
>> your
>> initial invertlinks command. I understand that you have multiple segment
>> directories, which have been fetched over a recent number of days, 
>> and that
>> the output would also suggest the process was properly executed, 
>> however I
>> have never used the command without the -dir option (as it has always 
>> worked
>> for me), therefore I can only suggest that this may be the problem.
>>
>>
>>
>> On Tue, Aug 23, 2011 at 3:29 PM, Marek 
>> Bachmann<m....@uni-kassel.de>wrote:
>>
>>> Hi Markus,
>>>
>>> thank you for the quick reply. I already searched for this 
>>> Configuration
>>> error and found:
>>>
>>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html> 
>>>
>>>
>>> Where they say that "This exception is innocuous - it helps to debug at
>>> which points in the code the Configuration instances are being created.
>>> (...)"
>>>
>>> I have indeed not much disk space on the machine but it should be 
>>> enough at
>>> the moment:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# 
>>> df
>>> -h .
>>> Filesystem            Size  Used Avail Use% Mounted on
>>> /dev/vda1              20G  5.9G   15G  30% /home
>>>
>>> As I am root and all directories under 
>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
>>> are set to root:root and 755 permissions shouldn't be the problem.
>>>
>>> Any further suggestions? :-/
>>>
>>> Thank you once again
>>>
>>>
>>>
>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>>>
>>>   There are some peculiarities in your log:
>>>>
>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - 
>>>> java.io.IOException:
>>>> config()
>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>> Configuration.java:211)
>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>> Configuration.java:198)
>>>>         at 
>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>>>         at
>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>>>> LocalJobRunner.java:93)
>>>>         at
>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>>>> LocalJobRunner.java:373)
>>>>         at
>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>>> JobClient.java:800)
>>>>         at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>>> java:730)
>>>>         at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>> java:1249)
>>>>         at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>>>         at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>>>         at 
>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>         at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>>>
>>>> 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
>>>> job_local_0002
>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - 
>>>> java.io.IOException:
>>>> config(config)
>>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>>> Configuration.java:226)
>>>>         at 
>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>>>         at 
>>>> org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>>>> java:52)
>>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>> java:32)
>>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>> java:38)
>>>>         at
>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>>> LocalJobRunner.java:111)
>>>>
>>>>
>>>> Can you check permissions, disk space etc?
>>>>
>>>>
>>>>
>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>>>
>>>>> Hey Ho,
>>>>>
>>>>> for some reasons the inverlinks command produces an empty linkdb.
>>>>>
>>>>> I did:
>>>>>
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# 
>>>>>
>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize 
>>>>> -noFilter
>>>>> LinkDb: starting at 2011-08-23 14:47:21
>>>>> LinkDb: linkdb: crawl/linkdb
>>>>> LinkDb: URL normalize: false
>>>>> LinkDb: URL filter: false
>>>>> LinkDb: adding segment: crawl/segments/20110817164804
>>>>> LinkDb: adding segment: crawl/segments/20110817164912
>>>>> LinkDb: adding segment: crawl/segments/20110817165053
>>>>> LinkDb: adding segment: crawl/segments/20110817165524
>>>>> LinkDb: adding segment: crawl/segments/20110817170729
>>>>> LinkDb: adding segment: crawl/segments/20110817171757
>>>>> LinkDb: adding segment: crawl/segments/20110817172919
>>>>> LinkDb: adding segment: crawl/segments/20110819135218
>>>>> LinkDb: adding segment: crawl/segments/20110819165658
>>>>> LinkDb: adding segment: crawl/segments/20110819170807
>>>>> LinkDb: adding segment: crawl/segments/20110819171841
>>>>> LinkDb: adding segment: crawl/segments/20110819173350
>>>>> LinkDb: adding segment: crawl/segments/20110822135934
>>>>> LinkDb: adding segment: crawl/segments/20110822141229
>>>>> LinkDb: adding segment: crawl/segments/20110822143419
>>>>> LinkDb: adding segment: crawl/segments/20110822143824
>>>>> LinkDb: adding segment: crawl/segments/20110822144031
>>>>> LinkDb: adding segment: crawl/segments/20110822144232
>>>>> LinkDb: adding segment: crawl/segments/20110822144435
>>>>> LinkDb: adding segment: crawl/segments/20110822144617
>>>>> LinkDb: adding segment: crawl/segments/20110822144750
>>>>> LinkDb: adding segment: crawl/segments/20110822144927
>>>>> LinkDb: adding segment: crawl/segments/20110822145249
>>>>> LinkDb: adding segment: crawl/segments/20110822150757
>>>>> LinkDb: adding segment: crawl/segments/20110822152354
>>>>> LinkDb: adding segment: crawl/segments/20110822152503
>>>>> LinkDb: adding segment: crawl/segments/20110822153900
>>>>> LinkDb: adding segment: crawl/segments/20110822155321
>>>>> LinkDb: adding segment: crawl/segments/20110822155732
>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>>>
>>>>> After that:
>>>>>
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# 
>>>>>
>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>>>> LinkDb dump: db: crawl/linkdb/
>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>>>
>>>>> And then:
>>>>>
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# 
>>>>>
>>>>> cd
>>>>> linkdump/
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>> runtime/local/bin/linkdump#
>>>>> ll
>>>>> total 0
>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>> runtime/local/bin/linkdump#
>>>>>
>>>>> As you see, the dump size is 0 byte.
>>>>>
>>>>> Unfortunately I have no idea what went wrong.
>>>>>
>>>>> I have attached the hadoop.log for the inverlinks process. Perhaps 
>>>>> that
>>>>> helps anybody?
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Empty LinkDB after invertlinks

Posted by Marek Bachmann <m....@uni-kassel.de>.

Hi Lewis,

thank you for your suggestion.
Unfortunately this isn't the problem. Actually I have also tried to 
merge all segments together and put the one big segment to the 
inverlinks command. Same (none) effect. :-(

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
./nutch mergesegs crawl/one-seg -dir crawl/segments/
Merging 29 segments to crawl/one-seg/20110823165144
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
SegmentMerger:   adding 
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm 
-rf crawl/linkdb/

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/ 
-noNormalize -noFilter
LinkDb: starting at 2011-08-23 17:01:44
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: false
LinkDb: URL filter: false
LinkDb: adding segment: crawl/one-seg/20110823165144
LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# 
./nutch readlinkdb crawl/linkdb/ -dump linkdump
LinkDb dump: starting at 2011-08-23 17:03:12
LinkDb dump: db: crawl/linkdb/
LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd 
linkdump/
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump# 
ll
total 0
-rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000


Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
> Hi
>
> Small suggestion, but I do not see any -dir argument passed alongside your
> initial invertlinks command. I understand that you have multiple segment
> directories, which have been fetched over a recent number of days, and that
> the output would also suggest the process was properly executed, however I
> have never used the command without the -dir option (as it has always worked
> for me), therefore I can only suggest that this may be the problem.
>
>
>
> On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann<m....@uni-kassel.de>wrote:
>
>> Hi Markus,
>>
>> thank you for the quick reply. I already searched for this Configuration
>> error and found:
>>
>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>>
>> Where they say that "This exception is innocuous - it helps to debug at
>> which points in the code the Configuration instances are being created.
>> (...)"
>>
>> I have indeed not much disk space on the machine but it should be enough at
>> the moment:
>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df
>> -h .
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/vda1              20G  5.9G   15G  30% /home
>>
>> As I am root and all directories under /home/nutchServer/relaunch_**nutch/runtime/local/bin
>> are set to root:root and 755 permissions shouldn't be the problem.
>>
>> Any further suggestions? :-/
>>
>> Thank you once again
>>
>>
>>
>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>>
>>   There are some peculiarities in your log:
>>>
>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
>>> config()
>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>> Configuration.java:211)
>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>> Configuration.java:198)
>>>         at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>>         at
>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>>> LocalJobRunner.java:93)
>>>         at
>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>>> LocalJobRunner.java:373)
>>>         at
>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>> JobClient.java:800)
>>>         at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>> java:730)
>>>         at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>> java:1249)
>>>         at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>>         at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>>         at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>         at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>>
>>> 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
>>> job_local_0002
>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
>>> config(config)
>>>         at org.apache.hadoop.conf.**Configuration.<init>(**
>>> Configuration.java:226)
>>>         at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>>         at org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>>> java:52)
>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>> java:32)
>>>         at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>> java:38)
>>>         at
>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>> LocalJobRunner.java:111)
>>>
>>>
>>> Can you check permissions, disk space etc?
>>>
>>>
>>>
>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>>
>>>> Hey Ho,
>>>>
>>>> for some reasons the inverlinks command produces an empty linkdb.
>>>>
>>>> I did:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
>>>> LinkDb: starting at 2011-08-23 14:47:21
>>>> LinkDb: linkdb: crawl/linkdb
>>>> LinkDb: URL normalize: false
>>>> LinkDb: URL filter: false
>>>> LinkDb: adding segment: crawl/segments/20110817164804
>>>> LinkDb: adding segment: crawl/segments/20110817164912
>>>> LinkDb: adding segment: crawl/segments/20110817165053
>>>> LinkDb: adding segment: crawl/segments/20110817165524
>>>> LinkDb: adding segment: crawl/segments/20110817170729
>>>> LinkDb: adding segment: crawl/segments/20110817171757
>>>> LinkDb: adding segment: crawl/segments/20110817172919
>>>> LinkDb: adding segment: crawl/segments/20110819135218
>>>> LinkDb: adding segment: crawl/segments/20110819165658
>>>> LinkDb: adding segment: crawl/segments/20110819170807
>>>> LinkDb: adding segment: crawl/segments/20110819171841
>>>> LinkDb: adding segment: crawl/segments/20110819173350
>>>> LinkDb: adding segment: crawl/segments/20110822135934
>>>> LinkDb: adding segment: crawl/segments/20110822141229
>>>> LinkDb: adding segment: crawl/segments/20110822143419
>>>> LinkDb: adding segment: crawl/segments/20110822143824
>>>> LinkDb: adding segment: crawl/segments/20110822144031
>>>> LinkDb: adding segment: crawl/segments/20110822144232
>>>> LinkDb: adding segment: crawl/segments/20110822144435
>>>> LinkDb: adding segment: crawl/segments/20110822144617
>>>> LinkDb: adding segment: crawl/segments/20110822144750
>>>> LinkDb: adding segment: crawl/segments/20110822144927
>>>> LinkDb: adding segment: crawl/segments/20110822145249
>>>> LinkDb: adding segment: crawl/segments/20110822150757
>>>> LinkDb: adding segment: crawl/segments/20110822152354
>>>> LinkDb: adding segment: crawl/segments/20110822152503
>>>> LinkDb: adding segment: crawl/segments/20110822153900
>>>> LinkDb: adding segment: crawl/segments/20110822155321
>>>> LinkDb: adding segment: crawl/segments/20110822155732
>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>>
>>>> After that:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>>> LinkDb dump: db: crawl/linkdb/
>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>>
>>>> And then:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> cd
>>>> linkdump/
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>> runtime/local/bin/linkdump#
>>>> ll
>>>> total 0
>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>> runtime/local/bin/linkdump#
>>>>
>>>> As you see, the dump size is 0 byte.
>>>>
>>>> Unfortunately I have no idea what went wrong.
>>>>
>>>> I have attached the hadoop.log for the inverlinks process. Perhaps that
>>>> helps anybody?
>>>>
>>>
>>>
>>
>
>

Re: Empty LinkDB after invertlinks

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi

Small suggestion, but I do not see any -dir argument passed alongside your
initial invertlinks command. I understand that you have multiple segment
directories, which have been fetched over a recent number of days, and that
the output would also suggest the process was properly executed, however I
have never used the command without the -dir option (as it has always worked
for me), therefore I can only suggest that this may be the problem.



On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann <m....@uni-kassel.de>wrote:

> Hi Markus,
>
> thank you for the quick reply. I already searched for this Configuration
> error and found:
>
> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>
> Where they say that "This exception is innocuous - it helps to debug at
> which points in the code the Configuration instances are being created.
> (...)"
>
> I have indeed not much disk space on the machine but it should be enough at
> the moment:
>
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df
> -h .
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/vda1              20G  5.9G   15G  30% /home
>
> As I am root and all directories under /home/nutchServer/relaunch_**nutch/runtime/local/bin
> are set to root:root and 755 permissions shouldn't be the problem.
>
> Any further suggestions? :-/
>
> Thank you once again
>
>
>
> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>
>  There are some peculiarities in your log:
>>
>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
>> config()
>>        at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:211)
>>        at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:198)
>>        at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>        at
>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>> LocalJobRunner.java:93)
>>        at
>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>> LocalJobRunner.java:373)
>>        at
>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>> JobClient.java:800)
>>        at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>> java:730)
>>        at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>> java:1249)
>>        at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>        at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>        at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>        at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>
>> 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
>> job_local_0002
>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
>> config(config)
>>        at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:226)
>>        at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>        at org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>> java:52)
>>        at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>> java:32)
>>        at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>> java:38)
>>        at
>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:111)
>>
>>
>> Can you check permissions, disk space etc?
>>
>>
>>
>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>
>>> Hey Ho,
>>>
>>> for some reasons the inverlinks command produces an empty linkdb.
>>>
>>> I did:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
>>> LinkDb: starting at 2011-08-23 14:47:21
>>> LinkDb: linkdb: crawl/linkdb
>>> LinkDb: URL normalize: false
>>> LinkDb: URL filter: false
>>> LinkDb: adding segment: crawl/segments/20110817164804
>>> LinkDb: adding segment: crawl/segments/20110817164912
>>> LinkDb: adding segment: crawl/segments/20110817165053
>>> LinkDb: adding segment: crawl/segments/20110817165524
>>> LinkDb: adding segment: crawl/segments/20110817170729
>>> LinkDb: adding segment: crawl/segments/20110817171757
>>> LinkDb: adding segment: crawl/segments/20110817172919
>>> LinkDb: adding segment: crawl/segments/20110819135218
>>> LinkDb: adding segment: crawl/segments/20110819165658
>>> LinkDb: adding segment: crawl/segments/20110819170807
>>> LinkDb: adding segment: crawl/segments/20110819171841
>>> LinkDb: adding segment: crawl/segments/20110819173350
>>> LinkDb: adding segment: crawl/segments/20110822135934
>>> LinkDb: adding segment: crawl/segments/20110822141229
>>> LinkDb: adding segment: crawl/segments/20110822143419
>>> LinkDb: adding segment: crawl/segments/20110822143824
>>> LinkDb: adding segment: crawl/segments/20110822144031
>>> LinkDb: adding segment: crawl/segments/20110822144232
>>> LinkDb: adding segment: crawl/segments/20110822144435
>>> LinkDb: adding segment: crawl/segments/20110822144617
>>> LinkDb: adding segment: crawl/segments/20110822144750
>>> LinkDb: adding segment: crawl/segments/20110822144927
>>> LinkDb: adding segment: crawl/segments/20110822145249
>>> LinkDb: adding segment: crawl/segments/20110822150757
>>> LinkDb: adding segment: crawl/segments/20110822152354
>>> LinkDb: adding segment: crawl/segments/20110822152503
>>> LinkDb: adding segment: crawl/segments/20110822153900
>>> LinkDb: adding segment: crawl/segments/20110822155321
>>> LinkDb: adding segment: crawl/segments/20110822155732
>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>
>>> After that:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>> LinkDb dump: db: crawl/linkdb/
>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>
>>> And then:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> cd
>>> linkdump/
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>> runtime/local/bin/linkdump#
>>> ll
>>> total 0
>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>> runtime/local/bin/linkdump#
>>>
>>> As you see, the dump size is 0 byte.
>>>
>>> Unfortunately I have no idea what went wrong.
>>>
>>> I have attached the hadoop.log for the inverlinks process. Perhaps that
>>> helps anybody?
>>>
>>
>>
>


-- 
*Lewis*

Re: Empty LinkDB after invertlinks

Posted by Marek Bachmann <m....@uni-kassel.de>.

Hi Markus,

thank you for the quick reply. I already searched for this Configuration 
error and found:

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html

Where they say that "This exception is innocuous - it helps to debug at 
which points in the code the Configuration instances are being created. 
(...)"

I have indeed not much disk space on the machine but it should be enough 
at the moment:

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/vda1              20G  5.9G   15G  30% /home

As I am root and all directories under 
/home/nutchServer/relaunch_nutch/runtime/local/bin are set to root:root 
and 755 permissions shouldn't be the problem.

Any further suggestions? :-/

Thank you once again



Am 23.08.2011 16:10, schrieb Markus Jelsma:
> There are some peculiarities in your log:
>
> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
> config()
> 	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:211)
> 	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:198)
> 	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:213)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
> 	at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
> 	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> 	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:190)
> 	at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job: job_local_0002
> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
> config(config)
> 	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:226)
> 	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:184)
> 	at org.apache.hadoop.mapreduce.JobContext.<init>(JobContext.java:52)
> 	at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:32)
> 	at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:38)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:111)
>
>
> Can you check permissions, disk space etc?
>
>
>
> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>> Hey Ho,
>>
>> for some reasons the inverlinks command produces an empty linkdb.
>>
>> I did:
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
>> LinkDb: starting at 2011-08-23 14:47:21
>> LinkDb: linkdb: crawl/linkdb
>> LinkDb: URL normalize: false
>> LinkDb: URL filter: false
>> LinkDb: adding segment: crawl/segments/20110817164804
>> LinkDb: adding segment: crawl/segments/20110817164912
>> LinkDb: adding segment: crawl/segments/20110817165053
>> LinkDb: adding segment: crawl/segments/20110817165524
>> LinkDb: adding segment: crawl/segments/20110817170729
>> LinkDb: adding segment: crawl/segments/20110817171757
>> LinkDb: adding segment: crawl/segments/20110817172919
>> LinkDb: adding segment: crawl/segments/20110819135218
>> LinkDb: adding segment: crawl/segments/20110819165658
>> LinkDb: adding segment: crawl/segments/20110819170807
>> LinkDb: adding segment: crawl/segments/20110819171841
>> LinkDb: adding segment: crawl/segments/20110819173350
>> LinkDb: adding segment: crawl/segments/20110822135934
>> LinkDb: adding segment: crawl/segments/20110822141229
>> LinkDb: adding segment: crawl/segments/20110822143419
>> LinkDb: adding segment: crawl/segments/20110822143824
>> LinkDb: adding segment: crawl/segments/20110822144031
>> LinkDb: adding segment: crawl/segments/20110822144232
>> LinkDb: adding segment: crawl/segments/20110822144435
>> LinkDb: adding segment: crawl/segments/20110822144617
>> LinkDb: adding segment: crawl/segments/20110822144750
>> LinkDb: adding segment: crawl/segments/20110822144927
>> LinkDb: adding segment: crawl/segments/20110822145249
>> LinkDb: adding segment: crawl/segments/20110822150757
>> LinkDb: adding segment: crawl/segments/20110822152354
>> LinkDb: adding segment: crawl/segments/20110822152503
>> LinkDb: adding segment: crawl/segments/20110822153900
>> LinkDb: adding segment: crawl/segments/20110822155321
>> LinkDb: adding segment: crawl/segments/20110822155732
>> LinkDb: merging with existing linkdb: crawl/linkdb
>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>
>> After that:
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>> LinkDb dump: starting at 2011-08-23 14:48:26
>> LinkDb dump: db: crawl/linkdb/
>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>
>> And then:
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
>> linkdump/
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>> ll
>> total 0
>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>>
>> As you see, the dump size is 0 byte.
>>
>> Unfortunately I have no idea what went wrong.
>>
>> I have attached the hadoop.log for the inverlinks process. Perhaps that
>> helps anybody?
>

Re: Empty LinkDB after invertlinks

Posted by Markus Jelsma <ma...@openindex.io>.

There are some peculiarities in your log:

2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException: 
config()
	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:211)
	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:198)
	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:213)
	at 
org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
	at 
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
	at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:190)
	at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job: job_local_0002
2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException: 
config(config)
	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:226)
	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:184)
	at org.apache.hadoop.mapreduce.JobContext.<init>(JobContext.java:52)
	at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:32)
	at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:38)
	at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:111)


Can you check permissions, disk space etc?



On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
> Hey Ho,
> 
> for some reasons the inverlinks command produces an empty linkdb.
> 
> I did:
> 
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
> LinkDb: starting at 2011-08-23 14:47:21
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: false
> LinkDb: URL filter: false
> LinkDb: adding segment: crawl/segments/20110817164804
> LinkDb: adding segment: crawl/segments/20110817164912
> LinkDb: adding segment: crawl/segments/20110817165053
> LinkDb: adding segment: crawl/segments/20110817165524
> LinkDb: adding segment: crawl/segments/20110817170729
> LinkDb: adding segment: crawl/segments/20110817171757
> LinkDb: adding segment: crawl/segments/20110817172919
> LinkDb: adding segment: crawl/segments/20110819135218
> LinkDb: adding segment: crawl/segments/20110819165658
> LinkDb: adding segment: crawl/segments/20110819170807
> LinkDb: adding segment: crawl/segments/20110819171841
> LinkDb: adding segment: crawl/segments/20110819173350
> LinkDb: adding segment: crawl/segments/20110822135934
> LinkDb: adding segment: crawl/segments/20110822141229
> LinkDb: adding segment: crawl/segments/20110822143419
> LinkDb: adding segment: crawl/segments/20110822143824
> LinkDb: adding segment: crawl/segments/20110822144031
> LinkDb: adding segment: crawl/segments/20110822144232
> LinkDb: adding segment: crawl/segments/20110822144435
> LinkDb: adding segment: crawl/segments/20110822144617
> LinkDb: adding segment: crawl/segments/20110822144750
> LinkDb: adding segment: crawl/segments/20110822144927
> LinkDb: adding segment: crawl/segments/20110822145249
> LinkDb: adding segment: crawl/segments/20110822150757
> LinkDb: adding segment: crawl/segments/20110822152354
> LinkDb: adding segment: crawl/segments/20110822152503
> LinkDb: adding segment: crawl/segments/20110822153900
> LinkDb: adding segment: crawl/segments/20110822155321
> LinkDb: adding segment: crawl/segments/20110822155732
> LinkDb: merging with existing linkdb: crawl/linkdb
> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
> 
> After that:
> 
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
> LinkDb dump: starting at 2011-08-23 14:48:26
> LinkDb dump: db: crawl/linkdb/
> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
> 
> And then:
> 
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
> linkdump/
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
> ll
> total 0
> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
> 
> As you see, the dump size is 0 byte.
> 
> Unfortunately I have no idea what went wrong.
> 
> I have attached the hadoop.log for the inverlinks process. Perhaps that
> helps anybody?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350