You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/23 16:05:16 UTC
Empty LinkDB after invertlinks
Hey Ho,
for some reasons the inverlinks command produces an empty linkdb.
I did:
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
LinkDb: starting at 2011-08-23 14:47:21
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: false
LinkDb: URL filter: false
LinkDb: adding segment: crawl/segments/20110817164804
LinkDb: adding segment: crawl/segments/20110817164912
LinkDb: adding segment: crawl/segments/20110817165053
LinkDb: adding segment: crawl/segments/20110817165524
LinkDb: adding segment: crawl/segments/20110817170729
LinkDb: adding segment: crawl/segments/20110817171757
LinkDb: adding segment: crawl/segments/20110817172919
LinkDb: adding segment: crawl/segments/20110819135218
LinkDb: adding segment: crawl/segments/20110819165658
LinkDb: adding segment: crawl/segments/20110819170807
LinkDb: adding segment: crawl/segments/20110819171841
LinkDb: adding segment: crawl/segments/20110819173350
LinkDb: adding segment: crawl/segments/20110822135934
LinkDb: adding segment: crawl/segments/20110822141229
LinkDb: adding segment: crawl/segments/20110822143419
LinkDb: adding segment: crawl/segments/20110822143824
LinkDb: adding segment: crawl/segments/20110822144031
LinkDb: adding segment: crawl/segments/20110822144232
LinkDb: adding segment: crawl/segments/20110822144435
LinkDb: adding segment: crawl/segments/20110822144617
LinkDb: adding segment: crawl/segments/20110822144750
LinkDb: adding segment: crawl/segments/20110822144927
LinkDb: adding segment: crawl/segments/20110822145249
LinkDb: adding segment: crawl/segments/20110822150757
LinkDb: adding segment: crawl/segments/20110822152354
LinkDb: adding segment: crawl/segments/20110822152503
LinkDb: adding segment: crawl/segments/20110822153900
LinkDb: adding segment: crawl/segments/20110822155321
LinkDb: adding segment: crawl/segments/20110822155732
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
After that:
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
./nutch readlinkdb crawl/linkdb/ -dump linkdump
LinkDb dump: starting at 2011-08-23 14:48:26
LinkDb dump: db: crawl/linkdb/
LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
And then:
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
linkdump/
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
ll
total 0
-rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
As you see, the dump size is 0 byte.
Unfortunately I have no idea what went wrong.
I have attached the hadoop.log for the inverlinks process. Perhaps that
helps anybody?
Re: Empty LinkDB after invertlinks
Posted by Marek Bachmann <m....@uni-kassel.de>.
Hi Lewis,
you are right.
I think the problem is a bit more general. There are some tools which
aren't very verbose about which configuration they use (and some tools
don't tell much at all ;-) ).
I think there were many problems discussed on the list which were
related to a wrong configuration files or a standard option that wasn't
noticed by the user.
So it would be great that if we ran a command, it would tell us which
config file it uses and which values it has detected (or which default
values it uses)
Unfortunately I have no knowledge about the configuration architecture
that nutch uses.
It seems to me that the the options for most of the tools can be defined
in the nutch-site.xml. But I am not aware of how and WHERE this file is
interpreted by the tools.
I think I'll take a look in the sources and see if I could manage to
teach the single tools to be more verbose. :)
By the way, that cuts a point which occupies since a while: How can I
know which options are available for a tool to be configured in the
nutch-site.xml
For now I'll make myself acquaint with JIRA, since I never worked with it
regards,
Marek
Am 23.08.2011 23:56, schrieb lewis.mcgibbney@gmail.com:
> Hi Marek,
>
> You make a reasonable point. If you feel that this is something that
> should be integrated then maybe consider filing a JIRA with a
> comprehensive description of the problem and a proposed solution. If you
> do not actually patch this yourself then maybe someone else can provide
> a patch in the future should they experience the same as it would be a
> nice indication of the situation identified by Sergey.
>
>
> On Aug 23, 2011 10:48pm, Marek Bachmann <m....@uni-kassel.de> wrote:
>> Oh yes, thank your very much Sergey, that was the problem.
>
>
>
>> Would have been nice, if the inverlinks command had told me that it has
>
>> ignored them :-)
>
>
>
>> Cheers,
>
>
>
>> Marek
>
>
>
>> Am 23.08.2011 19:26, schrieb Sergey A Volkov:
>
>> > Hi
>
>> >
>
>> > Is it possible that you fetch documents from just one site/domain?
>
>> >
>
>> > Looks like by default nutch ignore internal site links
>
>> > (db.ignore.internal.links)
>
>> >
>
>> > Sergey Volkov
>
>> >
>
>> > On 08/23/2011 07:04 PM, Marek Bachmann wrote:
>
>> >> Hi Lewis,
>
>> >>
>
>> >> thank you for your suggestion.
>
>> >> Unfortunately this isn't the problem. Actually I have also tried to
>
>> >> merge all segments together and put the one big segment to the
>
>> >> inverlinks command. Same (none) effect. :-(
>
>> >>
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>
>> >> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
>
>> >> Merging 29 segments to crawl/one-seg/20110823165144
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
>>
>
>> >> SegmentMerger: adding
>
>> >>
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
>>
>
>> >> SegmentMerger: using segment data from: content crawl_generate
>
>> >> crawl_fetch crawl_parse parse_data parse_text
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
>
>> >> -rf crawl/linkdb/
>
>> >>
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>
>> >> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
>
>> >> -noNormalize -noFilter
>
>> >> LinkDb: starting at 2011-08-23 17:01:44
>
>> >> LinkDb: linkdb: crawl/linkdb
>
>> >> LinkDb: URL normalize: false
>
>> >> LinkDb: URL filter: false
>
>> >> LinkDb: adding segment: crawl/one-seg/20110823165144
>
>> >> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
>
>> >>
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>
>> >> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>
>> >> LinkDb dump: starting at 2011-08-23 17:03:12
>
>> >> LinkDb dump: db: crawl/linkdb/
>
>> >> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
>
>> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
>
>> >> linkdump/
>
>> >>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>>
>
>> >> ll
>
>> >> total 0
>
>> >> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
>
>> >>
>
>> >>
>
>> >> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
>
>> >>> Hi
>
>> >>>
>
>> >>> Small suggestion, but I do not see any -dir argument passed
>
>> >>> alongside your
>
>> >>> initial invertlinks command. I understand that you have multiple
>
>> >>> segment
>
>> >>> directories, which have been fetched over a recent number of days,
>
>> >>> and that
>
>> >>> the output would also suggest the process was properly executed,
>
>> >>> however I
>
>> >>> have never used the command without the -dir option (as it has
>
>> >>> always worked
>
>> >>> for me), therefore I can only suggest that this may be the problem.
>
>> >>>
>
>> >>>
>
>> >>>
>
>> >>> On Tue, Aug 23, 2011 at 3:29 PM, Marek
>
>> >>> Bachmannm.bachmann@uni-kassel.de>wrote:
>
>> >>>
>
>> >>>> Hi Markus,
>
>> >>>>
>
>> >>>> thank you for the quick reply. I already searched for this
>
>> >>>> Configuration
>
>> >>>> error and found:
>
>> >>>>
>
>> >>>>
>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.htmlhttp://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>>
>
>> >>>>
>
>> >>>>
>
>> >>>> Where they say that "This exception is innocuous - it helps to
>
>> >>>> debug at
>
>> >>>> which points in the code the Configuration instances are being
>
>> >>>> created.
>
>> >>>> (...)"
>
>> >>>>
>
>> >>>> I have indeed not much disk space on the machine but it should be
>
>> >>>> enough at
>
>> >>>> the moment:
>
>> >>>>
>
>> >>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>> df
>
>> >>>> -h .
>
>> >>>> Filesystem Size Used Avail Use% Mounted on
>
>> >>>> /dev/vda1 20G 5.9G 15G 30% /home
>
>> >>>>
>
>> >>>> As I am root and all directories under
>
>> >>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
>
>> >>>> are set to root:root and 755 permissions shouldn't be the problem.
>
>> >>>>
>
>> >>>> Any further suggestions? :-/
>
>> >>>>
>
>> >>>> Thank you once again
>
>> >>>>
>
>> >>>>
>
>> >>>>
>
>> >>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>
>> >>>>
>
>> >>>> There are some peculiarities in your log:
>
>> >>>>>
>
>> >>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -
>
>> >>>>> java.io.IOException:
>
>> >>>>> config()
>
>> >>>>> at org.apache.hadoop.conf.**Configuration.(**
>
>> >>>>> Configuration.java:211)
>
>> >>>>> at org.apache.hadoop.conf.**Configuration.(**
>
>> >>>>> Configuration.java:198)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**213)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.(**
>
>> >>>>> LocalJobRunner.java:93)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>
>> >>>>> LocalJobRunner.java:373)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>
>> >>>>> JobClient.java:800)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>
>> >>>>> java:730)
>
>> >>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>
>> >>>>> java:1249)
>
>> >>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>
>> >>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>
>> >>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>
>> >>>>>
>
>> >>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:
>
>> >>>>> job_local_0002
>
>> >>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -
>
>> >>>>> java.io.IOException:
>
>> >>>>> config(config)
>
>> >>>>> at org.apache.hadoop.conf.**Configuration.(**
>
>> >>>>> Configuration.java:226)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**184)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapreduce.**JobContext.(JobContext.**
>
>> >>>>> java:52)
>
>> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**
>
>> >>>>> java:32)
>
>> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**
>
>> >>>>> java:38)
>
>> >>>>> at
>
>> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>
>> >>>>> LocalJobRunner.java:111)
>
>> >>>>>
>
>> >>>>>
>
>> >>>>> Can you check permissions, disk space etc?
>
>> >>>>>
>
>> >>>>>
>
>> >>>>>
>
>> >>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>
>> >>>>>
>
>> >>>>>> Hey Ho,
>
>> >>>>>>
>
>> >>>>>> for some reasons the inverlinks command produces an empty linkdb.
>
>> >>>>>>
>
>> >>>>>> I did:
>
>> >>>>>>
>
>> >>>>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>>>>
>
>> >>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize
>
>> >>>>>> -noFilter
>
>> >>>>>> LinkDb: starting at 2011-08-23 14:47:21
>
>> >>>>>> LinkDb: linkdb: crawl/linkdb
>
>> >>>>>> LinkDb: URL normalize: false
>
>> >>>>>> LinkDb: URL filter: false
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817164804
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817164912
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817165053
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817165524
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817170729
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817171757
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110817172919
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819135218
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819165658
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819170807
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819171841
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110819173350
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822135934
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822141229
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822143419
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822143824
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144031
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144232
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144435
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144617
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144750
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822144927
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822145249
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822150757
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822152354
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822152503
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822153900
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822155321
>
>> >>>>>> LinkDb: adding segment: crawl/segments/20110822155732
>
>> >>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>
>> >>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>
>> >>>>>>
>
>> >>>>>> After that:
>
>> >>>>>>
>
>> >>>>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>>>>
>
>> >>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>
>> >>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>
>> >>>>>> LinkDb dump: db: crawl/linkdb/
>
>> >>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>
>> >>>>>>
>
>> >>>>>> And then:
>
>> >>>>>>
>
>> >>>>>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>
>> >>>>>>
>
>> >>>>>> cd
>
>> >>>>>> linkdump/
>
>> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>
>> >>>>>> runtime/local/bin/linkdump#
>
>> >>>>>> ll
>
>> >>>>>> total 0
>
>> >>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>
>> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>
>> >>>>>> runtime/local/bin/linkdump#
>
>> >>>>>>
>
>> >>>>>> As you see, the dump size is 0 byte.
>
>> >>>>>>
>
>> >>>>>> Unfortunately I have no idea what went wrong.
>
>> >>>>>>
>
>> >>>>>> I have attached the hadoop.log for the inverlinks process.
>
>> >>>>>> Perhaps that
>
>> >>>>>> helps anybody?
>
>> >>>>>>
>
>> >>>>>
>
>> >>>>>
>
>> >>>>
>
>> >>>
>
>> >>>
>
>> >>
>
>> >
>
>
>
>
>
Re: Re: Empty LinkDB after invertlinks
Posted by le...@gmail.com.
Hi Marek,
You make a reasonable point. If you feel that this is something that should
be integrated then maybe consider filing a JIRA with a comprehensive
description of the problem and a proposed solution. If you do not actually
patch this yourself then maybe someone else can provide a patch in the
future should they experience the same as it would be a nice indication of
the situation identified by Sergey.
On Aug 23, 2011 10:48pm, Marek Bachmann <m....@uni-kassel.de> wrote:
> Oh yes, thank your very much Sergey, that was the problem.
> Would have been nice, if the inverlinks command had told me that it has
> ignored them :-)
> Cheers,
> Marek
> Am 23.08.2011 19:26, schrieb Sergey A Volkov:
> > Hi
> >
> > Is it possible that you fetch documents from just one site/domain?
> >
> > Looks like by default nutch ignore internal site links
> > (db.ignore.internal.links)
> >
> > Sergey Volkov
> >
> > On 08/23/2011 07:04 PM, Marek Bachmann wrote:
> >> Hi Lewis,
> >>
> >> thank you for your suggestion.
> >> Unfortunately this isn't the problem. Actually I have also tried to
> >> merge all segments together and put the one big segment to the
> >> inverlinks command. Same (none) effect. :-(
> >>
> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> >> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
> >> Merging 29 segments to crawl/one-seg/20110823165144
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
> >> SegmentMerger: adding
> >>
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
> >> SegmentMerger: using segment data from: content crawl_generate
> >> crawl_fetch crawl_parse parse_data parse_text
> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
> >> -rf crawl/linkdb/
> >>
> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> >> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
> >> -noNormalize -noFilter
> >> LinkDb: starting at 2011-08-23 17:01:44
> >> LinkDb: linkdb: crawl/linkdb
> >> LinkDb: URL normalize: false
> >> LinkDb: URL filter: false
> >> LinkDb: adding segment: crawl/one-seg/20110823165144
> >> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
> >>
> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> >> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
> >> LinkDb dump: starting at 2011-08-23 17:03:12
> >> LinkDb dump: db: crawl/linkdb/
> >> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
> >> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
> >> linkdump/
> >>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
> >> ll
> >> total 0
> >> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
> >>
> >>
> >> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
> >>> Hi
> >>>
> >>> Small suggestion, but I do not see any -dir argument passed
> >>> alongside your
> >>> initial invertlinks command. I understand that you have multiple
> >>> segment
> >>> directories, which have been fetched over a recent number of days,
> >>> and that
> >>> the output would also suggest the process was properly executed,
> >>> however I
> >>> have never used the command without the -dir option (as it has
> >>> always worked
> >>> for me), therefore I can only suggest that this may be the problem.
> >>>
> >>>
> >>>
> >>> On Tue, Aug 23, 2011 at 3:29 PM, Marek
> >>> Bachmannm.bachmann@uni-kassel.de>wrote:
> >>>
> >>>> Hi Markus,
> >>>>
> >>>> thank you for the quick reply. I already searched for this
> >>>> Configuration
> >>>> error and found:
> >>>>
> >>>>
> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.htmlhttp://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
> >>>>
> >>>>
> >>>> Where they say that "This exception is innocuous - it helps to
> >>>> debug at
> >>>> which points in the code the Configuration instances are being
> >>>> created.
> >>>> (...)"
> >>>>
> >>>> I have indeed not much disk space on the machine but it should be
> >>>> enough at
> >>>> the moment:
> >>>>
> >>>>
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
> >>>> df
> >>>> -h .
> >>>> Filesystem Size Used Avail Use% Mounted on
> >>>> /dev/vda1 20G 5.9G 15G 30% /home
> >>>>
> >>>> As I am root and all directories under
> >>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
> >>>> are set to root:root and 755 permissions shouldn't be the problem.
> >>>>
> >>>> Any further suggestions? :-/
> >>>>
> >>>> Thank you once again
> >>>>
> >>>>
> >>>>
> >>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
> >>>>
> >>>> There are some peculiarities in your log:
> >>>>>
> >>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -
> >>>>> java.io.IOException:
> >>>>> config()
> >>>>> at org.apache.hadoop.conf.**Configuration.(**
> >>>>> Configuration.java:211)
> >>>>> at org.apache.hadoop.conf.**Configuration.(**
> >>>>> Configuration.java:198)
> >>>>> at
> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**213)
> >>>>> at
> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.(**
> >>>>> LocalJobRunner.java:93)
> >>>>> at
> >>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
> >>>>> LocalJobRunner.java:373)
> >>>>> at
> >>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
> >>>>> JobClient.java:800)
> >>>>> at
> >>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
> >>>>> java:730)
> >>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
> >>>>> java:1249)
> >>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
> >>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
> >>>>> at
> >>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
> >>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
> >>>>>
> >>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:
> >>>>> job_local_0002
> >>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -
> >>>>> java.io.IOException:
> >>>>> config(config)
> >>>>> at org.apache.hadoop.conf.**Configuration.(**
> >>>>> Configuration.java:226)
> >>>>> at
> >>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**184)
> >>>>> at
> >>>>> org.apache.hadoop.mapreduce.**JobContext.(JobContext.**
> >>>>> java:52)
> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**
> >>>>> java:32)
> >>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**
> >>>>> java:38)
> >>>>> at
> >>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> >>>>> LocalJobRunner.java:111)
> >>>>>
> >>>>>
> >>>>> Can you check permissions, disk space etc?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
> >>>>>
> >>>>>> Hey Ho,
> >>>>>>
> >>>>>> for some reasons the inverlinks command produces an empty linkdb.
> >>>>>>
> >>>>>> I did:
> >>>>>>
> >>>>>>
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
> >>>>>>
> >>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize
> >>>>>> -noFilter
> >>>>>> LinkDb: starting at 2011-08-23 14:47:21
> >>>>>> LinkDb: linkdb: crawl/linkdb
> >>>>>> LinkDb: URL normalize: false
> >>>>>> LinkDb: URL filter: false
> >>>>>> LinkDb: adding segment: crawl/segments/20110817164804
> >>>>>> LinkDb: adding segment: crawl/segments/20110817164912
> >>>>>> LinkDb: adding segment: crawl/segments/20110817165053
> >>>>>> LinkDb: adding segment: crawl/segments/20110817165524
> >>>>>> LinkDb: adding segment: crawl/segments/20110817170729
> >>>>>> LinkDb: adding segment: crawl/segments/20110817171757
> >>>>>> LinkDb: adding segment: crawl/segments/20110817172919
> >>>>>> LinkDb: adding segment: crawl/segments/20110819135218
> >>>>>> LinkDb: adding segment: crawl/segments/20110819165658
> >>>>>> LinkDb: adding segment: crawl/segments/20110819170807
> >>>>>> LinkDb: adding segment: crawl/segments/20110819171841
> >>>>>> LinkDb: adding segment: crawl/segments/20110819173350
> >>>>>> LinkDb: adding segment: crawl/segments/20110822135934
> >>>>>> LinkDb: adding segment: crawl/segments/20110822141229
> >>>>>> LinkDb: adding segment: crawl/segments/20110822143419
> >>>>>> LinkDb: adding segment: crawl/segments/20110822143824
> >>>>>> LinkDb: adding segment: crawl/segments/20110822144031
> >>>>>> LinkDb: adding segment: crawl/segments/20110822144232
> >>>>>> LinkDb: adding segment: crawl/segments/20110822144435
> >>>>>> LinkDb: adding segment: crawl/segments/20110822144617
> >>>>>> LinkDb: adding segment: crawl/segments/20110822144750
> >>>>>> LinkDb: adding segment: crawl/segments/20110822144927
> >>>>>> LinkDb: adding segment: crawl/segments/20110822145249
> >>>>>> LinkDb: adding segment: crawl/segments/20110822150757
> >>>>>> LinkDb: adding segment: crawl/segments/20110822152354
> >>>>>> LinkDb: adding segment: crawl/segments/20110822152503
> >>>>>> LinkDb: adding segment: crawl/segments/20110822153900
> >>>>>> LinkDb: adding segment: crawl/segments/20110822155321
> >>>>>> LinkDb: adding segment: crawl/segments/20110822155732
> >>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
> >>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
> >>>>>>
> >>>>>> After that:
> >>>>>>
> >>>>>>
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
> >>>>>>
> >>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
> >>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
> >>>>>> LinkDb dump: db: crawl/linkdb/
> >>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
> >>>>>>
> >>>>>> And then:
> >>>>>>
> >>>>>>
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
> >>>>>>
> >>>>>> cd
> >>>>>> linkdump/
> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
> >>>>>> runtime/local/bin/linkdump#
> >>>>>> ll
> >>>>>> total 0
> >>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
> >>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
> >>>>>> runtime/local/bin/linkdump#
> >>>>>>
> >>>>>> As you see, the dump size is 0 byte.
> >>>>>>
> >>>>>> Unfortunately I have no idea what went wrong.
> >>>>>>
> >>>>>> I have attached the hadoop.log for the inverlinks process.
> >>>>>> Perhaps that
> >>>>>> helps anybody?
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
Re: Empty LinkDB after invertlinks
Posted by Marek Bachmann <m....@uni-kassel.de>.
Oh yes, thank your very much Sergey, that was the problem.
Would have been nice, if the inverlinks command had told me that it has
ignored them :-)
Cheers,
Marek
Am 23.08.2011 19:26, schrieb Sergey A Volkov:
> Hi
>
> Is it possible that you fetch documents from just one site/domain?
>
> Looks like by default nutch ignore internal site links
> (db.ignore.internal.links)
>
> Sergey Volkov
>
> On 08/23/2011 07:04 PM, Marek Bachmann wrote:
>> Hi Lewis,
>>
>> thank you for your suggestion.
>> Unfortunately this isn't the problem. Actually I have also tried to
>> merge all segments together and put the one big segment to the
>> inverlinks command. Same (none) effect. :-(
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
>> Merging 29 segments to crawl/one-seg/20110823165144
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
>> SegmentMerger: adding
>> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
>> SegmentMerger: using segment data from: content crawl_generate
>> crawl_fetch crawl_parse parse_data parse_text
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
>> -rf crawl/linkdb/
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
>> -noNormalize -noFilter
>> LinkDb: starting at 2011-08-23 17:01:44
>> LinkDb: linkdb: crawl/linkdb
>> LinkDb: URL normalize: false
>> LinkDb: URL filter: false
>> LinkDb: adding segment: crawl/one-seg/20110823165144
>> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>> LinkDb dump: starting at 2011-08-23 17:03:12
>> LinkDb dump: db: crawl/linkdb/
>> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
>> linkdump/
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>> ll
>> total 0
>> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
>>
>>
>> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
>>> Hi
>>>
>>> Small suggestion, but I do not see any -dir argument passed
>>> alongside your
>>> initial invertlinks command. I understand that you have multiple
>>> segment
>>> directories, which have been fetched over a recent number of days,
>>> and that
>>> the output would also suggest the process was properly executed,
>>> however I
>>> have never used the command without the -dir option (as it has
>>> always worked
>>> for me), therefore I can only suggest that this may be the problem.
>>>
>>>
>>>
>>> On Tue, Aug 23, 2011 at 3:29 PM, Marek
>>> Bachmann<m....@uni-kassel.de>wrote:
>>>
>>>> Hi Markus,
>>>>
>>>> thank you for the quick reply. I already searched for this
>>>> Configuration
>>>> error and found:
>>>>
>>>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>>>>
>>>>
>>>> Where they say that "This exception is innocuous - it helps to
>>>> debug at
>>>> which points in the code the Configuration instances are being
>>>> created.
>>>> (...)"
>>>>
>>>> I have indeed not much disk space on the machine but it should be
>>>> enough at
>>>> the moment:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> df
>>>> -h .
>>>> Filesystem Size Used Avail Use% Mounted on
>>>> /dev/vda1 20G 5.9G 15G 30% /home
>>>>
>>>> As I am root and all directories under
>>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
>>>> are set to root:root and 755 permissions shouldn't be the problem.
>>>>
>>>> Any further suggestions? :-/
>>>>
>>>> Thank you once again
>>>>
>>>>
>>>>
>>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>>>>
>>>> There are some peculiarities in your log:
>>>>>
>>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -
>>>>> java.io.IOException:
>>>>> config()
>>>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:211)
>>>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:198)
>>>>> at
>>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>>>> at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>>>>> LocalJobRunner.java:93)
>>>>> at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>>>>> LocalJobRunner.java:373)
>>>>> at
>>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>>>> JobClient.java:800)
>>>>> at
>>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>>>> java:730)
>>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>>> java:1249)
>>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>>>> at
>>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>>>>
>>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:
>>>>> job_local_0002
>>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -
>>>>> java.io.IOException:
>>>>> config(config)
>>>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>>>> Configuration.java:226)
>>>>> at
>>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>>>> at
>>>>> org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>>>>> java:52)
>>>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>>> java:32)
>>>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>>> java:38)
>>>>> at
>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>>>> LocalJobRunner.java:111)
>>>>>
>>>>>
>>>>> Can you check permissions, disk space etc?
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>>>>
>>>>>> Hey Ho,
>>>>>>
>>>>>> for some reasons the inverlinks command produces an empty linkdb.
>>>>>>
>>>>>> I did:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize
>>>>>> -noFilter
>>>>>> LinkDb: starting at 2011-08-23 14:47:21
>>>>>> LinkDb: linkdb: crawl/linkdb
>>>>>> LinkDb: URL normalize: false
>>>>>> LinkDb: URL filter: false
>>>>>> LinkDb: adding segment: crawl/segments/20110817164804
>>>>>> LinkDb: adding segment: crawl/segments/20110817164912
>>>>>> LinkDb: adding segment: crawl/segments/20110817165053
>>>>>> LinkDb: adding segment: crawl/segments/20110817165524
>>>>>> LinkDb: adding segment: crawl/segments/20110817170729
>>>>>> LinkDb: adding segment: crawl/segments/20110817171757
>>>>>> LinkDb: adding segment: crawl/segments/20110817172919
>>>>>> LinkDb: adding segment: crawl/segments/20110819135218
>>>>>> LinkDb: adding segment: crawl/segments/20110819165658
>>>>>> LinkDb: adding segment: crawl/segments/20110819170807
>>>>>> LinkDb: adding segment: crawl/segments/20110819171841
>>>>>> LinkDb: adding segment: crawl/segments/20110819173350
>>>>>> LinkDb: adding segment: crawl/segments/20110822135934
>>>>>> LinkDb: adding segment: crawl/segments/20110822141229
>>>>>> LinkDb: adding segment: crawl/segments/20110822143419
>>>>>> LinkDb: adding segment: crawl/segments/20110822143824
>>>>>> LinkDb: adding segment: crawl/segments/20110822144031
>>>>>> LinkDb: adding segment: crawl/segments/20110822144232
>>>>>> LinkDb: adding segment: crawl/segments/20110822144435
>>>>>> LinkDb: adding segment: crawl/segments/20110822144617
>>>>>> LinkDb: adding segment: crawl/segments/20110822144750
>>>>>> LinkDb: adding segment: crawl/segments/20110822144927
>>>>>> LinkDb: adding segment: crawl/segments/20110822145249
>>>>>> LinkDb: adding segment: crawl/segments/20110822150757
>>>>>> LinkDb: adding segment: crawl/segments/20110822152354
>>>>>> LinkDb: adding segment: crawl/segments/20110822152503
>>>>>> LinkDb: adding segment: crawl/segments/20110822153900
>>>>>> LinkDb: adding segment: crawl/segments/20110822155321
>>>>>> LinkDb: adding segment: crawl/segments/20110822155732
>>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>>>>
>>>>>> After that:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>>>>> LinkDb dump: db: crawl/linkdb/
>>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>>>>
>>>>>> And then:
>>>>>>
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>>
>>>>>> cd
>>>>>> linkdump/
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>>> runtime/local/bin/linkdump#
>>>>>> ll
>>>>>> total 0
>>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>>> runtime/local/bin/linkdump#
>>>>>>
>>>>>> As you see, the dump size is 0 byte.
>>>>>>
>>>>>> Unfortunately I have no idea what went wrong.
>>>>>>
>>>>>> I have attached the hadoop.log for the inverlinks process.
>>>>>> Perhaps that
>>>>>> helps anybody?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
Re: Empty LinkDB after invertlinks
Posted by Sergey A Volkov <se...@gmail.com>.
Hi
Is it possible that you fetch documents from just one site/domain?
Looks like by default nutch ignore internal site links
(db.ignore.internal.links)
Sergey Volkov
On 08/23/2011 07:04 PM, Marek Bachmann wrote:
> Hi Lewis,
>
> thank you for your suggestion.
> Unfortunately this isn't the problem. Actually I have also tried to
> merge all segments together and put the one big segment to the
> inverlinks command. Same (none) effect. :-(
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> ./nutch mergesegs crawl/one-seg -dir crawl/segments/
> Merging 29 segments to crawl/one-seg/20110823165144
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
> SegmentMerger: adding
> file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
> SegmentMerger: using segment data from: content crawl_generate
> crawl_fetch crawl_parse parse_data parse_text
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
> -rf crawl/linkdb/
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
> -noNormalize -noFilter
> LinkDb: starting at 2011-08-23 17:01:44
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: false
> LinkDb: URL filter: false
> LinkDb: adding segment: crawl/one-seg/20110823165144
> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
> LinkDb dump: starting at 2011-08-23 17:03:12
> LinkDb dump: db: crawl/linkdb/
> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
> linkdump/
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
> ll
> total 0
> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
>
>
> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
>> Hi
>>
>> Small suggestion, but I do not see any -dir argument passed alongside
>> your
>> initial invertlinks command. I understand that you have multiple segment
>> directories, which have been fetched over a recent number of days,
>> and that
>> the output would also suggest the process was properly executed,
>> however I
>> have never used the command without the -dir option (as it has always
>> worked
>> for me), therefore I can only suggest that this may be the problem.
>>
>>
>>
>> On Tue, Aug 23, 2011 at 3:29 PM, Marek
>> Bachmann<m....@uni-kassel.de>wrote:
>>
>>> Hi Markus,
>>>
>>> thank you for the quick reply. I already searched for this
>>> Configuration
>>> error and found:
>>>
>>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>>>
>>>
>>> Where they say that "This exception is innocuous - it helps to debug at
>>> which points in the code the Configuration instances are being created.
>>> (...)"
>>>
>>> I have indeed not much disk space on the machine but it should be
>>> enough at
>>> the moment:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> df
>>> -h .
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/vda1 20G 5.9G 15G 30% /home
>>>
>>> As I am root and all directories under
>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin
>>> are set to root:root and 755 permissions shouldn't be the problem.
>>>
>>> Any further suggestions? :-/
>>>
>>> Thank you once again
>>>
>>>
>>>
>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>>>
>>> There are some peculiarities in your log:
>>>>
>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -
>>>> java.io.IOException:
>>>> config()
>>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>>> Configuration.java:211)
>>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>>> Configuration.java:198)
>>>> at
>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>>> at
>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>>>> LocalJobRunner.java:93)
>>>> at
>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>>>> LocalJobRunner.java:373)
>>>> at
>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>>> JobClient.java:800)
>>>> at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>>> java:730)
>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>> java:1249)
>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>>> at
>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>>>
>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:
>>>> job_local_0002
>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -
>>>> java.io.IOException:
>>>> config(config)
>>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>>> Configuration.java:226)
>>>> at
>>>> org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>>> at
>>>> org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>>>> java:52)
>>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>> java:32)
>>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>>> java:38)
>>>> at
>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>>> LocalJobRunner.java:111)
>>>>
>>>>
>>>> Can you check permissions, disk space etc?
>>>>
>>>>
>>>>
>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>>>
>>>>> Hey Ho,
>>>>>
>>>>> for some reasons the inverlinks command produces an empty linkdb.
>>>>>
>>>>> I did:
>>>>>
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>
>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize
>>>>> -noFilter
>>>>> LinkDb: starting at 2011-08-23 14:47:21
>>>>> LinkDb: linkdb: crawl/linkdb
>>>>> LinkDb: URL normalize: false
>>>>> LinkDb: URL filter: false
>>>>> LinkDb: adding segment: crawl/segments/20110817164804
>>>>> LinkDb: adding segment: crawl/segments/20110817164912
>>>>> LinkDb: adding segment: crawl/segments/20110817165053
>>>>> LinkDb: adding segment: crawl/segments/20110817165524
>>>>> LinkDb: adding segment: crawl/segments/20110817170729
>>>>> LinkDb: adding segment: crawl/segments/20110817171757
>>>>> LinkDb: adding segment: crawl/segments/20110817172919
>>>>> LinkDb: adding segment: crawl/segments/20110819135218
>>>>> LinkDb: adding segment: crawl/segments/20110819165658
>>>>> LinkDb: adding segment: crawl/segments/20110819170807
>>>>> LinkDb: adding segment: crawl/segments/20110819171841
>>>>> LinkDb: adding segment: crawl/segments/20110819173350
>>>>> LinkDb: adding segment: crawl/segments/20110822135934
>>>>> LinkDb: adding segment: crawl/segments/20110822141229
>>>>> LinkDb: adding segment: crawl/segments/20110822143419
>>>>> LinkDb: adding segment: crawl/segments/20110822143824
>>>>> LinkDb: adding segment: crawl/segments/20110822144031
>>>>> LinkDb: adding segment: crawl/segments/20110822144232
>>>>> LinkDb: adding segment: crawl/segments/20110822144435
>>>>> LinkDb: adding segment: crawl/segments/20110822144617
>>>>> LinkDb: adding segment: crawl/segments/20110822144750
>>>>> LinkDb: adding segment: crawl/segments/20110822144927
>>>>> LinkDb: adding segment: crawl/segments/20110822145249
>>>>> LinkDb: adding segment: crawl/segments/20110822150757
>>>>> LinkDb: adding segment: crawl/segments/20110822152354
>>>>> LinkDb: adding segment: crawl/segments/20110822152503
>>>>> LinkDb: adding segment: crawl/segments/20110822153900
>>>>> LinkDb: adding segment: crawl/segments/20110822155321
>>>>> LinkDb: adding segment: crawl/segments/20110822155732
>>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>>>
>>>>> After that:
>>>>>
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>
>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>>>> LinkDb dump: db: crawl/linkdb/
>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>>>
>>>>> And then:
>>>>>
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>>>
>>>>> cd
>>>>> linkdump/
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>> runtime/local/bin/linkdump#
>>>>> ll
>>>>> total 0
>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>>> runtime/local/bin/linkdump#
>>>>>
>>>>> As you see, the dump size is 0 byte.
>>>>>
>>>>> Unfortunately I have no idea what went wrong.
>>>>>
>>>>> I have attached the hadoop.log for the inverlinks process. Perhaps
>>>>> that
>>>>> helps anybody?
>>>>>
>>>>
>>>>
>>>
>>
>>
>
Re: Empty LinkDB after invertlinks
Posted by Marek Bachmann <m....@uni-kassel.de>.
Hi Lewis,
thank you for your suggestion.
Unfortunately this isn't the problem. Actually I have also tried to
merge all segments together and put the one big segment to the
inverlinks command. Same (none) effect. :-(
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
./nutch mergesegs crawl/one-seg -dir crawl/segments/
Merging 29 segments to crawl/one-seg/20110823165144
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321
SegmentMerger: adding
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732
SegmentMerger: using segment data from: content crawl_generate
crawl_fetch crawl_parse parse_data parse_text
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm
-rf crawl/linkdb/
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/
-noNormalize -noFilter
LinkDb: starting at 2011-08-23 17:01:44
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: false
LinkDb: URL filter: false
LinkDb: adding segment: crawl/one-seg/20110823165144
LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
./nutch readlinkdb crawl/linkdb/ -dump linkdump
LinkDb dump: starting at 2011-08-23 17:03:12
LinkDb dump: db: crawl/linkdb/
LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
linkdump/
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
ll
total 0
-rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000
Am 23.08.2011 16:44, schrieb lewis john mcgibbney:
> Hi
>
> Small suggestion, but I do not see any -dir argument passed alongside your
> initial invertlinks command. I understand that you have multiple segment
> directories, which have been fetched over a recent number of days, and that
> the output would also suggest the process was properly executed, however I
> have never used the command without the -dir option (as it has always worked
> for me), therefore I can only suggest that this may be the problem.
>
>
>
> On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann<m....@uni-kassel.de>wrote:
>
>> Hi Markus,
>>
>> thank you for the quick reply. I already searched for this Configuration
>> error and found:
>>
>> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>>
>> Where they say that "This exception is innocuous - it helps to debug at
>> which points in the code the Configuration instances are being created.
>> (...)"
>>
>> I have indeed not much disk space on the machine but it should be enough at
>> the moment:
>>
>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df
>> -h .
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 20G 5.9G 15G 30% /home
>>
>> As I am root and all directories under /home/nutchServer/relaunch_**nutch/runtime/local/bin
>> are set to root:root and 755 permissions shouldn't be the problem.
>>
>> Any further suggestions? :-/
>>
>> Thank you once again
>>
>>
>>
>> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>>
>> There are some peculiarities in your log:
>>>
>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
>>> config()
>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>> Configuration.java:211)
>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>> Configuration.java:198)
>>> at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>> at
>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>>> LocalJobRunner.java:93)
>>> at
>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>>> LocalJobRunner.java:373)
>>> at
>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>> JobClient.java:800)
>>> at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>> java:730)
>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>> java:1249)
>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>> at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>>
>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:
>>> job_local_0002
>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
>>> config(config)
>>> at org.apache.hadoop.conf.**Configuration.<init>(**
>>> Configuration.java:226)
>>> at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>> at org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>>> java:52)
>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>> java:32)
>>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>>> java:38)
>>> at
>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>>> LocalJobRunner.java:111)
>>>
>>>
>>> Can you check permissions, disk space etc?
>>>
>>>
>>>
>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>>
>>>> Hey Ho,
>>>>
>>>> for some reasons the inverlinks command produces an empty linkdb.
>>>>
>>>> I did:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
>>>> LinkDb: starting at 2011-08-23 14:47:21
>>>> LinkDb: linkdb: crawl/linkdb
>>>> LinkDb: URL normalize: false
>>>> LinkDb: URL filter: false
>>>> LinkDb: adding segment: crawl/segments/20110817164804
>>>> LinkDb: adding segment: crawl/segments/20110817164912
>>>> LinkDb: adding segment: crawl/segments/20110817165053
>>>> LinkDb: adding segment: crawl/segments/20110817165524
>>>> LinkDb: adding segment: crawl/segments/20110817170729
>>>> LinkDb: adding segment: crawl/segments/20110817171757
>>>> LinkDb: adding segment: crawl/segments/20110817172919
>>>> LinkDb: adding segment: crawl/segments/20110819135218
>>>> LinkDb: adding segment: crawl/segments/20110819165658
>>>> LinkDb: adding segment: crawl/segments/20110819170807
>>>> LinkDb: adding segment: crawl/segments/20110819171841
>>>> LinkDb: adding segment: crawl/segments/20110819173350
>>>> LinkDb: adding segment: crawl/segments/20110822135934
>>>> LinkDb: adding segment: crawl/segments/20110822141229
>>>> LinkDb: adding segment: crawl/segments/20110822143419
>>>> LinkDb: adding segment: crawl/segments/20110822143824
>>>> LinkDb: adding segment: crawl/segments/20110822144031
>>>> LinkDb: adding segment: crawl/segments/20110822144232
>>>> LinkDb: adding segment: crawl/segments/20110822144435
>>>> LinkDb: adding segment: crawl/segments/20110822144617
>>>> LinkDb: adding segment: crawl/segments/20110822144750
>>>> LinkDb: adding segment: crawl/segments/20110822144927
>>>> LinkDb: adding segment: crawl/segments/20110822145249
>>>> LinkDb: adding segment: crawl/segments/20110822150757
>>>> LinkDb: adding segment: crawl/segments/20110822152354
>>>> LinkDb: adding segment: crawl/segments/20110822152503
>>>> LinkDb: adding segment: crawl/segments/20110822153900
>>>> LinkDb: adding segment: crawl/segments/20110822155321
>>>> LinkDb: adding segment: crawl/segments/20110822155732
>>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>>
>>>> After that:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>>> LinkDb dump: db: crawl/linkdb/
>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>>
>>>> And then:
>>>>
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>>> cd
>>>> linkdump/
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>> runtime/local/bin/linkdump#
>>>> ll
>>>> total 0
>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>>> runtime/local/bin/linkdump#
>>>>
>>>> As you see, the dump size is 0 byte.
>>>>
>>>> Unfortunately I have no idea what went wrong.
>>>>
>>>> I have attached the hadoop.log for the inverlinks process. Perhaps that
>>>> helps anybody?
>>>>
>>>
>>>
>>
>
>
Re: Empty LinkDB after invertlinks
Posted by lewis john mcgibbney <le...@gmail.com>.
Hi
Small suggestion, but I do not see any -dir argument passed alongside your
initial invertlinks command. I understand that you have multiple segment
directories, which have been fetched over a recent number of days, and that
the output would also suggest the process was properly executed, however I
have never used the command without the -dir option (as it has always worked
for me), therefore I can only suggest that this may be the problem.
On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann <m....@uni-kassel.de>wrote:
> Hi Markus,
>
> thank you for the quick reply. I already searched for this Configuration
> error and found:
>
> http://www.mail-archive.com/**nutch-user@lucene.apache.org/**msg15397.html<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html>
>
> Where they say that "This exception is innocuous - it helps to debug at
> which points in the code the Configuration instances are being created.
> (...)"
>
> I have indeed not much disk space on the machine but it should be enough at
> the moment:
>
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df
> -h .
> Filesystem Size Used Avail Use% Mounted on
> /dev/vda1 20G 5.9G 15G 30% /home
>
> As I am root and all directories under /home/nutchServer/relaunch_**nutch/runtime/local/bin
> are set to root:root and 755 permissions shouldn't be the problem.
>
> Any further suggestions? :-/
>
> Thank you once again
>
>
>
> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>
> There are some peculiarities in your log:
>>
>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
>> config()
>> at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:211)
>> at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:198)
>> at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>> at
>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>> LocalJobRunner.java:93)
>> at
>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>> LocalJobRunner.java:373)
>> at
>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>> JobClient.java:800)
>> at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>> java:730)
>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>> java:1249)
>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>> at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>
>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:
>> job_local_0002
>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
>> config(config)
>> at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:226)
>> at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>> at org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>> java:52)
>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>> java:32)
>> at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>> java:38)
>> at
>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:111)
>>
>>
>> Can you check permissions, disk space etc?
>>
>>
>>
>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>
>>> Hey Ho,
>>>
>>> for some reasons the inverlinks command produces an empty linkdb.
>>>
>>> I did:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
>>> LinkDb: starting at 2011-08-23 14:47:21
>>> LinkDb: linkdb: crawl/linkdb
>>> LinkDb: URL normalize: false
>>> LinkDb: URL filter: false
>>> LinkDb: adding segment: crawl/segments/20110817164804
>>> LinkDb: adding segment: crawl/segments/20110817164912
>>> LinkDb: adding segment: crawl/segments/20110817165053
>>> LinkDb: adding segment: crawl/segments/20110817165524
>>> LinkDb: adding segment: crawl/segments/20110817170729
>>> LinkDb: adding segment: crawl/segments/20110817171757
>>> LinkDb: adding segment: crawl/segments/20110817172919
>>> LinkDb: adding segment: crawl/segments/20110819135218
>>> LinkDb: adding segment: crawl/segments/20110819165658
>>> LinkDb: adding segment: crawl/segments/20110819170807
>>> LinkDb: adding segment: crawl/segments/20110819171841
>>> LinkDb: adding segment: crawl/segments/20110819173350
>>> LinkDb: adding segment: crawl/segments/20110822135934
>>> LinkDb: adding segment: crawl/segments/20110822141229
>>> LinkDb: adding segment: crawl/segments/20110822143419
>>> LinkDb: adding segment: crawl/segments/20110822143824
>>> LinkDb: adding segment: crawl/segments/20110822144031
>>> LinkDb: adding segment: crawl/segments/20110822144232
>>> LinkDb: adding segment: crawl/segments/20110822144435
>>> LinkDb: adding segment: crawl/segments/20110822144617
>>> LinkDb: adding segment: crawl/segments/20110822144750
>>> LinkDb: adding segment: crawl/segments/20110822144927
>>> LinkDb: adding segment: crawl/segments/20110822145249
>>> LinkDb: adding segment: crawl/segments/20110822150757
>>> LinkDb: adding segment: crawl/segments/20110822152354
>>> LinkDb: adding segment: crawl/segments/20110822152503
>>> LinkDb: adding segment: crawl/segments/20110822153900
>>> LinkDb: adding segment: crawl/segments/20110822155321
>>> LinkDb: adding segment: crawl/segments/20110822155732
>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>
>>> After that:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>> LinkDb dump: db: crawl/linkdb/
>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>
>>> And then:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> cd
>>> linkdump/
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>> runtime/local/bin/linkdump#
>>> ll
>>> total 0
>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>> runtime/local/bin/linkdump#
>>>
>>> As you see, the dump size is 0 byte.
>>>
>>> Unfortunately I have no idea what went wrong.
>>>
>>> I have attached the hadoop.log for the inverlinks process. Perhaps that
>>> helps anybody?
>>>
>>
>>
>
--
*Lewis*
Re: Empty LinkDB after invertlinks
Posted by Marek Bachmann <m....@uni-kassel.de>.
Hi Markus,
thank you for the quick reply. I already searched for this Configuration
error and found:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html
Where they say that "This exception is innocuous - it helps to debug at
which points in the code the Configuration instances are being created.
(...)"
I have indeed not much disk space on the machine but it should be enough
at the moment:
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 5.9G 15G 30% /home
As I am root and all directories under
/home/nutchServer/relaunch_nutch/runtime/local/bin are set to root:root
and 755 permissions shouldn't be the problem.
Any further suggestions? :-/
Thank you once again
Am 23.08.2011 16:10, schrieb Markus Jelsma:
> There are some peculiarities in your log:
>
> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
> config()
> at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:211)
> at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:198)
> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:213)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
> at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:190)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job: job_local_0002
> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
> config(config)
> at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:226)
> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:184)
> at org.apache.hadoop.mapreduce.JobContext.<init>(JobContext.java:52)
> at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:32)
> at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:38)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:111)
>
>
> Can you check permissions, disk space etc?
>
>
>
> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>> Hey Ho,
>>
>> for some reasons the inverlinks command produces an empty linkdb.
>>
>> I did:
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
>> LinkDb: starting at 2011-08-23 14:47:21
>> LinkDb: linkdb: crawl/linkdb
>> LinkDb: URL normalize: false
>> LinkDb: URL filter: false
>> LinkDb: adding segment: crawl/segments/20110817164804
>> LinkDb: adding segment: crawl/segments/20110817164912
>> LinkDb: adding segment: crawl/segments/20110817165053
>> LinkDb: adding segment: crawl/segments/20110817165524
>> LinkDb: adding segment: crawl/segments/20110817170729
>> LinkDb: adding segment: crawl/segments/20110817171757
>> LinkDb: adding segment: crawl/segments/20110817172919
>> LinkDb: adding segment: crawl/segments/20110819135218
>> LinkDb: adding segment: crawl/segments/20110819165658
>> LinkDb: adding segment: crawl/segments/20110819170807
>> LinkDb: adding segment: crawl/segments/20110819171841
>> LinkDb: adding segment: crawl/segments/20110819173350
>> LinkDb: adding segment: crawl/segments/20110822135934
>> LinkDb: adding segment: crawl/segments/20110822141229
>> LinkDb: adding segment: crawl/segments/20110822143419
>> LinkDb: adding segment: crawl/segments/20110822143824
>> LinkDb: adding segment: crawl/segments/20110822144031
>> LinkDb: adding segment: crawl/segments/20110822144232
>> LinkDb: adding segment: crawl/segments/20110822144435
>> LinkDb: adding segment: crawl/segments/20110822144617
>> LinkDb: adding segment: crawl/segments/20110822144750
>> LinkDb: adding segment: crawl/segments/20110822144927
>> LinkDb: adding segment: crawl/segments/20110822145249
>> LinkDb: adding segment: crawl/segments/20110822150757
>> LinkDb: adding segment: crawl/segments/20110822152354
>> LinkDb: adding segment: crawl/segments/20110822152503
>> LinkDb: adding segment: crawl/segments/20110822153900
>> LinkDb: adding segment: crawl/segments/20110822155321
>> LinkDb: adding segment: crawl/segments/20110822155732
>> LinkDb: merging with existing linkdb: crawl/linkdb
>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>
>> After that:
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>> LinkDb dump: starting at 2011-08-23 14:48:26
>> LinkDb dump: db: crawl/linkdb/
>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>
>> And then:
>>
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
>> linkdump/
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>> ll
>> total 0
>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>>
>> As you see, the dump size is 0 byte.
>>
>> Unfortunately I have no idea what went wrong.
>>
>> I have attached the hadoop.log for the inverlinks process. Perhaps that
>> helps anybody?
>
Re: Empty LinkDB after invertlinks
Posted by Markus Jelsma <ma...@openindex.io>.
There are some peculiarities in your log:
2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
config()
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:211)
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:198)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:213)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:190)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job: job_local_0002
2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
config(config)
at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:226)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:184)
at org.apache.hadoop.mapreduce.JobContext.<init>(JobContext.java:52)
at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:32)
at org.apache.hadoop.mapred.JobContext.<init>(JobContext.java:38)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:111)
Can you check permissions, disk space etc?
On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
> Hey Ho,
>
> for some reasons the inverlinks command produces an empty linkdb.
>
> I did:
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
> LinkDb: starting at 2011-08-23 14:47:21
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: false
> LinkDb: URL filter: false
> LinkDb: adding segment: crawl/segments/20110817164804
> LinkDb: adding segment: crawl/segments/20110817164912
> LinkDb: adding segment: crawl/segments/20110817165053
> LinkDb: adding segment: crawl/segments/20110817165524
> LinkDb: adding segment: crawl/segments/20110817170729
> LinkDb: adding segment: crawl/segments/20110817171757
> LinkDb: adding segment: crawl/segments/20110817172919
> LinkDb: adding segment: crawl/segments/20110819135218
> LinkDb: adding segment: crawl/segments/20110819165658
> LinkDb: adding segment: crawl/segments/20110819170807
> LinkDb: adding segment: crawl/segments/20110819171841
> LinkDb: adding segment: crawl/segments/20110819173350
> LinkDb: adding segment: crawl/segments/20110822135934
> LinkDb: adding segment: crawl/segments/20110822141229
> LinkDb: adding segment: crawl/segments/20110822143419
> LinkDb: adding segment: crawl/segments/20110822143824
> LinkDb: adding segment: crawl/segments/20110822144031
> LinkDb: adding segment: crawl/segments/20110822144232
> LinkDb: adding segment: crawl/segments/20110822144435
> LinkDb: adding segment: crawl/segments/20110822144617
> LinkDb: adding segment: crawl/segments/20110822144750
> LinkDb: adding segment: crawl/segments/20110822144927
> LinkDb: adding segment: crawl/segments/20110822145249
> LinkDb: adding segment: crawl/segments/20110822150757
> LinkDb: adding segment: crawl/segments/20110822152354
> LinkDb: adding segment: crawl/segments/20110822152503
> LinkDb: adding segment: crawl/segments/20110822153900
> LinkDb: adding segment: crawl/segments/20110822155321
> LinkDb: adding segment: crawl/segments/20110822155732
> LinkDb: merging with existing linkdb: crawl/linkdb
> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>
> After that:
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#
> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
> LinkDb dump: starting at 2011-08-23 14:48:26
> LinkDb dump: db: crawl/linkdb/
> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>
> And then:
>
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd
> linkdump/
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
> ll
> total 0
> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#
>
> As you see, the dump size is 0 byte.
>
> Unfortunately I have no idea what went wrong.
>
> I have attached the hadoop.log for the inverlinks process. Perhaps that
> helps anybody?
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350