You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dean Pullen <de...@semantico.com> on 2012/01/05 18:28:52 UTC
parse data directory not found after merge
Hi all,
I'm upgrading from nutch 1 to 1.4 and am having problems running
invertlinks.
Error:
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
I notice that the parse_data directories are produced after a fetch
(with fetcher.parse set to true), but after the merge the parse_data
directory doesn't exist.
What behaviour has changed since 1.0 and does anyone have a solution for
the above?
Thanks in advance,
Dean.
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Pretty sure the same thing is happening with Hadoop 1.0...
On 10/01/2012 14:11, Dean Pullen wrote:
> Upgraded to Hadoop 0.20.205.0 and the DiskErrorException dissappears,
> but the same result occurs, i.e. only the crawl_fetch and crawl_data
> directories get merged, no parse_data directory exists.
>
> Arghhhhhhhhh.
>
>
> Dean.
>
> On 10/01/2012 11:33, Dean Pullen wrote:
>> I'm running in local mode (I believe) and using hadoop 0.20.2, as
>> this is the lib version shipped with nutch 1.4
>>
>> Dean.
>>
>> On 09/01/2012 16:41, Lewis John Mcgibbney wrote:
>>> How are you running Nutch local or deploy mode? Which hadoop versions
>>> are you using 0.20.2? This appears to be an open issue with this
>>> version [1].
>>>
>>> Also please have a look here [2] for a similar frustrating situation.
>>>
>>> [1]https://issues.apache.org/jira/browse/HADOOP-6958
>>> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html
>>>
>>>
>>> On Mon, Jan 9, 2012 at 4:14 PM, Dean
>>> Pullen<de...@semantico.com> wrote:
>>>> This is interesting, and something I've only just noticed in the logs:
>>>>
>>>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out
>>>>
>>>> in any of the configured local directories
>>>>
>>>> This is during the mergesegs job (and previous jobs).....but I'm
>>>> not sure
>>>> what it means or if it's actually a problem.
>>>>
>>>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>>>>
>>>> It suggests that the map part of the hadoop job has not produced an
>>>> output
>>>> file, or it's looking in the wrong place?
>>>>
>>>> Dean
>>>
>>
>>
>
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Upgraded to Hadoop 0.20.205.0 and the DiskErrorException dissappears,
but the same result occurs, i.e. only the crawl_fetch and crawl_data
directories get merged, no parse_data directory exists.
Arghhhhhhhhh.
Dean.
On 10/01/2012 11:33, Dean Pullen wrote:
> I'm running in local mode (I believe) and using hadoop 0.20.2, as this
> is the lib version shipped with nutch 1.4
>
> Dean.
>
> On 09/01/2012 16:41, Lewis John Mcgibbney wrote:
>> How are you running Nutch local or deploy mode? Which hadoop versions
>> are you using 0.20.2? This appears to be an open issue with this
>> version [1].
>>
>> Also please have a look here [2] for a similar frustrating situation.
>>
>> [1]https://issues.apache.org/jira/browse/HADOOP-6958
>> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html
>>
>>
>> On Mon, Jan 9, 2012 at 4:14 PM, Dean
>> Pullen<de...@semantico.com> wrote:
>>> This is interesting, and something I've only just noticed in the logs:
>>>
>>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out
>>>
>>> in any of the configured local directories
>>>
>>> This is during the mergesegs job (and previous jobs).....but I'm not
>>> sure
>>> what it means or if it's actually a problem.
>>>
>>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>>>
>>> It suggests that the map part of the hadoop job has not produced an
>>> output
>>> file, or it's looking in the wrong place?
>>>
>>> Dean
>>
>
>
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
I'm running in local mode (I believe) and using hadoop 0.20.2, as this
is the lib version shipped with nutch 1.4
Dean.
On 09/01/2012 16:41, Lewis John Mcgibbney wrote:
> How are you running Nutch local or deploy mode? Which hadoop versions
> are you using 0.20.2? This appears to be an open issue with this
> version [1].
>
> Also please have a look here [2] for a similar frustrating situation.
>
> [1]https://issues.apache.org/jira/browse/HADOOP-6958
> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html
>
> On Mon, Jan 9, 2012 at 4:14 PM, Dean Pullen<de...@semantico.com> wrote:
>> This is interesting, and something I've only just noticed in the logs:
>>
>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out
>> in any of the configured local directories
>>
>> This is during the mergesegs job (and previous jobs).....but I'm not sure
>> what it means or if it's actually a problem.
>>
>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>>
>> It suggests that the map part of the hadoop job has not produced an output
>> file, or it's looking in the wrong place?
>>
>> Dean
>
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
How are you running Nutch local or deploy mode? Which hadoop versions
are you using 0.20.2? This appears to be an open issue with this
version [1].
Also please have a look here [2] for a similar frustrating situation.
[1] https://issues.apache.org/jira/browse/HADOOP-6958
[2] http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html
On Mon, Jan 9, 2012 at 4:14 PM, Dean Pullen <de...@semantico.com> wrote:
> This is interesting, and something I've only just noticed in the logs:
>
> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out
> in any of the configured local directories
>
> This is during the mergesegs job (and previous jobs).....but I'm not sure
> what it means or if it's actually a problem.
>
> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>
> It suggests that the map part of the hadoop job has not produced an output
> file, or it's looking in the wrong place?
>
> Dean
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
This is interesting, and something I've only just noticed in the logs:
2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out
in any of the configured local directories
This is during the mergesegs job (and previous jobs).....but I'm not
sure what it means or if it's actually a problem.
mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
It suggests that the map part of the hadoop job has not produced an
output file, or it's looking in the wrong place?
Dean
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
No, thank you for taking the time to look at it! I'm still on the case
but am hoping you'll find the problem.
Dean.
On 09/01/2012 14:24, Lewis John Mcgibbney wrote:
> Hi Dean,
>
> I'll have a look into this later today if I get a chance. Anyone else
> experiencing problems using the mergesegs command or code?
>
> Thanks for persisting with this Dean hopefully we will get to the
> bottom of it soon.
>
> On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen<de...@semantico.com> wrote:
>> Looking through the code, I'm seeing
>> org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for
>> crawl_fetch and crawl_generate.
>>
>> Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...)
>> gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse
>> parse_data parse_text
>>
>> I'm not quiet sure what's going on in-between these two calls...
>>
>> Dean.
>>
>>
>>
>> On 08/01/2012 22:51, Dean Pullen wrote:
>>> Where do we go from here? I can start looking/stepping through the
>>> mergesegs code, but I'm reluctant due to it's probable complexity.
>>>
>>> Dean.
>>>
>>>
>>> On 08/01/2012 14:26, Dean Pullen wrote:
>>>> No Lewis, -linkdb was already been used for the solrindex command, so we
>>>> still have the same problem.
>>>>
>>>> Many thanks,
>>>>
>>>> Dean
>>>>
>>>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>>>>> Hi dean is this sorted
>>>>>
>>>>> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com>
>>>>> wrote:
>>>>>> Sorry, you did mean on solrindex - which I already do...
>>>>>>
>>>>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>>>>
>>>>>> The -linkdb param isn't in the invertlinks docs
>>>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>>>>> (However it is in the solrindex docs)
>>>>>>
>>>>>> Adding it makes no difference to invertlinks.
>>>>>>
>>>>>> I think the problem is definitely with mergesegs, as opposed to
>>>>> invertlinks etc.
>>>>>> Thanks again,
>>>>>>
>>>>>> Dean.
>>>>>>
>>>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>>>>
>>>>>> OK so now I think were at the bottom of it. If you wish to create a
>>>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>>>>> parameter. This was implemented as not everyone wishes to create a
>>>>>> linkdb.
>>>>>>
>>>>>> Your invertlinks command should be passed as follows
>>>>>>
>>>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>>>>> /path/to/segment/dirs
>>>>>> then
>>>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>>>>> path/to/linkdb -dir path/to/segment/dirs
>>>>>>
>>>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>>>>> be thrown an exception as the linkdb is treated as a segment directory
>>>>>> now.
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>>>>> wrote:
>>>>>> Only this:
>>>>>>
>>>>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use
>>>>>> GenericOptionsParser
>>>>>> for parsing the arguments. Applications should implement Tool for the
>>>>> same.
>>>>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>> where
>>>>>> applicable
>>>>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
>>>>> 2012-01-06
>>>>>> 17:15:51
>>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
>>>>>> /opt/nutch_1_4/data/crawl/linkdb
>>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize:
>>>>>> true
>>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
>>>>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>>>>> exist:
>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>>>> at
>>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>>> at
>>>>>>
>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>>> at
>>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>>> at
>>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>>> at
>>>>>>
>>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>>
>>>>>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting
>>>>>> at
>>>>>> 2012-01-06 17:15:52
>>>>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>>>>> IndexerMapReduce:
>>>>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>>>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>>>>> IndexerMapReduce:
>>>>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>>>>
>
>
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Dean,
I'll have a look into this later today if I get a chance. Anyone else
experiencing problems using the mergesegs command or code?
Thanks for persisting with this Dean hopefully we will get to the
bottom of it soon.
On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen <de...@semantico.com> wrote:
> Looking through the code, I'm seeing
> org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for
> crawl_fetch and crawl_generate.
>
> Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...)
> gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse
> parse_data parse_text
>
> I'm not quiet sure what's going on in-between these two calls...
>
> Dean.
>
>
>
> On 08/01/2012 22:51, Dean Pullen wrote:
>>
>> Where do we go from here? I can start looking/stepping through the
>> mergesegs code, but I'm reluctant due to it's probable complexity.
>>
>> Dean.
>>
>>
>> On 08/01/2012 14:26, Dean Pullen wrote:
>>>
>>> No Lewis, -linkdb was already been used for the solrindex command, so we
>>> still have the same problem.
>>>
>>> Many thanks,
>>>
>>> Dean
>>>
>>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>>>>
>>>> Hi dean is this sorted
>>>>
>>>> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com>
>>>> wrote:
>>>>>
>>>>> Sorry, you did mean on solrindex - which I already do...
>>>>>
>>>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>>>
>>>>> The -linkdb param isn't in the invertlinks docs
>>>>
>>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>>>>
>>>>> (However it is in the solrindex docs)
>>>>>
>>>>> Adding it makes no difference to invertlinks.
>>>>>
>>>>> I think the problem is definitely with mergesegs, as opposed to
>>>>
>>>> invertlinks etc.
>>>>>
>>>>> Thanks again,
>>>>>
>>>>> Dean.
>>>>>
>>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>>>
>>>>> OK so now I think were at the bottom of it. If you wish to create a
>>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>>>> parameter. This was implemented as not everyone wishes to create a
>>>>> linkdb.
>>>>>
>>>>> Your invertlinks command should be passed as follows
>>>>>
>>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>>>> /path/to/segment/dirs
>>>>> then
>>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>>>> path/to/linkdb -dir path/to/segment/dirs
>>>>>
>>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>>>> be thrown an exception as the linkdb is treated as a segment directory
>>>>> now.
>>>>>
>>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>>>>
>>>> wrote:
>>>>>
>>>>> Only this:
>>>>>
>>>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use
>>>>> GenericOptionsParser
>>>>> for parsing the arguments. Applications should implement Tool for the
>>>>
>>>> same.
>>>>>
>>>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>
>>>> where
>>>>>
>>>>> applicable
>>>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
>>>>
>>>> 2012-01-06
>>>>>
>>>>> 17:15:51
>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
>>>>> /opt/nutch_1_4/data/crawl/linkdb
>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize:
>>>>> true
>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
>>>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>>>> exist:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>>> at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>>
>>>>> at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>>
>>>>> at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>>
>>>>> at
>>>>
>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>>
>>>>> at
>>>>>
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>
>>>>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting
>>>>> at
>>>>> 2012-01-06 17:15:52
>>>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>>>
>>>
>>
>
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Looking through the code, I'm seeing
org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for
crawl_fetch and crawl_generate.
Prior to this
org.apache.nutch.segment.SegmentMerger.getRecordWriter(...) gets called
for all components, i.e. crawl_generate crawl_fetch crawl_parse
parse_data parse_text
I'm not quiet sure what's going on in-between these two calls...
Dean.
On 08/01/2012 22:51, Dean Pullen wrote:
> Where do we go from here? I can start looking/stepping through the
> mergesegs code, but I'm reluctant due to it's probable complexity.
>
> Dean.
>
>
> On 08/01/2012 14:26, Dean Pullen wrote:
>> No Lewis, -linkdb was already been used for the solrindex command, so
>> we still have the same problem.
>>
>> Many thanks,
>>
>> Dean
>>
>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>>> Hi dean is this sorted
>>>
>>> On Saturday, January 7, 2012, Dean
>>> Pullen<de...@semantico.com> wrote:
>>>> Sorry, you did mean on solrindex - which I already do...
>>>>
>>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>>
>>>> The -linkdb param isn't in the invertlinks docs
>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>>> (However it is in the solrindex docs)
>>>>
>>>> Adding it makes no difference to invertlinks.
>>>>
>>>> I think the problem is definitely with mergesegs, as opposed to
>>> invertlinks etc.
>>>> Thanks again,
>>>>
>>>> Dean.
>>>>
>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>>
>>>> OK so now I think were at the bottom of it. If you wish to create a
>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>>> parameter. This was implemented as not everyone wishes to create a
>>>> linkdb.
>>>>
>>>> Your invertlinks command should be passed as follows
>>>>
>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>>> /path/to/segment/dirs
>>>> then
>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>>> path/to/linkdb -dir path/to/segment/dirs
>>>>
>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>>> be thrown an exception as the linkdb is treated as a segment directory
>>>> now.
>>>>
>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>>> wrote:
>>>> Only this:
>>>>
>>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use
>>>> GenericOptionsParser
>>>> for parsing the arguments. Applications should implement Tool for the
>>> same.
>>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes
>>> where
>>>> applicable
>>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
>>> 2012-01-06
>>>> 17:15:51
>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
>>>> /opt/nutch_1_4/data/crawl/linkdb
>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize:
>>>> true
>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
>>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>>> exist:
>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>> at
>>>>
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>
>>>> at
>>>>
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>
>>>> at
>>>>
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>
>>>> at
>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>> at
>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>
>>>> at
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>
>>>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer:
>>>> starting at
>>>> 2012-01-06 17:15:52
>>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>>> IndexerMapReduce:
>>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>>> IndexerMapReduce:
>>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>>
>>
>
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Where do we go from here? I can start looking/stepping through the
mergesegs code, but I'm reluctant due to it's probable complexity.
Dean.
On 08/01/2012 14:26, Dean Pullen wrote:
> No Lewis, -linkdb was already been used for the solrindex command, so
> we still have the same problem.
>
> Many thanks,
>
> Dean
>
> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>> Hi dean is this sorted
>>
>> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com>
>> wrote:
>>> Sorry, you did mean on solrindex - which I already do...
>>>
>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>
>>> The -linkdb param isn't in the invertlinks docs
>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>> (However it is in the solrindex docs)
>>>
>>> Adding it makes no difference to invertlinks.
>>>
>>> I think the problem is definitely with mergesegs, as opposed to
>> invertlinks etc.
>>> Thanks again,
>>>
>>> Dean.
>>>
>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>
>>> OK so now I think were at the bottom of it. If you wish to create a
>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>> parameter. This was implemented as not everyone wishes to create a
>>> linkdb.
>>>
>>> Your invertlinks command should be passed as follows
>>>
>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>> /path/to/segment/dirs
>>> then
>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>> path/to/linkdb -dir path/to/segment/dirs
>>>
>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>> be thrown an exception as the linkdb is treated as a segment directory
>>> now.
>>>
>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>> wrote:
>>> Only this:
>>>
>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use
>>> GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for the
>> same.
>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>> where
>>> applicable
>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
>> 2012-01-06
>>> 17:15:51
>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
>>> /opt/nutch_1_4/data/crawl/linkdb
>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize:
>>> true
>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>> at
>>>
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>
>>> at
>>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>
>>> at
>>>
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>
>>> at
>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>> at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer:
>>> starting at
>>> 2012-01-06 17:15:52
>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>
>
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
No Lewis, -linkdb was already been used for the solrindex command, so we
still have the same problem.
Many thanks,
Dean
On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
> Hi dean is this sorted
>
> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com> wrote:
>> Sorry, you did mean on solrindex - which I already do...
>>
>> On 07/01/2012 13:15, Dean Pullen wrote:
>>
>> The -linkdb param isn't in the invertlinks docs
> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>> (However it is in the solrindex docs)
>>
>> Adding it makes no difference to invertlinks.
>>
>> I think the problem is definitely with mergesegs, as opposed to
> invertlinks etc.
>> Thanks again,
>>
>> Dean.
>>
>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>
>> OK so now I think were at the bottom of it. If you wish to create a
>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>> parameter. This was implemented as not everyone wishes to create a
>> linkdb.
>>
>> Your invertlinks command should be passed as follows
>>
>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>> /path/to/segment/dirs
>> then
>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>> path/to/linkdb -dir path/to/segment/dirs
>>
>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>> be thrown an exception as the linkdb is treated as a segment directory
>> now.
>>
>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
> wrote:
>> Only this:
>>
>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser
>> for parsing the arguments. Applications should implement Tool for the
> same.
>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
> where
>> applicable
>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
> 2012-01-06
>> 17:15:51
>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
>> /opt/nutch_1_4/data/crawl/linkdb
>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true
>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>> at
>>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>> at
>>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>> at
>>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>> at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>
>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at
>> 2012-01-06 17:15:52
>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi dean is this sorted
On Saturday, January 7, 2012, Dean Pullen <de...@semantico.com> wrote:
> Sorry, you did mean on solrindex - which I already do...
>
> On 07/01/2012 13:15, Dean Pullen wrote:
>
> The -linkdb param isn't in the invertlinks docs
http://wiki.apache.org/nutch/bin/nutch_invertlinks
>
> (However it is in the solrindex docs)
>
> Adding it makes no difference to invertlinks.
>
> I think the problem is definitely with mergesegs, as opposed to
invertlinks etc.
>
> Thanks again,
>
> Dean.
>
> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>
> OK so now I think were at the bottom of it. If you wish to create a
> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
> parameter. This was implemented as not everyone wishes to create a
> linkdb.
>
> Your invertlinks command should be passed as follows
>
> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
> /path/to/segment/dirs
> then
> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
> path/to/linkdb -dir path/to/segment/dirs
>
> If you are not passing the -linkdb path/to/linkdb explicitly you will
> be thrown an exception as the linkdb is treated as a segment directory
> now.
>
> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
wrote:
>
> Only this:
>
> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the
same.
> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
where
> applicable
> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
2012-01-06
> 17:15:51
> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
> /opt/nutch_1_4/data/crawl/linkdb
> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true
> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
> at
>
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at
>
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at
>
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at
> 2012-01-06 17:15:52
> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: /opt/nutch_1_4/data/crawl/crawldb
> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>
--
*Lewis*
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Sorry, you did mean on solrindex - which I already do...
On 07/01/2012 13:15, Dean Pullen wrote:
> The -linkdb param isn't in the invertlinks docs
> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>
> (However it is in the solrindex docs)
>
> Adding it makes no difference to invertlinks.
>
> I think the problem is definitely with mergesegs, as opposed to
> invertlinks etc.
>
> Thanks again,
>
> Dean.
>
> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>> OK so now I think were at the bottom of it. If you wish to create a
>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>> parameter. This was implemented as not everyone wishes to create a
>> linkdb.
>>
>> Your invertlinks command should be passed as follows
>>
>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>> /path/to/segment/dirs
>> then
>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>> path/to/linkdb -dir path/to/segment/dirs
>>
>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>> be thrown an exception as the linkdb is treated as a segment directory
>> now.
>>
>> On Fri, Jan 6, 2012 at 5:17 PM, Dean
>> Pullen<de...@semantico.com> wrote:
>>> Only this:
>>>
>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use
>>> GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for
>>> the same.
>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java
>>> classes where
>>> applicable
>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
>>> 2012-01-06
>>> 17:15:51
>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
>>> /opt/nutch_1_4/data/crawl/linkdb
>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize:
>>> true
>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>> at
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>
>>> at
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>
>>> at
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>
>>> at
>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>> at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer:
>>> starting at
>>> 2012-01-06 17:15:52
>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduces:
>>> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547
>>> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer -
>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
>>> Input path does not exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>> Input path does not exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
>>> 2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump:
>>> starting
>>> 2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db:
>>> /opt/nutch_1_4/data/crawl/crawldb/
>>> 2012-01-06 17:15:54,212 WARN mapred.JobClient - Use
>>> GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for
>>> the same.
>>> 2012-01-06 17:15:55,603 INFO crawl.CrawlDbReader - CrawlDb dump: done
>>>
>>
>>
>
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
The -linkdb param isn't in the invertlinks docs
http://wiki.apache.org/nutch/bin/nutch_invertlinks
(However it is in the solrindex docs)
Adding it makes no difference to invertlinks.
I think the problem is definitely with mergesegs, as opposed to
invertlinks etc.
Thanks again,
Dean.
On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
> OK so now I think were at the bottom of it. If you wish to create a
> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
> parameter. This was implemented as not everyone wishes to create a
> linkdb.
>
> Your invertlinks command should be passed as follows
>
> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
> /path/to/segment/dirs
> then
> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
> path/to/linkdb -dir path/to/segment/dirs
>
> If you are not passing the -linkdb path/to/linkdb explicitly you will
> be thrown an exception as the linkdb is treated as a segment directory
> now.
>
> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com> wrote:
>> Only this:
>>
>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser
>> for parsing the arguments. Applications should implement Tool for the same.
>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at 2012-01-06
>> 17:15:51
>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
>> /opt/nutch_1_4/data/crawl/linkdb
>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true
>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>> at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>> at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>> at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>
>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at
>> 2012-01-06 17:15:52
>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduces:
>> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547
>> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer -
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
>> Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>> Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
>> 2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump: starting
>> 2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db:
>> /opt/nutch_1_4/data/crawl/crawldb/
>> 2012-01-06 17:15:54,212 WARN mapred.JobClient - Use GenericOptionsParser
>> for parsing the arguments. Applications should implement Tool for the same.
>> 2012-01-06 17:15:55,603 INFO crawl.CrawlDbReader - CrawlDb dump: done
>>
>
>
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
OK so now I think were at the bottom of it. If you wish to create a
linkdb in >= Nutch 1.4 you need to specifically pass the linkdb
parameter. This was implemented as not everyone wishes to create a
linkdb.
Your invertlinks command should be passed as follows
bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
/path/to/segment/dirs
then
bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
path/to/linkdb -dir path/to/segment/dirs
If you are not passing the -linkdb path/to/linkdb explicitly you will
be thrown an exception as the linkdb is treated as a segment directory
now.
On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen <de...@semantico.com> wrote:
> Only this:
>
> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the same.
> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at 2012-01-06
> 17:15:51
> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
> /opt/nutch_1_4/data/crawl/linkdb
> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true
> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at
> 2012-01-06 17:15:52
> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: /opt/nutch_1_4/data/crawl/crawldb
> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: /opt/nutch_1_4/data/crawl/linkdb
> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547
> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer -
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
> Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
> Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
> 2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump: starting
> 2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db:
> /opt/nutch_1_4/data/crawl/crawldb/
> 2012-01-06 17:15:54,212 WARN mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the same.
> 2012-01-06 17:15:55,603 INFO crawl.CrawlDbReader - CrawlDb dump: done
>
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Only this:
2012-01-06 17:15:47,972 WARN mapred.JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at
2012-01-06 17:15:51
2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb:
/opt/nutch_1_4/data/crawl/linkdb
2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true
2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true
2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment:
file:/opt/nutch_1_4/data/crawl/segments/20120106171547
2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting
at 2012-01-06 17:15:52
2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
IndexerMapReduce: crawldb: /opt/nutch_1_4/data/crawl/crawldb
2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
IndexerMapReduce: linkdb: /opt/nutch_1_4/data/crawl/linkdb
2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -
IndexerMapReduces: adding segment:
/opt/nutch_1_4/data/crawl/segments/20120106171547
2012-01-06 17:15:53,000 ERROR solr.SolrIndexer -
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
Input path does not exist:
file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
Input path does not exist:
file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump: starting
2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db:
/opt/nutch_1_4/data/crawl/crawldb/
2012-01-06 17:15:54,212 WARN mapred.JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2012-01-06 17:15:55,603 INFO crawl.CrawlDbReader - CrawlDb dump: done
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Another thing which I have stupidly not asked yet, have you checked
your hadoop.log to see if there are any problems around the parse
phase?
It should begin
LOG.info("ParseSegment: starting at " + sdf.format(start));
LOG.info("ParseSegment: segment: " + segment);
...
if successful
...
LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
...
if not then
...
LOG.warn("Error parsing: " etc
Any joy?
On Fri, Jan 6, 2012 at 4:38 PM, Dean Pullen <de...@semantico.com> wrote:
> Two iterations do the same thing - the parse_data directory is missing.
>
> Interestingly, just doing the mergesegs on ONE crawl also removes the
> parse_data dir etc!
>
> Dean.
>
>
>
> On 06/01/2012 16:28, Lewis John Mcgibbney wrote:
>>
>> How about merging segs after every subsequent iteration of the crawl
>> cycle... surely this is a problem with producing the specific
>> parse_data directory. If it doesn't work after two iterations then we
>> know that it is happening early on in the crawl cycle. Have you
>> manually checked that the directories exist after fetching and
>> parsing?
>>
>> On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<de...@semantico.com>
>> wrote:
>>>
>>> Good spot because all of that was meant to be removed! No, I'm afraid
>>> that's
>>> just a copy/paste problem.
>>>
>>> Dean
>>>
>>> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>>>
>>>> Ok then,
>>>>
>>>> How about your generate command:
>>>>
>>>> 2) GENERATE:
>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>> 26
>>>>
>>>> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc,
>>>> when everything else being utilised within the crawl cycle points to
>>>> an entirely different<segment_dirs> path which is
>>>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>>>
>>>> Was this intentional?
>>>>
>>>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com>
>>>> wrote:
>>>>>
>>>>> Lewis,
>>>>>
>>>>> Changing the merge to * returns a similar response:
>>>>>
>>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>>>
>>>>> And yes, your assumption was correct - it's a different segment
>>>>> directory
>>>>> each loop.
>>>>>
>>>>> Many thanks,
>>>>>
>>>>> Dean.
>>>>>
>>>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>>>
>>>>>> Hi Dean,
>>>>>>
>>>>>> Without discussing any of your configuration properties can you please
>>>>>> try
>>>>>>
>>>>>> 6) MERGE SEGMENTS:
>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>>>
>>>>>> paying attention to the wildcard /* in -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>>>
>>>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>>>> times, you are not recursively generating, fetching, parsing and
>>>>>> updating the WebDB with
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>>>
>>>>>>>
>>>>>>> Firstly I have a seed URL XML document here:
>>>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a
>>>>>>> URL
>>>>>>> within it.
>>>>>>>
>>>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>>>
>>>>>>> # allow urls in ukcigarforums.com domain
>>>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>>>> # deny anything else
>>>>>>> -.
>>>>>>>
>>>>>>>
>>>>>>> Here's the procedure:
>>>>>>>
>>>>>>>
>>>>>>> 1) INJECT:
>>>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/nutch_1_4/data/seed/
>>>>>>>
>>>>>>> 2) GENERATE:
>>>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000
>>>>>>> -adddays
>>>>>>> 26
>>>>>>>
>>>>>>> 3) FETCH:
>>>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>>
>>>>>>> 4) PARSE:
>>>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>>
>>>>>>> 5) UPDATE DB:
>>>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>>>
>>>>>>>
>>>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>>>
>>>>>>> 6) MERGE SEGMENTS:
>>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>>>
>>>>>>>
>>>>>>> Interestingly, this prints out:
>>>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>>>> crawl_parse parse_data parse_text"
>>>>>>>
>>>>>>> MERGEDsegments segment directory then has just two directories,
>>>>>>> instead
>>>>>>> of
>>>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>>>> crawl_fetch
>>>>>>>
>>>>>>> (when then delete from the segments directory and copy the
>>>>>>> MERGEDsegments
>>>>>>> results into it)
>>>>>>>
>>>>>>>
>>>>>>> Lastly we run invert links after merge segments:
>>>>>>>
>>>>>>> 7) INVERT LINKS:
>>>>>>> /opt/nutch_1_4/bin/nutch invertlinks
>>>>>>> /opt/nutch_1_4/data/crawl/linkdb/
>>>>>>> -dir
>>>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>>>
>>>>>>> Which produces:
>>>>>>>
>>>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>>>> does
>>>>>>> not
>>>>>>> exist:
>>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>>>
>>>>>>>
>>>>
>>
>>
>
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Two iterations do the same thing - the parse_data directory is missing.
Interestingly, just doing the mergesegs on ONE crawl also removes the
parse_data dir etc!
Dean.
On 06/01/2012 16:28, Lewis John Mcgibbney wrote:
> How about merging segs after every subsequent iteration of the crawl
> cycle... surely this is a problem with producing the specific
> parse_data directory. If it doesn't work after two iterations then we
> know that it is happening early on in the crawl cycle. Have you
> manually checked that the directories exist after fetching and
> parsing?
>
> On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<de...@semantico.com> wrote:
>> Good spot because all of that was meant to be removed! No, I'm afraid that's
>> just a copy/paste problem.
>>
>> Dean
>>
>> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>> Ok then,
>>>
>>> How about your generate command:
>>>
>>> 2) GENERATE:
>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>>
>>> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc,
>>> when everything else being utilised within the crawl cycle points to
>>> an entirely different<segment_dirs> path which is
>>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>>
>>> Was this intentional?
>>>
>>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com>
>>> wrote:
>>>> Lewis,
>>>>
>>>> Changing the merge to * returns a similar response:
>>>>
>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>>
>>>> And yes, your assumption was correct - it's a different segment directory
>>>> each loop.
>>>>
>>>> Many thanks,
>>>>
>>>> Dean.
>>>>
>>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>> Hi Dean,
>>>>>
>>>>> Without discussing any of your configuration properties can you please
>>>>> try
>>>>>
>>>>> 6) MERGE SEGMENTS:
>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>>
>>>>> paying attention to the wildcard /* in -dir
>>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>>
>>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>>> times, you are not recursively generating, fetching, parsing and
>>>>> updating the WebDB with
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>>>> wrote:
>>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>>
>>>>>>
>>>>>> Firstly I have a seed URL XML document here:
>>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>>>> within it.
>>>>>>
>>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>>
>>>>>> # allow urls in ukcigarforums.com domain
>>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>>> # deny anything else
>>>>>> -.
>>>>>>
>>>>>>
>>>>>> Here's the procedure:
>>>>>>
>>>>>>
>>>>>> 1) INJECT:
>>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>>> /opt/nutch_1_4/data/seed/
>>>>>>
>>>>>> 2) GENERATE:
>>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>>>> 26
>>>>>>
>>>>>> 3) FETCH:
>>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>
>>>>>> 4) PARSE:
>>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>
>>>>>> 5) UPDATE DB:
>>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>>
>>>>>>
>>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>>
>>>>>> 6) MERGE SEGMENTS:
>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>>
>>>>>>
>>>>>> Interestingly, this prints out:
>>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>>> crawl_parse parse_data parse_text"
>>>>>>
>>>>>> MERGEDsegments segment directory then has just two directories, instead
>>>>>> of
>>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>>> crawl_fetch
>>>>>>
>>>>>> (when then delete from the segments directory and copy the
>>>>>> MERGEDsegments
>>>>>> results into it)
>>>>>>
>>>>>>
>>>>>> Lastly we run invert links after merge segments:
>>>>>>
>>>>>> 7) INVERT LINKS:
>>>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>>>> -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>>
>>>>>> Which produces:
>>>>>>
>>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>>> does
>>>>>> not
>>>>>> exist:
>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>>
>>>>>>
>>>
>
>
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
How about merging segs after every subsequent iteration of the crawl
cycle... surely this is a problem with producing the specific
parse_data directory. If it doesn't work after two iterations then we
know that it is happening early on in the crawl cycle. Have you
manually checked that the directories exist after fetching and
parsing?
On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen <de...@semantico.com> wrote:
> Good spot because all of that was meant to be removed! No, I'm afraid that's
> just a copy/paste problem.
>
> Dean
>
> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>
>> Ok then,
>>
>> How about your generate command:
>>
>> 2) GENERATE:
>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>
>> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc,
>> when everything else being utilised within the crawl cycle points to
>> an entirely different<segment_dirs> path which is
>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>
>> Was this intentional?
>>
>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com>
>> wrote:
>>>
>>> Lewis,
>>>
>>> Changing the merge to * returns a similar response:
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>
>>> And yes, your assumption was correct - it's a different segment directory
>>> each loop.
>>>
>>> Many thanks,
>>>
>>> Dean.
>>>
>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>
>>>> Hi Dean,
>>>>
>>>> Without discussing any of your configuration properties can you please
>>>> try
>>>>
>>>> 6) MERGE SEGMENTS:
>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>
>>>> paying attention to the wildcard /* in -dir
>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>
>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>> times, you are not recursively generating, fetching, parsing and
>>>> updating the WebDB with
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>>> wrote:
>>>>>
>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>
>>>>>
>>>>> Firstly I have a seed URL XML document here:
>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>>> within it.
>>>>>
>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>
>>>>> # allow urls in ukcigarforums.com domain
>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>> # deny anything else
>>>>> -.
>>>>>
>>>>>
>>>>> Here's the procedure:
>>>>>
>>>>>
>>>>> 1) INJECT:
>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/nutch_1_4/data/seed/
>>>>>
>>>>> 2) GENERATE:
>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>>> 26
>>>>>
>>>>> 3) FETCH:
>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>
>>>>> 4) PARSE:
>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>
>>>>> 5) UPDATE DB:
>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>
>>>>>
>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>
>>>>> 6) MERGE SEGMENTS:
>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>
>>>>>
>>>>> Interestingly, this prints out:
>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>> crawl_parse parse_data parse_text"
>>>>>
>>>>> MERGEDsegments segment directory then has just two directories, instead
>>>>> of
>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>> crawl_fetch
>>>>>
>>>>> (when then delete from the segments directory and copy the
>>>>> MERGEDsegments
>>>>> results into it)
>>>>>
>>>>>
>>>>> Lastly we run invert links after merge segments:
>>>>>
>>>>> 7) INVERT LINKS:
>>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>>> -dir
>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>
>>>>> Which produces:
>>>>>
>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>> does
>>>>> not
>>>>> exist:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>
>>>>>
>>>>
>>
>>
>
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Good spot because all of that was meant to be removed! No, I'm afraid
that's just a copy/paste problem.
Dean
On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
> Ok then,
>
> How about your generate command:
>
> 2) GENERATE:
> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>
> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc,
> when everything else being utilised within the crawl cycle points to
> an entirely different<segment_dirs> path which is
> /opt/nutch_1_4/data/crawl/segments/segment_date
>
> Was this intentional?
>
> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com> wrote:
>> Lewis,
>>
>> Changing the merge to * returns a similar response:
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>
>> And yes, your assumption was correct - it's a different segment directory
>> each loop.
>>
>> Many thanks,
>>
>> Dean.
>>
>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>> Hi Dean,
>>>
>>> Without discussing any of your configuration properties can you please try
>>>
>>> 6) MERGE SEGMENTS:
>>> /opt/nutch_1_4/bin/nutch mergesegs
>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>
>>> paying attention to the wildcard /* in -dir
>>> /opt/nutch_1_4/data/crawl/segments/*
>>>
>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>> times, you are not recursively generating, fetching, parsing and
>>> updating the WebDB with
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>> with every iteration of the g/f/p/updatedb cycle.
>>>
>>> Thanks
>>>
>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>> wrote:
>>>> No problem Lewis, I appreciate you looking into it.
>>>>
>>>>
>>>> Firstly I have a seed URL XML document here:
>>>> http://www.ukcigarforums.com/injectlist.xml
>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>> within it.
>>>>
>>>> Nutch's regex-urlfilter.txt contains this:
>>>>
>>>> # allow urls in ukcigarforums.com domain
>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>> # deny anything else
>>>> -.
>>>>
>>>>
>>>> Here's the procedure:
>>>>
>>>>
>>>> 1) INJECT:
>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/nutch_1_4/data/seed/
>>>>
>>>> 2) GENERATE:
>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>> 26
>>>>
>>>> 3) FETCH:
>>>> /opt/nutch_1_4/bin/nutch fetch
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>
>>>> 4) PARSE:
>>>> /opt/nutch_1_4/bin/nutch parse
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>
>>>> 5) UPDATE DB:
>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>
>>>>
>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>
>>>> 6) MERGE SEGMENTS:
>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>
>>>>
>>>> Interestingly, this prints out:
>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>> crawl_parse parse_data parse_text"
>>>>
>>>> MERGEDsegments segment directory then has just two directories, instead
>>>> of
>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>> crawl_fetch
>>>>
>>>> (when then delete from the segments directory and copy the MERGEDsegments
>>>> results into it)
>>>>
>>>>
>>>> Lastly we run invert links after merge segments:
>>>>
>>>> 7) INVERT LINKS:
>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>> -dir
>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>
>>>> Which produces:
>>>>
>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>>> not
>>>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>
>>>>
>>>
>
>
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Ok then,
How about your generate command:
2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
Your <segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc,
when everything else being utilised within the crawl cycle points to
an entirely different <segment_dirs> path which is
/opt/nutch_1_4/data/crawl/segments/segment_date
Was this intentional?
On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen <de...@semantico.com> wrote:
> Lewis,
>
> Changing the merge to * returns a similar response:
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>
> And yes, your assumption was correct - it's a different segment directory
> each loop.
>
> Many thanks,
>
> Dean.
>
> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>
>> Hi Dean,
>>
>> Without discussing any of your configuration properties can you please try
>>
>> 6) MERGE SEGMENTS:
>> /opt/nutch_1_4/bin/nutch mergesegs
>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>
>> paying attention to the wildcard /* in -dir
>> /opt/nutch_1_4/data/crawl/segments/*
>>
>> Also presumably, when you mention you repeat steps 2-5 another 4
>> times, you are not recursively generating, fetching, parsing and
>> updating the WebDB with
>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>> with every iteration of the g/f/p/updatedb cycle.
>>
>> Thanks
>>
>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>> wrote:
>>>
>>> No problem Lewis, I appreciate you looking into it.
>>>
>>>
>>> Firstly I have a seed URL XML document here:
>>> http://www.ukcigarforums.com/injectlist.xml
>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>> within it.
>>>
>>> Nutch's regex-urlfilter.txt contains this:
>>>
>>> # allow urls in ukcigarforums.com domain
>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>> # deny anything else
>>> -.
>>>
>>>
>>> Here's the procedure:
>>>
>>>
>>> 1) INJECT:
>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/nutch_1_4/data/seed/
>>>
>>> 2) GENERATE:
>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>> 26
>>>
>>> 3) FETCH:
>>> /opt/nutch_1_4/bin/nutch fetch
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>
>>> 4) PARSE:
>>> /opt/nutch_1_4/bin/nutch parse
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>
>>> 5) UPDATE DB:
>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>
>>>
>>> Repeat steps 2 to 5 another 4 times, then:
>>>
>>> 6) MERGE SEGMENTS:
>>> /opt/nutch_1_4/bin/nutch mergesegs
>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>
>>>
>>> Interestingly, this prints out:
>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>> crawl_parse parse_data parse_text"
>>>
>>> MERGEDsegments segment directory then has just two directories, instead
>>> of
>>> all of those listed in the last output, i.e. just: crawl_generate and
>>> crawl_fetch
>>>
>>> (when then delete from the segments directory and copy the MERGEDsegments
>>> results into it)
>>>
>>>
>>> Lastly we run invert links after merge segments:
>>>
>>> 7) INVERT LINKS:
>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>> -dir
>>> /opt/nutch_1_4/data/crawl/segments/
>>>
>>> Which produces:
>>>
>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>> not
>>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>
>>>
>>
>>
>
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Lewis,
Changing the merge to * returns a similar response:
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
And yes, your assumption was correct - it's a different segment
directory each loop.
Many thanks,
Dean.
On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
> Hi Dean,
>
> Without discussing any of your configuration properties can you please try
>
> 6) MERGE SEGMENTS:
> /opt/nutch_1_4/bin/nutch mergesegs
> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>
> paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/*
>
> Also presumably, when you mention you repeat steps 2-5 another 4
> times, you are not recursively generating, fetching, parsing and
> updating the WebDB with
> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
> with every iteration of the g/f/p/updatedb cycle.
>
> Thanks
>
> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com> wrote:
>> No problem Lewis, I appreciate you looking into it.
>>
>>
>> Firstly I have a seed URL XML document here:
>> http://www.ukcigarforums.com/injectlist.xml
>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>> within it.
>>
>> Nutch's regex-urlfilter.txt contains this:
>>
>> # allow urls in ukcigarforums.com domain
>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>> # deny anything else
>> -.
>>
>>
>> Here's the procedure:
>>
>>
>> 1) INJECT:
>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/nutch_1_4/data/seed/
>>
>> 2) GENERATE:
>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>
>> 3) FETCH:
>> /opt/nutch_1_4/bin/nutch fetch
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>
>> 4) PARSE:
>> /opt/nutch_1_4/bin/nutch parse
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>
>> 5) UPDATE DB:
>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>
>>
>> Repeat steps 2 to 5 another 4 times, then:
>>
>> 6) MERGE SEGMENTS:
>> /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/
>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>
>>
>> Interestingly, this prints out:
>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>> crawl_parse parse_data parse_text"
>>
>> MERGEDsegments segment directory then has just two directories, instead of
>> all of those listed in the last output, i.e. just: crawl_generate and
>> crawl_fetch
>>
>> (when then delete from the segments directory and copy the MERGEDsegments
>> results into it)
>>
>>
>> Lastly we run invert links after merge segments:
>>
>> 7) INVERT LINKS:
>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir
>> /opt/nutch_1_4/data/crawl/segments/
>>
>> Which produces:
>>
>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>
>>
>
>
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Dean,
Without discussing any of your configuration properties can you please try
6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/* -filter -normalize
paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/*
Also presumably, when you mention you repeat steps 2-5 another 4
times, you are not recursively generating, fetching, parsing and
updating the WebDB with
/opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
with every iteration of the g/f/p/updatedb cycle.
Thanks
On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen <de...@semantico.com> wrote:
> No problem Lewis, I appreciate you looking into it.
>
>
> Firstly I have a seed URL XML document here:
> http://www.ukcigarforums.com/injectlist.xml
> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
> within it.
>
> Nutch's regex-urlfilter.txt contains this:
>
> # allow urls in ukcigarforums.com domain
> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
> # deny anything else
> -.
>
>
> Here's the procedure:
>
>
> 1) INJECT:
> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
> /opt/nutch_1_4/data/seed/
>
> 2) GENERATE:
> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>
> 3) FETCH:
> /opt/nutch_1_4/bin/nutch fetch
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>
> 4) PARSE:
> /opt/nutch_1_4/bin/nutch parse
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>
> 5) UPDATE DB:
> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>
>
> Repeat steps 2 to 5 another 4 times, then:
>
> 6) MERGE SEGMENTS:
> /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/
> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>
>
> Interestingly, this prints out:
> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
> crawl_parse parse_data parse_text"
>
> MERGEDsegments segment directory then has just two directories, instead of
> all of those listed in the last output, i.e. just: crawl_generate and
> crawl_fetch
>
> (when then delete from the segments directory and copy the MERGEDsegments
> results into it)
>
>
> Lastly we run invert links after merge segments:
>
> 7) INVERT LINKS:
> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir
> /opt/nutch_1_4/data/crawl/segments/
>
> Which produces:
>
> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>
>
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
No problem Lewis, I appreciate you looking into it.
Firstly I have a seed URL XML document here:
http://www.ukcigarforums.com/injectlist.xml
This basically has 'http://www.ukcigarforums.com/content.php' as a URL
within it.
Nutch's regex-urlfilter.txt contains this:
# allow urls in ukcigarforums.com domain
+http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
# deny anything else
-.
Here's the procedure:
1) INJECT:
/opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/seed/
2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
3) FETCH:
/opt/nutch_1_4/bin/nutch fetch
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
4) PARSE:
/opt/nutch_1_4/bin/nutch parse
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
5) UPDATE DB:
/opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
Repeat steps 2 to 5 another 4 times, then:
6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/ -filter -normalize
Interestingly, this prints out:
"SegmentMerger: using segment data from: crawl_generate crawl_fetch
crawl_parse parse_data parse_text"
MERGEDsegments segment directory then has just two directories, instead
of all of those listed in the last output, i.e. just: crawl_generate and
crawl_fetch
(when then delete from the segments directory and copy the
MERGEDsegments results into it)
Lastly we run invert links after merge segments:
7) INVERT LINKS:
/opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
-dir /opt/nutch_1_4/data/crawl/segments/
Which produces:
"LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Can you please post your script or what type of commands (and
parameters) you are passing... I suspect that there is maybe something
lurking which we could fix now e.g. differences between the 1.0/1.3
commands and current 1.4.
If not then you may have flagged up something which requires some TLC.
Thanks
On Fri, Jan 6, 2012 at 12:14 PM, Dean Pullen <de...@semantico.com> wrote:
> I've also tried nutch v1.3 with the same outcome (i.e. parse_data directory
> is not found).
>
>
>
> On 06/01/2012 10:42, Dean Pullen wrote:
>>
>> I'd like to reiterate that this all works in v1...
>>
>> Dean
>>
>> On 06/01/2012 10:04, Dean Pullen wrote:
>>>
>>> Lewis,
>>>
>>> Many thanks for your reply.
>>>
>>> I've separated the parsing from the fetching, and although each segment -
>>> we run the crawl 5 times - has the parse_data directory after parsing
>>> (observed via pausing the process), the mergesegs command does not reproduce
>>> the parse_data directory meaning invertlinks fails with the same parse_data
>>> not found error.
>>>
>>> The merged segments directory simply has the crawl_generate and
>>> crawl_fetch directories, not any of the others you can see in the other
>>> segments directories.
>>>
>>> Regards,
>>>
>>> Dean.
>>>
>>>
>>> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
>>>
>>>> Hi Dean,
>>>>
>>>> Depending on the size of the segments your fetching, in most cases I
>>>> would advise you to separate out fetching and parsing into individual
>>>> steps. This becomes self explanatory as your segments increase in size
>>>> and the possibility of something going wrong with the fetching and
>>>> parsing when done together. This looks to be a segments which when
>>>> being fetched has experienced problems during parsing, therefore no
>>>> parse_data was produced.
>>>>
>>>> Can you please try a test fetch (with parsing boolean set to false) on
>>>> a sample segment then an individual parse and report back to us with
>>>> this one please.
>>>>
>>>> Thanks
>>>>
>>>> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<de...@semantico.com>
>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>>>> invertlinks.
>>>>>
>>>>> Error:
>>>>>
>>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>>>> not
>>>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>>> at
>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>> at
>>>>>
>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>> at
>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>> at
>>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>> at
>>>>>
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>
>>>>> I notice that the parse_data directories are produced after a fetch
>>>>> (with
>>>>> fetcher.parse set to true), but after the merge the parse_data
>>>>> directory
>>>>> doesn't exist.
>>>>>
>>>>> What behaviour has changed since 1.0 and does anyone have a solution
>>>>> for the
>>>>> above?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Dean.
>>>>
>>>>
>>>>
>>>> --
>>>> Lewis
>>
>>
>
--
Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
I've also tried nutch v1.3 with the same outcome (i.e. parse_data
directory is not found).
On 06/01/2012 10:42, Dean Pullen wrote:
> I'd like to reiterate that this all works in v1...
>
> Dean
>
> On 06/01/2012 10:04, Dean Pullen wrote:
>> Lewis,
>>
>> Many thanks for your reply.
>>
>> I've separated the parsing from the fetching, and although each
>> segment - we run the crawl 5 times - has the parse_data directory
>> after parsing (observed via pausing the process), the mergesegs
>> command does not reproduce the parse_data directory meaning
>> invertlinks fails with the same parse_data not found error.
>>
>> The merged segments directory simply has the crawl_generate and
>> crawl_fetch directories, not any of the others you can see in the
>> other segments directories.
>>
>> Regards,
>>
>> Dean.
>>
>>
>> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
>>
>>> Hi Dean,
>>>
>>> Depending on the size of the segments your fetching, in most cases I
>>> would advise you to separate out fetching and parsing into individual
>>> steps. This becomes self explanatory as your segments increase in size
>>> and the possibility of something going wrong with the fetching and
>>> parsing when done together. This looks to be a segments which when
>>> being fetched has experienced problems during parsing, therefore no
>>> parse_data was produced.
>>>
>>> Can you please try a test fetch (with parsing boolean set to false) on
>>> a sample segment then an individual parse and report back to us with
>>> this one please.
>>>
>>> Thanks
>>>
>>> On Thu, Jan 5, 2012 at 5:28 PM, Dean
>>> Pullen<de...@semantico.com> wrote:
>>>> Hi all,
>>>>
>>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>>> invertlinks.
>>>>
>>>> Error:
>>>>
>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>> does not
>>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>> at
>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>
>>>> at
>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>
>>>> at
>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>
>>>> at
>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>> at
>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>
>>>> at
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>
>>>> I notice that the parse_data directories are produced after a fetch
>>>> (with
>>>> fetcher.parse set to true), but after the merge the parse_data
>>>> directory
>>>> doesn't exist.
>>>>
>>>> What behaviour has changed since 1.0 and does anyone have a
>>>> solution for the
>>>> above?
>>>>
>>>> Thanks in advance,
>>>>
>>>> Dean.
>>>
>>>
>>> --
>>> Lewis
>
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
I'd like to reiterate that this all works in v1...
Dean
On 06/01/2012 10:04, Dean Pullen wrote:
> Lewis,
>
> Many thanks for your reply.
>
> I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error.
>
> The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories.
>
> Regards,
>
> Dean.
>
>
> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
>
>> Hi Dean,
>>
>> Depending on the size of the segments your fetching, in most cases I
>> would advise you to separate out fetching and parsing into individual
>> steps. This becomes self explanatory as your segments increase in size
>> and the possibility of something going wrong with the fetching and
>> parsing when done together. This looks to be a segments which when
>> being fetched has experienced problems during parsing, therefore no
>> parse_data was produced.
>>
>> Can you please try a test fetch (with parsing boolean set to false) on
>> a sample segment then an individual parse and report back to us with
>> this one please.
>>
>> Thanks
>>
>> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<de...@semantico.com> wrote:
>>> Hi all,
>>>
>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>> invertlinks.
>>>
>>> Error:
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>> at
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>> at
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>> at
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>> at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> I notice that the parse_data directories are produced after a fetch (with
>>> fetcher.parse set to true), but after the merge the parse_data directory
>>> doesn't exist.
>>>
>>> What behaviour has changed since 1.0 and does anyone have a solution for the
>>> above?
>>>
>>> Thanks in advance,
>>>
>>> Dean.
>>
>>
>> --
>> Lewis
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Lewis,
Many thanks for your reply.
I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error.
The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories.
Regards,
Dean.
On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
> Hi Dean,
>
> Depending on the size of the segments your fetching, in most cases I
> would advise you to separate out fetching and parsing into individual
> steps. This becomes self explanatory as your segments increase in size
> and the possibility of something going wrong with the fetching and
> parsing when done together. This looks to be a segments which when
> being fetched has experienced problems during parsing, therefore no
> parse_data was produced.
>
> Can you please try a test fetch (with parsing boolean set to false) on
> a sample segment then an individual parse and report back to us with
> this one please.
>
> Thanks
>
> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <de...@semantico.com> wrote:
>> Hi all,
>>
>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>> invertlinks.
>>
>> Error:
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>> at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>> at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>> at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>
>> I notice that the parse_data directories are produced after a fetch (with
>> fetcher.parse set to true), but after the merge the parse_data directory
>> doesn't exist.
>>
>> What behaviour has changed since 1.0 and does anyone have a solution for the
>> above?
>>
>> Thanks in advance,
>>
>> Dean.
>
>
>
> --
> Lewis
Re: parse data directory not found after merge
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Dean,
Depending on the size of the segments your fetching, in most cases I
would advise you to separate out fetching and parsing into individual
steps. This becomes self explanatory as your segments increase in size
and the possibility of something going wrong with the fetching and
parsing when done together. This looks to be a segments which when
being fetched has experienced problems during parsing, therefore no
parse_data was produced.
Can you please try a test fetch (with parsing boolean set to false) on
a sample segment then an individual parse and report back to us with
this one please.
Thanks
On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <de...@semantico.com> wrote:
> Hi all,
>
> I'm upgrading from nutch 1 to 1.4 and am having problems running
> invertlinks.
>
> Error:
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> I notice that the parse_data directories are produced after a fetch (with
> fetcher.parse set to true), but after the merge the parse_data directory
> doesn't exist.
>
> What behaviour has changed since 1.0 and does anyone have a solution for the
> above?
>
> Thanks in advance,
>
> Dean.
--
Lewis
Re: parse data directory not found after merge
Posted by Markus Jelsma <ma...@openindex.io>.
There is no zip. Anyway, i just did three fetch and parse cycles of
nutch.apache.org with trunk. Trunk has no changes concerning segments etc with
regards to 1.4. I injected nutch.apache.org and then did two fetches of -topN
4 pages so i got 9 pages in three segments. I also configured to stay within
the domain.
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls: 28
retry 0: 28
min score: 0.0010
avg score: 0.080714285
max score: 1.588
status 1 (db_unfetched): 19
status 2 (db_fetched): 9
CrawlDb statistics: done
crawl/segments/20120111122321/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_text
crawl/segments/20120111122438/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:24 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_text
crawl/segments/20120111122539/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:26 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_text
Let's merge the three segments into one:
$ bin/nutch mergesegs merged_segment -dir crawl/segments/
Merging 3 segments to merged_segment/20120111122826
SegmentMerger: adding file:/PATH/crawl/segments/20120111122539
SegmentMerger: adding file:/PATH/crawl/segments/20120111122438
SegmentMerger: adding file:/PATH/crawl/segments/20120111122321
SegmentMerger: using segment data from: content crawl_generate crawl_fetch
crawl_parse parse_data parse_text
.. it takes a while but finishes. Then i've got this:
$ ls merged_segment/20120111122826/
content crawl_fetch crawl_generate crawl_parse parse_data parse_text
I don't see the problem but this should reproduce your problem as your steps
are not really different from mine. Is it still the parse_data directory that
is missing?
Why are you mering anyway, it is not mandatory at all.
On Wednesday 11 January 2012 12:09:57 Dean Pullen wrote:
> A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same
> thing.
>
> I've zipped up the nutch/hadoop dir with all config etc, would either of
> you (Markus/Lewis) care to look at it?
>
> Any help at this stage would be immensely appreciated.
>
> Regards,
>
> Dean.
--
Markus Jelsma - CTO - Openindex
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Markus,
I didn't include the zip, I was just saying I have it if you would like
to see/use it! Shall I send?
Can you zip up and send to me what you've just done? Presumably it must
be a config thing?!
I know mergesegs isn't needed, but as I believed there was a problem
with it I've been trying to discover the problem for the sake of it...
Dean.
Re: parse data directory not found after merge
Posted by Markus Jelsma <ma...@openindex.io>.
I ran the merge local only. I've never merged on a Hadoop cluster since we
don't need it there.
On Wednesday 11 January 2012 12:21:20 Dean Pullen wrote:
> For further reference, below is the Hadoop job task log for the
> mergesegs command.
> You'll see that parse_data etc merges are performed.
>
>
> Completed Tasks
>
> Task Complete Status Start Time Finish Time Errors
> Counters
> task_201201111048_0031_m_000000 100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_fetch/part-000
> 00/data:0+259 11-Jan-2012 11:16:22
> 11-Jan-2012 11:16:25 (3sec)
>
> 9
> task_201201111048_0031_m_000001 100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_generate/part-
> 00000:0+234 11-Jan-2012 11:16:22
> 11-Jan-2012 11:16:25 (3sec)
>
> 9
> task_201201111048_0031_m_000002 100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/content/part-00000/d
> ata:0+129 11-Jan-2012 11:16:25
> 11-Jan-2012 11:16:28 (3sec)
>
> 9
> task_201201111048_0031_m_000003 100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_parse/part-000
> 00:0+129 11-Jan-2012 11:16:25
> 11-Jan-2012 11:16:28 (3sec)
>
> 9
> task_201201111048_0031_m_000004 100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_data/part-0000
> 0/data:0+128 11-Jan-2012 11:16:28
> 11-Jan-2012 11:16:31 (3sec)
>
> 9
> task_201201111048_0031_m_000005 100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_text/part-0000
> 0/data:0+128 11-Jan-2012 11:16:28
> 11-Jan-2012 11:16:31 (3sec)
>
>
>
>
> And the parse_data job itself:
>
> attempt_201201111048_0031_m_000004_0
> /default-rack/dhcp-192-168-4-26.semantico.net SUCCEEDED 100.00%
> 11-Jan-2012 11:16:28 11-Jan-2012 11:16:30 (1sec)
--
Markus Jelsma - CTO - Openindex
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
For further reference, below is the Hadoop job task log for the
mergesegs command.
You'll see that parse_data etc merges are performed.
Completed Tasks
Task Complete Status Start Time Finish Time Errors
Counters
task_201201111048_0031_m_000000 100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_fetch/part-00000/data:0+259
11-Jan-2012 11:16:22
11-Jan-2012 11:16:25 (3sec)
9
task_201201111048_0031_m_000001 100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_generate/part-00000:0+234
11-Jan-2012 11:16:22
11-Jan-2012 11:16:25 (3sec)
9
task_201201111048_0031_m_000002 100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/content/part-00000/data:0+129
11-Jan-2012 11:16:25
11-Jan-2012 11:16:28 (3sec)
9
task_201201111048_0031_m_000003 100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_parse/part-00000:0+129
11-Jan-2012 11:16:25
11-Jan-2012 11:16:28 (3sec)
9
task_201201111048_0031_m_000004 100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_data/part-00000/data:0+128
11-Jan-2012 11:16:28
11-Jan-2012 11:16:31 (3sec)
9
task_201201111048_0031_m_000005 100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_text/part-00000/data:0+128
11-Jan-2012 11:16:28
11-Jan-2012 11:16:31 (3sec)
And the parse_data job itself:
attempt_201201111048_0031_m_000004_0
/default-rack/dhcp-192-168-4-26.semantico.net SUCCEEDED 100.00%
11-Jan-2012 11:16:28 11-Jan-2012 11:16:30 (1sec)
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same
thing.
I've zipped up the nutch/hadoop dir with all config etc, would either of
you (Markus/Lewis) care to look at it?
Any help at this stage would be immensely appreciated.
Regards,
Dean.
Re: parse data directory not found after merge
Posted by Markus Jelsma <ma...@openindex.io>.
Well, set up to crawl nutch.apache.org only and fetch some cycles and see what
happens. If merging goes bad then i can reproduce and perhaps fix it.
If not, you may want to start debugging the thing step by step.
On Tuesday 10 January 2012 18:06:34 Dean Pullen wrote:
> Yes, this is about the parse_data directory dissapearing after a merge.
>
> I've used a clean Nutch 1.4 multiple times, I've not yet use an example
> crawl though.
>
> Anything specific you recommend?
>
> Dean.
>
> On 10/01/2012 16:59, Markus Jelsma wrote:
> > I haven't followed the entire thread but this is about the parse_data
> > directory disappears after a merge? We have no issues with merges on
> > small crawls.
> >
> > Do you still store content despite the parsing fetcher? Can you reproduce
> > this on a clean Nutch 1.4 build with an example crawl?
--
Markus Jelsma - CTO - Openindex
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
Yes, this is about the parse_data directory dissapearing after a merge.
I've used a clean Nutch 1.4 multiple times, I've not yet use an example
crawl though.
Anything specific you recommend?
Dean.
On 10/01/2012 16:59, Markus Jelsma wrote:
> I haven't followed the entire thread but this is about the parse_data
> directory disappears after a merge? We have no issues with merges on small
> crawls.
>
> Do you still store content despite the parsing fetcher? Can you reproduce this
> on a clean Nutch 1.4 build with an example crawl?
>
>
Re: parse data directory not found after merge
Posted by Dean Pullen <de...@semantico.com>.
The disk errors were solved by upgrading hadoop to 0.20.203 - they no
longer appear.
Dean.
On 10/01/2012 17:01, Markus Jelsma wrote:
> I might want to ask about your Hadoop temp dir since you seem to have disk
> errors. Have you set it?
>
> On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote:
>> I haven't followed the entire thread but this is about the parse_data
>> directory disappears after a merge? We have no issues with merges on small
>> crawls.
>>
>> Do you still store content despite the parsing fetcher? Can you reproduce
>> this on a clean Nutch 1.4 build with an example crawl?
>>
>> On Thursday 05 January 2012 18:28:52 Dean Pullen wrote:
>>> Hi all,
>>>
>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>> invertlinks.
>>>
>>> Error:
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>> not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>
>>> at
>>>
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:
>>> 19 0) at
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI
>>> np utFormat.java:44) at
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2
>>> 01 ) at
>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>
>>> at
>>>
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> I notice that the parse_data directories are produced after a fetch
>>> (with fetcher.parse set to true), but after the merge the parse_data
>>> directory doesn't exist.
>>>
>>> What behaviour has changed since 1.0 and does anyone have a solution for
>>> the above?
>>>
>>> Thanks in advance,
>>>
>>> Dean.
Re: parse data directory not found after merge
Posted by Markus Jelsma <ma...@openindex.io>.
I might want to ask about your Hadoop temp dir since you seem to have disk
errors. Have you set it?
On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote:
> I haven't followed the entire thread but this is about the parse_data
> directory disappears after a merge? We have no issues with merges on small
> crawls.
>
> Do you still store content despite the parsing fetcher? Can you reproduce
> this on a clean Nutch 1.4 build with an example crawl?
>
> On Thursday 05 January 2012 18:28:52 Dean Pullen wrote:
> > Hi all,
> >
> > I'm upgrading from nutch 1 to 1.4 and am having problems running
> > invertlinks.
> >
> > Error:
> >
> > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
> > not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
> >
> > at
> >
> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:
> > 19 0) at
> > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI
> > np utFormat.java:44) at
> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2
> > 01 ) at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> >
> > at
> >
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> >
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
> >
> > I notice that the parse_data directories are produced after a fetch
> > (with fetcher.parse set to true), but after the merge the parse_data
> > directory doesn't exist.
> >
> > What behaviour has changed since 1.0 and does anyone have a solution for
> > the above?
> >
> > Thanks in advance,
> >
> > Dean.
--
Markus Jelsma - CTO - Openindex
Re: parse data directory not found after merge
Posted by Markus Jelsma <ma...@openindex.io>.
I haven't followed the entire thread but this is about the parse_data
directory disappears after a merge? We have no issues with merges on small
crawls.
Do you still store content despite the parsing fetcher? Can you reproduce this
on a clean Nutch 1.4 build with an example crawl?
On Thursday 05 January 2012 18:28:52 Dean Pullen wrote:
> Hi all,
>
> I'm upgrading from nutch 1 to 1.4 and am having problems running
> invertlinks.
>
> Error:
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
> not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19
> 0) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp
> utFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201
> ) at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> I notice that the parse_data directories are produced after a fetch
> (with fetcher.parse set to true), but after the merge the parse_data
> directory doesn't exist.
>
> What behaviour has changed since 1.0 and does anyone have a solution for
> the above?
>
> Thanks in advance,
>
> Dean.
--
Markus Jelsma - CTO - Openindex