You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dean Pullen <de...@semantico.com> on 2012/01/05 18:28:52 UTC

parse data directory not found after merge

Hi all,

I'm upgrading from nutch 1 to 1.4 and am having problems running 
invertlinks.

Error:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does 
not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
     at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
     at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
     at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
     at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
     at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

I notice that the parse_data directories are produced after a fetch 
(with fetcher.parse set to true), but after the merge the parse_data 
directory doesn't exist.

What behaviour has changed since 1.0 and does anyone have a solution for 
the above?

Thanks in advance,

Dean.

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Pretty sure the same thing is happening with Hadoop 1.0...

On 10/01/2012 14:11, Dean Pullen wrote:
> Upgraded to Hadoop 0.20.205.0 and the DiskErrorException dissappears, 
> but the same result occurs, i.e. only the crawl_fetch and crawl_data 
> directories get merged, no parse_data directory exists.
>
> Arghhhhhhhhh.
>
>
> Dean.
>
> On 10/01/2012 11:33, Dean Pullen wrote:
>> I'm running in local mode (I believe) and using hadoop 0.20.2, as 
>> this is the lib version shipped with nutch 1.4
>>
>> Dean.
>>
>> On 09/01/2012 16:41, Lewis John Mcgibbney wrote:
>>> How are you running Nutch local or deploy mode? Which hadoop versions
>>> are you using 0.20.2? This appears to be an open issue with this
>>> version [1].
>>>
>>> Also please have a look here [2] for a similar frustrating situation.
>>>
>>> [1]https://issues.apache.org/jira/browse/HADOOP-6958
>>> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html 
>>>
>>>
>>> On Mon, Jan 9, 2012 at 4:14 PM, Dean 
>>> Pullen<de...@semantico.com>  wrote:
>>>> This is interesting, and something I've only just noticed in the logs:
>>>>
>>>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out 
>>>>
>>>> in any of the configured local directories
>>>>
>>>> This is during the mergesegs job (and previous jobs).....but I'm 
>>>> not sure
>>>> what it means or if it's actually a problem.
>>>>
>>>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>>>>
>>>> It suggests that the map part of the hadoop job has not produced an 
>>>> output
>>>> file, or it's looking in the wrong place?
>>>>
>>>> Dean
>>>
>>
>>
>

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Upgraded to Hadoop 0.20.205.0 and the DiskErrorException dissappears, 
but the same result occurs, i.e. only the crawl_fetch and crawl_data 
directories get merged, no parse_data directory exists.

Arghhhhhhhhh.


Dean.

On 10/01/2012 11:33, Dean Pullen wrote:
> I'm running in local mode (I believe) and using hadoop 0.20.2, as this 
> is the lib version shipped with nutch 1.4
>
> Dean.
>
> On 09/01/2012 16:41, Lewis John Mcgibbney wrote:
>> How are you running Nutch local or deploy mode? Which hadoop versions
>> are you using 0.20.2? This appears to be an open issue with this
>> version [1].
>>
>> Also please have a look here [2] for a similar frustrating situation.
>>
>> [1]https://issues.apache.org/jira/browse/HADOOP-6958
>> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html 
>>
>>
>> On Mon, Jan 9, 2012 at 4:14 PM, Dean 
>> Pullen<de...@semantico.com>  wrote:
>>> This is interesting, and something I've only just noticed in the logs:
>>>
>>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out 
>>>
>>> in any of the configured local directories
>>>
>>> This is during the mergesegs job (and previous jobs).....but I'm not 
>>> sure
>>> what it means or if it's actually a problem.
>>>
>>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>>>
>>> It suggests that the map part of the hadoop job has not produced an 
>>> output
>>> file, or it's looking in the wrong place?
>>>
>>> Dean
>>
>
>

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

I'm running in local mode (I believe) and using hadoop 0.20.2, as this 
is the lib version shipped with nutch 1.4

Dean.

On 09/01/2012 16:41, Lewis John Mcgibbney wrote:
> How are you running Nutch local or deploy mode? Which hadoop versions
> are you using 0.20.2? This appears to be an open issue with this
> version [1].
>
> Also please have a look here [2] for a similar frustrating situation.
>
> [1]https://issues.apache.org/jira/browse/HADOOP-6958
> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html
>
> On Mon, Jan 9, 2012 at 4:14 PM, Dean Pullen<de...@semantico.com>  wrote:
>> This is interesting, and something I've only just noticed in the logs:
>>
>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out
>> in any of the configured local directories
>>
>> This is during the mergesegs job (and previous jobs).....but I'm not sure
>> what it means or if it's actually a problem.
>>
>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>>
>> It suggests that the map part of the hadoop job has not produced an output
>> file, or it's looking in the wrong place?
>>
>> Dean
>

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

How are you running Nutch local or deploy mode? Which hadoop versions
are you using 0.20.2? This appears to be an open issue with this
version [1].

Also please have a look here [2] for a similar frustrating situation.

[1] https://issues.apache.org/jira/browse/HADOOP-6958
[2] http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html

On Mon, Jan 9, 2012 at 4:14 PM, Dean Pullen <de...@semantico.com> wrote:
> This is interesting, and something I've only just noticed in the logs:
>
> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out
> in any of the configured local directories
>
> This is during the mergesegs job (and previous jobs).....but I'm not sure
> what it means or if it's actually a problem.
>
> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.
>
> It suggests that the map part of the hadoop job has not produced an output
> file, or it's looking in the wrong place?
>
> Dean



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

This is interesting, and something I've only just noticed in the logs:

2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out 
in any of the configured local directories

This is during the mergesegs job (and previous jobs).....but I'm not 
sure what it means or if it's actually a problem.

mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists.

It suggests that the map part of the hadoop job has not produced an 
output file, or it's looking in the wrong place?

Dean

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

No, thank you for taking the time to look at it! I'm still on the case 
but am hoping you'll find the problem.

Dean.

On 09/01/2012 14:24, Lewis John Mcgibbney wrote:
> Hi Dean,
>
> I'll have a look into this later today if I get a chance. Anyone else
> experiencing problems using the mergesegs command or code?
>
> Thanks for persisting with this Dean hopefully we will get to the
> bottom of it soon.
>
> On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen<de...@semantico.com>  wrote:
>> Looking through the code, I'm seeing
>> org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for
>> crawl_fetch and crawl_generate.
>>
>> Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...)
>> gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse
>> parse_data parse_text
>>
>> I'm not quiet sure what's going on in-between these two calls...
>>
>> Dean.
>>
>>
>>
>> On 08/01/2012 22:51, Dean Pullen wrote:
>>> Where do we go from here? I can start looking/stepping through the
>>> mergesegs code, but I'm reluctant due to it's probable complexity.
>>>
>>> Dean.
>>>
>>>
>>> On 08/01/2012 14:26, Dean Pullen wrote:
>>>> No Lewis, -linkdb was already been used for the solrindex command, so we
>>>> still have the same problem.
>>>>
>>>> Many thanks,
>>>>
>>>> Dean
>>>>
>>>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>>>>> Hi dean is this sorted
>>>>>
>>>>> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com>
>>>>>   wrote:
>>>>>> Sorry, you did mean on solrindex - which I already do...
>>>>>>
>>>>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>>>>
>>>>>> The -linkdb param isn't in the invertlinks docs
>>>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>>>>> (However it is in the solrindex docs)
>>>>>>
>>>>>> Adding it makes no difference to invertlinks.
>>>>>>
>>>>>> I think the problem is definitely with mergesegs, as opposed to
>>>>> invertlinks etc.
>>>>>> Thanks again,
>>>>>>
>>>>>> Dean.
>>>>>>
>>>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>>>>
>>>>>> OK so now I think were at the bottom of it. If you wish to create a
>>>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>>>>> parameter. This was implemented as not everyone wishes to create a
>>>>>> linkdb.
>>>>>>
>>>>>> Your invertlinks command should be passed as follows
>>>>>>
>>>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>>>>> /path/to/segment/dirs
>>>>>> then
>>>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>>>>> path/to/linkdb -dir path/to/segment/dirs
>>>>>>
>>>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>>>>> be thrown an exception as the linkdb is treated as a segment directory
>>>>>> now.
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>>>>>   wrote:
>>>>>> Only this:
>>>>>>
>>>>>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use
>>>>>> GenericOptionsParser
>>>>>> for parsing the arguments. Applications should implement Tool for the
>>>>> same.
>>>>>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>> where
>>>>>> applicable
>>>>>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
>>>>> 2012-01-06
>>>>>> 17:15:51
>>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>>>>>> /opt/nutch_1_4/data/crawl/linkdb
>>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize:
>>>>>> true
>>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>>>>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>>>>> exist:
>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>>>>     at
>>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>>>     at
>>>>>>
>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>>>     at
>>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>>>     at
>>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>>>     at
>>>>>>
>>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>>
>>>>>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting
>>>>>> at
>>>>>> 2012-01-06 17:15:52
>>>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>>>>> IndexerMapReduce:
>>>>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>>>>> IndexerMapReduce:
>>>>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>>>>
>
>

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Dean,

I'll have a look into this later today if I get a chance. Anyone else
experiencing problems using the mergesegs command or code?

Thanks for persisting with this Dean hopefully we will get to the
bottom of it soon.

On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen <de...@semantico.com> wrote:
> Looking through the code, I'm seeing
> org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for
> crawl_fetch and crawl_generate.
>
> Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...)
> gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse
> parse_data parse_text
>
> I'm not quiet sure what's going on in-between these two calls...
>
> Dean.
>
>
>
> On 08/01/2012 22:51, Dean Pullen wrote:
>>
>> Where do we go from here? I can start looking/stepping through the
>> mergesegs code, but I'm reluctant due to it's probable complexity.
>>
>> Dean.
>>
>>
>> On 08/01/2012 14:26, Dean Pullen wrote:
>>>
>>> No Lewis, -linkdb was already been used for the solrindex command, so we
>>> still have the same problem.
>>>
>>> Many thanks,
>>>
>>> Dean
>>>
>>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>>>>
>>>> Hi dean is this sorted
>>>>
>>>> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com>
>>>>  wrote:
>>>>>
>>>>> Sorry, you did mean on solrindex - which I already do...
>>>>>
>>>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>>>
>>>>> The -linkdb param isn't in the invertlinks docs
>>>>
>>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>>>>
>>>>> (However it is in the solrindex docs)
>>>>>
>>>>> Adding it makes no difference to invertlinks.
>>>>>
>>>>> I think the problem is definitely with mergesegs, as opposed to
>>>>
>>>> invertlinks etc.
>>>>>
>>>>> Thanks again,
>>>>>
>>>>> Dean.
>>>>>
>>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>>>
>>>>> OK so now I think were at the bottom of it. If you wish to create a
>>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>>>> parameter. This was implemented as not everyone wishes to create a
>>>>> linkdb.
>>>>>
>>>>> Your invertlinks command should be passed as follows
>>>>>
>>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>>>> /path/to/segment/dirs
>>>>> then
>>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>>>> path/to/linkdb -dir path/to/segment/dirs
>>>>>
>>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>>>> be thrown an exception as the linkdb is treated as a segment directory
>>>>> now.
>>>>>
>>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>>>>
>>>>  wrote:
>>>>>
>>>>> Only this:
>>>>>
>>>>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use
>>>>> GenericOptionsParser
>>>>> for parsing the arguments. Applications should implement Tool for the
>>>>
>>>> same.
>>>>>
>>>>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>
>>>> where
>>>>>
>>>>> applicable
>>>>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
>>>>
>>>> 2012-01-06
>>>>>
>>>>> 17:15:51
>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>>>>> /opt/nutch_1_4/data/crawl/linkdb
>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize:
>>>>> true
>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>>>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>>>> exist:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>>>    at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>>
>>>>>    at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>>
>>>>>    at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>>
>>>>>    at
>>>>
>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>>
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>
>>>>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting
>>>>> at
>>>>> 2012-01-06 17:15:52
>>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>>>
>>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Looking through the code, I'm seeing 
org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for 
crawl_fetch and crawl_generate.

Prior to this 
org.apache.nutch.segment.SegmentMerger.getRecordWriter(...) gets called 
for all components, i.e. crawl_generate crawl_fetch crawl_parse 
parse_data parse_text

I'm not quiet sure what's going on in-between these two calls...

Dean.


On 08/01/2012 22:51, Dean Pullen wrote:
> Where do we go from here? I can start looking/stepping through the 
> mergesegs code, but I'm reluctant due to it's probable complexity.
>
> Dean.
>
>
> On 08/01/2012 14:26, Dean Pullen wrote:
>> No Lewis, -linkdb was already been used for the solrindex command, so 
>> we still have the same problem.
>>
>> Many thanks,
>>
>> Dean
>>
>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>>> Hi dean is this sorted
>>>
>>> On Saturday, January 7, 2012, Dean 
>>> Pullen<de...@semantico.com>  wrote:
>>>> Sorry, you did mean on solrindex - which I already do...
>>>>
>>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>>
>>>> The -linkdb param isn't in the invertlinks docs
>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>>> (However it is in the solrindex docs)
>>>>
>>>> Adding it makes no difference to invertlinks.
>>>>
>>>> I think the problem is definitely with mergesegs, as opposed to
>>> invertlinks etc.
>>>> Thanks again,
>>>>
>>>> Dean.
>>>>
>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>>
>>>> OK so now I think were at the bottom of it. If you wish to create a
>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>>> parameter. This was implemented as not everyone wishes to create a
>>>> linkdb.
>>>>
>>>> Your invertlinks command should be passed as follows
>>>>
>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>>> /path/to/segment/dirs
>>>> then
>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>>> path/to/linkdb -dir path/to/segment/dirs
>>>>
>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>>> be thrown an exception as the linkdb is treated as a segment directory
>>>> now.
>>>>
>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>>>   wrote:
>>>> Only this:
>>>>
>>>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use 
>>>> GenericOptionsParser
>>>> for parsing the arguments. Applications should implement Tool for the
>>> same.
>>>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes
>>> where
>>>> applicable
>>>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
>>> 2012-01-06
>>>> 17:15:51
>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>>>> /opt/nutch_1_4/data/crawl/linkdb
>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: 
>>>> true
>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not 
>>>> exist:
>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>>     at
>>>>
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) 
>>>
>>>>     at
>>>>
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) 
>>>
>>>>     at
>>>>
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) 
>>>
>>>>     at
>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>     at
>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) 
>>>>
>>>>     at 
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>
>>>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: 
>>>> starting at
>>>> 2012-01-06 17:15:52
>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
>>>> IndexerMapReduce:
>>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
>>>> IndexerMapReduce:
>>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>>
>>
>

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Where do we go from here? I can start looking/stepping through the 
mergesegs code, but I'm reluctant due to it's probable complexity.

Dean.


On 08/01/2012 14:26, Dean Pullen wrote:
> No Lewis, -linkdb was already been used for the solrindex command, so 
> we still have the same problem.
>
> Many thanks,
>
> Dean
>
> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>> Hi dean is this sorted
>>
>> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com>  
>> wrote:
>>> Sorry, you did mean on solrindex - which I already do...
>>>
>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>
>>> The -linkdb param isn't in the invertlinks docs
>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>> (However it is in the solrindex docs)
>>>
>>> Adding it makes no difference to invertlinks.
>>>
>>> I think the problem is definitely with mergesegs, as opposed to
>> invertlinks etc.
>>> Thanks again,
>>>
>>> Dean.
>>>
>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>
>>> OK so now I think were at the bottom of it. If you wish to create a
>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>> parameter. This was implemented as not everyone wishes to create a
>>> linkdb.
>>>
>>> Your invertlinks command should be passed as follows
>>>
>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>> /path/to/segment/dirs
>>> then
>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>> path/to/linkdb -dir path/to/segment/dirs
>>>
>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>> be thrown an exception as the linkdb is treated as a segment directory
>>> now.
>>>
>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>>   wrote:
>>> Only this:
>>>
>>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use 
>>> GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for the
>> same.
>>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>> where
>>> applicable
>>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
>> 2012-01-06
>>> 17:15:51
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>>> /opt/nutch_1_4/data/crawl/linkdb
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: 
>>> true
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not 
>>> exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>     at
>>>
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) 
>>
>>>     at
>>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) 
>>
>>>     at
>>>
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) 
>>
>>>     at
>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>     at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) 
>>>
>>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: 
>>> starting at
>>> 2012-01-06 17:15:52
>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
>>> IndexerMapReduce:
>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
>>> IndexerMapReduce:
>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>
>

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

No Lewis, -linkdb was already been used for the solrindex command, so we 
still have the same problem.

Many thanks,

Dean

On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
> Hi dean is this sorted
>
> On Saturday, January 7, 2012, Dean Pullen<de...@semantico.com>  wrote:
>> Sorry, you did mean on solrindex - which I already do...
>>
>> On 07/01/2012 13:15, Dean Pullen wrote:
>>
>> The -linkdb param isn't in the invertlinks docs
> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>> (However it is in the solrindex docs)
>>
>> Adding it makes no difference to invertlinks.
>>
>> I think the problem is definitely with mergesegs, as opposed to
> invertlinks etc.
>> Thanks again,
>>
>> Dean.
>>
>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>
>> OK so now I think were at the bottom of it. If you wish to create a
>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>> parameter. This was implemented as not everyone wishes to create a
>> linkdb.
>>
>> Your invertlinks command should be passed as follows
>>
>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>> /path/to/segment/dirs
>> then
>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>> path/to/linkdb -dir path/to/segment/dirs
>>
>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>> be thrown an exception as the linkdb is treated as a segment directory
>> now.
>>
>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
>   wrote:
>> Only this:
>>
>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use GenericOptionsParser
>> for parsing the arguments. Applications should implement Tool for the
> same.
>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
> where
>> applicable
>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
> 2012-01-06
>> 17:15:51
>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>> /opt/nutch_1_4/data/crawl/linkdb
>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: true
>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>     at
>>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>     at
>>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>     at
>>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>     at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>     at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>
>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting at
>> 2012-01-06 17:15:52
>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi dean is this sorted

On Saturday, January 7, 2012, Dean Pullen <de...@semantico.com> wrote:
> Sorry, you did mean on solrindex - which I already do...
>
> On 07/01/2012 13:15, Dean Pullen wrote:
>
> The -linkdb param isn't in the invertlinks docs
http://wiki.apache.org/nutch/bin/nutch_invertlinks
>
> (However it is in the solrindex docs)
>
> Adding it makes no difference to invertlinks.
>
> I think the problem is definitely with mergesegs, as opposed to
invertlinks etc.
>
> Thanks again,
>
> Dean.
>
> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>
> OK so now I think were at the bottom of it. If you wish to create a
> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
> parameter. This was implemented as not everyone wishes to create a
> linkdb.
>
> Your invertlinks command should be passed as follows
>
> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
> /path/to/segment/dirs
> then
> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
> path/to/linkdb -dir path/to/segment/dirs
>
> If you are not passing the -linkdb path/to/linkdb explicitly you will
> be thrown an exception as the linkdb is treated as a segment directory
> now.
>
> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>
 wrote:
>
> Only this:
>
> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the
same.
> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
where
> applicable
> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
2012-01-06
> 17:15:51
> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
> /opt/nutch_1_4/data/crawl/linkdb
> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: true
> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>    at
>
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>    at
>
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>    at
>
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>    at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting at
> 2012-01-06 17:15:52
> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: /opt/nutch_1_4/data/crawl/crawldb
> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>

-- 
*Lewis*

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Sorry, you did mean on solrindex - which I already do...

On 07/01/2012 13:15, Dean Pullen wrote:
> The -linkdb param isn't in the invertlinks docs 
> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>
> (However it is in the solrindex docs)
>
> Adding it makes no difference to invertlinks.
>
> I think the problem is definitely with mergesegs, as opposed to 
> invertlinks etc.
>
> Thanks again,
>
> Dean.
>
> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>> OK so now I think were at the bottom of it. If you wish to create a
>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>> parameter. This was implemented as not everyone wishes to create a
>> linkdb.
>>
>> Your invertlinks command should be passed as follows
>>
>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>> /path/to/segment/dirs
>> then
>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>> path/to/linkdb -dir path/to/segment/dirs
>>
>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>> be thrown an exception as the linkdb is treated as a segment directory
>> now.
>>
>> On Fri, Jan 6, 2012 at 5:17 PM, Dean 
>> Pullen<de...@semantico.com>  wrote:
>>> Only this:
>>>
>>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use 
>>> GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for 
>>> the same.
>>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java 
>>> classes where
>>> applicable
>>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at 
>>> 2012-01-06
>>> 17:15:51
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>>> /opt/nutch_1_4/data/crawl/linkdb
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: 
>>> true
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not 
>>> exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>     at
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) 
>>>
>>>     at
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) 
>>>
>>>     at
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) 
>>>
>>>     at 
>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>     at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) 
>>>
>>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: 
>>> starting at
>>> 2012-01-06 17:15:52
>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
>>> IndexerMapReduce:
>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
>>> IndexerMapReduce:
>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
>>> IndexerMapReduces:
>>> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547
>>> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer -
>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not 
>>> exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
>>> Input path does not exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>> Input path does not exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
>>> 2012-01-06 17:15:54,027 INFO  crawl.CrawlDbReader - CrawlDb dump: 
>>> starting
>>> 2012-01-06 17:15:54,028 INFO  crawl.CrawlDbReader - CrawlDb db:
>>> /opt/nutch_1_4/data/crawl/crawldb/
>>> 2012-01-06 17:15:54,212 WARN  mapred.JobClient - Use 
>>> GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for 
>>> the same.
>>> 2012-01-06 17:15:55,603 INFO  crawl.CrawlDbReader - CrawlDb dump: done
>>>
>>
>>
>

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

The -linkdb param isn't in the invertlinks docs 
http://wiki.apache.org/nutch/bin/nutch_invertlinks

(However it is in the solrindex docs)

Adding it makes no difference to invertlinks.

I think the problem is definitely with mergesegs, as opposed to 
invertlinks etc.

Thanks again,

Dean.

On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
> OK so now I think were at the bottom of it. If you wish to create a
> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
> parameter. This was implemented as not everyone wishes to create a
> linkdb.
>
> Your invertlinks command should be passed as follows
>
> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
> /path/to/segment/dirs
> then
> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
> path/to/linkdb -dir path/to/segment/dirs
>
> If you are not passing the -linkdb path/to/linkdb explicitly you will
> be thrown an exception as the linkdb is treated as a segment directory
> now.
>
> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<de...@semantico.com>  wrote:
>> Only this:
>>
>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use GenericOptionsParser
>> for parsing the arguments. Applications should implement Tool for the same.
>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at 2012-01-06
>> 17:15:51
>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>> /opt/nutch_1_4/data/crawl/linkdb
>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: true
>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>     at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>     at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>     at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>     at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>     at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>
>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting at
>> 2012-01-06 17:15:52
>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
>> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547
>> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer -
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
>> Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>> Input path does not exist:
>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
>> 2012-01-06 17:15:54,027 INFO  crawl.CrawlDbReader - CrawlDb dump: starting
>> 2012-01-06 17:15:54,028 INFO  crawl.CrawlDbReader - CrawlDb db:
>> /opt/nutch_1_4/data/crawl/crawldb/
>> 2012-01-06 17:15:54,212 WARN  mapred.JobClient - Use GenericOptionsParser
>> for parsing the arguments. Applications should implement Tool for the same.
>> 2012-01-06 17:15:55,603 INFO  crawl.CrawlDbReader - CrawlDb dump: done
>>
>
>

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

OK so now I think were at the bottom of it. If you wish to create a
linkdb in >= Nutch 1.4 you need to specifically pass the linkdb
parameter. This was implemented as not everyone wishes to create a
linkdb.

Your invertlinks command should be passed as follows

bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
/path/to/segment/dirs
then
bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
path/to/linkdb -dir path/to/segment/dirs

If you are not passing the -linkdb path/to/linkdb explicitly you will
be thrown an exception as the linkdb is treated as a segment directory
now.

On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen <de...@semantico.com> wrote:
> Only this:
>
> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the same.
> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at 2012-01-06
> 17:15:51
> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
> /opt/nutch_1_4/data/crawl/linkdb
> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: true
> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>    at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>    at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>    at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting at
> 2012-01-06 17:15:52
> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: /opt/nutch_1_4/data/crawl/crawldb
> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: /opt/nutch_1_4/data/crawl/linkdb
> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547
> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer -
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
> Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
> Input path does not exist:
> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
> 2012-01-06 17:15:54,027 INFO  crawl.CrawlDbReader - CrawlDb dump: starting
> 2012-01-06 17:15:54,028 INFO  crawl.CrawlDbReader - CrawlDb db:
> /opt/nutch_1_4/data/crawl/crawldb/
> 2012-01-06 17:15:54,212 WARN  mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the same.
> 2012-01-06 17:15:55,603 INFO  crawl.CrawlDbReader - CrawlDb dump: done
>



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Only this:

2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use 
GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes 
where applicable
2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at 
2012-01-06 17:15:51
2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb: 
/opt/nutch_1_4/data/crawl/linkdb
2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/opt/nutch_1_4/data/crawl/segments/20120106171547
2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: 
org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
     at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
     at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
     at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
     at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
     at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting 
at 2012-01-06 17:15:52
2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
IndexerMapReduce: crawldb: /opt/nutch_1_4/data/crawl/crawldb
2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
IndexerMapReduce: linkdb: /opt/nutch_1_4/data/crawl/linkdb
2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce - 
IndexerMapReduces: adding segment: 
/opt/nutch_1_4/data/crawl/segments/20120106171547
2012-01-06 17:15:53,000 ERROR solr.SolrIndexer - 
org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse
Input path does not exist: 
file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
Input path does not exist: 
file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text
2012-01-06 17:15:54,027 INFO  crawl.CrawlDbReader - CrawlDb dump: starting
2012-01-06 17:15:54,028 INFO  crawl.CrawlDbReader - CrawlDb db: 
/opt/nutch_1_4/data/crawl/crawldb/
2012-01-06 17:15:54,212 WARN  mapred.JobClient - Use 
GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
2012-01-06 17:15:55,603 INFO  crawl.CrawlDbReader - CrawlDb dump: done

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Another thing which I have stupidly not asked yet, have you checked
your hadoop.log to see if there are any problems around the parse
phase?

It should begin

LOG.info("ParseSegment: starting at " + sdf.format(start));
LOG.info("ParseSegment: segment: " + segment);
...
if successful
...
LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
...
if not then
...
LOG.warn("Error parsing: " etc

Any joy?

On Fri, Jan 6, 2012 at 4:38 PM, Dean Pullen <de...@semantico.com> wrote:
> Two iterations do the same thing - the parse_data directory is missing.
>
> Interestingly, just doing the mergesegs on ONE crawl also removes the
> parse_data dir etc!
>
> Dean.
>
>
>
> On 06/01/2012 16:28, Lewis John Mcgibbney wrote:
>>
>> How about merging segs after every subsequent iteration of the crawl
>> cycle... surely this is a problem with producing the specific
>> parse_data directory. If it doesn't work after two iterations then we
>> know that it is happening early on in the crawl cycle. Have you
>> manually checked that the directories exist after fetching and
>> parsing?
>>
>> On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<de...@semantico.com>
>>  wrote:
>>>
>>> Good spot because all of that was meant to be removed! No, I'm afraid
>>> that's
>>> just a copy/paste problem.
>>>
>>> Dean
>>>
>>> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>>>
>>>> Ok then,
>>>>
>>>> How about your generate command:
>>>>
>>>> 2) GENERATE:
>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>> 26
>>>>
>>>> Your<segments_dir>    seems to point to /opt/semantico/slot/etc/etc/etc,
>>>> when everything else being utilised within the crawl cycle points to
>>>> an entirely different<segment_dirs>    path which is
>>>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>>>
>>>> Was this intentional?
>>>>
>>>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com>
>>>>  wrote:
>>>>>
>>>>> Lewis,
>>>>>
>>>>> Changing the merge to * returns a similar response:
>>>>>
>>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>>>
>>>>> And yes, your assumption was correct - it's a different segment
>>>>> directory
>>>>> each loop.
>>>>>
>>>>> Many thanks,
>>>>>
>>>>> Dean.
>>>>>
>>>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>>>
>>>>>> Hi Dean,
>>>>>>
>>>>>> Without discussing any of your configuration properties can you please
>>>>>> try
>>>>>>
>>>>>> 6) MERGE SEGMENTS:
>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>>>
>>>>>> paying attention to the wildcard /* in -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>>>
>>>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>>>> times, you are not recursively generating, fetching, parsing and
>>>>>> updating the WebDB with
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>>>>>  wrote:
>>>>>>>
>>>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>>>
>>>>>>>
>>>>>>> Firstly I have a seed URL XML document here:
>>>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a
>>>>>>> URL
>>>>>>> within it.
>>>>>>>
>>>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>>>
>>>>>>> # allow urls in ukcigarforums.com domain
>>>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>>>> # deny anything else
>>>>>>> -.
>>>>>>>
>>>>>>>
>>>>>>> Here's the procedure:
>>>>>>>
>>>>>>>
>>>>>>> 1) INJECT:
>>>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/nutch_1_4/data/seed/
>>>>>>>
>>>>>>> 2) GENERATE:
>>>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000
>>>>>>> -adddays
>>>>>>> 26
>>>>>>>
>>>>>>> 3) FETCH:
>>>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>>
>>>>>>> 4) PARSE:
>>>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>>
>>>>>>> 5) UPDATE DB:
>>>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>>>
>>>>>>>
>>>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>>>
>>>>>>> 6) MERGE SEGMENTS:
>>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>>>
>>>>>>>
>>>>>>> Interestingly, this prints out:
>>>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>>>> crawl_parse parse_data parse_text"
>>>>>>>
>>>>>>> MERGEDsegments segment directory then has just two directories,
>>>>>>> instead
>>>>>>> of
>>>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>>>> crawl_fetch
>>>>>>>
>>>>>>> (when then delete from the segments directory and copy the
>>>>>>> MERGEDsegments
>>>>>>> results into it)
>>>>>>>
>>>>>>>
>>>>>>> Lastly we run invert links after merge segments:
>>>>>>>
>>>>>>> 7) INVERT LINKS:
>>>>>>> /opt/nutch_1_4/bin/nutch invertlinks
>>>>>>> /opt/nutch_1_4/data/crawl/linkdb/
>>>>>>> -dir
>>>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>>>
>>>>>>> Which produces:
>>>>>>>
>>>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>>>> does
>>>>>>> not
>>>>>>> exist:
>>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>>>
>>>>>>>
>>>>
>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Two iterations do the same thing - the parse_data directory is missing.

Interestingly, just doing the mergesegs on ONE crawl also removes the 
parse_data dir etc!

Dean.


On 06/01/2012 16:28, Lewis John Mcgibbney wrote:
> How about merging segs after every subsequent iteration of the crawl
> cycle... surely this is a problem with producing the specific
> parse_data directory. If it doesn't work after two iterations then we
> know that it is happening early on in the crawl cycle. Have you
> manually checked that the directories exist after fetching and
> parsing?
>
> On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<de...@semantico.com>  wrote:
>> Good spot because all of that was meant to be removed! No, I'm afraid that's
>> just a copy/paste problem.
>>
>> Dean
>>
>> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>> Ok then,
>>>
>>> How about your generate command:
>>>
>>> 2) GENERATE:
>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>>
>>> Your<segments_dir>    seems to point to /opt/semantico/slot/etc/etc/etc,
>>> when everything else being utilised within the crawl cycle points to
>>> an entirely different<segment_dirs>    path which is
>>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>>
>>> Was this intentional?
>>>
>>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com>
>>>   wrote:
>>>> Lewis,
>>>>
>>>> Changing the merge to * returns a similar response:
>>>>
>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>>
>>>> And yes, your assumption was correct - it's a different segment directory
>>>> each loop.
>>>>
>>>> Many thanks,
>>>>
>>>> Dean.
>>>>
>>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>> Hi Dean,
>>>>>
>>>>> Without discussing any of your configuration properties can you please
>>>>> try
>>>>>
>>>>> 6) MERGE SEGMENTS:
>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>>
>>>>> paying attention to the wildcard /* in -dir
>>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>>
>>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>>> times, you are not recursively generating, fetching, parsing and
>>>>> updating the WebDB with
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>>>>   wrote:
>>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>>
>>>>>>
>>>>>> Firstly I have a seed URL XML document here:
>>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>>>> within it.
>>>>>>
>>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>>
>>>>>> # allow urls in ukcigarforums.com domain
>>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>>> # deny anything else
>>>>>> -.
>>>>>>
>>>>>>
>>>>>> Here's the procedure:
>>>>>>
>>>>>>
>>>>>> 1) INJECT:
>>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>>> /opt/nutch_1_4/data/seed/
>>>>>>
>>>>>> 2) GENERATE:
>>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>>>> 26
>>>>>>
>>>>>> 3) FETCH:
>>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>
>>>>>> 4) PARSE:
>>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>
>>>>>> 5) UPDATE DB:
>>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>>
>>>>>>
>>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>>
>>>>>> 6) MERGE SEGMENTS:
>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>>
>>>>>>
>>>>>> Interestingly, this prints out:
>>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>>> crawl_parse parse_data parse_text"
>>>>>>
>>>>>> MERGEDsegments segment directory then has just two directories, instead
>>>>>> of
>>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>>> crawl_fetch
>>>>>>
>>>>>> (when then delete from the segments directory and copy the
>>>>>> MERGEDsegments
>>>>>> results into it)
>>>>>>
>>>>>>
>>>>>> Lastly we run invert links after merge segments:
>>>>>>
>>>>>> 7) INVERT LINKS:
>>>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>>>> -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>>
>>>>>> Which produces:
>>>>>>
>>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>>> does
>>>>>> not
>>>>>> exist:
>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>>
>>>>>>
>>>
>
>

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

How about merging segs after every subsequent iteration of the crawl
cycle... surely this is a problem with producing the specific
parse_data directory. If it doesn't work after two iterations then we
know that it is happening early on in the crawl cycle. Have you
manually checked that the directories exist after fetching and
parsing?

On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen <de...@semantico.com> wrote:
> Good spot because all of that was meant to be removed! No, I'm afraid that's
> just a copy/paste problem.
>
> Dean
>
> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>
>> Ok then,
>>
>> How about your generate command:
>>
>> 2) GENERATE:
>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>
>> Your<segments_dir>  seems to point to /opt/semantico/slot/etc/etc/etc,
>> when everything else being utilised within the crawl cycle points to
>> an entirely different<segment_dirs>  path which is
>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>
>> Was this intentional?
>>
>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com>
>>  wrote:
>>>
>>> Lewis,
>>>
>>> Changing the merge to * returns a similar response:
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>
>>> And yes, your assumption was correct - it's a different segment directory
>>> each loop.
>>>
>>> Many thanks,
>>>
>>> Dean.
>>>
>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>
>>>> Hi Dean,
>>>>
>>>> Without discussing any of your configuration properties can you please
>>>> try
>>>>
>>>> 6) MERGE SEGMENTS:
>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>
>>>> paying attention to the wildcard /* in -dir
>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>
>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>> times, you are not recursively generating, fetching, parsing and
>>>> updating the WebDB with
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>>>  wrote:
>>>>>
>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>
>>>>>
>>>>> Firstly I have a seed URL XML document here:
>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>>> within it.
>>>>>
>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>
>>>>> # allow urls in ukcigarforums.com domain
>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>> # deny anything else
>>>>> -.
>>>>>
>>>>>
>>>>> Here's the procedure:
>>>>>
>>>>>
>>>>> 1) INJECT:
>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/nutch_1_4/data/seed/
>>>>>
>>>>> 2) GENERATE:
>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>>> 26
>>>>>
>>>>> 3) FETCH:
>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>
>>>>> 4) PARSE:
>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>
>>>>> 5) UPDATE DB:
>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>
>>>>>
>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>
>>>>> 6) MERGE SEGMENTS:
>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>
>>>>>
>>>>> Interestingly, this prints out:
>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>> crawl_parse parse_data parse_text"
>>>>>
>>>>> MERGEDsegments segment directory then has just two directories, instead
>>>>> of
>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>> crawl_fetch
>>>>>
>>>>> (when then delete from the segments directory and copy the
>>>>> MERGEDsegments
>>>>> results into it)
>>>>>
>>>>>
>>>>> Lastly we run invert links after merge segments:
>>>>>
>>>>> 7) INVERT LINKS:
>>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>>> -dir
>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>
>>>>> Which produces:
>>>>>
>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>> does
>>>>> not
>>>>> exist:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>
>>>>>
>>>>
>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Good spot because all of that was meant to be removed! No, I'm afraid 
that's just a copy/paste problem.

Dean

On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
> Ok then,
>
> How about your generate command:
>
> 2) GENERATE:
> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>
> Your<segments_dir>  seems to point to /opt/semantico/slot/etc/etc/etc,
> when everything else being utilised within the crawl cycle points to
> an entirely different<segment_dirs>  path which is
> /opt/nutch_1_4/data/crawl/segments/segment_date
>
> Was this intentional?
>
> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<de...@semantico.com>  wrote:
>> Lewis,
>>
>> Changing the merge to * returns a similar response:
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>
>> And yes, your assumption was correct - it's a different segment directory
>> each loop.
>>
>> Many thanks,
>>
>> Dean.
>>
>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>> Hi Dean,
>>>
>>> Without discussing any of your configuration properties can you please try
>>>
>>> 6) MERGE SEGMENTS:
>>> /opt/nutch_1_4/bin/nutch mergesegs
>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>
>>> paying attention to the wildcard /* in -dir
>>> /opt/nutch_1_4/data/crawl/segments/*
>>>
>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>> times, you are not recursively generating, fetching, parsing and
>>> updating the WebDB with
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>> with every iteration of the g/f/p/updatedb cycle.
>>>
>>> Thanks
>>>
>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>>   wrote:
>>>> No problem Lewis, I appreciate you looking into it.
>>>>
>>>>
>>>> Firstly I have a seed URL XML document here:
>>>> http://www.ukcigarforums.com/injectlist.xml
>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>> within it.
>>>>
>>>> Nutch's regex-urlfilter.txt contains this:
>>>>
>>>> # allow urls in ukcigarforums.com domain
>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>> # deny anything else
>>>> -.
>>>>
>>>>
>>>> Here's the procedure:
>>>>
>>>>
>>>> 1) INJECT:
>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/nutch_1_4/data/seed/
>>>>
>>>> 2) GENERATE:
>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>> 26
>>>>
>>>> 3) FETCH:
>>>> /opt/nutch_1_4/bin/nutch fetch
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>
>>>> 4) PARSE:
>>>> /opt/nutch_1_4/bin/nutch parse
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>
>>>> 5) UPDATE DB:
>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>
>>>>
>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>
>>>> 6) MERGE SEGMENTS:
>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>
>>>>
>>>> Interestingly, this prints out:
>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>> crawl_parse parse_data parse_text"
>>>>
>>>> MERGEDsegments segment directory then has just two directories, instead
>>>> of
>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>> crawl_fetch
>>>>
>>>> (when then delete from the segments directory and copy the MERGEDsegments
>>>> results into it)
>>>>
>>>>
>>>> Lastly we run invert links after merge segments:
>>>>
>>>> 7) INVERT LINKS:
>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>> -dir
>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>
>>>> Which produces:
>>>>
>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>>> not
>>>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>
>>>>
>>>
>
>

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Ok then,

How about your generate command:

2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26

Your <segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc,
when everything else being utilised within the crawl cycle points to
an entirely different <segment_dirs> path which is
/opt/nutch_1_4/data/crawl/segments/segment_date

Was this intentional?

On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen <de...@semantico.com> wrote:
> Lewis,
>
> Changing the merge to * returns a similar response:
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>
> And yes, your assumption was correct - it's a different segment directory
> each loop.
>
> Many thanks,
>
> Dean.
>
> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>
>> Hi Dean,
>>
>> Without discussing any of your configuration properties can you please try
>>
>> 6) MERGE SEGMENTS:
>> /opt/nutch_1_4/bin/nutch mergesegs
>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>
>> paying attention to the wildcard /* in -dir
>> /opt/nutch_1_4/data/crawl/segments/*
>>
>> Also presumably, when you mention you repeat steps 2-5 another 4
>> times, you are not recursively generating, fetching, parsing and
>> updating the WebDB with
>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>> with every iteration of the g/f/p/updatedb cycle.
>>
>> Thanks
>>
>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>
>>  wrote:
>>>
>>> No problem Lewis, I appreciate you looking into it.
>>>
>>>
>>> Firstly I have a seed URL XML document here:
>>> http://www.ukcigarforums.com/injectlist.xml
>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>> within it.
>>>
>>> Nutch's regex-urlfilter.txt contains this:
>>>
>>> # allow urls in ukcigarforums.com domain
>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>> # deny anything else
>>> -.
>>>
>>>
>>> Here's the procedure:
>>>
>>>
>>> 1) INJECT:
>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/nutch_1_4/data/seed/
>>>
>>> 2) GENERATE:
>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>> 26
>>>
>>> 3) FETCH:
>>> /opt/nutch_1_4/bin/nutch fetch
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>
>>> 4) PARSE:
>>> /opt/nutch_1_4/bin/nutch parse
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>
>>> 5) UPDATE DB:
>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>
>>>
>>> Repeat steps 2 to 5 another 4 times, then:
>>>
>>> 6) MERGE SEGMENTS:
>>> /opt/nutch_1_4/bin/nutch mergesegs
>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>
>>>
>>> Interestingly, this prints out:
>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>> crawl_parse parse_data parse_text"
>>>
>>> MERGEDsegments segment directory then has just two directories, instead
>>> of
>>> all of those listed in the last output, i.e. just: crawl_generate and
>>> crawl_fetch
>>>
>>> (when then delete from the segments directory and copy the MERGEDsegments
>>> results into it)
>>>
>>>
>>> Lastly we run invert links after merge segments:
>>>
>>> 7) INVERT LINKS:
>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>> -dir
>>> /opt/nutch_1_4/data/crawl/segments/
>>>
>>> Which produces:
>>>
>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>> not
>>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>
>>>
>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Lewis,

Changing the merge to * returns a similar response:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern 
file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files

And yes, your assumption was correct - it's a different segment 
directory each loop.

Many thanks,

Dean.

On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
> Hi Dean,
>
> Without discussing any of your configuration properties can you please try
>
> 6) MERGE SEGMENTS:
> /opt/nutch_1_4/bin/nutch mergesegs
> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>
> paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/*
>
> Also presumably, when you mention you repeat steps 2-5 another 4
> times, you are not recursively generating, fetching, parsing and
> updating the WebDB with
> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
> with every iteration of the g/f/p/updatedb cycle.
>
> Thanks
>
> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<de...@semantico.com>  wrote:
>> No problem Lewis, I appreciate you looking into it.
>>
>>
>> Firstly I have a seed URL XML document here:
>> http://www.ukcigarforums.com/injectlist.xml
>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>> within it.
>>
>> Nutch's regex-urlfilter.txt contains this:
>>
>> # allow urls in ukcigarforums.com domain
>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>> # deny anything else
>> -.
>>
>>
>> Here's the procedure:
>>
>>
>> 1) INJECT:
>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/nutch_1_4/data/seed/
>>
>> 2) GENERATE:
>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>
>> 3) FETCH:
>> /opt/nutch_1_4/bin/nutch fetch
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>
>> 4) PARSE:
>> /opt/nutch_1_4/bin/nutch parse
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>
>> 5) UPDATE DB:
>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>
>>
>> Repeat steps 2 to 5 another 4 times, then:
>>
>> 6) MERGE SEGMENTS:
>> /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/
>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>
>>
>> Interestingly, this prints out:
>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>> crawl_parse parse_data parse_text"
>>
>> MERGEDsegments segment directory then has just two directories, instead of
>> all of those listed in the last output, i.e. just: crawl_generate and
>> crawl_fetch
>>
>> (when then delete from the segments directory and copy the MERGEDsegments
>> results into it)
>>
>>
>> Lastly we run invert links after merge segments:
>>
>> 7) INVERT LINKS:
>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir
>> /opt/nutch_1_4/data/crawl/segments/
>>
>> Which produces:
>>
>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>
>>
>
>

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Dean,

Without discussing any of your configuration properties can you please try

6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/* -filter -normalize

paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/*

Also presumably, when you mention you repeat steps 2-5 another 4
times, you are not recursively generating, fetching, parsing and
updating the WebDB with
/opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
with every iteration of the g/f/p/updatedb cycle.

Thanks

On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen <de...@semantico.com> wrote:
> No problem Lewis, I appreciate you looking into it.
>
>
> Firstly I have a seed URL XML document here:
> http://www.ukcigarforums.com/injectlist.xml
> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
> within it.
>
> Nutch's regex-urlfilter.txt contains this:
>
> # allow urls in ukcigarforums.com domain
> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
> # deny anything else
> -.
>
>
> Here's the procedure:
>
>
> 1) INJECT:
> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
> /opt/nutch_1_4/data/seed/
>
> 2) GENERATE:
> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>
> 3) FETCH:
> /opt/nutch_1_4/bin/nutch fetch
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>
> 4) PARSE:
> /opt/nutch_1_4/bin/nutch parse
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>
> 5) UPDATE DB:
> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>
>
> Repeat steps 2 to 5 another 4 times, then:
>
> 6) MERGE SEGMENTS:
> /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/
> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>
>
> Interestingly, this prints out:
> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
> crawl_parse parse_data parse_text"
>
> MERGEDsegments segment directory then has just two directories, instead of
> all of those listed in the last output, i.e. just: crawl_generate and
> crawl_fetch
>
> (when then delete from the segments directory and copy the MERGEDsegments
> results into it)
>
>
> Lastly we run invert links after merge segments:
>
> 7) INVERT LINKS:
> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir
> /opt/nutch_1_4/data/crawl/segments/
>
> Which produces:
>
> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>
>



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

No problem Lewis, I appreciate you looking into it.


Firstly I have a seed URL XML document here: 
http://www.ukcigarforums.com/injectlist.xml
This basically has 'http://www.ukcigarforums.com/content.php' as a URL 
within it.

Nutch's regex-urlfilter.txt contains this:

# allow urls in ukcigarforums.com domain
+http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
# deny anything else
-.


Here's the procedure:


1) INJECT:
/opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ 
/opt/nutch_1_4/data/seed/

2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ 
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26

3) FETCH:
/opt/nutch_1_4/bin/nutch fetch 
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15

4) PARSE:
/opt/nutch_1_4/bin/nutch parse 
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15

5) UPDATE DB:
/opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ 
/opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter


Repeat steps 2 to 5 another 4 times, then:

6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs 
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir 
/opt/nutch_1_4/data/crawl/segments/ -filter -normalize


Interestingly, this prints out:
"SegmentMerger: using segment data from: crawl_generate crawl_fetch 
crawl_parse parse_data parse_text"

MERGEDsegments segment directory then has just two directories, instead 
of all of those listed in the last output, i.e. just: crawl_generate and 
crawl_fetch

(when then delete from the segments directory and copy the 
MERGEDsegments results into it)


Lastly we run invert links after merge segments:

7) INVERT LINKS:
/opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ 
-dir /opt/nutch_1_4/data/crawl/segments/

Which produces:

"LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does 
not exist: 
file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Can you please post your script or what type of commands (and
parameters) you are passing... I suspect that there is maybe something
lurking which we could fix now e.g. differences between the 1.0/1.3
commands and current 1.4.

If not then you may have flagged up something which requires some TLC.

Thanks

On Fri, Jan 6, 2012 at 12:14 PM, Dean Pullen <de...@semantico.com> wrote:
> I've also tried nutch v1.3 with the same outcome (i.e. parse_data directory
> is not found).
>
>
>
> On 06/01/2012 10:42, Dean Pullen wrote:
>>
>> I'd like to reiterate that this all works in v1...
>>
>> Dean
>>
>> On 06/01/2012 10:04, Dean Pullen wrote:
>>>
>>> Lewis,
>>>
>>> Many thanks for your reply.
>>>
>>> I've separated the parsing from the fetching, and although each segment -
>>> we run the crawl 5 times - has the parse_data directory after parsing
>>> (observed via pausing the process), the mergesegs command does not reproduce
>>> the parse_data directory meaning invertlinks fails with the same parse_data
>>> not found error.
>>>
>>> The merged segments directory simply has the crawl_generate and
>>> crawl_fetch directories, not any of the others you can see in the other
>>> segments directories.
>>>
>>> Regards,
>>>
>>> Dean.
>>>
>>>
>>> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
>>>
>>>> Hi Dean,
>>>>
>>>> Depending on the size of the segments your fetching, in most cases I
>>>> would advise you to separate out fetching and parsing into individual
>>>> steps. This becomes self explanatory as your segments increase in size
>>>> and the possibility of something going wrong with the fetching and
>>>> parsing when done together. This looks to be a segments which when
>>>> being fetched has experienced problems during parsing, therefore no
>>>> parse_data was produced.
>>>>
>>>> Can you please try a test fetch (with parsing boolean set to false) on
>>>> a sample segment then an individual parse and report back to us with
>>>> this one please.
>>>>
>>>> Thanks
>>>>
>>>> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<de...@semantico.com>
>>>>  wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>>>> invertlinks.
>>>>>
>>>>> Error:
>>>>>
>>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>>>> not
>>>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>>    at
>>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>
>>>>> I notice that the parse_data directories are produced after a fetch
>>>>> (with
>>>>> fetcher.parse set to true), but after the merge the parse_data
>>>>> directory
>>>>> doesn't exist.
>>>>>
>>>>> What behaviour has changed since 1.0 and does anyone have a solution
>>>>> for the
>>>>> above?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Dean.
>>>>
>>>>
>>>>
>>>> --
>>>> Lewis
>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

I've also tried nutch v1.3 with the same outcome (i.e. parse_data 
directory is not found).


On 06/01/2012 10:42, Dean Pullen wrote:
> I'd like to reiterate that this all works in v1...
>
> Dean
>
> On 06/01/2012 10:04, Dean Pullen wrote:
>> Lewis,
>>
>> Many thanks for your reply.
>>
>> I've separated the parsing from the fetching, and although each 
>> segment - we run the crawl 5 times - has the parse_data directory 
>> after parsing (observed via pausing the process), the mergesegs 
>> command does not reproduce the parse_data directory meaning 
>> invertlinks fails with the same parse_data not found error.
>>
>> The merged segments directory simply has the crawl_generate and 
>> crawl_fetch directories, not any of the others you can see in the 
>> other segments directories.
>>
>> Regards,
>>
>> Dean.
>>
>>
>> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
>>
>>> Hi Dean,
>>>
>>> Depending on the size of the segments your fetching, in most cases I
>>> would advise you to separate out fetching and parsing into individual
>>> steps. This becomes self explanatory as your segments increase in size
>>> and the possibility of something going wrong with the fetching and
>>> parsing when done together. This looks to be a segments which when
>>> being fetched has experienced problems during parsing, therefore no
>>> parse_data was produced.
>>>
>>> Can you please try a test fetch (with parsing boolean set to false) on
>>> a sample segment then an individual parse and report back to us with
>>> this one please.
>>>
>>> Thanks
>>>
>>> On Thu, Jan 5, 2012 at 5:28 PM, Dean 
>>> Pullen<de...@semantico.com>  wrote:
>>>> Hi all,
>>>>
>>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>>> invertlinks.
>>>>
>>>> Error:
>>>>
>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path 
>>>> does not
>>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>>     at
>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) 
>>>>
>>>>     at
>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) 
>>>>
>>>>     at
>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) 
>>>>
>>>>     at 
>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>     at
>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) 
>>>>
>>>>     at 
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>
>>>> I notice that the parse_data directories are produced after a fetch 
>>>> (with
>>>> fetcher.parse set to true), but after the merge the parse_data 
>>>> directory
>>>> doesn't exist.
>>>>
>>>> What behaviour has changed since 1.0 and does anyone have a 
>>>> solution for the
>>>> above?
>>>>
>>>> Thanks in advance,
>>>>
>>>> Dean.
>>>
>>>
>>> -- 
>>> Lewis
>

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

I'd like to reiterate that this all works in v1...

Dean

On 06/01/2012 10:04, Dean Pullen wrote:
> Lewis,
>
> Many thanks for your reply.
>
> I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error.
>
> The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories.
>
> Regards,
>
> Dean.
>
>
> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
>
>> Hi Dean,
>>
>> Depending on the size of the segments your fetching, in most cases I
>> would advise you to separate out fetching and parsing into individual
>> steps. This becomes self explanatory as your segments increase in size
>> and the possibility of something going wrong with the fetching and
>> parsing when done together. This looks to be a segments which when
>> being fetched has experienced problems during parsing, therefore no
>> parse_data was produced.
>>
>> Can you please try a test fetch (with parsing boolean set to false) on
>> a sample segment then an individual parse and report back to us with
>> this one please.
>>
>> Thanks
>>
>> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<de...@semantico.com>  wrote:
>>> Hi all,
>>>
>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>> invertlinks.
>>>
>>> Error:
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>     at
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>     at
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>     at
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>     at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>     at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> I notice that the parse_data directories are produced after a fetch (with
>>> fetcher.parse set to true), but after the merge the parse_data directory
>>> doesn't exist.
>>>
>>> What behaviour has changed since 1.0 and does anyone have a solution for the
>>> above?
>>>
>>> Thanks in advance,
>>>
>>> Dean.
>>
>>
>> -- 
>> Lewis

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Lewis,

Many thanks for your reply.

I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error.

The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories.

Regards,

Dean. 


On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:

> Hi Dean,
> 
> Depending on the size of the segments your fetching, in most cases I
> would advise you to separate out fetching and parsing into individual
> steps. This becomes self explanatory as your segments increase in size
> and the possibility of something going wrong with the fetching and
> parsing when done together. This looks to be a segments which when
> being fetched has experienced problems during parsing, therefore no
> parse_data was produced.
> 
> Can you please try a test fetch (with parsing boolean set to false) on
> a sample segment then an individual parse and report back to us with
> this one please.
> 
> Thanks
> 
> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <de...@semantico.com> wrote:
>> Hi all,
>> 
>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>> invertlinks.
>> 
>> Error:
>> 
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>    at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>    at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>    at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>    at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>> 
>> I notice that the parse_data directories are produced after a fetch (with
>> fetcher.parse set to true), but after the merge the parse_data directory
>> doesn't exist.
>> 
>> What behaviour has changed since 1.0 and does anyone have a solution for the
>> above?
>> 
>> Thanks in advance,
>> 
>> Dean.
> 
> 
> 
> -- 
> Lewis

Re: parse data directory not found after merge

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Dean,

Depending on the size of the segments your fetching, in most cases I
would advise you to separate out fetching and parsing into individual
steps. This becomes self explanatory as your segments increase in size
and the possibility of something going wrong with the fetching and
parsing when done together. This looks to be a segments which when
being fetched has experienced problems during parsing, therefore no
parse_data was produced.

Can you please try a test fetch (with parsing boolean set to false) on
a sample segment then an individual parse and report back to us with
this one please.

Thanks

On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <de...@semantico.com> wrote:
> Hi all,
>
> I'm upgrading from nutch 1 to 1.4 and am having problems running
> invertlinks.
>
> Error:
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>    at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>    at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>    at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> I notice that the parse_data directories are produced after a fetch (with
> fetcher.parse set to true), but after the merge the parse_data directory
> doesn't exist.
>
> What behaviour has changed since 1.0 and does anyone have a solution for the
> above?
>
> Thanks in advance,
>
> Dean.



-- 
Lewis

Re: parse data directory not found after merge

Posted by Markus Jelsma <ma...@openindex.io>.

There is no zip. Anyway, i just did three fetch and parse cycles of 
nutch.apache.org with trunk. Trunk has no changes concerning segments etc with 
regards to 1.4. I injected nutch.apache.org and then did two fetches of -topN 
4 pages so i got 9 pages in three segments. I also configured to stay within 
the domain.

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     28
retry 0:        28
min score:      0.0010
avg score:      0.080714285
max score:      1.588
status 1 (db_unfetched):        19
status 2 (db_fetched):  9
CrawlDb statistics: done

crawl/segments/20120111122321/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_text

crawl/segments/20120111122438/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:24 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_text

crawl/segments/20120111122539/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:26 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_text


Let's merge the three segments into one:
$ bin/nutch mergesegs merged_segment -dir crawl/segments/
Merging 3 segments to merged_segment/20120111122826
SegmentMerger:   adding file:/PATH/crawl/segments/20120111122539
SegmentMerger:   adding file:/PATH/crawl/segments/20120111122438
SegmentMerger:   adding file:/PATH/crawl/segments/20120111122321
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text

.. it takes a while but finishes. Then i've got this:

$ ls merged_segment/20120111122826/
content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text

I don't see the problem but this should reproduce your problem as your steps 
are not really different from mine. Is it still the parse_data directory that 
is missing?

Why are you mering anyway, it is not mandatory at all.


On Wednesday 11 January 2012 12:09:57 Dean Pullen wrote:
> A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same
> thing.
> 
> I've zipped up the nutch/hadoop dir with all config etc, would either of
> you (Markus/Lewis) care to look at it?
> 
> Any help at this stage would be immensely appreciated.
> 
> Regards,
> 
> Dean.

-- 
Markus Jelsma - CTO - Openindex

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Markus,

I didn't include the zip, I was just saying I have it if you would like 
to see/use it! Shall I send?

Can you zip up and send to me what you've just done? Presumably it must 
be a config thing?!

I know mergesegs isn't needed, but as I believed there was a problem 
with it I've been trying to discover the problem for the sake of it...

Dean.

Re: parse data directory not found after merge

Posted by Markus Jelsma <ma...@openindex.io>.

I ran the merge local only. I've never merged on a Hadoop cluster since we 
don't need it there.

On Wednesday 11 January 2012 12:21:20 Dean Pullen wrote:
> For further reference, below is the Hadoop job task log for the
> mergesegs command.
> You'll see that parse_data etc merges are performed.
> 
> 
> Completed Tasks
> 
> Task    Complete    Status    Start Time    Finish Time    Errors
>   Counters
> task_201201111048_0031_m_000000    100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_fetch/part-000
> 00/data:0+259 11-Jan-2012 11:16:22
> 11-Jan-2012 11:16:25 (3sec)
> 
> 9
> task_201201111048_0031_m_000001    100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_generate/part-
> 00000:0+234 11-Jan-2012 11:16:22
> 11-Jan-2012 11:16:25 (3sec)
> 
> 9
> task_201201111048_0031_m_000002    100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/content/part-00000/d
> ata:0+129 11-Jan-2012 11:16:25
> 11-Jan-2012 11:16:28 (3sec)
> 
> 9
> task_201201111048_0031_m_000003    100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_parse/part-000
> 00:0+129 11-Jan-2012 11:16:25
> 11-Jan-2012 11:16:28 (3sec)
> 
> 9
> task_201201111048_0031_m_000004    100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_data/part-0000
> 0/data:0+128 11-Jan-2012 11:16:28
> 11-Jan-2012 11:16:31 (3sec)
> 
> 9
> task_201201111048_0031_m_000005    100.00%
> file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_text/part-0000
> 0/data:0+128 11-Jan-2012 11:16:28
> 11-Jan-2012 11:16:31 (3sec)
> 
> 
> 
> 
> And the parse_data job itself:
> 
> attempt_201201111048_0031_m_000004_0
> /default-rack/dhcp-192-168-4-26.semantico.net    SUCCEEDED    100.00%
> 11-Jan-2012 11:16:28    11-Jan-2012 11:16:30 (1sec)

-- 
Markus Jelsma - CTO - Openindex

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

For further reference, below is the Hadoop job task log for the 
mergesegs command.
You'll see that parse_data etc merges are performed.


Completed Tasks

Task    Complete    Status    Start Time    Finish Time    Errors   
  Counters
task_201201111048_0031_m_000000    100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_fetch/part-00000/data:0+259
11-Jan-2012 11:16:22
11-Jan-2012 11:16:25 (3sec)

9
task_201201111048_0031_m_000001    100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_generate/part-00000:0+234
11-Jan-2012 11:16:22
11-Jan-2012 11:16:25 (3sec)

9
task_201201111048_0031_m_000002    100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/content/part-00000/data:0+129
11-Jan-2012 11:16:25
11-Jan-2012 11:16:28 (3sec)

9
task_201201111048_0031_m_000003    100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_parse/part-00000:0+129
11-Jan-2012 11:16:25
11-Jan-2012 11:16:28 (3sec)

9
task_201201111048_0031_m_000004    100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_data/part-00000/data:0+128
11-Jan-2012 11:16:28
11-Jan-2012 11:16:31 (3sec)

9
task_201201111048_0031_m_000005    100.00%
file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_text/part-00000/data:0+128
11-Jan-2012 11:16:28
11-Jan-2012 11:16:31 (3sec)




And the parse_data job itself:

attempt_201201111048_0031_m_000004_0    
/default-rack/dhcp-192-168-4-26.semantico.net    SUCCEEDED    100.00%
11-Jan-2012 11:16:28    11-Jan-2012 11:16:30 (1sec)

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same 
thing.

I've zipped up the nutch/hadoop dir with all config etc, would either of 
you (Markus/Lewis) care to look at it?

Any help at this stage would be immensely appreciated.

Regards,

Dean.

Re: parse data directory not found after merge

Posted by Markus Jelsma <ma...@openindex.io>.

Well, set up to crawl nutch.apache.org only and fetch some cycles and see what 
happens. If merging goes bad then i can reproduce and perhaps fix it.

If not, you may want to start debugging the thing step by step.

On Tuesday 10 January 2012 18:06:34 Dean Pullen wrote:
> Yes, this is about the parse_data directory dissapearing after a merge.
> 
> I've used a clean Nutch 1.4 multiple times, I've not yet use an example
> crawl though.
> 
> Anything specific you recommend?
> 
> Dean.
> 
> On 10/01/2012 16:59, Markus Jelsma wrote:
> > I haven't followed the entire thread but this is about the parse_data
> > directory disappears after a merge? We have no issues with merges on
> > small crawls.
> > 
> > Do you still store content despite the parsing fetcher? Can you reproduce
> > this on a clean Nutch 1.4  build with an example crawl?

-- 
Markus Jelsma - CTO - Openindex

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

Yes, this is about the parse_data directory dissapearing after a merge.

I've used a clean Nutch 1.4 multiple times, I've not yet use an example 
crawl though.

Anything specific you recommend?

Dean.

On 10/01/2012 16:59, Markus Jelsma wrote:
> I haven't followed the entire thread but this is about the parse_data
> directory disappears after a merge? We have no issues with merges on small
> crawls.
>
> Do you still store content despite the parsing fetcher? Can you reproduce this
> on a clean Nutch 1.4  build with an example crawl?
>
>

Re: parse data directory not found after merge

Posted by Dean Pullen <de...@semantico.com>.

The disk errors were solved by upgrading hadoop to 0.20.203 - they no 
longer appear.

Dean.

On 10/01/2012 17:01, Markus Jelsma wrote:
> I might want to ask about your Hadoop temp dir since you seem to have disk
> errors. Have you set it?
>
> On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote:
>> I haven't followed the entire thread but this is about the parse_data
>> directory disappears after a merge? We have no issues with merges on small
>> crawls.
>>
>> Do you still store content despite the parsing fetcher? Can you reproduce
>> this on a clean Nutch 1.4  build with an example crawl?
>>
>> On Thursday 05 January 2012 18:28:52 Dean Pullen wrote:
>>> Hi all,
>>>
>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>> invertlinks.
>>>
>>> Error:
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>> not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>
>>>       at
>>>
>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:
>>> 19 0) at
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI
>>> np utFormat.java:44) at
>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2
>>> 01 ) at
>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>
>>>       at
>>>
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>
>>>       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>       at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>       at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> I notice that the parse_data directories are produced after a fetch
>>> (with fetcher.parse set to true), but after the merge the parse_data
>>> directory doesn't exist.
>>>
>>> What behaviour has changed since 1.0 and does anyone have a solution for
>>> the above?
>>>
>>> Thanks in advance,
>>>
>>> Dean.

Re: parse data directory not found after merge

Posted by Markus Jelsma <ma...@openindex.io>.

I might want to ask about your Hadoop temp dir since you seem to have disk 
errors. Have you set it?

On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote:
> I haven't followed the entire thread but this is about the parse_data
> directory disappears after a merge? We have no issues with merges on small
> crawls.
> 
> Do you still store content despite the parsing fetcher? Can you reproduce
> this on a clean Nutch 1.4  build with an example crawl?
> 
> On Thursday 05 January 2012 18:28:52 Dean Pullen wrote:
> > Hi all,
> > 
> > I'm upgrading from nutch 1 to 1.4 and am having problems running
> > invertlinks.
> > 
> > Error:
> > 
> > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
> > not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
> > 
> >      at
> > 
> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:
> > 19 0) at
> > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI
> > np utFormat.java:44) at
> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2
> > 01 ) at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> > 
> >      at
> > 
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> > 
> >      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> >      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
> >      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
> > 
> > I notice that the parse_data directories are produced after a fetch
> > (with fetcher.parse set to true), but after the merge the parse_data
> > directory doesn't exist.
> > 
> > What behaviour has changed since 1.0 and does anyone have a solution for
> > the above?
> > 
> > Thanks in advance,
> > 
> > Dean.

-- 
Markus Jelsma - CTO - Openindex

Re: parse data directory not found after merge

Posted by Markus Jelsma <ma...@openindex.io>.

I haven't followed the entire thread but this is about the parse_data 
directory disappears after a merge? We have no issues with merges on small 
crawls.

Do you still store content despite the parsing fetcher? Can you reproduce this 
on a clean Nutch 1.4  build with an example crawl?

On Thursday 05 January 2012 18:28:52 Dean Pullen wrote:
> Hi all,
> 
> I'm upgrading from nutch 1 to 1.4 and am having problems running
> invertlinks.
> 
> Error:
> 
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
> not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>      at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19
> 0) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp
> utFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201
> ) at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>      at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
> 
> I notice that the parse_data directories are produced after a fetch
> (with fetcher.parse set to true), but after the merge the parse_data
> directory doesn't exist.
> 
> What behaviour has changed since 1.0 and does anyone have a solution for
> the above?
> 
> Thanks in advance,
> 
> Dean.

-- 
Markus Jelsma - CTO - Openindex