You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by gsamsa <ma...@gmail.com> on 2014/09/24 15:36:40 UTC

Apache nutch 1.9 error - Input path does not exist

Hello guys,

I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an ubuntu
virtual machine in virtualbox.

*Description of error*


I start a crawl like that:

*./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*

However, I get the following error(that is my log from
`nutch/logs/hadoop.logs`):

  

    /  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting at
2014-09-24 14:39:46
        2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: crawlDb:
-solr/crawldb
        2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
urls
        2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
        2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
        2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
library not loaded
        2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
        2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
        2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
number of urls rejected by filters: 0
        2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
number of urls after normalization: 2
        2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
injected urls into crawl db.
        2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: overwrite:
false
        2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
false
        2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
merged: 2
        2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total new
urls injected: 0
        2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector: finished at
2014-09-24 14:39:52, elapsed: 00:00:05
        2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
        2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator: starting
at 2014-09-24 14:39:55
        2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
        2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
filtering: false
        2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
normalizing: true
        2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
50000
        2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
        2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
        2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
        2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
        2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
        2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
        2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
        2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'generate_host_count', using default
        2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
Partitioning selected urls for politeness.
        2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator: segment:
-solr/segments/20140924143959
        2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
        2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator: finished
at 2014-09-24 14:40:01, elapsed: 00:00:05
        2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher: starting at
2014-09-24 14:40:03
        2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher: segment:
-solr/segments
        2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher Timelimit
set for : 1411573203467
        2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
        2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
PriviledgedActionException as:testUser
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
        2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
        	at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
        	at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
        	at
org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106)
        	at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
        	at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
        	at
org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
        	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
        	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:415)
        	at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        	at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
        	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
        	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
        	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
        	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
        	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/

I basically have configured my solr like in the tutorial on  apache wiki
<http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch> 
:

/    mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
    
    cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/
    vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
 
    Copy exactly in 351 line: <field name="_version_" type="long"
indexed="true" stored="true"/> 
/   
This is what I get when I start solr:

<http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg> 

*What I tried:*


According to this  thread
<http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html>  
the issue should be fixed by deleting all segments files in
*-solr/segments*, however, that does not resolve the issue.

Any recommendations where this error can come from and what I can do to fix
it?




--
View this message in context: http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Apache nutch 1.9 error - Input path does not exist

Posted by atawfik <co...@gmail.com>.

Hi,

To access the core, you need to provide the full API. For instance,
you can query everything using http://127.0.0.1:8983/solr/collection1/select?q=*:*. You can read more about that in the Solr site.

If you want to explore the core, you can navigate to http://127.0.0.1:8983/solr/#/collection1

Regards
Ameer 
On Sep 24, 2014, at 11:49 PM, gsamsa [via Lucene] <ml...@n3.nabble.com> wrote:

> Thx for your answer! 
> 
> I immediately tried it, however, it gives me: 
> 
> 
> 
> Any recommendations, what I am doing wrong? 
> 
> Should I start nutch with this url(http://127.0.0.1:8983/solr/collection1/)? 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160996.html
> To unsubscribe from Apache nutch 1.9 error - Input path does not exist, click here.
> NAML





--
View this message in context: http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4161001.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Apache nutch 1.9 error - Input path does not exist

Posted by gsamsa <ma...@gmail.com>.

Thx for your answer!

I immediately tried it, however, it gives me:

<http://lucene.472066.n3.nabble.com/file/n4160996/notFound.jpg> 

Any recommendations, what I am doing wrong?

Should I start nutch with this url(http://127.0.0.1:8983/solr/collection1/)?



--
View this message in context: http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160996.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Apache nutch 1.9 error - Input path does not exist

Posted by atawfik <co...@gmail.com>.

Hi,

Your Solr address is wrong. You should include the core name. In your case,
it will be http://127.0.0.1:8983/solr/collection1/

Regards
Ameer



--
View this message in context: http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160991.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Apache nutch 1.9 error - Input path does not exist

Posted by gsamsa <ma...@gmail.com>.

Thx a lot that works like a charm.

However, my current problem is that I cannot see anything on solr. Any
recommendations what I am doing wrong? I have done it exaclty like
described on the wiki page. That is my schema.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="nutch" version="1.5">
    <types>
        <fieldType name="string" class="solr.StrField"
sortMissingLast="true"
            omitNorms="true"/>
        <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="float" class="solr.TrieFloatField"
precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>
        <fieldType name="url" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"/>
            </analyzer>
        </fieldType>
    </types>
    <fields>
        <field name="id" type="string" stored="true" indexed="true"
            required="true"/>

        <!-- core fields -->
        <field name="segment" type="string" stored="true" indexed="false"/>
        <field name="digest" type="string" stored="true" indexed="false"/>
        <field name="boost" type="float" stored="true" indexed="false"/>

        <!-- fields for index-basic plugin -->
        <field name="host" type="string" stored="false" indexed="true"/>
        <field name="url" type="url" stored="true" indexed="true"/>
        <field name="content" type="text" stored="false" indexed="true"/>
        <field name="title" type="text" stored="true" indexed="true"/>
        <field name="cache" type="string" stored="true" indexed="false"/>
        <field name="tstamp" type="date" stored="true" indexed="false"/>

        <!-- fields for index-anchor plugin -->
        <field name="anchor" type="string" stored="true" indexed="true"
            multiValued="true"/>

        <!-- fields for index-more plugin -->
        <field name="type" type="string" stored="true" indexed="true"
            multiValued="true"/>
        <field name="contentLength" type="long" stored="true"
            indexed="false"/>
        <field name="lastModified" type="date" stored="true"
            indexed="false"/>
        <field name="date" type="date" stored="true" indexed="true"/>

        <!-- fields for languageidentifier plugin -->
        <field name="lang" type="string" stored="true" indexed="true"/>

        <!-- fields for subcollection plugin -->
        <field name="subcollection" type="string" stored="true"
            indexed="true" multiValued="true"/>

        <!-- fields for feed plugin (tag is also used by
microformats-reltag)-->
        <field name="author" type="string" stored="true" indexed="true"/>
        <field name="tag" type="string" stored="true" indexed="true"
multiValued="true"/>
        <field name="feed" type="string" stored="true" indexed="true"/>
        <field name="publishedDate" type="date" stored="true"
            indexed="true"/>
        <field name="updatedDate" type="date" stored="true"
            indexed="true"/>

        <!-- fields for creativecommons plugin -->
        <field name="cc" type="string" stored="true" indexed="true"
            multiValued="true"/>

        <!-- fields for tld plugin -->
        <field name="tld" type="string" stored="false" indexed="false"/>

        <field name="_version_" type="long" stored="true" indexed="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <defaultSearchField>content</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>
</schema>


Furthermore, would it be better to use a later solr like 4.10?

I appreciate your reply!

On Wed, Sep 24, 2014 at 4:35 PM, Jonathan Cooper-Ellis [via Lucene] <
ml-node+s472066n4160936h20@n3.nabble.com> wrote:

> Hello,
>
> It looks like you're confusing the usage of bin/crawl with the old
> bin/nutch crawl command. You want to start the crawl like this:
>
> bin/crawl <seed_directory> <crawl_directory> <solr_url> <number_of_rounds>
>
> So, the script thinks "-solr" is your crawl directory (which does not
> exist):
>
> 2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/testUser/Desktop/nutch-solr-example/apache-
> nutch-1.9/-solr/segments/crawl_generate
>
> Hope that helps!
>
> -jce
>
>
>
> On Wed, Sep 24, 2014 at 9:36 AM, gsamsa <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=4160936&i=0>> wrote:
>
> > Hello guys,
> >
> > I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an
> > ubuntu
> > virtual machine in virtualbox.
> >
> > *Description of error*
> >
> >
> > I start a crawl like that:
> >
> > *./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*
> >
> > However, I get the following error(that is my log from
> > `nutch/logs/hadoop.logs`):
> >
> >
> >
> >     /  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting
> at
> > 2014-09-24 14:39:46
> >         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector:
> crawlDb:
> > -solr/crawldb
> >         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
> > urls
> >         2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector:
> Converting
> > injected urls to crawl db entries.
> >         2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> >         2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
> > library not loaded
> >         2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'inject', using default
> >         2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'inject', using default
> >         2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
> > number of urls rejected by filters: 0
> >         2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
> > number of urls after normalization: 2
> >         2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
> > injected urls into crawl db.
> >         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector:
> overwrite:
> > false
> >         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
> > false
> >         2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
> > merged: 2
> >         2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total
> new
> > urls injected: 0
> >         2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector:
> finished
> > at
> > 2014-09-24 14:39:52, elapsed: 00:00:05
> >         2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> >         2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator:
> starting
> > at 2014-09-24 14:39:55
> >         2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator:
> > Selecting
> > best-scoring urls due for fetch.
> >         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> > filtering: false
> >         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> > normalizing: true
> >         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
> > 50000
> >         2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> >         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> > defaultInterval=2592000
> >         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> >         2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'partition', using default
> >         2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> >         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> > defaultInterval=2592000
> >         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> >         2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'generate_host_count', using default
> >         2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
> > Partitioning selected urls for politeness.
> >         2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator:
> segment:
> > -solr/segments/20140924143959
> >         2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'partition', using default
> >         2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator:
> finished
> > at 2014-09-24 14:40:01, elapsed: 00:00:05
> >         2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher:
> starting
> > at
> > 2014-09-24 14:40:03
> >         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher:
> segment:
> > -solr/segments
> >         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher
> Timelimit
> > set for : 1411573203467
> >         2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> >         2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
> > PriviledgedActionException as:testUser
> > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> > exist:
> >
> >
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>
> >         2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >
> >
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>
> >                 at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
>
> >                 at
> >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>
> >                 at
> > org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106)
> >                 at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> >                 at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> >                 at
> > org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> >                 at
> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> >                 at
> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> >                 at java.security.AccessController.doPrivileged(Native
> > Method)
> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
> >                 at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>
> >                 at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> >                 at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> >                 at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> >                 at
> > org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
> >                 at
> org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
> >                 at
> > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >                 at
> > org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/
> >
> > I basically have configured my solr like in the tutorial on  apache wiki
> > <
> http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch>
> > :
> >
> > /    mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
> >
> >     cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/conf/
> >     vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> >
> >     Copy exactly in 351 line: <field name="_version_" type="long"
> > indexed="true" stored="true"/>
> > /
> > This is what I get when I start solr:
> >
> > <http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg>
> >
> > *What I tried:*
> >
> >
> > According to this  thread
> > <
> >
> http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html
> > >
> > the issue should be fixed by deleting all segments files in
> > *-solr/segments*, however, that does not resolve the issue.
> >
> > Any recommendations where this error can come from and what I can do to
> fix
> > it?
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160936.html
>  To unsubscribe from Apache nutch 1.9 error - Input path does not exist, click
> here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4160918&code=bWFnZ3JlZ29yLnNhbXNhQGdtYWlsLmNvbXw0MTYwOTE4fDE4MjAwMjIxMjE=>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160959.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Apache nutch 1.9 error - Input path does not exist

Posted by Jonathan Cooper-Ellis <jc...@ziftr.com>.

Hello,

It looks like you're confusing the usage of bin/crawl with the old
bin/nutch crawl command. You want to start the crawl like this:

bin/crawl <seed_directory> <crawl_directory> <solr_url> <number_of_rounds>

So, the script thinks "-solr" is your crawl directory (which does not
exist):

2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-
nutch-1.9/-solr/segments/crawl_generate

Hope that helps!

-jce



On Wed, Sep 24, 2014 at 9:36 AM, gsamsa <ma...@gmail.com> wrote:

> Hello guys,
>
> I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an
> ubuntu
> virtual machine in virtualbox.
>
> *Description of error*
>
>
> I start a crawl like that:
>
> *./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*
>
> However, I get the following error(that is my log from
> `nutch/logs/hadoop.logs`):
>
>
>
>     /  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting at
> 2014-09-24 14:39:46
>         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: crawlDb:
> -solr/crawldb
>         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
> urls
>         2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
>         2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>         2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
> library not loaded
>         2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
>         2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
>         2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
> number of urls rejected by filters: 0
>         2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
> number of urls after normalization: 2
>         2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
> injected urls into crawl db.
>         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: overwrite:
> false
>         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
> false
>         2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
> merged: 2
>         2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total new
> urls injected: 0
>         2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector: finished
> at
> 2014-09-24 14:39:52, elapsed: 00:00:05
>         2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>         2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator: starting
> at 2014-09-24 14:39:55
>         2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator:
> Selecting
> best-scoring urls due for fetch.
>         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> filtering: false
>         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> normalizing: true
>         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
> 50000
>         2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
>         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
>         2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
>         2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
>         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
>         2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'generate_host_count', using default
>         2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
> Partitioning selected urls for politeness.
>         2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator: segment:
> -solr/segments/20140924143959
>         2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
>         2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator: finished
> at 2014-09-24 14:40:01, elapsed: 00:00:05
>         2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher: starting
> at
> 2014-09-24 14:40:03
>         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher: segment:
> -solr/segments
>         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher Timelimit
> set for : 1411573203467
>         2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>         2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
> PriviledgedActionException as:testUser
> cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
>
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>         2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>                 at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
>                 at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>                 at
> org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106)
>                 at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
>                 at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
>                 at
> org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
>                 at
> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
>                 at
> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>                 at java.security.AccessController.doPrivileged(Native
> Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>                 at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
>                 at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>                 at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>                 at
> org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
>                 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
>                 at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>                 at
> org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/
>
> I basically have configured my solr like in the tutorial on  apache wiki
> <http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch>
> :
>
> /    mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
>
>     cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
> ${APACHE_SOLR_HOME}/example/solr/conf/
>     vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
>
>     Copy exactly in 351 line: <field name="_version_" type="long"
> indexed="true" stored="true"/>
> /
> This is what I get when I start solr:
>
> <http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg>
>
> *What I tried:*
>
>
> According to this  thread
> <
> http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html
> >
> the issue should be fixed by deleting all segments files in
> *-solr/segments*, however, that does not resolve the issue.
>
> Any recommendations where this error can come from and what I can do to fix
> it?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>