You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "salima abdulsalam (JIRA)" <ji...@apache.org> on 2009/08/21 15:38:14 UTC

[jira] Created: (NUTCH-749) Fetching the url from crawldb

Fetching the url from crawldb
-----------------------------

                 Key: NUTCH-749
                 URL: https://issues.apache.org/jira/browse/NUTCH-749
             Project: Nutch
          Issue Type: Bug
         Environment: Nutch with solr integration
            Reporter: salima abdulsalam


Hi,
 Iam new to using the nutch with solr.I followed the link  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/  for integration.Iam getting an error while fetching the url from crawldb.

I used the below command

  bin/nutch fetch $SEGMENT -noParsing and i set the SEGMENT as  export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

after running the command, iam getting the error as


Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20090821062021
Exception in thread "main" java.io.IOException: Illegal file pattern: Expecting set closure character or end of range, or } for glob 20090821062021 at 30
        at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1086)
        at org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1071)
        at org.apache.hadoop.fs.FileSystem$GlobFilter.<init>(FileSystem.java:989)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:955)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
        at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:904)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:868)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:159)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
        at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:101)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003)

Can anyone help in this.

Thanks,
Salima


 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-749) Fetching the url from crawldb

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-749.
-------------------------------

    Resolution: Invalid

Please use the nutch-user mailing list to ask questions.

As for your problem, you need to add to your nutch-site.xml something like this:

<property>
  <name>http.robots.agents</name>
  <value>nutch-solr-integration,*</value>
</property>

Change nutch-solr-integration to your robot name.

> Fetching the url from crawldb
> -----------------------------
>
>                 Key: NUTCH-749
>                 URL: https://issues.apache.org/jira/browse/NUTCH-749
>             Project: Nutch
>          Issue Type: Bug
>         Environment: Nutch with solr integration
>            Reporter: salima abdulsalam
>
> Hi,
>  Iam new to using the nutch with solr.I followed the link  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/  for integration.Iam getting an error while fetching the url from crawldb.
> I used the below command
>   bin/nutch fetch $SEGMENT -noParsing and i set the SEGMENT as  export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> after running the command, iam getting the error as
> Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20090821062021
> Exception in thread "main" java.io.IOException: Illegal file pattern: Expecting set closure character or end of range, or } for glob 20090821062021 at 30
>         at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1086)
>         at org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1071)
>         at org.apache.hadoop.fs.FileSystem$GlobFilter.<init>(FileSystem.java:989)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:955)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
>         at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:964)
>         at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:904)
>         at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:868)
>         at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:159)
>         at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>         at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:101)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003)
> Can anyone help in this.
> Thanks,
> Salima
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.