You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Rameez Raja (JIRA)" <ji...@apache.org> on 2010/10/28 05:59:19 UTC

[jira] Created: (NUTCH-927) Sub pages are not getting crawled

Sub pages are not getting crawled
---------------------------------

                 Key: NUTCH-927
                 URL: https://issues.apache.org/jira/browse/NUTCH-927
             Project: Nutch
          Issue Type: Bug
          Components: injector
    Affects Versions: 2.0
            Reporter: Rameez Raja


In my program the objective is to crawl all the pages and fetch the contents from it. The category wise fetching the information is done perfectly but the sub pages are not getting crawled. In the sense, the nextpages are in the form of links at the bottom of the page.

I have included the code as,
<a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2" title="Next Page &gt;" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for Z-5500 Digital 5.1 Speaker System</a>.

Can anyone solve this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-927) Sub pages are not getting crawled

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche closed NUTCH-927.
-------------------------------

    Resolution: Not A Problem

Not a bug. use the mailing lists to ask questions

> Sub pages are not getting crawled
> ---------------------------------
>
>                 Key: NUTCH-927
>                 URL: https://issues.apache.org/jira/browse/NUTCH-927
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 2.0
>            Reporter: Rameez Raja
>
> In my program the objective is to crawl all the pages and fetch the contents from it. The category wise fetching the information is done perfectly but the sub pages are not getting crawled. In the sense, the nextpages are in the form of links at the bottom of the webpage as shown below - 
> <a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2" title="Next Page &gt;" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for Z-5500 Digital 5.1 Speaker System</a>.
> I am using the below script to crawl the site.
> $NUTCH_HOME/search/scripts/crawl.sh testcrawlreviews 5 & > crawl.log
> where 5 is the depth
> Shown below is the snapshot
> cd $NUTCH_HOME
> bin/nutch inject $BASEDIR/crawldb urls
> bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments
> SEGMENT=`ls $BASEDIR/segments/ | tail -1`
> echo processing segment $SEGMENT
> bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
> bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
> done

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-927) Sub pages are not getting crawled

Posted by "Rameez Raja (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rameez Raja updated NUTCH-927:
------------------------------

    Description: 
In my program the objective is to crawl all the pages and fetch the contents from it. The category wise fetching the information is done perfectly but the sub pages are not getting crawled. In the sense, the nextpages are in the form of links at the bottom of the webpage as shown below - 

<a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2" title="Next Page &gt;" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for Z-5500 Digital 5.1 Speaker System</a>.

I am using the below script to crawl the site.
$NUTCH_HOME/search/scripts/crawl.sh testcrawlreviews 5 & > crawl.log

where 5 is the depth


Shown below is the snapshot

cd $NUTCH_HOME
bin/nutch inject $BASEDIR/crawldb urls
bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments
SEGMENT=`ls $BASEDIR/segments/ | tail -1`
echo processing segment $SEGMENT
bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
done


  was:
In my program the objective is to crawl all the pages and fetch the contents from it. The category wise fetching the information is done perfectly but the sub pages are not getting crawled. In the sense, the nextpages are in the form of links at the bottom of the page.

I have included the code as,
<a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2" title="Next Page &gt;" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for Z-5500 Digital 5.1 Speaker System</a>.

Can anyone solve this problem.


> Sub pages are not getting crawled
> ---------------------------------
>
>                 Key: NUTCH-927
>                 URL: https://issues.apache.org/jira/browse/NUTCH-927
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 2.0
>            Reporter: Rameez Raja
>
> In my program the objective is to crawl all the pages and fetch the contents from it. The category wise fetching the information is done perfectly but the sub pages are not getting crawled. In the sense, the nextpages are in the form of links at the bottom of the webpage as shown below - 
> <a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2" title="Next Page &gt;" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for Z-5500 Digital 5.1 Speaker System</a>.
> I am using the below script to crawl the site.
> $NUTCH_HOME/search/scripts/crawl.sh testcrawlreviews 5 & > crawl.log
> where 5 is the depth
> Shown below is the snapshot
> cd $NUTCH_HOME
> bin/nutch inject $BASEDIR/crawldb urls
> bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments
> SEGMENT=`ls $BASEDIR/segments/ | tail -1`
> echo processing segment $SEGMENT
> bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
> bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
> done

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.