You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Yousin Kim <yo...@gmail.com> on 2015/04/13 05:00:04 UTC

I want to crawl deep pages

Hello, I compiled nutch2.3 with gora0.6, mongodb and tried to crawl
online-shop.

But, I got only front pages except detail pages of products.
How can I get product detail pages?

Thank you :)

I want to get urls like :
http://www.vanillashu.co.kr/product/detail.html?product_no=20388&cate_no=42&display_group=2

my seed list is http://www.vanillashu.co.kr/

regex-urlfilter
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
#+.

+^(http|https)://.* vanillashu.co.kr/

Re: I want to crawl deep pages

Posted by Michael Joyce <jo...@apache.org>.

Do you have any additional information? Config that you're using. Crawl
stats. Etc.

In general my approach to doing deep, single site crawls has been to ensure
the my config is as liberal as possible in terms of excluding links and
then use the regex to keep the crawl from going out of the relevant
domain(s).

One relevant property that I've had bite me before is:
<name>db.ignore.internal.links</name>


-- Jimmy

On Fri, Apr 17, 2015 at 4:13 PM, steve labar <st...@gmail.com>
wrote:

> I have similar problems. For me it seems to be when many of the pages get a
> very low ranking and therefore never get fetched. If I kickoff the scan
> again it goes one more layer deeper down the rabbit hole. I thought about
> trying to reduce that % which is needed in order to fetch those pages.
> Still honestly have not solved it but thought i'd mention I'm seeing
> similar tendencies.
>
> On Sun, Apr 12, 2015 at 8:00 PM, Yousin Kim <yo...@gmail.com> wrote:
>
> > Hello, I compiled nutch2.3 with gora0.6, mongodb and tried to crawl
> > online-shop.
> >
> > But, I got only front pages except detail pages of products.
> > How can I get product detail pages?
> >
> > Thank you :)
> >
> > I want to get urls like :
> >
> >
> http://www.vanillashu.co.kr/product/detail.html?product_no=20388&cate_no=42&display_group=2
> >
> > my seed list is http://www.vanillashu.co.kr/
> >
> > regex-urlfilter
> > # skip file: ftp: and mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> > # for a more extensive coverage use the urlfilter-suffix plugin
> >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >
> > # accept anything else
> > #+.
> >
> > +^(http|https)://.* vanillashu.co.kr/
> >
>

Re: I want to crawl deep pages

Posted by steve labar <st...@gmail.com>.

I have similar problems. For me it seems to be when many of the pages get a
very low ranking and therefore never get fetched. If I kickoff the scan
again it goes one more layer deeper down the rabbit hole. I thought about
trying to reduce that % which is needed in order to fetch those pages.
Still honestly have not solved it but thought i'd mention I'm seeing
similar tendencies.

On Sun, Apr 12, 2015 at 8:00 PM, Yousin Kim <yo...@gmail.com> wrote:

> Hello, I compiled nutch2.3 with gora0.6, mongodb and tried to crawl
> online-shop.
>
> But, I got only front pages except detail pages of products.
> How can I get product detail pages?
>
> Thank you :)
>
> I want to get urls like :
>
> http://www.vanillashu.co.kr/product/detail.html?product_no=20388&cate_no=42&display_group=2
>
> my seed list is http://www.vanillashu.co.kr/
>
> regex-urlfilter
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> #+.
>
> +^(http|https)://.* vanillashu.co.kr/
>