You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by eric park <hk...@gmail.com> on 2010/06/18 07:29:02 UTC

Bullentin board crawling

Hi, I'm trying to crawl a bulletin board containing about 700 pages. I set
the nutch crawler depth to 5, ran the crawler and only crawled about 60
pages. I don't think Nutch  crawls bulletin board recursively.  Anyone found
a way to crawl a bulletin board recursively?

Thank you

Re: Bullentin board crawling

Posted by Neal Richter <nr...@gmail.com>.

Do you have an RSS feed of all 700 of the pages you can use as the
input (first page)? Or just generate the list youself and give them
all as inputs?

On 6/18/10, eric park <hk...@gmail.com> wrote:
> Hello, Alex
>
> Thank you for your help. The problem is that I cannot set crawler depth to
> 70.  It will take forever crawling unnecessary web pages.  The bulletin
> board I'm trying to crawl is divided into 70 pages, each page containing 10
> pages.  When I set the crawler depth to 5 and crawl, it crawls only 50
> pages.  I'm looking for a way to crawl a bulletin board containing 700 pages
> recursively without setting the crawler depth to 70.  I would appreciate any
> ideas or help.
>
> Thank you.
>
> 2010/6/18 Alex McLintock <al...@gmail.com>
>
>> Hello Eric,
>>
>> I'm not sure I see your problem. There is nothing special about a
>> bulletin board compared to any other website. Here are some ideas
>> which may help?
>>
>> Have you iterated the "generate list of urls, crawl them, index them"
>> stage or have you only run it once?
>>
>> By default Nutch will ignore URLs with "?" in them as this introduces
>> a program parameter. This is generally a wise thing to do, but you can
>> override it by examing the regular expressions used to filter pages.
>> (find the regex config files)
>>
>>
>> On 18 June 2010 06:29, eric park <hk...@gmail.com> wrote:
>> > Hi, I'm trying to crawl a bulletin board containing about 700 pages. I
>> set
>> > the nutch crawler depth to 5, ran the crawler and only crawled about 60
>> > pages. I don't think Nutch  crawls bulletin board recursively.  Anyone
>> found
>> > a way to crawl a bulletin board recursively?
>> >
>> > Thank you
>> >
>>
>

Re: Bullentin board crawling

Posted by eric park <hk...@gmail.com>.

Hello, Alex

Thank you for your help. The problem is that I cannot set crawler depth to
70.  It will take forever crawling unnecessary web pages.  The bulletin
board I'm trying to crawl is divided into 70 pages, each page containing 10
pages.  When I set the crawler depth to 5 and crawl, it crawls only 50
pages.  I'm looking for a way to crawl a bulletin board containing 700 pages
recursively without setting the crawler depth to 70.  I would appreciate any
ideas or help.

Thank you.

2010/6/18 Alex McLintock <al...@gmail.com>

> Hello Eric,
>
> I'm not sure I see your problem. There is nothing special about a
> bulletin board compared to any other website. Here are some ideas
> which may help?
>
> Have you iterated the "generate list of urls, crawl them, index them"
> stage or have you only run it once?
>
> By default Nutch will ignore URLs with "?" in them as this introduces
> a program parameter. This is generally a wise thing to do, but you can
> override it by examing the regular expressions used to filter pages.
> (find the regex config files)
>
>
> On 18 June 2010 06:29, eric park <hk...@gmail.com> wrote:
> > Hi, I'm trying to crawl a bulletin board containing about 700 pages. I
> set
> > the nutch crawler depth to 5, ran the crawler and only crawled about 60
> > pages. I don't think Nutch  crawls bulletin board recursively.  Anyone
> found
> > a way to crawl a bulletin board recursively?
> >
> > Thank you
> >
>

Re: Bullentin board crawling

Posted by Alex McLintock <al...@gmail.com>.

Hello Eric,

I'm not sure I see your problem. There is nothing special about a
bulletin board compared to any other website. Here are some ideas
which may help?

Have you iterated the "generate list of urls, crawl them, index them"
stage or have you only run it once?

By default Nutch will ignore URLs with "?" in them as this introduces
a program parameter. This is generally a wise thing to do, but you can
override it by examing the regular expressions used to filter pages.
(find the regex config files)

On 18 June 2010 06:29, eric park <hk...@gmail.com> wrote:
> Hi, I'm trying to crawl a bulletin board containing about 700 pages. I set
> the nutch crawler depth to 5, ran the crawler and only crawled about 60
> pages. I don't think Nutch  crawls bulletin board recursively.  Anyone found
> a way to crawl a bulletin board recursively?
>
> Thank you
>