You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bayu Widyasanyata <bw...@gmail.com> on 2014/01/26 04:39:50 UTC

Strange: Nutch didn't crawl level 2 (depth 2) pages

Hi,

I just realized that my nutch didn't crawl the articles/pages (depth 2)
which shown on frontpage.
My target URL is: http://bappenas.go.id

As shown on that frontpage (top right below the slider banners) three is a
text link:

"Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
and its URL:
http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937

I tried to search with keyword "Liberal Party" (with quotes) which appear
on link (page) above but has no result :(

Following is the search link queried:
http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22

I use individual script to crawl below:

===
# Defines env variables
export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
export PATH="$JAVA_HOME/bin:$PATH"
NUTCH="/opt/searchengine/nutch"

# Start by injecting the seed url(s) to the nutch crawldb:
$NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb $NUTCH/urls/seed.txt

# Generate fetch list
$NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
$NUTCH/BappenasCrawl/segments

# last segment
export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
$NUTCH/BappenasCrawl/segments|tail -1`

# Launch the crawler!
$NUTCH/bin/nutch fetch $SEGMENT -noParsing

# Parse the fetched content:
$NUTCH/bin/nutch parse $SEGMENT

# We need to update the crawl database to ensure that for all future
crawls, Nutch only checks the already crawled pages, and only fetches new
and changed pages.
$NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT -filter
-normalize

# Indexing our crawl DB with solr
$NUTCH/bin/nutch solrindex
http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
-dir $NUTCH/BappenasCrawl/segments
===

I run this script daily but it looks it never reach the single article
pages which shown on the frontpage.

If I read Tejas explained on another thread (shown below), should I two or
three times loops (generate -> fetch -> parse -> update) to produce 2 or 3
depth levels?

QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
*****************
On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <te...@gmail.com>wrote:
Yes. Nutch would parse the HTML and extract the content out of it. Tweaking
around the code surrounding the parser would have made that happen. If you
did something else, would you mind sharing it ?

The "depth" is used by the Crawl class in 1.x which is deprecated in 2.x.
Use bin/crawl instead.
While running the "bin/crawl" script, the "<numberOfRounds>" option is
nothing but the depth till which you want the crawling to be performed.

If you want to use the individual commands instead, run generate -> fetch
-> parse -> update multiple times. The crawl script internally does the
same thing.
eg. If you want to fetch till depth 3, this is how you could do:
inject -> (generate -> fetch -> parse -> update)
          -> (generate -> fetch -> parse -> update)
          -> (generate -> fetch -> parse -> update)
               -> solrindex
*****************

I also has commented line below on regex-urlfilter.txt file:
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

Apps: nutch 1.7 and Solr 4.5.1

Thank you so much!

-- 
wassalam,
[bayu]

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Yupe, thanks!

---
wassalam,
[bayu]

/sent from Android phone/
On Feb 2, 2014 10:51 PM, "Tejas Patil" <te...@gmail.com> wrote:

> On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Hi Tejas,
> >
> > It's works and great! :)
> > After reconfigured and many times of generate, fetch, parse & update, the
> > pages on 2nd level is being crawled.
> >
> > 1 question, Is it fine and correct if I modified my current
> > crawler+indexing script into this pseudo (skeleton):
> >
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > # example number of levels / depth (loop)
> > LOOP=4
> >
> > nutch->inject()
> >
> > loop[ =< $LOOP]
> > {
> >     nutch->generate()
> >     nutch->fetch(a_segment)
> >     nutch->parse(a_segment)
> >     nutch->updatedb(a_segment)
> > }
> >
> > nutch->solrindex()
> >
> > I don't think that this should be a problem. Remember to pass all the
> segments generated in the crawl loop to the solrindex job using "-dir"
> option.
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >
> > Thank you!
> >
> >
> > On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
> > <bw...@gmail.com>wrote:
> >
> > > OK I will apply it first and update the result.
> > >
> > > Thanks.-
> > >
> > >
> > > On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <
> tejas.patil.cs@gmail.com
> > >wrote:
> > >
> > >> Please copy this at the end (but above the end tag '</configuration>')
> > in
> > >> your $NUTCH/conf/nutch-site.xml:
> > >>
> > >> <property>
> > >>   <name>http.content.limit</name>
> > >>   <value>999999999</value>
> > >> </property>
> > >>
> > >> <property>
> > >>   <name>http.timeout</name>
> > >>   <value>2147483640</value>
> > >> </property>
> > >>
> > >> <property>
> > >>   <name>db.max.outlinks.per.page</name>
> > >>   <value>999999999</value>
> > >> </property>
> > >>
> > >> Please check if the url got fetched correctly after every round:
> > >> For the first round with seed as http://bappenas.go.id, after
> > "updatedb"
> > >> job, run these to check if they are into the crawldb. The first url
> must
> > >> be
> > >> db_fetched while the second one must be db_unfetched:
> > >>
> > >> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
> > >> bin/nutch readdb <YOUR_CRAWLDB> -url
> > >>
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> > >>
> > >> Now crawl for the next depth. After "updatedb"job, check if the second
> > url
> > >> got fetched using the same command again. ie.
> > >> bin/nutch readdb <YOUR_CRAWLDB> -url
> > >>
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> > >>
> > >> Note that if there was any redirection, you need to look out the
> target
> > >> url
> > >> in the redirection chain and use that url ahead for debugging. Verify
> if
> > >> the content you got for that url had text "Liberal Party" in the
> parsed
> > >> output using this command:
> > >>
> > >> bin/nutch readseg -get <LATEST_SEGMENT>
> > >>
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> > >>
> > >> For larger segments, you might get a OOM error. So in that case, take
> > the
> > >> entire segment dump using:
> > >> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
> > >>
> > >> After all this is verified and everything looks good from the crawling
> > >> side, run solrindex and check if you get the query results. If not,
> then
> > >> there was a problem while indexing the stuff.
> > >>
> > >> Thanks,
> > >> Tejas
> > >>
> > >>
> > >> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
> > >> <bw...@gmail.com>wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I just realized that my nutch didn't crawl the articles/pages (depth
> > 2)
> > >> > which shown on frontpage.
> > >> > My target URL is: http://bappenas.go.id
> > >> >
> > >> > As shown on that frontpage (top right below the slider banners)
> three
> > >> is a
> > >> > text link:
> > >> >
> > >> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
> > >> > and its URL:
> > >> >
> > >> >
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
> > >> >
> > >> > I tried to search with keyword "Liberal Party" (with quotes) which
> > >> appear
> > >> > on link (page) above but has no result :(
> > >> >
> > >> > Following is the search link queried:
> > >> >
> > >> >
> > >>
> >
> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
> > >> >
> > >> > I use individual script to crawl below:
> > >> >
> > >> > ===
> > >> > # Defines env variables
> > >> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
> > >> > export PATH="$JAVA_HOME/bin:$PATH"
> > >> > NUTCH="/opt/searchengine/nutch"
> > >> >
> > >> > # Start by injecting the seed url(s) to the nutch crawldb:
> > >> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
> > >> $NUTCH/urls/seed.txt
> > >> >
> > >> > # Generate fetch list
> > >> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
> > >> > $NUTCH/BappenasCrawl/segments
> > >> >
> > >> > # last segment
> > >> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
> > >> > $NUTCH/BappenasCrawl/segments|tail -1`
> > >> >
> > >> > # Launch the crawler!
> > >> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
> > >> >
> > >> > # Parse the fetched content:
> > >> > $NUTCH/bin/nutch parse $SEGMENT
> > >> >
> > >> > # We need to update the crawl database to ensure that for all future
> > >> > crawls, Nutch only checks the already crawled pages, and only
> fetches
> > >> new
> > >> > and changed pages.
> > >> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT
> > -filter
> > >> > -normalize
> > >> >
> > >> > # Indexing our crawl DB with solr
> > >> > $NUTCH/bin/nutch solrindex
> > >> >
> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
> > >> > -dir<
> > >>
> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
> > >> >$NUTCH/BappenasCrawl/segments
> > >> > ===
> > >> >
> > >> > I run this script daily but it looks it never reach the single
> article
> > >> > pages which shown on the frontpage.
> > >> >
> > >> > If I read Tejas explained on another thread (shown below), should I
> > two
> > >> or
> > >> > three times loops (generate -> fetch -> parse -> update) to produce
> 2
> > >> or 3
> > >> > depth levels?
> > >> >
> > >> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
> > >> > *****************
> > >> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <
> > tejas.patil.cs@gmail.com
> > >> > >wrote:
> > >> > Yes. Nutch would parse the HTML and extract the content out of it.
> > >> Tweaking
> > >> > around the code surrounding the parser would have made that happen.
> If
> > >> you
> > >> > did something else, would you mind sharing it ?
> > >> >
> > >> > The "depth" is used by the Crawl class in 1.x which is deprecated in
> > >> 2.x.
> > >> > Use bin/crawl instead.
> > >> > While running the "bin/crawl" script, the "<numberOfRounds>" option
> is
> > >> > nothing but the depth till which you want the crawling to be
> > performed.
> > >> >
> > >> > If you want to use the individual commands instead, run generate ->
> > >> fetch
> > >> > -> parse -> update multiple times. The crawl script internally does
> > the
> > >> > same thing.
> > >> > eg. If you want to fetch till depth 3, this is how you could do:
> > >> > inject -> (generate -> fetch -> parse -> update)
> > >> >           -> (generate -> fetch -> parse -> update)
> > >> >           -> (generate -> fetch -> parse -> update)
> > >> >                -> solrindex
> > >> > *****************
> > >> >
> > >> > I also has commented line below on regex-urlfilter.txt file:
> > >> > # skip URLs containing certain characters as probable queries, etc.
> > >> > #-[?*!@=]
> > >> >
> > >> > Apps: nutch 1.7 and Solr 4.5.1
> > >> >
> > >> > Thank you so much!
> > >> >
> > >> > --
> > >> > wassalam,
> > >> > [bayu]
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > wassalam,
> > > [bayu]
> > >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Posted by Tejas Patil <te...@gmail.com>.

On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi Tejas,
>
> It's works and great! :)
> After reconfigured and many times of generate, fetch, parse & update, the
> pages on 2nd level is being crawled.
>
> 1 question, Is it fine and correct if I modified my current
> crawler+indexing script into this pseudo (skeleton):
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> # example number of levels / depth (loop)
> LOOP=4
>
> nutch->inject()
>
> loop[ =< $LOOP]
> {
>     nutch->generate()
>     nutch->fetch(a_segment)
>     nutch->parse(a_segment)
>     nutch->updatedb(a_segment)
> }
>
> nutch->solrindex()
>
> I don't think that this should be a problem. Remember to pass all the
segments generated in the crawl loop to the solrindex job using "-dir"
option.

>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> Thank you!
>
>
> On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > OK I will apply it first and update the result.
> >
> > Thanks.-
> >
> >
> > On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
> >
> >> Please copy this at the end (but above the end tag '</configuration>')
> in
> >> your $NUTCH/conf/nutch-site.xml:
> >>
> >> <property>
> >>   <name>http.content.limit</name>
> >>   <value>999999999</value>
> >> </property>
> >>
> >> <property>
> >>   <name>http.timeout</name>
> >>   <value>2147483640</value>
> >> </property>
> >>
> >> <property>
> >>   <name>db.max.outlinks.per.page</name>
> >>   <value>999999999</value>
> >> </property>
> >>
> >> Please check if the url got fetched correctly after every round:
> >> For the first round with seed as http://bappenas.go.id, after
> "updatedb"
> >> job, run these to check if they are into the crawldb. The first url must
> >> be
> >> db_fetched while the second one must be db_unfetched:
> >>
> >> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
> >> bin/nutch readdb <YOUR_CRAWLDB> -url
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> Now crawl for the next depth. After "updatedb"job, check if the second
> url
> >> got fetched using the same command again. ie.
> >> bin/nutch readdb <YOUR_CRAWLDB> -url
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> Note that if there was any redirection, you need to look out the target
> >> url
> >> in the redirection chain and use that url ahead for debugging. Verify if
> >> the content you got for that url had text "Liberal Party" in the parsed
> >> output using this command:
> >>
> >> bin/nutch readseg -get <LATEST_SEGMENT>
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> For larger segments, you might get a OOM error. So in that case, take
> the
> >> entire segment dump using:
> >> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
> >>
> >> After all this is verified and everything looks good from the crawling
> >> side, run solrindex and check if you get the query results. If not, then
> >> there was a problem while indexing the stuff.
> >>
> >> Thanks,
> >> Tejas
> >>
> >>
> >> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
> >> <bw...@gmail.com>wrote:
> >>
> >> > Hi,
> >> >
> >> > I just realized that my nutch didn't crawl the articles/pages (depth
> 2)
> >> > which shown on frontpage.
> >> > My target URL is: http://bappenas.go.id
> >> >
> >> > As shown on that frontpage (top right below the slider banners) three
> >> is a
> >> > text link:
> >> >
> >> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
> >> > and its URL:
> >> >
> >> >
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
> >> >
> >> > I tried to search with keyword "Liberal Party" (with quotes) which
> >> appear
> >> > on link (page) above but has no result :(
> >> >
> >> > Following is the search link queried:
> >> >
> >> >
> >>
> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
> >> >
> >> > I use individual script to crawl below:
> >> >
> >> > ===
> >> > # Defines env variables
> >> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
> >> > export PATH="$JAVA_HOME/bin:$PATH"
> >> > NUTCH="/opt/searchengine/nutch"
> >> >
> >> > # Start by injecting the seed url(s) to the nutch crawldb:
> >> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
> >> $NUTCH/urls/seed.txt
> >> >
> >> > # Generate fetch list
> >> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
> >> > $NUTCH/BappenasCrawl/segments
> >> >
> >> > # last segment
> >> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
> >> > $NUTCH/BappenasCrawl/segments|tail -1`
> >> >
> >> > # Launch the crawler!
> >> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
> >> >
> >> > # Parse the fetched content:
> >> > $NUTCH/bin/nutch parse $SEGMENT
> >> >
> >> > # We need to update the crawl database to ensure that for all future
> >> > crawls, Nutch only checks the already crawled pages, and only fetches
> >> new
> >> > and changed pages.
> >> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT
> -filter
> >> > -normalize
> >> >
> >> > # Indexing our crawl DB with solr
> >> > $NUTCH/bin/nutch solrindex
> >> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
> >> > -dir<
> >>
> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
> >> >$NUTCH/BappenasCrawl/segments
> >> > ===
> >> >
> >> > I run this script daily but it looks it never reach the single article
> >> > pages which shown on the frontpage.
> >> >
> >> > If I read Tejas explained on another thread (shown below), should I
> two
> >> or
> >> > three times loops (generate -> fetch -> parse -> update) to produce 2
> >> or 3
> >> > depth levels?
> >> >
> >> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
> >> > *****************
> >> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <
> tejas.patil.cs@gmail.com
> >> > >wrote:
> >> > Yes. Nutch would parse the HTML and extract the content out of it.
> >> Tweaking
> >> > around the code surrounding the parser would have made that happen. If
> >> you
> >> > did something else, would you mind sharing it ?
> >> >
> >> > The "depth" is used by the Crawl class in 1.x which is deprecated in
> >> 2.x.
> >> > Use bin/crawl instead.
> >> > While running the "bin/crawl" script, the "<numberOfRounds>" option is
> >> > nothing but the depth till which you want the crawling to be
> performed.
> >> >
> >> > If you want to use the individual commands instead, run generate ->
> >> fetch
> >> > -> parse -> update multiple times. The crawl script internally does
> the
> >> > same thing.
> >> > eg. If you want to fetch till depth 3, this is how you could do:
> >> > inject -> (generate -> fetch -> parse -> update)
> >> >           -> (generate -> fetch -> parse -> update)
> >> >           -> (generate -> fetch -> parse -> update)
> >> >                -> solrindex
> >> > *****************
> >> >
> >> > I also has commented line below on regex-urlfilter.txt file:
> >> > # skip URLs containing certain characters as probable queries, etc.
> >> > #-[?*!@=]
> >> >
> >> > Apps: nutch 1.7 and Solr 4.5.1
> >> >
> >> > Thank you so much!
> >> >
> >> > --
> >> > wassalam,
> >> > [bayu]
> >> >
> >>
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Tejas,

It's works and great! :)
After reconfigured and many times of generate, fetch, parse & update, the
pages on 2nd level is being crawled.

1 question, Is it fine and correct if I modified my current
crawler+indexing script into this pseudo (skeleton):

>>>>>>>>>>>>>>>>>>>>>>>>>>>
# example number of levels / depth (loop)
LOOP=4

nutch->inject()

loop[ =< $LOOP]
{
    nutch->generate()
    nutch->fetch(a_segment)
    nutch->parse(a_segment)
    nutch->updatedb(a_segment)
}

nutch->solrindex()

>>>>>>>>>>>>>>>>>>>>>>>>>>>

Thank you!


On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> OK I will apply it first and update the result.
>
> Thanks.-
>
>
> On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <te...@gmail.com>wrote:
>
>> Please copy this at the end (but above the end tag '</configuration>') in
>> your $NUTCH/conf/nutch-site.xml:
>>
>> <property>
>>   <name>http.content.limit</name>
>>   <value>999999999</value>
>> </property>
>>
>> <property>
>>   <name>http.timeout</name>
>>   <value>2147483640</value>
>> </property>
>>
>> <property>
>>   <name>db.max.outlinks.per.page</name>
>>   <value>999999999</value>
>> </property>
>>
>> Please check if the url got fetched correctly after every round:
>> For the first round with seed as http://bappenas.go.id, after "updatedb"
>> job, run these to check if they are into the crawldb. The first url must
>> be
>> db_fetched while the second one must be db_unfetched:
>>
>> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
>> bin/nutch readdb <YOUR_CRAWLDB> -url
>>
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>>
>> Now crawl for the next depth. After "updatedb"job, check if the second url
>> got fetched using the same command again. ie.
>> bin/nutch readdb <YOUR_CRAWLDB> -url
>>
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>>
>> Note that if there was any redirection, you need to look out the target
>> url
>> in the redirection chain and use that url ahead for debugging. Verify if
>> the content you got for that url had text "Liberal Party" in the parsed
>> output using this command:
>>
>> bin/nutch readseg -get <LATEST_SEGMENT>
>>
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>>
>> For larger segments, you might get a OOM error. So in that case, take the
>> entire segment dump using:
>> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
>>
>> After all this is verified and everything looks good from the crawling
>> side, run solrindex and check if you get the query results. If not, then
>> there was a problem while indexing the stuff.
>>
>> Thanks,
>> Tejas
>>
>>
>> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
>> <bw...@gmail.com>wrote:
>>
>> > Hi,
>> >
>> > I just realized that my nutch didn't crawl the articles/pages (depth 2)
>> > which shown on frontpage.
>> > My target URL is: http://bappenas.go.id
>> >
>> > As shown on that frontpage (top right below the slider banners) three
>> is a
>> > text link:
>> >
>> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
>> > and its URL:
>> >
>> >
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
>> >
>> > I tried to search with keyword "Liberal Party" (with quotes) which
>> appear
>> > on link (page) above but has no result :(
>> >
>> > Following is the search link queried:
>> >
>> >
>> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
>> >
>> > I use individual script to crawl below:
>> >
>> > ===
>> > # Defines env variables
>> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
>> > export PATH="$JAVA_HOME/bin:$PATH"
>> > NUTCH="/opt/searchengine/nutch"
>> >
>> > # Start by injecting the seed url(s) to the nutch crawldb:
>> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
>> $NUTCH/urls/seed.txt
>> >
>> > # Generate fetch list
>> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
>> > $NUTCH/BappenasCrawl/segments
>> >
>> > # last segment
>> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
>> > $NUTCH/BappenasCrawl/segments|tail -1`
>> >
>> > # Launch the crawler!
>> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
>> >
>> > # Parse the fetched content:
>> > $NUTCH/bin/nutch parse $SEGMENT
>> >
>> > # We need to update the crawl database to ensure that for all future
>> > crawls, Nutch only checks the already crawled pages, and only fetches
>> new
>> > and changed pages.
>> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT -filter
>> > -normalize
>> >
>> > # Indexing our crawl DB with solr
>> > $NUTCH/bin/nutch solrindex
>> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
>> > -dir<
>> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
>> >$NUTCH/BappenasCrawl/segments
>> > ===
>> >
>> > I run this script daily but it looks it never reach the single article
>> > pages which shown on the frontpage.
>> >
>> > If I read Tejas explained on another thread (shown below), should I two
>> or
>> > three times loops (generate -> fetch -> parse -> update) to produce 2
>> or 3
>> > depth levels?
>> >
>> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
>> > *****************
>> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <tejas.patil.cs@gmail.com
>> > >wrote:
>> > Yes. Nutch would parse the HTML and extract the content out of it.
>> Tweaking
>> > around the code surrounding the parser would have made that happen. If
>> you
>> > did something else, would you mind sharing it ?
>> >
>> > The "depth" is used by the Crawl class in 1.x which is deprecated in
>> 2.x.
>> > Use bin/crawl instead.
>> > While running the "bin/crawl" script, the "<numberOfRounds>" option is
>> > nothing but the depth till which you want the crawling to be performed.
>> >
>> > If you want to use the individual commands instead, run generate ->
>> fetch
>> > -> parse -> update multiple times. The crawl script internally does the
>> > same thing.
>> > eg. If you want to fetch till depth 3, this is how you could do:
>> > inject -> (generate -> fetch -> parse -> update)
>> >           -> (generate -> fetch -> parse -> update)
>> >           -> (generate -> fetch -> parse -> update)
>> >                -> solrindex
>> > *****************
>> >
>> > I also has commented line below on regex-urlfilter.txt file:
>> > # skip URLs containing certain characters as probable queries, etc.
>> > #-[?*!@=]
>> >
>> > Apps: nutch 1.7 and Solr 4.5.1
>> >
>> > Thank you so much!
>> >
>> > --
>> > wassalam,
>> > [bayu]
>> >
>>
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Posted by Bayu Widyasanyata <bw...@gmail.com>.

OK I will apply it first and update the result.

Thanks.-


On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <te...@gmail.com>wrote:

> Please copy this at the end (but above the end tag '</configuration>') in
> your $NUTCH/conf/nutch-site.xml:
>
> <property>
>   <name>http.content.limit</name>
>   <value>999999999</value>
> </property>
>
> <property>
>   <name>http.timeout</name>
>   <value>2147483640</value>
> </property>
>
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>999999999</value>
> </property>
>
> Please check if the url got fetched correctly after every round:
> For the first round with seed as http://bappenas.go.id, after "updatedb"
> job, run these to check if they are into the crawldb. The first url must be
> db_fetched while the second one must be db_unfetched:
>
> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
> bin/nutch readdb <YOUR_CRAWLDB> -url
>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>
> Now crawl for the next depth. After "updatedb"job, check if the second url
> got fetched using the same command again. ie.
> bin/nutch readdb <YOUR_CRAWLDB> -url
>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>
> Note that if there was any redirection, you need to look out the target url
> in the redirection chain and use that url ahead for debugging. Verify if
> the content you got for that url had text "Liberal Party" in the parsed
> output using this command:
>
> bin/nutch readseg -get <LATEST_SEGMENT>
>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>
> For larger segments, you might get a OOM error. So in that case, take the
> entire segment dump using:
> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
>
> After all this is verified and everything looks good from the crawling
> side, run solrindex and check if you get the query results. If not, then
> there was a problem while indexing the stuff.
>
> Thanks,
> Tejas
>
>
> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Hi,
> >
> > I just realized that my nutch didn't crawl the articles/pages (depth 2)
> > which shown on frontpage.
> > My target URL is: http://bappenas.go.id
> >
> > As shown on that frontpage (top right below the slider banners) three is
> a
> > text link:
> >
> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
> > and its URL:
> >
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
> >
> > I tried to search with keyword "Liberal Party" (with quotes) which appear
> > on link (page) above but has no result :(
> >
> > Following is the search link queried:
> >
> >
> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
> >
> > I use individual script to crawl below:
> >
> > ===
> > # Defines env variables
> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
> > export PATH="$JAVA_HOME/bin:$PATH"
> > NUTCH="/opt/searchengine/nutch"
> >
> > # Start by injecting the seed url(s) to the nutch crawldb:
> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb $NUTCH/urls/seed.txt
> >
> > # Generate fetch list
> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
> > $NUTCH/BappenasCrawl/segments
> >
> > # last segment
> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
> > $NUTCH/BappenasCrawl/segments|tail -1`
> >
> > # Launch the crawler!
> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
> >
> > # Parse the fetched content:
> > $NUTCH/bin/nutch parse $SEGMENT
> >
> > # We need to update the crawl database to ensure that for all future
> > crawls, Nutch only checks the already crawled pages, and only fetches new
> > and changed pages.
> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT -filter
> > -normalize
> >
> > # Indexing our crawl DB with solr
> > $NUTCH/bin/nutch solrindex
> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
> > -dir<
> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
> >$NUTCH/BappenasCrawl/segments
> > ===
> >
> > I run this script daily but it looks it never reach the single article
> > pages which shown on the frontpage.
> >
> > If I read Tejas explained on another thread (shown below), should I two
> or
> > three times loops (generate -> fetch -> parse -> update) to produce 2 or
> 3
> > depth levels?
> >
> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
> > *****************
> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> > Yes. Nutch would parse the HTML and extract the content out of it.
> Tweaking
> > around the code surrounding the parser would have made that happen. If
> you
> > did something else, would you mind sharing it ?
> >
> > The "depth" is used by the Crawl class in 1.x which is deprecated in 2.x.
> > Use bin/crawl instead.
> > While running the "bin/crawl" script, the "<numberOfRounds>" option is
> > nothing but the depth till which you want the crawling to be performed.
> >
> > If you want to use the individual commands instead, run generate -> fetch
> > -> parse -> update multiple times. The crawl script internally does the
> > same thing.
> > eg. If you want to fetch till depth 3, this is how you could do:
> > inject -> (generate -> fetch -> parse -> update)
> >           -> (generate -> fetch -> parse -> update)
> >           -> (generate -> fetch -> parse -> update)
> >                -> solrindex
> > *****************
> >
> > I also has commented line below on regex-urlfilter.txt file:
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > Apps: nutch 1.7 and Solr 4.5.1
> >
> > Thank you so much!
> >
> > --
> > wassalam,
> > [bayu]
> >
>



-- 
wassalam,
[bayu]

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Posted by Tejas Patil <te...@gmail.com>.

Please copy this at the end (but above the end tag '</configuration>') in
your $NUTCH/conf/nutch-site.xml:

<property>
  <name>http.content.limit</name>
  <value>999999999</value>
</property>

<property>
  <name>http.timeout</name>
  <value>2147483640</value>
</property>

<property>
  <name>db.max.outlinks.per.page</name>
  <value>999999999</value>
</property>

Please check if the url got fetched correctly after every round:
For the first round with seed as http://bappenas.go.id, after "updatedb"
job, run these to check if they are into the crawldb. The first url must be
db_fetched while the second one must be db_unfetched:

bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
bin/nutch readdb <YOUR_CRAWLDB> -url
http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/

Now crawl for the next depth. After "updatedb"job, check if the second url
got fetched using the same command again. ie.
bin/nutch readdb <YOUR_CRAWLDB> -url
http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/

Note that if there was any redirection, you need to look out the target url
in the redirection chain and use that url ahead for debugging. Verify if
the content you got for that url had text "Liberal Party" in the parsed
output using this command:

bin/nutch readseg -get <LATEST_SEGMENT>
http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/

For larger segments, you might get a OOM error. So in that case, take the
entire segment dump using:
bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>

After all this is verified and everything looks good from the crawling
side, run solrindex and check if you get the query results. If not, then
there was a problem while indexing the stuff.

Thanks,
Tejas


On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi,
>
> I just realized that my nutch didn't crawl the articles/pages (depth 2)
> which shown on frontpage.
> My target URL is: http://bappenas.go.id
>
> As shown on that frontpage (top right below the slider banners) three is a
> text link:
>
> "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
> and its URL:
>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
>
> I tried to search with keyword "Liberal Party" (with quotes) which appear
> on link (page) above but has no result :(
>
> Following is the search link queried:
>
> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
>
> I use individual script to crawl below:
>
> ===
> # Defines env variables
> export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
> export PATH="$JAVA_HOME/bin:$PATH"
> NUTCH="/opt/searchengine/nutch"
>
> # Start by injecting the seed url(s) to the nutch crawldb:
> $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb $NUTCH/urls/seed.txt
>
> # Generate fetch list
> $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
> $NUTCH/BappenasCrawl/segments
>
> # last segment
> export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
> $NUTCH/BappenasCrawl/segments|tail -1`
>
> # Launch the crawler!
> $NUTCH/bin/nutch fetch $SEGMENT -noParsing
>
> # Parse the fetched content:
> $NUTCH/bin/nutch parse $SEGMENT
>
> # We need to update the crawl database to ensure that for all future
> crawls, Nutch only checks the already crawled pages, and only fetches new
> and changed pages.
> $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT -filter
> -normalize
>
> # Indexing our crawl DB with solr
> $NUTCH/bin/nutch solrindex
> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
> -dir<http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir>$NUTCH/BappenasCrawl/segments
> ===
>
> I run this script daily but it looks it never reach the single article
> pages which shown on the frontpage.
>
> If I read Tejas explained on another thread (shown below), should I two or
> three times loops (generate -> fetch -> parse -> update) to produce 2 or 3
> depth levels?
>
> QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
> *****************
> On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
> Yes. Nutch would parse the HTML and extract the content out of it. Tweaking
> around the code surrounding the parser would have made that happen. If you
> did something else, would you mind sharing it ?
>
> The "depth" is used by the Crawl class in 1.x which is deprecated in 2.x.
> Use bin/crawl instead.
> While running the "bin/crawl" script, the "<numberOfRounds>" option is
> nothing but the depth till which you want the crawling to be performed.
>
> If you want to use the individual commands instead, run generate -> fetch
> -> parse -> update multiple times. The crawl script internally does the
> same thing.
> eg. If you want to fetch till depth 3, this is how you could do:
> inject -> (generate -> fetch -> parse -> update)
>           -> (generate -> fetch -> parse -> update)
>           -> (generate -> fetch -> parse -> update)
>                -> solrindex
> *****************
>
> I also has commented line below on regex-urlfilter.txt file:
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> Apps: nutch 1.7 and Solr 4.5.1
>
> Thank you so much!
>
> --
> wassalam,
> [bayu]
>