You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bayu Widyasanyata <bw...@gmail.com> on 2014/02/02 13:24:13 UTC

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Hi Tejas,

It's works and great! :)
After reconfigured and many times of generate, fetch, parse & update, the
pages on 2nd level is being crawled.

1 question, Is it fine and correct if I modified my current
crawler+indexing script into this pseudo (skeleton):

>>>>>>>>>>>>>>>>>>>>>>>>>>>
# example number of levels / depth (loop)
LOOP=4

nutch->inject()

loop[ =< $LOOP]
{
    nutch->generate()
    nutch->fetch(a_segment)
    nutch->parse(a_segment)
    nutch->updatedb(a_segment)
}

nutch->solrindex()

>>>>>>>>>>>>>>>>>>>>>>>>>>>

Thank you!


On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> OK I will apply it first and update the result.
>
> Thanks.-
>
>
> On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <te...@gmail.com>wrote:
>
>> Please copy this at the end (but above the end tag '</configuration>') in
>> your $NUTCH/conf/nutch-site.xml:
>>
>> <property>
>>   <name>http.content.limit</name>
>>   <value>999999999</value>
>> </property>
>>
>> <property>
>>   <name>http.timeout</name>
>>   <value>2147483640</value>
>> </property>
>>
>> <property>
>>   <name>db.max.outlinks.per.page</name>
>>   <value>999999999</value>
>> </property>
>>
>> Please check if the url got fetched correctly after every round:
>> For the first round with seed as http://bappenas.go.id, after "updatedb"
>> job, run these to check if they are into the crawldb. The first url must
>> be
>> db_fetched while the second one must be db_unfetched:
>>
>> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
>> bin/nutch readdb <YOUR_CRAWLDB> -url
>>
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>>
>> Now crawl for the next depth. After "updatedb"job, check if the second url
>> got fetched using the same command again. ie.
>> bin/nutch readdb <YOUR_CRAWLDB> -url
>>
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>>
>> Note that if there was any redirection, you need to look out the target
>> url
>> in the redirection chain and use that url ahead for debugging. Verify if
>> the content you got for that url had text "Liberal Party" in the parsed
>> output using this command:
>>
>> bin/nutch readseg -get <LATEST_SEGMENT>
>>
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
>>
>> For larger segments, you might get a OOM error. So in that case, take the
>> entire segment dump using:
>> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
>>
>> After all this is verified and everything looks good from the crawling
>> side, run solrindex and check if you get the query results. If not, then
>> there was a problem while indexing the stuff.
>>
>> Thanks,
>> Tejas
>>
>>
>> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
>> <bw...@gmail.com>wrote:
>>
>> > Hi,
>> >
>> > I just realized that my nutch didn't crawl the articles/pages (depth 2)
>> > which shown on frontpage.
>> > My target URL is: http://bappenas.go.id
>> >
>> > As shown on that frontpage (top right below the slider banners) three
>> is a
>> > text link:
>> >
>> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
>> > and its URL:
>> >
>> >
>> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
>> >
>> > I tried to search with keyword "Liberal Party" (with quotes) which
>> appear
>> > on link (page) above but has no result :(
>> >
>> > Following is the search link queried:
>> >
>> >
>> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
>> >
>> > I use individual script to crawl below:
>> >
>> > ===
>> > # Defines env variables
>> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
>> > export PATH="$JAVA_HOME/bin:$PATH"
>> > NUTCH="/opt/searchengine/nutch"
>> >
>> > # Start by injecting the seed url(s) to the nutch crawldb:
>> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
>> $NUTCH/urls/seed.txt
>> >
>> > # Generate fetch list
>> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
>> > $NUTCH/BappenasCrawl/segments
>> >
>> > # last segment
>> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
>> > $NUTCH/BappenasCrawl/segments|tail -1`
>> >
>> > # Launch the crawler!
>> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
>> >
>> > # Parse the fetched content:
>> > $NUTCH/bin/nutch parse $SEGMENT
>> >
>> > # We need to update the crawl database to ensure that for all future
>> > crawls, Nutch only checks the already crawled pages, and only fetches
>> new
>> > and changed pages.
>> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT -filter
>> > -normalize
>> >
>> > # Indexing our crawl DB with solr
>> > $NUTCH/bin/nutch solrindex
>> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
>> > -dir<
>> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
>> >$NUTCH/BappenasCrawl/segments
>> > ===
>> >
>> > I run this script daily but it looks it never reach the single article
>> > pages which shown on the frontpage.
>> >
>> > If I read Tejas explained on another thread (shown below), should I two
>> or
>> > three times loops (generate -> fetch -> parse -> update) to produce 2
>> or 3
>> > depth levels?
>> >
>> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
>> > *****************
>> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <tejas.patil.cs@gmail.com
>> > >wrote:
>> > Yes. Nutch would parse the HTML and extract the content out of it.
>> Tweaking
>> > around the code surrounding the parser would have made that happen. If
>> you
>> > did something else, would you mind sharing it ?
>> >
>> > The "depth" is used by the Crawl class in 1.x which is deprecated in
>> 2.x.
>> > Use bin/crawl instead.
>> > While running the "bin/crawl" script, the "<numberOfRounds>" option is
>> > nothing but the depth till which you want the crawling to be performed.
>> >
>> > If you want to use the individual commands instead, run generate ->
>> fetch
>> > -> parse -> update multiple times. The crawl script internally does the
>> > same thing.
>> > eg. If you want to fetch till depth 3, this is how you could do:
>> > inject -> (generate -> fetch -> parse -> update)
>> >           -> (generate -> fetch -> parse -> update)
>> >           -> (generate -> fetch -> parse -> update)
>> >                -> solrindex
>> > *****************
>> >
>> > I also has commented line below on regex-urlfilter.txt file:
>> > # skip URLs containing certain characters as probable queries, etc.
>> > #-[?*!@=]
>> >
>> > Apps: nutch 1.7 and Solr 4.5.1
>> >
>> > Thank you so much!
>> >
>> > --
>> > wassalam,
>> > [bayu]
>> >
>>
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Yupe, thanks!

---
wassalam,
[bayu]

/sent from Android phone/
On Feb 2, 2014 10:51 PM, "Tejas Patil" <te...@gmail.com> wrote:

> On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Hi Tejas,
> >
> > It's works and great! :)
> > After reconfigured and many times of generate, fetch, parse & update, the
> > pages on 2nd level is being crawled.
> >
> > 1 question, Is it fine and correct if I modified my current
> > crawler+indexing script into this pseudo (skeleton):
> >
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > # example number of levels / depth (loop)
> > LOOP=4
> >
> > nutch->inject()
> >
> > loop[ =< $LOOP]
> > {
> >     nutch->generate()
> >     nutch->fetch(a_segment)
> >     nutch->parse(a_segment)
> >     nutch->updatedb(a_segment)
> > }
> >
> > nutch->solrindex()
> >
> > I don't think that this should be a problem. Remember to pass all the
> segments generated in the crawl loop to the solrindex job using "-dir"
> option.
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >
> > Thank you!
> >
> >
> > On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
> > <bw...@gmail.com>wrote:
> >
> > > OK I will apply it first and update the result.
> > >
> > > Thanks.-
> > >
> > >
> > > On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <
> tejas.patil.cs@gmail.com
> > >wrote:
> > >
> > >> Please copy this at the end (but above the end tag '</configuration>')
> > in
> > >> your $NUTCH/conf/nutch-site.xml:
> > >>
> > >> <property>
> > >>   <name>http.content.limit</name>
> > >>   <value>999999999</value>
> > >> </property>
> > >>
> > >> <property>
> > >>   <name>http.timeout</name>
> > >>   <value>2147483640</value>
> > >> </property>
> > >>
> > >> <property>
> > >>   <name>db.max.outlinks.per.page</name>
> > >>   <value>999999999</value>
> > >> </property>
> > >>
> > >> Please check if the url got fetched correctly after every round:
> > >> For the first round with seed as http://bappenas.go.id, after
> > "updatedb"
> > >> job, run these to check if they are into the crawldb. The first url
> must
> > >> be
> > >> db_fetched while the second one must be db_unfetched:
> > >>
> > >> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
> > >> bin/nutch readdb <YOUR_CRAWLDB> -url
> > >>
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> > >>
> > >> Now crawl for the next depth. After "updatedb"job, check if the second
> > url
> > >> got fetched using the same command again. ie.
> > >> bin/nutch readdb <YOUR_CRAWLDB> -url
> > >>
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> > >>
> > >> Note that if there was any redirection, you need to look out the
> target
> > >> url
> > >> in the redirection chain and use that url ahead for debugging. Verify
> if
> > >> the content you got for that url had text "Liberal Party" in the
> parsed
> > >> output using this command:
> > >>
> > >> bin/nutch readseg -get <LATEST_SEGMENT>
> > >>
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> > >>
> > >> For larger segments, you might get a OOM error. So in that case, take
> > the
> > >> entire segment dump using:
> > >> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
> > >>
> > >> After all this is verified and everything looks good from the crawling
> > >> side, run solrindex and check if you get the query results. If not,
> then
> > >> there was a problem while indexing the stuff.
> > >>
> > >> Thanks,
> > >> Tejas
> > >>
> > >>
> > >> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
> > >> <bw...@gmail.com>wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I just realized that my nutch didn't crawl the articles/pages (depth
> > 2)
> > >> > which shown on frontpage.
> > >> > My target URL is: http://bappenas.go.id
> > >> >
> > >> > As shown on that frontpage (top right below the slider banners)
> three
> > >> is a
> > >> > text link:
> > >> >
> > >> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
> > >> > and its URL:
> > >> >
> > >> >
> > >>
> >
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
> > >> >
> > >> > I tried to search with keyword "Liberal Party" (with quotes) which
> > >> appear
> > >> > on link (page) above but has no result :(
> > >> >
> > >> > Following is the search link queried:
> > >> >
> > >> >
> > >>
> >
> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
> > >> >
> > >> > I use individual script to crawl below:
> > >> >
> > >> > ===
> > >> > # Defines env variables
> > >> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
> > >> > export PATH="$JAVA_HOME/bin:$PATH"
> > >> > NUTCH="/opt/searchengine/nutch"
> > >> >
> > >> > # Start by injecting the seed url(s) to the nutch crawldb:
> > >> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
> > >> $NUTCH/urls/seed.txt
> > >> >
> > >> > # Generate fetch list
> > >> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
> > >> > $NUTCH/BappenasCrawl/segments
> > >> >
> > >> > # last segment
> > >> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
> > >> > $NUTCH/BappenasCrawl/segments|tail -1`
> > >> >
> > >> > # Launch the crawler!
> > >> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
> > >> >
> > >> > # Parse the fetched content:
> > >> > $NUTCH/bin/nutch parse $SEGMENT
> > >> >
> > >> > # We need to update the crawl database to ensure that for all future
> > >> > crawls, Nutch only checks the already crawled pages, and only
> fetches
> > >> new
> > >> > and changed pages.
> > >> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT
> > -filter
> > >> > -normalize
> > >> >
> > >> > # Indexing our crawl DB with solr
> > >> > $NUTCH/bin/nutch solrindex
> > >> >
> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
> > >> > -dir<
> > >>
> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
> > >> >$NUTCH/BappenasCrawl/segments
> > >> > ===
> > >> >
> > >> > I run this script daily but it looks it never reach the single
> article
> > >> > pages which shown on the frontpage.
> > >> >
> > >> > If I read Tejas explained on another thread (shown below), should I
> > two
> > >> or
> > >> > three times loops (generate -> fetch -> parse -> update) to produce
> 2
> > >> or 3
> > >> > depth levels?
> > >> >
> > >> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
> > >> > *****************
> > >> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <
> > tejas.patil.cs@gmail.com
> > >> > >wrote:
> > >> > Yes. Nutch would parse the HTML and extract the content out of it.
> > >> Tweaking
> > >> > around the code surrounding the parser would have made that happen.
> If
> > >> you
> > >> > did something else, would you mind sharing it ?
> > >> >
> > >> > The "depth" is used by the Crawl class in 1.x which is deprecated in
> > >> 2.x.
> > >> > Use bin/crawl instead.
> > >> > While running the "bin/crawl" script, the "<numberOfRounds>" option
> is
> > >> > nothing but the depth till which you want the crawling to be
> > performed.
> > >> >
> > >> > If you want to use the individual commands instead, run generate ->
> > >> fetch
> > >> > -> parse -> update multiple times. The crawl script internally does
> > the
> > >> > same thing.
> > >> > eg. If you want to fetch till depth 3, this is how you could do:
> > >> > inject -> (generate -> fetch -> parse -> update)
> > >> >           -> (generate -> fetch -> parse -> update)
> > >> >           -> (generate -> fetch -> parse -> update)
> > >> >                -> solrindex
> > >> > *****************
> > >> >
> > >> > I also has commented line below on regex-urlfilter.txt file:
> > >> > # skip URLs containing certain characters as probable queries, etc.
> > >> > #-[?*!@=]
> > >> >
> > >> > Apps: nutch 1.7 and Solr 4.5.1
> > >> >
> > >> > Thank you so much!
> > >> >
> > >> > --
> > >> > wassalam,
> > >> > [bayu]
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > wassalam,
> > > [bayu]
> > >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Posted by Tejas Patil <te...@gmail.com>.

On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi Tejas,
>
> It's works and great! :)
> After reconfigured and many times of generate, fetch, parse & update, the
> pages on 2nd level is being crawled.
>
> 1 question, Is it fine and correct if I modified my current
> crawler+indexing script into this pseudo (skeleton):
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> # example number of levels / depth (loop)
> LOOP=4
>
> nutch->inject()
>
> loop[ =< $LOOP]
> {
>     nutch->generate()
>     nutch->fetch(a_segment)
>     nutch->parse(a_segment)
>     nutch->updatedb(a_segment)
> }
>
> nutch->solrindex()
>
> I don't think that this should be a problem. Remember to pass all the
segments generated in the crawl loop to the solrindex job using "-dir"
option.

>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> Thank you!
>
>
> On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > OK I will apply it first and update the result.
> >
> > Thanks.-
> >
> >
> > On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
> >
> >> Please copy this at the end (but above the end tag '</configuration>')
> in
> >> your $NUTCH/conf/nutch-site.xml:
> >>
> >> <property>
> >>   <name>http.content.limit</name>
> >>   <value>999999999</value>
> >> </property>
> >>
> >> <property>
> >>   <name>http.timeout</name>
> >>   <value>2147483640</value>
> >> </property>
> >>
> >> <property>
> >>   <name>db.max.outlinks.per.page</name>
> >>   <value>999999999</value>
> >> </property>
> >>
> >> Please check if the url got fetched correctly after every round:
> >> For the first round with seed as http://bappenas.go.id, after
> "updatedb"
> >> job, run these to check if they are into the crawldb. The first url must
> >> be
> >> db_fetched while the second one must be db_unfetched:
> >>
> >> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
> >> bin/nutch readdb <YOUR_CRAWLDB> -url
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> Now crawl for the next depth. After "updatedb"job, check if the second
> url
> >> got fetched using the same command again. ie.
> >> bin/nutch readdb <YOUR_CRAWLDB> -url
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> Note that if there was any redirection, you need to look out the target
> >> url
> >> in the redirection chain and use that url ahead for debugging. Verify if
> >> the content you got for that url had text "Liberal Party" in the parsed
> >> output using this command:
> >>
> >> bin/nutch readseg -get <LATEST_SEGMENT>
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> For larger segments, you might get a OOM error. So in that case, take
> the
> >> entire segment dump using:
> >> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
> >>
> >> After all this is verified and everything looks good from the crawling
> >> side, run solrindex and check if you get the query results. If not, then
> >> there was a problem while indexing the stuff.
> >>
> >> Thanks,
> >> Tejas
> >>
> >>
> >> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
> >> <bw...@gmail.com>wrote:
> >>
> >> > Hi,
> >> >
> >> > I just realized that my nutch didn't crawl the articles/pages (depth
> 2)
> >> > which shown on frontpage.
> >> > My target URL is: http://bappenas.go.id
> >> >
> >> > As shown on that frontpage (top right below the slider banners) three
> >> is a
> >> > text link:
> >> >
> >> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
> >> > and its URL:
> >> >
> >> >
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
> >> >
> >> > I tried to search with keyword "Liberal Party" (with quotes) which
> >> appear
> >> > on link (page) above but has no result :(
> >> >
> >> > Following is the search link queried:
> >> >
> >> >
> >>
> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
> >> >
> >> > I use individual script to crawl below:
> >> >
> >> > ===
> >> > # Defines env variables
> >> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
> >> > export PATH="$JAVA_HOME/bin:$PATH"
> >> > NUTCH="/opt/searchengine/nutch"
> >> >
> >> > # Start by injecting the seed url(s) to the nutch crawldb:
> >> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
> >> $NUTCH/urls/seed.txt
> >> >
> >> > # Generate fetch list
> >> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
> >> > $NUTCH/BappenasCrawl/segments
> >> >
> >> > # last segment
> >> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
> >> > $NUTCH/BappenasCrawl/segments|tail -1`
> >> >
> >> > # Launch the crawler!
> >> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
> >> >
> >> > # Parse the fetched content:
> >> > $NUTCH/bin/nutch parse $SEGMENT
> >> >
> >> > # We need to update the crawl database to ensure that for all future
> >> > crawls, Nutch only checks the already crawled pages, and only fetches
> >> new
> >> > and changed pages.
> >> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT
> -filter
> >> > -normalize
> >> >
> >> > # Indexing our crawl DB with solr
> >> > $NUTCH/bin/nutch solrindex
> >> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
> >> > -dir<
> >>
> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
> >> >$NUTCH/BappenasCrawl/segments
> >> > ===
> >> >
> >> > I run this script daily but it looks it never reach the single article
> >> > pages which shown on the frontpage.
> >> >
> >> > If I read Tejas explained on another thread (shown below), should I
> two
> >> or
> >> > three times loops (generate -> fetch -> parse -> update) to produce 2
> >> or 3
> >> > depth levels?
> >> >
> >> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
> >> > *****************
> >> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <
> tejas.patil.cs@gmail.com
> >> > >wrote:
> >> > Yes. Nutch would parse the HTML and extract the content out of it.
> >> Tweaking
> >> > around the code surrounding the parser would have made that happen. If
> >> you
> >> > did something else, would you mind sharing it ?
> >> >
> >> > The "depth" is used by the Crawl class in 1.x which is deprecated in
> >> 2.x.
> >> > Use bin/crawl instead.
> >> > While running the "bin/crawl" script, the "<numberOfRounds>" option is
> >> > nothing but the depth till which you want the crawling to be
> performed.
> >> >
> >> > If you want to use the individual commands instead, run generate ->
> >> fetch
> >> > -> parse -> update multiple times. The crawl script internally does
> the
> >> > same thing.
> >> > eg. If you want to fetch till depth 3, this is how you could do:
> >> > inject -> (generate -> fetch -> parse -> update)
> >> >           -> (generate -> fetch -> parse -> update)
> >> >           -> (generate -> fetch -> parse -> update)
> >> >                -> solrindex
> >> > *****************
> >> >
> >> > I also has commented line below on regex-urlfilter.txt file:
> >> > # skip URLs containing certain characters as probable queries, etc.
> >> > #-[?*!@=]
> >> >
> >> > Apps: nutch 1.7 and Solr 4.5.1
> >> >
> >> > Thank you so much!
> >> >
> >> > --
> >> > wassalam,
> >> > [bayu]
> >> >
> >>
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>