You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Rajani Maski <ra...@gmail.com> on 2012/12/17 06:48:06 UTC

Crawling localhost Webapps - regex- urfilter query

Hi users,

   I am trying to crawl the web applications running on the local apache
tomcat webserver. Note : tomcat version 7, running on 8080 port.


The Main html page is : http://43.44.111.123:8080/nutch-test-site/ch-1.html.
This main page is having an hyperlink to call its sub child  -
http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html
and the sub-child is again having its own child as hyperlink   -
http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html


Now *I would like to know what is the filter that has to be given in
regex-url-filter.txt to accept crawling for this site*.
Because I am getting log as No more urls to fetch. This seems to be mistake
in my regex-urlfilter.txt or seed.txt

I tried with the following cases setup:

*Case 1*
regex-urlfilter.txt  -
   # accept anything else
   +^http://43.44.111.123:8080/nutch-test-site/child-1.html

seed.txt -
  http://43.44.111.123:8080/nutch-test-site/child-1.html


*Case 2*
regex-urlfilter.txt  -
   # accept anything else
   +^http://43.44.111.123:8080/

seed.txt -
  http://43.44.111.123:8080/nutch-test-site/child-1.html


*Case 3*
regex-urlfilter.txt  -
   # accept anything else
   +^http://43.44.111.123:8080/

seed.txt -
  http://43.44.111.123:8080/


Output : Stopping at depth=1 - no more URLs to fetch.


*Nutch command: *
* bin/nutch crawl urls -dir tomcatcrawl -solr
http://localhost:8080/solrnutch -depth 3 -topN 5 *
*
*
*
*
Can you please point me out the mistake here.?

Regards
Rajani.

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Additionally, wget is great for fetching pages on the fly, but it does not
necessarily meant that your Nutch server will and/or should be able to
fetch the page.

I would always recommend using the parserchecker [0] tool for on the fly
fetching and parser checking. It can be run from the command line very
easily.

hth

Lewis

[0]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java

On Wed, Dec 19, 2012 at 1:20 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> This sounds most like non-existence of robots.txt on the webserver.
>
> Lewis
>
>
> On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski <ra...@gmail.com>wrote:
>
>> Hi Tejas,
>>
>>  I found out the reason for why the blog was not getting crawled :
>> http://rajinimaski.blogspot.in/
>> This is because of the proxy that has filter(block) for blog sites. Used
>> different IP and
>>  Now I am able to crawl the above blog site successfully.
>>
>> However the html files that I have put in local tomcat webserver are not
>> getting crawled and there are no errors also. attached is the log file and
>> html sample pages.I will look at the robot rules for this and get back.
>>
>> Thanks very much
>> Regards
>> Rajani
>>
>>
>>
>>
>>
>> On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <te...@gmail.com>wrote:
>>
>>> Hi Rajani,
>>>
>>> *Robot rules? I didn't get this check. Did you mean any setting in
>>> nutch-site
>>> xml ?*
>>> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard
>>>
>>> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my
>>> end.
>>> Without any error or exception its hard to tell issue. Set the logger to
>>> TRACE or DEBUG and see the logs created for the fetch phase.
>>> There must be some message regarding the url like
>>> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http
>>> code=403,
>>> url=http://www.abcd.edu/~pqr/homework.html
>>> or
>>> 2012-12-18 11:24:58,436 TRACE http.Http - fetching
>>> http://www.ics.uci.edu/~dan/class/260/notes/
>>> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
>>> http://www.ics.uci.edu/~dan/class/260/notes/
>>> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required
>>>
>>> or something else that can shed the light on the issue.
>>>
>>> Thanks,
>>> Tejas Patil
>>>
>>> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <ra...@gmail.com>
>>> wrote:
>>>
>>> > Hi Tejas,
>>> > Thank you for detailed information. For the checks,
>>> >
>>> > Check 1  - can the url be fetched via wget command :
>>> >
>>> > ubuntu@ubuntu-OptiPlex-390:~$ wget
>>> > http://localhost:8080/nutch-test-site/child-1.html
>>> > --2012-12-18 16:07:34--
>>> > http://localhost:8080/nutch-test-site/child-1.html
>>> > Resolving localhost (localhost)... 127.0.0.1
>>> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
>>> > HTTP request sent, awaiting response... 200 OK
>>> > Length: 102 [text/html]
>>> > Saving to: `child-1.html.1'
>>> >
>>> > 100%[======================================>] 102         --.-K/s   in
>>> 0s
>>> >
>>> >
>>> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
>>> >
>>> > Check 2 : what are the robots rules defined for the host ? Do they
>>> allow
>>> > the
>>> > crawler to crawl that url ? this will address #5.
>>> > Robot rules? I didn't get this check. Did you mean any setting in
>>> > nutch-site xml ?
>>> >
>>> > 3. After changing the parent page url from IP based to localhost and
>>> > running a *fresh* crawl, did you see any error or exception in the
>>> logs ?
>>> > try running fresh crawl in local mode, its helps in debugging things
>>> > quickly.
>>> >
>>> > Did a fresh crawl. There are no errors only warnings. The stats is
>>> same as
>>> > above.
>>> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has
>>> > http://localhost:8080/nutch-test-site/child-1.html
>>> >
>>> > Also important observation is when I set other sites for crawling like
>>> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and
>>> indexed
>>> > to
>>> > solr. But when I crawl the above html page nothing is fetched. Also
>>> when I
>>> > am trying to crawl the site: http://rajinimaski.blogspot.in/  (this
>>> has 3
>>> > blogs) there is 403 status - failed to fetch.
>>> >
>>> >
>>> > thanks & Regards
>>> > Rajani
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <tejas.patil.cs@gmail.com
>>> > >wrote:
>>> >
>>> > > Hi Rajani,
>>> > >
>>> > > A url is marked as "db_gone" when nutch receives below HTTP error
>>> codes
>>> > for
>>> > > the request:
>>> > > 1. Bad request (error code: 400)
>>> > > 2. Not found (error code: 404)
>>> > > 3. Access denied (error code: 401)
>>> > > 4. Permanently gone (error code: 410)
>>> > >
>>> > > Apart from this, a url can also be marked as "db_gone" if:
>>> > > 5. its not getting crawled due to "Robots denied" or
>>> > > 6. some exception is triggered while fetching the content from the
>>> server
>>> > > (eg. Read time out, Broken socket etc.)
>>> > >
>>> > > (NOTE: as we are dealing with a HTTP url here, it made sense to
>>> focus on
>>> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I
>>> preferred
>>> > to
>>> > > avoid discussing that.)
>>> > >
>>> > > The reason why you could not see the child pages in the crawldb:
>>> because
>>> > > the parent page has not been fetched successfully.
>>> > >
>>> > > Quick checks that you can try:
>>> > > 1. can the url be fetched via wget command
>>> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address
>>> > > cases 1-4
>>> > > 2. what are the robots rules defined for the host ? Do they allow the
>>> > > crawler to crawl that url ? this will address #5.
>>> > > 3. After changing the parent page url from IP based to localhost and
>>> > > running a *fresh* crawl, did you see any error or exception in the
>>> logs ?
>>> > > try running fresh crawl in local mode, its helps in debugging things
>>> > > quickly.
>>> > >
>>> > > Thanks,
>>> > > Tejas Patil
>>> > >
>>> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <
>>> rajinimaski@gmail.com
>>> > > >wrote:
>>> > >
>>> > > >  Can you please tell me what does this mean : Status: 3 (db_gone)
>>> > >
>>> >
>>>
>>
>>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Lewis John Mcgibbney <le...@gmail.com>.

This sounds most like non-existence of robots.txt on the webserver.

Lewis

On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski <ra...@gmail.com> wrote:

> Hi Tejas,
>
>  I found out the reason for why the blog was not getting crawled :
> http://rajinimaski.blogspot.in/
> This is because of the proxy that has filter(block) for blog sites. Used
> different IP and
>  Now I am able to crawl the above blog site successfully.
>
> However the html files that I have put in local tomcat webserver are not
> getting crawled and there are no errors also. attached is the log file and
> html sample pages.I will look at the robot rules for this and get back.
>
> Thanks very much
> Regards
> Rajani
>
>
>
>
>
> On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <te...@gmail.com>wrote:
>
>> Hi Rajani,
>>
>> *Robot rules? I didn't get this check. Did you mean any setting in
>> nutch-site
>> xml ?*
>> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard
>>
>> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my
>> end.
>> Without any error or exception its hard to tell issue. Set the logger to
>> TRACE or DEBUG and see the logs created for the fetch phase.
>> There must be some message regarding the url like
>> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http
>> code=403,
>> url=http://www.abcd.edu/~pqr/homework.html
>> or
>> 2012-12-18 11:24:58,436 TRACE http.Http - fetching
>> http://www.ics.uci.edu/~dan/class/260/notes/
>> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
>> http://www.ics.uci.edu/~dan/class/260/notes/
>> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required
>>
>> or something else that can shed the light on the issue.
>>
>> Thanks,
>> Tejas Patil
>>
>> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <ra...@gmail.com>
>> wrote:
>>
>> > Hi Tejas,
>> > Thank you for detailed information. For the checks,
>> >
>> > Check 1  - can the url be fetched via wget command :
>> >
>> > ubuntu@ubuntu-OptiPlex-390:~$ wget
>> > http://localhost:8080/nutch-test-site/child-1.html
>> > --2012-12-18 16:07:34--
>> > http://localhost:8080/nutch-test-site/child-1.html
>> > Resolving localhost (localhost)... 127.0.0.1
>> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
>> > HTTP request sent, awaiting response... 200 OK
>> > Length: 102 [text/html]
>> > Saving to: `child-1.html.1'
>> >
>> > 100%[======================================>] 102         --.-K/s   in
>> 0s
>> >
>> >
>> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
>> >
>> > Check 2 : what are the robots rules defined for the host ? Do they allow
>> > the
>> > crawler to crawl that url ? this will address #5.
>> > Robot rules? I didn't get this check. Did you mean any setting in
>> > nutch-site xml ?
>> >
>> > 3. After changing the parent page url from IP based to localhost and
>> > running a *fresh* crawl, did you see any error or exception in the logs
>> ?
>> > try running fresh crawl in local mode, its helps in debugging things
>> > quickly.
>> >
>> > Did a fresh crawl. There are no errors only warnings. The stats is same
>> as
>> > above.
>> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has
>> > http://localhost:8080/nutch-test-site/child-1.html
>> >
>> > Also important observation is when I set other sites for crawling like
>> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and
>> indexed
>> > to
>> > solr. But when I crawl the above html page nothing is fetched. Also
>> when I
>> > am trying to crawl the site: http://rajinimaski.blogspot.in/  (this
>> has 3
>> > blogs) there is 403 status - failed to fetch.
>> >
>> >
>> > thanks & Regards
>> > Rajani
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <tejas.patil.cs@gmail.com
>> > >wrote:
>> >
>> > > Hi Rajani,
>> > >
>> > > A url is marked as "db_gone" when nutch receives below HTTP error
>> codes
>> > for
>> > > the request:
>> > > 1. Bad request (error code: 400)
>> > > 2. Not found (error code: 404)
>> > > 3. Access denied (error code: 401)
>> > > 4. Permanently gone (error code: 410)
>> > >
>> > > Apart from this, a url can also be marked as "db_gone" if:
>> > > 5. its not getting crawled due to "Robots denied" or
>> > > 6. some exception is triggered while fetching the content from the
>> server
>> > > (eg. Read time out, Broken socket etc.)
>> > >
>> > > (NOTE: as we are dealing with a HTTP url here, it made sense to focus
>> on
>> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I
>> preferred
>> > to
>> > > avoid discussing that.)
>> > >
>> > > The reason why you could not see the child pages in the crawldb:
>> because
>> > > the parent page has not been fetched successfully.
>> > >
>> > > Quick checks that you can try:
>> > > 1. can the url be fetched via wget command
>> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address
>> > > cases 1-4
>> > > 2. what are the robots rules defined for the host ? Do they allow the
>> > > crawler to crawl that url ? this will address #5.
>> > > 3. After changing the parent page url from IP based to localhost and
>> > > running a *fresh* crawl, did you see any error or exception in the
>> logs ?
>> > > try running fresh crawl in local mode, its helps in debugging things
>> > > quickly.
>> > >
>> > > Thanks,
>> > > Tejas Patil
>> > >
>> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <rajinimaski@gmail.com
>> > > >wrote:
>> > >
>> > > >  Can you please tell me what does this mean : Status: 3 (db_gone)
>> > >
>> >
>>
>
>


-- 
*Lewis*

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Rajani Maski <ra...@gmail.com>.

Hi Tejas,

 I found out the reason for why the blog was not getting crawled :
http://rajinimaski.blogspot.in/
This is because of the proxy that has filter(block) for blog sites. Used
different IP and
 Now I am able to crawl the above blog site successfully.

However the html files that I have put in local tomcat webserver are not
getting crawled and there are no errors also. attached is the log file and
html sample pages.I will look at the robot rules for this and get back.

Thanks very much
Regards
Rajani





On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <te...@gmail.com>wrote:

> Hi Rajani,
>
> *Robot rules? I didn't get this check. Did you mean any setting in
> nutch-site
> xml ?*
> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard
>
> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my
> end.
> Without any error or exception its hard to tell issue. Set the logger to
> TRACE or DEBUG and see the logs created for the fetch phase.
> There must be some message regarding the url like
> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http
> code=403,
> url=http://www.abcd.edu/~pqr/homework.html
> or
> 2012-12-18 11:24:58,436 TRACE http.Http - fetching
> http://www.ics.uci.edu/~dan/class/260/notes/
> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
> http://www.ics.uci.edu/~dan/class/260/notes/
> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required
>
> or something else that can shed the light on the issue.
>
> Thanks,
> Tejas Patil
>
> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <ra...@gmail.com>
> wrote:
>
> > Hi Tejas,
> > Thank you for detailed information. For the checks,
> >
> > Check 1  - can the url be fetched via wget command :
> >
> > ubuntu@ubuntu-OptiPlex-390:~$ wget
> > http://localhost:8080/nutch-test-site/child-1.html
> > --2012-12-18 16:07:34--
> > http://localhost:8080/nutch-test-site/child-1.html
> > Resolving localhost (localhost)... 127.0.0.1
> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
> > HTTP request sent, awaiting response... 200 OK
> > Length: 102 [text/html]
> > Saving to: `child-1.html.1'
> >
> > 100%[======================================>] 102         --.-K/s   in 0s
> >
> >
> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
> >
> > Check 2 : what are the robots rules defined for the host ? Do they allow
> > the
> > crawler to crawl that url ? this will address #5.
> > Robot rules? I didn't get this check. Did you mean any setting in
> > nutch-site xml ?
> >
> > 3. After changing the parent page url from IP based to localhost and
> > running a *fresh* crawl, did you see any error or exception in the logs ?
> > try running fresh crawl in local mode, its helps in debugging things
> > quickly.
> >
> > Did a fresh crawl. There are no errors only warnings. The stats is same
> as
> > above.
> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has
> > http://localhost:8080/nutch-test-site/child-1.html
> >
> > Also important observation is when I set other sites for crawling like
> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and indexed
> > to
> > solr. But when I crawl the above html page nothing is fetched. Also when
> I
> > am trying to crawl the site: http://rajinimaski.blogspot.in/  (this has
> 3
> > blogs) there is 403 status - failed to fetch.
> >
> >
> > thanks & Regards
> > Rajani
> >
> >
> >
> >
> >
> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> >
> > > Hi Rajani,
> > >
> > > A url is marked as "db_gone" when nutch receives below HTTP error codes
> > for
> > > the request:
> > > 1. Bad request (error code: 400)
> > > 2. Not found (error code: 404)
> > > 3. Access denied (error code: 401)
> > > 4. Permanently gone (error code: 410)
> > >
> > > Apart from this, a url can also be marked as "db_gone" if:
> > > 5. its not getting crawled due to "Robots denied" or
> > > 6. some exception is triggered while fetching the content from the
> server
> > > (eg. Read time out, Broken socket etc.)
> > >
> > > (NOTE: as we are dealing with a HTTP url here, it made sense to focus
> on
> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred
> > to
> > > avoid discussing that.)
> > >
> > > The reason why you could not see the child pages in the crawldb:
> because
> > > the parent page has not been fetched successfully.
> > >
> > > Quick checks that you can try:
> > > 1. can the url be fetched via wget command
> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address
> > > cases 1-4
> > > 2. what are the robots rules defined for the host ? Do they allow the
> > > crawler to crawl that url ? this will address #5.
> > > 3. After changing the parent page url from IP based to localhost and
> > > running a *fresh* crawl, did you see any error or exception in the
> logs ?
> > > try running fresh crawl in local mode, its helps in debugging things
> > > quickly.
> > >
> > > Thanks,
> > > Tejas Patil
> > >
> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <rajinimaski@gmail.com
> > > >wrote:
> > >
> > > >  Can you please tell me what does this mean : Status: 3 (db_gone)
> > >
> >
>

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Tejas Patil <te...@gmail.com>.

Hi Rajani,

*Robot rules? I didn't get this check. Did you mean any setting in nutch-site
xml ?*
No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard

I was able to crawl http://rajinimaski.blogspot.in/ successfully at my end.
Without any error or exception its hard to tell issue. Set the logger to
TRACE or DEBUG and see the logs created for the fetch phase.
There must be some message regarding the url like
fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http code=403,
url=http://www.abcd.edu/~pqr/homework.html
or
2012-12-18 11:24:58,436 TRACE http.Http - fetching
http://www.ics.uci.edu/~dan/class/260/notes/
2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
http://www.ics.uci.edu/~dan/class/260/notes/
2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required

or something else that can shed the light on the issue.

Thanks,
Tejas Patil

On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <ra...@gmail.com> wrote:

> Hi Tejas,
> Thank you for detailed information. For the checks,
>
> Check 1  - can the url be fetched via wget command :
>
> ubuntu@ubuntu-OptiPlex-390:~$ wget
> http://localhost:8080/nutch-test-site/child-1.html
> --2012-12-18 16:07:34--
> http://localhost:8080/nutch-test-site/child-1.html
> Resolving localhost (localhost)... 127.0.0.1
> Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 102 [text/html]
> Saving to: `child-1.html.1'
>
> 100%[======================================>] 102         --.-K/s   in 0s
>
>
> 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
>
> Check 2 : what are the robots rules defined for the host ? Do they allow
> the
> crawler to crawl that url ? this will address #5.
> Robot rules? I didn't get this check. Did you mean any setting in
> nutch-site xml ?
>
> 3. After changing the parent page url from IP based to localhost and
> running a *fresh* crawl, did you see any error or exception in the logs ?
> try running fresh crawl in local mode, its helps in debugging things
> quickly.
>
> Did a fresh crawl. There are no errors only warnings. The stats is same as
> above.
> configuration : regexurl-filter.txt has "+." and urls/seed.txt has
> http://localhost:8080/nutch-test-site/child-1.html
>
> Also important observation is when I set other sites for crawling like
> http://viterbi.usc.edu/admission/ etc.,. crawl is successful and indexed
> to
> solr. But when I crawl the above html page nothing is fetched. Also when I
> am trying to crawl the site: http://rajinimaski.blogspot.in/  (this has 3
> blogs) there is 403 status - failed to fetch.
>
>
> thanks & Regards
> Rajani
>
>
>
>
>
> On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > Hi Rajani,
> >
> > A url is marked as "db_gone" when nutch receives below HTTP error codes
> for
> > the request:
> > 1. Bad request (error code: 400)
> > 2. Not found (error code: 404)
> > 3. Access denied (error code: 401)
> > 4. Permanently gone (error code: 410)
> >
> > Apart from this, a url can also be marked as "db_gone" if:
> > 5. its not getting crawled due to "Robots denied" or
> > 6. some exception is triggered while fetching the content from the server
> > (eg. Read time out, Broken socket etc.)
> >
> > (NOTE: as we are dealing with a HTTP url here, it made sense to focus on
> > HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred
> to
> > avoid discussing that.)
> >
> > The reason why you could not see the child pages in the crawldb: because
> > the parent page has not been fetched successfully.
> >
> > Quick checks that you can try:
> > 1. can the url be fetched via wget command
> > <http://linux.die.net/man/1/wget>on the terminal ? this will address
> > cases 1-4
> > 2. what are the robots rules defined for the host ? Do they allow the
> > crawler to crawl that url ? this will address #5.
> > 3. After changing the parent page url from IP based to localhost and
> > running a *fresh* crawl, did you see any error or exception in the logs ?
> > try running fresh crawl in local mode, its helps in debugging things
> > quickly.
> >
> > Thanks,
> > Tejas Patil
> >
> > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <rajinimaski@gmail.com
> > >wrote:
> >
> > >  Can you please tell me what does this mean : Status: 3 (db_gone)
> >
>

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Rajani Maski <ra...@gmail.com>.

Hi Tejas,
Thank you for detailed information. For the checks,

Check 1  - can the url be fetched via wget command :

ubuntu@ubuntu-OptiPlex-390:~$ wget
http://localhost:8080/nutch-test-site/child-1.html
--2012-12-18 16:07:34--  http://localhost:8080/nutch-test-site/child-1.html
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102 [text/html]
Saving to: `child-1.html.1'

100%[======================================>] 102         --.-K/s   in 0s

2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]

Check 2 : what are the robots rules defined for the host ? Do they allow the
crawler to crawl that url ? this will address #5.
Robot rules? I didn't get this check. Did you mean any setting in
nutch-site xml ?

3. After changing the parent page url from IP based to localhost and
running a *fresh* crawl, did you see any error or exception in the logs ?
try running fresh crawl in local mode, its helps in debugging things
quickly.

Did a fresh crawl. There are no errors only warnings. The stats is same as
above.
configuration : regexurl-filter.txt has "+." and urls/seed.txt has
http://localhost:8080/nutch-test-site/child-1.html

Also important observation is when I set other sites for crawling like
http://viterbi.usc.edu/admission/ etc.,. crawl is successful and indexed to
solr. But when I crawl the above html page nothing is fetched. Also when I
am trying to crawl the site: http://rajinimaski.blogspot.in/  (this has 3
blogs) there is 403 status - failed to fetch.

thanks & Regards
Rajani

On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <te...@gmail.com>wrote:

> Hi Rajani,
>
> A url is marked as "db_gone" when nutch receives below HTTP error codes for
> the request:
> 1. Bad request (error code: 400)
> 2. Not found (error code: 404)
> 3. Access denied (error code: 401)
> 4. Permanently gone (error code: 410)
>
> Apart from this, a url can also be marked as "db_gone" if:
> 5. its not getting crawled due to "Robots denied" or
> 6. some exception is triggered while fetching the content from the server
> (eg. Read time out, Broken socket etc.)
>
> (NOTE: as we are dealing with a HTTP url here, it made sense to focus on
> HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred to
> avoid discussing that.)
>
> The reason why you could not see the child pages in the crawldb: because
> the parent page has not been fetched successfully.
>
> Quick checks that you can try:
> 1. can the url be fetched via wget command
> <http://linux.die.net/man/1/wget>on the terminal ? this will address
> cases 1-4
> 2. what are the robots rules defined for the host ? Do they allow the
> crawler to crawl that url ? this will address #5.
> 3. After changing the parent page url from IP based to localhost and
> running a *fresh* crawl, did you see any error or exception in the logs ?
> try running fresh crawl in local mode, its helps in debugging things
> quickly.
>
> Thanks,
> Tejas Patil
>
> On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <rajinimaski@gmail.com
> >wrote:
>
> >  Can you please tell me what does this mean : Status: 3 (db_gone)
>

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Tejas Patil <te...@gmail.com>.

Hi Rajani,

A url is marked as "db_gone" when nutch receives below HTTP error codes for
the request:
1. Bad request (error code: 400)
2. Not found (error code: 404)
3. Access denied (error code: 401)
4. Permanently gone (error code: 410)

Apart from this, a url can also be marked as "db_gone" if:
5. its not getting crawled due to "Robots denied" or
6. some exception is triggered while fetching the content from the server
(eg. Read time out, Broken socket etc.)

(NOTE: as we are dealing with a HTTP url here, it made sense to focus on
HTTP codes only. For FTP protocol, nutch has similar stuff. I preferred to
avoid discussing that.)

The reason why you could not see the child pages in the crawldb: because
the parent page has not been fetched successfully.

Quick checks that you can try:
1. can the url be fetched via wget command
<http://linux.die.net/man/1/wget>on the terminal ? this will address
cases 1-4
2. what are the robots rules defined for the host ? Do they allow the
crawler to crawl that url ? this will address #5.
3. After changing the parent page url from IP based to localhost and
running a *fresh* crawl, did you see any error or exception in the logs ?
try running fresh crawl in local mode, its helps in debugging things
quickly.

Thanks,
Tejas Patil

On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <ra...@gmail.com>wrote:

>  Can you please tell me what does this mean : Status: 3 (db_gone)

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Rajani Maski <ra...@gmail.com>.

Hi Tejas,

  Please find my replied embedded. Thank you for the reply and time.

>
> *"status 1 (db_unfetched): 1"* means that url [1] is NOT crawled.
> (FYI: it is not interpreted as "db_unfetched - status is 1". The number 1
> here indicates that there is 1 url in the crawldb with status as
> db_unfetched.)
>
> You said that there are no exceptions in the log file. Which log file did
> you see ?

If you are running in the distributed mode, then you must see the hadoop
> logs (on jobtracker) for the nutch jobs.
>

> It is  a basic local set up.

 Downloaded the binary version of apache nutch 1.5.1 and followed the setup
steps mentioned in wiki. The log file path is ../Downloads/apache-nutch-1.5.
1/logs/hadoop.log
   Note : I was using system IPaddress  yesterday  but there is some
exception today for the same url (Exception :2012-12-18 12:42:04,558 INFO
 fetcher.Fetcher - fetch of
http://43.44.111.123:8080/nutch-test-site/home.html failed with:
java.net.SocketTimeoutException: Read timed out)
So I changed it to localhost and now the stat is

ubuntu@ubuntu-OptiPlex-390:~/Downloads/apache-nutch-1.5.1$ bin/nutch readdb
crawlnewtest/crawldb -stats
CrawlDb statistics start: crawlnewtest/crawldb
Statistics for CrawlDb: crawlnewtest/crawldb
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 3 (db_gone): 1
CrawlDb statistics: done

>
> Also, can you send the entry of the url [1] from the crawldb ? The command
> is:
> *bin/nutch readdb <path to the crawldb> -url <url>*
> *
> *
>
Command : /Downloads/apache-nutch-1.5.1$ bin/nutch readdb
crawlnewtest/crawldb -url http://localhost:8080/nutch-test-site/home.html
URL: http://localhost:8080/nutch-test-site/home.html
Version: 7
Status: 3 (db_gone)
Fetch time: Fri Feb 01 12:26:54 IST 2013
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: _pst_: gone(11), lastModified=0:
http://localhost:8080/nutch-test-site/home.html

 Can you please tell me what does this mean : Status: 3 (db_gone)   [ or
could you point me to the reference link where I can know about what such
response mean]

If you are not able to get any output for above command, then get the dump
> of whole crawldb using this command:
> *bin/nutch readdb <path to the crawldb> -dump <output directory>**
>

/Downloads/apache-nutch-1.5.1$ bin/nutch readdb crawlnewtest/crawldb -dump
/home/ubuntu/Downloads/apache-nutch-1.5.1/test
CrawlDb dump: starting
CrawlDb db: crawlnewtest/crawldb
CrawlDb dump: done

Output is :
http://localhost:8080/nutch-test-site/home.html Version: 7
Status: 3 (db_gone)
Fetch time: Fri Feb 01 12:26:54 IST 2013
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: _pst_: gone(11), lastModified=0:
http://localhost:8080/nutch-test-site/home.html

In the previous email of yours Point B - if db status is fetched,
*B. The main url gets crawled successfully but the rest 2 child pages are
not getting crawled.*
If the url [1] is db_fetched, then use the same
command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
pages.
Can you please correct me if the command I am using is wrong because
this gives me only one main url listed.
Command : /Downloads/apache-nutch-1.5.1$ bin/nutch readdb
crawlnewtest/crawldb -stats

Thank you very much.

Regards
Rajani

> thanks,
> Tejas Patil
>
> On Mon, Dec 17, 2012 at 8:51 PM, Rajani Maski <ra...@gmail.com>
> wrote:
>
> > status 1 (db_unfetched): 1
>

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Tejas Patil <te...@gmail.com>.

Hi Rajani,

*"status 1 (db_unfetched): 1"* means that url [1] is NOT crawled.
(FYI: it is not interpreted as "db_unfetched - status is 1". The number 1
here indicates that there is 1 url in the crawldb with status as
db_unfetched.)

You said that there are no exceptions in the log file. Which log file did
you see ?
If you are running in the distributed mode, then you must see the hadoop
logs (on jobtracker) for the nutch jobs.

Also, can you send the entry of the url [1] from the crawldb ? The command
is:
*bin/nutch readdb <path to the crawldb> -url <url>*
*
*
If you are not able to get any output for above command, then get the dump
of whole crawldb using this command:
*bin/nutch readdb <path to the crawldb> -dump <output directory>**
*

thanks,
Tejas Patil

On Mon, Dec 17, 2012 at 8:51 PM, Rajani Maski <ra...@gmail.com> wrote:

> status 1 (db_unfetched): 1

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Rajani Maski <ra...@gmail.com>.

Hi Tejas,

 Thank you very much for the detailed reply.

Please find the observations embedded in the email :

*A. The main url [1] does NOT gets crawled. *
This can happen due to some regex mismatch, or expception while crawling
the url.
One naive way to forget about regex rules is to simply add "+." at the
start of the regex rules file. It will start accepting any url.
Done
Now run a fresh crawl and see if the url is getting fetched or not. How to
check ? Use "bin/nutch CrawlDbReader" command.

Command used is :  bin/nutch readdb crawlnewtest -stats
CrawlDb statistics start: crawlnewtest
Statistics for CrawlDb: crawlnewtest
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
CrawlDb statistics: done

db_unfetched - status is 1. .
Log has only info and warning. No errors
WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
WARN  mapred.JobClient - Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.

If the status shown is db_unfetched, then check for logs which can provide
some exceptions responsible behind the failiure.
If it was fetched, then goto B.

Does the above status mean that url[1] is crawled? If that is the case, why
is it not indexed to solr? The nutch command I have used is : bin/nutch
crawl urls -dir crawlnewtest -solr http://localhost:8080/solrnutch -depth 3
-topN 5

I do not see any error logs other than the warns that I have mentioned
above. Yet to follow the last few steps of reading segments.

*B. The main url gets crawled successfully but the rest 2 child pages are
not getting crawled.*
If the url [1] is db_fetched, then use the same
command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
pages.

If they are both db_unfetched, then there was some exception which caused
the issue. See logs for details.

If the child pages are not found in the DB, then there was some issue with
link extraction ie. getting the child links from the content of the main
page.
Read the segment for the first round of crawl (which will have the content
of the main page). Extract the content of the main page using the
"bin/nutch SegmentReader" command. Check if the content fetched has the
child urls in it. If yes, then the issue is with link extraction.

On Mon, Dec 17, 2012 at 2:13 PM, Tejas Patil <te...@gmail.com>wrote:

> Lets break down the possibilities:
>
> *A. The main url [1] does NOT gets crawled. *
> This can happen due to some regex mismatch, or expception while crawling
> the url.
> One naive way to forget about regex rules is to simply add "+." at the
> start of the regex rules file. It will start accepting any url.
> Now run a fresh crawl and see if the url is getting fetched or not. How to
> check ? Use "bin/nutch CrawlDbReader" command.
> If the status shown is db_unfetched, then check for logs which can provide
> some exceptions responsible behind the failiure.
> If it was fetched, then goto B.
>
> *B. The main url gets crawled successfully but the rest 2 child pages are
> not getting crawled.*
> If the url [1] is db_fetched, then use the same
> command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
> pages.
> If they are both db_unfetched, then there was some exception which caused
> the issue. See logs for details.
>
> If the child pages are not found in the DB, then there was some issue with
> link extraction ie. getting the child links from the content of the main
> page.
> Read the segment for the first round of crawl (which will have the content
> of the main page). Extract the content of the main page using the
> "bin/nutch SegmentReader" command. Check if the content fetched has the
> child urls in it. If yes, then the issue is with link extraction.
>
> Please do this and revert back to this group with your observations.
>
> Thanks,
> Tejas Patil
>
>
> [1] : http://43.44.111.123:8080/nutch-test-site/ch-1.html
>
>
> On Sun, Dec 16, 2012 at 9:48 PM, Rajani Maski <ra...@gmail.com>
> wrote:
>
> > Hi users,
> >
> >    I am trying to crawl the web applications running on the local apache
> > tomcat webserver. Note : tomcat version 7, running on 8080 port.
> >
> >
> > The Main html page is :
> > http://43.44.111.123:8080/nutch-test-site/ch-1.html.
> > This main page is having an hyperlink to call its sub child  -
> > http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html
> > and the sub-child is again having its own child as hyperlink   -
> > http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html
> >
> >
> > Now *I would like to know what is the filter that has to be given in
> > regex-url-filter.txt to accept crawling for this site*.
> > Because I am getting log as No more urls to fetch. This seems to be
> mistake
> > in my regex-urlfilter.txt or seed.txt
> >
> > I tried with the following cases setup:
> >
> > *Case 1*
> > regex-urlfilter.txt  -
> >    # accept anything else
> >    +^http://43.44.111.123:8080/nutch-test-site/child-1.html
> >
> > seed.txt -
> >   http://43.44.111.123:8080/nutch-test-site/child-1.html
> >
> >
> > *Case 2*
> > regex-urlfilter.txt  -
> >    # accept anything else
> >    +^http://43.44.111.123:8080/
> >
> > seed.txt -
> >   http://43.44.111.123:8080/nutch-test-site/child-1.html
> >
> >
> > *Case 3*
> > regex-urlfilter.txt  -
> >    # accept anything else
> >    +^http://43.44.111.123:8080/
> >
> > seed.txt -
> >   http://43.44.111.123:8080/
> >
> >
> > Output : Stopping at depth=1 - no more URLs to fetch.
> >
> >
> > *Nutch command: *
> > * bin/nutch crawl urls -dir tomcatcrawl -solr
> > http://localhost:8080/solrnutch -depth 3 -topN 5 *
> > *
> > *
> > *
> > *
> > Can you please point me out the mistake here.?
> >
> > Regards
> > Rajani.
> >
>

Re: Crawling localhost Webapps - regex- urfilter query

Posted by Tejas Patil <te...@gmail.com>.

Lets break down the possibilities:

*A. The main url [1] does NOT gets crawled. *
This can happen due to some regex mismatch, or expception while crawling
the url.
One naive way to forget about regex rules is to simply add "+." at the
start of the regex rules file. It will start accepting any url.
Now run a fresh crawl and see if the url is getting fetched or not. How to
check ? Use "bin/nutch CrawlDbReader" command.
If the status shown is db_unfetched, then check for logs which can provide
some exceptions responsible behind the failiure.
If it was fetched, then goto B.

*B. The main url gets crawled successfully but the rest 2 child pages are
not getting crawled.*
If the url [1] is db_fetched, then use the same
command "bin/nutch CrawlDbReader" to see the status of the rest 2 child
pages.
If they are both db_unfetched, then there was some exception which caused
the issue. See logs for details.

If the child pages are not found in the DB, then there was some issue with
link extraction ie. getting the child links from the content of the main
page.
Read the segment for the first round of crawl (which will have the content
of the main page). Extract the content of the main page using the
"bin/nutch SegmentReader" command. Check if the content fetched has the
child urls in it. If yes, then the issue is with link extraction.

Please do this and revert back to this group with your observations.

Thanks,
Tejas Patil

[1] : http://43.44.111.123:8080/nutch-test-site/ch-1.html

On Sun, Dec 16, 2012 at 9:48 PM, Rajani Maski <ra...@gmail.com> wrote:

> Hi users,
>
>    I am trying to crawl the web applications running on the local apache
> tomcat webserver. Note : tomcat version 7, running on 8080 port.
>
>
> The Main html page is :
> http://43.44.111.123:8080/nutch-test-site/ch-1.html.
> This main page is having an hyperlink to call its sub child  -
> http://43.44.111.123:8080/nutch-test-site/ch1/ch1-1.html
> and the sub-child is again having its own child as hyperlink   -
> http://43.44.111.123:8080/nutch-test-site/ch2/ch2-2.html
>
>
> Now *I would like to know what is the filter that has to be given in
> regex-url-filter.txt to accept crawling for this site*.
> Because I am getting log as No more urls to fetch. This seems to be mistake
> in my regex-urlfilter.txt or seed.txt
>
> I tried with the following cases setup:
>
> *Case 1*
> regex-urlfilter.txt  -
>    # accept anything else
>    +^http://43.44.111.123:8080/nutch-test-site/child-1.html
>
> seed.txt -
>   http://43.44.111.123:8080/nutch-test-site/child-1.html
>
>
> *Case 2*
> regex-urlfilter.txt  -
>    # accept anything else
>    +^http://43.44.111.123:8080/
>
> seed.txt -
>   http://43.44.111.123:8080/nutch-test-site/child-1.html
>
>
> *Case 3*
> regex-urlfilter.txt  -
>    # accept anything else
>    +^http://43.44.111.123:8080/
>
> seed.txt -
>   http://43.44.111.123:8080/
>
>
> Output : Stopping at depth=1 - no more URLs to fetch.
>
>
> *Nutch command: *
> * bin/nutch crawl urls -dir tomcatcrawl -solr
> http://localhost:8080/solrnutch -depth 3 -topN 5 *
> *
> *
> *
> *
> Can you please point me out the mistake here.?
>
> Regards
> Rajani.
>