You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Manu Warikoo <mw...@hotmail.com> on 2008/09/25 20:12:12 UTC

FW: Indexing Files on Local File System





Hi, I am running Nutch 0.9 and am attempting to use it to index files on my local file system without much luck. I believe I have configured things correctly, however, no files are being indexed and no errors being reported. Note that I have looked thru the various posts on this topic on the mailing list and tired various variations on the configuration. I am providing details of my configuration and log files below. I would appreciate any insight people might have. Best,mw Details:OS: Windows Vista (note I have turned off defender and firewall)<comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >& logs/crawl.logurls files contains only```````````````````````````````````````````````````file:///C:/MyData/```````````````````````````````````````````````````Nutch-site.xml`````````````````````````````````````<property> <name>http.agent.url</name> <value></value> <description>none</description></property><property> <name>http.agent.email</name> <value>none</value> <description></description></property><property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value></property><property><name>file.content.limit</name> <value>-1</value></property> </configuration>```````````````````````````````````````````````````crawl-urlfilters.txt```````````````````````````````````````````````````# The url filter file used by the crawl command.# Better for intranet crawling.# Be sure to change MY.DOMAIN.NAME to your domain name.# Each non-comment, non-blank line contains a regular expression# prefixed by '+' or '-'.  The first matching pattern in the file# determines whether a URL is included or ignored.  If no pattern# matches, the URL is ignored.# skip file:, ftp:, & mailto: urls# -^(file|ftp|mailto):# skip http:, ftp:, & mailto: urls-^(http|ftp|mailto):# skip image and other suffixes we can't yet parse-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$# skip URLs containing certain characters as probable queries, etc.-[?*!@=]# skip URLs with slash-delimited segment that repeats 3+ times, to break loops# -.*(/.+?)/.*?\1/.*?\1/# accept hosts in MY.DOMAIN.NAME# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/# skip everything else# -.# get everything else+^file:///C:/MyData/*-.*```````````````````````````````````````````````````
_________________________________________________________________
Want to do more with Windows Live? Learn “10 hidden secrets” from Jamie.
http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_domore_092008

Re: Indexing Files on Local File System

Posted by Srinivas Gokavarapu <sr...@gmail.com>.

hi,
           Check this link For Crawling local pages in
nutch<http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch>.
Follow the steps in this site and check once

On Fri, Sep 26, 2008 at 3:24 AM, Kevin MacDonald <ke...@hautesecure.com>wrote:

> Manu,
> The only way I was able to figure out why nutch was not crawling Urls that
> I
> was expecting it to crawl was by digging into the code and adding extra
> logging lines. I suggest you look at org.apache.nutch.fetcher.Fetcher.run()
> and get an idea what it's doing. Also, look at Fetcher.handleRedirect().
> Put
> a whole bunch of extra logging lines in that file to figure out if either a
> filter or a Normalizer is stripping out Urls that you want crawled. You can
> also try disabling all Normalizers by adding something like this to your
> nutch-site.xml file. Note that I stripped out just about everything. You
> might only want to strip out the Normalizers. See the original settings in
> nutch-default.xml.
>
> <property>
>  <name>plugin.includes</name>
>   <value>protocol-http|parse-(text|html|js)|scoring-opic</value>
> </property>
>
>
> On Thu, Sep 25, 2008 at 1:53 PM, Manu Warikoo <mw...@hotmail.com>
> wrote:
>
> >
> > hi,
> > Thanks for responding.
> > Just tried the changes that you suggested, no change.
> > log files look exactly the same expect that now the dir ref comes up with
> > only 2 /.
> > any other possible things?
> > mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: srinivas.iiit@gmail.com
> >
> > To: nutch-user@lucene.apache.org> Subject: Re: FW: Indexing Files on
> Local
> > File System> > hi,> You should change the url as file://C:/MyData/ and
> also
> > in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*>
> >
> > On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <mw...@hotmail.com>
> > wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it
> to
> > index files on my> > local file system without much luck. I believe I
> have
> > configured things> > correctly, however, no files are being indexed and
> no
> > errors being reported.> > Note that I have looked thru the various posts
> on
> > this topic on the mailing> > list and tired various variations on the
> > configuration.> >> > I am providing details of my configuration and log
> > files below. I would> > appreciate any insight people might have.> >
> Best,>
> > > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender
> > and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth
> 4
> > -topN 500 >&> > logs/crawl.log> > urls files contains only> >
> > ```````````````````````````````````````````````````> >
> file:///C:/MyData/>
> > >> > ```````````````````````````````````````````````````> >
> Nutch-site.xml>
> > > `````````````````````````````````````> > <property>> >
> > <name>http.agent.url</name>> > <value></value>> >
> > <description>none</description>> > </property>> > <property>> >
> > <name>http.agent.email</name>> > <value>none</value>> >
> > <description></description>> > </property>> >> > <property>> >
> > <name>plugin.includes</name>> >> >
> >
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
> > > </property>> > <property>> > <name>file.content.limit</name>
> > <value>-1</value>> > </property>> > </configuration>> >
> > ```````````````````````````````````````````````````> >
> crawl-urlfilters.txt>
> > > ```````````````````````````````````````````````````> > # The url filter
> > file used by the crawl command.> > # Better for intranet crawling.> > #
> Be
> > sure to change MY.DOMAIN.NAME to your domain name.> > # Each
> non-comment,
> > non-blank line contains a regular expression> > # prefixed by '+' or '-'.
> > The first matching pattern in the file> > # determines whether a URL is
> > included or ignored. If no pattern> > # matches, the URL is ignored.> > #
> > skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip
> > http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and
> > other suffixes we can't yet parse> >> >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
> > > # skip URLs containing certain characters as probable queries, etc.> >
> > -[?*!@=]> > # skip URLs with slash-delimited segment that repeats 3+
> times,
> > to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in
> > MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip
> > everything else> > # -.> > # get everything else> >
> +^file:///C:/MyData/*> >
> > -.*> > ```````````````````````````````````````````````````> >> >
> > ------------------------------> > Want to do more with Windows Live?
> Learn
> > "10 hidden secrets" from Jamie. Learn> > Now<
> >
> http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008
> >>
> > >
> > _________________________________________________________________
> > See how Windows Mobile brings your life together—at home, work, or on the
> > go.
> > http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/
>

Re: Indexing Files on Local File System

Posted by Kevin MacDonald <ke...@hautesecure.com>.

Manu,
The only way I was able to figure out why nutch was not crawling Urls that I
was expecting it to crawl was by digging into the code and adding extra
logging lines. I suggest you look at org.apache.nutch.fetcher.Fetcher.run()
and get an idea what it's doing. Also, look at Fetcher.handleRedirect(). Put
a whole bunch of extra logging lines in that file to figure out if either a
filter or a Normalizer is stripping out Urls that you want crawled. You can
also try disabling all Normalizers by adding something like this to your
nutch-site.xml file. Note that I stripped out just about everything. You
might only want to strip out the Normalizers. See the original settings in
nutch-default.xml.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|parse-(text|html|js)|scoring-opic</value>
</property>


On Thu, Sep 25, 2008 at 1:53 PM, Manu Warikoo <mw...@hotmail.com> wrote:

>
> hi,
> Thanks for responding.
> Just tried the changes that you suggested, no change.
> log files look exactly the same expect that now the dir ref comes up with
> only 2 /.
> any other possible things?
> mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: srinivas.iiit@gmail.com>
> To: nutch-user@lucene.apache.org> Subject: Re: FW: Indexing Files on Local
> File System> > hi,> You should change the url as file://C:/MyData/ and also
> in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> >
> On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <mw...@hotmail.com>
> wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it to
> index files on my> > local file system without much luck. I believe I have
> configured things> > correctly, however, no files are being indexed and no
> errors being reported.> > Note that I have looked thru the various posts on
> this topic on the mailing> > list and tired various variations on the
> configuration.> >> > I am providing details of my configuration and log
> files below. I would> > appreciate any insight people might have.> > Best,>
> > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender
> and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth 4
> -topN 500 >&> > logs/crawl.log> > urls files contains only> >
> ```````````````````````````````````````````````````> > file:///C:/MyData/>
> >> > ```````````````````````````````````````````````````> > Nutch-site.xml>
> > `````````````````````````````````````> > <property>> >
> <name>http.agent.url</name>> > <value></value>> >
> <description>none</description>> > </property>> > <property>> >
> <name>http.agent.email</name>> > <value>none</value>> >
> <description></description>> > </property>> >> > <property>> >
> <name>plugin.includes</name>> >> >
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>>
> > </property>> > <property>> > <name>file.content.limit</name>
> <value>-1</value>> > </property>> > </configuration>> >
> ```````````````````````````````````````````````````> > crawl-urlfilters.txt>
> > ```````````````````````````````````````````````````> > # The url filter
> file used by the crawl command.> > # Better for intranet crawling.> > # Be
> sure to change MY.DOMAIN.NAME to your domain name.> > # Each non-comment,
> non-blank line contains a regular expression> > # prefixed by '+' or '-'.
> The first matching pattern in the file> > # determines whether a URL is
> included or ignored. If no pattern> > # matches, the URL is ignored.> > #
> skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip
> http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and
> other suffixes we can't yet parse> >> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$>
> > # skip URLs containing certain characters as probable queries, etc.> >
> -[?*!@=]> > # skip URLs with slash-delimited segment that repeats 3+ times,
> to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in
> MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip
> everything else> > # -.> > # get everything else> > +^file:///C:/MyData/*> >
> -.*> > ```````````````````````````````````````````````````> >> >
> ------------------------------> > Want to do more with Windows Live? Learn
> "10 hidden secrets" from Jamie. Learn> > Now<
> http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>>
> >
> _________________________________________________________________
> See how Windows Mobile brings your life together—at home, work, or on the
> go.
> http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/

RE: Indexing Files on Local File System

Posted by Manu Warikoo <mw...@hotmail.com>.

hi, 
Thanks for responding.
Just tried the changes that you suggested, no change.
log files look exactly the same expect that now the dir ref comes up with only 2 /.
any other possible things?
mw> Date: Fri, 26 Sep 2008 01:19:14 +0530> From: srinivas.iiit@gmail.com> To: nutch-user@lucene.apache.org> Subject: Re: FW: Indexing Files on Local File System> > hi,> You should change the url as file://C:/MyData/ and also in> crawl-urlfilter.txt change the file:// line to> +^file://C:/MyData/*> > On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <mw...@hotmail.com> wrote:> > > Hi,> >> > I am running Nutch 0.9 and am attempting to use it to index files on my> > local file system without much luck. I believe I have configured things> > correctly, however, no files are being indexed and no errors being reported.> > Note that I have looked thru the various posts on this topic on the mailing> > list and tired various variations on the configuration.> >> > I am providing details of my configuration and log files below. I would> > appreciate any insight people might have.> > Best,> > mw> >> > Details:> > OS: Windows Vista (note I have turned off defender and firewall)> > <comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >&> > logs/crawl.log> > urls files contains only> > ```````````````````````````````````````````````````> > file:///C:/MyData/> >> > ```````````````````````````````````````````````````> > Nutch-site.xml> > `````````````````````````````````````> > <property>> > <name>http.agent.url</name>> > <value></value>> > <description>none</description>> > </property>> > <property>> > <name>http.agent.email</name>> > <value>none</value>> > <description></description>> > </property>> >> > <property>> > <name>plugin.includes</name>> >> > <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>> > </property>> > <property>> > <name>file.content.limit</name> <value>-1</value>> > </property>> > </configuration>> > ```````````````````````````````````````````````````> > crawl-urlfilters.txt> > ```````````````````````````````````````````````````> > # The url filter file used by the crawl command.> > # Better for intranet crawling.> > # Be sure to change MY.DOMAIN.NAME to your domain name.> > # Each non-comment, non-blank line contains a regular expression> > # prefixed by '+' or '-'. The first matching pattern in the file> > # determines whether a URL is included or ignored. If no pattern> > # matches, the URL is ignored.> > # skip file:, ftp:, & mailto: urls> > # -^(file|ftp|mailto):> > # skip http:, ftp:, & mailto: urls> > -^(http|ftp|mailto):> > # skip image and other suffixes we can't yet parse> >> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$> > # skip URLs containing certain characters as probable queries, etc.> > -[?*!@=]> > # skip URLs with slash-delimited segment that repeats 3+ times, to break> > loops> > # -.*(/.+?)/.*?\1/.*?\1/> > # accept hosts in MY.DOMAIN.NAME> > # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/> > # skip everything else> > # -.> > # get everything else> > +^file:///C:/MyData/*> > -.*> > ```````````````````````````````````````````````````> >> > ------------------------------> > Want to do more with Windows Live? Learn "10 hidden secrets" from Jamie. Learn> > Now<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>> >
_________________________________________________________________
See how Windows Mobile brings your life together—at home, work, or on the go.
http://clk.atdmt.com/MRT/go/msnnkwxp1020093182mrt/direct/01/

Re: FW: Indexing Files on Local File System

Posted by Srinivas Gokavarapu <sr...@gmail.com>.

hi,
           You should change the url as file://C:/MyData/  and also in
crawl-urlfilter.txt change the file:// line to
+^file://C:/MyData/*

On Thu, Sep 25, 2008 at 11:42 PM, Manu Warikoo <mw...@hotmail.com> wrote:

>  Hi,
>
> I am running Nutch 0.9 and am attempting to use it to index files on my
> local file system without much luck. I believe I have configured things
> correctly, however, no files are being indexed and no errors being reported.
> Note that I have looked thru the various posts on this topic on the mailing
> list and tired various variations on the configuration.
>
> I am providing details of my configuration and log files below. I would
> appreciate any insight people might have.
> Best,
> mw
>
> Details:
> OS: Windows Vista (note I have turned off defender and firewall)
> <comand> bin/nutch crawl urls -dir crawl_results -depth 4 -topN 500 >&
> logs/crawl.log
> urls files contains only
> ```````````````````````````````````````````````````
> file:///C:/MyData/
>
> ```````````````````````````````````````````````````
> Nutch-site.xml
> `````````````````````````````````````
> <property>
>  <name>http.agent.url</name>
>  <value></value>
>  <description>none</description>
> </property>
> <property>
>  <name>http.agent.email</name>
>  <value>none</value>
>  <description></description>
> </property>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> </configuration>
> ```````````````````````````````````````````````````
> crawl-urlfilters.txt
> ```````````````````````````````````````````````````
> # The url filter file used by the crawl command.
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> # skip file:, ftp:, & mailto: urls
> # -^(file|ftp|mailto):
> # skip http:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> # -.*(/.+?)/.*?\1/.*?\1/
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> # skip everything else
> # -.
> # get everything else
> +^file:///C:/MyData/*
> -.*
> ```````````````````````````````````````````````````
>
> ------------------------------
> Want to do more with Windows Live? Learn "10 hidden secrets" from Jamie. Learn
> Now<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM_WL_domore_092008>
>