You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kauu <ba...@gmail.com> on 2006/04/01 02:56:23 UTC

Re: Crawling the local file system with Nutch - Document-

thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf  as
demanded by user seamlessly.



On 4/1/06, Vertical Search <ve...@gmail.com> wrote:
>
> Nutchians,
> I have tried to document the sequence of steps to adopt nutch to crawl and
> search local file system on windows machine.
> I have been able to do it successfully using nutch 0.8 Dev
> The configuration are as follows
> *Inspiron 630m
> Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> Windows XP
> Professional)*
> *If some can review it, it will be very helpful.*
>
> Crawling the local filesystem with nutch
> Platform: Microsoft / nutch 0.8 Dev
> For a linux version, please refer to
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> The link did help me get it off the ground.
>
> I have been working on adopting nutch in a vertical domain. All of a
> sudden,
> I was asked to develop a proof of concept
> to adopt nutch to crawl and search local file syste,
> Initially I did face some problems. But some mail archieves did help me
> proceed further.
> The intention is to provide a overview of steps to crawl local file
> systems
> and search through the browser.
>
> I downloaded the nuctch nightly from
> 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> but
> helps)
> 2. Extract the downloaded nightly build. <Dont build yet>
> 3. Create a folder --> c:/LocalSearch --> copied the following folders and
> librariees
> 1. bin/
> 2. conf/
> 3. *.job, *.jar and *.war files
> 4. urls/ <URLS folder>
> 5. Plugins folder
> 4. Modify the nutch-site.xml to include the Plugin folder
> 5. Modify the nutch-site.xml to include the includes. An example is as
> follows
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <nutch-conf>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> </nutch-conf>
>
> 6. Modify crawl-urlfilter.txt
> Remember we have to crawl the local file system. Hence we have to modify
> the
> entries as follows
>
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
>
> -^(http|ftp|mailto):
>
> #skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> #skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> #accecpt anything else
> +.*
>
> 7. urls folder
> Create a file for all the urls to be crawled. The file should have the
> urls
> as below
> save the file under the urls folder.
>
> The directories should be in "file://" format. Example entries were as
> follows
>
> file://c:/resumes/word <file:///c:/resumes/word>
> file://c:/resumes/pdf <file:///c:/resumes/pdf>
>
> #file:///data/readings/semanticweb/
>
> Nutch recognises that the third line does not contain a valid file-url and
> skips it
>
> As suggested by the link
> 8. Ignoring the parent directories. As suggested in the linux flavor of
> local fs crawl, I did modify the code in
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> java.io.File f).
>
> I changed the following line:
>
> this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> true);
> to
>
> this.content = list2html(f.listFiles(), path, false);
> and recompiled.
>
> 9. Compile the changes. Just compiled the whole source code base. did not
> take more than 2 minutes.
>
> 10. Crawling the file system.
>     on my desktop, I have a short cut to "cygdrive", double click
>     pwd.
>     cd ../../cygdrive/c/$NUTCH_HOME
>
>     Execute
>     bin/nutch crawl urls -dir c:/localfs/database
>
> Voila, that is it, After 20 minutes, the files were indexed, merged and
> all
> done.
>
> 11. extracted the nutch-o.8-dev.war file to <TOMCAT_HOME>/webapps/ROOT
> folder
>
> Opened the nutch-site.xml and added the following snippet to reflect the
> search folder
> <property>
>   <name>searcher.dir</name>
>   <value>c:/localfs/database</value>
>   <description>
>   Path to root of crawl.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
> </property>
>
> 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> map
> machine name to localhost. That increased search considerably.
>
> 13. Modified the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
> I hope this helps folks who are trying to adopt nutch for local file
> system.
> Personally I believe corporates should adopt nutch rather buying google
> appliance :)
>
>


--
www.babatu.com

Re: Crawling the local file system with Nutch - Document-

Posted by kauu <ba...@gmail.com>.
hi sudhendra seshachala
thx so much for ur code.
yes ,i want it .


On 4/5/06, sudhendra seshachala <su...@yahoo.com> wrote:
>
> I just modified search.jsp. Basically set the content type based on
> document type I was querying.
>   Rest is handled protocol and browser.
>
>   I can send the code if you would like.
>
>   Thanks
>
> kauu <ba...@gmail.com> wrote:
>   thx for ur idea!!
> but i get a question .
> how to modify the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
>
> On 4/1/06, Vertical Search wrote:
> >
> > Nutchians,
> > I have tried to document the sequence of steps to adopt nutch to crawl
> and
> > search local file system on windows machine.
> > I have been able to do it successfully using nutch 0.8 Dev
> > The configuration are as follows
> > *Inspiron 630m
> > Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> > Windows XP
> > Professional)*
> > *If some can review it, it will be very helpful.*
> >
> > Crawling the local filesystem with nutch
> > Platform: Microsoft / nutch 0.8 Dev
> > For a linux version, please refer to
> >
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> > The link did help me get it off the ground.
> >
> > I have been working on adopting nutch in a vertical domain. All of a
> > sudden,
> > I was asked to develop a proof of concept
> > to adopt nutch to crawl and search local file syste,
> > Initially I did face some problems. But some mail archieves did help me
> > proceed further.
> > The intention is to provide a overview of steps to crawl local file
> > systems
> > and search through the browser.
> >
> > I downloaded the nuctch nightly from
> > 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> > but
> > helps)
> > 2. Extract the downloaded nightly build.
> > 3. Create a folder --> c:/LocalSearch --> copied the following folders
> and
> > librariees
> > 1. bin/
> > 2. conf/
> > 3. *.job, *.jar and *.war files
> > 4. urls/
> > 5. Plugins folder
> > 4. Modify the nutch-site.xml to include the Plugin folder
> > 5. Modify the nutch-site.xml to include the includes. An example is as
> > follows
> >
> >
> >
> >
> >
> >
>
> > plugin.includes
> >
> >
> protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
> >
>
> >
>
> > file.content.limit -1
> >
>
> >
> >
> > 6. Modify crawl-urlfilter.txt
> > Remember we have to crawl the local file system. Hence we have to modify
> > the
> > entries as follows
> >
> > #skip http:, ftp:, & mailto: urls
> > ##-^(file|ftp|mailto):
> >
> > -^(http|ftp|mailto):
> >
> > #skip image and other suffixes we can't yet parse
> >
> >
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> >
> > #skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > #accept hosts in MY.DOMAIN.NAME
> > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > #accecpt anything else
> > +.*
> >
> > 7. urls folder
> > Create a file for all the urls to be crawled. The file should have the
> > urls
> > as below
> > save the file under the urls folder.
> >
> > The directories should be in "file://" format. Example entries were as
> > follows
> >
> > file://c:/resumes/word
> > file://c:/resumes/pdf
> >
> > #file:///data/readings/semanticweb/
> >
> > Nutch recognises that the third line does not contain a valid file-url
> and
> > skips it
> >
> > As suggested by the link
> > 8. Ignoring the parent directories. As suggested in the linux flavor of
> > local fs crawl, I did modify the code in
> > org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> > java.io.File f).
> >
> > I changed the following line:
> >
> > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> > true);
> > to
> >
> > this.content = list2html(f.listFiles(), path, false);
> > and recompiled.
> >
> > 9. Compile the changes. Just compiled the whole source code base. did
> not
> > take more than 2 minutes.
> >
> > 10. Crawling the file system.
> > on my desktop, I have a short cut to "cygdrive", double click
> > pwd.
> > cd ../../cygdrive/c/$NUTCH_HOME
> >
> > Execute
> > bin/nutch crawl urls -dir c:/localfs/database
> >
> > Voila, that is it, After 20 minutes, the files were indexed, merged and
> > all
> > done.
> >
> > 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
> > folder
> >
> > Opened the nutch-site.xml and added the following snippet to reflect the
> > search folder
> >
>
> > searcher.dir
> > c:/localfs/database
> >
> > Path to root of crawl. This directory is searched (in
> > order) for either the file search-servers.txt, containing a list of
> > distributed search servers, or the directory "index" containing
> > merged indexes, or the directory "segments" containing segment
> > indexes.
> >
> >
>
> >
> > 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> > map
> > machine name to localhost. That increased search considerably.
> >
> > 13. Modified the search.jsp and cached servlet to view word and pdf as
> > demanded by user seamlessly.
> >
> >
> > I hope this helps folks who are trying to adopt nutch for local file
> > system.
> > Personally I believe corporates should adopt nutch rather buying google
> > appliance :)
> >
> >
>
>
> --
> www.babatu.com
>
>
>
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>
>
>
>
> ---------------------------------
> New Yahoo! Messenger with Voice. Call regular phones from your PC and save
> big.
>



--
www.babatu.com

Re: Crawling the local file system with Nutch - Document-

Posted by sudhendra seshachala <su...@yahoo.com>.
I just modified search.jsp. Basically set the content type based on document type I was querying.
  Rest is handled protocol and browser.
   
  I can send the code if you would like.
   
  Thanks

kauu <ba...@gmail.com> wrote:
  thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf as
demanded by user seamlessly.



On 4/1/06, Vertical Search wrote:
>
> Nutchians,
> I have tried to document the sequence of steps to adopt nutch to crawl and
> search local file system on windows machine.
> I have been able to do it successfully using nutch 0.8 Dev
> The configuration are as follows
> *Inspiron 630m
> Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
> Windows XP
> Professional)*
> *If some can review it, it will be very helpful.*
>
> Crawling the local filesystem with nutch
> Platform: Microsoft / nutch 0.8 Dev
> For a linux version, please refer to
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> The link did help me get it off the ground.
>
> I have been working on adopting nutch in a vertical domain. All of a
> sudden,
> I was asked to develop a proof of concept
> to adopt nutch to crawl and search local file syste,
> Initially I did face some problems. But some mail archieves did help me
> proceed further.
> The intention is to provide a overview of steps to crawl local file
> systems
> and search through the browser.
>
> I downloaded the nuctch nightly from
> 1. Create the environment variable such as "NUTCH_HOME". (Not mandatory,
> but
> helps)
> 2. Extract the downloaded nightly build. 
> 3. Create a folder --> c:/LocalSearch --> copied the following folders and
> librariees
> 1. bin/
> 2. conf/
> 3. *.job, *.jar and *.war files
> 4. urls/ 
> 5. Plugins folder
> 4. Modify the nutch-site.xml to include the Plugin folder
> 5. Modify the nutch-site.xml to include the includes. An example is as
> follows
>
> 
> 
> 
> 
> 

> plugin.includes
>
> protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)
> 

> 

> file.content.limit -1
> 

> 
>
> 6. Modify crawl-urlfilter.txt
> Remember we have to crawl the local file system. Hence we have to modify
> the
> entries as follows
>
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
>
> -^(http|ftp|mailto):
>
> #skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> #skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> #accecpt anything else
> +.*
>
> 7. urls folder
> Create a file for all the urls to be crawled. The file should have the
> urls
> as below
> save the file under the urls folder.
>
> The directories should be in "file://" format. Example entries were as
> follows
>
> file://c:/resumes/word 
> file://c:/resumes/pdf 
>
> #file:///data/readings/semanticweb/
>
> Nutch recognises that the third line does not contain a valid file-url and
> skips it
>
> As suggested by the link
> 8. Ignoring the parent directories. As suggested in the linux flavor of
> local fs crawl, I did modify the code in
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> java.io.File f).
>
> I changed the following line:
>
> this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> true);
> to
>
> this.content = list2html(f.listFiles(), path, false);
> and recompiled.
>
> 9. Compile the changes. Just compiled the whole source code base. did not
> take more than 2 minutes.
>
> 10. Crawling the file system.
> on my desktop, I have a short cut to "cygdrive", double click
> pwd.
> cd ../../cygdrive/c/$NUTCH_HOME
>
> Execute
> bin/nutch crawl urls -dir c:/localfs/database
>
> Voila, that is it, After 20 minutes, the files were indexed, merged and
> all
> done.
>
> 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT
> folder
>
> Opened the nutch-site.xml and added the following snippet to reflect the
> search folder
> 

> searcher.dir
> c:/localfs/database
> 
> Path to root of crawl. This directory is searched (in
> order) for either the file search-servers.txt, containing a list of
> distributed search servers, or the directory "index" containing
> merged indexes, or the directory "segments" containing segment
> indexes.
> 
> 

>
> 12. Searching locally was a bit slow. So I changed the hosts.ini file to
> map
> machine name to localhost. That increased search considerably.
>
> 13. Modified the search.jsp and cached servlet to view word and pdf as
> demanded by user seamlessly.
>
>
> I hope this helps folks who are trying to adopt nutch for local file
> system.
> Personally I believe corporates should adopt nutch rather buying google
> appliance :)
>
>


--
www.babatu.com



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.