You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Wenhao Xu <xu...@gmail.com> on 2011/02/09 19:19:48 UTC

How to use Nutch index files on localdisk?

Hi all,
   I am new to Nutch. I want to use  Nutch's MapReduce indexer to index
files on a local filesystem. And I want to customize the field adding to the
index. I searched the Internet for a while, but haven't found the answer.
Could you give me some advice? Thanks very much.

Regards,
Wen

-- 
~_~

Re: How to use Nutch index files on localdisk?

Posted by Wenhao Xu <xu...@gmail.com>.

Hi Markus,
  Thanks. It works.
  But Nutch seems only crawling a single level of directory.  My directory
structure is:
nutch-rawl
   |-- conf
        |----  many xml and text files
   | --- new

   Below is the snapshot of crawl command's output.  It stops fetching at
depth 1.  I glanced the protocol-file implementation. It reads the
directory/file and generate a html response with links to reflect the
directory structure. Therefore, after fetching, the HTMLparser should be
called and update the crawl db correctly. And then next round of fetching
should happen. However,  Here, it only fetch nutch-crawl directory.
    Does anybody have any advice on this? I am a newbie on Nutch. Thanks for
the help.

rootUrlDir = urls
threads = 10
depth = 3
indexer=lucene
topN = 50
Injector: starting at 2011-02-12 15:48:12
Injector: crawlDb: crawl_local_results/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-02-12 15:48:14, elapsed: 00:00:02
Generator: starting at 2011-02-12 15:48:14
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
...
inishing thread FetcherThread, activeThreads=1
fetching file:///Users/peter/storage/nutch-crawl/
...
Fetcher: finished at 2011-02-12 15:48:20, elapsed: 00:00:02
ParseSegment: starting at 2011-02-12 15:48:20
ParseSegment: segment: crawl_local_results/segments/20110212154817
ParseSegment: finished at 2011-02-12 15:48:21, elapsed: 00:00:01
CrawlDb update: starting at 2011-02-12 15:48:22
CrawlDb update: db: crawl_local_results/crawldb
....
Generator: starting at 2011-02-12 15:48:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.

Regards,
Wen.

On Thu, Feb 10, 2011 at 2:59 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Here's an old post on this one which probably doesn't work anymore:
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>
> And here the info on the Wiki's FAQ page:
> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
>
>
> On Wednesday 09 February 2011 19:19:48 Wenhao Xu wrote:
> > Hi all,
> >    I am new to Nutch. I want to use  Nutch's MapReduce indexer to index
> > files on a local filesystem. And I want to customize the field adding to
> > the index. I searched the Internet for a while, but haven't found the
> > answer. Could you give me some advice? Thanks very much.
> >
> > Regards,
> > Wen
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

-- 
~_~

Re: How to use Nutch index files on localdisk?

Posted by Wenhao Xu <xu...@gmail.com>.

Hi Markus,
  Thanks. It works.
  But Nutch seems only crawling a single level of directory.  My directory
structure is:
nutch-rawl
   |-- conf
        |----  many xml and text files
   | --- new

   Below is the snapshot of crawl command's output.  It stops fetching at
depth 1.  I glanced the protocol-file implementation. It reads the
directory/file and generate a html response with links to reflect the
directory structure. Therefore, after fetching, the HTMLparser should be
called and update the crawl db correctly. And then next round of fetching
should happen. However,  Here, it only fetch nutch-crawl directory.
    Does anybody have any advice on this? I am a newbie on Nutch. Thanks for
the help.

rootUrlDir = urls
threads = 10
depth = 3
indexer=lucene
topN = 50
Injector: starting at 2011-02-12 15:48:12
Injector: crawlDb: crawl_local_results/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-02-12 15:48:14, elapsed: 00:00:02
Generator: starting at 2011-02-12 15:48:14
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
...
inishing thread FetcherThread, activeThreads=1
fetching file:///Users/peter/storage/nutch-crawl/
...
Fetcher: finished at 2011-02-12 15:48:20, elapsed: 00:00:02
ParseSegment: starting at 2011-02-12 15:48:20
ParseSegment: segment: crawl_local_results/segments/20110212154817
ParseSegment: finished at 2011-02-12 15:48:21, elapsed: 00:00:01
CrawlDb update: starting at 2011-02-12 15:48:22
CrawlDb update: db: crawl_local_results/crawldb
....
Generator: starting at 2011-02-12 15:48:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.

Regards,
Wen.

On Thu, Feb 10, 2011 at 2:59 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Here's an old post on this one which probably doesn't work anymore:
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>
> And here the info on the Wiki's FAQ page:
> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
>
>
> On Wednesday 09 February 2011 19:19:48 Wenhao Xu wrote:
> > Hi all,
> >    I am new to Nutch. I want to use  Nutch's MapReduce indexer to index
> > files on a local filesystem. And I want to customize the field adding to
> > the index. I searched the Internet for a while, but haven't found the
> > answer. Could you give me some advice? Thanks very much.
> >
> > Regards,
> > Wen
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

-- 
~_~

Re: How to use Nutch index files on localdisk?

Posted by Markus Jelsma <ma...@openindex.io>.

Here's an old post on this one which probably doesn't work anymore:
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

And here the info on the Wiki's FAQ page:
http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F


On Wednesday 09 February 2011 19:19:48 Wenhao Xu wrote:
> Hi all,
>    I am new to Nutch. I want to use  Nutch's MapReduce indexer to index
> files on a local filesystem. And I want to customize the field adding to
> the index. I searched the Internet for a while, but haven't found the
> answer. Could you give me some advice? Thanks very much.
> 
> Regards,
> Wen

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350