You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by og...@yahoo.com on 2007/03/29 00:03:28 UTC

1 Nutch, multiple indices?

Hi,

Nutch Wiki used to have this page:
  http://wiki.apache.org/nutch/SearchOverMultipleIndexes

But now that page is gone.  Anyone knows what happened?  Google has it indexed, so I know the page did exist, but unfortunately it has the empty version of it indexed.

In any case, my question is as follows:
I've always thought and used Nutch to create/maintain and search a single index.  However, I am now wondering about maintaining a set of distinct indices under 1 instance of Nutch.  I may want to search them all in parallel or search a specific one or a set of them.  Is this even doable?

Thanks,
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

Re: Nutch dataset dirstructure

Posted by pi...@kw.nl.

Hi

> > Can anyone point me to some documentation about
> > the directory structure Nutch creates and maintains
> > when crawling, indexing etc ?

> Well, unfortunately there is not much document out there.

I was afraid someone would say that :-) Thanks for the
extensive answer, I have some starting points now !


thanks!
*-pike

Re: Nutch dataset dirstructure

Posted by Enis Soztutar <en...@gmail.com>.

pike wrote:
> Hi
>
> I'm new to nutch.
> Can anyone point me to some documentation about
> the directory structure Nutch creates and maintains
> when crawling, indexing etc ? We're doing "whole-web"
> crawls step by step. Since I have no reference, it's
> hard to see wether crawling, merging, indexing, etc
> went ok.
>
>
> thanks!
> *-pike
>
Well, unfortunately there is not much document out there. But you should 
start by reading the articles at the nutch wiki first. For the index 
structure you should seek help in the lucene wiki, since nutch uses 
lucene as an inverted index. To look at the generated indexes you can 
use luke or lucli(command line) tools. lucli can be found in the contrib 
directory of lucene.

Nutch stores the crawl state of the urls in the crawldb. The crawldb is 
an instance of Hadoop's MapFile, which is a sequence of <key,value> 
pairs. The keys in crawldb are urls and values are CrawlDatum objects. 
MapFile uses two SequenceFile s, one for storing the data, the other for 
indexing the data. You should check the javadocs of these classes for 
further info.

Linkdb is also stored as map files, from urls to Inlink objects.
For further info, you should really browse the javadocs, and skim 
through the code to get a deeper understanding of the system.

Nutch dataset dirstructure

Posted by pike <pi...@kw.nl>.

Hi

I'm new to nutch.
Can anyone point me to some documentation about
the directory structure Nutch creates and maintains
when crawling, indexing etc ? We're doing "whole-web"
crawls step by step. Since I have no reference, it's
hard to see wether crawling, merging, indexing, etc
went ok.


thanks!
*-pike

Re: 1 Nutch, multiple indices?

Posted by "Steve W." <mi...@gmail.com>.

I documented my approach to this under Debian on the Nutch Wiki here:

http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

Steve Walker
Middle Fork Geographic Information Services
http://mfgis.com




On 3/28/07, ogjunk-nutch@yahoo.com <og...@yahoo.com> wrote:
> Hi,
>
> Nutch Wiki used to have this page:
>   http://wiki.apache.org/nutch/SearchOverMultipleIndexes
>
> But now that page is gone.  Anyone knows what happened?  Google has it indexed, so I know the page did exist, but unfortunately it has the empty version of it indexed.
>
> In any case, my question is as follows:
> I've always thought and used Nutch to create/maintain and search a single index.  However, I am now wondering about maintaining a set of distinct indices under 1 instance of Nutch.  I may want to search them all in parallel or search a specific one or a set of them.  Is this even doable?
>
> Thanks,
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
>
>