You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ja...@thomson.com on 2006/09/22 23:21:15 UTC

0.8 output\index versus output\indexes

Hi All-

I am just curious if someone could explain the difference between the
'index' folder and the 'indexes' folder inside the output directory of
the crawl?

I noticed that indexes have parts (though mine only has one part) but
index just contains the Lucene index.  My theory is that each part is
the result of a hadoop reduce task, and since I am only crawling with
one machine there is only the one part... And index is the merge of
those parts.. Am I correct or just creative?

The motivation for my question is that I am trying to determine what
parts of the crawl need to be deployed to my searcher machines (I don't
use servlet searcher but a custom class using the Nutch API).  It looks
like it works with just 'index' and 'segments', but I want to be sure
that I should not be deploying 'indexes' instead/in-addition.

Thanks,
Jared-

Re: 0.8 output\index versus output\indexes

Posted by liv <li...@hotmail.com>.
I've found this message while looking to update subcollections field upon a
reindexing operation. I had no explanation for my issue: I fetched/indexed
some sites, using subcollection.xml, then I made changes in the
subcollection.xml and reindexed. While inspecting the db with luke, or using
the web search the collections looked unchanged. See here the whole story.

http://www.nabble.com/subcollections-tf2821188.html
http://www.nabble.com/subcollections-tf2821188.html 

I manually looked over all files, and this is what I found: when doing a
reindex operation, only the "indexes" files change, but "index" don't. And
as you say that "index" folder has preeminence over "indexes", this means
that... it's a bug of some sort! 

in order to benefit of the new subcollection.xml and reindex, I need to
remove the "index" folder (unchanged upon reindex) and let the searcher work
onky with "indexes" folder. Please tell me if I am wrong. or if there is any
other method to accomplish this.

Also, what's the drawback or advantage to use "index" or "indexes"?

Also, could you point me to a source to browse the internals of the nutch in
a "tutorial-style"?

Thanks!


Andrzej Bialecki wrote:
> 
> jared.dunne@thomson.com wrote:
>> I am just curious if someone could explain the difference between the
>> 'index' folder and the 'indexes' folder inside the output directory of
>> the crawl?
> 
>> The motivation for my question is that I am trying to determine what
>> parts of the crawl need to be deployed to my searcher machines (I don't
>> use servlet searcher but a custom class using the Nutch API).  It looks
>> like it works with just 'index' and 'segments', but I want to be sure
>> that I should not be deploying 'indexes' instead/in-addition.
>>   
> 
> That's correct. NutchBean first tries to use "index", if it can't be 
> found then it tries "indexes".
> 
> 

-- 
View this message in context: http://www.nabble.com/0.8-output%5Cindex-versus-output%5Cindexes-tf2320120.html#a7994100
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: 0.8 output\index versus output\indexes

Posted by Andrzej Bialecki <ab...@getopt.org>.
jared.dunne@thomson.com wrote:
> Hi All-
>
> I am just curious if someone could explain the difference between the
> 'index' folder and the 'indexes' folder inside the output directory of
> the crawl?
>
> I noticed that indexes have parts (though mine only has one part) but
> index just contains the Lucene index.  My theory is that each part is
> the result of a hadoop reduce task, and since I am only crawling with
> one machine there is only the one part... And index is the merge of
> those parts.. Am I correct or just creative?
>   

Correct.

> The motivation for my question is that I am trying to determine what
> parts of the crawl need to be deployed to my searcher machines (I don't
> use servlet searcher but a custom class using the Nutch API).  It looks
> like it works with just 'index' and 'segments', but I want to be sure
> that I should not be deploying 'indexes' instead/in-addition.
>   

That's correct. NutchBean first tries to use "index", if it can't be 
found then it tries "indexes".

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com