You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Prasenjit Mukherjee <pr...@aol.com> on 2006/03/06 07:01:44 UTC

Distributed Lucene..

I already have an implementation of a distributed crawler farm, where 
crawler instances are runnign on different boxes. I want to come up with 
a distributed indexing scheme using lucene and take advantage of the 
distributed nature of my crawlers' distributed nature. Here is what I am 
thinking.

Crawlers will analyze and tokenize the content for every URLs(aka 
Documents) and create the following data for every url document:
<url-id,  <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>>   <field-2, 
<term-f2-t1,term-f2-t2,term-f2-t3, >>  ...... >

And then based on some partitioning function the carwlers can send a 
subset of tokens(aka terms)  to the indexing server. The partitioning 
function can be as simple as based on the starting character of the 
terms.  Lets say if we have 5 indexers, we will distribute the indexing 
data in the following manner :

Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z

Does it make any sense ? Also would like to know if there are other ways 
to distribute lucene's indexing/searching  ?

thanks,
prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Distributed Lucene..

Posted by Andrzej Bialecki <ab...@getopt.org>.

Prasenjit Mukherjee wrote:
> I think nutch has a distributed lucene implementation. I could have 
> used nutch straightaway, but I have a different crawler, and also dont 
> want to use NDFS(which is used by nutch) . What I have proposed 
> earlier is basically based on mapReduce paradigm, which is used by 
> nutch as well.
>
> It would be nice to get some articles specifically detailing out  the 
> distributed architecture used in nutch.
>

A few comments:

* you can use your own crawler, and then only write some glue code to 
convert the output of that crawler to the format that Nutch uses.

* Nutch can be run in a so called "local" mode, without using NDFS

* the core map-reduce and I/O functionality has been split to its own 
project, Hadoop, where the development is taking place at a furious rate 
;-) This code is completely independent of Nutch or Lucene. You can 
implement your own data processing using this framework.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Distributed Lucene..

Posted by Prasenjit Mukherjee <pr...@aol.com>.

I think nutch has a distributed lucene implementation. I could have used 
nutch straightaway, but I have a different crawler, and also dont want 
to use NDFS(which is used by nutch) . What I have proposed earlier is 
basically based on mapReduce paradigm, which is used by nutch as well.

It would be nice to get some articles specifically detailing out  the 
distributed architecture used in nutch.

prasen

Samuru Jackson wrote:

>>Does it make any sense ? Also would like to know if there are other ways
>>to distribute lucene's indexing/searching  ?
>>    
>>
>
>I'm interested in such a distributed architecture too.
>
>What I have got in mind is some kind of lucene index cluster where you
>have got several machines having subindexes in memory. So if you have
>got a a searchquery the machines should perfom fast because the index
>is in memory and no hard disk access is performed.
>
>Is there anything like this available?
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Distributed Lucene..

Posted by Samuru Jackson <sa...@googlemail.com>.

> Does it make any sense ? Also would like to know if there are other ways
> to distribute lucene's indexing/searching  ?

I'm interested in such a distributed architecture too.

What I have got in mind is some kind of lucene index cluster where you
have got several machines having subindexes in memory. So if you have
got a a searchquery the machines should perfom fast because the index
is in memory and no hard disk access is performed.

Is there anything like this available?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org