You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Prasenjit Mukherjee <pr...@aol.com> on 2006/03/06 07:01:44 UTC
Distributed Lucene..
I already have an implementation of a distributed crawler farm, where
crawler instances are runnign on different boxes. I want to come up with
a distributed indexing scheme using lucene and take advantage of the
distributed nature of my crawlers' distributed nature. Here is what I am
thinking.
Crawlers will analyze and tokenize the content for every URLs(aka
Documents) and create the following data for every url document:
<url-id, <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>> <field-2,
<term-f2-t1,term-f2-t2,term-f2-t3, >> ...... >
And then based on some partitioning function the carwlers can send a
subset of tokens(aka terms) to the indexing server. The partitioning
function can be as simple as based on the starting character of the
terms. Lets say if we have 5 indexers, we will distribute the indexing
data in the following manner :
Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z
Does it make any sense ? Also would like to know if there are other ways
to distribute lucene's indexing/searching ?
thanks,
prasen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Distributed Lucene..
Posted by Andrzej Bialecki <ab...@getopt.org>.
Prasenjit Mukherjee wrote:
> I think nutch has a distributed lucene implementation. I could have
> used nutch straightaway, but I have a different crawler, and also dont
> want to use NDFS(which is used by nutch) . What I have proposed
> earlier is basically based on mapReduce paradigm, which is used by
> nutch as well.
>
> It would be nice to get some articles specifically detailing out the
> distributed architecture used in nutch.
>
A few comments:
* you can use your own crawler, and then only write some glue code to
convert the output of that crawler to the format that Nutch uses.
* Nutch can be run in a so called "local" mode, without using NDFS
* the core map-reduce and I/O functionality has been split to its own
project, Hadoop, where the development is taking place at a furious rate
;-) This code is completely independent of Nutch or Lucene. You can
implement your own data processing using this framework.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Distributed Lucene..
Posted by Prasenjit Mukherjee <pr...@aol.com>.
I think nutch has a distributed lucene implementation. I could have used
nutch straightaway, but I have a different crawler, and also dont want
to use NDFS(which is used by nutch) . What I have proposed earlier is
basically based on mapReduce paradigm, which is used by nutch as well.
It would be nice to get some articles specifically detailing out the
distributed architecture used in nutch.
prasen
Samuru Jackson wrote:
>>Does it make any sense ? Also would like to know if there are other ways
>>to distribute lucene's indexing/searching ?
>>
>>
>
>I'm interested in such a distributed architecture too.
>
>What I have got in mind is some kind of lucene index cluster where you
>have got several machines having subindexes in memory. So if you have
>got a a searchquery the machines should perfom fast because the index
>is in memory and no hard disk access is performed.
>
>Is there anything like this available?
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Distributed Lucene..
Posted by Samuru Jackson <sa...@googlemail.com>.
> Does it make any sense ? Also would like to know if there are other ways
> to distribute lucene's indexing/searching ?
I'm interested in such a distributed architecture too.
What I have got in mind is some kind of lucene index cluster where you
have got several machines having subindexes in memory. So if you have
got a a searchquery the machines should perfom fast because the index
is in memory and no hard disk access is performed.
Is there anything like this available?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org