You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Richard Marr <ri...@gmail.com> on 2008/08/21 17:36:01 UTC
Re: Lucene-based Distributed Index Leveraging Hadoop
> Stefan Groschupf (4 Apr) wrote:
>
> I just noticed - to late though - Ning already contributed the code to
> hadoop. So I guess my question should be rephrased what is the idea of
> moving this into a own project?
Hi all,
It was interesting to hear Mark Butler present his implementation of
Distributed Lucene at the Hadoop User Group meeting in London on
Tuesday. There's obviously been quite a bit of discussion on the
subject, and lots of interested parties. Mark, not sure if you're on
this list but thanks for sharing.
Is this the forum to ask about open projects? I'm interested in
joining a project as long as it's goals aren't too distant to what I'm
looking for. Based mostly on gut feeling I'd rather go for a
stand-alone project that wasn't dependent on HDFS/Hadoop, but willing
to be convinced otherwise.
Rich
RE: Lucene-based Distributed Index Leveraging Hadoop
Posted by marcus clemens <ma...@hotmail.com>.
message to Mark Butler . i am looking for candidates that have lucene exp for contract and permanent positions
can you please send me your cv > Date: Thu, 21 Aug 2008 16:36:01 +0100> From: richard.marr@gmail.com> To: general@lucene.apache.org> Subject: Re: Lucene-based Distributed Index Leveraging Hadoop> > > Stefan Groschupf (4 Apr) wrote:> >> > I just noticed - to late though - Ning already contributed the code to> > hadoop. So I guess my question should be rephrased what is the idea of> > moving this into a own project?> > > Hi all,> > It was interesting to hear Mark Butler present his implementation of> Distributed Lucene at the Hadoop User Group meeting in London on> Tuesday. There's obviously been quite a bit of discussion on the> subject, and lots of interested parties. Mark, not sure if you're on> this list but thanks for sharing.> > Is this the forum to ask about open projects? I'm interested in> joining a project as long as it's goals aren't too distant to what I'm> looking for. Based mostly on gut feeling I'd rather go for a> stand-alone project that wasn't dependent on HDFS/Hadoop, but willing> to be convinced otherwise.> > Rich
_________________________________________________________________
Get Hotmail on your mobile from Vodafone
http://clk.atdmt.com/UKM/go/107571435/direct/01/
Re: Lucene-based Distributed Index Leveraging Hadoop
Posted by Richard Marr <ri...@gmail.com>.
2008/8/21 Stefan Groschupf <sg...@101tec.com>:
>
> Is there any material published about this? I would be very interested to
> see Marks slides and hear about the discussion.
>
In case anybody wants to see Marl's talk, the slides and video are here:
http://skillsmatter.com/podcast/home/distributed-lucene-for-hadoop
Rich
Re: Lucene-based Distributed Index Leveraging Hadoop
Posted by Richard Marr <ri...@gmail.com>.
Stefan,
I've got a lot of reading and learning to do :o)
Thanks for the info, and good luck with your deployment.
Rich
Re: Lucene-based Distributed Index Leveraging Hadoop
Posted by Stefan Groschupf <sg...@101tec.com>.
Hi,
> In terms of which project best fits my needs my gut feeling is that
> dlucene is pretty close. It supports incremental updates, and doesn't
> build in dependencies on systems like HDFS or Terracotta (I don't yet
> understand all the implications of those systems so would rather keep
> things simple if possible).
Upgrades...
The way we solve this with katta is that we simply deploy a new small
index and use * in the client instead of a fixed index name.
Than once a night we merge all the small indexes (since this slows
down things) together to a big new index.
To solve the problem of duplicate documents each document gets a
timestamp and in the client we do a simple dedub based on a key and
use always the latest document with the latest time stamp.
Dependencies...
Katta is independent of those technologies, it is lucene, zookeeper
and hadoop RPI (instead of RMI, http or Apache Mina). Though we
support loading index shards from a hadoop file system, but you also
can load them from a mounted remote hdd NAS or what ever you like
> The obvious drawback being that dlucene
> doesn't seem to be an active public project.
Mark need to answer this but dlucene is checked in to the katta svn
and I saw Marko checking in changes to dlucene. There was a discussion
between Mark and me to bring dlucene and katta together and I really
would love to see that happen but unfortunately we had a lot of
pressure from our customer to deliver something so we had to focus on
other things. More developers getting involved would clearly help
here.. :-)
>
>
> Thanks for the reply Stefan. I'll certainly be taking a look through
> the code for Katta since no doubt there's a lot to learn in there.
Katta will be deployed into a production system of our customer in
less than 4 weeks - so we working hard to iron out issues.
However katta is running since 6 weeks in a 10 node test environment
with heavy load.
Stefan
Re: Lucene-based Distributed Index Leveraging Hadoop
Posted by Richard Marr <ri...@gmail.com>.
Stefan,
> Is there any material published about this? I would be very interested to
> see Marks slides and hear about the discussion.
I believe all the slides will be available. I'll post a link as soon
as I have one.
> Please keep in mind that katta is very young and compass or solr might be
> more interesting if you need something working now, though they might have
> different goals and focus than dlucene or katta.
I am looking to have something working relatively quickly, but my
performance needs and use cases are relatively modest (for now) so
some degree of string and sticky tape in the implementation is okay in
the short term. My main aim is to ensure that whatever I implement
scales horizontally without too much drama.
In terms of which project best fits my needs my gut feeling is that
dlucene is pretty close. It supports incremental updates, and doesn't
build in dependencies on systems like HDFS or Terracotta (I don't yet
understand all the implications of those systems so would rather keep
things simple if possible). The obvious drawback being that dlucene
doesn't seem to be an active public project.
Thanks for the reply Stefan. I'll certainly be taking a look through
the code for Katta since no doubt there's a lot to learn in there.
All the best...
Rich
Re: Lucene-based Distributed Index Leveraging Hadoop
Posted by Stefan Groschupf <sg...@101tec.com>.
>
Hi All, Hi Mark,
> It was interesting to hear Mark Butler present his implementation of
> Distributed Lucene at the Hadoop User Group meeting in London on
> Tuesday. There's obviously been quite a bit of discussion on the
> subject, and lots of interested parties. Mark, not sure if you're on
> this list but thanks for sharing.
Is there any material published about this? I would be very interested
to see Marks slides and hear about the discussion.
> Is this the forum to ask about open projects? I'm interested in
> joining a project as long as it's goals aren't too distant to what I'm
> looking for. Based mostly on gut feeling I'd rather go for a
> stand-alone project that wasn't dependent on HDFS/Hadoop, but willing
> to be convinced otherwise.
Rich, as you know there are a couple project in this area solar,
compass, dlucene and katta and since all are open source I guess the
easiest way to be involved is to join the mailing lists.
I only can speak for katta - we are very interested in getting more
people involved to get other perspective. There is quite some activity
in our project since our project is part of a upcoming production
system, but low traffic in mailing list (So far all developers work in
the same room).
You can find our mailing list on our source forge page:
http://katta.wiki.sourceforge.net/
Please keep in mind that katta is very young and compass or solr might
be more interesting if you need something working now, though they
might have different goals and focus than dlucene or katta.
Stefan Groschupf