You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Jason Rutherglen <ja...@gmail.com> on 2008/10/08 19:22:46 UTC

Re: Realtime Search for Social Networks Collaboration

Hi Joaquin,

Are you interested in integration of realtime search using Lucene with
Oracle?  This may be something that will benefit many users.

Jason

On Sun, Sep 21, 2008 at 11:38 PM, J. Delgado <jo...@gmail.com> wrote:
> On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ्
> <no...@gmail.com> wrote:
>>
>> Moving back to RDBMS model will be a big step backwards where we miss
>> mulivalued fields and arbitrary fields .
>
>  No one is suggesting to "lose" any of the virtues of the field based
> indexing that Lucene provides. All but the contrary: by extending the RDBMS
> model with Lucene-based indexes one can map relational rows to documents and
> columns to fields. Note that one relational field can be mapped to one or
> more text based fields and multi-valued fields will still be allowed.
>
> Please check the Lucence OJVM implementation for details on implementation
> and philosophy on the RDBMS-Lucene converged model:
>
> http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
>
> More discussions at Marcelo's blog who will be presenting in Oracle World
> 2008 this week.
> http://marceloochoa.blogspot.com/
>
> BTW, it just happen that this was implemented using Oracle but similar
> implementation in H2 seems not only feasible but desirable.
>
> -- Joaquin
>
>
>>
>>
>> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
>> <ja...@gmail.com> wrote:
>> > Cool.  I mention H2 because it does have some Lucene code in it yes.
>> > Also according to some benchmarks it's the fastest of the open source
>> > databases.  I think it's possible to integrate realtime search for H2.
>> >  I suppose there is no need to store the data in Lucene in this case?
>> > One loses the multiple values per field Lucene offers, and the schema
>> > become static.  Perhaps it's a trade off?
>> >
>> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <jo...@gmail.com>
>> > wrote:
>> >> Yes, both Marcelo and I would be interested.
>> >>
>> >> We looked into H2 and it looks like something similar to Oracle's ODCI
>> >> can
>> >> be implemented. Plus the primitive full-text implementación is based on
>> >> Lucene.
>> >> I say primitive because looking at the code I saw that one cannot
>> >> define an
>> >> Analyzer and for each scan corresponding to a where clause a searcher
>> >> is
>> >> open and closed, instead of having a pool, plus it does not have any
>> >> way to
>> >> queue changes to reduce the use of the IndexWriter, etc.
>> >>
>> >> But its open source and that is a great starting point!
>> >>
>> >> -- Joaquin
>> >>
>> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> >> <ja...@gmail.com> wrote:
>> >>>
>> >>> Perhaps an interesting project would be to integrate Ocean with H2
>> >>> www.h2database.com to take advantage of both models.  I'm not sure how
>> >>> exactly that would work, but it seems like it would not be too
>> >>> difficult.  Perhaps this would solve being able to perform faster
>> >>> hierarchical queries and perhaps other types of queries that Lucene is
>> >>> not capable of.
>> >>>
>> >>> Is this something Joaquin you are interested in collaborating on?  I
>> >>> am definitely interested in it.
>> >>>
>> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <jo...@gmail.com>
>> >>> wrote:
>> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>> >>> > <ot...@yahoo.com> wrote:
>> >>> >>
>> >>> >> Regarding real-time search and Solr, my feeling is the focus should
>> >>> >> be
>> >>> >> on
>> >>> >> first adding real-time search to Lucene, and then we'll figure out
>> >>> >> how
>> >>> >> to
>> >>> >> incorporate that into Solr later.
>> >>> >
>> >>> >
>> >>> > Otis, what do you mean exactly by "adding real-time search to
>> >>> > Lucene"?
>> >>> >  Note
>> >>> > that Lucene, being a indexing/search library (and not a full blown
>> >>> > search
>> >>> > engine), is by definition "real-time": once you add/write a document
>> >>> > to
>> >>> > the
>> >>> > index it becomes immediately searchable and if a document is
>> >>> > logically
>> >>> > deleted and no longer returned in a search, though physical deletion
>> >>> > happens
>> >>> > during an index optimization.
>> >>> >
>> >>> > Now, the problem of adding/deleting documents in bulk, as part of a
>> >>> > transaction and making these documents available for search
>> >>> > immediately
>> >>> > after the transaction is commited sounds more like a search engine
>> >>> > problem
>> >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
>> >>> > to
>> >>> > be
>> >>> > I/O expensive and thus are usually implemented bached proceeses with
>> >>> > some
>> >>> > kind of sync mechanism, which makes them non real-time.
>> >>> >
>> >>> > For example, in my previous life, I designed and help implement a
>> >>> > quasi-realtime enterprise search engine using Lucene, having a set
>> >>> > of
>> >>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>> >>> > accross
>> >>> > different search services which powered a broker based distributed
>> >>> > search
>> >>> > interface. The most recent documents provided to the indexers were
>> >>> > always
>> >>> > added to the smaller in-memory (RAM) indexes which usually could
>> >>> > absorbe
>> >>> > the
>> >>> > load of a bulk "add" transaction and later would be merged into
>> >>> > larger
>> >>> > disk
>> >>> > based indexes and then flushed to make them ready to absorbe new
>> >>> > fresh
>> >>> > docs.
>> >>> > We even had further partitioning of the indexes that reflected time
>> >>> > periods
>> >>> > with caps on size for them to be merged into older more archive
>> >>> > based
>> >>> > indexes which were used less (yes the search engine default search
>> >>> > was
>> >>> > on
>> >>> > data no more than 1 month old, though user could open the time
>> >>> > window by
>> >>> > including archives).
>> >>> >
>> >>> > As for SOLR and OCEAN,  I would argue that these semi-structured
>> >>> > search
>> >>> > engines are becomming more and more like relational databases with
>> >>> > full-text
>> >>> > search capablities (without the benefit of full reletional algebra
>> >>> > --
>> >>> > for
>> >>> > example joins are not possible using SOLR). Notice that "real-time"
>> >>> > CRUD
>> >>> > operations and transactionality are core DB concepts adn have been
>> >>> > studied
>> >>> > and developed by database communities for aquite long time. There
>> >>> > has
>> >>> > been
>> >>> > recent efforts on how to effeciently integrate Lucene into
>> >>> > releational
>> >>> > databases (see Lucene JVM ORACLE integration, see
>> >>> >
>> >>> >
>> >>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>> >>> >
>> >>> > I think we should seriously look at joining efforts with open-source
>> >>> > Database engine projects, written in Java (see
>> >>> > http://java-source.net/open-source/database-engines) in order to
>> >>> > blend
>> >>> > IR
>> >>> > and ORM for once and for all.
>> >>> >
>> >>> > -- Joaquin
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> I've read Jason's Wiki as well.  Actually, I had to read it a
>> >>> >> number of
>> >>> >> times to understand bits and pieces of it.  I have to admit there
>> >>> >> is
>> >>> >> still
>> >>> >> some fuzziness about the whole things in my head - is "Ocean"
>> >>> >> something
>> >>> >> that
>> >>> >> already works, a separate project on googlecode.com?  I think so.
>> >>> >>  If
>> >>> >> so,
>> >>> >> and if you are working on getting it integrated into Lucene, would
>> >>> >> it
>> >>> >> make
>> >>> >> it less confusing to just refer to it as "real-time search", so
>> >>> >> there
>> >>> >> is no
>> >>> >> confusion?
>> >>> >>
>> >>> >> If this is to be initially integrated into Lucene, why are things
>> >>> >> like
>> >>> >> replication, crowding/field collapsing, locallucene, name service,
>> >>> >> tag
>> >>> >> index, etc. all mentioned there on the Wiki and bundled with
>> >>> >> description of
>> >>> >> how real-time search works and is to be implemented?  I suppose
>> >>> >> mentioning
>> >>> >> replication kind-of makes sense because the replication approach is
>> >>> >> closely
>> >>> >> tied to real-time search - all query nodes need to see index
>> >>> >> changes
>> >>> >> fast.
>> >>> >>  But Lucene itself offers no replication mechanism, so maybe the
>> >>> >> replication
>> >>> >> is something to figure out separately, say on the Solr level, later
>> >>> >> on
>> >>> >> "once
>> >>> >> we get there".  I think even just the essential real-time search
>> >>> >> requires
>> >>> >> substantial changes to Lucene (I remember seeing large patches in
>> >>> >> JIRA),
>> >>> >> which makes it hard to digest, understand, comment on, and
>> >>> >> ultimately
>> >>> >> commit
>> >>> >> (hence the luke warm response, I think).  Bringing other
>> >>> >> non-essential
>> >>> >> elements into discussion at the same time makes it more difficult t
>> >>> >> o
>> >>> >>  process all this new stuff, at least for me.  Am I the only one
>> >>> >> who
>> >>> >> finds
>> >>> >> this hard?
>> >>> >>
>> >>> >> That said, it sounds like we have some discussion going (Karl...),
>> >>> >> so I
>> >>> >> look forward to understanding more! :)
>> >>> >>
>> >>> >>
>> >>> >> Otis
>> >>> >> --
>> >>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> ----- Original Message ----
>> >>> >> > From: Yonik Seeley <yo...@apache.org>
>> >>> >> > To: java-dev@lucene.apache.org
>> >>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>> >>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>> >>> >> >
>> >>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> >>> >> > wrote:
>> >>> >> > > I also think it's got a
>> >>> >> > > lot of things now which makes integration difficult to do
>> >>> >> > > properly.
>> >>> >> >
>> >>> >> > I agree, and that's why the major bump in version number rather
>> >>> >> > than
>> >>> >> > minor - we recognize that some features will need some amount of
>> >>> >> > rearchitecture.
>> >>> >> >
>> >>> >> > > I think the problem with integration with SOLR is it was
>> >>> >> > > designed
>> >>> >> > > with
>> >>> >> > > a different problem set in mind than Ocean, originally the CNET
>> >>> >> > > shopping application.
>> >>> >> >
>> >>> >> > That was the first use of Solr, but it actually existed before
>> >>> >> > that
>> >>> >> > w/o any defined use other than to be a "plan B" alternative to
>> >>> >> > MySQL
>> >>> >> > based search servers (that's actually where some of the parameter
>> >>> >> > names come from... the default /select URL instead of /search,
>> >>> >> > the
>> >>> >> > "rows" parameter, etc).
>> >>> >> >
>> >>> >> > But you're right... some things like the replication strategy
>> >>> >> > were
>> >>> >> > designed (well, borrowed from Doug to be exact) with the idea
>> >>> >> > that it
>> >>> >> > would be OK to have slightly "stale" views of the data in the
>> >>> >> > range
>> >>> >> > of
>> >>> >> > minutes.  It just made things easier/possible at the time.  But
>> >>> >> > tons
>> >>> >> > of Solr and Lucene users want almost instantaneous visibility of
>> >>> >> > added
>> >>> >> > documents, if they can get it.  It's hardly restricted to social
>> >>> >> > network applications.
>> >>> >> >
>> >>> >> > Bottom line is that Solr aims to be a general enterprise search
>> >>> >> > platform, and getting as real-time as we can get, and as scalable
>> >>> >> > as
>> >>> >> > we can get are some of the top priorities going forward.
>> >>> >> >
>> >>> >> > -Yonik
>> >>> >> >
>> >>> >> >
>> >>> >> > ---------------------------------------------------------------------
>> >>> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>>
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>