You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Angel, Eric" <ea...@business.com> on 2009/10/09 04:00:12 UTC

Realtime & distributed

Does anyone have any recommendations?  I've looked at Katta, but it  
doesn't seem to support realtime searching.  It also uses hdfs, which  
I've heard can be slow.  I'm looking to serve 40gb of indexes and  
support about 1 million updates per day.

Thx

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by Jake Mannix <ja...@gmail.com>.

Hi Mike,

  Zoie itself doesn't do anything with the new with the distributed
side of things - it just plays nicely with it.   Zoie, at its core,
exposes a couple of primary interfaces (well, this is a slightly
simplified form of them) :

  interface IndexReaderFactory {  List getIndexReaders(); }, and
  interface DataConsumer{ void consume(Collection events); }

To do distributed realtime search with zoie, you just need to
make sure you get your indexing events to each of your nodes
as fast as they show up, push them in through the DataConsumer
API, and IndexReaders exposed through getIndexReaders() are
then a fresh realtime read-only view on the index on each node.

Doing distributed search with a setup like this now means just
pushing your Query to all of the nodes, returning the top n hits
from each back to a broker, sorting all n * num_nodes results
by score and taking the top n of that combined list.

Depending on your system's setup, you either push events to
the nodes, or pull events from somewhere to them, but if you do
the latter the realtimeliness will be bounded by how often you
poll, of course.

  -jake

On Fri, Oct 9, 2009 at 9:09 PM, Michael Masters <mm...@gmail.com> wrote:

> Hi Jake,
>
> Zoie looks like a a really cool project. I'd like to learn more about
> the distributed part of the setup. Any way you could describe that
> here or on the wiki?
>
> -Mike
>
> On Thu, Oct 8, 2009 at 9:24 PM, Jake Mannix <ja...@gmail.com> wrote:
> > On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
> >
> >>
> >> Does anyone have any recommendations?  I've looked at Katta, but it
> doesn't
> >> seem to support realtime searching.  It also uses hdfs, which I've heard
> can
> >> be slow.  I'm looking to serve 40gb of indexes and support about 1
> million
> >> updates per day.
> >>
> >>
> > Hi Eric,
> >
> >  As I mentioned in my response to Jason, we at LinkedIn serve our roughly
> > 50million document profile index on a real-time distributed setup (we're
> > serving facets in real-time also), serving tens of millions of queries a
> day
> > in the 1-10ms latency per node, based on the open source zoie project
> (built
> > here at LinkedIn) : http://zoie.googlecode.com
> >
> >  Zoie doesn't handle the distributed part of the setup, it's just the
> > real-time side.  Distribution is done pretty straitgtforwardly in our
> case
> > though: N shards each getting a different contiguous slice of the user
> base,
> > each replicated K times, and all N*K nodes get indexing events
> distributed
> > by a message queue independently.
> >
> >  If you have any questions about zoie, let me know.  The documentation
> > could get filled in a little further, and it doesn't touch on distributed
> > side of things, so feel free to ping me.
> >
> >  -jake
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Realtime & distributed

Posted by Michael Masters <mm...@gmail.com>.

Hi Jake,

Zoie looks like a a really cool project. I'd like to learn more about
the distributed part of the setup. Any way you could describe that
here or on the wiki?

-Mike

On Thu, Oct 8, 2009 at 9:24 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
>
>>
>> Does anyone have any recommendations?  I've looked at Katta, but it doesn't
>> seem to support realtime searching.  It also uses hdfs, which I've heard can
>> be slow.  I'm looking to serve 40gb of indexes and support about 1 million
>> updates per day.
>>
>>
> Hi Eric,
>
>  As I mentioned in my response to Jason, we at LinkedIn serve our roughly
> 50million document profile index on a real-time distributed setup (we're
> serving facets in real-time also), serving tens of millions of queries a day
> in the 1-10ms latency per node, based on the open source zoie project (built
> here at LinkedIn) : http://zoie.googlecode.com
>
>  Zoie doesn't handle the distributed part of the setup, it's just the
> real-time side.  Distribution is done pretty straitgtforwardly in our case
> though: N shards each getting a different contiguous slice of the user base,
> each replicated K times, and all N*K nodes get indexing events distributed
> by a message queue independently.
>
>  If you have any questions about zoie, let me know.  The documentation
> could get filled in a little further, and it doesn't touch on distributed
> side of things, so feel free to ping me.
>
>  -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:

>
> Does anyone have any recommendations?  I've looked at Katta, but it doesn't
> seem to support realtime searching.  It also uses hdfs, which I've heard can
> be slow.  I'm looking to serve 40gb of indexes and support about 1 million
> updates per day.
>
>
Hi Eric,

  As I mentioned in my response to Jason, we at LinkedIn serve our roughly
50million document profile index on a real-time distributed setup (we're
serving facets in real-time also), serving tens of millions of queries a day
in the 1-10ms latency per node, based on the open source zoie project (built
here at LinkedIn) : http://zoie.googlecode.com

  Zoie doesn't handle the distributed part of the setup, it's just the
real-time side.  Distribution is done pretty straitgtforwardly in our case
though: N shards each getting a different contiguous slice of the user base,
each replicated K times, and all N*K nodes get indexing events distributed
by a message queue independently.

  If you have any questions about zoie, let me know.  The documentation
could get filled in a little further, and it doesn't touch on distributed
side of things, so feel free to ping me.

  -jake

Re: Realtime & distributed

Posted by Bradford Stephens <br...@gmail.com>.

My deepest apologies for the spam, everyone. I slipped on my G-mail button :)

On Fri, Oct 9, 2009 at 9:09 PM, Bradford Stephens
<br...@gmail.com> wrote:
> Hey Eric,
>
> My consulting company specializes in scalable, real-time search with
> distributed Lucene. I'm more than happy to chat, if you'd like! :)
>
> Cheers,
> Bradford
>
> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
>>
>> Does anyone have any recommendations?  I've looked at Katta, but it doesn't
>> seem to support realtime searching.  It also uses hdfs, which I've heard can
>> be slow.  I'm looking to serve 40gb of indexes and support about 1 million
>> updates per day.
>>
>> Thx
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> http://www.drawntoscaleconsulting.com - Scalability, Hadoop, HBase,
> and Distributed Lucene Consulting
>
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>



-- 
http://www.drawntoscaleconsulting.com - Scalability, Hadoop, HBase,
and Distributed Lucene Consulting

http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by Bradford Stephens <br...@gmail.com>.

Hey Eric,

My consulting company specializes in scalable, real-time search with
distributed Lucene. I'm more than happy to chat, if you'd like! :)

Cheers,
Bradford

On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
>
> Does anyone have any recommendations?  I've looked at Katta, but it doesn't
> seem to support realtime searching.  It also uses hdfs, which I've heard can
> be slow.  I'm looking to serve 40gb of indexes and support about 1 million
> updates per day.
>
> Thx
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
http://www.drawntoscaleconsulting.com - Scalability, Hadoop, HBase,
and Distributed Lucene Consulting

http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by John Wang <jo...@gmail.com>.

I can provide some preliminary numbers (we will need to do some detailed
analysis and post it somewhere):

Dataset: medline
starting index: empty.
add only, no update, for 30 min.
maximum indexing load, 1000 docs/ sec

Under stress, we take indexing events (add only) and stream into both
systems: Zoie and NRT consumer.

First dimension to track: realtime-ness:

making document available to searcher as readily as possible, so for each
batch stream we are getting, we call IndexWriter.commit for NRT:

We found the indexing speed for NRT is slow, e.g. in 30 min, zoie indexed
1.4 million docs, where as NRT indexed only 200k. This is actually expected,
because zoie is batching in memory to the disk index, the actual index is
into the small memory index. Where as NRT is always adding to the target
disk index.

We added batching to NRT, e.g. only call indexWriter.commit when the number
of requests = 1000. This made the indexing speed with NRT more comparable.
However, at this point, zoie remained to be realtime, and NRT is not.

IMHO, lucene NRT provides a good way to do stream/batch indexing without
having to make cumbersome calls to track IndexReader/Writer instances.
Furthermore, one of biggest benefit of Lucene NRT, 2.9 is the segment level
search. This is a major refactor that provided major benefits to lucene, and
it really shows off the incremental update feature for lucene.

The question of "how realtime", can lead to a very academic discussion :)

because under stress and heavy load, batching is fine, because load is
pushing the docs to be indexed, so the delay is small. It is one of those
semi-heavy load, docs are being batch until the queue size is "ripe" before
added to document, but when the load is lighter, the impact on indexing
performance becomes less significant.

To be truly realtime, IMHO, you need some sort of memory helper to handle
transient indexing requests. Doing that is where the actual challenge is.

-John

On Fri, Oct 9, 2009 at 1:06 PM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> The dimensions sound good.  It's unclear if you're going to post a
> chart again, numbers, or code?  There's a LUCENE-1577 Jira issue for
> code.
>
> On Fri, Oct 9, 2009 at 12:37 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > Jason,
> >
> >  We've been running some perf/load/stress tests lately, but on a
> suggestion
> >
> > from Ted Dunning, I've been trying to come up with a more "realistic" set
> of
> > stress
> > tests and indexing rates to see where NRT performs well and where it does
> > not,
> > instead of just indexing at maximum rate, looping over all docs in the
> test
> > set
> > and then doing them again and again.
> >
> >  Once we've got a good test set, which hits on the variety of dimensions:
> > indexing
> > rate, document size, query rate while indexing, and delay-to-visibility
> of
> > indexed docs,
> > we'll certainly post that, as John did for the zoie tests on the zoie
> wiki.
> >
> >  -jake
> >
> > On Fri, Oct 9, 2009 at 12:29 PM, Jason Rutherglen <
> > jason.rutherglen@gmail.com> wrote:
> >
> >> Jake and John,
> >>
> >> It would be interesting and enlightening to see NRT performance
> >> numbers in a variety of configurations. The best way to go about
> >> this is to post benchmarks that others may run in their
> >> environment which can then be tweaked for their unique edge
> >> cases. I wish I had more time to work on it.
> >>
> >> -J
> >>
> >> On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >> > Jason,
> >> >
> >> > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <
> >> jason.rutherglen@gmail.com
> >> >> wrote:
> >> >
> >> >> Today near realtime search (with or without SSDs) comes at a
> >> >> price, that is reduced indexing speed due to continued in RAM
> >> >> merging. People typically hack something together where indexes
> >> >> are held in a RAMDir until being flushed to disk. The problem
> >> >> with this is, merging in the background becomes really tricky
> >> >> unless it's performed inside of IndexWriter (see LUCENE-1313 and
> >> >> IW.getReader). There is the Zoie system which uses the RAMDir
> >> >> solution, however it's implemented using a customized deleted
> >> >> doc set based on a bloomfilter backed by an inefficient RB tree
> >> >> which slows down queries. There's always a trade off when trying
> >> >> to build an NRT system, currently.
> >> >>
> >> >
> >> >  I'm not sure what numbers you are using to justify saying that zoie
> >> > "slows down queries" - latency at LinkedIn using zoie has a typical
> >> > median response time of 4-8ms at the searcher node level (slower
> >> > at the broker due to a lot of custom stuff that happens before
> >> > queries are actually sent to the nodex), while dealing with sustained
> >> > rapid indexing throughput, all with basically zero time between
> indexing
> >> > event to index visibility (ie. true real-time, not "near real time",
> >> unless
> >> > indexing events are coming in *very* fast).
> >> >
> >> >  You say there's a tradeoff, but as you should remember from your
> >> > time at LinkedIn, we do distributed realtime faceted search while
> >> > maintaining extremely low latency and still indexing sometimes more
> >> > than a thousand new docs a minute per node (I should dredge up
> >> > some new numbers to verify what that is exactly these days).
> >> >
> >> >
> >> > Deletes can pile up in segments so the
> >> >> BalancedSegmentMergePolicy could be used to remove those faster
> >> >> than LogMergePolicy, however I haven't tested it, and it may be
> >> >> trying to not do large segment merges altogether which IMO
> >> >> is less than ideal because query performance soon degrades
> >> >> (similar to an unoptimized index).
> >> >>
> >> >
> >> > Not optimizing all the way has shown in our case to actually be
> >> > *better* than the "optimal" case of a 1-segment index, at least in
> >> > the case of realtime indexing at rapid update pace.
> >> >
> >> >
> >> >  -jake
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Realtime & distributed

Posted by Jason Rutherglen <ja...@gmail.com>.

The dimensions sound good.  It's unclear if you're going to post a
chart again, numbers, or code?  There's a LUCENE-1577 Jira issue for
code.

On Fri, Oct 9, 2009 at 12:37 PM, Jake Mannix <ja...@gmail.com> wrote:
> Jason,
>
>  We've been running some perf/load/stress tests lately, but on a suggestion
>
> from Ted Dunning, I've been trying to come up with a more "realistic" set of
> stress
> tests and indexing rates to see where NRT performs well and where it does
> not,
> instead of just indexing at maximum rate, looping over all docs in the test
> set
> and then doing them again and again.
>
>  Once we've got a good test set, which hits on the variety of dimensions:
> indexing
> rate, document size, query rate while indexing, and delay-to-visibility of
> indexed docs,
> we'll certainly post that, as John did for the zoie tests on the zoie wiki.
>
>  -jake
>
> On Fri, Oct 9, 2009 at 12:29 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Jake and John,
>>
>> It would be interesting and enlightening to see NRT performance
>> numbers in a variety of configurations. The best way to go about
>> this is to post benchmarks that others may run in their
>> environment which can then be tweaked for their unique edge
>> cases. I wish I had more time to work on it.
>>
>> -J
>>
>> On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix <ja...@gmail.com> wrote:
>> > Jason,
>> >
>> > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <
>> jason.rutherglen@gmail.com
>> >> wrote:
>> >
>> >> Today near realtime search (with or without SSDs) comes at a
>> >> price, that is reduced indexing speed due to continued in RAM
>> >> merging. People typically hack something together where indexes
>> >> are held in a RAMDir until being flushed to disk. The problem
>> >> with this is, merging in the background becomes really tricky
>> >> unless it's performed inside of IndexWriter (see LUCENE-1313 and
>> >> IW.getReader). There is the Zoie system which uses the RAMDir
>> >> solution, however it's implemented using a customized deleted
>> >> doc set based on a bloomfilter backed by an inefficient RB tree
>> >> which slows down queries. There's always a trade off when trying
>> >> to build an NRT system, currently.
>> >>
>> >
>> >  I'm not sure what numbers you are using to justify saying that zoie
>> > "slows down queries" - latency at LinkedIn using zoie has a typical
>> > median response time of 4-8ms at the searcher node level (slower
>> > at the broker due to a lot of custom stuff that happens before
>> > queries are actually sent to the nodex), while dealing with sustained
>> > rapid indexing throughput, all with basically zero time between indexing
>> > event to index visibility (ie. true real-time, not "near real time",
>> unless
>> > indexing events are coming in *very* fast).
>> >
>> >  You say there's a tradeoff, but as you should remember from your
>> > time at LinkedIn, we do distributed realtime faceted search while
>> > maintaining extremely low latency and still indexing sometimes more
>> > than a thousand new docs a minute per node (I should dredge up
>> > some new numbers to verify what that is exactly these days).
>> >
>> >
>> > Deletes can pile up in segments so the
>> >> BalancedSegmentMergePolicy could be used to remove those faster
>> >> than LogMergePolicy, however I haven't tested it, and it may be
>> >> trying to not do large segment merges altogether which IMO
>> >> is less than ideal because query performance soon degrades
>> >> (similar to an unoptimized index).
>> >>
>> >
>> > Not optimizing all the way has shown in our case to actually be
>> > *better* than the "optimal" case of a 1-segment index, at least in
>> > the case of realtime indexing at rapid update pace.
>> >
>> >
>> >  -jake
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by Jake Mannix <ja...@gmail.com>.

Jason,

  We've been running some perf/load/stress tests lately, but on a suggestion

from Ted Dunning, I've been trying to come up with a more "realistic" set of
stress
tests and indexing rates to see where NRT performs well and where it does
not,
instead of just indexing at maximum rate, looping over all docs in the test
set
and then doing them again and again.

  Once we've got a good test set, which hits on the variety of dimensions:
indexing
rate, document size, query rate while indexing, and delay-to-visibility of
indexed docs,
we'll certainly post that, as John did for the zoie tests on the zoie wiki.

  -jake

On Fri, Oct 9, 2009 at 12:29 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Jake and John,
>
> It would be interesting and enlightening to see NRT performance
> numbers in a variety of configurations. The best way to go about
> this is to post benchmarks that others may run in their
> environment which can then be tweaked for their unique edge
> cases. I wish I had more time to work on it.
>
> -J
>
> On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix <ja...@gmail.com> wrote:
> > Jason,
> >
> > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com
> >> wrote:
> >
> >> Today near realtime search (with or without SSDs) comes at a
> >> price, that is reduced indexing speed due to continued in RAM
> >> merging. People typically hack something together where indexes
> >> are held in a RAMDir until being flushed to disk. The problem
> >> with this is, merging in the background becomes really tricky
> >> unless it's performed inside of IndexWriter (see LUCENE-1313 and
> >> IW.getReader). There is the Zoie system which uses the RAMDir
> >> solution, however it's implemented using a customized deleted
> >> doc set based on a bloomfilter backed by an inefficient RB tree
> >> which slows down queries. There's always a trade off when trying
> >> to build an NRT system, currently.
> >>
> >
> >  I'm not sure what numbers you are using to justify saying that zoie
> > "slows down queries" - latency at LinkedIn using zoie has a typical
> > median response time of 4-8ms at the searcher node level (slower
> > at the broker due to a lot of custom stuff that happens before
> > queries are actually sent to the nodex), while dealing with sustained
> > rapid indexing throughput, all with basically zero time between indexing
> > event to index visibility (ie. true real-time, not "near real time",
> unless
> > indexing events are coming in *very* fast).
> >
> >  You say there's a tradeoff, but as you should remember from your
> > time at LinkedIn, we do distributed realtime faceted search while
> > maintaining extremely low latency and still indexing sometimes more
> > than a thousand new docs a minute per node (I should dredge up
> > some new numbers to verify what that is exactly these days).
> >
> >
> > Deletes can pile up in segments so the
> >> BalancedSegmentMergePolicy could be used to remove those faster
> >> than LogMergePolicy, however I haven't tested it, and it may be
> >> trying to not do large segment merges altogether which IMO
> >> is less than ideal because query performance soon degrades
> >> (similar to an unoptimized index).
> >>
> >
> > Not optimizing all the way has shown in our case to actually be
> > *better* than the "optimal" case of a 1-segment index, at least in
> > the case of realtime indexing at rapid update pace.
> >
> >
> >  -jake
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Realtime & distributed

Posted by Jason Rutherglen <ja...@gmail.com>.

Jake and John,

It would be interesting and enlightening to see NRT performance
numbers in a variety of configurations. The best way to go about
this is to post benchmarks that others may run in their
environment which can then be tweaked for their unique edge
cases. I wish I had more time to work on it.

-J

On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix <ja...@gmail.com> wrote:
> Jason,
>
> On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherglen@gmail.com
>> wrote:
>
>> Today near realtime search (with or without SSDs) comes at a
>> price, that is reduced indexing speed due to continued in RAM
>> merging. People typically hack something together where indexes
>> are held in a RAMDir until being flushed to disk. The problem
>> with this is, merging in the background becomes really tricky
>> unless it's performed inside of IndexWriter (see LUCENE-1313 and
>> IW.getReader). There is the Zoie system which uses the RAMDir
>> solution, however it's implemented using a customized deleted
>> doc set based on a bloomfilter backed by an inefficient RB tree
>> which slows down queries. There's always a trade off when trying
>> to build an NRT system, currently.
>>
>
>  I'm not sure what numbers you are using to justify saying that zoie
> "slows down queries" - latency at LinkedIn using zoie has a typical
> median response time of 4-8ms at the searcher node level (slower
> at the broker due to a lot of custom stuff that happens before
> queries are actually sent to the nodex), while dealing with sustained
> rapid indexing throughput, all with basically zero time between indexing
> event to index visibility (ie. true real-time, not "near real time", unless
> indexing events are coming in *very* fast).
>
>  You say there's a tradeoff, but as you should remember from your
> time at LinkedIn, we do distributed realtime faceted search while
> maintaining extremely low latency and still indexing sometimes more
> than a thousand new docs a minute per node (I should dredge up
> some new numbers to verify what that is exactly these days).
>
>
> Deletes can pile up in segments so the
>> BalancedSegmentMergePolicy could be used to remove those faster
>> than LogMergePolicy, however I haven't tested it, and it may be
>> trying to not do large segment merges altogether which IMO
>> is less than ideal because query performance soon degrades
>> (similar to an unoptimized index).
>>
>
> Not optimizing all the way has shown in our case to actually be
> *better* than the "optimal" case of a 1-segment index, at least in
> the case of realtime indexing at rapid update pace.
>
>
>  -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by Jake Mannix <ja...@gmail.com>.

Jason,

On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> Today near realtime search (with or without SSDs) comes at a
> price, that is reduced indexing speed due to continued in RAM
> merging. People typically hack something together where indexes
> are held in a RAMDir until being flushed to disk. The problem
> with this is, merging in the background becomes really tricky
> unless it's performed inside of IndexWriter (see LUCENE-1313 and
> IW.getReader). There is the Zoie system which uses the RAMDir
> solution, however it's implemented using a customized deleted
> doc set based on a bloomfilter backed by an inefficient RB tree
> which slows down queries. There's always a trade off when trying
> to build an NRT system, currently.
>

  I'm not sure what numbers you are using to justify saying that zoie
"slows down queries" - latency at LinkedIn using zoie has a typical
median response time of 4-8ms at the searcher node level (slower
at the broker due to a lot of custom stuff that happens before
queries are actually sent to the nodex), while dealing with sustained
rapid indexing throughput, all with basically zero time between indexing
event to index visibility (ie. true real-time, not "near real time", unless
indexing events are coming in *very* fast).

  You say there's a tradeoff, but as you should remember from your
time at LinkedIn, we do distributed realtime faceted search while
maintaining extremely low latency and still indexing sometimes more
than a thousand new docs a minute per node (I should dredge up
some new numbers to verify what that is exactly these days).

Deletes can pile up in segments so the
> BalancedSegmentMergePolicy could be used to remove those faster
> than LogMergePolicy, however I haven't tested it, and it may be
> trying to not do large segment merges altogether which IMO
> is less than ideal because query performance soon degrades
> (similar to an unoptimized index).
>

Not optimizing all the way has shown in our case to actually be
*better* than the "optimal" case of a 1-segment index, at least in
the case of realtime indexing at rapid update pace.

  -jake

Re: Realtime & distributed

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> There is the Zoie system which uses the RAMDir
> solution,
>

Also, to clarify: zoie does not index into a RAMDir and then periodically
merge that
down to disk, as for one thing, this has a bad failure mode when the system
crashes,
as you lose the entire RAMDir and have to figure out how far back to look in
your
transaction log to know how much to reindex.

Zoie instead indexes "redundantly": every incoming document is indexed into
a
RAMDir *and* the FSDirectory simultaneously, but the disk IndexReader for
the
FSDirectory is only reopened every 15 minutes or so, while the IndexReader
for
the RAMDirectory is reopened for every query to guarantee real-timeliness of
the index.

The only case where zoie *isn't* realtime, is when the speed of indexing
updates
comes in faster than can be indexed into the RAMDirectory - if this is the
case,
those updates will pile up in a queue being served by that indexing thread,
and
won't be visible until that thread has caught up.  In practice, this doesn't
happen
unless any given node is trying to index a hundred documents (depends on
size,
of course) a second.

Of course, since the IndexWriter buffers some documents in RAM before
flushing to disk, you are not totally immune to system failures, but at zoie
is
no more susceptible to that then non-realtime search, as it's writing
directly to
disk all the time as well (and yes, this is redundant, but ever since the
fantastic
indexing speed improvements of Lucene 2.3, I've yet to see indexing be the
bottleneck anymore).

  -jake

Re: Realtime & distributed

Posted by John Wang <jo...@gmail.com>.

Eric:

   For more specific Zoie questions, let's move it to the zoie discussion
group instead.

Thanks

-John

On Sun, Oct 11, 2009 at 2:31 PM, John Wang <jo...@gmail.com> wrote:

> Hi Eric:
>
> I regret the direction the thread has taken and partly take responsibility
> for it...
>
> As to your question:
>
> We have 2 nodes per commodity server, each holding 5 million docs (although
> given the numbers we are seeing, we think we were a bit too conservative,
> and may increase to 10). In terms of indexing, each partition is doing
> indexing in realtime. We have total about 12 partitions, so 6 machines. With
> about 4 - 5 replications.
>
> RamDir only holds the transient index, once flushed to the disk index,
> RamDir is emptied. So yes, it is the second part of your question. The trick
> is the synchronization logic as well as handling of deletes and updates
> between ram and disk index.
>
> I am not sure I can disclose what HW we are using, but Zoie is designed to
> run on commodity HW.
>
> I think it is always a good idea to archive your data. Since with our
> setup, we have replications that holds its own copy of the index, so there
> is already redundancy. So having a set of offline nodes doing just indexing
> is not necc.
>
> Yes, we are working hard to make zoie 2.9 compatible. As Jake has
> previously mentioned, Lucene 2.9 has changed alot internally
>
> (I personally think these changes are awesome and really allows
> applications the flexibility to unleash the powers of the lucene. Plus these
> changes are very performance oriented for incremental indexing, which is
> important to us. Much kudos to the lucene team and the contributors)
>
> so to fully take advantage of this work while maintaining backward
> compatibility is not a trivial project.
>
> Expect to see another maintenance release of zoie before 2.9. We hope to
> have 2.9 work done soon, but in terms of timing, lucene 3.0 (partiticularly
> looking forward to custom indexing) is also coming out, we are deciding
> whether to wait and include 3.0.
>
> Hope this helps.
>
> -John
>
>
> On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric <ea...@business.com> wrote:
>
>> Man, this thread really went south.  Anyhow, I have a few questions about
>> Zoie:
>>
>> * How many nodes are you using to support the speeds you desire at LI?
>> * Am I wrong to assume that the RAMDir holds the entire index - just as
>> the FSDir?  Or does RAMDir only hold a portion of the index that hasn't yet
>> been flushed to disk?
>> * Katta is supposed to be able to be able to run on commodity hardware -
>> is that the same case for Zoie?
>> * Would you agree that it's a good idea to build an "offline" index
>> parallel to the online index in case there is a crash on the online index
>> and data is lost?
>> * I see that there are plans to have Zoie use Lucene 2.9.  How long would
>> you say before it's available?
>>
>> Thanks,
>>
>> E
>>
>> -----Original Message-----
>> From: Jason Rutherglen [mailto:jason.rutherglen@gmail.com]
>> Sent: Sat 10/10/2009 12:16 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Realtime & distributed
>>
>> John,
>>
>> Actually everyone is entitled to their technical opinion and
>> none of the comments were misleading. Jake and yourself
>> validated that they are true in your comments. I'm simply trying
>> to create better technology as is everyone on here. The process
>> takes time and coordination between many parties of many
>> backgrounds around the globe. Sometimes there are differences of
>> opinion, however those are easily ironed out over time (and quite
>> frankly in this case benchmarks).
>>
>> However I am very concerned about your ignorant disregard of some of the
>> most basic human rights in existence.
>>
>> -J
>>
>> On Thu, Oct 8, 2009 at 10:26 PM, John Wang <jo...@gmail.com> wrote:
>> > Jason:
>> >        I would really appreciate it if you would stop making false
>> > statements and misinformation. Everyone is entitled to his/her opinions
>> on
>> > technologies, but deliberately making misleading and false information
>> on
>> > such a distribution is just unethical, and you'll end up just
>> discrediting
>> > yourself.
>> >
>> >        Making unsubstantiated comments while not willing to put in any
>> > effort is the primary reason you are no longer working at Linkedin and
>> on
>> > Zoie.
>> >
>> > "The problem
>> > with this is, merging in the background becomes really tricky
>> > unless it's performed inside of IndexWriter" - *what does this really
>> mean?
>> > Merging happens regardless in an incremental indexing system. Especially
>> > with high indexing load, segments are created often, merging is
>> crucial.*
>> > "There is the Zoie system which uses the RAMDir
>> > solution, however it's implemented using a customized deleted
>> > doc set based on a bloomfilter backed by an inefficient RB tree
>> > which slows down queries"  -* if you ever spend the time to read the
>> code,
>> > (even when you were working on it), it is just not true. We did have an
>> RB
>> > set for deleted docs, quite a few releases ago, and we changed to a
>> special
>> > type of bloomfilter set backed by a hash int set. You knew this and was
>> part
>> > of the discussion on it, and now saying such a thing is just plain
>> > disappointing.*
>> >
>> >        Thanks Jake for the clarification, and Eric, let me know if you
>> to
>> > know more in detail with how we are dealing with realtime
>> indexing/search
>> > with Zoie here at linkedin in a production environment powering a real
>> > internet company with real traffic.
>> >
>> > -John
>> >
>> > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <
>> jason.rutherglen@gmail.com
>> >> wrote:
>> >
>> >> Eric,
>> >>
>> >> Katta doesn't require HDFS which would be slow to search on,
>> >> though Katta can be used to copy indexes out of HDFS onto local
>> >> servers. The best bet is hardware that uses SSDs because merges
>> >> and update latency will greatly decrease and there won't be a
>> >> synchronous IO issue as there is with hard drives. Also, IO
>> >> caches get flushed as large merges occur, which means subsequent
>> >> queries may hit the HD and slow down. With SSDs this is much
>> >> less of an issue.
>> >>
>> >> Today near realtime search (with or without SSDs) comes at a
>> >> price, that is reduced indexing speed due to continued in RAM
>> >> merging. People typically hack something together where indexes
>> >> are held in a RAMDir until being flushed to disk. The problem
>> >> with this is, merging in the background becomes really tricky
>> >> unless it's performed inside of IndexWriter (see LUCENE-1313 and
>> >> IW.getReader). There is the Zoie system which uses the RAMDir
>> >> solution, however it's implemented using a customized deleted
>> >> doc set based on a bloomfilter backed by an inefficient RB tree
>> >> which slows down queries. There's always a trade off when trying
>> >> to build an NRT system, currently.
>> >>
>> >> Also, there isn't a clear way to replicate segments in realtime
>> >> so people usually end up analyzing documents on each replicated
>> >> node, which is redundant. A long term solution here could be a
>> >> distributed transaction log where encoded segments are stored
>> >> and replicated to N nodes.
>> >>
>> >> Deletes can pile up in segments so the
>> >> BalancedSegmentMergePolicy could be used to remove those faster
>> >> than LogMergePolicy, however I haven't tested it, and it may be
>> >> trying to not do large segment merges altogether which IMO
>> >> is less than ideal because query performance soon degrades
>> >> (similar to an unoptimized index).
>> >>
>> >> Hopefully in the future we can offer searching over
>> >> IndexWriter's RAM buffer where indexing and search speed would
>> >> be roughly what it is today. That combined with a way to insure
>> >> segments don't get flushed out of the IO cache during large
>> >> segment merges would mean really efficient NRT, even on systems
>> >> with HDs. In the interim, you'd need to play around and see what
>> >> works for your requirements.
>> >>
>> >> -J
>> >>
>> >> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com>
>> wrote:
>> >> >
>> >> > Does anyone have any recommendations?  I've looked at Katta, but it
>> >> doesn't
>> >> > seem to support realtime searching.  It also uses hdfs, which I've
>> heard
>> >> can
>> >> > be slow.  I'm looking to serve 40gb of indexes and support about 1
>> >> million
>> >> > updates per day.
>> >> >
>> >> > Thx
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Realtime & distributed

Posted by John Wang <jo...@gmail.com>.

Hi Eric:

I regret the direction the thread has taken and partly take responsibility
for it...

As to your question:

We have 2 nodes per commodity server, each holding 5 million docs (although
given the numbers we are seeing, we think we were a bit too conservative,
and may increase to 10). In terms of indexing, each partition is doing
indexing in realtime. We have total about 12 partitions, so 6 machines. With
about 4 - 5 replications.

RamDir only holds the transient index, once flushed to the disk index,
RamDir is emptied. So yes, it is the second part of your question. The trick
is the synchronization logic as well as handling of deletes and updates
between ram and disk index.

I am not sure I can disclose what HW we are using, but Zoie is designed to
run on commodity HW.

I think it is always a good idea to archive your data. Since with our setup,
we have replications that holds its own copy of the index, so there is
already redundancy. So having a set of offline nodes doing just indexing is
not necc.

Yes, we are working hard to make zoie 2.9 compatible. As Jake has previously
mentioned, Lucene 2.9 has changed alot internally

(I personally think these changes are awesome and really allows applications
the flexibility to unleash the powers of the lucene. Plus these changes are
very performance oriented for incremental indexing, which is important to
us. Much kudos to the lucene team and the contributors)

so to fully take advantage of this work while maintaining backward
compatibility is not a trivial project.

Expect to see another maintenance release of zoie before 2.9. We hope to
have 2.9 work done soon, but in terms of timing, lucene 3.0 (partiticularly
looking forward to custom indexing) is also coming out, we are deciding
whether to wait and include 3.0.

Hope this helps.

-John

On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric <ea...@business.com> wrote:

> Man, this thread really went south.  Anyhow, I have a few questions about
> Zoie:
>
> * How many nodes are you using to support the speeds you desire at LI?
> * Am I wrong to assume that the RAMDir holds the entire index - just as the
> FSDir?  Or does RAMDir only hold a portion of the index that hasn't yet been
> flushed to disk?
> * Katta is supposed to be able to be able to run on commodity hardware - is
> that the same case for Zoie?
> * Would you agree that it's a good idea to build an "offline" index
> parallel to the online index in case there is a crash on the online index
> and data is lost?
> * I see that there are plans to have Zoie use Lucene 2.9.  How long would
> you say before it's available?
>
> Thanks,
>
> E
>
> -----Original Message-----
> From: Jason Rutherglen [mailto:jason.rutherglen@gmail.com]
> Sent: Sat 10/10/2009 12:16 PM
> To: java-user@lucene.apache.org
> Subject: Re: Realtime & distributed
>
> John,
>
> Actually everyone is entitled to their technical opinion and
> none of the comments were misleading. Jake and yourself
> validated that they are true in your comments. I'm simply trying
> to create better technology as is everyone on here. The process
> takes time and coordination between many parties of many
> backgrounds around the globe. Sometimes there are differences of
> opinion, however those are easily ironed out over time (and quite
> frankly in this case benchmarks).
>
> However I am very concerned about your ignorant disregard of some of the
> most basic human rights in existence.
>
> -J
>
> On Thu, Oct 8, 2009 at 10:26 PM, John Wang <jo...@gmail.com> wrote:
> > Jason:
> >        I would really appreciate it if you would stop making false
> > statements and misinformation. Everyone is entitled to his/her opinions
> on
> > technologies, but deliberately making misleading and false information on
> > such a distribution is just unethical, and you'll end up just
> discrediting
> > yourself.
> >
> >        Making unsubstantiated comments while not willing to put in any
> > effort is the primary reason you are no longer working at Linkedin and on
> > Zoie.
> >
> > "The problem
> > with this is, merging in the background becomes really tricky
> > unless it's performed inside of IndexWriter" - *what does this really
> mean?
> > Merging happens regardless in an incremental indexing system. Especially
> > with high indexing load, segments are created often, merging is crucial.*
> > "There is the Zoie system which uses the RAMDir
> > solution, however it's implemented using a customized deleted
> > doc set based on a bloomfilter backed by an inefficient RB tree
> > which slows down queries"  -* if you ever spend the time to read the
> code,
> > (even when you were working on it), it is just not true. We did have an
> RB
> > set for deleted docs, quite a few releases ago, and we changed to a
> special
> > type of bloomfilter set backed by a hash int set. You knew this and was
> part
> > of the discussion on it, and now saying such a thing is just plain
> > disappointing.*
> >
> >        Thanks Jake for the clarification, and Eric, let me know if you to
> > know more in detail with how we are dealing with realtime indexing/search
> > with Zoie here at linkedin in a production environment powering a real
> > internet company with real traffic.
> >
> > -John
> >
> > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com
> >> wrote:
> >
> >> Eric,
> >>
> >> Katta doesn't require HDFS which would be slow to search on,
> >> though Katta can be used to copy indexes out of HDFS onto local
> >> servers. The best bet is hardware that uses SSDs because merges
> >> and update latency will greatly decrease and there won't be a
> >> synchronous IO issue as there is with hard drives. Also, IO
> >> caches get flushed as large merges occur, which means subsequent
> >> queries may hit the HD and slow down. With SSDs this is much
> >> less of an issue.
> >>
> >> Today near realtime search (with or without SSDs) comes at a
> >> price, that is reduced indexing speed due to continued in RAM
> >> merging. People typically hack something together where indexes
> >> are held in a RAMDir until being flushed to disk. The problem
> >> with this is, merging in the background becomes really tricky
> >> unless it's performed inside of IndexWriter (see LUCENE-1313 and
> >> IW.getReader). There is the Zoie system which uses the RAMDir
> >> solution, however it's implemented using a customized deleted
> >> doc set based on a bloomfilter backed by an inefficient RB tree
> >> which slows down queries. There's always a trade off when trying
> >> to build an NRT system, currently.
> >>
> >> Also, there isn't a clear way to replicate segments in realtime
> >> so people usually end up analyzing documents on each replicated
> >> node, which is redundant. A long term solution here could be a
> >> distributed transaction log where encoded segments are stored
> >> and replicated to N nodes.
> >>
> >> Deletes can pile up in segments so the
> >> BalancedSegmentMergePolicy could be used to remove those faster
> >> than LogMergePolicy, however I haven't tested it, and it may be
> >> trying to not do large segment merges altogether which IMO
> >> is less than ideal because query performance soon degrades
> >> (similar to an unoptimized index).
> >>
> >> Hopefully in the future we can offer searching over
> >> IndexWriter's RAM buffer where indexing and search speed would
> >> be roughly what it is today. That combined with a way to insure
> >> segments don't get flushed out of the IO cache during large
> >> segment merges would mean really efficient NRT, even on systems
> >> with HDs. In the interim, you'd need to play around and see what
> >> works for your requirements.
> >>
> >> -J
> >>
> >> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com>
> wrote:
> >> >
> >> > Does anyone have any recommendations?  I've looked at Katta, but it
> >> doesn't
> >> > seem to support realtime searching.  It also uses hdfs, which I've
> heard
> >> can
> >> > be slow.  I'm looking to serve 40gb of indexes and support about 1
> >> million
> >> > updates per day.
> >> >
> >> > Thx
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Realtime & distributed

Posted by Jake Mannix <ja...@gmail.com>.

Ok nevermind actually - the simultaneous indexing was something done in zoie
1.3,
and was changed in 1.4 to addIndexesNoOptimize() on the RAMDirectory indexes
as soon as they are big enough.

It's still true that you can throw away the RAMDirectory once the disk index
is
reopened though.

  -jake


On Sun, Oct 11, 2009 at 3:36 PM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Eric,
>
>   One clarification before letting the rest of this discussion sneak over
> to the zoie list:
>
> On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric <ea...@business.com> wrote:
>
> * Am I wrong to assume that the RAMDir holds the entire index - just as the
>> FSDir?  Or does RAMDir only hold a portion of the index that hasn't yet been
>> flushed to disk?
>>
>
> With zoie, you index to the FSDir *and* the RAMDir simultaneously (so there
> is increased CPU usage for indexing because of this),  but you only reopen()
> the IndexReader on the FSDir every 15minutes (or so), so the fact that
> you've been writing to it the whole while is invisible to the application in
> the intervening time.  This means that a) you don't need to worry about
> disaster recovery any worse than a regular non-realtime setup, and b) that
> when it's time to reopen the FSDir based index, you don't need to write the
> RAMDir to disk, you can just throw it away, as the disk already has the docs
> that are in that RAMDir.
>
>   -jake
>

Re: Realtime & distributed

Posted by Jake Mannix <ja...@gmail.com>.

Hey Eric,

  One clarification before letting the rest of this discussion sneak over to
the zoie list:

On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric <ea...@business.com> wrote:

* Am I wrong to assume that the RAMDir holds the entire index - just as the
> FSDir?  Or does RAMDir only hold a portion of the index that hasn't yet been
> flushed to disk?
>

With zoie, you index to the FSDir *and* the RAMDir simultaneously (so there
is increased CPU usage for indexing because of this),  but you only reopen()
the IndexReader on the FSDir every 15minutes (or so), so the fact that
you've been writing to it the whole while is invisible to the application in
the intervening time.  This means that a) you don't need to worry about
disaster recovery any worse than a regular non-realtime setup, and b) that
when it's time to reopen the FSDir based index, you don't need to write the
RAMDir to disk, you can just throw it away, as the disk already has the docs
that are in that RAMDir.

  -jake

RE: Realtime & distributed

Posted by "Angel, Eric" <ea...@business.com>.

Man, this thread really went south.  Anyhow, I have a few questions about Zoie:

* How many nodes are you using to support the speeds you desire at LI?
* Am I wrong to assume that the RAMDir holds the entire index - just as the FSDir?  Or does RAMDir only hold a portion of the index that hasn't yet been flushed to disk?
* Katta is supposed to be able to be able to run on commodity hardware - is that the same case for Zoie?
* Would you agree that it's a good idea to build an "offline" index parallel to the online index in case there is a crash on the online index and data is lost?
* I see that there are plans to have Zoie use Lucene 2.9.  How long would you say before it's available?

Thanks,

E

-----Original Message-----
From: Jason Rutherglen [mailto:jason.rutherglen@gmail.com]
Sent: Sat 10/10/2009 12:16 PM
To: java-user@lucene.apache.org
Subject: Re: Realtime & distributed
 
John,

Actually everyone is entitled to their technical opinion and
none of the comments were misleading. Jake and yourself
validated that they are true in your comments. I'm simply trying
to create better technology as is everyone on here. The process
takes time and coordination between many parties of many
backgrounds around the globe. Sometimes there are differences of
opinion, however those are easily ironed out over time (and quite
frankly in this case benchmarks).

However I am very concerned about your ignorant disregard of some of the
most basic human rights in existence.

-J

On Thu, Oct 8, 2009 at 10:26 PM, John Wang <jo...@gmail.com> wrote:
> Jason:
>        I would really appreciate it if you would stop making false
> statements and misinformation. Everyone is entitled to his/her opinions on
> technologies, but deliberately making misleading and false information on
> such a distribution is just unethical, and you'll end up just discrediting
> yourself.
>
>        Making unsubstantiated comments while not willing to put in any
> effort is the primary reason you are no longer working at Linkedin and on
> Zoie.
>
> "The problem
> with this is, merging in the background becomes really tricky
> unless it's performed inside of IndexWriter" - *what does this really mean?
> Merging happens regardless in an incremental indexing system. Especially
> with high indexing load, segments are created often, merging is crucial.*
> "There is the Zoie system which uses the RAMDir
> solution, however it's implemented using a customized deleted
> doc set based on a bloomfilter backed by an inefficient RB tree
> which slows down queries"  -* if you ever spend the time to read the code,
> (even when you were working on it), it is just not true. We did have an RB
> set for deleted docs, quite a few releases ago, and we changed to a special
> type of bloomfilter set backed by a hash int set. You knew this and was part
> of the discussion on it, and now saying such a thing is just plain
> disappointing.*
>
>        Thanks Jake for the clarification, and Eric, let me know if you to
> know more in detail with how we are dealing with realtime indexing/search
> with Zoie here at linkedin in a production environment powering a real
> internet company with real traffic.
>
> -John
>
> On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherglen@gmail.com
>> wrote:
>
>> Eric,
>>
>> Katta doesn't require HDFS which would be slow to search on,
>> though Katta can be used to copy indexes out of HDFS onto local
>> servers. The best bet is hardware that uses SSDs because merges
>> and update latency will greatly decrease and there won't be a
>> synchronous IO issue as there is with hard drives. Also, IO
>> caches get flushed as large merges occur, which means subsequent
>> queries may hit the HD and slow down. With SSDs this is much
>> less of an issue.
>>
>> Today near realtime search (with or without SSDs) comes at a
>> price, that is reduced indexing speed due to continued in RAM
>> merging. People typically hack something together where indexes
>> are held in a RAMDir until being flushed to disk. The problem
>> with this is, merging in the background becomes really tricky
>> unless it's performed inside of IndexWriter (see LUCENE-1313 and
>> IW.getReader). There is the Zoie system which uses the RAMDir
>> solution, however it's implemented using a customized deleted
>> doc set based on a bloomfilter backed by an inefficient RB tree
>> which slows down queries. There's always a trade off when trying
>> to build an NRT system, currently.
>>
>> Also, there isn't a clear way to replicate segments in realtime
>> so people usually end up analyzing documents on each replicated
>> node, which is redundant. A long term solution here could be a
>> distributed transaction log where encoded segments are stored
>> and replicated to N nodes.
>>
>> Deletes can pile up in segments so the
>> BalancedSegmentMergePolicy could be used to remove those faster
>> than LogMergePolicy, however I haven't tested it, and it may be
>> trying to not do large segment merges altogether which IMO
>> is less than ideal because query performance soon degrades
>> (similar to an unoptimized index).
>>
>> Hopefully in the future we can offer searching over
>> IndexWriter's RAM buffer where indexing and search speed would
>> be roughly what it is today. That combined with a way to insure
>> segments don't get flushed out of the IO cache during large
>> segment merges would mean really efficient NRT, even on systems
>> with HDs. In the interim, you'd need to play around and see what
>> works for your requirements.
>>
>> -J
>>
>> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
>> >
>> > Does anyone have any recommendations?  I've looked at Katta, but it
>> doesn't
>> > seem to support realtime searching.  It also uses hdfs, which I've heard
>> can
>> > be slow.  I'm looking to serve 40gb of indexes and support about 1
>> million
>> > updates per day.
>> >
>> > Thx
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by Jason Rutherglen <ja...@gmail.com>.

John,

Actually everyone is entitled to their technical opinion and
none of the comments were misleading. Jake and yourself
validated that they are true in your comments. I'm simply trying
to create better technology as is everyone on here. The process
takes time and coordination between many parties of many
backgrounds around the globe. Sometimes there are differences of
opinion, however those are easily ironed out over time (and quite
frankly in this case benchmarks).

However I am very concerned about your ignorant disregard of some of the
most basic human rights in existence.

-J

On Thu, Oct 8, 2009 at 10:26 PM, John Wang <jo...@gmail.com> wrote:
> Jason:
>        I would really appreciate it if you would stop making false
> statements and misinformation. Everyone is entitled to his/her opinions on
> technologies, but deliberately making misleading and false information on
> such a distribution is just unethical, and you'll end up just discrediting
> yourself.
>
>        Making unsubstantiated comments while not willing to put in any
> effort is the primary reason you are no longer working at Linkedin and on
> Zoie.
>
> "The problem
> with this is, merging in the background becomes really tricky
> unless it's performed inside of IndexWriter" - *what does this really mean?
> Merging happens regardless in an incremental indexing system. Especially
> with high indexing load, segments are created often, merging is crucial.*
> "There is the Zoie system which uses the RAMDir
> solution, however it's implemented using a customized deleted
> doc set based on a bloomfilter backed by an inefficient RB tree
> which slows down queries"  -* if you ever spend the time to read the code,
> (even when you were working on it), it is just not true. We did have an RB
> set for deleted docs, quite a few releases ago, and we changed to a special
> type of bloomfilter set backed by a hash int set. You knew this and was part
> of the discussion on it, and now saying such a thing is just plain
> disappointing.*
>
>        Thanks Jake for the clarification, and Eric, let me know if you to
> know more in detail with how we are dealing with realtime indexing/search
> with Zoie here at linkedin in a production environment powering a real
> internet company with real traffic.
>
> -John
>
> On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherglen@gmail.com
>> wrote:
>
>> Eric,
>>
>> Katta doesn't require HDFS which would be slow to search on,
>> though Katta can be used to copy indexes out of HDFS onto local
>> servers. The best bet is hardware that uses SSDs because merges
>> and update latency will greatly decrease and there won't be a
>> synchronous IO issue as there is with hard drives. Also, IO
>> caches get flushed as large merges occur, which means subsequent
>> queries may hit the HD and slow down. With SSDs this is much
>> less of an issue.
>>
>> Today near realtime search (with or without SSDs) comes at a
>> price, that is reduced indexing speed due to continued in RAM
>> merging. People typically hack something together where indexes
>> are held in a RAMDir until being flushed to disk. The problem
>> with this is, merging in the background becomes really tricky
>> unless it's performed inside of IndexWriter (see LUCENE-1313 and
>> IW.getReader). There is the Zoie system which uses the RAMDir
>> solution, however it's implemented using a customized deleted
>> doc set based on a bloomfilter backed by an inefficient RB tree
>> which slows down queries. There's always a trade off when trying
>> to build an NRT system, currently.
>>
>> Also, there isn't a clear way to replicate segments in realtime
>> so people usually end up analyzing documents on each replicated
>> node, which is redundant. A long term solution here could be a
>> distributed transaction log where encoded segments are stored
>> and replicated to N nodes.
>>
>> Deletes can pile up in segments so the
>> BalancedSegmentMergePolicy could be used to remove those faster
>> than LogMergePolicy, however I haven't tested it, and it may be
>> trying to not do large segment merges altogether which IMO
>> is less than ideal because query performance soon degrades
>> (similar to an unoptimized index).
>>
>> Hopefully in the future we can offer searching over
>> IndexWriter's RAM buffer where indexing and search speed would
>> be roughly what it is today. That combined with a way to insure
>> segments don't get flushed out of the IO cache during large
>> segment merges would mean really efficient NRT, even on systems
>> with HDs. In the interim, you'd need to play around and see what
>> works for your requirements.
>>
>> -J
>>
>> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
>> >
>> > Does anyone have any recommendations?  I've looked at Katta, but it
>> doesn't
>> > seem to support realtime searching.  It also uses hdfs, which I've heard
>> can
>> > be slow.  I'm looking to serve 40gb of indexes and support about 1
>> million
>> > updates per day.
>> >
>> > Thx
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Realtime & distributed

Posted by John Wang <jo...@gmail.com>.

Jason:
        I would really appreciate it if you would stop making false
statements and misinformation. Everyone is entitled to his/her opinions on
technologies, but deliberately making misleading and false information on
such a distribution is just unethical, and you'll end up just discrediting
yourself.

        Making unsubstantiated comments while not willing to put in any
effort is the primary reason you are no longer working at Linkedin and on
Zoie.

"The problem
with this is, merging in the background becomes really tricky
unless it's performed inside of IndexWriter" - *what does this really mean?
Merging happens regardless in an incremental indexing system. Especially
with high indexing load, segments are created often, merging is crucial.*
"There is the Zoie system which uses the RAMDir
solution, however it's implemented using a customized deleted
doc set based on a bloomfilter backed by an inefficient RB tree
which slows down queries"  -* if you ever spend the time to read the code,
(even when you were working on it), it is just not true. We did have an RB
set for deleted docs, quite a few releases ago, and we changed to a special
type of bloomfilter set backed by a hash int set. You knew this and was part
of the discussion on it, and now saying such a thing is just plain
disappointing.*

        Thanks Jake for the clarification, and Eric, let me know if you to
know more in detail with how we are dealing with realtime indexing/search
with Zoie here at linkedin in a production environment powering a real
internet company with real traffic.

-John

On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> Eric,
>
> Katta doesn't require HDFS which would be slow to search on,
> though Katta can be used to copy indexes out of HDFS onto local
> servers. The best bet is hardware that uses SSDs because merges
> and update latency will greatly decrease and there won't be a
> synchronous IO issue as there is with hard drives. Also, IO
> caches get flushed as large merges occur, which means subsequent
> queries may hit the HD and slow down. With SSDs this is much
> less of an issue.
>
> Today near realtime search (with or without SSDs) comes at a
> price, that is reduced indexing speed due to continued in RAM
> merging. People typically hack something together where indexes
> are held in a RAMDir until being flushed to disk. The problem
> with this is, merging in the background becomes really tricky
> unless it's performed inside of IndexWriter (see LUCENE-1313 and
> IW.getReader). There is the Zoie system which uses the RAMDir
> solution, however it's implemented using a customized deleted
> doc set based on a bloomfilter backed by an inefficient RB tree
> which slows down queries. There's always a trade off when trying
> to build an NRT system, currently.
>
> Also, there isn't a clear way to replicate segments in realtime
> so people usually end up analyzing documents on each replicated
> node, which is redundant. A long term solution here could be a
> distributed transaction log where encoded segments are stored
> and replicated to N nodes.
>
> Deletes can pile up in segments so the
> BalancedSegmentMergePolicy could be used to remove those faster
> than LogMergePolicy, however I haven't tested it, and it may be
> trying to not do large segment merges altogether which IMO
> is less than ideal because query performance soon degrades
> (similar to an unoptimized index).
>
> Hopefully in the future we can offer searching over
> IndexWriter's RAM buffer where indexing and search speed would
> be roughly what it is today. That combined with a way to insure
> segments don't get flushed out of the IO cache during large
> segment merges would mean really efficient NRT, even on systems
> with HDs. In the interim, you'd need to play around and see what
> works for your requirements.
>
> -J
>
> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
> >
> > Does anyone have any recommendations?  I've looked at Katta, but it
> doesn't
> > seem to support realtime searching.  It also uses hdfs, which I've heard
> can
> > be slow.  I'm looking to serve 40gb of indexes and support about 1
> million
> > updates per day.
> >
> > Thx
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Realtime & distributed

Posted by Jason Rutherglen <ja...@gmail.com>.

Eric,

Katta doesn't require HDFS which would be slow to search on,
though Katta can be used to copy indexes out of HDFS onto local
servers. The best bet is hardware that uses SSDs because merges
and update latency will greatly decrease and there won't be a
synchronous IO issue as there is with hard drives. Also, IO
caches get flushed as large merges occur, which means subsequent
queries may hit the HD and slow down. With SSDs this is much
less of an issue.

Today near realtime search (with or without SSDs) comes at a
price, that is reduced indexing speed due to continued in RAM
merging. People typically hack something together where indexes
are held in a RAMDir until being flushed to disk. The problem
with this is, merging in the background becomes really tricky
unless it's performed inside of IndexWriter (see LUCENE-1313 and
IW.getReader). There is the Zoie system which uses the RAMDir
solution, however it's implemented using a customized deleted
doc set based on a bloomfilter backed by an inefficient RB tree
which slows down queries. There's always a trade off when trying
to build an NRT system, currently.

Also, there isn't a clear way to replicate segments in realtime
so people usually end up analyzing documents on each replicated
node, which is redundant. A long term solution here could be a
distributed transaction log where encoded segments are stored
and replicated to N nodes.

Deletes can pile up in segments so the
BalancedSegmentMergePolicy could be used to remove those faster
than LogMergePolicy, however I haven't tested it, and it may be
trying to not do large segment merges altogether which IMO
is less than ideal because query performance soon degrades
(similar to an unoptimized index).

Hopefully in the future we can offer searching over
IndexWriter's RAM buffer where indexing and search speed would
be roughly what it is today. That combined with a way to insure
segments don't get flushed out of the IO cache during large
segment merges would mean really efficient NRT, even on systems
with HDs. In the interim, you'd need to play around and see what
works for your requirements.

-J

On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ea...@business.com> wrote:
>
> Does anyone have any recommendations?  I've looked at Katta, but it doesn't
> seem to support realtime searching.  It also uses hdfs, which I've heard can
> be slow.  I'm looking to serve 40gb of indexes and support about 1 million
> updates per day.
>
> Thx
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org