You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Dorian Hoxha <do...@gmail.com> on 2017/01/20 16:12:53 UTC

How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Hi friends,

I was thinking how scylladb architecture
<http://www.scylladb.com/technology/architecture/> works compared to
cassandra which gives them 10x+ performance and lower latency. If you were
starting lucene and solr from scratch what would you do to achieve
something similar ?

Different language (rust/c++?) for better SIMD
<http://blog-archive.griddynamics.com/2015/06/lucene-simd-codec-benchmark-and-future.html>
?
Use a GPU with a SSD for posting-list intersection ?(not out yet)
Make it in-memory and use better data structures?
Shard on cores like scylladb (so 1 shard for each core on the machine) ?
External cache (like keeping n redis-servers with big ram/network & slow
cpu/disk just for cache) ??
Use better data structures (like algolia autocomplete radix
<https://blog.algolia.com/inside-the-algolia-engine-part-2-the-indexing-challenge-of-instant-search/>
)
Distributing documents by term instead of id
<http://research.microsoft.com/en-us/um/people/trishulc/papers/Maguro.pdf> ?
Using ASIC / FPGA ?

Regards,
Dorian

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

RedisSearch seems to be fully in-memory and have no analysis or query
chain. Or any real multilingual support. It is pears and apples
comparison and their "big" feature is what Lucene started from (term
list). I don't even see phrase search support, as they don't seem to
implement posting list, just the terms.

Also I don't see them publishing their Elasticsearch or Solr
configuration, which from past experiences is often left untuned.

But yes, good for them. And good for Postgres for adding full-text
search some months ago. Even good for Oracle for having a commercial
(however hardcoded and terrible) full-text search.

I think the summary - in my mind - is that, if software is swallowing
the world than search is swallowing the software. Maybe it will become
that last "Kitchen sink" proof, replacing email. And the more
interesting ideas go around - the better. And some of them, I am sure,
will end up in Lucene/Solr/Elasticsearch, as - after all - they are
the most popular platforms and people will bring those extra things to
the core platform they use, if they really want it.

Regards,
   Alex.


----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 10 February 2017 at 11:38, Dorian Hoxha <do...@gmail.com> wrote:
> @Alex,
> I don't know if you've seen it, but there's also redissearch module which
> they claim to be faster (ofc less features):
> https://redislabs.com/blog/adding-search-engine-redis-adventures-module-land/
> http://www.slideshare.net/RedisLabs/redis-for-search
> https://github.com/RedisLabsModules/RediSearch
>
> On Fri, Feb 10, 2017 at 1:36 PM, Dorian Hoxha <do...@gmail.com>
> wrote:
>>
>>
>>
>> On Wed, Feb 8, 2017 at 3:58 PM, Alexandre Rafalovitch <ar...@gmail.com>
>> wrote:
>>>
>>> One you filter out the JIRA messages, the forum is very strong and
>>> alive. It is just very focused on its purpose - building Solr and
>>> Lucene and ElasticSearch.
>>
>> Will do just that. Thanks.
>>>
>>>
>>> As to "perfection" - nothing is perfect, you can just look at the list
>>> of the open JIRAs to confirm that for Lucene and/or Solr. But there is
>>> constant improvement and ever-deepening of the features and
>>> performance improvement.
>>>
>>> You can also look at Elasticsearch for inspiration, as they build on
>>> Lucene (and are contributing to it) and had a chance to rebuild the
>>> layers above it.
>>
>> They have more fancy features, but less advanced ones (ex: shard
>> splitting!)
>>>
>>>
>>> On your question specifically, I think it is hard to answer it well.
>>> Partially because I am not sure your assumptions are all that thought
>>> out. For example:
>>> 1) Different language than Java - Solr relies on Zookeeper, Tika and
>>> other libraries. All of those are in Java. Language change implies
>>> full change of the dependencies and ecosystem and - without looking -
>>> I doubt there is an open-source comprehensive MSWord parser in
>>> C++/Rust.
>>
>> Usually indexing-speed is not the bottleneck (beside logging and some
>> other scenarios) so you could probably use a java service (for tika).
>> Zookeeper is again not a bottleneck when serving requests, and you can
>> still use it with a non-java db.
>>>
>>> 2) Algolia radix? Lucene uses pre-compiled DFA (deterministic finite
>>> automata). Are you sure the open graph chosen because Algolia wants to
>>> run on the phone is an improvement on the DFA
>>
>> The `suggesters` which are backed by DFA can't be used with normal
>> filters/queries which is critical (and algolia-radix can do)
>>>
>>> 3) Document distribution is already customizable with _route_ key,
>>> though obviously Maguro algorithm is beyond single key's reach. On the
>>> other hand, I am not sure Maguro is designed for good faceting,
>>> streaming, enumerations, or other features Lucene/Solr has in its
>>> core.
>>
>> Yes, seems very special use case.
>>>
>>>
>>> As to the rest (GPU!, FPGA), we accept contributions. Including large,
>>> complex, interesting contributions (streams, learning to rank,
>>> docvalues, etc).
>>
>> I mean just in the "ideas case", not do it for me.
>>>
>>> And, long term, it is probably more effective to be
>>> able to innovate without the well-established framework rather than
>>> reinventing things from scratch. After all, even Twitter and LinkedIn
>>> built their internal implementations on top of Lucene rather than
>>> reinventing absolutely everything.
>>
>> Depends how core it is to your comp and how good at low-level your team
>> is. Most of the time yes but sometimes you gotta (like the scylladb case,
>> they've built A LOT from scratch, like custom scheduler etc)
>>>
>>>
>>> Still, Elasticsearch had a - very successful - go at the "Innovator's
>>> Dilemma" situation. If you want to create a team trying to
>>> rebuild/improve the approaches completely from scratch, I am sure you
>>> will find a lot of us looking at your efforts with interest. I, for
>>> one, would be happy to point out a new radically-different approach to
>>> search engine implementation on my Solr Start mailing list.
>>
>> That's why I'm asking for ideas. This is what I got from another dev on
>> the same question:  https://news.ycombinator.com/item?id=13249724
>> Quote:"Multicores parallel shared nothing architecture like the on in the
>> TurboPFor inverted index app and a ram resident inverted index."
>>
>>
>>
>>>
>>> Regards and good luck,
>>>    Alex.
>>> ----
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>>> experienced
>>>
>>>
>>> On 8 February 2017 at 03:39, Dorian Hoxha <do...@gmail.com> wrote:
>>> > So, am I asking too much (maybe), is this forum dead (then where to ask
>>> > ?
>>> > there is extreme noise here), is lucene perfect(of course not) ?
>>> >
>>> >
>>> > On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <do...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Was thinking also how bing doesn't use posting lists and also
>>> >> compiling
>>> >> queries !
>>> >> About the queries, I would've think it wouldn't be as high overhead as
>>> >> queries in in rdbms since those apply on each row while on search they
>>> >> apply
>>> >> on each bitset.
>>> >>
>>> >>
>>> >> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jw...@whitepages.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> I’ve had some curiosity about this question too.
>>> >>>
>>> >>>
>>> >>>
>>> >>> For a while, I watched for a seastar-like library for the JVM, but
>>> >>> https://github.com/bestwpw/windmill was the only one I came across,
>>> >>> and it
>>> >>> doesn’t seem to be going anywhere. Since one of the points of the JVM
>>> >>> is to
>>> >>> abstract away the platform, I certainty wonder if the JVM will ever
>>> >>> get the
>>> >>> kinds of machine affinity these other projects see. Your
>>> >>> one-shard-per-core
>>> >>> could probably be faked with multiple JVMs and numactl - could be an
>>> >>> interesting experiment.
>>> >>>
>>> >>>
>>> >>>
>>> >>> That said, I’m aware that a phenomenal amount of optimization effort
>>> >>> has
>>> >>> gone into Lucene, and I’d also be interested in hearing about things
>>> >>> that
>>> >>> worked well.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> From: Dorian Hoxha <do...@gmail.com>
>>> >>> Reply-To: "dev@lucene.apache.org" <de...@lucene.apache.org>
>>> >>> Date: Friday, January 20, 2017 at 8:12 AM
>>> >>> To: "dev@lucene.apache.org" <de...@lucene.apache.org>
>>> >>> Subject: How would you architect solr/lucene if you were starting
>>> >>> from
>>> >>> scratch for them to be 10X+ faster/efficient ?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Hi friends,
>>> >>>
>>> >>> I was thinking how scylladb architecture works compared to cassandra
>>> >>> which gives them 10x+ performance and lower latency. If you were
>>> >>> starting
>>> >>> lucene and solr from scratch what would you do to achieve something
>>> >>> similar
>>> >>> ?
>>> >>>
>>> >>> Different language (rust/c++?) for better SIMD ?
>>> >>>
>>> >>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>>> >>>
>>> >>> Make it in-memory and use better data structures?
>>> >>>
>>> >>> Shard on cores like scylladb (so 1 shard for each core on the
>>> >>> machine) ?
>>> >>>
>>> >>> External cache (like keeping n redis-servers with big ram/network &
>>> >>> slow
>>> >>> cpu/disk just for cache) ??
>>> >>>
>>> >>> Use better data structures (like algolia autocomplete radix )
>>> >>>
>>> >>> Distributing documents by term instead of id ?
>>> >>>
>>> >>> Using ASIC / FPGA ?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Regards,
>>> >>>
>>> >>> Dorian
>>> >>
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Dorian Hoxha <do...@gmail.com>.

@Alex,
I don't know if you've seen it, but there's also redissearch module which
they claim to be faster (ofc less features):
https://redislabs.com/blog/adding-search-engine-redis-adventures-module-land/
http://www.slideshare.net/RedisLabs/redis-for-search
https://github.com/RedisLabsModules/RediSearch

On Fri, Feb 10, 2017 at 1:36 PM, Dorian Hoxha <do...@gmail.com>
wrote:

>
>
> On Wed, Feb 8, 2017 at 3:58 PM, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> One you filter out the JIRA messages, the forum is very strong and
>> alive. It is just very focused on its purpose - building Solr and
>> Lucene and ElasticSearch.
>>
> Will do just that. Thanks.
>
>>
>> As to "perfection" - nothing is perfect, you can just look at the list
>> of the open JIRAs to confirm that for Lucene and/or Solr. But there is
>> constant improvement and ever-deepening of the features and
>> performance improvement.
>>
>> You can also look at Elasticsearch for inspiration, as they build on
>> Lucene (and are contributing to it) and had a chance to rebuild the
>> layers above it.
>>
> They have more fancy features, but less advanced ones (ex: shard
> splitting!)
>
>>
>> On your question specifically, I think it is hard to answer it well.
>> Partially because I am not sure your assumptions are all that thought
>> out. For example:
>> 1) Different language than Java - Solr relies on Zookeeper, Tika and
>> other libraries. All of those are in Java. Language change implies
>> full change of the dependencies and ecosystem and - without looking -
>> I doubt there is an open-source comprehensive MSWord parser in
>> C++/Rust.
>>
> Usually indexing-speed is not the bottleneck (beside logging and some
> other scenarios) so you could probably use a java service (for tika).
> Zookeeper is again not a bottleneck when serving requests, and you can
> still use it with a non-java db.
>
>> 2) Algolia radix? Lucene uses pre-compiled DFA (deterministic finite
>> automata). Are you sure the open graph chosen because Algolia wants to
>> run on the phone is an improvement on the DFA
>>
> The `suggesters` which are backed by DFA can't be used with normal
> filters/queries which is critical (and algolia-radix can do)
>
>> 3) Document distribution is already customizable with _route_ key,
>> though obviously Maguro algorithm is beyond single key's reach. On the
>> other hand, I am not sure Maguro is designed for good faceting,
>> streaming, enumerations, or other features Lucene/Solr has in its
>> core.
>>
> Yes, seems very special use case.
>
>>
>> As to the rest (GPU!, FPGA), we accept contributions. Including large,
>> complex, interesting contributions (streams, learning to rank,
>> docvalues, etc).
>
> I mean just in the "ideas case", not do it for me.
>
>> And, long term, it is probably more effective to be
>> able to innovate without the well-established framework rather than
>> reinventing things from scratch. After all, even Twitter and LinkedIn
>> built their internal implementations on top of Lucene rather than
>> reinventing absolutely everything.
>>
> Depends how core it is to your comp and how good at low-level your team
> is. Most of the time yes but sometimes you gotta (like the scylladb case,
> they've built A LOT from scratch, like custom scheduler etc)
>
>>
>> Still, Elasticsearch had a - very successful - go at the "Innovator's
>> Dilemma" situation. If you want to create a team trying to
>> rebuild/improve the approaches completely from scratch, I am sure you
>> will find a lot of us looking at your efforts with interest. I, for
>> one, would be happy to point out a new radically-different approach to
>> search engine implementation on my Solr Start mailing list.
>>
> That's why I'm asking for ideas. This is what I got from another dev on
> the same question:  https://news.ycombinator.com/item?id=13249724
> Quote:"Multicores parallel shared nothing architecture like the on in the
> TurboPFor inverted index app and a ram resident inverted index."
>
>
>
>
>> Regards and good luck,
>>    Alex.
>> ----
>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>
>>
>> On 8 February 2017 at 03:39, Dorian Hoxha <do...@gmail.com> wrote:
>> > So, am I asking too much (maybe), is this forum dead (then where to ask
>> ?
>> > there is extreme noise here), is lucene perfect(of course not) ?
>> >
>> >
>> > On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <do...@gmail.com>
>> > wrote:
>> >>
>> >> Was thinking also how bing doesn't use posting lists and also compiling
>> >> queries !
>> >> About the queries, I would've think it wouldn't be as high overhead as
>> >> queries in in rdbms since those apply on each row while on search they
>> apply
>> >> on each bitset.
>> >>
>> >>
>> >> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jw...@whitepages.com>
>> >> wrote:
>> >>>
>> >>>
>> >>>
>> >>> I’ve had some curiosity about this question too.
>> >>>
>> >>>
>> >>>
>> >>> For a while, I watched for a seastar-like library for the JVM, but
>> >>> https://github.com/bestwpw/windmill was the only one I came across,
>> and it
>> >>> doesn’t seem to be going anywhere. Since one of the points of the JVM
>> is to
>> >>> abstract away the platform, I certainty wonder if the JVM will ever
>> get the
>> >>> kinds of machine affinity these other projects see. Your
>> one-shard-per-core
>> >>> could probably be faked with multiple JVMs and numactl - could be an
>> >>> interesting experiment.
>> >>>
>> >>>
>> >>>
>> >>> That said, I’m aware that a phenomenal amount of optimization effort
>> has
>> >>> gone into Lucene, and I’d also be interested in hearing about things
>> that
>> >>> worked well.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> From: Dorian Hoxha <do...@gmail.com>
>> >>> Reply-To: "dev@lucene.apache.org" <de...@lucene.apache.org>
>> >>> Date: Friday, January 20, 2017 at 8:12 AM
>> >>> To: "dev@lucene.apache.org" <de...@lucene.apache.org>
>> >>> Subject: How would you architect solr/lucene if you were starting from
>> >>> scratch for them to be 10X+ faster/efficient ?
>> >>>
>> >>>
>> >>>
>> >>> Hi friends,
>> >>>
>> >>> I was thinking how scylladb architecture works compared to cassandra
>> >>> which gives them 10x+ performance and lower latency. If you were
>> starting
>> >>> lucene and solr from scratch what would you do to achieve something
>> similar
>> >>> ?
>> >>>
>> >>> Different language (rust/c++?) for better SIMD ?
>> >>>
>> >>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>> >>>
>> >>> Make it in-memory and use better data structures?
>> >>>
>> >>> Shard on cores like scylladb (so 1 shard for each core on the
>> machine) ?
>> >>>
>> >>> External cache (like keeping n redis-servers with big ram/network &
>> slow
>> >>> cpu/disk just for cache) ??
>> >>>
>> >>> Use better data structures (like algolia autocomplete radix )
>> >>>
>> >>> Distributing documents by term instead of id ?
>> >>>
>> >>> Using ASIC / FPGA ?
>> >>>
>> >>>
>> >>>
>> >>> Regards,
>> >>>
>> >>> Dorian
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Dorian Hoxha <do...@gmail.com>.

On Wed, Feb 8, 2017 at 3:58 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> One you filter out the JIRA messages, the forum is very strong and
> alive. It is just very focused on its purpose - building Solr and
> Lucene and ElasticSearch.
>
Will do just that. Thanks.

>
> As to "perfection" - nothing is perfect, you can just look at the list
> of the open JIRAs to confirm that for Lucene and/or Solr. But there is
> constant improvement and ever-deepening of the features and
> performance improvement.
>
> You can also look at Elasticsearch for inspiration, as they build on
> Lucene (and are contributing to it) and had a chance to rebuild the
> layers above it.
>
They have more fancy features, but less advanced ones (ex: shard splitting!)

>
> On your question specifically, I think it is hard to answer it well.
> Partially because I am not sure your assumptions are all that thought
> out. For example:
> 1) Different language than Java - Solr relies on Zookeeper, Tika and
> other libraries. All of those are in Java. Language change implies
> full change of the dependencies and ecosystem and - without looking -
> I doubt there is an open-source comprehensive MSWord parser in
> C++/Rust.
>
Usually indexing-speed is not the bottleneck (beside logging and some other
scenarios) so you could probably use a java service (for tika).
Zookeeper is again not a bottleneck when serving requests, and you can
still use it with a non-java db.

> 2) Algolia radix? Lucene uses pre-compiled DFA (deterministic finite
> automata). Are you sure the open graph chosen because Algolia wants to
> run on the phone is an improvement on the DFA
>
The `suggesters` which are backed by DFA can't be used with normal
filters/queries which is critical (and algolia-radix can do)

> 3) Document distribution is already customizable with _route_ key,
> though obviously Maguro algorithm is beyond single key's reach. On the
> other hand, I am not sure Maguro is designed for good faceting,
> streaming, enumerations, or other features Lucene/Solr has in its
> core.
>
Yes, seems very special use case.

>
> As to the rest (GPU!, FPGA), we accept contributions. Including large,
> complex, interesting contributions (streams, learning to rank,
> docvalues, etc).

I mean just in the "ideas case", not do it for me.

> And, long term, it is probably more effective to be
> able to innovate without the well-established framework rather than
> reinventing things from scratch. After all, even Twitter and LinkedIn
> built their internal implementations on top of Lucene rather than
> reinventing absolutely everything.
>
Depends how core it is to your comp and how good at low-level your team is.
Most of the time yes but sometimes you gotta (like the scylladb case,
they've built A LOT from scratch, like custom scheduler etc)

>
> Still, Elasticsearch had a - very successful - go at the "Innovator's
> Dilemma" situation. If you want to create a team trying to
> rebuild/improve the approaches completely from scratch, I am sure you
> will find a lot of us looking at your efforts with interest. I, for
> one, would be happy to point out a new radically-different approach to
> search engine implementation on my Solr Start mailing list.
>
That's why I'm asking for ideas. This is what I got from another dev on the
same question:  https://news.ycombinator.com/item?id=13249724
Quote:"Multicores parallel shared nothing architecture like the on in the
TurboPFor inverted index app and a ram resident inverted index."




> Regards and good luck,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 8 February 2017 at 03:39, Dorian Hoxha <do...@gmail.com> wrote:
> > So, am I asking too much (maybe), is this forum dead (then where to ask ?
> > there is extreme noise here), is lucene perfect(of course not) ?
> >
> >
> > On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <do...@gmail.com>
> > wrote:
> >>
> >> Was thinking also how bing doesn't use posting lists and also compiling
> >> queries !
> >> About the queries, I would've think it wouldn't be as high overhead as
> >> queries in in rdbms since those apply on each row while on search they
> apply
> >> on each bitset.
> >>
> >>
> >> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jw...@whitepages.com>
> >> wrote:
> >>>
> >>>
> >>>
> >>> I’ve had some curiosity about this question too.
> >>>
> >>>
> >>>
> >>> For a while, I watched for a seastar-like library for the JVM, but
> >>> https://github.com/bestwpw/windmill was the only one I came across,
> and it
> >>> doesn’t seem to be going anywhere. Since one of the points of the JVM
> is to
> >>> abstract away the platform, I certainty wonder if the JVM will ever
> get the
> >>> kinds of machine affinity these other projects see. Your
> one-shard-per-core
> >>> could probably be faked with multiple JVMs and numactl - could be an
> >>> interesting experiment.
> >>>
> >>>
> >>>
> >>> That said, I’m aware that a phenomenal amount of optimization effort
> has
> >>> gone into Lucene, and I’d also be interested in hearing about things
> that
> >>> worked well.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Dorian Hoxha <do...@gmail.com>
> >>> Reply-To: "dev@lucene.apache.org" <de...@lucene.apache.org>
> >>> Date: Friday, January 20, 2017 at 8:12 AM
> >>> To: "dev@lucene.apache.org" <de...@lucene.apache.org>
> >>> Subject: How would you architect solr/lucene if you were starting from
> >>> scratch for them to be 10X+ faster/efficient ?
> >>>
> >>>
> >>>
> >>> Hi friends,
> >>>
> >>> I was thinking how scylladb architecture works compared to cassandra
> >>> which gives them 10x+ performance and lower latency. If you were
> starting
> >>> lucene and solr from scratch what would you do to achieve something
> similar
> >>> ?
> >>>
> >>> Different language (rust/c++?) for better SIMD ?
> >>>
> >>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
> >>>
> >>> Make it in-memory and use better data structures?
> >>>
> >>> Shard on cores like scylladb (so 1 shard for each core on the machine)
> ?
> >>>
> >>> External cache (like keeping n redis-servers with big ram/network &
> slow
> >>> cpu/disk just for cache) ??
> >>>
> >>> Use better data structures (like algolia autocomplete radix )
> >>>
> >>> Distributing documents by term instead of id ?
> >>>
> >>> Using ASIC / FPGA ?
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Dorian
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

One you filter out the JIRA messages, the forum is very strong and
alive. It is just very focused on its purpose - building Solr and
Lucene and ElasticSearch.

As to "perfection" - nothing is perfect, you can just look at the list
of the open JIRAs to confirm that for Lucene and/or Solr. But there is
constant improvement and ever-deepening of the features and
performance improvement.

You can also look at Elasticsearch for inspiration, as they build on
Lucene (and are contributing to it) and had a chance to rebuild the
layers above it.

On your question specifically, I think it is hard to answer it well.
Partially because I am not sure your assumptions are all that thought
out. For example:
1) Different language than Java - Solr relies on Zookeeper, Tika and
other libraries. All of those are in Java. Language change implies
full change of the dependencies and ecosystem and - without looking -
I doubt there is an open-source comprehensive MSWord parser in
C++/Rust.
2) Algolia radix? Lucene uses pre-compiled DFA (deterministic finite
automata). Are you sure the open graph chosen because Algolia wants to
run on the phone is an improvement on the DFA?
3) Document distribution is already customizable with _route_ key,
though obviously Maguro algorithm is beyond single key's reach. On the
other hand, I am not sure Maguro is designed for good faceting,
streaming, enumerations, or other features Lucene/Solr has in its
core.

As to the rest (GPU!, FPGA), we accept contributions. Including large,
complex, interesting contributions (streams, learning to rank,
docvalues, etc). And, long term, it is probably more effective to be
able to innovate without the well-established framework rather than
reinventing things from scratch. After all, even Twitter and LinkedIn
built their internal implementations on top of Lucene rather than
reinventing absolutely everything.

Still, Elasticsearch had a - very successful - go at the "Innovator's
Dilemma" situation. If you want to create a team trying to
rebuild/improve the approaches completely from scratch, I am sure you
will find a lot of us looking at your efforts with interest. I, for
one, would be happy to point out a new radically-different approach to
search engine implementation on my Solr Start mailing list.

Regards and good luck,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 8 February 2017 at 03:39, Dorian Hoxha <do...@gmail.com> wrote:
> So, am I asking too much (maybe), is this forum dead (then where to ask ?
> there is extreme noise here), is lucene perfect(of course not) ?
>
>
> On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <do...@gmail.com>
> wrote:
>>
>> Was thinking also how bing doesn't use posting lists and also compiling
>> queries !
>> About the queries, I would've think it wouldn't be as high overhead as
>> queries in in rdbms since those apply on each row while on search they apply
>> on each bitset.
>>
>>
>> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jw...@whitepages.com>
>> wrote:
>>>
>>>
>>>
>>> I’ve had some curiosity about this question too.
>>>
>>>
>>>
>>> For a while, I watched for a seastar-like library for the JVM, but
>>> https://github.com/bestwpw/windmill was the only one I came across, and it
>>> doesn’t seem to be going anywhere. Since one of the points of the JVM is to
>>> abstract away the platform, I certainty wonder if the JVM will ever get the
>>> kinds of machine affinity these other projects see. Your one-shard-per-core
>>> could probably be faked with multiple JVMs and numactl - could be an
>>> interesting experiment.
>>>
>>>
>>>
>>> That said, I’m aware that a phenomenal amount of optimization effort has
>>> gone into Lucene, and I’d also be interested in hearing about things that
>>> worked well.
>>>
>>>
>>>
>>>
>>>
>>> From: Dorian Hoxha <do...@gmail.com>
>>> Reply-To: "dev@lucene.apache.org" <de...@lucene.apache.org>
>>> Date: Friday, January 20, 2017 at 8:12 AM
>>> To: "dev@lucene.apache.org" <de...@lucene.apache.org>
>>> Subject: How would you architect solr/lucene if you were starting from
>>> scratch for them to be 10X+ faster/efficient ?
>>>
>>>
>>>
>>> Hi friends,
>>>
>>> I was thinking how scylladb architecture works compared to cassandra
>>> which gives them 10x+ performance and lower latency. If you were starting
>>> lucene and solr from scratch what would you do to achieve something similar
>>> ?
>>>
>>> Different language (rust/c++?) for better SIMD ?
>>>
>>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>>>
>>> Make it in-memory and use better data structures?
>>>
>>> Shard on cores like scylladb (so 1 shard for each core on the machine) ?
>>>
>>> External cache (like keeping n redis-servers with big ram/network & slow
>>> cpu/disk just for cache) ??
>>>
>>> Use better data structures (like algolia autocomplete radix )
>>>
>>> Distributing documents by term instead of id ?
>>>
>>> Using ASIC / FPGA ?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Dorian
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Dorian Hoxha <do...@gmail.com>.

So, am I asking too much (maybe), is this forum dead (then where to ask ?
there is extreme noise here), is lucene perfect(of course not) ?

On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <do...@gmail.com>
wrote:

> Was thinking also how bing doesn't use posting lists
> <http://bitfunnel.org/strangeloop/> and also compiling queries
> <https://github.com/BitFunnel/NativeJIT> !
> About the queries, I would've think it wouldn't be as high overhead as
> queries in in rdbms since those apply on each row while on search they
> apply on each bitset.
>
>
> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jw...@whitepages.com>
> wrote:
>
>>
>>
>> I’ve had some curiosity about this question too.
>>
>>
>>
>> For a while, I watched for a seastar-like library for the JVM, but
>> https://github.com/bestwpw/windmill was the only one I came across, and
>> it doesn’t seem to be going anywhere. Since one of the points of the JVM is
>> to abstract away the platform, I certainty wonder if the JVM will ever get
>> the kinds of machine affinity these other projects see. Your
>> one-shard-per-core could probably be faked with multiple JVMs and numactl -
>> could be an interesting experiment.
>>
>>
>>
>> That said, I’m aware that a phenomenal amount of optimization effort has
>> gone into Lucene, and I’d also be interested in hearing about things that
>> worked well.
>>
>>
>>
>>
>>
>> *From: *Dorian Hoxha <do...@gmail.com>
>> *Reply-To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
>> *Date: *Friday, January 20, 2017 at 8:12 AM
>> *To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
>> *Subject: *How would you architect solr/lucene if you were starting from
>> scratch for them to be 10X+ faster/efficient ?
>>
>>
>>
>> Hi friends,
>>
>> I was thinking how scylladb architecture
>> <http://www.scylladb.com/technology/architecture/> works compared to
>> cassandra which gives them 10x+ performance and lower latency. If you were
>> starting lucene and solr from scratch what would you do to achieve
>> something similar ?
>>
>> Different language (rust/c++?) for better SIMD
>> <http://blog-archive.griddynamics.com/2015/06/lucene-simd-codec-benchmark-and-future.html>
>> ?
>>
>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>>
>> Make it in-memory and use better data structures?
>>
>> Shard on cores like scylladb (so 1 shard for each core on the machine) ?
>>
>> External cache (like keeping n redis-servers with big ram/network & slow
>> cpu/disk just for cache) ??
>>
>> Use better data structures (like algolia autocomplete radix
>> <https://blog.algolia.com/inside-the-algolia-engine-part-2-the-indexing-challenge-of-instant-search/>
>> )
>>
>> Distributing documents by term instead of id
>> <http://research.microsoft.com/en-us/um/people/trishulc/papers/Maguro.pdf>
>> ?
>>
>> Using ASIC / FPGA ?
>>
>>
>>
>> Regards,
>>
>> Dorian
>>
>
>

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Mikhail Khludnev <mk...@apache.org>.

On Wed, Jan 25, 2017 at 7:01 PM, Dorian Hoxha <do...@gmail.com>
wrote:

> Was thinking also how bing doesn't use posting lists
> <http://bitfunnel.org/strangeloop/> and
>
This is an interesting case with unlimited termdictionary.
I think you can do that in Lucene if you rewrite terms to conjunction
of hashes in query time :

name:foo => +h1(name:foo) +h2(name:foo) +h3(name:foo) ..

and the same should be done during indexing.

also compiling queries <https://github.com/BitFunnel/NativeJIT> !
> About the queries, I would've think it wouldn't be as high overhead as
> queries in in rdbms since those apply on each row while on search they
> apply on each bitset.
>
>
> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jw...@whitepages.com>
> wrote:
>
>>
>>
>> I’ve had some curiosity about this question too.
>>
>>
>>
>> For a while, I watched for a seastar-like library for the JVM, but
>> https://github.com/bestwpw/windmill was the only one I came across, and
>> it doesn’t seem to be going anywhere. Since one of the points of the JVM is
>> to abstract away the platform, I certainty wonder if the JVM will ever get
>> the kinds of machine affinity these other projects see. Your
>> one-shard-per-core could probably be faked with multiple JVMs and numactl -
>> could be an interesting experiment.
>>
>>
>>
>> That said, I’m aware that a phenomenal amount of optimization effort has
>> gone into Lucene, and I’d also be interested in hearing about things that
>> worked well.
>>
>>
>>
>>
>>
>> *From: *Dorian Hoxha <do...@gmail.com>
>> *Reply-To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
>> *Date: *Friday, January 20, 2017 at 8:12 AM
>> *To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
>> *Subject: *How would you architect solr/lucene if you were starting from
>> scratch for them to be 10X+ faster/efficient ?
>>
>>
>>
>> Hi friends,
>>
>> I was thinking how scylladb architecture
>> <http://www.scylladb.com/technology/architecture/> works compared to
>> cassandra which gives them 10x+ performance and lower latency. If you were
>> starting lucene and solr from scratch what would you do to achieve
>> something similar ?
>>
>> Different language (rust/c++?) for better SIMD
>> <http://blog-archive.griddynamics.com/2015/06/lucene-simd-codec-benchmark-and-future.html>
>> ?
>>
>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>>
>> Make it in-memory and use better data structures?
>>
>> Shard on cores like scylladb (so 1 shard for each core on the machine) ?
>>
>> External cache (like keeping n redis-servers with big ram/network & slow
>> cpu/disk just for cache) ??
>>
>> Use better data structures (like algolia autocomplete radix
>> <https://blog.algolia.com/inside-the-algolia-engine-part-2-the-indexing-challenge-of-instant-search/>
>> )
>>
>> Distributing documents by term instead of id
>> <http://research.microsoft.com/en-us/um/people/trishulc/papers/Maguro.pdf>
>> ?
>>
>> Using ASIC / FPGA ?
>>
>>
>>
>> Regards,
>>
>> Dorian
>>
>
>


-- 
Sincerely yours
Mikhail Khludnev

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Dorian Hoxha <do...@gmail.com>.

Was thinking also how bing doesn't use posting lists
<http://bitfunnel.org/strangeloop/> and also compiling queries
<https://github.com/BitFunnel/NativeJIT> !
About the queries, I would've think it wouldn't be as high overhead as
queries in in rdbms since those apply on each row while on search they
apply on each bitset.

On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jw...@whitepages.com> wrote:

>
>
> I’ve had some curiosity about this question too.
>
>
>
> For a while, I watched for a seastar-like library for the JVM, but
> https://github.com/bestwpw/windmill was the only one I came across, and
> it doesn’t seem to be going anywhere. Since one of the points of the JVM is
> to abstract away the platform, I certainty wonder if the JVM will ever get
> the kinds of machine affinity these other projects see. Your
> one-shard-per-core could probably be faked with multiple JVMs and numactl -
> could be an interesting experiment.
>
>
>
> That said, I’m aware that a phenomenal amount of optimization effort has
> gone into Lucene, and I’d also be interested in hearing about things that
> worked well.
>
>
>
>
>
> *From: *Dorian Hoxha <do...@gmail.com>
> *Reply-To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
> *Date: *Friday, January 20, 2017 at 8:12 AM
> *To: *"dev@lucene.apache.org" <de...@lucene.apache.org>
> *Subject: *How would you architect solr/lucene if you were starting from
> scratch for them to be 10X+ faster/efficient ?
>
>
>
> Hi friends,
>
> I was thinking how scylladb architecture
> <http://www.scylladb.com/technology/architecture/> works compared to
> cassandra which gives them 10x+ performance and lower latency. If you were
> starting lucene and solr from scratch what would you do to achieve
> something similar ?
>
> Different language (rust/c++?) for better SIMD
> <http://blog-archive.griddynamics.com/2015/06/lucene-simd-codec-benchmark-and-future.html>
> ?
>
> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>
> Make it in-memory and use better data structures?
>
> Shard on cores like scylladb (so 1 shard for each core on the machine) ?
>
> External cache (like keeping n redis-servers with big ram/network & slow
> cpu/disk just for cache) ??
>
> Use better data structures (like algolia autocomplete radix
> <https://blog.algolia.com/inside-the-algolia-engine-part-2-the-indexing-challenge-of-instant-search/>
> )
>
> Distributing documents by term instead of id
> <http://research.microsoft.com/en-us/um/people/trishulc/papers/Maguro.pdf>
> ?
>
> Using ASIC / FPGA ?
>
>
>
> Regards,
>
> Dorian
>

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Posted by Jeff Wartes <jw...@whitepages.com>.

I’ve had some curiosity about this question too.

For a while, I watched for a seastar-like library for the JVM, but https://github.com/bestwpw/windmill was the only one I came across, and it doesn’t seem to be going anywhere. Since one of the points of the JVM is to abstract away the platform, I certainty wonder if the JVM will ever get the kinds of machine affinity these other projects see. Your one-shard-per-core could probably be faked with multiple JVMs and numactl - could be an interesting experiment.

That said, I’m aware that a phenomenal amount of optimization effort has gone into Lucene, and I’d also be interested in hearing about things that worked well.

From: Dorian Hoxha <do...@gmail.com>
Reply-To: "dev@lucene.apache.org" <de...@lucene.apache.org>
Date: Friday, January 20, 2017 at 8:12 AM
To: "dev@lucene.apache.org" <de...@lucene.apache.org>
Subject: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Hi friends,
I was thinking how scylladb architecture<http://www.scylladb.com/technology/architecture/> works compared to cassandra which gives them 10x+ performance and lower latency. If you were starting lucene and solr from scratch what would you do to achieve something similar ?

Different language (rust/c++?) for better SIMD<http://blog-archive.griddynamics.com/2015/06/lucene-simd-codec-benchmark-and-future.html> ?
Use a GPU with a SSD for posting-list intersection ?(not out yet)
Make it in-memory and use better data structures?
Shard on cores like scylladb (so 1 shard for each core on the machine) ?
External cache (like keeping n redis-servers with big ram/network & slow cpu/disk just for cache) ??
Use better data structures (like algolia autocomplete radix<https://blog.algolia.com/inside-the-algolia-engine-part-2-the-indexing-challenge-of-instant-search/> )
Distributing documents by term instead of id<http://research.microsoft.com/en-us/um/people/trishulc/papers/Maguro.pdf> ?
Using ASIC / FPGA ?

Regards,
Dorian